22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Equation 9.17 - Decoder’s (masked) attention scores

Therefore we need a mask that flags every element above the

diagonal as invalid, as we did with the padded data points in the

source mask. The shape of the target mask, though, must match

the shape of the alphas attribute: (1, L target , L target ).

We can create a function to generate the mask for us:

Subsequent Mask

1 def subsequent_mask(size):

2 attn_shape = (1, size, size)

3 subsequent_mask = (

4 1 - torch.triu(torch.ones(attn_shape), diagonal=1)

5 ).bool()

6 return subsequent_mask

subsequent_mask(2) # 1, L, L

Output

tensor([[[ True, False],

[ True, True]]])

Perfect! The element above the diagonal is indeed set to False.

We must use this mask while querying the decoder to prevent it

from cheating. You can choose to use an additional mask to

"hide" more data from the decoder if you wish, but the

subsequent mask is a necessity with the self-attention decoder.

Self-Attention | 753

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!