Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

shifted_seq = torch.cat([source_seq[:, -1:],target_seq[:, :-1]], dim=1)The shifted target sequence was already used (even though we didn’t have a namefor it) when we discussed teacher forcing. There, at every step (after the first one), itrandomly chose as the input to the subsequent step either an actual element fromthat sequence or a prediction. It worked very well with recurrent layers that weresequential in nature. But this isn’t the case anymore.One of the advantages of self-attention over recurrent layers isthat operations can be parallelized. No need to do anythingsequentially anymore, teacher forcing included. This means we’reusing the whole shifted target sequence at once as the "query"argument of the decoder.That’s very nice and cool, sure, but it raises one big problem involving the…Attention ScoresTo understand what the problem is, let’s look at the context vector that will resultin the first "hidden state" produced by the decoder, which, in turn, will lead to thefirst prediction:Equation 9.14 - Context vector for the first target"What’s the problem with it?"The problem is that it is using a "key" (K 2 ) and a "value" (V 2 ) that aretransformations of the data point it is trying to predict.In other words, the model is being allowed to cheat by peekinginto the future because we’re giving it all data points in thetarget sequence except the very last one.If we look at the context vector corresponding to the last prediction, it should beSelf-Attention | 751

clear that the model simply cannot cheat (there’s no K 3 or V 3 ):Equation 9.15 - Context vector for the second targetWe can also check it quickly by looking at the subscript indices: As long as theindices of the "values" are lower than the index of the context vector, there is nocheating. By the way, it is even easier to check what’s happening if we use thealphas matrix:Equation 9.16 - Decoder’s attention scoresFor the decoder, the shape of the alphas attribute is given by (N, L target , L target ) sinceit is looking at itself. Any alphas above the diagonal are, literally, cheating codes.We need to force the self-attention mechanism to ignore them. If only there was away to do it…"What about those masks we discussed earlier?"You’re absolutely right! They are perfect for this case.Target Mask (Training)The purpose of the target mask is to zero attention scores for "future" datapoints. In our example, that’s the alphas matrix we’re aiming for:752 | Chapter 9 — Part II: Sequence-to-Sequence

clear that the model simply cannot cheat (there’s no K 3 or V 3 ):

Equation 9.15 - Context vector for the second target

We can also check it quickly by looking at the subscript indices: As long as the

indices of the "values" are lower than the index of the context vector, there is no

cheating. By the way, it is even easier to check what’s happening if we use the

alphas matrix:

Equation 9.16 - Decoder’s attention scores

For the decoder, the shape of the alphas attribute is given by (N, L target , L target ) since

it is looking at itself. Any alphas above the diagonal are, literally, cheating codes.

We need to force the self-attention mechanism to ignore them. If only there was a

way to do it…

"What about those masks we discussed earlier?"

You’re absolutely right! They are perfect for this case.

Target Mask (Training)

The purpose of the target mask is to zero attention scores for "future" data

points. In our example, that’s the alphas matrix we’re aiming for:

752 | Chapter 9 — Part II: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!