Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
shifted_seq = torch.cat([source_seq[:, -1:],target_seq[:, :-1]], dim=1)The shifted target sequence was already used (even though we didn’t have a namefor it) when we discussed teacher forcing. There, at every step (after the first one), itrandomly chose as the input to the subsequent step either an actual element fromthat sequence or a prediction. It worked very well with recurrent layers that weresequential in nature. But this isn’t the case anymore.One of the advantages of self-attention over recurrent layers isthat operations can be parallelized. No need to do anythingsequentially anymore, teacher forcing included. This means we’reusing the whole shifted target sequence at once as the "query"argument of the decoder.That’s very nice and cool, sure, but it raises one big problem involving the…Attention ScoresTo understand what the problem is, let’s look at the context vector that will resultin the first "hidden state" produced by the decoder, which, in turn, will lead to thefirst prediction:Equation 9.14 - Context vector for the first target"What’s the problem with it?"The problem is that it is using a "key" (K 2 ) and a "value" (V 2 ) that aretransformations of the data point it is trying to predict.In other words, the model is being allowed to cheat by peekinginto the future because we’re giving it all data points in thetarget sequence except the very last one.If we look at the context vector corresponding to the last prediction, it should beSelf-Attention | 751
clear that the model simply cannot cheat (there’s no K 3 or V 3 ):Equation 9.15 - Context vector for the second targetWe can also check it quickly by looking at the subscript indices: As long as theindices of the "values" are lower than the index of the context vector, there is nocheating. By the way, it is even easier to check what’s happening if we use thealphas matrix:Equation 9.16 - Decoder’s attention scoresFor the decoder, the shape of the alphas attribute is given by (N, L target , L target ) sinceit is looking at itself. Any alphas above the diagonal are, literally, cheating codes.We need to force the self-attention mechanism to ignore them. If only there was away to do it…"What about those masks we discussed earlier?"You’re absolutely right! They are perfect for this case.Target Mask (Training)The purpose of the target mask is to zero attention scores for "future" datapoints. In our example, that’s the alphas matrix we’re aiming for:752 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
clear that the model simply cannot cheat (there’s no K 3 or V 3 ):
Equation 9.15 - Context vector for the second target
We can also check it quickly by looking at the subscript indices: As long as the
indices of the "values" are lower than the index of the context vector, there is no
cheating. By the way, it is even easier to check what’s happening if we use the
alphas matrix:
Equation 9.16 - Decoder’s attention scores
For the decoder, the shape of the alphas attribute is given by (N, L target , L target ) since
it is looking at itself. Any alphas above the diagonal are, literally, cheating codes.
We need to force the self-attention mechanism to ignore them. If only there was a
way to do it…
"What about those masks we discussed earlier?"
You’re absolutely right! They are perfect for this case.
Target Mask (Training)
The purpose of the target mask is to zero attention scores for "future" data
points. In our example, that’s the alphas matrix we’re aiming for:
752 | Chapter 9 — Part II: Sequence-to-Sequence