Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Visualizing PredictionsLet’s plot the predicted coordinates and connect them using dashed lines, whileusing solid lines to connect the actual coordinates, just like before:fig = sequence_pred(sbs_seq_selfattnpe, full_test, test_directions)Figure 9.50 - Predicting the last two cornersAwesome, it looks like positional encoding is working well indeed—the predictedcoordinates are quite close to the actual ones for the most part.Positional Encoding (PE) | 781
Visualizing AttentionNow, let’s check what the model is paying attention to for the first two sequencesin the training set. Unlike last time, though, there are three heads and threeattention mechanisms to visualize now.We’re starting with the three heads of the self-attention mechanism of theencoder. There are two data points in our source sequence, so each attention headhas a two-by-two matrix of attention scores.Figure 9.51 - Encoder’s self-attention scores for its three headsIt seems that, in Attention Head #3, each data point is dividing its attentionbetween itself and the other data point. In the other attention heads, though, thedata points are paying attention to a single data point, either itself or the other one.Of course, these are just two data points used for visualization: The attentionscores are different for each source sequence.782 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
Visualizing Attention
Now, let’s check what the model is paying attention to for the first two sequences
in the training set. Unlike last time, though, there are three heads and three
attention mechanisms to visualize now.
We’re starting with the three heads of the self-attention mechanism of the
encoder. There are two data points in our source sequence, so each attention head
has a two-by-two matrix of attention scores.
Figure 9.51 - Encoder’s self-attention scores for its three heads
It seems that, in Attention Head #3, each data point is dividing its attention
between itself and the other data point. In the other attention heads, though, the
data points are paying attention to a single data point, either itself or the other one.
Of course, these are just two data points used for visualization: The attention
scores are different for each source sequence.
782 | Chapter 9 — Part II: Sequence-to-Sequence