Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Outputtensor([[[-1.0000, 0.0000],[-0.1585, 1.5403]]])"What am I looking at?"It turns out, the original coordinates were somewhat crowded-out by the addition ofthe positional encoding (especially the first row). This may happen if the datapoints have values roughly in the same range as the positional encoding.Unfortunately, this is fairly common: Both standardized inputs and wordembeddings (we’ll get back to them in Chapter 11) are likely to have most of theirvalues inside the [-1, 1] range of the positional encoding."How can we handle it then?"That’s what the scaling in the forward() method is for: It’s as if we were "reversingthe standardization" of the inputs (using a standard deviation equal to the squareroot of their dimensionality) to retrieve the hypothetical "raw" inputs.Equation 9.22 - "Reversing" the standardizationBy the way, previously, we scaled the dot product using theinverse of the square root of its dimensionality, which was itsstandard deviation.Even though this is not the same thing, the analogy might helpyou remember that the inputs are also scaled by the square rootof their number of dimensions before the positional encodinggets added to them.In our example, the dimensionality is two (coordinates), so the inputs are going tobe scaled by the square root of two:posenc(source_seq)Positional Encoding (PE) | 777
Outputtensor([[[-1.4142, -0.4142],[-0.5727, 1.9545]]])The results above (after the encoding) illustrate the effect of scaling the inputs: Itseems to have lessened the crowding-out effect of the positional encoding. Forinputs with many dimensions, the effect will be much more pronounced: A 300-dimension embedding will have a scaling factor around 17, for example."Wait, isn’t this bad for the model?"Left unchecked, yes, it could be bad for the model. That’s why we’ll pull off yetanother normalization trick: layer normalization. We’ll discuss it in detail in thenext chapter.For now, scaling the coordinates by the square root of two isn’t going to be an issue,so we can move on and integrate positional encoding into our model.Encoder + Decoder + PEThe new encoder and decoder classes are just wrapping their self-attentioncounterparts by assigning the latter to be the layer attribute of the former, andencoding the inputs prior to calling the corresponding layer:Encoder with Positional Encoding1 class EncoderPe(nn.Module):2 def __init__(self, n_heads, d_model, ff_units,3 n_features=None, max_len=100):4 super().__init__()5 pe_dim = d_model if n_features is None else n_features6 self.pe = PositionalEncoding(max_len, pe_dim)7 self.layer = EncoderSelfAttn(n_heads, d_model,8 ff_units, n_features)910 def forward(self, query, mask=None):11 query_pe = self.pe(query)12 out = self.layer(query_pe, mask)13 return out778 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
Output
tensor([[[-1.0000, 0.0000],
[-0.1585, 1.5403]]])
"What am I looking at?"
It turns out, the original coordinates were somewhat crowded-out by the addition of
the positional encoding (especially the first row). This may happen if the data
points have values roughly in the same range as the positional encoding.
Unfortunately, this is fairly common: Both standardized inputs and word
embeddings (we’ll get back to them in Chapter 11) are likely to have most of their
values inside the [-1, 1] range of the positional encoding.
"How can we handle it then?"
That’s what the scaling in the forward() method is for: It’s as if we were "reversing
the standardization" of the inputs (using a standard deviation equal to the square
root of their dimensionality) to retrieve the hypothetical "raw" inputs.
Equation 9.22 - "Reversing" the standardization
By the way, previously, we scaled the dot product using the
inverse of the square root of its dimensionality, which was its
standard deviation.
Even though this is not the same thing, the analogy might help
you remember that the inputs are also scaled by the square root
of their number of dimensions before the positional encoding
gets added to them.
In our example, the dimensionality is two (coordinates), so the inputs are going to
be scaled by the square root of two:
posenc(source_seq)
Positional Encoding (PE) | 777