Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Transformer Encoder1 class EncoderTransf(nn.Module):2 def __init__(self, encoder_layer, n_layers=1, max_len=100):3 super().__init__()4 self.d_model = encoder_layer.d_model5 self.pe = PositionalEncoding(max_len, self.d_model)6 self.norm = nn.LayerNorm(self.d_model)7 self.layers = nn.ModuleList([copy.deepcopy(encoder_layer)8 for _ in range(n_layers)])910 def forward(self, query, mask=None):11 # Positional Encoding12 x = self.pe(query)13 for layer in self.layers:14 x = layer(x, mask)15 # Norm16 return self.norm(x)In PyTorch, the encoder is implemented as nn.TransformerEncoder, and itsconstructor method expects similar arguments: encoder_layer, num_layers,and an optional normalization layer to normalize (or not) the outputs.enclayer = nn.TransformerEncoderLayer(d_model=6, nhead=3, dim_feedforward=20)enctransf = nn.TransformerEncoder(enclayer, num_layers=1, norm=nn.LayerNorm)Therefore, it behaves a bit differently than ours, since it does not (at the timeof writing) implement positional encoding for the inputs, and it does notnormalize the outputs by default.Transformer Encoder | 815
Transformer DecoderWe’ll be representing the decoder using "stacked" layers in detail (like Figure 10.6(b)); that is, showing the internal wrapped "sub-layers" (the dashed rectangles).Figure 10.9 - Transformer decoder—norm-last vs norm-firstThe small arrow on the left represents the states produced by the encoder, whichwill be used as inputs for "keys" and "values" of the (cross-)multi-headed attentionmechanism in each "layer."Moreover, there is one final linear layer responsible for projecting the decoder’soutput back to the original number of dimensions (corner’s coordinates, in ourcase). This linear layer is not included in our decoder’s class, though: It will be part816 | Chapter 10: Transform and Roll Out
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
Transformer Decoder
We’ll be representing the decoder using "stacked" layers in detail (like Figure 10.6
(b)); that is, showing the internal wrapped "sub-layers" (the dashed rectangles).
Figure 10.9 - Transformer decoder—norm-last vs norm-first
The small arrow on the left represents the states produced by the encoder, which
will be used as inputs for "keys" and "values" of the (cross-)multi-headed attention
mechanism in each "layer."
Moreover, there is one final linear layer responsible for projecting the decoder’s
output back to the original number of dimensions (corner’s coordinates, in our
case). This linear layer is not included in our decoder’s class, though: It will be part
816 | Chapter 10: Transform and Roll Out