Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Transformer Encoder1 class EncoderTransf(nn.Module):2 def __init__(self, encoder_layer, n_layers=1, max_len=100):3 super().__init__()4 self.d_model = encoder_layer.d_model5 self.pe = PositionalEncoding(max_len, self.d_model)6 self.norm = nn.LayerNorm(self.d_model)7 self.layers = nn.ModuleList([copy.deepcopy(encoder_layer)8 for _ in range(n_layers)])910 def forward(self, query, mask=None):11 # Positional Encoding12 x = self.pe(query)13 for layer in self.layers:14 x = layer(x, mask)15 # Norm16 return self.norm(x)In PyTorch, the encoder is implemented as nn.TransformerEncoder, and itsconstructor method expects similar arguments: encoder_layer, num_layers,and an optional normalization layer to normalize (or not) the outputs.enclayer = nn.TransformerEncoderLayer(d_model=6, nhead=3, dim_feedforward=20)enctransf = nn.TransformerEncoder(enclayer, num_layers=1, norm=nn.LayerNorm)Therefore, it behaves a bit differently than ours, since it does not (at the timeof writing) implement positional encoding for the inputs, and it does notnormalize the outputs by default.Transformer Encoder | 815

Transformer DecoderWe’ll be representing the decoder using "stacked" layers in detail (like Figure 10.6(b)); that is, showing the internal wrapped "sub-layers" (the dashed rectangles).Figure 10.9 - Transformer decoder—norm-last vs norm-firstThe small arrow on the left represents the states produced by the encoder, whichwill be used as inputs for "keys" and "values" of the (cross-)multi-headed attentionmechanism in each "layer."Moreover, there is one final linear layer responsible for projecting the decoder’soutput back to the original number of dimensions (corner’s coordinates, in ourcase). This linear layer is not included in our decoder’s class, though: It will be part816 | Chapter 10: Transform and Roll Out

