Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Transformer Encoder1 class EncoderTransf(nn.Module):2 def __init__(self, encoder_layer, n_layers=1, max_len=100):3 super().__init__()4 self.d_model = encoder_layer.d_model5 self.pe = PositionalEncoding(max_len, self.d_model)6 self.norm = nn.LayerNorm(self.d_model)7 self.layers = nn.ModuleList([copy.deepcopy(encoder_layer)8 for _ in range(n_layers)])910 def forward(self, query, mask=None):11 # Positional Encoding12 x = self.pe(query)13 for layer in self.layers:14 x = layer(x, mask)15 # Norm16 return self.norm(x)In PyTorch, the encoder is implemented as nn.TransformerEncoder, and itsconstructor method expects similar arguments: encoder_layer, num_layers,and an optional normalization layer to normalize (or not) the outputs.enclayer = nn.TransformerEncoderLayer(d_model=6, nhead=3, dim_feedforward=20)enctransf = nn.TransformerEncoder(enclayer, num_layers=1, norm=nn.LayerNorm)Therefore, it behaves a bit differently than ours, since it does not (at the timeof writing) implement positional encoding for the inputs, and it does notnormalize the outputs by default.Transformer Encoder | 815

Transformer DecoderWe’ll be representing the decoder using "stacked" layers in detail (like Figure 10.6(b)); that is, showing the internal wrapped "sub-layers" (the dashed rectangles).Figure 10.9 - Transformer decoder—norm-last vs norm-firstThe small arrow on the left represents the states produced by the encoder, whichwill be used as inputs for "keys" and "values" of the (cross-)multi-headed attentionmechanism in each "layer."Moreover, there is one final linear layer responsible for projecting the decoder’soutput back to the original number of dimensions (corner’s coordinates, in ourcase). This linear layer is not included in our decoder’s class, though: It will be part816 | Chapter 10: Transform and Roll Out

Transformer Decoder

We’ll be representing the decoder using "stacked" layers in detail (like Figure 10.6

(b)); that is, showing the internal wrapped "sub-layers" (the dashed rectangles).

Figure 10.9 - Transformer decoder—norm-last vs norm-first

The small arrow on the left represents the states produced by the encoder, which

will be used as inputs for "keys" and "values" of the (cross-)multi-headed attention

mechanism in each "layer."

Moreover, there is one final linear layer responsible for projecting the decoder’s

output back to the original number of dimensions (corner’s coordinates, in our

case). This linear layer is not included in our decoder’s class, though: It will be part

816 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!