Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
1. Encoder-DecoderThe encoder-decoder architecture was actually extended from the one developedin the previous chapter (EncoderDecoderSelfAttn), which handled training andprediction using greedy decoding. There are no changes here, except for theomission of both encode() and decode() methods, which are going to beoverridden anyway:Encoder + Decoder + Self-Attention1 class EncoderDecoderSelfAttn(nn.Module):2 def __init__(self, encoder, decoder, input_len, target_len):3 super().__init__()4 self.encoder = encoder5 self.decoder = decoder6 self.input_len = input_len7 self.target_len = target_len8 self.trg_masks = self.subsequent_mask(self.target_len)910 @staticmethod11 def subsequent_mask(size):12 attn_shape = (1, size, size)13 subsequent_mask = (14 1 - torch.triu(torch.ones(attn_shape), diagonal=1)15 ).bool()16 return subsequent_mask1718 def predict(self, source_seq, source_mask):19 # Decodes/generates a sequence using one input20 # at a time - used in EVAL mode21 inputs = source_seq[:, -1:]22 for i in range(self.target_len):23 out = self.decode(inputs,24 source_mask,25 self.trg_masks[:, :i+1, :i+1])26 out = torch.cat([inputs, out[:, -1:, :]], dim=-2)27 inputs = out.detach()28 outputs = inputs[:, 1:, :]29 return outputs3031 def forward(self, X, source_mask=None):32 # Sends the mask to the same device as the inputs33 self.trg_masks = self.trg_masks.type_as(X).bool()Putting It All Together | 863
34 # Slices the input to get source sequence35 source_seq = X[:, :self.input_len, :]36 # Encodes source sequence AND initializes decoder37 self.encode(source_seq, source_mask)38 if self.training:39 # Slices the input to get the shifted target seq40 shifted_target_seq = X[:, self.input_len-1:-1, :]41 # Decodes using the mask to prevent cheating42 outputs = self.decode(shifted_target_seq,43 source_mask,44 self.trg_masks)45 else:46 # Decodes using its own predictions47 outputs = self.predict(source_seq, source_mask)4849 return outputs864 | Chapter 10: Transform and Roll Out
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
34 # Slices the input to get source sequence
35 source_seq = X[:, :self.input_len, :]
36 # Encodes source sequence AND initializes decoder
37 self.encode(source_seq, source_mask)
38 if self.training:
39 # Slices the input to get the shifted target seq
40 shifted_target_seq = X[:, self.input_len-1:-1, :]
41 # Decodes using the mask to prevent cheating
42 outputs = self.decode(shifted_target_seq,
43 source_mask,
44 self.trg_masks)
45 else:
46 # Decodes using its own predictions
47 outputs = self.predict(source_seq, source_mask)
48
49 return outputs
864 | Chapter 10: Transform and Roll Out