Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
of the encoder-decoder (or Transformer) class.Let’s see it in code, starting with the "layer" and all its wrapped "sub-layers." By theway, the code below is remarkably similar to that of the EncoderLayer, except forthe fact that it has a third "sub-layer" (cross-attention) in between the other two.Decoder "Layer"1 class DecoderLayer(nn.Module):2 def __init__(self, n_heads, d_model, ff_units, dropout=0.1):3 super().__init__()4 self.n_heads = n_heads5 self.d_model = d_model6 self.ff_units = ff_units7 self.self_attn_heads = \8 MultiHeadedAttention(n_heads, d_model, dropout)9 self.cross_attn_heads = \10 MultiHeadedAttention(n_heads, d_model, dropout)11 self.ffn = nn.Sequential(12 nn.Linear(d_model, ff_units),13 nn.ReLU(),14 nn.Dropout(dropout),15 nn.Linear(ff_units, d_model),16 )1718 self.norm1 = nn.LayerNorm(d_model)19 self.norm2 = nn.LayerNorm(d_model)20 self.norm3 = nn.LayerNorm(d_model)21 self.drop1 = nn.Dropout(dropout)22 self.drop2 = nn.Dropout(dropout)23 self.drop3 = nn.Dropout(dropout)2425 def init_keys(self, states):26 self.cross_attn_heads.init_keys(states)2728 def forward(self, query, source_mask=None, target_mask=None):29 # Sublayer #030 # Norm31 norm_query = self.norm1(query)32 # Masked Multi-head Attention33 self.self_attn_heads.init_keys(norm_query)34 states = self.self_attn_heads(norm_query, target_mask)35 # AddTransformer Decoder | 817
36 att1 = query + self.drop1(states)3738 # Sublayer #139 # Norm40 norm_att1 = self.norm2(att1)41 # Multi-head Attention42 encoder_states = self.cross_attn_heads(norm_att1,43 source_mask)44 # Add45 att2 = att1 + self.drop2(encoder_states)4647 # Sublayer #248 # Norm49 norm_att2 = self.norm3(att2)50 # Feed Forward51 out = self.ffn(norm_att2)52 # Add53 out = att2 + self.drop3(out)54 return outThe constructor method of the decoder "layer" takes the same arguments as theencoder "layer" does. The forward() method takes three arguments: the "query,"the source mask that’s going to be used to ignore padded data points in the sourcesequence during cross-attention, and the target mask used to avoid cheating bypeeking into the future.818 | Chapter 10: Transform and Roll Out
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
of the encoder-decoder (or Transformer) class.
Let’s see it in code, starting with the "layer" and all its wrapped "sub-layers." By the
way, the code below is remarkably similar to that of the EncoderLayer, except for
the fact that it has a third "sub-layer" (cross-attention) in between the other two.
Decoder "Layer"
1 class DecoderLayer(nn.Module):
2 def __init__(self, n_heads, d_model, ff_units, dropout=0.1):
3 super().__init__()
4 self.n_heads = n_heads
5 self.d_model = d_model
6 self.ff_units = ff_units
7 self.self_attn_heads = \
8 MultiHeadedAttention(n_heads, d_model, dropout)
9 self.cross_attn_heads = \
10 MultiHeadedAttention(n_heads, d_model, dropout)
11 self.ffn = nn.Sequential(
12 nn.Linear(d_model, ff_units),
13 nn.ReLU(),
14 nn.Dropout(dropout),
15 nn.Linear(ff_units, d_model),
16 )
17
18 self.norm1 = nn.LayerNorm(d_model)
19 self.norm2 = nn.LayerNorm(d_model)
20 self.norm3 = nn.LayerNorm(d_model)
21 self.drop1 = nn.Dropout(dropout)
22 self.drop2 = nn.Dropout(dropout)
23 self.drop3 = nn.Dropout(dropout)
24
25 def init_keys(self, states):
26 self.cross_attn_heads.init_keys(states)
27
28 def forward(self, query, source_mask=None, target_mask=None):
29 # Sublayer #0
30 # Norm
31 norm_query = self.norm1(query)
32 # Masked Multi-head Attention
33 self.self_attn_heads.init_keys(norm_query)
34 states = self.self_attn_heads(norm_query, target_mask)
35 # Add
Transformer Decoder | 817