Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

of the encoder-decoder (or Transformer) class.Let’s see it in code, starting with the "layer" and all its wrapped "sub-layers." By theway, the code below is remarkably similar to that of the EncoderLayer, except forthe fact that it has a third "sub-layer" (cross-attention) in between the other two.Decoder "Layer"1 class DecoderLayer(nn.Module):2 def __init__(self, n_heads, d_model, ff_units, dropout=0.1):3 super().__init__()4 self.n_heads = n_heads5 self.d_model = d_model6 self.ff_units = ff_units7 self.self_attn_heads = \8 MultiHeadedAttention(n_heads, d_model, dropout)9 self.cross_attn_heads = \10 MultiHeadedAttention(n_heads, d_model, dropout)11 self.ffn = nn.Sequential(12 nn.Linear(d_model, ff_units),13 nn.ReLU(),14 nn.Dropout(dropout),15 nn.Linear(ff_units, d_model),16 )1718 self.norm1 = nn.LayerNorm(d_model)19 self.norm2 = nn.LayerNorm(d_model)20 self.norm3 = nn.LayerNorm(d_model)21 self.drop1 = nn.Dropout(dropout)22 self.drop2 = nn.Dropout(dropout)23 self.drop3 = nn.Dropout(dropout)2425 def init_keys(self, states):26 self.cross_attn_heads.init_keys(states)2728 def forward(self, query, source_mask=None, target_mask=None):29 # Sublayer #030 # Norm31 norm_query = self.norm1(query)32 # Masked Multi-head Attention33 self.self_attn_heads.init_keys(norm_query)34 states = self.self_attn_heads(norm_query, target_mask)35 # AddTransformer Decoder | 817

36 att1 = query + self.drop1(states)3738 # Sublayer #139 # Norm40 norm_att1 = self.norm2(att1)41 # Multi-head Attention42 encoder_states = self.cross_attn_heads(norm_att1,43 source_mask)44 # Add45 att2 = att1 + self.drop2(encoder_states)4647 # Sublayer #248 # Norm49 norm_att2 = self.norm3(att2)50 # Feed Forward51 out = self.ffn(norm_att2)52 # Add53 out = att2 + self.drop3(out)54 return outThe constructor method of the decoder "layer" takes the same arguments as theencoder "layer" does. The forward() method takes three arguments: the "query,"the source mask that’s going to be used to ignore padded data points in the sourcesequence during cross-attention, and the target mask used to avoid cheating bypeeking into the future.818 | Chapter 10: Transform and Roll Out

of the encoder-decoder (or Transformer) class.

Let’s see it in code, starting with the "layer" and all its wrapped "sub-layers." By the

way, the code below is remarkably similar to that of the EncoderLayer, except for

the fact that it has a third "sub-layer" (cross-attention) in between the other two.

Decoder "Layer"

1 class DecoderLayer(nn.Module):

2 def __init__(self, n_heads, d_model, ff_units, dropout=0.1):

3 super().__init__()

4 self.n_heads = n_heads

5 self.d_model = d_model

6 self.ff_units = ff_units

7 self.self_attn_heads = \

8 MultiHeadedAttention(n_heads, d_model, dropout)

9 self.cross_attn_heads = \

10 MultiHeadedAttention(n_heads, d_model, dropout)

11 self.ffn = nn.Sequential(

12 nn.Linear(d_model, ff_units),

13 nn.ReLU(),

14 nn.Dropout(dropout),

15 nn.Linear(ff_units, d_model),

16 )

17

18 self.norm1 = nn.LayerNorm(d_model)

19 self.norm2 = nn.LayerNorm(d_model)

20 self.norm3 = nn.LayerNorm(d_model)

21 self.drop1 = nn.Dropout(dropout)

22 self.drop2 = nn.Dropout(dropout)

23 self.drop3 = nn.Dropout(dropout)

24

25 def init_keys(self, states):

26 self.cross_attn_heads.init_keys(states)

27

28 def forward(self, query, source_mask=None, target_mask=None):

29 # Sublayer #0

30 # Norm

31 norm_query = self.norm1(query)

32 # Masked Multi-head Attention

33 self.self_attn_heads.init_keys(norm_query)

34 states = self.self_attn_heads(norm_query, target_mask)

35 # Add

Transformer Decoder | 817

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!