Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Encoder + Self-Attention1 class EncoderSelfAttn(nn.Module):2 def __init__(self, n_heads, d_model,3 ff_units, n_features=None):4 super().__init__()5 self.n_heads = n_heads6 self.d_model = d_model7 self.ff_units = ff_units8 self.n_features = n_features9 self.self_attn_heads = \10 MultiHeadAttention(n_heads,11 d_model,12 input_dim=n_features)13 self.ffn = nn.Sequential(14 nn.Linear(d_model, ff_units),15 nn.ReLU(),16 nn.Linear(ff_units, d_model),17 )1819 def forward(self, query, mask=None):20 self.self_attn_heads.init_keys(query)21 att = self.self_attn_heads(query, mask)22 out = self.ffn(att)23 return outRemember that the "query" in the forward() method actually gets the data pointsfrom the source sequence. These data points will be transformed into different"keys," "values," and "queries" inside each of the attention heads. The output ofthe attention heads is a context vector (att) that goes through a feed-forwardnetwork to produce a "hidden state."By the way, now that we’ve gotten rid of the recurrent layer, we’llbe talking about model dimensions (d_model) instead of hiddendimensions (hidden_dim). You still get to choose it, though.The mask argument should receive the source mask; that is, the mask we use toignore padded data points in our source sequence.Self-Attention | 745

Let’s create an encoder and feed it a source sequence:torch.manual_seed(11)encself = EncoderSelfAttn(n_heads=3, d_model=2,ff_units=10, n_features=2)query = source_seqencoder_states = encself(query)encoder_statesOutputtensor([[[-0.0498, 0.2193],[-0.0642, 0.2258]]], grad_fn=<AddBackward0>)It produced a sequence of states that will be the input of the (cross-)attentionmechanism used by the decoder. Business as usual.Cross-AttentionThe cross-attention was the first mechanism we discussed: The decoder provideda "query" (Q), which served not only as input but also got concatenated to theresulting context vector. That won’t be the case anymore! Instead ofconcatenation, the context vector will go through a feed-forward network in thedecoder to generate the predicted coordinates.The figure below illustrates the current state of the architecture: self-attention asencoder, cross-attention on top of it, and the modifications to the decoder.746 | Chapter 9 — Part II: Sequence-to-Sequence

Encoder + Self-Attention

1 class EncoderSelfAttn(nn.Module):

2 def __init__(self, n_heads, d_model,

3 ff_units, n_features=None):

4 super().__init__()

5 self.n_heads = n_heads

6 self.d_model = d_model

7 self.ff_units = ff_units

8 self.n_features = n_features

9 self.self_attn_heads = \

10 MultiHeadAttention(n_heads,

11 d_model,

12 input_dim=n_features)

13 self.ffn = nn.Sequential(

14 nn.Linear(d_model, ff_units),

15 nn.ReLU(),

16 nn.Linear(ff_units, d_model),

17 )

18

19 def forward(self, query, mask=None):

20 self.self_attn_heads.init_keys(query)

21 att = self.self_attn_heads(query, mask)

22 out = self.ffn(att)

23 return out

Remember that the "query" in the forward() method actually gets the data points

from the source sequence. These data points will be transformed into different

"keys," "values," and "queries" inside each of the attention heads. The output of

the attention heads is a context vector (att) that goes through a feed-forward

network to produce a "hidden state."

By the way, now that we’ve gotten rid of the recurrent layer, we’ll

be talking about model dimensions (d_model) instead of hidden

dimensions (hidden_dim). You still get to choose it, though.

The mask argument should receive the source mask; that is, the mask we use to

ignore padded data points in our source sequence.

Self-Attention | 745

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!