Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Encoder + Self-Attention1 class EncoderSelfAttn(nn.Module):2 def __init__(self, n_heads, d_model,3 ff_units, n_features=None):4 super().__init__()5 self.n_heads = n_heads6 self.d_model = d_model7 self.ff_units = ff_units8 self.n_features = n_features9 self.self_attn_heads = \10 MultiHeadAttention(n_heads,11 d_model,12 input_dim=n_features)13 self.ffn = nn.Sequential(14 nn.Linear(d_model, ff_units),15 nn.ReLU(),16 nn.Linear(ff_units, d_model),17 )1819 def forward(self, query, mask=None):20 self.self_attn_heads.init_keys(query)21 att = self.self_attn_heads(query, mask)22 out = self.ffn(att)23 return outRemember that the "query" in the forward() method actually gets the data pointsfrom the source sequence. These data points will be transformed into different"keys," "values," and "queries" inside each of the attention heads. The output ofthe attention heads is a context vector (att) that goes through a feed-forwardnetwork to produce a "hidden state."By the way, now that we’ve gotten rid of the recurrent layer, we’llbe talking about model dimensions (d_model) instead of hiddendimensions (hidden_dim). You still get to choose it, though.The mask argument should receive the source mask; that is, the mask we use toignore padded data points in our source sequence.Self-Attention | 745
Let’s create an encoder and feed it a source sequence:torch.manual_seed(11)encself = EncoderSelfAttn(n_heads=3, d_model=2,ff_units=10, n_features=2)query = source_seqencoder_states = encself(query)encoder_statesOutputtensor([[[-0.0498, 0.2193],[-0.0642, 0.2258]]], grad_fn=<AddBackward0>)It produced a sequence of states that will be the input of the (cross-)attentionmechanism used by the decoder. Business as usual.Cross-AttentionThe cross-attention was the first mechanism we discussed: The decoder provideda "query" (Q), which served not only as input but also got concatenated to theresulting context vector. That won’t be the case anymore! Instead ofconcatenation, the context vector will go through a feed-forward network in thedecoder to generate the predicted coordinates.The figure below illustrates the current state of the architecture: self-attention asencoder, cross-attention on top of it, and the modifications to the decoder.746 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 720 and 721: to initialize the hidden state and
- Page 722 and 723: predictions in previous steps have
- Page 724 and 725: the second set of predicted coordin
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
Encoder + Self-Attention
1 class EncoderSelfAttn(nn.Module):
2 def __init__(self, n_heads, d_model,
3 ff_units, n_features=None):
4 super().__init__()
5 self.n_heads = n_heads
6 self.d_model = d_model
7 self.ff_units = ff_units
8 self.n_features = n_features
9 self.self_attn_heads = \
10 MultiHeadAttention(n_heads,
11 d_model,
12 input_dim=n_features)
13 self.ffn = nn.Sequential(
14 nn.Linear(d_model, ff_units),
15 nn.ReLU(),
16 nn.Linear(ff_units, d_model),
17 )
18
19 def forward(self, query, mask=None):
20 self.self_attn_heads.init_keys(query)
21 att = self.self_attn_heads(query, mask)
22 out = self.ffn(att)
23 return out
Remember that the "query" in the forward() method actually gets the data points
from the source sequence. These data points will be transformed into different
"keys," "values," and "queries" inside each of the attention heads. The output of
the attention heads is a context vector (att) that goes through a feed-forward
network to produce a "hidden state."
By the way, now that we’ve gotten rid of the recurrent layer, we’ll
be talking about model dimensions (d_model) instead of hidden
dimensions (hidden_dim). You still get to choose it, though.
The mask argument should receive the source mask; that is, the mask we use to
ignore padded data points in our source sequence.
Self-Attention | 745