Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure 9.27 - Encoder with self- and cross-attentionsIf you’re wondering why we removed the concatenation part, here comes theanswer: We’re using self-attention as a decoder too.Self-Attention | 747
DecoderThere is one main difference (in the code) between the encoder and thedecoder—the latter includes a cross-attention mechanism, as you can see below:Decoder + Self-Attention1 class DecoderSelfAttn(nn.Module):2 def __init__(self, n_heads, d_model,3 ff_units, n_features=None):4 super().__init__()5 self.n_heads = n_heads6 self.d_model = d_model7 self.ff_units = ff_units8 self.n_features = d_model if n_features is None \9 else n_features10 self.self_attn_heads = \11 MultiHeadAttention(n_heads, d_model,12 input_dim=self.n_features)13 self.cross_attn_heads = \ 114 MultiHeadAttention(n_heads, d_model)15 self.ffn = nn.Sequential(16 nn.Linear(d_model, ff_units),17 nn.ReLU(),18 nn.Linear(ff_units, self.n_features))1920 def init_keys(self, states): 121 self.cross_attn_heads.init_keys(states)2223 def forward(self, query, source_mask=None, target_mask=None):24 self.self_attn_heads.init_keys(query)25 att1 = self.self_attn_heads(query, target_mask)26 att2 = self.cross_attn_heads(att1, source_mask) 127 out = self.ffn(att2)28 return out1 Including cross-attention748 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 722 and 723: predictions in previous steps have
- Page 724 and 725: the second set of predicted coordin
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
Decoder
There is one main difference (in the code) between the encoder and the
decoder—the latter includes a cross-attention mechanism, as you can see below:
Decoder + Self-Attention
1 class DecoderSelfAttn(nn.Module):
2 def __init__(self, n_heads, d_model,
3 ff_units, n_features=None):
4 super().__init__()
5 self.n_heads = n_heads
6 self.d_model = d_model
7 self.ff_units = ff_units
8 self.n_features = d_model if n_features is None \
9 else n_features
10 self.self_attn_heads = \
11 MultiHeadAttention(n_heads, d_model,
12 input_dim=self.n_features)
13 self.cross_attn_heads = \ 1
14 MultiHeadAttention(n_heads, d_model)
15 self.ffn = nn.Sequential(
16 nn.Linear(d_model, ff_units),
17 nn.ReLU(),
18 nn.Linear(ff_units, self.n_features))
19
20 def init_keys(self, states): 1
21 self.cross_attn_heads.init_keys(states)
22
23 def forward(self, query, source_mask=None, target_mask=None):
24 self.self_attn_heads.init_keys(query)
25 att1 = self.self_attn_heads(query, target_mask)
26 att2 = self.cross_attn_heads(att1, source_mask) 1
27 out = self.ffn(att2)
28 return out
1 Including cross-attention
748 | Chapter 9 — Part II: Sequence-to-Sequence