Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Wide vs Narrow AttentionThis mechanism is known as wide attention: Each attention head gets thefull hidden state and produces a context vector of the same size. This istotally fine if the number of hidden dimensions is small.For a larger number of dimensions, though, each attention head will get achunk of the affine transformation of the hidden state to work with. This isa detail of utmost importance: It is not a chunk of the original hidden state,but of its transformation. For example, say there are 512 dimensions in thehidden state and we’d like to use eight attention heads: Each attention headwould work with a chunk of 64 dimensions only. This mechanism is known asnarrow attention, and we’ll get back to it in the next chapter."Which one should I use?"On the one hand, wide attention will likely yield better models compared tousing narrow attention on the same number of dimensions. On the otherhand, narrow attention makes it possible to use more dimensions, whichmay improve the quality of the model as well. It’s hard to tell you which oneis best overall, but I can tell you that state-of-the-art large Transformermodels use narrow attention. In our much simpler and smaller model,though, we’re sticking with wide attention.The multi-headed attention mechanism is usually depicted like this:Figure 9.22 - Multi-headed attention mechanismAttention | 737
The code for the multi-headed attention mechanism looks like this:Multi-Headed Attention Mechanism1 class MultiHeadAttention(nn.Module):2 def __init__(self, n_heads, d_model,3 input_dim=None, proj_values=True):4 super().__init__()5 self.linear_out = nn.Linear(n_heads * d_model, d_model)6 self.attn_heads = nn.ModuleList(7 [Attention(d_model,8 input_dim=input_dim,9 proj_values=proj_values)10 for _ in range(n_heads)]11 )1213 def init_keys(self, key):14 for attn in self.attn_heads:15 attn.init_keys(key)1617 @property18 def alphas(self):19 # Shape: n_heads, N, 1, L (source)20 return torch.stack(21 [attn.alphas for attn in self.attn_heads], dim=022 )2324 def output_function(self, contexts):25 # N, 1, n_heads * D26 concatenated = torch.cat(contexts, axis=-1)27 # Linear transf. to go back to original dimension28 out = self.linear_out(concatenated) # N, 1, D29 return out3031 def forward(self, query, mask=None):32 contexts = [attn(query, mask=mask)33 for attn in self.attn_heads]34 out = self.output_function(contexts)35 return outIt is pretty much a list of attention mechanisms with an extra linear layer on top.But it is not any list; it is a special list—it is an nn.ModuleList.738 | Chapter 9 — Part I: Sequence-to-Sequence
- Page 712 and 713: import copyimport numpy as npimport
- Page 714 and 715: Figure 9.3 - Sequence datasetThe co
- Page 716 and 717: coordinates of a "perfect" square a
- Page 718 and 719: Let’s pretend for a moment that t
- Page 720 and 721: to initialize the hidden state and
- Page 722 and 723: predictions in previous steps have
- Page 724 and 725: the second set of predicted coordin
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
Wide vs Narrow Attention
This mechanism is known as wide attention: Each attention head gets the
full hidden state and produces a context vector of the same size. This is
totally fine if the number of hidden dimensions is small.
For a larger number of dimensions, though, each attention head will get a
chunk of the affine transformation of the hidden state to work with. This is
a detail of utmost importance: It is not a chunk of the original hidden state,
but of its transformation. For example, say there are 512 dimensions in the
hidden state and we’d like to use eight attention heads: Each attention head
would work with a chunk of 64 dimensions only. This mechanism is known as
narrow attention, and we’ll get back to it in the next chapter.
"Which one should I use?"
On the one hand, wide attention will likely yield better models compared to
using narrow attention on the same number of dimensions. On the other
hand, narrow attention makes it possible to use more dimensions, which
may improve the quality of the model as well. It’s hard to tell you which one
is best overall, but I can tell you that state-of-the-art large Transformer
models use narrow attention. In our much simpler and smaller model,
though, we’re sticking with wide attention.
The multi-headed attention mechanism is usually depicted like this:
Figure 9.22 - Multi-headed attention mechanism
Attention | 737