Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Outputtensor([[[0.4132, 0.3728],[0.4132, 0.3728]]], grad_fn=<SliceBackward>)Greedy Decoding vs Beam SearchThis is called greedy decoding because each prediction is deemed final. "Nobacksies": Once it’s done, it’s really done, and you just move along to the nextprediction and never look back. In the context of our sequence-to-sequenceproblem, a regression, it wouldn’t make much sense to do otherwise anyway.But that may not be the case for other types of sequence-to-sequenceproblems. In machine translation, for example, the decoder outputsprobabilities for the next word in the sentence at each step. The greedyapproach would simply take the word with the highest probability and moveon to the next.However, since each prediction is an input to the next step, taking the topword at every step is not necessarily the winning approach (translatingfrom one language to another is not exactly "linear"). It is probably wiser tokeep a handful of candidates at every step and try their combinations tochoose the best one: That’s called beam search. We’re not delving into itsdetails here, but you can find more information in Jason Brownlee’s "How toImplement a Beam Search Decoder for Natural Language Processing." [143]Self-Attention | 757
Encoder + Decoder + Self-AttentionLet’s join the encoder and the decoder together again, each using self-attention tocompute their corresponding "hidden states," and the decoder using crossattentionto make predictions. The full picture looks like this (including the needfor masking one of the inputs to avoid cheating).Figure 9.32 - Encoder + decoder + attention (simplified)For some cool animations of the self-attention mechanism, makesure to check out Raimi Karim’s "Illustrated: Self-Attention." [144]But, if you prefer an even more simplified diagram, here it is:758 | Chapter 9 — Part II: Sequence-to-Sequence
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
Output
tensor([[[0.4132, 0.3728],
[0.4132, 0.3728]]], grad_fn=<SliceBackward>)
Greedy Decoding vs Beam Search
This is called greedy decoding because each prediction is deemed final. "No
backsies": Once it’s done, it’s really done, and you just move along to the next
prediction and never look back. In the context of our sequence-to-sequence
problem, a regression, it wouldn’t make much sense to do otherwise anyway.
But that may not be the case for other types of sequence-to-sequence
problems. In machine translation, for example, the decoder outputs
probabilities for the next word in the sentence at each step. The greedy
approach would simply take the word with the highest probability and move
on to the next.
However, since each prediction is an input to the next step, taking the top
word at every step is not necessarily the winning approach (translating
from one language to another is not exactly "linear"). It is probably wiser to
keep a handful of candidates at every step and try their combinations to
choose the best one: That’s called beam search. We’re not delving into its
details here, but you can find more information in Jason Brownlee’s "How to
Implement a Beam Search Decoder for Natural Language Processing." [143]
Self-Attention | 757