Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Outputtensor([[[0.4132, 0.3728],[0.4132, 0.3728]]], grad_fn=<SliceBackward>)Greedy Decoding vs Beam SearchThis is called greedy decoding because each prediction is deemed final. "Nobacksies": Once it’s done, it’s really done, and you just move along to the nextprediction and never look back. In the context of our sequence-to-sequenceproblem, a regression, it wouldn’t make much sense to do otherwise anyway.But that may not be the case for other types of sequence-to-sequenceproblems. In machine translation, for example, the decoder outputsprobabilities for the next word in the sentence at each step. The greedyapproach would simply take the word with the highest probability and moveon to the next.However, since each prediction is an input to the next step, taking the topword at every step is not necessarily the winning approach (translatingfrom one language to another is not exactly "linear"). It is probably wiser tokeep a handful of candidates at every step and try their combinations tochoose the best one: That’s called beam search. We’re not delving into itsdetails here, but you can find more information in Jason Brownlee’s "How toImplement a Beam Search Decoder for Natural Language Processing." [143]Self-Attention | 757

Encoder + Decoder + Self-AttentionLet’s join the encoder and the decoder together again, each using self-attention tocompute their corresponding "hidden states," and the decoder using crossattentionto make predictions. The full picture looks like this (including the needfor masking one of the inputs to avoid cheating).Figure 9.32 - Encoder + decoder + attention (simplified)For some cool animations of the self-attention mechanism, makesure to check out Raimi Karim’s "Illustrated: Self-Attention." [144]But, if you prefer an even more simplified diagram, here it is:758 | Chapter 9 — Part II: Sequence-to-Sequence

Output

tensor([[[0.4132, 0.3728],

[0.4132, 0.3728]]], grad_fn=<SliceBackward>)

Greedy Decoding vs Beam Search

This is called greedy decoding because each prediction is deemed final. "No

backsies": Once it’s done, it’s really done, and you just move along to the next

prediction and never look back. In the context of our sequence-to-sequence

problem, a regression, it wouldn’t make much sense to do otherwise anyway.

But that may not be the case for other types of sequence-to-sequence

problems. In machine translation, for example, the decoder outputs

probabilities for the next word in the sentence at each step. The greedy

approach would simply take the word with the highest probability and move

on to the next.

However, since each prediction is an input to the next step, taking the top

word at every step is not necessarily the winning approach (translating

from one language to another is not exactly "linear"). It is probably wiser to

keep a handful of candidates at every step and try their combinations to

choose the best one: That’s called beam search. We’re not delving into its

details here, but you can find more information in Jason Brownlee’s "How to

Implement a Beam Search Decoder for Natural Language Processing." [143]

Self-Attention | 757

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!