22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

We can also use a simplified diagram for the decoder (Figure 9.29, although

depicting a single attention head, corresponds to the "Masked Multi-Headed Self-

Attention" box below).

Figure 9.30 - Decoder with self- and cross-attentions (diagram)

The decoder’s first input (x 10 , x 11 ) is the last known element of the source

sequence, as usual. The source mask is the same mask used to ignore padded data

points in the encoder.

"What about the target mask?"

We’ll get to that shortly. First, we need to discuss the subsequent inputs.

Subsequent Inputs and Teacher Forcing

In our problem, the first two data points are the source sequence, while the last two

are the target sequence. Now, let’s define the shifted target sequence, which

includes the last known element of the source sequence and all elements in the

target sequence but the last one.

Figure 9.31 - Shifted target sequence

750 | Chapter 9 — Part II: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!