Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Next, we shift our focus to the self-attention mechanism on the right:• It is the second data point's turn to be the "query" (Q), being paired with both"keys" (K), generating attention scores and a context vector, resulting in thesecond "hidden state":Equation 9.12 - Context vector for second input (x 1 )As you probably already noticed, the context vector (and thusthe "hidden state") associated with a data point is basically afunction of the corresponding "query" (Q), and everything else("keys" (K), "values" (V), and the parameters of the self-attentionmechanism) is held constant for all queries.Therefore, we can simplify a bit our previous diagram and depict only one selfattentionmechanism, assuming it will be fed a different "query" (Q) every time.Figure 9.25 - Encoder with self-attentionSelf-Attention | 743

The alphas are the attention scores, and they are organized as follows in thealphas attribute (as we’ve already seen in the "Visualizing Attention" section):Equation 9.13 - Attention scoresFor the encoder, the shape of the alphas attribute is given by (N, L source , L source ) sinceit is looking at itself.Even though I’ve described the process as if it were sequential,these operations can be parallelized to generate all "hiddenstates" at once, which is much more efficient than using arecurrent layer that is sequential in nature.We can also use an even more simplified diagram of the encoder that abstracts thenitty-gritty details of the self-attention mechanism.Figure 9.26 - Encoder with self-attention (diagram)The code for our encoder with self-attention is actually quite simple since most ofthe moving parts are inside the attention heads:744 | Chapter 9 — Part II: Sequence-to-Sequence

Next, we shift our focus to the self-attention mechanism on the right:

• It is the second data point's turn to be the "query" (Q), being paired with both

"keys" (K), generating attention scores and a context vector, resulting in the

second "hidden state":

Equation 9.12 - Context vector for second input (x 1 )

As you probably already noticed, the context vector (and thus

the "hidden state") associated with a data point is basically a

function of the corresponding "query" (Q), and everything else

("keys" (K), "values" (V), and the parameters of the self-attention

mechanism) is held constant for all queries.

Therefore, we can simplify a bit our previous diagram and depict only one selfattention

mechanism, assuming it will be fed a different "query" (Q) every time.

Figure 9.25 - Encoder with self-attention

Self-Attention | 743

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!