Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Outputtensor([[[ 0.5475, 0.0875, -1.2350]]])Even though the first "key" (K 0 ) is the smallest in size, it is the most wellalignedto the "query," and, overall, is the "key" with the largest dot product.This means that the decoder would pay the most attention to this particularkey. Makes sense, right?Applying the softmax to these values gives us the following attention scores:scores = F.softmax(prod, dim=-1)scoresOutputtensor([[[0.5557, 0.3508, 0.0935]]])Unsurprisingly, the first key got the largest weight. Let’s use these weights tocompute the context vector:Equation 9.8 - Computing the context vectorv = kcontext = torch.bmm(scores, v)contextOutputtensor([[[ 0.5706, -0.0993]]])Better yet, let’s visualize the context vector.Attention | 719

Since the context vector is a weighted sum of the values (or keys, sincewe’re not applying any affine transformations yet), it is only logical that itslocation is somewhere between the other vectors."Why do we need to scale the dot product?"If we don’t, the distribution of attention scores will get too skewed because thesoftmax function is actually affected by the scale of its inputs:dummy_product = torch.tensor([4.0, 1.0])(F.softmax(dummy_product, dim=-1),F.softmax(100*dummy_product, dim=-1))Output(tensor([0.9526, 0.0474]), tensor([1., 0.]))See? As the scale of the dot products grows larger, the resulting distribution of thesoftmax gets more and more skewed.In our case, there isn’t much difference because our vectors have only twodimensions:720 | Chapter 9 — Part I: Sequence-to-Sequence

Output

tensor([[[ 0.5475, 0.0875, -1.2350]]])

Even though the first "key" (K 0 ) is the smallest in size, it is the most wellaligned

to the "query," and, overall, is the "key" with the largest dot product.

This means that the decoder would pay the most attention to this particular

key. Makes sense, right?

Applying the softmax to these values gives us the following attention scores:

scores = F.softmax(prod, dim=-1)

scores

Output

tensor([[[0.5557, 0.3508, 0.0935]]])

Unsurprisingly, the first key got the largest weight. Let’s use these weights to

compute the context vector:

Equation 9.8 - Computing the context vector

v = k

context = torch.bmm(scores, v)

context

Output

tensor([[[ 0.5706, -0.0993]]])

Better yet, let’s visualize the context vector.

Attention | 719

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!