Outputtensor([[[ 0.5475, 0.0875, -1.2350]]])Even though the first "key" (K 0 ) is the smallest in size, it is the most wellalignedto the "query," and, overall, is the "key" with the largest dot product.This means that the decoder would pay the most attention to this particularkey. Makes sense, right?Applying the softmax to these values gives us the following attention scores:scores = F.softmax(prod, dim=-1)scoresOutputtensor([[[0.5557, 0.3508, 0.0935]]])Unsurprisingly, the first key got the largest weight. Let’s use these weights tocompute the context vector:Equation 9.8 - Computing the context vectorv = kcontext = torch.bmm(scores, v)contextOutputtensor([[[ 0.5706, -0.0993]]])Better yet, let’s visualize the context vector.Attention | 719

Since the context vector is a weighted sum of the values (or keys, sincewe’re not applying any affine transformations yet), it is only logical that itslocation is somewhere between the other vectors."Why do we need to scale the dot product?"If we don’t, the distribution of attention scores will get too skewed because thesoftmax function is actually affected by the scale of its inputs:dummy_product = torch.tensor([4.0, 1.0])(F.softmax(dummy_product, dim=-1),F.softmax(100*dummy_product, dim=-1))Output(tensor([0.9526, 0.0474]), tensor([1., 0.]))See? As the scale of the dot products grows larger, the resulting distribution of thesoftmax gets more and more skewed.In our case, there isn’t much difference because our vectors have only twodimensions:720 | Chapter 9 — Part I: Sequence-to-Sequence


