22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Output

tensor([[[2.6334e-01, 6.9912e-02, 1.6958e-01, 1.6574e-01,

1.1365e-01, 1.3449e-01, 6.6508e-02, 1.6772e-02],

[2.7878e-05, 2.5806e-03, 2.9353e-03, 1.3467e-01,

1.7490e-03, 8.5641e-01, 7.3843e-04, 8.8371e-04]],

[[6.8102e-02, 1.8080e-02, 1.0238e-01, 6.1889e-02,

6.2652e-01, 1.0388e-02, 1.6588e-02, 9.6055e-02],

[2.2783e-04, 2.1089e-02, 3.4972e-01, 2.3252e-02,

5.2879e-01, 3.5840e-02, 2.5432e-02, 1.5650e-02]]],

device='cuda:0')

"Why are we slicing the third dimension? What is the third dimension

again?"

In the multi-headed self-attention mechanism, the scores have the following shape:

(N, n_heads, L, L). We have two sentences (N=2), two attention heads (n_heads=2),

and our sequence has eight tokens (L=8).

"I’m sorry, but our sequences have seven tokens, not eight."

Yes, that’s true. But don’t forget about the special classifier token that was

prepended to the sequence of embeddings. That’s also the reason why we’re slicing

the third dimension: That zero index means we’re looking at the attention scores

of the special classifier token. Since we’re using the output corresponding to that

token to classify the sentences, it’s only logical to check what it’s paying attention

to, right? Moreover, the first value in each attention score tensor above

represents how much attention the special classifier token is paying to itself.

So, what is the model paying attention to then? Let’s see!

Figure 11.21 - Attention scores

Clearly, the model learned that "white rabbit" and "Alice" are strong signs that a

given sentence belongs to Alice’s Adventures in Wonderland. Conversely, if there is a

Word Embeddings | 947

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!