22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

narrow attention comes in: Each attention head will get a chunk of the

transformed data points (projections) to work with.

Chunking

This is a detail of utmost importance: The attention heads do not

use chunks of the original data points, but rather those of their

projections.

"Why?"

To understand why, let’s take an example of an affine transformation, one that

generates "values" (v 0 ) from the first data point (x 0 ).

Figure 10.1 - Narrow attention

The transformation above takes a single data point of four dimensions (features)

and turns it into a "value" (also with four dimensions) that’s going to be used in the

attention mechanism.

At first sight, it may look like we’ll get the same result whether we split the inputs

into chunks or we split the projections into chunks. But that’s definitely not the case.

So, let’s zoom in and look at the individual weights inside that transformation.

798 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!