22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Now each sequence has ten elements, and we have everything we need to build

our model.

The Model

The main part of the model is the Transformer encoder, which, coincidentally, is

implemented by normalizing the inputs (norm-first), like our own EncoderLayer

and EncoderTransf classes (and unlike PyTorch’s default implementation).

The encoder outputs a sequence of "hidden states" (memory), the first of which is

used as input to a classifier ("MLP Head"), as briefly discussed in the previous

section. So, the model is all about pre-processing the inputs, our images, using a

series of transformations:

• computing a sequence of patch embeddings

• prepending the same special classifier token [CLS] embedding to every

sequence

• adding position embedding (or, in our case, position encoding implemented in

our encoder)

The figure below illustrates the architecture.

Figure 10.28 - The Vision Transformer (ViT)

Let’s see it in code!

Vision Transformer | 857

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!