22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Einops

"There is more than one way to skin a cat," as the saying goes, and so there is

more than one way to rearrange the pixels into sequences. An alternative

approach uses a package called einops. [153] It is very minimalistic (maybe

even a bit too much) and allows you to express complex rearrangements in a

couple lines of code. It may take a while to get the hang of how it works,

though.

We’re not using it here, but, if you’re interested, this is the einops equivalent

of the extract_image_patches() function above:

# Adapted from https://github.com/lucidrains/vit-pytorch/blob/

# main/vit_pytorch/vit_pytorch.py

# !pip install einops

from einops import rearrange

patches = rearrange(padded_img,

'b c (h p1) (w p2) -> b (h w) (p1 p2 c)',

p1 = kernel_size, p2 = kernel_size)

Special Classifier Token

In Chapter 8, the final hidden state represented the full sequence. This approach had

its shortcomings (the attention mechanism was developed to compensate for

them), but it leveraged the fact that there was an underlying sequential structure

to the data.

This is not quite the same for images, though. The sequence of patches is a clever

way of making the data suitable for the encoder, sure, but it does not necessarily

reflect a sequential structure; after all, we end up with two different sequences

depending on which direction we choose to go over the patches: row-wise or

column-wise.

"Can’t we use the full sequence of 'hidden states' then? Or maybe

average them?"

It is definitely possible to use the average of the "hidden states" produced by the

encoder as input for the classifier.

Vision Transformer | 853

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!