22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Figure 10.23 - Sample image—split into a sequence of flattened patches

That’s more like it: Each image is turned into a sequence of length nine, each

element in the sequence having 16 features (pixel values in this case).

Embeddings

If each patch is like a receptive field, and we even talked about kernel size and

stride, why not go full convolution then? That’s how the Visual Transformer (ViT)

actually implemented patch embeddings:

Patch Embeddings

1 # Adapted from https://amaarora.github.io/2021/01/18/ViT.html

2 class PatchEmbed(nn.Module):

3 def __init__(self, img_size=224, patch_size=16,

4 in_channels=3, embed_dim=768, dilation=1):

5 super().__init__()

6 num_patches = (img_size // patch_size) * \

7 (img_size // patch_size)

8 self.img_size = img_size

9 self.patch_size = patch_size

10 self.num_patches = num_patches

11 self.proj = nn.Conv2d(in_channels,

12 embed_dim,

13 kernel_size=patch_size,

14 stride=patch_size)

15

16 def forward(self, x):

17 x = self.proj(x).flatten(2).transpose(1, 2)

18 return x

Vision Transformer | 851

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!