Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Figure 10.18 - Losses - PyTorch’s TransformerOnce again, the validation loss is significantly lower than the training loss. Nosurprises here since it is roughly the same model.Visualizing PredictionsLet’s plot the predicted coordinates and connect them using dashed lines, whileusing solid lines to connect the actual coordinates, just like before.Figure 10.19 - PredictionsOnce again, looking good, right?The PyTorch Transformer | 845

Vision TransformerThe Transformer architecture is fairly flexible, and, although it was devised tohandle NLP tasks in the first place, it is already starting to spread to different areas,including computer vision. Let’s take a look at one of the latest developments in thefield: the Vision Transformer (ViT). It was introduced by Dosovitskiy, A., et al. intheir paper "An Image is Worth 16x16 Words: Transformers for Image Recognitionat Scale." [152]"Cool, but I thought the Transformer handled sequences, not images."That’s a fair point. The answer is deceptively simple: Let’s break an image into asequence of patches.Data Generation & PreparationFirst, let’s bring back our multiclass classification problem from Chapter 5. We’regenerating a synthetic dataset of images that are going to have either a diagonal ora parallel line, and labeling them according to the table below:LineLabel/Class IndexParallel (Horizontal OR Vertical) 0Diagonal, Tilted to the Right 1Diagonal, Tilted to the Left 2Data Generation1 images, labels = generate_dataset(img_size=12, n_images=1000,2 binary=False, seed=17)Each image, like the example below, is 12x12 pixels in size and has a single channel:img = torch.as_tensor(images[2]).unsqueeze(0).float()/255.846 | Chapter 10: Transform and Roll Out

Vision Transformer

The Transformer architecture is fairly flexible, and, although it was devised to

handle NLP tasks in the first place, it is already starting to spread to different areas,

including computer vision. Let’s take a look at one of the latest developments in the

field: the Vision Transformer (ViT). It was introduced by Dosovitskiy, A., et al. in

their paper "An Image is Worth 16x16 Words: Transformers for Image Recognition

at Scale." [152]

"Cool, but I thought the Transformer handled sequences, not images."

That’s a fair point. The answer is deceptively simple: Let’s break an image into a

sequence of patches.

Data Generation & Preparation

First, let’s bring back our multiclass classification problem from Chapter 5. We’re

generating a synthetic dataset of images that are going to have either a diagonal or

a parallel line, and labeling them according to the table below:

Line

Label/Class Index

Parallel (Horizontal OR Vertical) 0

Diagonal, Tilted to the Right 1

Diagonal, Tilted to the Left 2

Data Generation

1 images, labels = generate_dataset(img_size=12, n_images=1000,

2 binary=False, seed=17)

Each image, like the example below, is 12x12 pixels in size and has a single channel:

img = torch.as_tensor(images[2]).unsqueeze(0).float()/255.

846 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!