22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

following imports:

import copy

import numpy as np

import torch

import torch.optim as optim

import torch.nn as nn

import torch.nn.functional as F

from torch.utils.data import DataLoader, Dataset, random_split, \

TensorDataset

from torchvision.transforms import Compose, Normalize, Pad

from data_generation.square_sequences import generate_sequences

from data_generation.image_classification import generate_dataset

from helpers import index_splitter, make_balanced_sampler

from stepbystep.v4 import StepByStep

# These are the classes we built in Chapter 9

from seq2seq import PositionalEncoding, subsequent_mask, \

EncoderDecoderSelfAttn

Transform and Roll Out

We’re actually quite close to developing our own version of the famous

Transformer model. The encoder-decoder architecture with positional encoding

is missing only a few details to effectively "transform and roll out" :-)

"What’s missing?"

First, we need to revisit the multi-headed attention mechanism to make it less

computationally expensive by using narrow attention. Then, we’ll learn about a

new kind of normalization: layer normalization. Finally, we’ll add some more bells

and whistles: dropout, residual connections, and more "layers" (like the encoder

and decoder "layers" from the last chapter).

Narrow Attention

In the last chapter, we used full attention heads to build a multi-headed attention

and we called it wide attention. Although this mechanism works well, it gets

prohibitively expensive as the number of dimensions grows. That’s when the

Transform and Roll Out | 797

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!