Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

In PyTorch, the decoder is implemented as nn.TransformerDecoder, and itsconstructor method expects similar arguments: decoder_layer, num_layers,and an optional normalization layer to normalize (or not) the outputs.declayer = nn.TransformerDecoderLayer(d_model=6, nhead=3, dim_feedforward=20)dectransf = nn.TransformerDecoder(declayer, num_layers=1, norm=nn.LayerNorm)PyTorch’s decoder also behaves a bit differently than ours, since it does not(at the time of writing) implement positional encoding for the inputs, and itdoes not normalize the outputs by default.Before putting the encoder and the decoder together, we still have to make a shortpit-stop and address that teeny-tiny detail…Layer NormalizationLayer normalization was introduced by Jimmy Lei Ba, Jamie Ryan Kiros, andGeoffrey E. Hinton in their 2016 paper "Layer Normalization," [151] but it only gotreally popular after being used in the hugely successful Transformer architecture.They say: "…we transpose batch normalization into layer normalization by computingthe mean and variance used for normalization from all of the summed inputs to theneurons in a layer on a single training case" (the highlight is mine).Simply put: Layer normalization standardizes individual datapoints, not features.This is completely different than the standardizations we’ve performed so far.Before, each feature, either in the whole training set (using Scikit-learn’sStandardScaler way back in Chapter 0), or in a mini-batch (using batch norm inChapter 7), was standardized to have zero mean and unit standard deviation. In atabular dataset, we standardized the columns.Layer Normalization | 821

Layer normalization, in a tabular dataset, standardizes the rows.Each data point will have the average of its features equal zero,and the standard deviation of its features will equal one.Let’s assume we have a mini-batch of three sequences (N=3), each sequence havinga length of two (L=2), each data point having four features (D=4), and, to illustratethe importance of layer normalization, let’s add positional encoding to it too:d_model = 4seq_len = 2n_points = 3torch.manual_seed(34)data = torch.randn(n_points, seq_len, d_model)pe = PositionalEncoding(seq_len, d_model)inputs = pe(data)inputsOutputtensor([[[-3.8049, 1.9899, -1.7325, 2.1359],[ 1.7854, 0.8155, 0.1116, -1.7420]],[[-2.4273, 1.3559, 2.8615, 2.0084],[-1.0353, -1.2766, -2.2082, -0.6952]],[[-0.8044, 1.9707, 3.3704, 2.0587],[ 4.2256, 6.9575, 1.4770, 2.0762]]])It should be straightforward to identify the different dimensions, N (three verticalgroups), L (two rows in each group), and D (four columns), in the tensor above.There are six data points in total, and their value range is mostly the result of theaddition of positional encoding.Well, layer normalization standardizes individual data points, the rows in thetensor above, so we need to compute statistics over the corresponding dimension(D). Let’s start with the means:822 | Chapter 10: Transform and Roll Out

In PyTorch, the decoder is implemented as nn.TransformerDecoder, and its

constructor method expects similar arguments: decoder_layer, num_layers,

and an optional normalization layer to normalize (or not) the outputs.

declayer = nn.TransformerDecoderLayer(

d_model=6, nhead=3, dim_feedforward=20

)

dectransf = nn.TransformerDecoder(

declayer, num_layers=1, norm=nn.LayerNorm

)

PyTorch’s decoder also behaves a bit differently than ours, since it does not

(at the time of writing) implement positional encoding for the inputs, and it

does not normalize the outputs by default.

Before putting the encoder and the decoder together, we still have to make a short

pit-stop and address that teeny-tiny detail…

Layer Normalization

Layer normalization was introduced by Jimmy Lei Ba, Jamie Ryan Kiros, and

Geoffrey E. Hinton in their 2016 paper "Layer Normalization," [151] but it only got

really popular after being used in the hugely successful Transformer architecture.

They say: "…we transpose batch normalization into layer normalization by computing

the mean and variance used for normalization from all of the summed inputs to the

neurons in a layer on a single training case" (the highlight is mine).

Simply put: Layer normalization standardizes individual data

points, not features.

This is completely different than the standardizations we’ve performed so far.

Before, each feature, either in the whole training set (using Scikit-learn’s

StandardScaler way back in Chapter 0), or in a mini-batch (using batch norm in

Chapter 7), was standardized to have zero mean and unit standard deviation. In a

tabular dataset, we standardized the columns.

Layer Normalization | 821

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!