Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

layer_norm = nn.LayerNorm(d_model)normalized = layer_norm(inputs)normalized[0][0].mean(), normalized[0][0].std(unbiased=False)Output(tensor(-1.4901e-08, grad_fn=<MeanBackward0>),tensor(1.0000, grad_fn=<StdBackward0>))Zero mean and unit standard deviation, as expected."Why do they have a grad_fn attribute?"Like batch normalization, layer normalization can learn affine transformations. Yes,plural: Each feature has its own affine transformation. Since we’re using layernormalization on d_model, and its dimensionality is four, there will be four weightsand four biases in the state_dict():layer_norm.state_dict()OutputOrderedDict([('weight', tensor([1., 1., 1., 1.])),('bias', tensor([0., 0., 0., 0.]))])The weights and biases are used to scale and translate, respectively, thestandardized values:Equation 10.10 - Layer normalization (with affine transformation)In PyTorch’s documentation, though, you’ll find gamma and beta instead:Equation 10.11 - Layer Normalization (with affine transformation)Layer Normalization | 825

Batch and layer normalization look quite similar to one another, but there are someimportant differences between them that we need to point out.Batch vs LayerAlthough both normalizations compute statistics, namely, mean and biasedstandard deviation, to standardize the inputs, only batch norm needs to keep trackof running statistics.Moreover, since layer normalization considers data pointsindividually, it exhibits the same behavior whether the model isin training or in evaluation mode.To illustrate the difference between the two types of normalization, let’s generateyet another dummy example (again adding positional encoding to it):torch.manual_seed(23)dummy_points = torch.randn(4, 1, 256)dummy_pe = PositionalEncoding(1, 256)dummy_enc = dummy_pe(dummy_points)dummy_encOutputtensor([[[-14.4193, 10.0495, -7.8116, ..., -18.0732, -3.9566]],[[ 2.6628, -3.5462, -23.6461, ..., -18.4375, -37.4197]],[[-24.6397, -1.9127, -16.4244, ..., -26.0550, -14.0706]],[[ 13.7988, 21.4612, 10.4125, ..., -17.0188, 3.9237]]])There are four sequences, so let’s pretend there are two mini-batches of twosequences each (N=2). Each sequence has a length of one (L=1 is not quite asequence, I know), and their sole data points have 256 features (D=256). The figurebelow illustrates the difference between applying batch norm (over features /columns) and layer norm (over data points / rows).826 | Chapter 10: Transform and Roll Out

Batch and layer normalization look quite similar to one another, but there are some

important differences between them that we need to point out.

Batch vs Layer

Although both normalizations compute statistics, namely, mean and biased

standard deviation, to standardize the inputs, only batch norm needs to keep track

of running statistics.

Moreover, since layer normalization considers data points

individually, it exhibits the same behavior whether the model is

in training or in evaluation mode.

To illustrate the difference between the two types of normalization, let’s generate

yet another dummy example (again adding positional encoding to it):

torch.manual_seed(23)

dummy_points = torch.randn(4, 1, 256)

dummy_pe = PositionalEncoding(1, 256)

dummy_enc = dummy_pe(dummy_points)

dummy_enc

Output

tensor([[[-14.4193, 10.0495, -7.8116, ..., -18.0732, -3.9566]],

[[ 2.6628, -3.5462, -23.6461, ..., -18.4375, -37.4197]],

[[-24.6397, -1.9127, -16.4244, ..., -26.0550, -14.0706]],

[[ 13.7988, 21.4612, 10.4125, ..., -17.0188, 3.9237]]])

There are four sequences, so let’s pretend there are two mini-batches of two

sequences each (N=2). Each sequence has a length of one (L=1 is not quite a

sequence, I know), and their sole data points have 256 features (D=256). The figure

below illustrates the difference between applying batch norm (over features /

columns) and layer norm (over data points / rows).

826 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!