Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
layer_norm = nn.LayerNorm(d_model)normalized = layer_norm(inputs)normalized[0][0].mean(), normalized[0][0].std(unbiased=False)Output(tensor(-1.4901e-08, grad_fn=<MeanBackward0>),tensor(1.0000, grad_fn=<StdBackward0>))Zero mean and unit standard deviation, as expected."Why do they have a grad_fn attribute?"Like batch normalization, layer normalization can learn affine transformations. Yes,plural: Each feature has its own affine transformation. Since we’re using layernormalization on d_model, and its dimensionality is four, there will be four weightsand four biases in the state_dict():layer_norm.state_dict()OutputOrderedDict([('weight', tensor([1., 1., 1., 1.])),('bias', tensor([0., 0., 0., 0.]))])The weights and biases are used to scale and translate, respectively, thestandardized values:Equation 10.10 - Layer normalization (with affine transformation)In PyTorch’s documentation, though, you’ll find gamma and beta instead:Equation 10.11 - Layer Normalization (with affine transformation)Layer Normalization | 825
Batch and layer normalization look quite similar to one another, but there are someimportant differences between them that we need to point out.Batch vs LayerAlthough both normalizations compute statistics, namely, mean and biasedstandard deviation, to standardize the inputs, only batch norm needs to keep trackof running statistics.Moreover, since layer normalization considers data pointsindividually, it exhibits the same behavior whether the model isin training or in evaluation mode.To illustrate the difference between the two types of normalization, let’s generateyet another dummy example (again adding positional encoding to it):torch.manual_seed(23)dummy_points = torch.randn(4, 1, 256)dummy_pe = PositionalEncoding(1, 256)dummy_enc = dummy_pe(dummy_points)dummy_encOutputtensor([[[-14.4193, 10.0495, -7.8116, ..., -18.0732, -3.9566]],[[ 2.6628, -3.5462, -23.6461, ..., -18.4375, -37.4197]],[[-24.6397, -1.9127, -16.4244, ..., -26.0550, -14.0706]],[[ 13.7988, 21.4612, 10.4125, ..., -17.0188, 3.9237]]])There are four sequences, so let’s pretend there are two mini-batches of twosequences each (N=2). Each sequence has a length of one (L=1 is not quite asequence, I know), and their sole data points have 256 features (D=256). The figurebelow illustrates the difference between applying batch norm (over features /columns) and layer norm (over data points / rows).826 | Chapter 10: Transform and Roll Out
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
Batch and layer normalization look quite similar to one another, but there are some
important differences between them that we need to point out.
Batch vs Layer
Although both normalizations compute statistics, namely, mean and biased
standard deviation, to standardize the inputs, only batch norm needs to keep track
of running statistics.
Moreover, since layer normalization considers data points
individually, it exhibits the same behavior whether the model is
in training or in evaluation mode.
To illustrate the difference between the two types of normalization, let’s generate
yet another dummy example (again adding positional encoding to it):
torch.manual_seed(23)
dummy_points = torch.randn(4, 1, 256)
dummy_pe = PositionalEncoding(1, 256)
dummy_enc = dummy_pe(dummy_points)
dummy_enc
Output
tensor([[[-14.4193, 10.0495, -7.8116, ..., -18.0732, -3.9566]],
[[ 2.6628, -3.5462, -23.6461, ..., -18.4375, -37.4197]],
[[-24.6397, -1.9127, -16.4244, ..., -26.0550, -14.0706]],
[[ 13.7988, 21.4612, 10.4125, ..., -17.0188, 3.9237]]])
There are four sequences, so let’s pretend there are two mini-batches of two
sequences each (N=2). Each sequence has a length of one (L=1 is not quite a
sequence, I know), and their sole data points have 256 features (D=256). The figure
below illustrates the difference between applying batch norm (over features /
columns) and layer norm (over data points / rows).
826 | Chapter 10: Transform and Roll Out