Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Equation 10.7 - Data points' means over features (D)inputs_mean = inputs.mean(axis=2).unsqueeze(2)inputs_meanOutputtensor([[[-0.3529],[ 0.2426]],[[ 0.9496],[-1.3038]],[[ 1.6489],[ 3.6841]]])As expected, six mean values, one for each data point. The unsqueeze() is there topreserve the original dimensionality, thus making the result a tensor of (N, L, 1)shape.Next, we compute the biased standard deviations over the same dimension (D):Equation 10.8 - Data points' standard deviations over features (D)inputs_var = inputs.var(axis=2, unbiased=False).unsqueeze(2)inputs_varLayer Normalization | 823
Outputtensor([[[6.3756],[1.6661]],[[4.0862],[0.3153]],[[2.3135],[4.6163]]])No surprises here.The actual standardization is then computed using the mean, biased standarddeviation, and a tiny epsilon to guarantee numerical stability:Equation 10.9 - Layer normalization(inputs - inputs_mean)/torch.sqrt(inputs_var+1e-5)Outputtensor([[[-1.3671, 0.9279, -0.5464, 0.9857],[ 1.1953, 0.4438, -0.1015, -1.5376]],[[-1.6706, 0.2010, 0.9458, 0.5238],[ 0.4782, 0.0485, -1.6106, 1.0839]],[[-1.6129, 0.2116, 1.1318, 0.2695],[ 0.2520, 1.5236, -1.0272, -0.7484]]])The values above are layer normalized. It is possible to achieve the very sameresults by using PyTorch’s own nn.LayerNorm, of course:824 | Chapter 10: Transform and Roll Out
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
Output
tensor([[[6.3756],
[1.6661]],
[[4.0862],
[0.3153]],
[[2.3135],
[4.6163]]])
No surprises here.
The actual standardization is then computed using the mean, biased standard
deviation, and a tiny epsilon to guarantee numerical stability:
Equation 10.9 - Layer normalization
(inputs - inputs_mean)/torch.sqrt(inputs_var+1e-5)
Output
tensor([[[-1.3671, 0.9279, -0.5464, 0.9857],
[ 1.1953, 0.4438, -0.1015, -1.5376]],
[[-1.6706, 0.2010, 0.9458, 0.5238],
[ 0.4782, 0.0485, -1.6106, 1.0839]],
[[-1.6129, 0.2116, 1.1318, 0.2695],
[ 0.2520, 1.5236, -1.0272, -0.7484]]])
The values above are layer normalized. It is possible to achieve the very same
results by using PyTorch’s own nn.LayerNorm, of course:
824 | Chapter 10: Transform and Roll Out