Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
"… with great depth comes great complexity …"Peter Parker…and, along with that, overfitting.But we also know that dropout works pretty well as a regularizer, so we can throwthat in the mix as well."How are we adding normalization, residual connections, and dropoutto our model?"We’ll wrap each and every "sub-layer" with them! Cool, right? But that brings upanother question: How to wrap them? It turns out, we can wrap a "sub-layer" in oneof two ways: norm-last or norm-first.Figure 10.7 - "Sub-Layers"—norm-last vs norm-firstThe norm-last wrapper follows the "Attention Is All you Need" [149] paper to theletter:"We employ a residual connection around each of the two sub-layers, followed bylayer normalization. That is, the output of each sub-layer isLayerNorm(x+Sublayer(x)), where Sublayer(x) is the function implemented bythe sub-layer itself."The norm-first wrapper follows the "sub-layer" implementation described in "TheAnnotated Transformer," [150] which explicitly places norm first as opposed to lastWrapping "Sub-Layers" | 809
for the sake of code simplicity.Let’s turn the diagrams above into equations:Equation 10.3 - Outputs—norm-first vs norm-lastThe equations are almost the same, except for the fact that the norm-last wrapper(from "Attention Is All You Need") normalizes the outputs and the norm-firstwrapper (from "The Annotated Transformer") normalizes the inputs. That’s asmall, yet important, difference."Why?"If you’re using positional encoding, you want to normalize your inputs, so normfirstis more convenient."What about the outputs?"We’ll normalize the final outputs; that is, the output of the last "layer" (which isthe output of its last, not normalized, "sub-layer"). Any intermediate output issimply the input of the subsequent "sub-layer," and each "sub-layer" normalizes itsown inputs.There is another important difference that will be discussed in thenext section.From now on, we’re sticking with norm-first, thus normalizing the inputs:Equation 10.4 - Outputs—norm-firstBy wrapping each and every "sub-layer" inside both encoder "layers" and decoder"layers," we’ll arrive at the desired Transformer architecture.Let’s start with the…810 | Chapter 10: Transform and Roll Out
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
for the sake of code simplicity.
Let’s turn the diagrams above into equations:
Equation 10.3 - Outputs—norm-first vs norm-last
The equations are almost the same, except for the fact that the norm-last wrapper
(from "Attention Is All You Need") normalizes the outputs and the norm-first
wrapper (from "The Annotated Transformer") normalizes the inputs. That’s a
small, yet important, difference.
"Why?"
If you’re using positional encoding, you want to normalize your inputs, so normfirst
is more convenient.
"What about the outputs?"
We’ll normalize the final outputs; that is, the output of the last "layer" (which is
the output of its last, not normalized, "sub-layer"). Any intermediate output is
simply the input of the subsequent "sub-layer," and each "sub-layer" normalizes its
own inputs.
There is another important difference that will be discussed in the
next section.
From now on, we’re sticking with norm-first, thus normalizing the inputs:
Equation 10.4 - Outputs—norm-first
By wrapping each and every "sub-layer" inside both encoder "layers" and decoder
"layers," we’ll arrive at the desired Transformer architecture.
Let’s start with the…
810 | Chapter 10: Transform and Roll Out