Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
sequential order of the data• figuring out that attention is not enough and that we also need positionalencoding to incorporate sequential order back into the model• using alternating sines and cosines of different frequencies as positionalencoding• learning that combining sines and cosines yields interesting properties, such askeeping constant the encoded distance between any two positions T stepsapart• using register_buffer() to add an attribute that should be part of themodule’s state without being a parameter• visualizing self- and cross-attention scoresCongratulations! That was definitely an intense chapter. The attention mechanismin its different forms—single-head, multi-headed, self-attention, and crossattention—isvery flexible and built on top of fairly simple concepts, but the wholething is definitely not that easy to grasp. Maybe you feel a bit overwhelmed by thehuge amount of information and details involved in it, but don’t worry. I guesseveryone does feel like that at first; I know I did. It gets better with time!The good thing is, you have already learned most of the techniques that make upthe famous Transformer architecture: attention mechanisms, masks, andpositional encoding. There are still a few things left to learn about it, like layernormalization, and we’ll cover them all in the next chapter.Transform and roll out![142] https://arxiv.org/abs/1706.03762[143] https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/[144] https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a[145] https://kazemnejad.com/blog/transformer_architecture_positional_encoding/Recap | 795
Chapter 10Transform and Roll OutSpoilersIn this chapter, we will:• modify the multi-headed attention mechanism to use narrow attention• use layer normalization to standardize individual data points• stack "layers" together to build Transformer encoders and decoders• add layer normalization, dropout, and residual connections to each "sublayer"operation• learn the difference between norm-last and norm-first "sub-layers"• train a Transformer to predict a target sequence from a source sequence• build and train a Vision Transformer to perform image classificationJupyter NotebookThe Jupyter notebook corresponding to Chapter 10 [146] is part of the official DeepLearning with PyTorch Step-by-Step repository on GitHub. You can also run itdirectly in Google Colab [147] .If you’re using a local installation, open your terminal or Anaconda prompt andnavigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activatethe pytorchbook environment and run jupyter notebook:$ conda activate pytorchbook(pytorchbook)$ jupyter notebookIf you’re using Jupyter’s default settings, this link should open Chapter 10’snotebook. If not, just click on Chapter10.ipynb in your Jupyter’s home page.ImportsFor the sake of organization, all libraries needed throughout the code used in anygiven chapter are imported at its very beginning. For this chapter, we’ll need the796 | Chapter 10: Transform and Roll Out
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
- Page 806 and 807: Visualizing PredictionsLet’s plot
- Page 808 and 809: Next, we’re moving on to the thre
- Page 810 and 811: Data Generation & Preparation1 # Tr
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
sequential order of the data
• figuring out that attention is not enough and that we also need positional
encoding to incorporate sequential order back into the model
• using alternating sines and cosines of different frequencies as positional
encoding
• learning that combining sines and cosines yields interesting properties, such as
keeping constant the encoded distance between any two positions T steps
apart
• using register_buffer() to add an attribute that should be part of the
module’s state without being a parameter
• visualizing self- and cross-attention scores
Congratulations! That was definitely an intense chapter. The attention mechanism
in its different forms—single-head, multi-headed, self-attention, and crossattention—is
very flexible and built on top of fairly simple concepts, but the whole
thing is definitely not that easy to grasp. Maybe you feel a bit overwhelmed by the
huge amount of information and details involved in it, but don’t worry. I guess
everyone does feel like that at first; I know I did. It gets better with time!
The good thing is, you have already learned most of the techniques that make up
the famous Transformer architecture: attention mechanisms, masks, and
positional encoding. There are still a few things left to learn about it, like layer
normalization, and we’ll cover them all in the next chapter.
Transform and roll out!
[142] https://arxiv.org/abs/1706.03762
[143] https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/
[144] https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
[145] https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
Recap | 795