Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

sequential order of the data• figuring out that attention is not enough and that we also need positionalencoding to incorporate sequential order back into the model• using alternating sines and cosines of different frequencies as positionalencoding• learning that combining sines and cosines yields interesting properties, such askeeping constant the encoded distance between any two positions T stepsapart• using register_buffer() to add an attribute that should be part of themodule’s state without being a parameter• visualizing self- and cross-attention scoresCongratulations! That was definitely an intense chapter. The attention mechanismin its different forms—single-head, multi-headed, self-attention, and crossattention—isvery flexible and built on top of fairly simple concepts, but the wholething is definitely not that easy to grasp. Maybe you feel a bit overwhelmed by thehuge amount of information and details involved in it, but don’t worry. I guesseveryone does feel like that at first; I know I did. It gets better with time!The good thing is, you have already learned most of the techniques that make upthe famous Transformer architecture: attention mechanisms, masks, andpositional encoding. There are still a few things left to learn about it, like layernormalization, and we’ll cover them all in the next chapter.Transform and roll out![142] https://arxiv.org/abs/1706.03762[143] https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/[144] https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a[145] https://kazemnejad.com/blog/transformer_architecture_positional_encoding/Recap | 795

Chapter 10Transform and Roll OutSpoilersIn this chapter, we will:• modify the multi-headed attention mechanism to use narrow attention• use layer normalization to standardize individual data points• stack "layers" together to build Transformer encoders and decoders• add layer normalization, dropout, and residual connections to each "sublayer"operation• learn the difference between norm-last and norm-first "sub-layers"• train a Transformer to predict a target sequence from a source sequence• build and train a Vision Transformer to perform image classificationJupyter NotebookThe Jupyter notebook corresponding to Chapter 10 [146] is part of the official DeepLearning with PyTorch Step-by-Step repository on GitHub. You can also run itdirectly in Google Colab [147] .If you’re using a local installation, open your terminal or Anaconda prompt andnavigate to the PyTorchStepByStep folder you cloned from GitHub. Then, activatethe pytorchbook environment and run jupyter notebook:$ conda activate pytorchbook(pytorchbook)$ jupyter notebookIf you’re using Jupyter’s default settings, this link should open Chapter 10’snotebook. If not, just click on Chapter10.ipynb in your Jupyter’s home page.ImportsFor the sake of organization, all libraries needed throughout the code used in anygiven chapter are imported at its very beginning. For this chapter, we’ll need the796 | Chapter 10: Transform and Roll Out

sequential order of the data

• figuring out that attention is not enough and that we also need positional

encoding to incorporate sequential order back into the model

• using alternating sines and cosines of different frequencies as positional

encoding

• learning that combining sines and cosines yields interesting properties, such as

keeping constant the encoded distance between any two positions T steps

apart

• using register_buffer() to add an attribute that should be part of the

module’s state without being a parameter

• visualizing self- and cross-attention scores

Congratulations! That was definitely an intense chapter. The attention mechanism

in its different forms—single-head, multi-headed, self-attention, and crossattention—is

very flexible and built on top of fairly simple concepts, but the whole

thing is definitely not that easy to grasp. Maybe you feel a bit overwhelmed by the

huge amount of information and details involved in it, but don’t worry. I guess

everyone does feel like that at first; I know I did. It gets better with time!

The good thing is, you have already learned most of the techniques that make up

the famous Transformer architecture: attention mechanisms, masks, and

positional encoding. There are still a few things left to learn about it, like layer

normalization, and we’ll cover them all in the next chapter.

Transform and roll out!

[142] https://arxiv.org/abs/1706.03762

[143] https://machinelearningmastery.com/beam-search-decoder-natural-language-processing/

[144] https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a

[145] https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Recap | 795

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!