Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Model Configuration & TrainingModel Configuration1 torch.manual_seed(42)2 # Layers3 enclayer = EncoderLayer(n_heads=3, d_model=6,4 ff_units=10, dropout=0.1)5 declayer = DecoderLayer(n_heads=3, d_model=6,6 ff_units=10, dropout=0.1)7 # Encoder and Decoder8 enctransf = EncoderTransf(enclayer, n_layers=2)9 dectransf = DecoderTransf(declayer, n_layers=2)10 # Transformer11 model_transf = EncoderDecoderTransf(enctransf,12 dectransf,13 input_len=2,14 target_len=2,15 n_features=2)16 loss = nn.MSELoss()17 optimizer = torch.optim.Adam(model_transf.parameters(), lr=0.01)Weight Initialization1 for p in model_transf.parameters():2 if p.dim() > 1:3 nn.init.xavier_uniform_(p)Model Training1 sbs_seq_transf = StepByStep(model_transf, loss, optimizer)2 sbs_seq_transf.set_loaders(train_loader, test_loader)3 sbs_seq_transf.train(50)sbs_seq_transf.losses[-1], sbs_seq_transf.val_losses[-1]Output(0.019648547226097435, 0.011462601833045483)Putting It All Together | 875
RecapIn this chapter, we’ve extended the encoder-decoder architecture and transformedit into a Transformer (the last pun of the chapter; I couldn’t resist it!). First, wemodified the multi-headed attention mechanism to use narrow attention. Then,we introduced layer normalization and the need to change the dimensionality ofthe inputs using either projections or embeddings. Next, we used our formerencoder and decoder as "layers" that could be stacked to form the new Transformerencoder and decoder. That made our model much deeper, thus raising the need forwrapping the internal operations (self-, cross-attention, and feed-forwardnetwork, now called "sub-layers") of each "layer" with a combination of layernormalization, dropout, and residual connection. This is what we’ve covered:• using narrow attention in the multi-headed attention mechanism• chunking the projections of the inputs to implement narrow attention• learning that chunking projections allows different heads to focus on, literally,different dimensions of the inputs• standardizing individual data points using layer normalization• using layer normalization to standardize positionally-encoded inputs• changing the dimensionality of the inputs using projections (embeddings)• defining an encoder "layer" that uses two "sub-layers": a self-attentionmechanism and a feed-forward network• stacking encoder "layers" to build a Transformer encoder• wrapping "sub-layer" operations with a combination of layer normalization,dropout, and residual connection• learning the difference between norm-last and norm-first "sub-layers"• understanding that norm-first "sub-layers" allow the inputs to flowunimpeded all the way to the top through the residual connections• defining a decoder "layer" that uses three "sub-layers": a masked selfattentionmechanism, a cross-attention mechanism, and a feed-forwardnetwork• stacking decoder "layers" to build a Transformer decoder• combining both encoder and decoder into a full-blown, norm-first Transformerarchitecture876 | Chapter 10: Transform and Roll Out
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 862 and 863: Figure 10.15 - Losses—Transformer
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
- Page 912 and 913: "What is this punkt?"That’s the P
- Page 914 and 915: 14 # If there is a configuration fi
- Page 916 and 917: Sentence Tokenization in spaCyBy th
- Page 918 and 919: AttributesThe Dataset has many attr
- Page 920 and 921: Output{'labels': 1,'sentence': 'The
- Page 922 and 923: elements from the text. But preproc
- Page 924 and 925: Data AugmentationLet’s briefly ad
- Page 926 and 927: The corpora’s dictionary is not a
- Page 928 and 929: Finally, if we want to convert a li
- Page 930 and 931: Once we’re happy with the size an
- Page 932 and 933: from transformers import BertTokeni
- Page 934 and 935: "What about the separation token?"T
- Page 936 and 937: The last output, attention_mask, wo
- Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941: vector, right? And our vocabulary i
- Page 942 and 943: Maybe you filled this blank in with
- Page 944 and 945: Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947: That’s a fairly simple model, rig
- Page 948 and 949: Figure 11.13 - Continuous bag-of-wo
Model Configuration & Training
Model Configuration
1 torch.manual_seed(42)
2 # Layers
3 enclayer = EncoderLayer(n_heads=3, d_model=6,
4 ff_units=10, dropout=0.1)
5 declayer = DecoderLayer(n_heads=3, d_model=6,
6 ff_units=10, dropout=0.1)
7 # Encoder and Decoder
8 enctransf = EncoderTransf(enclayer, n_layers=2)
9 dectransf = DecoderTransf(declayer, n_layers=2)
10 # Transformer
11 model_transf = EncoderDecoderTransf(enctransf,
12 dectransf,
13 input_len=2,
14 target_len=2,
15 n_features=2)
16 loss = nn.MSELoss()
17 optimizer = torch.optim.Adam(model_transf.parameters(), lr=0.01)
Weight Initialization
1 for p in model_transf.parameters():
2 if p.dim() > 1:
3 nn.init.xavier_uniform_(p)
Model Training
1 sbs_seq_transf = StepByStep(model_transf, loss, optimizer)
2 sbs_seq_transf.set_loaders(train_loader, test_loader)
3 sbs_seq_transf.train(50)
sbs_seq_transf.losses[-1], sbs_seq_transf.val_losses[-1]
Output
(0.019648547226097435, 0.011462601833045483)
Putting It All Together | 875