Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure 10.15 - Losses—Transformer model"Why is the validation loss so much better than the training loss?"This phenomenon may happen for a variety of reasons, from having an easiervalidation set to being a "side effect" of regularization (e.g., dropout) in our currentmodel. The regularization makes it harder for the model to learn or, in other words,it yields higher losses. In our Transformer model, there are many dropout layers,so it gets increasingly more difficult for the model to learn.Let’s observe this effect by using the same mini-batch to compute the loss usingthe trained model in both train and eval modes:torch.manual_seed(11)x, y = next(iter(train_loader))device = sbs_seq_transf.device# Trainingmodel_transf.train()loss(model_transf(x.to(device)), y.to(device))Outputtensor(0.0158, device='cuda:0', grad_fn=<MseLossBackward>)# Validationmodel_transf.eval()loss(model_transf(x.to(device)), y.to(device))The Transformer | 837
Outputtensor(0.0091, device='cuda:0')See the difference? The loss is roughly two times larger in training mode. You canalso set dropout to zero and retrain the model to verify that both loss curves getmuch closer to each other (by the way, the overall loss level gets better withoutdropout, but that’s just because our sequence-to-sequence problem is actuallyquite simple).Visualizing PredictionsLet’s plot the predicted coordinates and connect them using dashed lines, whileusing solid lines to connect the actual coordinates, just like before:fig = sequence_pred(sbs_seq_transf, full_test, test_directions)Figure 10.16 - PredictionsLooking good, right?The PyTorch TransformerSo far we’ve been using our own classes to build encoder and decoder "layers" andassemble them all into a Transformer. We don’t have to do it like that, though.PyTorch implements a full-fledged Transformer class of its own: nn.Transformer.There are some differences between PyTorch’s implementation and our own:838 | Chapter 10: Transform and Roll Out
- Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815: Model Configuration1 class EncoderS
- Page 816 and 817: 1617 @property18 def alphas(self):1
- Page 818 and 819: Output(0.016193246061448008, 0.0341
- Page 820 and 821: sequential order of the data• fig
- Page 822 and 823: following imports:import copyimport
- Page 824 and 825: Figure 10.2 - Chunking: the wrong a
- Page 826 and 827: chunks to compute the other half of
- Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
- Page 830 and 831: dummy_points = torch.randn(16, 2, 4
- Page 832 and 833: Stacking Encoders and DecodersLet
- Page 834 and 835: "… with great depth comes great c
- Page 836 and 837: Transformer EncoderWe’ll be repre
- Page 838 and 839: Let’s see it in code, starting wi
- Page 840 and 841: Transformer Encoder1 class EncoderT
- Page 842 and 843: of the encoder-decoder (or Transfor
- Page 844 and 845: In PyTorch, the decoder "layer" is
- Page 846 and 847: In PyTorch, the decoder is implemen
- Page 848 and 849: Equation 10.7 - Data points' means
- Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853: Figure 10.10 - Layer norm vs batch
- Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857: The TransformerLet’s start with t
- Page 858 and 859: "values") in the decoder.• decode
- Page 860 and 861: Data Preparation1 # Generating trai
- Page 864 and 865: • First, and most important, PyTo
- Page 866 and 867: decode(), with a single one, encode
- Page 868 and 869: 46 for i in range(self.target_len):
- Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873: Figure 10.20 - Sample image—label
- Page 874 and 875: 4041 # Builds a weighted random sam
- Page 876 and 877: Figure 10.23 - Sample image—split
- Page 878 and 879: Einops"There is more than one way t
- Page 880 and 881: Figure 10.26 - Two patch embeddings
- Page 882 and 883: Now each sequence has ten elements,
- Page 884 and 885: It takes an instance of a Transform
- Page 886 and 887: Putting It All TogetherIn this chap
- Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
- Page 890 and 891: This is the actual encoder-decoder
- Page 892 and 893: 3. DecoderThe Transformer decoder h
- Page 894 and 895: 5. Encoder "Layer"The encoder "laye
- Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899: 8. Multi-Headed AttentionThe multi-
- Page 900 and 901: Model Configuration & TrainingModel
- Page 902 and 903: • training the Transformer to tac
- Page 904 and 905: Part IVNatural Language Processing|
- Page 906 and 907: Additional SetupThis is a special c
- Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
- Page 910 and 911: The actual texts of the books are c
Output
tensor(0.0091, device='cuda:0')
See the difference? The loss is roughly two times larger in training mode. You can
also set dropout to zero and retrain the model to verify that both loss curves get
much closer to each other (by the way, the overall loss level gets better without
dropout, but that’s just because our sequence-to-sequence problem is actually
quite simple).
Visualizing Predictions
Let’s plot the predicted coordinates and connect them using dashed lines, while
using solid lines to connect the actual coordinates, just like before:
fig = sequence_pred(sbs_seq_transf, full_test, test_directions)
Figure 10.16 - Predictions
Looking good, right?
The PyTorch Transformer
So far we’ve been using our own classes to build encoder and decoder "layers" and
assemble them all into a Transformer. We don’t have to do it like that, though.
PyTorch implements a full-fledged Transformer class of its own: nn.Transformer.
There are some differences between PyTorch’s implementation and our own:
838 | Chapter 10: Transform and Roll Out