Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
encdec = EncoderDecoder(encoder, decoder_attn,input_len=2, target_len=2,teacher_forcing_prob=0.0)encdec(full_seq)Outputtensor([[[-0.3555, -0.1220],[-0.2641, -0.2521]]], grad_fn=<CopySlices>)We could use it to train a model already, but we would miss something interesting:visualizing the attention scores. To visualize them, we need to store them first.The easiest way to do so is to create a new class that inherits from EncoderDecoderand then override the init_outputs() and store_outputs() methods:Encoder + Decoder + Attention1 class EncoderDecoderAttn(EncoderDecoder):2 def __init__(self, encoder, decoder, input_len, target_len,3 teacher_forcing_prob=0.5):4 super().__init__(encoder, decoder, input_len, target_len,5 teacher_forcing_prob)6 self.alphas = None78 def init_outputs(self, batch_size):9 device = next(self.parameters()).device10 # N, L (target), F11 self.outputs = torch.zeros(batch_size,12 self.target_len,13 self.encoder.n_features).to(device)14 # N, L (target), L (source)15 self.alphas = torch.zeros(batch_size,16 self.target_len,17 self.input_len).to(device)1819 def store_output(self, i, out):20 # Stores the output21 self.outputs[:, i:i+1, :] = out22 self.alphas[:, i:i+1, :] = self.decoder.attn.alphasAttention | 731
The attention scores are stored in the alphas attribute of the attention model,which, in turn, is the decoder’s attn attribute. For each step in the target sequencegeneration, the corresponding scores are copied to the alphas attribute of theEncoderDecoderAttn model (line 22).IMPORTANT: Pay attention (pun very much intended!) to theshape of the alphas attribute: (N, L target , L source ). For each one outof N sequences in a mini-batch, there is a matrix, where each"query" (Q) coming from the target sequence (a row in thismatrix) has as many attention scores as there are "keys" (K) inthe source sequence (the columns in this matrix).We’ll visualize these matrices shortly. Moreover, a properunderstanding of how attention scores are organized in thealphas attribute will make it much easier to understand the nextsection: "Self-Attention."Model Configuration & TrainingWe just have to replace the original classes for both decoder and model with theirattention counterparts, and we’re good to go:Model Configuration1 torch.manual_seed(17)2 encoder = Encoder(n_features=2, hidden_dim=2)3 decoder_attn = DecoderAttn(n_features=2, hidden_dim=2)4 model = EncoderDecoderAttn(encoder, decoder_attn,5 input_len=2, target_len=2,6 teacher_forcing_prob=0.5)7 loss = nn.MSELoss()8 optimizer = optim.Adam(model.parameters(), lr=0.01)Model Training1 sbs_seq_attn = StepByStep(model, loss, optimizer)2 sbs_seq_attn.set_loaders(train_loader, test_loader)3 sbs_seq_attn.train(100)732 | Chapter 9 — Part I: Sequence-to-Sequence
- Page 706 and 707: Data Preparation1 def pack_collate(
- Page 708 and 709: and variable-length sequences.Model
- Page 710 and 711: • generating variable-length sequ
- Page 712 and 713: import copyimport numpy as npimport
- Page 714 and 715: Figure 9.3 - Sequence datasetThe co
- Page 716 and 717: coordinates of a "perfect" square a
- Page 718 and 719: Let’s pretend for a moment that t
- Page 720 and 721: to initialize the hidden state and
- Page 722 and 723: predictions in previous steps have
- Page 724 and 725: the second set of predicted coordin
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
- Page 766 and 767: Once again, the affine transformati
- Page 768 and 769: Next, we shift our focus to the sel
- Page 770 and 771: Encoder + Self-Attention1 class Enc
- Page 772 and 773: Figure 9.27 - Encoder with self- an
- Page 774 and 775: The figure below depicts the self-a
- Page 776 and 777: shifted_seq = torch.cat([source_seq
- Page 778 and 779: Equation 9.17 - Decoder’s (masked
- Page 780 and 781: At evaluation / prediction time we
- Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785: Figure 9.33 - Encoder + decoder + a
- Page 786 and 787: 64 return outputsThe encoder-decode
- Page 788 and 789: Figure 9.34 - Losses—encoder + de
- Page 790 and 791: curse. On the one hand, it makes co
- Page 792 and 793: "Are we done now? Is this good enou
- Page 794 and 795: Figure 9.46 - Consistent distancesA
- Page 796 and 797: Let’s recap what we’ve already
- Page 798 and 799: Let’s see it in code:max_len = 10
- Page 800 and 801: Let’s put it all together into a
- Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805: Decoder with Positional Encoding1 c
The attention scores are stored in the alphas attribute of the attention model,
which, in turn, is the decoder’s attn attribute. For each step in the target sequence
generation, the corresponding scores are copied to the alphas attribute of the
EncoderDecoderAttn model (line 22).
IMPORTANT: Pay attention (pun very much intended!) to the
shape of the alphas attribute: (N, L target , L source ). For each one out
of N sequences in a mini-batch, there is a matrix, where each
"query" (Q) coming from the target sequence (a row in this
matrix) has as many attention scores as there are "keys" (K) in
the source sequence (the columns in this matrix).
We’ll visualize these matrices shortly. Moreover, a proper
understanding of how attention scores are organized in the
alphas attribute will make it much easier to understand the next
section: "Self-Attention."
Model Configuration & Training
We just have to replace the original classes for both decoder and model with their
attention counterparts, and we’re good to go:
Model Configuration
1 torch.manual_seed(17)
2 encoder = Encoder(n_features=2, hidden_dim=2)
3 decoder_attn = DecoderAttn(n_features=2, hidden_dim=2)
4 model = EncoderDecoderAttn(encoder, decoder_attn,
5 input_len=2, target_len=2,
6 teacher_forcing_prob=0.5)
7 loss = nn.MSELoss()
8 optimizer = optim.Adam(model.parameters(), lr=0.01)
Model Training
1 sbs_seq_attn = StepByStep(model, loss, optimizer)
2 sbs_seq_attn.set_loaders(train_loader, test_loader)
3 sbs_seq_attn.train(100)
732 | Chapter 9 — Part I: Sequence-to-Sequence