Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
coordinates of a "perfect" square and split it into source and target sequences:full_seq = (torch.tensor([[-1, -1], [-1, 1], [1, 1], [1, -1]]).float().view(1, 4, 2))source_seq = full_seq[:, :2] # first two cornerstarget_seq = full_seq[:, 2:] # last two cornersNow, let’s encode the source sequence and take the final hidden state:torch.manual_seed(21)encoder = Encoder(n_features=2, hidden_dim=2)hidden_seq = encoder(source_seq) # output is N, L, Fhidden_final = hidden_seq[:, -1:] # takes last hidden statehidden_finalOutputtensor([[[ 0.3105, -0.5263]]], grad_fn=<SliceBackward>)Of course, the model is untrained, so the final hidden state above is totally random.In a trained model, however, the final hidden state will encode information aboutthe source sequence. In Chapter 8, we used it to classify the direction in which thesquare was drawn, so it is safe to say that the final hidden state encoded thedrawing direction (clockwise or counterclockwise).Pretty straightforward, right? Now, let’s go over the…DecoderThe decoder’s goal is to generate the target sequence from aninitial representation; that is, to decode it.Sounds like a perfect match, doesn’t it? Encode the source sequence, get itsrepresentation (final hidden state), and feed it to the decoder so it generates thetarget sequence."How does the decoder transform a hidden state into a sequence?"Encoder-Decoder Architecture | 691
We can use recurrent layers for that as well.Figure 9.5 - DecoderLet’s analyze the figure above:• In the first step, the initial hidden state is the encoder’s final hidden state (h f ,in blue).• The first cell will output a new hidden state (h 2 ): That’s both the output of thatcell and one of the inputs of the next cell, as we’ve already seen in Chapter 8.• Before, we’d only run the final hidden state through a linear layer to produce thelogits, but now we’ll run the output of every cell through a linear layer (w T h) toconvert each hidden state into predicted coordinates (x 2 ).• The predicted coordinates are then used as one of the inputs of the secondstep (x 2 )."Great, but we’re missing one input in the first step, right?"That’s right! The first cell takes both an initial hidden state (h f , in blue, theencoder’s output) and a first data point (x 1 , in red).692 | Chapter 9 — Part I: Sequence-to-Sequence
- Page 666 and 667: Equation 8.9 - LSTM—candidate hid
- Page 668 and 669: Now, let’s visualize the internal
- Page 670 and 671: OutputOrderedDict([('weight_ih', te
- Page 672 and 673: def forget_gate(h, x):thf = f_hidde
- Page 674 and 675: Outputtensor([[-5.4936e-02, -8.3816
- Page 676 and 677: 1 First change: from RNN to LSTM2 S
- Page 678 and 679: Like the GRU, the LSTM presents fou
- Page 680 and 681: Output-----------------------------
- Page 682 and 683: Before moving on to packed sequence
- Page 684 and 685: column-wise fashion, from top to bo
- Page 686 and 687: does match the last output.• No,
- Page 688 and 689: So, to actually get the last output
- Page 690 and 691: Data Preparation1 class CustomDatas
- Page 692 and 693: OutputPackedSequence(data=tensor([[
- Page 694 and 695: Model Configuration & TrainingWe ca
- Page 696 and 697: size = 5weight = torch.ones(size) *
- Page 698 and 699: torch.manual_seed(17)conv_seq = nn.
- Page 700 and 701: Figure 8.32 - Applying dilated filt
- Page 702 and 703: Model Configuration1 torch.manual_s
- Page 704 and 705: We can actually find an expression
- Page 706 and 707: Data Preparation1 def pack_collate(
- Page 708 and 709: and variable-length sequences.Model
- Page 710 and 711: • generating variable-length sequ
- Page 712 and 713: import copyimport numpy as npimport
- Page 714 and 715: Figure 9.3 - Sequence datasetThe co
- Page 718 and 719: Let’s pretend for a moment that t
- Page 720 and 721: to initialize the hidden state and
- Page 722 and 723: predictions in previous steps have
- Page 724 and 725: the second set of predicted coordin
- Page 726 and 727: Let’s create an instance of the m
- Page 728 and 729: Model Configuration & TrainingThe m
- Page 730 and 731: Sure, we can!AttentionHere is a (no
- Page 732 and 733: based on "the" and "zone," I’ve j
- Page 734 and 735: Figure 9.12 - Matching a query to t
- Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739: utmost importance for the correct i
- Page 740 and 741: Its formula is:Equation 9.3 - Cosin
- Page 742 and 743: second hidden state contributes to
- Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747: alphas = F.softmax(scaled_products,
- Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751: Attention Mechanism1 class Attentio
- Page 752 and 753: "Why would I want to force it to do
- Page 754 and 755: 1 Sets attention module and adjusts
- Page 756 and 757: encdec = EncoderDecoder(encoder, de
- Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761: Figure 9.20 - Attention scoresSee?
- Page 762 and 763: Wide vs Narrow AttentionThis mechan
- Page 764 and 765: "What’s so special about it?"Even
We can use recurrent layers for that as well.
Figure 9.5 - Decoder
Let’s analyze the figure above:
• In the first step, the initial hidden state is the encoder’s final hidden state (h f ,
in blue).
• The first cell will output a new hidden state (h 2 ): That’s both the output of that
cell and one of the inputs of the next cell, as we’ve already seen in Chapter 8.
• Before, we’d only run the final hidden state through a linear layer to produce the
logits, but now we’ll run the output of every cell through a linear layer (w T h) to
convert each hidden state into predicted coordinates (x 2 ).
• The predicted coordinates are then used as one of the inputs of the second
step (x 2 ).
"Great, but we’re missing one input in the first step, right?"
That’s right! The first cell takes both an initial hidden state (h f , in blue, the
encoder’s output) and a first data point (x 1 , in red).
692 | Chapter 9 — Part I: Sequence-to-Sequence