Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Every gate worthy of its name will use a sigmoid activationfunction to produce gate-compatible values between zero andone.Moreover, since all components of a GRU (n, r, and z) share a similar structure, itshould be no surprise that its corresponding transformations (t h and t x ) are alsosimilarly computed:Equation 8.7 - Transformations of a GRUSee? They all follow the same logic! Actually, let’s literally see how all thesecomponents are connected in the following diagram.Gated Recurrent Units (GRUs) | 627
Figure 8.18 - Internals of a GRU cellThe gates are following the same color convention I used in the equations: red forthe reset gate (r) and blue for the update gate (z). The path of the (new) candidatehidden state (n) is drawn in black and joins the (old) hidden state (h), drawn in gray,to produce the actual new hidden state (h').To really understand the flow of information inside the GRU cell, I suggest you trythese exercises:• First, learn to look past (or literally ignore) the internals of the gates: both r andz are simply values between zero and one (for each hidden dimension).• Pretend r=1; can you see that the resulting n is equivalent to the output of asimple RNN?• Keep r=1, and now pretend z=0; can you see that the new hidden state h' isequivalent to the output of a simple RNN?• Now pretend z=1; can you see that the new hidden state h' is simply a copy ofthe old hidden state (in other words, the data [x] does not have any effect)?• If you decrease r all the way to zero, the resulting n is less and less influencedby the old hidden state.• If you decrease z all the way to zero, the new hidden state h' is closer andcloser to n.628 | Chapter 8: Sequences
- Page 602 and 603: (and only if) the norm exceeds the
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 608 and 609: Clipping with HooksFirst, we reset
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
- Page 642 and 643: Model Configuration1 class SquareMo
- Page 644 and 645: StepByStep.loader_apply(test_loader
- Page 646 and 647: Figure 8.14 - Final hidden states f
- Page 648 and 649: Figure 8.16 - Transforming the hidd
- Page 650 and 651: Since the RNN cell has both of them
- Page 654 and 655: • For r=0 and z=0, the cell becom
- Page 656 and 657: In code, we can use split() to get
- Page 658 and 659: Let’s pause for a moment here. Fi
- Page 660 and 661: Square Model II — The QuickeningT
- Page 662 and 663: Outputtensor([[53, 53],[75, 75]])Th
- Page 664 and 665: Figure 8.22 - Transforming the hidd
- Page 666 and 667: Equation 8.9 - LSTM—candidate hid
- Page 668 and 669: Now, let’s visualize the internal
- Page 670 and 671: OutputOrderedDict([('weight_ih', te
- Page 672 and 673: def forget_gate(h, x):thf = f_hidde
- Page 674 and 675: Outputtensor([[-5.4936e-02, -8.3816
- Page 676 and 677: 1 First change: from RNN to LSTM2 S
- Page 678 and 679: Like the GRU, the LSTM presents fou
- Page 680 and 681: Output-----------------------------
- Page 682 and 683: Before moving on to packed sequence
- Page 684 and 685: column-wise fashion, from top to bo
- Page 686 and 687: does match the last output.• No,
- Page 688 and 689: So, to actually get the last output
- Page 690 and 691: Data Preparation1 class CustomDatas
- Page 692 and 693: OutputPackedSequence(data=tensor([[
- Page 694 and 695: Model Configuration & TrainingWe ca
- Page 696 and 697: size = 5weight = torch.ones(size) *
- Page 698 and 699: torch.manual_seed(17)conv_seq = nn.
- Page 700 and 701: Figure 8.32 - Applying dilated filt
Every gate worthy of its name will use a sigmoid activation
function to produce gate-compatible values between zero and
one.
Moreover, since all components of a GRU (n, r, and z) share a similar structure, it
should be no surprise that its corresponding transformations (t h and t x ) are also
similarly computed:
Equation 8.7 - Transformations of a GRU
See? They all follow the same logic! Actually, let’s literally see how all these
components are connected in the following diagram.
Gated Recurrent Units (GRUs) | 627