Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure 8.22 - Transforming the hidden stateGated Recurrent Units (GRUs) | 639
I’d like to draw your attention to the third column in particular: It clearly shows theeffect of a gate, the reset gate in this case, over the feature space. Since a gate hasa distinct value for each dimension, each dimension will shrink differently (it canonly shrink because values are always between zero and one). In the third row, forexample, the first dimension gets multiplied by 0.70, while the second dimensiongets multiplied by only 0.05, making the resulting feature space really small.Can We Do Better?The gated recurrent unit is definitely an improvement over the regular RNN, butthere are a couple of points I’d like to raise:• Using the reset gate inside the hyperbolic tangent seems "weird" (not ascientific argument at all, I know).• The best thing about the hidden state is that it is bounded by the hyperbolictangent—it guarantees the next cell will get the hidden state in the same range.• The worst thing about the hidden state is that it is bounded by the hyperbolictangent—it constrains the values the hidden state can take and, along withthem, the corresponding gradients.• Since we cannot have the cake and eat it too when it comes to the hidden statebeing bounded, what is preventing us from using two hidden states in thesame cell?Yes, let’s try that—two hidden states are surely better than one, right?By the way—I know that GRUs were invented a long time AFTERthe development of LSTMs, but I’ve decided to present them inorder of increasing complexity. Please don’t take the "story" I’mtelling too literally—it is just a way to facilitate learning.Long Short-Term Memory (LSTM)Long short-term memory, or LSTM for short, uses two states instead of one.Besides the regular hidden state (h), which is bounded by the hyperbolic tangent,as usual, it introduces a second cell state (c) as well, which is unbounded.So, let’s work through the points raised in the last section. First, let’s keep it simpleand use a regular RNN to generate a candidate hidden state (g):640 | Chapter 8: Sequences
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
- Page 642 and 643: Model Configuration1 class SquareMo
- Page 644 and 645: StepByStep.loader_apply(test_loader
- Page 646 and 647: Figure 8.14 - Final hidden states f
- Page 648 and 649: Figure 8.16 - Transforming the hidd
- Page 650 and 651: Since the RNN cell has both of them
- Page 652 and 653: Every gate worthy of its name will
- Page 654 and 655: • For r=0 and z=0, the cell becom
- Page 656 and 657: In code, we can use split() to get
- Page 658 and 659: Let’s pause for a moment here. Fi
- Page 660 and 661: Square Model II — The QuickeningT
- Page 662 and 663: Outputtensor([[53, 53],[75, 75]])Th
- Page 666 and 667: Equation 8.9 - LSTM—candidate hid
- Page 668 and 669: Now, let’s visualize the internal
- Page 670 and 671: OutputOrderedDict([('weight_ih', te
- Page 672 and 673: def forget_gate(h, x):thf = f_hidde
- Page 674 and 675: Outputtensor([[-5.4936e-02, -8.3816
- Page 676 and 677: 1 First change: from RNN to LSTM2 S
- Page 678 and 679: Like the GRU, the LSTM presents fou
- Page 680 and 681: Output-----------------------------
- Page 682 and 683: Before moving on to packed sequence
- Page 684 and 685: column-wise fashion, from top to bo
- Page 686 and 687: does match the last output.• No,
- Page 688 and 689: So, to actually get the last output
- Page 690 and 691: Data Preparation1 class CustomDatas
- Page 692 and 693: OutputPackedSequence(data=tensor([[
- Page 694 and 695: Model Configuration & TrainingWe ca
- Page 696 and 697: size = 5weight = torch.ones(size) *
- Page 698 and 699: torch.manual_seed(17)conv_seq = nn.
- Page 700 and 701: Figure 8.32 - Applying dilated filt
- Page 702 and 703: Model Configuration1 torch.manual_s
- Page 704 and 705: We can actually find an expression
- Page 706 and 707: Data Preparation1 def pack_collate(
- Page 708 and 709: and variable-length sequences.Model
- Page 710 and 711: • generating variable-length sequ
- Page 712 and 713: import copyimport numpy as npimport
I’d like to draw your attention to the third column in particular: It clearly shows the
effect of a gate, the reset gate in this case, over the feature space. Since a gate has
a distinct value for each dimension, each dimension will shrink differently (it can
only shrink because values are always between zero and one). In the third row, for
example, the first dimension gets multiplied by 0.70, while the second dimension
gets multiplied by only 0.05, making the resulting feature space really small.
Can We Do Better?
The gated recurrent unit is definitely an improvement over the regular RNN, but
there are a couple of points I’d like to raise:
• Using the reset gate inside the hyperbolic tangent seems "weird" (not a
scientific argument at all, I know).
• The best thing about the hidden state is that it is bounded by the hyperbolic
tangent—it guarantees the next cell will get the hidden state in the same range.
• The worst thing about the hidden state is that it is bounded by the hyperbolic
tangent—it constrains the values the hidden state can take and, along with
them, the corresponding gradients.
• Since we cannot have the cake and eat it too when it comes to the hidden state
being bounded, what is preventing us from using two hidden states in the
same cell?
Yes, let’s try that—two hidden states are surely better than one, right?
By the way—I know that GRUs were invented a long time AFTER
the development of LSTMs, but I’ve decided to present them in
order of increasing complexity. Please don’t take the "story" I’m
telling too literally—it is just a way to facilitate learning.
Long Short-Term Memory (LSTM)
Long short-term memory, or LSTM for short, uses two states instead of one.
Besides the regular hidden state (h), which is bounded by the hyperbolic tangent,
as usual, it introduces a second cell state (c) as well, which is unbounded.
So, let’s work through the points raised in the last section. First, let’s keep it simple
and use a regular RNN to generate a candidate hidden state (g):
640 | Chapter 8: Sequences