pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

As we backpropagate across multiple time steps, the product of gradients getsmaller and smaller, ultimately leading to the problem of vanishing gradients.Similarly, if the gradients are larger than 1, the products get larger and larger,and ultimately lead to the problem of exploding gradients.Chapter 8Of the two, exploding gradients are more easily detectable. The gradients willbecome very large and turn into Not a Number (NaN) and the training processwill crash. Exploding gradients can be controlled by clipping them at a predefinedthreshold [13]. TensorFlow 2.0 allows you to clip gradients using the clipvalueor clipnorm parameter during optimizer construction, or by explicitly clippinggradients using tf.clip_by_value.The effect of vanishing gradients is that gradients from time steps that are far awaydo not contribute anything to the learning process, so the RNN ends up not learningany long-range dependencies. While there are a few approaches to minimizingthe problem, such as proper initialization of the W matrix, more aggressiveregularization, using ReLU instead of tanh activation, and pretraining the layersusing unsupervised methods, the most popular solution is to use LSTM or GRUarchitectures, each of which will be explained shortly. These architectures have beendesigned to deal with vanishing gradients and learn long-term dependencies moreeffectively.RNN cell variantsIn this section we'll look at some cell variants of RNNs. We'll begin by lookingat a variant of the SimpleRNN cell: the Long short-term memory RNN.Long short-term memory (LSTM)The LSTM is a variant of the SimpleRNN cell that is capable of learning long-termdependencies. LSTMs were first proposed by Hochreiter and SchmidHuber [14] andrefined by many other researchers. They work well on a large variety of problemsand are the most widely used RNN variant.We have seen how the SimpleRNN combines the hidden state from the previoustime step and the current input through a tanh layer to implement recurrence.LSTMs also implement recurrence in a similar way, but instead of a single tanhlayer, there are four layers interacting in a very specific way. The following diagramillustrates the transformations that are applied in the hidden state at time step t.The diagram looks complicated, but let us look at it component by component. Theline across the top of the diagram is the cell state c, representing the internal memoryof the unit.[ 285 ]

Recurrent Neural NetworksThe line across the bottom is the hidden state h, and the i, f, o, and g gates are themechanisms by which the LSTM works around the vanishing gradient problem.During training, the LSTM learns the parameters for these gates:Figure 3: An LSTM cellAn alternative way to think about how these gates work inside an LSTM cell isto consider the equations for the cell. These equations describe how the valueof the hidden state h tat time t is calculated from the value of hidden state h t-1atthe previous time step. In general, the equation-based description tends to beclearer and more concise, and is usually the way a new cell design is presented inacademic papers. Diagrams, when provided, may or may not be comparable toones you have seen earlier. For these reasons, it usually makes sense to learn toread the equations and visualize the cell design. To that end, we will describe theother cell variants in this book using equations only.The set of equations representing an LSTM are shown as follows:ii = σσ(WW ii h tt−1 + UU ii xx tt + VV ii cc tt−1 )ff = σσ(WW ff h tt−1 + UU ff xx tt + VV ff cc tt−1 )oo = σσ(WW oo h tt−1 + UU oo xx tt + VV oo cc tt−1 )gg = tanh⁡(WW gg h tt−1 + UU gg xx tt )cc tt = (ff ∗ cc tt−1 ) + (gg ∗ ii)h tt = tanh(cc tt ) ∗ oo[ 286 ]

Recurrent Neural Networks

The line across the bottom is the hidden state h, and the i, f, o, and g gates are the

mechanisms by which the LSTM works around the vanishing gradient problem.

During training, the LSTM learns the parameters for these gates:

Figure 3: An LSTM cell

An alternative way to think about how these gates work inside an LSTM cell is

to consider the equations for the cell. These equations describe how the value

of the hidden state h t

at time t is calculated from the value of hidden state h t-1

at

the previous time step. In general, the equation-based description tends to be

clearer and more concise, and is usually the way a new cell design is presented in

academic papers. Diagrams, when provided, may or may not be comparable to

ones you have seen earlier. For these reasons, it usually makes sense to learn to

read the equations and visualize the cell design. To that end, we will describe the

other cell variants in this book using equations only.

The set of equations representing an LSTM are shown as follows:

ii = σσ(WW ii h tt−1 + UU ii xx tt + VV ii cc tt−1 )

ff = σσ(WW ff h tt−1 + UU ff xx tt + VV ff cc tt−1 )

oo = σσ(WW oo h tt−1 + UU oo xx tt + VV oo cc tt−1 )

gg = tanh⁡(WW gg h tt−1 + UU gg xx tt )

cc tt = (ff ∗ cc tt−1 ) + (gg ∗ ii)

h tt = tanh(cc tt ) ∗ oo

[ 286 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!