pdfcoffee
As we backpropagate across multiple time steps, the product of gradients getsmaller and smaller, ultimately leading to the problem of vanishing gradients.Similarly, if the gradients are larger than 1, the products get larger and larger,and ultimately lead to the problem of exploding gradients.Chapter 8Of the two, exploding gradients are more easily detectable. The gradients willbecome very large and turn into Not a Number (NaN) and the training processwill crash. Exploding gradients can be controlled by clipping them at a predefinedthreshold [13]. TensorFlow 2.0 allows you to clip gradients using the clipvalueor clipnorm parameter during optimizer construction, or by explicitly clippinggradients using tf.clip_by_value.The effect of vanishing gradients is that gradients from time steps that are far awaydo not contribute anything to the learning process, so the RNN ends up not learningany long-range dependencies. While there are a few approaches to minimizingthe problem, such as proper initialization of the W matrix, more aggressiveregularization, using ReLU instead of tanh activation, and pretraining the layersusing unsupervised methods, the most popular solution is to use LSTM or GRUarchitectures, each of which will be explained shortly. These architectures have beendesigned to deal with vanishing gradients and learn long-term dependencies moreeffectively.RNN cell variantsIn this section we'll look at some cell variants of RNNs. We'll begin by lookingat a variant of the SimpleRNN cell: the Long short-term memory RNN.Long short-term memory (LSTM)The LSTM is a variant of the SimpleRNN cell that is capable of learning long-termdependencies. LSTMs were first proposed by Hochreiter and SchmidHuber [14] andrefined by many other researchers. They work well on a large variety of problemsand are the most widely used RNN variant.We have seen how the SimpleRNN combines the hidden state from the previoustime step and the current input through a tanh layer to implement recurrence.LSTMs also implement recurrence in a similar way, but instead of a single tanhlayer, there are four layers interacting in a very specific way. The following diagramillustrates the transformations that are applied in the hidden state at time step t.The diagram looks complicated, but let us look at it component by component. Theline across the top of the diagram is the cell state c, representing the internal memoryof the unit.[ 285 ]
Recurrent Neural NetworksThe line across the bottom is the hidden state h, and the i, f, o, and g gates are themechanisms by which the LSTM works around the vanishing gradient problem.During training, the LSTM learns the parameters for these gates:Figure 3: An LSTM cellAn alternative way to think about how these gates work inside an LSTM cell isto consider the equations for the cell. These equations describe how the valueof the hidden state h tat time t is calculated from the value of hidden state h t-1atthe previous time step. In general, the equation-based description tends to beclearer and more concise, and is usually the way a new cell design is presented inacademic papers. Diagrams, when provided, may or may not be comparable toones you have seen earlier. For these reasons, it usually makes sense to learn toread the equations and visualize the cell design. To that end, we will describe theother cell variants in this book using equations only.The set of equations representing an LSTM are shown as follows:ii = σσ(WW ii h tt−1 + UU ii xx tt + VV ii cc tt−1 )ff = σσ(WW ff h tt−1 + UU ff xx tt + VV ff cc tt−1 )oo = σσ(WW oo h tt−1 + UU oo xx tt + VV oo cc tt−1 )gg = tanh(WW gg h tt−1 + UU gg xx tt )cc tt = (ff ∗ cc tt−1 ) + (gg ∗ ii)h tt = tanh(cc tt ) ∗ oo[ 286 ]
- Page 269 and 270: Word EmbeddingsFor example, "crucia
- Page 271 and 272: Word EmbeddingsAssuming a window si
- Page 273 and 274: Word EmbeddingsGloVeThe Global vect
- Page 275 and 276: Word Embeddingsgensim is an open so
- Page 277 and 278: Word Embeddingsgensim also provides
- Page 279 and 280: Word EmbeddingsSpecifically, we wil
- Page 281 and 282: Word EmbeddingsWe will also convert
- Page 283 and 284: Word EmbeddingsE = np.zeros((vocab_
- Page 285 and 286: Word Embeddingsx = self.embedding(x
- Page 287 and 288: Word EmbeddingsThe change in valida
- Page 289 and 290: Word EmbeddingsThe dataset is a 114
- Page 291 and 292: Word Embeddingsprint("random walks
- Page 293 and 294: Word Embeddingssize=128, # size of
- Page 295 and 296: Word EmbeddingsfastText computes em
- Page 297 and 298: Word EmbeddingsIn the future, once
- Page 299 and 300: Word EmbeddingsA much earlier relat
- Page 301 and 302: Word EmbeddingsOnce you have the fi
- Page 303 and 304: Word EmbeddingsThis will create the
- Page 305 and 306: Word EmbeddingsClassifying with BER
- Page 307 and 308: Word Embeddings2. Each Transformer
- Page 309 and 310: Word EmbeddingsOnce trained, we sav
- Page 311 and 312: Word Embeddings4. Pennington, J., S
- Page 313 and 314: Word Embeddings34. Google Research,
- Page 315 and 316: Recurrent Neural NetworksWe will th
- Page 317 and 318: Recurrent Neural NetworksFor notati
- Page 319: Recurrent Neural NetworksThis probl
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
- Page 329 and 330: Recurrent Neural Networkstexts = do
- Page 331 and 332: Recurrent Neural Networksdef call(s
- Page 333 and 334: Recurrent Neural Networks# callback
- Page 335 and 336: Recurrent Neural NetworksExample
- Page 337 and 338: Recurrent Neural NetworksAs can be
- Page 339 and 340: Recurrent Neural Networksdata_dir =
- Page 341 and 342: Recurrent Neural NetworksWe can als
- Page 343 and 344: Recurrent Neural NetworksIn order t
- Page 345 and 346: Recurrent Neural Networkssource_voc
- Page 347 and 348: Recurrent Neural NetworksFinally, w
- Page 349 and 350: Recurrent Neural Networks38 - val_l
- Page 351 and 352: Recurrent Neural NetworksIf you wou
- Page 353 and 354: Recurrent Neural NetworksExample
- Page 355 and 356: Recurrent Neural NetworksNext we ha
- Page 357 and 358: Recurrent Neural Networksself.embed
- Page 359 and 360: Recurrent Neural NetworksThis is a
- Page 361 and 362: Recurrent Neural Networksreturn np.
- Page 363 and 364: Recurrent Neural NetworksAttention
- Page 365 and 366: Recurrent Neural NetworksFinally, V
- Page 367 and 368: Recurrent Neural Networks# query.sh
- Page 369 and 370: Recurrent Neural Networksself.atten
Recurrent Neural Networks
The line across the bottom is the hidden state h, and the i, f, o, and g gates are the
mechanisms by which the LSTM works around the vanishing gradient problem.
During training, the LSTM learns the parameters for these gates:
Figure 3: An LSTM cell
An alternative way to think about how these gates work inside an LSTM cell is
to consider the equations for the cell. These equations describe how the value
of the hidden state h t
at time t is calculated from the value of hidden state h t-1
at
the previous time step. In general, the equation-based description tends to be
clearer and more concise, and is usually the way a new cell design is presented in
academic papers. Diagrams, when provided, may or may not be comparable to
ones you have seen earlier. For these reasons, it usually makes sense to learn to
read the equations and visualize the cell design. To that end, we will describe the
other cell variants in this book using equations only.
The set of equations representing an LSTM are shown as follows:
ii = σσ(WW ii h tt−1 + UU ii xx tt + VV ii cc tt−1 )
ff = σσ(WW ff h tt−1 + UU ff xx tt + VV ff cc tt−1 )
oo = σσ(WW oo h tt−1 + UU oo xx tt + VV oo cc tt−1 )
gg = tanh(WW gg h tt−1 + UU gg xx tt )
cc tt = (ff ∗ cc tt−1 ) + (gg ∗ ii)
h tt = tanh(cc tt ) ∗ oo
[ 286 ]