pdfcoffee
Chapter 8Just as in a traditional neural network, where the learned parameters are stored asweight matrices, the RNN's parameters are defined by the three weight matricesU, V, and W, corresponding to the weights of the input, output, and hiddenstates respectively:Figure 1: (a) Schematic of an RNN cell; (b) RNN cell unrolledFigure 1(b) shows the same RNN in an "unrolled view". Unrolling just means that wedraw the network out for the complete sequence. The network shown here has threetime steps, suitable for processing three element sequences. Note that the weightmatrices U, V, and W, that we spoke about earlier, are shared between each of thetime steps. This is because we are applying the same operation to different inputsat each time step. Being able to share these weights across all the time steps greatlyreduces the number of parameters that the RNN needs to learn.We can also describe the RNN as a computation graph in terms of equations. Theinternal state of the RNN at a time t is given by the value of the hidden vector h(t),which is the sum of the weight matrix W and the hidden state h t-1at time t-1, andthe product of the weight matrix U and the input x tat time t, passed through atanh activation function. The choice of tanh over other activation functions suchas sigmoid has to do with it being more efficient for learning in practice, and helpscombat the vanishing gradient problem, which we will learn about later in thechapter.[ 281 ]
Recurrent Neural NetworksFor notational convenience, in all our equations describing differenttypes of RNN architectures in this chapter, we have omitted explicitreference to the bias terms by incorporating it within the matrix.Consider the following equation of a line in an n-dimensional space.Here w 1through w nrefer to the coefficients of the line in each of then dimensions, and the bias b refers to the y-intercept along each ofthese dimensions.yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + bbWe can rewrite the equation in matrix notation as follows:yy = WWWW + bbHere W is a matrix of shape (m, n) and b is a vector of shape (m,1), where m is the number of rows corresponding to the records inour dataset, and n is the number of columns corresponding to thefeatures for each record. Equivalently, we can eliminate the vector bby folding it into our matrix W by treating the b vector as a featurecolumn corresponding to the "unit" feature of W. Thus:yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + ww 0 (1)= WW ′ XXHere W' is a matrix of shape (m, n+1), where the last columncontains the values of b.The resulting notation ends up being more compact and (webelieve) easier for the reader to comprehend and retain as well.The output vector y tat time t is the product of the weight matrix V and the hiddenstate h t, passed through a softmax activation, such that the resulting vector is a setof output probabilities:h tt = tanh(WWh tt−1 + UUxx tt )yy tt = ssssssssssssss(VVh tt )Keras provides the SimpleRNN recurrent layer that incorporates all the logic wehave seen so far, as well as the more advanced variants such as LSTM and GRU,which we will learn about later in this chapter. Strictly speaking, it is not necessary tounderstand how they work in order to start building with them.[ 282 ]
- Page 267 and 268: Word EmbeddingsDeep learning models
- Page 269 and 270: Word EmbeddingsFor example, "crucia
- Page 271 and 272: Word EmbeddingsAssuming a window si
- Page 273 and 274: Word EmbeddingsGloVeThe Global vect
- Page 275 and 276: Word Embeddingsgensim is an open so
- Page 277 and 278: Word Embeddingsgensim also provides
- Page 279 and 280: Word EmbeddingsSpecifically, we wil
- Page 281 and 282: Word EmbeddingsWe will also convert
- Page 283 and 284: Word EmbeddingsE = np.zeros((vocab_
- Page 285 and 286: Word Embeddingsx = self.embedding(x
- Page 287 and 288: Word EmbeddingsThe change in valida
- Page 289 and 290: Word EmbeddingsThe dataset is a 114
- Page 291 and 292: Word Embeddingsprint("random walks
- Page 293 and 294: Word Embeddingssize=128, # size of
- Page 295 and 296: Word EmbeddingsfastText computes em
- Page 297 and 298: Word EmbeddingsIn the future, once
- Page 299 and 300: Word EmbeddingsA much earlier relat
- Page 301 and 302: Word EmbeddingsOnce you have the fi
- Page 303 and 304: Word EmbeddingsThis will create the
- Page 305 and 306: Word EmbeddingsClassifying with BER
- Page 307 and 308: Word Embeddings2. Each Transformer
- Page 309 and 310: Word EmbeddingsOnce trained, we sav
- Page 311 and 312: Word Embeddings4. Pennington, J., S
- Page 313 and 314: Word Embeddings34. Google Research,
- Page 315: Recurrent Neural NetworksWe will th
- Page 319 and 320: Recurrent Neural NetworksThis probl
- Page 321 and 322: Recurrent Neural NetworksThe line a
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
- Page 329 and 330: Recurrent Neural Networkstexts = do
- Page 331 and 332: Recurrent Neural Networksdef call(s
- Page 333 and 334: Recurrent Neural Networks# callback
- Page 335 and 336: Recurrent Neural NetworksExample
- Page 337 and 338: Recurrent Neural NetworksAs can be
- Page 339 and 340: Recurrent Neural Networksdata_dir =
- Page 341 and 342: Recurrent Neural NetworksWe can als
- Page 343 and 344: Recurrent Neural NetworksIn order t
- Page 345 and 346: Recurrent Neural Networkssource_voc
- Page 347 and 348: Recurrent Neural NetworksFinally, w
- Page 349 and 350: Recurrent Neural Networks38 - val_l
- Page 351 and 352: Recurrent Neural NetworksIf you wou
- Page 353 and 354: Recurrent Neural NetworksExample
- Page 355 and 356: Recurrent Neural NetworksNext we ha
- Page 357 and 358: Recurrent Neural Networksself.embed
- Page 359 and 360: Recurrent Neural NetworksThis is a
- Page 361 and 362: Recurrent Neural Networksreturn np.
- Page 363 and 364: Recurrent Neural NetworksAttention
- Page 365 and 366: Recurrent Neural NetworksFinally, V
Chapter 8
Just as in a traditional neural network, where the learned parameters are stored as
weight matrices, the RNN's parameters are defined by the three weight matrices
U, V, and W, corresponding to the weights of the input, output, and hidden
states respectively:
Figure 1: (a) Schematic of an RNN cell; (b) RNN cell unrolled
Figure 1(b) shows the same RNN in an "unrolled view". Unrolling just means that we
draw the network out for the complete sequence. The network shown here has three
time steps, suitable for processing three element sequences. Note that the weight
matrices U, V, and W, that we spoke about earlier, are shared between each of the
time steps. This is because we are applying the same operation to different inputs
at each time step. Being able to share these weights across all the time steps greatly
reduces the number of parameters that the RNN needs to learn.
We can also describe the RNN as a computation graph in terms of equations. The
internal state of the RNN at a time t is given by the value of the hidden vector h(t),
which is the sum of the weight matrix W and the hidden state h t-1
at time t-1, and
the product of the weight matrix U and the input x t
at time t, passed through a
tanh activation function. The choice of tanh over other activation functions such
as sigmoid has to do with it being more efficient for learning in practice, and helps
combat the vanishing gradient problem, which we will learn about later in the
chapter.
[ 281 ]