pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 8Just as in a traditional neural network, where the learned parameters are stored asweight matrices, the RNN's parameters are defined by the three weight matricesU, V, and W, corresponding to the weights of the input, output, and hiddenstates respectively:Figure 1: (a) Schematic of an RNN cell; (b) RNN cell unrolledFigure 1(b) shows the same RNN in an "unrolled view". Unrolling just means that wedraw the network out for the complete sequence. The network shown here has threetime steps, suitable for processing three element sequences. Note that the weightmatrices U, V, and W, that we spoke about earlier, are shared between each of thetime steps. This is because we are applying the same operation to different inputsat each time step. Being able to share these weights across all the time steps greatlyreduces the number of parameters that the RNN needs to learn.We can also describe the RNN as a computation graph in terms of equations. Theinternal state of the RNN at a time t is given by the value of the hidden vector h(t),which is the sum of the weight matrix W and the hidden state h t-1at time t-1, andthe product of the weight matrix U and the input x tat time t, passed through atanh activation function. The choice of tanh over other activation functions suchas sigmoid has to do with it being more efficient for learning in practice, and helpscombat the vanishing gradient problem, which we will learn about later in thechapter.[ 281 ]

Recurrent Neural NetworksFor notational convenience, in all our equations describing differenttypes of RNN architectures in this chapter, we have omitted explicitreference to the bias terms by incorporating it within the matrix.Consider the following equation of a line in an n-dimensional space.Here w 1through w nrefer to the coefficients of the line in each of then dimensions, and the bias b refers to the y-intercept along each ofthese dimensions.yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + bbWe can rewrite the equation in matrix notation as follows:yy = WWWW + bbHere W is a matrix of shape (m, n) and b is a vector of shape (m,1), where m is the number of rows corresponding to the records inour dataset, and n is the number of columns corresponding to thefeatures for each record. Equivalently, we can eliminate the vector bby folding it into our matrix W by treating the b vector as a featurecolumn corresponding to the "unit" feature of W. Thus:yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + ww 0 (1)= WW ′ XXHere W' is a matrix of shape (m, n+1), where the last columncontains the values of b.The resulting notation ends up being more compact and (webelieve) easier for the reader to comprehend and retain as well.The output vector y tat time t is the product of the weight matrix V and the hiddenstate h t, passed through a softmax activation, such that the resulting vector is a setof output probabilities:h tt = tanh⁡(WWh tt−1 + UUxx tt )yy tt = ssssssssssssss(VVh tt )Keras provides the SimpleRNN recurrent layer that incorporates all the logic wehave seen so far, as well as the more advanced variants such as LSTM and GRU,which we will learn about later in this chapter. Strictly speaking, it is not necessary tounderstand how they work in order to start building with them.[ 282 ]

Recurrent Neural Networks

For notational convenience, in all our equations describing different

types of RNN architectures in this chapter, we have omitted explicit

reference to the bias terms by incorporating it within the matrix.

Consider the following equation of a line in an n-dimensional space.

Here w 1

through w n

refer to the coefficients of the line in each of the

n dimensions, and the bias b refers to the y-intercept along each of

these dimensions.

yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + bb

We can rewrite the equation in matrix notation as follows:

yy = WWWW + bb

Here W is a matrix of shape (m, n) and b is a vector of shape (m,

1), where m is the number of rows corresponding to the records in

our dataset, and n is the number of columns corresponding to the

features for each record. Equivalently, we can eliminate the vector b

by folding it into our matrix W by treating the b vector as a feature

column corresponding to the "unit" feature of W. Thus:

yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + ww 0 (1)

= WW ′ XX

Here W' is a matrix of shape (m, n+1), where the last column

contains the values of b.

The resulting notation ends up being more compact and (we

believe) easier for the reader to comprehend and retain as well.

The output vector y t

at time t is the product of the weight matrix V and the hidden

state h t

, passed through a softmax activation, such that the resulting vector is a set

of output probabilities:

h tt = tanh⁡(WWh tt−1 + UUxx tt )

yy tt = ssssssssssssss(VVh tt )

Keras provides the SimpleRNN recurrent layer that incorporates all the logic we

have seen so far, as well as the more advanced variants such as LSTM and GRU,

which we will learn about later in this chapter. Strictly speaking, it is not necessary to

understand how they work in order to start building with them.

[ 282 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!