pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 8Just as in a traditional neural network, where the learned parameters are stored asweight matrices, the RNN's parameters are defined by the three weight matricesU, V, and W, corresponding to the weights of the input, output, and hiddenstates respectively:Figure 1: (a) Schematic of an RNN cell; (b) RNN cell unrolledFigure 1(b) shows the same RNN in an "unrolled view". Unrolling just means that wedraw the network out for the complete sequence. The network shown here has threetime steps, suitable for processing three element sequences. Note that the weightmatrices U, V, and W, that we spoke about earlier, are shared between each of thetime steps. This is because we are applying the same operation to different inputsat each time step. Being able to share these weights across all the time steps greatlyreduces the number of parameters that the RNN needs to learn.We can also describe the RNN as a computation graph in terms of equations. Theinternal state of the RNN at a time t is given by the value of the hidden vector h(t),which is the sum of the weight matrix W and the hidden state h t-1at time t-1, andthe product of the weight matrix U and the input x tat time t, passed through atanh activation function. The choice of tanh over other activation functions suchas sigmoid has to do with it being more efficient for learning in practice, and helpscombat the vanishing gradient problem, which we will learn about later in thechapter.[ 281 ]

Recurrent Neural NetworksFor notational convenience, in all our equations describing differenttypes of RNN architectures in this chapter, we have omitted explicitreference to the bias terms by incorporating it within the matrix.Consider the following equation of a line in an n-dimensional space.Here w 1through w nrefer to the coefficients of the line in each of then dimensions, and the bias b refers to the y-intercept along each ofthese dimensions.yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + bbWe can rewrite the equation in matrix notation as follows:yy = WWWW + bbHere W is a matrix of shape (m, n) and b is a vector of shape (m,1), where m is the number of rows corresponding to the records inour dataset, and n is the number of columns corresponding to thefeatures for each record. Equivalently, we can eliminate the vector bby folding it into our matrix W by treating the b vector as a featurecolumn corresponding to the "unit" feature of W. Thus:yy = ww 1 xx 1 + ww 2 xx 2 + ⋯ + ww nn xx nn + ww 0 (1)= WW ′ XXHere W' is a matrix of shape (m, n+1), where the last columncontains the values of b.The resulting notation ends up being more compact and (webelieve) easier for the reader to comprehend and retain as well.The output vector y tat time t is the product of the weight matrix V and the hiddenstate h t, passed through a softmax activation, such that the resulting vector is a setof output probabilities:h tt = tanh⁡(WWh tt−1 + UUxx tt )yy tt = ssssssssssssss(VVh tt )Keras provides the SimpleRNN recurrent layer that incorporates all the logic wehave seen so far, as well as the more advanced variants such as LSTM and GRU,which we will learn about later in this chapter. Strictly speaking, it is not necessary tounderstand how they work in order to start building with them.[ 282 ]

Chapter 8

Just as in a traditional neural network, where the learned parameters are stored as

weight matrices, the RNN's parameters are defined by the three weight matrices

U, V, and W, corresponding to the weights of the input, output, and hidden

states respectively:

Figure 1: (a) Schematic of an RNN cell; (b) RNN cell unrolled

Figure 1(b) shows the same RNN in an "unrolled view". Unrolling just means that we

draw the network out for the complete sequence. The network shown here has three

time steps, suitable for processing three element sequences. Note that the weight

matrices U, V, and W, that we spoke about earlier, are shared between each of the

time steps. This is because we are applying the same operation to different inputs

at each time step. Being able to share these weights across all the time steps greatly

reduces the number of parameters that the RNN needs to learn.

We can also describe the RNN as a computation graph in terms of equations. The

internal state of the RNN at a time t is given by the value of the hidden vector h(t),

which is the sum of the weight matrix W and the hidden state h t-1

at time t-1, and

the product of the weight matrix U and the input x t

at time t, passed through a

tanh activation function. The choice of tanh over other activation functions such

as sigmoid has to do with it being more efficient for learning in practice, and helps

combat the vanishing gradient problem, which we will learn about later in the

chapter.

[ 281 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!