09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The Math Behind Deep Learning

If we used SGD, we need to sum the errors and the gradients at each timestep for one

given training example:

Figure 16: Recurrent neural network unrolled with equations

We are not going to write all the tedious math behind all the gradients, but rather

focus only on a few peculiar cases. For instance, with math computations similar to

the one made in the previous chapters, it can be proven by using the chain rule that

the gradient for V depends only on the value at the current timestep s 3

, y 3

and yŷ 3 :

∂∂EE 3

∂∂∂∂ = ∂∂EE 3

∂∂yŷ

3

∂∂yŷ

3

∂∂∂∂ = ∂∂EE 3

∂∂yy 3

∂∂yŷ

3

̂

∂∂zz 3

∂∂zz 3

∂∂∂∂ = (yy 3

̂ − yy 3 )ss 3

However, ∂∂EE 3

has dependencies carried across timesteps because for instance

∂∂∂∂

ss 3 = tanh⁡(UU xxtt + WW ss2 ) depends on s 2

which depends on W 2

and s 1

. As a consequence,

the gradient is a bit more complicated because we need to sum up the contributions

of each time step:

3

∂∂EE 3

∂∂∂∂ = ∑ ∂∂EE 3 ∂∂yŷ

3 ∂∂ss 3 ∂∂ss kk

∂∂yŷ

3 ∂∂ss 3 ∂∂ss kk ∂∂∂∂

kk=0

In order to understand the preceding equation, you can think that we are using

the standard backpropagation algorithm used for traditional feed-forward neural

networks, but for RNNs we need to additionally add the gradients of W across

timesteps. That's because we can effectively make the dependencies across time

explicit by unrolling the RNN. This is the reason why backpropagation for RNNs

is frequently called Backpropagation Through Time (BTT). The intuition is shown

in Figure 17 where the backpropagated signals are represented:

[ 566 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!