09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 15

Figure 17: RNN equations and back propagated signals

I hope that you are following up to this point because now the discussion will be

slightly more difficult. If we consider:

3

∂∂EE 3

∂∂∂∂ = ∑ ∂∂EE 3 ∂∂yŷ

3 ∂∂ss 3 ∂∂ss kk

∂∂yŷ

3 ∂∂ss 3 ∂∂ss kk ∂∂∂∂

kk=0

Then we notice that ∂∂ss 3

∂∂ss kk

should be again computed with the chain rule producing a

number of multiplications. In this case, we take the derivative of a vector function

with respect to a vector, so we need a matrix whose elements are all the pointwise

derivatives (in math, this matrix is called a Jacobian). Mathematically, it can be

proven that:

Therefore, we have:

3

kk=0

3

∂∂ss 3

= ∏

∂∂ss kk

∂∂ss jj

∂∂ss jj−1

jj=kk+1

∂∂EE 3

∂∂∂∂ = ∑ ∂∂EE 3 ∂∂yŷ

3

( ∏ ∂∂ss jj

) ∂∂ss kk

∂∂yŷ

3 ∂∂ss 3

∂∂∂∂

3

∂∂ss jj−1

jj=kk+1

The multiplication in the preceding equation is particularly problematic since both

the sigmoid and tanh get saturated at both ends and their derivative goes to 0. When

this happens, they drive other gradients in previous layers towards 0. This makes

the gradient vanish completely after a few time steps and the network stops learning

from "far away."

[ 567 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!