09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 1

Mathematically this direction is the value of the partial derivative ∂∂cc evaluated at

∂∂ww

point w r

, reached at step r. Therefore, by taking the opposite direction − ∂∂cc

∂∂ww (ww rr) the

hiker can move towards the ditch.

At each step, the hiker can decide how big a stride to take before the next stop. This

is the so-called "learning rate" ηη ≥ 0 in gradient descent jargon. Note that if ηη is too

small, then the hiker will move slowly. However, if ηη is too high, then the hiker will

possibly miss the ditch by stepping over it.

Now you should remember that a sigmoid is a continuous function and it is possible

1

to compute the derivative. It can be proven that the sigmoid σσ(xx) = has the

1 + ee−xx derivative ddσσ(xx) = σσ(xx)(1 − σσ(xx)) .

dd(xx)

ReLU is not differentiable at 0. We can however extend the first derivative at 0 to a

function over the whole domain by defining it to be either a 0 or 1.

The piecewise derivative of ReLU y = max(0, x) is dddd

dddd = {0 xx ≤ 0. Once we have the

1 xx > 0

derivative, it is possible to optimize the nets with a gradient descent technique.

TensorFlow computes the derivative on our behalf so we don't need to worry about

implementing or computing it.

A neural network is essentially a composition of multiple derivable functions with

thousands and sometimes millions of parameters. Each network layer computes a

function, the error of which should be minimized in order to improve the accuracy

observed during the learning phase. When we discuss backpropagation, we will

discover that the minimization game is a bit more complex than our toy example.

However, it is still based on the same intuition of descending a slope to reach a ditch.

TensorFlow implements a fast variant of gradient descent known as SGD and many

more advanced optimization techniques such as RMSProp and Adam. RMSProp

and Adam include the concept of momentum (a velocity component), in addition to

the acceleration component that SGD has. This allows faster convergence at the cost

of more computation. Think about a hiker who starts to move in one direction then

decides to change direction but remembers previous choices. It can be proven that

momentum helps accelerate SGD in the relevant direction and dampens

oscillations [10].

A complete list of optimizers can be found at https://

www.tensorflow.org/api_docs/python/tf/keras/

optimizers.

SGD was our default choice so far. So now let's try the other two.

[ 27 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!