16.03.2021 Views

Advanced Deep Learning with Keras

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 1

Figure 1.3.8: Plot of a function with 2 minima, x = -1.51 and x = 1.66.

Also shown is the derivative of the function.

Gradient descent is not typically used in deep neural networks since you'll often

come upon millions of parameters that need to be trained. It is computationally

inefficient to perform a full gradient descent. Instead, SGD is used. In SGD, a mini

batch of samples is chosen to compute an approximate value of the descent. The

parameters (for example, weights and biases) are adjusted by the following equation:

θ ← θ −∈

g (Equation 1.3.7)

1

In this equation, θ and g = ∇ ∑ L are the parameters and gradients tensor of the loss

m

θ

function respectively. The g is computed from partial derivatives of the loss function.

The mini-batch size is recommended to be a power of 2 for GPU optimization

purposes. In the proposed network, batch_size=128.

Equation 1.3.7 computes the last layer parameter updates. So, how do we adjust the

parameters of the preceding layers? For this case, the chain rule of differentiation is

applied to propagate the derivatives to the lower layers and compute the gradients

accordingly. This algorithm is known as backpropagation in deep learning. The

details of backpropagation are beyond the scope of this book. However, a good

online reference can be found at http://neuralnetworksanddeeplearning.com.

[ 19 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!