09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 15

The error is computed via loss functions such as MSE, or cross-entropy for

noncontinuous values such as Booleans (Chapter 1, Neural Network Foundations with

TensorFlow 2.0). A gradient-descent-optimization algorithm is used to adjust the

weight of neurons by calculating the gradient of the loss function. Backpropagation

computes the gradient, and gradient descent uses the gradients for training the

model. Reduction in the error rate of predictions increases accuracy, allowing

machine learning models to improve. Stochastic gradient descent is the simplest

thing you could possibly do by taking one step in the direction of the gradient.

This chapter does not cover the math behind other optimizers such as Adam and

RMSProp (chapter 1). However, they involve using the first and the second moments

of the gradients. The first moment involves the exponentially decaying average of

the previous gradients and the second moment involves the exponentially decaying

average of the previous squared gradients.

There are three big properties of your data that justify using deep learning,

otherwise you might just use regular machine learning: (1) very-high-dimensional

input (text, images, audio signals, videos, and temporal series are frequently a good

example), (2) dealing with complex decision surfaces that cannot be approximated

with a low-order polynomial function, and (3) having a large amount of training

data available.

Deep learning models can be thought of as a computational graph made up by

stacking together several basic components such as Dense (chapter 1), CNNs

(chapters 4 and 5), Embeddings (chapter 6), RNNs(chapter 7), GANs (chapter 8),

Autoencoders (chapter 9) and, sometimes, adopting shortcut connections such as

"peephole", "skip", and "residual" because they help data flow a bit more smoothly

(chapter 5 and 7). Each node in the graph takes tensors as input and produces

tensors as output. As discussed, training happens by adjusting the weights in each

node with backpropagation, where the key intuition is to reduce the error in the final

output node(s) via gradient descent. GPUs and TPUs (chapter 16) can significantly

accelerate the optimization process since it is essentially based on (hundreds of)

millions of matrix computations.

There are a few other mathematical tools that might be helpful to improve your

learning process. Regularization (L1, L2, Lasso, from chapter 1) can significantly

improve the learning by keeping weights normalized. Batch normalization (chapter

1) helps to basically keep track of the mean and the standard deviation of your

dataset across multiple deep layers. The key intuition is to have data resembling

a normal distribution while it flows through the computational graph. Dropout

(chapters 1, 4, and 5) helps by introducing some elements of redundancy in your

computation; this prevents overfitting and allows better generalization.

[ 569 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!