09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

The Math Behind Deep Learning

In many cases, evaluating the above gradient might require expensive evaluation

of the gradients from all summand functions. When the training set is very large,

this can be extremely expensive. If we have three million samples, we have to

loop through three million times or use the dot product. That's a lot! How can we

simplify this? There are three types of gradient descent, each different in the way

they handle the training dataset:

Batch Gradient Descent (BGD)

Batch gradient descent computes the change of error, but updates the whole model

only once the entire dataset has been evaluated. Computationally it is very efficient,

but it requires that the results for the whole dataset be held in the memory.

Stochastic Gradient Descent (SGD)

Instead of updating the model once the dataset has been evaluated, it does so after

every single training example. The key idea is very simple: SGD samples a subset

of summand functions at every step.

Mini-Batch Gradient Descent (MBGD)

This is the method that is very frequently used in deep learning. MBGD (or minibatch)

combines BGD and the SGD in one single heuristic. The dataset is divided into

small batches of about size bs, which is generally from 64 to 256. Then each of the

batches are evaluated separately.

Note that bs is another hyperparameter to fine tune during training. MBGD lies

between the extremes of BGD and SGD – by adjusting the batch size and the

learning rate parameters, we sometimes find a solution that descends closer to the

global minimum than can be achieved by either of the extremes.

In contrast with gradient descent, where the cost function minimized more

smoothly, the mini-batch gradient has a bit more of a noisy and bumpy descent,

but the cost function still trends downhill. The reason for the noise is that minibatches

are a sample of all the examples and this sampling can cause the loss

function to oscillate.

[ 564 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!