Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Sure, in the real world, you’ll never get a pretty bowl like that. But our conclusionstill holds:1. Always standardize (scale) your features.2. DO NOT EVER FORGET #1!Step 5 - Rinse and Repeat!Now we use the updated parameters to go back to Step 1 and restart the process.Definition of EpochAn epoch is complete whenever every point in the training set(N) has already been used in all steps: forward pass, computingloss, computing gradients, and updating parameters.During one epoch, we perform at least one update, but no morethan N updates.The number of updates (N/n) will depend on the type of gradientdescent being used:• For batch (n = N) gradient descent, this is trivial, as it uses allpoints for computing the loss—one epoch is the same as oneupdate.• For stochastic (n = 1) gradient descent, one epoch means Nupdates, since every individual data point is used to performan update.• For mini-batch (of size n), one epoch has N/n updates, since amini-batch of n data points is used to perform an update.Repeating this process over and over for many epochs is, in a nutshell, training amodel.What happens if we run it over 1,000 epochs?Step 5 - Rinse and Repeat! | 55

Figure 0.18 - Final model’s predictionsIn the next chapter, we’ll put all these steps together and run it for 1,000 epochs, sowe’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690."Why 1,000 epochs?"No particular reason, but this is a fairly simple model, and we can afford to run itover a large number of epochs. In more-complex models, though, a couple of dozenepochs may be enough. We’ll discuss this a bit more in Chapter 1.The Path of Gradient DescentIn Step 3, we have seen the loss surface and both random start and minimumpoints.Which path is gradient descent going to take to go from random start to aminimum? How long will it take? Will it actually reach the minimum?The answers to all these questions depend on many things, like the learning rate, theshape of the loss surface, and the number of points we use to compute the loss.Depending on whether we use batch, mini-batch, or stochastic gradient descent,the path is going to be more or less smooth, and it is likely to reach the minimum inmore or less time.To illustrate the differences, I’ve generated paths over 100 epochs using either 80data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for56 | Chapter 0: Visualizing Gradient Descent

Figure 0.18 - Final model’s predictions

In the next chapter, we’ll put all these steps together and run it for 1,000 epochs, so

we’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690.

"Why 1,000 epochs?"

No particular reason, but this is a fairly simple model, and we can afford to run it

over a large number of epochs. In more-complex models, though, a couple of dozen

epochs may be enough. We’ll discuss this a bit more in Chapter 1.

The Path of Gradient Descent

In Step 3, we have seen the loss surface and both random start and minimum

points.

Which path is gradient descent going to take to go from random start to a

minimum? How long will it take? Will it actually reach the minimum?

The answers to all these questions depend on many things, like the learning rate, the

shape of the loss surface, and the number of points we use to compute the loss.

Depending on whether we use batch, mini-batch, or stochastic gradient descent,

the path is going to be more or less smooth, and it is likely to reach the minimum in

more or less time.

To illustrate the differences, I’ve generated paths over 100 epochs using either 80

data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for

56 | Chapter 0: Visualizing Gradient Descent

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!