Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Sure, in the real world, you’ll never get a pretty bowl like that. But our conclusionstill holds:1. Always standardize (scale) your features.2. DO NOT EVER FORGET #1!Step 5 - Rinse and Repeat!Now we use the updated parameters to go back to Step 1 and restart the process.Definition of EpochAn epoch is complete whenever every point in the training set(N) has already been used in all steps: forward pass, computingloss, computing gradients, and updating parameters.During one epoch, we perform at least one update, but no morethan N updates.The number of updates (N/n) will depend on the type of gradientdescent being used:• For batch (n = N) gradient descent, this is trivial, as it uses allpoints for computing the loss—one epoch is the same as oneupdate.• For stochastic (n = 1) gradient descent, one epoch means Nupdates, since every individual data point is used to performan update.• For mini-batch (of size n), one epoch has N/n updates, since amini-batch of n data points is used to perform an update.Repeating this process over and over for many epochs is, in a nutshell, training amodel.What happens if we run it over 1,000 epochs?Step 5 - Rinse and Repeat! | 55
Figure 0.18 - Final model’s predictionsIn the next chapter, we’ll put all these steps together and run it for 1,000 epochs, sowe’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690."Why 1,000 epochs?"No particular reason, but this is a fairly simple model, and we can afford to run itover a large number of epochs. In more-complex models, though, a couple of dozenepochs may be enough. We’ll discuss this a bit more in Chapter 1.The Path of Gradient DescentIn Step 3, we have seen the loss surface and both random start and minimumpoints.Which path is gradient descent going to take to go from random start to aminimum? How long will it take? Will it actually reach the minimum?The answers to all these questions depend on many things, like the learning rate, theshape of the loss surface, and the number of points we use to compute the loss.Depending on whether we use batch, mini-batch, or stochastic gradient descent,the path is going to be more or less smooth, and it is likely to reach the minimum inmore or less time.To illustrate the differences, I’ve generated paths over 100 epochs using either 80data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for56 | Chapter 0: Visualizing Gradient Descent
- Page 30 and 31: • Classes and methods are written
- Page 32 and 33: What’s Next?It’s time to set up
- Page 34 and 35: After choosing a repository, it wil
- Page 36 and 37: 1. AnacondaIf you don’t have Anac
- Page 38 and 39: 3. PyTorchPyTorch is the coolest de
- Page 40 and 41: (pytorchbook) C:\> conda install py
- Page 42 and 43: (pytorchbook)$ pip install torchviz
- Page 44 and 45: 7. JupyterAfter cloning the reposit
- Page 46 and 47: Part IFundamentals| 21
- Page 48 and 49: notebook. If not, just click on Cha
- Page 50 and 51: Also, let’s say that, on average,
- Page 52 and 53: There is one exception to the "alwa
- Page 54 and 55: Random Initialization1 # Step 0 - I
- Page 56 and 57: Batch, Mini-batch, and Stochastic G
- Page 58 and 59: Outputarray([[-2. , -1.94, -1.88, .
- Page 60 and 61: one matrix for each data point, eac
- Page 62 and 63: Sure, different values of b produce
- Page 64 and 65: Output-3.044811379650508 -1.8337537
- Page 66 and 67: each parameter using the chain rule
- Page 68 and 69: What’s the impact of one update o
- Page 70 and 71: gradients, we know we need to take
- Page 72 and 73: Very High Learning RateWait, it may
- Page 74 and 75: true_b = 1true_w = 2N = 100# Data G
- Page 76 and 77: Let’s look at the cross-sections
- Page 78 and 79: Zero Mean and Unit Standard Deviati
- Page 82 and 83: computing the loss, as shown in the
- Page 84 and 85: • visualizing the effects of usin
- Page 86 and 87: If you’re using Jupyter’s defau
- Page 88 and 89: Notebook Cell 1.1 - Splitting synth
- Page 90 and 91: Step 2# Step 2 - Computing the loss
- Page 92 and 93: Output[0.49671415] [-0.1382643][0.8
- Page 94 and 95: Notebook Cell 1.2 - Implementing gr
- Page 96 and 97: # Sanity Check: do we get the same
- Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3
- Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,
- Page 102 and 103: dummy_array = np.array([1, 2, 3])du
- Page 104 and 105: n_cudas = torch.cuda.device_count()
- Page 106 and 107: back_to_numpy = x_train_tensor.nump
- Page 108 and 109: I am assuming you’d like to use y
- Page 110 and 111: Outputtensor([0.1940], device='cuda
- Page 112 and 113: print(error.requires_grad, yhat.req
- Page 114 and 115: Output(tensor([0.], device='cuda:0'
- Page 116 and 117: 56 # need to tell it to let it go..
- Page 118 and 119: computation.If you chose "Local Ins
- Page 120 and 121: Figure 1.6 - Now parameter "b" does
- Page 122 and 123: There are many optimizers: SGD is t
- Page 124 and 125: 41 optimizer.zero_grad() 34243 prin
- Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los
- Page 128 and 129: Outputarray(0.00804466, dtype=float
Figure 0.18 - Final model’s predictions
In the next chapter, we’ll put all these steps together and run it for 1,000 epochs, so
we’ll get to the parameters depicted in the figure above, b = 1.0235 and w = 1.9690.
"Why 1,000 epochs?"
No particular reason, but this is a fairly simple model, and we can afford to run it
over a large number of epochs. In more-complex models, though, a couple of dozen
epochs may be enough. We’ll discuss this a bit more in Chapter 1.
The Path of Gradient Descent
In Step 3, we have seen the loss surface and both random start and minimum
points.
Which path is gradient descent going to take to go from random start to a
minimum? How long will it take? Will it actually reach the minimum?
The answers to all these questions depend on many things, like the learning rate, the
shape of the loss surface, and the number of points we use to compute the loss.
Depending on whether we use batch, mini-batch, or stochastic gradient descent,
the path is going to be more or less smooth, and it is likely to reach the minimum in
more or less time.
To illustrate the differences, I’ve generated paths over 100 epochs using either 80
data points (batch), 16 data points (mini-batch), or a single data point (stochastic) for
56 | Chapter 0: Visualizing Gradient Descent