Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
There are many optimizers: SGD is the most basic of them, andAdam is one of the most popular.Different optimizers use different mechanics for updating theparameters, but they all achieve the same goal through, literally,different paths.To see what I mean by this, check out this animated GIF [45]developed by Alec Radford [46] , available at Stanford’s "CS231n:Convolutional Neural Networks for Visual Recognition" [47]course. The animation shows a loss surface, just like the ones wecomputed in Chapter 0, and the paths traversed by someoptimizers to achieve the minimum (represented by a star).Remember, the choice of mini-batch size influences the path ofgradient descent, and so does the choice of an optimizer.step / zero_gradAn optimizer takes the parameters we want to update, the learning rate we wantto use (and possibly many other hyper-parameters as well!), and performs theupdates through its step() method.# Defines an SGD optimizer to update the parametersoptimizer = optim.SGD([b, w], lr=lr)Besides, we also don’t need to zero the gradients one by one anymore. We justinvoke the optimizer’s zero_grad() method, and that’s it!In the code below, we create a stochastic gradient descent (SGD) optimizer to updateour parameters b and w.Don’t be fooled by the optimizer’s name: If we use all trainingdata at once for the update—as we are actually doing in thecode—the optimizer is performing a batch gradient descent,despite its name.Optimizer | 97
Notebook Cell 1.7 - PyTorch’s optimizer in action—no more manual update of parameters!1 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter2 lr = 0.134 # Step 0 - Initializes parameters "b" and "w" randomly5 torch.manual_seed(42)6 b = torch.randn(1, requires_grad=True, \7 dtype=torch.float, device=device)8 w = torch.randn(1, requires_grad=True, \9 dtype=torch.float, device=device)1011 # Defines a SGD optimizer to update the parameters12 optimizer = optim.SGD([b, w], lr=lr) 11314 # Defines number of epochs15 n_epochs = 10001617 for epoch in range(n_epochs):18 # Step 1 - Computes model's predicted output - forward pass19 yhat = b + w * x_train_tensor2021 # Step 2 - Computes the loss22 # We are using ALL data points, so this is BATCH gradient23 # descent. How wrong is our model? That's the error!24 error = (yhat - y_train_tensor)25 # It is a regression, so it computes mean squared error (MSE)26 loss = (error ** 2).mean()2728 # Step 3 - Computes gradients for both "b" and "w" parameters29 loss.backward()3031 # Step 4 - Updates parameters using gradients and32 # the learning rate. No more manual update!33 # with torch.no_grad():34 # b -= lr * b.grad35 # w -= lr * w.grad36 optimizer.step() 23738 # No more telling Pytorch to let gradients go!39 # b.grad.zero_()40 # w.grad.zero_()98 | Chapter 1: A Simple Regression Problem
- Page 72 and 73: Very High Learning RateWait, it may
- Page 74 and 75: true_b = 1true_w = 2N = 100# Data G
- Page 76 and 77: Let’s look at the cross-sections
- Page 78 and 79: Zero Mean and Unit Standard Deviati
- Page 80 and 81: Sure, in the real world, you’ll n
- Page 82 and 83: computing the loss, as shown in the
- Page 84 and 85: • visualizing the effects of usin
- Page 86 and 87: If you’re using Jupyter’s defau
- Page 88 and 89: Notebook Cell 1.1 - Splitting synth
- Page 90 and 91: Step 2# Step 2 - Computing the loss
- Page 92 and 93: Output[0.49671415] [-0.1382643][0.8
- Page 94 and 95: Notebook Cell 1.2 - Implementing gr
- Page 96 and 97: # Sanity Check: do we get the same
- Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3
- Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,
- Page 102 and 103: dummy_array = np.array([1, 2, 3])du
- Page 104 and 105: n_cudas = torch.cuda.device_count()
- Page 106 and 107: back_to_numpy = x_train_tensor.nump
- Page 108 and 109: I am assuming you’d like to use y
- Page 110 and 111: Outputtensor([0.1940], device='cuda
- Page 112 and 113: print(error.requires_grad, yhat.req
- Page 114 and 115: Output(tensor([0.], device='cuda:0'
- Page 116 and 117: 56 # need to tell it to let it go..
- Page 118 and 119: computation.If you chose "Local Ins
- Page 120 and 121: Figure 1.6 - Now parameter "b" does
- Page 124 and 125: 41 optimizer.zero_grad() 34243 prin
- Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los
- Page 128 and 129: Outputarray(0.00804466, dtype=float
- Page 130 and 131: Let’s build a proper (yet simple)
- Page 132 and 133: "What do we need this for?"It turns
- Page 134 and 135: 1 Instantiating a model2 What IS th
- Page 136 and 137: In the __init__() method, we create
- Page 138 and 139: LayersA Linear model can be seen as
- Page 140 and 141: There are MANY different layers tha
- Page 142 and 143: We use magic, just like that:%run -
- Page 144 and 145: • Step 1: compute model’s predi
- Page 146 and 147: RecapFirst of all, congratulations
- Page 148 and 149: Chapter 2Rethinking the Training Lo
- Page 150 and 151: Let’s take a look at the code onc
- Page 152 and 153: Higher-Order FunctionsAlthough this
- Page 154 and 155: def exponentiation_builder(exponent
- Page 156 and 157: Apart from returning the loss value
- Page 158 and 159: Our code should look like this; see
- Page 160 and 161: There is no need to load the whole
- Page 162 and 163: but if we want to get serious about
- Page 164 and 165: How does this change our code so fa
- Page 166 and 167: Run - Model Training V2%run -i mode
- Page 168 and 169: piece of code that’s going to be
- Page 170 and 171: for it. We could do the same for th
There are many optimizers: SGD is the most basic of them, and
Adam is one of the most popular.
Different optimizers use different mechanics for updating the
parameters, but they all achieve the same goal through, literally,
different paths.
To see what I mean by this, check out this animated GIF [45]
developed by Alec Radford [46] , available at Stanford’s "CS231n:
Convolutional Neural Networks for Visual Recognition" [47]
course. The animation shows a loss surface, just like the ones we
computed in Chapter 0, and the paths traversed by some
optimizers to achieve the minimum (represented by a star).
Remember, the choice of mini-batch size influences the path of
gradient descent, and so does the choice of an optimizer.
step / zero_grad
An optimizer takes the parameters we want to update, the learning rate we want
to use (and possibly many other hyper-parameters as well!), and performs the
updates through its step() method.
# Defines an SGD optimizer to update the parameters
optimizer = optim.SGD([b, w], lr=lr)
Besides, we also don’t need to zero the gradients one by one anymore. We just
invoke the optimizer’s zero_grad() method, and that’s it!
In the code below, we create a stochastic gradient descent (SGD) optimizer to update
our parameters b and w.
Don’t be fooled by the optimizer’s name: If we use all training
data at once for the update—as we are actually doing in the
code—the optimizer is performing a batch gradient descent,
despite its name.
Optimizer | 97