Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

There are many optimizers: SGD is the most basic of them, andAdam is one of the most popular.Different optimizers use different mechanics for updating theparameters, but they all achieve the same goal through, literally,different paths.To see what I mean by this, check out this animated GIF [45]developed by Alec Radford [46] , available at Stanford’s "CS231n:Convolutional Neural Networks for Visual Recognition" [47]course. The animation shows a loss surface, just like the ones wecomputed in Chapter 0, and the paths traversed by someoptimizers to achieve the minimum (represented by a star).Remember, the choice of mini-batch size influences the path ofgradient descent, and so does the choice of an optimizer.step / zero_gradAn optimizer takes the parameters we want to update, the learning rate we wantto use (and possibly many other hyper-parameters as well!), and performs theupdates through its step() method.# Defines an SGD optimizer to update the parametersoptimizer = optim.SGD([b, w], lr=lr)Besides, we also don’t need to zero the gradients one by one anymore. We justinvoke the optimizer’s zero_grad() method, and that’s it!In the code below, we create a stochastic gradient descent (SGD) optimizer to updateour parameters b and w.Don’t be fooled by the optimizer’s name: If we use all trainingdata at once for the update—as we are actually doing in thecode—the optimizer is performing a batch gradient descent,despite its name.Optimizer | 97

Notebook Cell 1.7 - PyTorch’s optimizer in action—no more manual update of parameters!1 # Sets learning rate - this is "eta" ~ the "n"-like Greek letter2 lr = 0.134 # Step 0 - Initializes parameters "b" and "w" randomly5 torch.manual_seed(42)6 b = torch.randn(1, requires_grad=True, \7 dtype=torch.float, device=device)8 w = torch.randn(1, requires_grad=True, \9 dtype=torch.float, device=device)1011 # Defines a SGD optimizer to update the parameters12 optimizer = optim.SGD([b, w], lr=lr) 11314 # Defines number of epochs15 n_epochs = 10001617 for epoch in range(n_epochs):18 # Step 1 - Computes model's predicted output - forward pass19 yhat = b + w * x_train_tensor2021 # Step 2 - Computes the loss22 # We are using ALL data points, so this is BATCH gradient23 # descent. How wrong is our model? That's the error!24 error = (yhat - y_train_tensor)25 # It is a regression, so it computes mean squared error (MSE)26 loss = (error ** 2).mean()2728 # Step 3 - Computes gradients for both "b" and "w" parameters29 loss.backward()3031 # Step 4 - Updates parameters using gradients and32 # the learning rate. No more manual update!33 # with torch.no_grad():34 # b -= lr * b.grad35 # w -= lr * w.grad36 optimizer.step() 23738 # No more telling Pytorch to let gradients go!39 # b.grad.zero_()40 # w.grad.zero_()98 | Chapter 1: A Simple Regression Problem

There are many optimizers: SGD is the most basic of them, and

Adam is one of the most popular.

Different optimizers use different mechanics for updating the

parameters, but they all achieve the same goal through, literally,

different paths.

To see what I mean by this, check out this animated GIF [45]

developed by Alec Radford [46] , available at Stanford’s "CS231n:

Convolutional Neural Networks for Visual Recognition" [47]

course. The animation shows a loss surface, just like the ones we

computed in Chapter 0, and the paths traversed by some

optimizers to achieve the minimum (represented by a star).

Remember, the choice of mini-batch size influences the path of

gradient descent, and so does the choice of an optimizer.

step / zero_grad

An optimizer takes the parameters we want to update, the learning rate we want

to use (and possibly many other hyper-parameters as well!), and performs the

updates through its step() method.

# Defines an SGD optimizer to update the parameters

optimizer = optim.SGD([b, w], lr=lr)

Besides, we also don’t need to zero the gradients one by one anymore. We just

invoke the optimizer’s zero_grad() method, and that’s it!

In the code below, we create a stochastic gradient descent (SGD) optimizer to update

our parameters b and w.

Don’t be fooled by the optimizer’s name: If we use all training

data at once for the update—as we are actually doing in the

code—the optimizer is performing a batch gradient descent,

despite its name.

Optimizer | 97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!