Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

print(error.requires_grad, yhat.requires_grad, \b.requires_grad, w.requires_grad)print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)OutputTrue True True TrueFalse FalsegradWhat about the actual values of the gradients? We can inspect them by looking atthe grad attribute of a tensor.print(b.grad, w.grad)Outputtensor([-3.3881], device='cuda:0')tensor([-1.9439], device='cuda:0')If you check the method’s documentation, it clearly states that gradients areaccumulated. What does that mean? It means that, if we run Notebook Cell 1.5'scode (Steps 1 to 3) twice and check the grad attribute afterward, we will end upwith:Outputtensor([-6.7762], device='cuda:0')tensor([-3.8878], device='cuda:0')If you do not have a GPU, your outputs are going to be slightly different:Outputtensor([-3.1125]) tensor([-1.8156])Autograd | 87

Outputtensor([-6.2250]) tensor([-3.6313])These gradients' values are exactly twice as much as they were before, asexpected!OK, but that is actually a problem: We need to use the gradients corresponding tothe current loss to perform the parameter update. We should NOT useaccumulated gradients."If accumulating gradients is a problem, why does PyTorch do it bydefault?"It turns out this behavior can be useful to circumvent hardware limitations.During the training of large models, the necessary number of data points in a minibatchmay be too large to fit in memory (of the graphics card). How can one solvethis, other than buying more-expensive hardware?One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’tquote me on this!), compute the gradients for those "subs" and accumulate them toachieve the same result as computing the gradients on the full mini-batch.Sounds confusing? No worries, this is fairly advanced already and somewhatoutside of the scope of this book, but I thought this particular behavior of PyTorchneeded to be explained.Luckily, this is easy to solve!zero_Every time we use the gradients to update the parameters, we need to zero thegradients afterward. And that’s what zero_() is good for.# This code will be placed _after_ Step 4# (updating the parameters)b.grad.zero_(), w.grad.zero_()88 | Chapter 1: A Simple Regression Problem

Output

tensor([-6.2250]) tensor([-3.6313])

These gradients' values are exactly twice as much as they were before, as

expected!

OK, but that is actually a problem: We need to use the gradients corresponding to

the current loss to perform the parameter update. We should NOT use

accumulated gradients.

"If accumulating gradients is a problem, why does PyTorch do it by

default?"

It turns out this behavior can be useful to circumvent hardware limitations.

During the training of large models, the necessary number of data points in a minibatch

may be too large to fit in memory (of the graphics card). How can one solve

this, other than buying more-expensive hardware?

One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’t

quote me on this!), compute the gradients for those "subs" and accumulate them to

achieve the same result as computing the gradients on the full mini-batch.

Sounds confusing? No worries, this is fairly advanced already and somewhat

outside of the scope of this book, but I thought this particular behavior of PyTorch

needed to be explained.

Luckily, this is easy to solve!

zero_

Every time we use the gradients to update the parameters, we need to zero the

gradients afterward. And that’s what zero_() is good for.

# This code will be placed _after_ Step 4

# (updating the parameters)

b.grad.zero_(), w.grad.zero_()

88 | Chapter 1: A Simple Regression Problem

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!