print(error.requires_grad, yhat.requires_grad, \b.requires_grad, w.requires_grad)print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)OutputTrue True True TrueFalse FalsegradWhat about the actual values of the gradients? We can inspect them by looking atthe grad attribute of a tensor.print(b.grad, w.grad)Outputtensor([-3.3881], device='cuda:0')tensor([-1.9439], device='cuda:0')If you check the method’s documentation, it clearly states that gradients areaccumulated. What does that mean? It means that, if we run Notebook Cell 1.5'scode (Steps 1 to 3) twice and check the grad attribute afterward, we will end upwith:Outputtensor([-6.7762], device='cuda:0')tensor([-3.8878], device='cuda:0')If you do not have a GPU, your outputs are going to be slightly different:Outputtensor([-3.1125]) tensor([-1.8156])Autograd | 87

Outputtensor([-6.2250]) tensor([-3.6313])These gradients' values are exactly twice as much as they were before, asexpected!OK, but that is actually a problem: We need to use the gradients corresponding tothe current loss to perform the parameter update. We should NOT useaccumulated gradients."If accumulating gradients is a problem, why does PyTorch do it bydefault?"It turns out this behavior can be useful to circumvent hardware limitations.During the training of large models, the necessary number of data points in a minibatchmay be too large to fit in memory (of the graphics card). How can one solvethis, other than buying more-expensive hardware?One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’tquote me on this!), compute the gradients for those "subs" and accumulate them toachieve the same result as computing the gradients on the full mini-batch.Sounds confusing? No worries, this is fairly advanced already and somewhatoutside of the scope of this book, but I thought this particular behavior of PyTorchneeded to be explained.Luckily, this is easy to solve!zero_Every time we use the gradients to update the parameters, we need to zero thegradients afterward. And that’s what zero_() is good for.# This code will be placed _after_ Step 4# (updating the parameters)b.grad.zero_(), w.grad.zero_()88 | Chapter 1: A Simple Regression Problem


