Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
print(error.requires_grad, yhat.requires_grad, \b.requires_grad, w.requires_grad)print(y_train_tensor.requires_grad, x_train_tensor.requires_grad)OutputTrue True True TrueFalse FalsegradWhat about the actual values of the gradients? We can inspect them by looking atthe grad attribute of a tensor.print(b.grad, w.grad)Outputtensor([-3.3881], device='cuda:0')tensor([-1.9439], device='cuda:0')If you check the method’s documentation, it clearly states that gradients areaccumulated. What does that mean? It means that, if we run Notebook Cell 1.5'scode (Steps 1 to 3) twice and check the grad attribute afterward, we will end upwith:Outputtensor([-6.7762], device='cuda:0')tensor([-3.8878], device='cuda:0')If you do not have a GPU, your outputs are going to be slightly different:Outputtensor([-3.1125]) tensor([-1.8156])Autograd | 87
Outputtensor([-6.2250]) tensor([-3.6313])These gradients' values are exactly twice as much as they were before, asexpected!OK, but that is actually a problem: We need to use the gradients corresponding tothe current loss to perform the parameter update. We should NOT useaccumulated gradients."If accumulating gradients is a problem, why does PyTorch do it bydefault?"It turns out this behavior can be useful to circumvent hardware limitations.During the training of large models, the necessary number of data points in a minibatchmay be too large to fit in memory (of the graphics card). How can one solvethis, other than buying more-expensive hardware?One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’tquote me on this!), compute the gradients for those "subs" and accumulate them toachieve the same result as computing the gradients on the full mini-batch.Sounds confusing? No worries, this is fairly advanced already and somewhatoutside of the scope of this book, but I thought this particular behavior of PyTorchneeded to be explained.Luckily, this is easy to solve!zero_Every time we use the gradients to update the parameters, we need to zero thegradients afterward. And that’s what zero_() is good for.# This code will be placed _after_ Step 4# (updating the parameters)b.grad.zero_(), w.grad.zero_()88 | Chapter 1: A Simple Regression Problem
- Page 62 and 63: Sure, different values of b produce
- Page 64 and 65: Output-3.044811379650508 -1.8337537
- Page 66 and 67: each parameter using the chain rule
- Page 68 and 69: What’s the impact of one update o
- Page 70 and 71: gradients, we know we need to take
- Page 72 and 73: Very High Learning RateWait, it may
- Page 74 and 75: true_b = 1true_w = 2N = 100# Data G
- Page 76 and 77: Let’s look at the cross-sections
- Page 78 and 79: Zero Mean and Unit Standard Deviati
- Page 80 and 81: Sure, in the real world, you’ll n
- Page 82 and 83: computing the loss, as shown in the
- Page 84 and 85: • visualizing the effects of usin
- Page 86 and 87: If you’re using Jupyter’s defau
- Page 88 and 89: Notebook Cell 1.1 - Splitting synth
- Page 90 and 91: Step 2# Step 2 - Computing the loss
- Page 92 and 93: Output[0.49671415] [-0.1382643][0.8
- Page 94 and 95: Notebook Cell 1.2 - Implementing gr
- Page 96 and 97: # Sanity Check: do we get the same
- Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3
- Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,
- Page 102 and 103: dummy_array = np.array([1, 2, 3])du
- Page 104 and 105: n_cudas = torch.cuda.device_count()
- Page 106 and 107: back_to_numpy = x_train_tensor.nump
- Page 108 and 109: I am assuming you’d like to use y
- Page 110 and 111: Outputtensor([0.1940], device='cuda
- Page 114 and 115: Output(tensor([0.], device='cuda:0'
- Page 116 and 117: 56 # need to tell it to let it go..
- Page 118 and 119: computation.If you chose "Local Ins
- Page 120 and 121: Figure 1.6 - Now parameter "b" does
- Page 122 and 123: There are many optimizers: SGD is t
- Page 124 and 125: 41 optimizer.zero_grad() 34243 prin
- Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los
- Page 128 and 129: Outputarray(0.00804466, dtype=float
- Page 130 and 131: Let’s build a proper (yet simple)
- Page 132 and 133: "What do we need this for?"It turns
- Page 134 and 135: 1 Instantiating a model2 What IS th
- Page 136 and 137: In the __init__() method, we create
- Page 138 and 139: LayersA Linear model can be seen as
- Page 140 and 141: There are MANY different layers tha
- Page 142 and 143: We use magic, just like that:%run -
- Page 144 and 145: • Step 1: compute model’s predi
- Page 146 and 147: RecapFirst of all, congratulations
- Page 148 and 149: Chapter 2Rethinking the Training Lo
- Page 150 and 151: Let’s take a look at the code onc
- Page 152 and 153: Higher-Order FunctionsAlthough this
- Page 154 and 155: def exponentiation_builder(exponent
- Page 156 and 157: Apart from returning the loss value
- Page 158 and 159: Our code should look like this; see
- Page 160 and 161: There is no need to load the whole
Output
tensor([-6.2250]) tensor([-3.6313])
These gradients' values are exactly twice as much as they were before, as
expected!
OK, but that is actually a problem: We need to use the gradients corresponding to
the current loss to perform the parameter update. We should NOT use
accumulated gradients.
"If accumulating gradients is a problem, why does PyTorch do it by
default?"
It turns out this behavior can be useful to circumvent hardware limitations.
During the training of large models, the necessary number of data points in a minibatch
may be too large to fit in memory (of the graphics card). How can one solve
this, other than buying more-expensive hardware?
One can split a mini-batch into "sub-mini-batches" (horrible name, I know, don’t
quote me on this!), compute the gradients for those "subs" and accumulate them to
achieve the same result as computing the gradients on the full mini-batch.
Sounds confusing? No worries, this is fairly advanced already and somewhat
outside of the scope of this book, but I thought this particular behavior of PyTorch
needed to be explained.
Luckily, this is easy to solve!
zero_
Every time we use the gradients to update the parameters, we need to zero the
gradients afterward. And that’s what zero_() is good for.
# This code will be placed _after_ Step 4
# (updating the parameters)
b.grad.zero_(), w.grad.zero_()
88 | Chapter 1: A Simple Regression Problem