Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
(and only if) the norm exceeds the clipping value, the gradients are scaled down tomatch the desired norm; otherwise they keep their values. We can use PyTorch’snn.utils.clip_grad_norm_() to scale gradients in-place:parm.grad = fake_grads.clone()# Gradient Norm Clippingnn.utils.clip_grad_norm_(parm, max_norm=1.0, norm_type=2)fake_grads.norm(), parm.grad.view(-1,), parm.grad.norm()Output(tensor(2.6249), tensor([0.9524, 0.3048]), tensor(1.0000))The norm of our fake gradients was 2.6249, and we’re clipping the norm at 1.0, sothe gradients get scaled by a factor of 0.3810.Clipping the norm preserves the direction of the gradient vector.Figure E.6 - Gradients: before and after clipping by norm"A couple of questions … first, which one is better?"On the one hand, norm clipping maintains the balance between the updates of allparameters since it’s only scaling the norm and preserving the direction. On theother hand, value clipping is faster, and the fact that it slightly changes the directionof the gradient vector does not seem to have any harmful impact on performance.So, you’re probably OK using value clipping."Second, which clip value should I use?"That’s trickier to answer—the clip value is a hyper-parameter that can be fine-tunedlike any other. You can start with a value like ten, and work your way down if theVanishing and Exploding Gradients | 577
gradients keep exploding."Finally, how do I actually do it in practice?"Glad you asked! We’re creating some more methods in our StepByStep class tohandle both kinds of clipping, and modifying the _make_train_step_fn() methodto account for them. Gradient clipping must happen after gradients are computed(loss.backward()) and before updating the parameters (optimizer.step()):StepByStep Methodsetattr(StepByStep, 'clipping', None)def set_clip_grad_value(self, clip_value):self.clipping = lambda: nn.utils.clip_grad_value_(self.model.parameters(), clip_value=clip_value)def set_clip_grad_norm(self, max_norm, norm_type=2):self.clipping = lambda: nn.utils.clip_grad_norm_(self.model.parameters(), max_norm, norm_type)def remove_clip(self):self.clipping = Nonedef _make_train_step_fn(self):# This method does not need ARGS... it can refer to# the attributes: self.model, self.loss_fn, and self.optimizer# Builds function that performs a step in the train loopdef perform_train_step_fn(x, y):# Sets model to TRAIN modeself.model.train()# Step 1 - Computes model's predicted output - forward passyhat = self.model(x)# Step 2 - Computes the lossloss = self.loss_fn(yhat, y)# Step 3 - Computes gradientsloss.backward()578 | Extra Chapter: Vanishing and Exploding Gradients
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
- Page 590 and 591: Model Configuration (2)1 loss_fn =
- Page 592 and 593: weights. If done properly, the init
- Page 594 and 595: just did), or, if you are training
- Page 596 and 597: Figure E.3 - The effect of batch no
- Page 598 and 599: Model Configuration1 torch.manual_s
- Page 600 and 601: torch.manual_seed(42)parm = nn.Para
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 608 and 609: Clipping with HooksFirst, we reset
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
- Page 642 and 643: Model Configuration1 class SquareMo
- Page 644 and 645: StepByStep.loader_apply(test_loader
- Page 646 and 647: Figure 8.14 - Final hidden states f
- Page 648 and 649: Figure 8.16 - Transforming the hidd
- Page 650 and 651: Since the RNN cell has both of them
gradients keep exploding.
"Finally, how do I actually do it in practice?"
Glad you asked! We’re creating some more methods in our StepByStep class to
handle both kinds of clipping, and modifying the _make_train_step_fn() method
to account for them. Gradient clipping must happen after gradients are computed
(loss.backward()) and before updating the parameters (optimizer.step()):
StepByStep Method
setattr(StepByStep, 'clipping', None)
def set_clip_grad_value(self, clip_value):
self.clipping = lambda: nn.utils.clip_grad_value_(
self.model.parameters(), clip_value=clip_value
)
def set_clip_grad_norm(self, max_norm, norm_type=2):
self.clipping = lambda: nn.utils.clip_grad_norm_(
self.model.parameters(), max_norm, norm_type
)
def remove_clip(self):
self.clipping = None
def _make_train_step_fn(self):
# This method does not need ARGS... it can refer to
# the attributes: self.model, self.loss_fn, and self.optimizer
# Builds function that performs a step in the train loop
def perform_train_step_fn(x, y):
# Sets model to TRAIN mode
self.model.train()
# Step 1 - Computes model's predicted output - forward pass
yhat = self.model(x)
# Step 2 - Computes the loss
loss = self.loss_fn(yhat, y)
# Step 3 - Computes gradients
loss.backward()
578 | Extra Chapter: Vanishing and Exploding Gradients