Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Clipping with HooksFirst, we reset the parameters once again:Model Configuration1 torch.manual_seed(42)2 with torch.no_grad():3 model.apply(weights_init)Next, we use set_clip_backprop() to clip the gradients during backpropagationusing hooks:Model Training1 sbs_reg_clip_hook = StepByStep(model, loss_fn, optimizer)2 sbs_reg_clip_hook.set_loaders(train_loader)3 sbs_reg_clip_hook.set_clip_backprop(1.0)4 sbs_reg_clip_hook.capture_gradients(['fc1'])5 sbs_reg_clip_hook.train(10)6 sbs_reg_clip_hook.remove_clip()7 sbs_reg_clip_hook.remove_hooks()fig = sbs_reg_clip_hook.plot_losses()Figure E.8 - Losses—clipping by value with hooksThe loss is, once again, well behaved. At first sight, there doesn’t seem to be anydifference…Vanishing and Exploding Gradients | 583
Or is there? Let’s compare the distributions of the computed gradients over thewhole training loop for both methods.Figure E.9 - Distributions of gradients during trainingWell, that’s a big difference! On the left plot, the gradients were computed as usualand only clipped before the parameter update to prevent the compounding effectthat led to the explosion of the gradients. On the right plot, no gradients are everabove the clip value (in absolute terms).Keep in mind that, even though the choice of clipping method does not seem tohave an impact on the overall loss of our simple model, this won’t hold true forrecurrent neural networks, and you should use hooks for clipping gradients inthat case.RecapThis extra chapter was much shorter than the others, and its purpose was toillustrate some simple techniques to take back control of gradients gone wild.Therefore, we’re skipping the "Putting It All Together" section this time. We usedtwo simple datasets, together with two simple models, to show the signs of bothvanishing and exploding gradients. The former issue was addressed with differentinitialization schemes and, optionally, batch normalization, while the latter wasaddressed by clipping the gradients in different ways. This is what we’ve covered:• visualizing the vanishing gradients problem in deeper models• using a function to initialize the weights of a model• visualizing the effect of initialization schemes on the gradients• realizing that batch normalization can compensate for bad initializations• understanding the exploding gradients problem• using gradient clipping to address the exploding gradients problem584 | Extra Chapter: Vanishing and Exploding Gradients
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
- Page 590 and 591: Model Configuration (2)1 loss_fn =
- Page 592 and 593: weights. If done properly, the init
- Page 594 and 595: just did), or, if you are training
- Page 596 and 597: Figure E.3 - The effect of batch no
- Page 598 and 599: Model Configuration1 torch.manual_s
- Page 600 and 601: torch.manual_seed(42)parm = nn.Para
- Page 602 and 603: (and only if) the norm exceeds the
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
- Page 642 and 643: Model Configuration1 class SquareMo
- Page 644 and 645: StepByStep.loader_apply(test_loader
- Page 646 and 647: Figure 8.14 - Final hidden states f
- Page 648 and 649: Figure 8.16 - Transforming the hidd
- Page 650 and 651: Since the RNN cell has both of them
- Page 652 and 653: Every gate worthy of its name will
- Page 654 and 655: • For r=0 and z=0, the cell becom
- Page 656 and 657: In code, we can use split() to get
Clipping with Hooks
First, we reset the parameters once again:
Model Configuration
1 torch.manual_seed(42)
2 with torch.no_grad():
3 model.apply(weights_init)
Next, we use set_clip_backprop() to clip the gradients during backpropagation
using hooks:
Model Training
1 sbs_reg_clip_hook = StepByStep(model, loss_fn, optimizer)
2 sbs_reg_clip_hook.set_loaders(train_loader)
3 sbs_reg_clip_hook.set_clip_backprop(1.0)
4 sbs_reg_clip_hook.capture_gradients(['fc1'])
5 sbs_reg_clip_hook.train(10)
6 sbs_reg_clip_hook.remove_clip()
7 sbs_reg_clip_hook.remove_hooks()
fig = sbs_reg_clip_hook.plot_losses()
Figure E.8 - Losses—clipping by value with hooks
The loss is, once again, well behaved. At first sight, there doesn’t seem to be any
difference…
Vanishing and Exploding Gradients | 583