Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure E.3 - The effect of batch normalizationThe left-most plot shows us the result of a bad initialization scheme: vanishedgradients. The center plot shows us the result of a proper initialization scheme.Finally, the right-most plot shows us that batch normalization can indeedcompensate for a bad initialization.Not all bad gradients vanish, though—some bad gradients explode!Exploding GradientsThe root of the problem is the same: deeper and deeper models. If the model has acouple of layers only, one large gradient won’t do any harm. But, if there are manylayers, the gradients may end up growing uncontrollably. That’s the so-calledexploding gradients problem, and it’s fairly easy to spot: Just look for NaN values inthe loss. If that’s the case, it means that the gradients grew so large that they cannotbe properly represented anymore."Why does it happen?"There may be a compounding effect (think of raising a relatively small number (e.g.,1.5) to a large power (e.g., 20)), especially in recurrent neural networks (the topicof Chapter 8), since the same weights are used repeatedly along a sequence. But,there are other reasons as well: The learning rate may be too high, or the targetvariable (in a regression problem) may have a large range of values. Let meillustrate it.Data Generation & PreparationLet’s use Scikit-Learn’s make_regression() to generate a dataset of 1,000 pointswith ten features each, and a little bit of noise:Vanishing and Exploding Gradients | 571
Data Generation & Preparation1 X_reg, y_reg = make_regression(2 n_samples=1000, n_features=10, noise=0.1, random_state=423 )4 X_reg = torch.as_tensor(X_reg).float()5 y_reg = torch.as_tensor(y_reg).float().view(-1, 1)67 dataset = TensorDataset(X_reg, y_reg)8 train_loader = DataLoader(9 dataset=dataset, batch_size=32, shuffle=True10 )Even though we cannot plot a ten-dimensional regression, we can still visualize thedistribution of both features and target values.Figure E.4 - Distributions of feature and target valuesIt’s all good and fine with our feature values since they are inside a typicalstandardized range (-3, 3). The target values, though, are on a very different scale,from -400 to 400. If the target variable represents a monetary value, for example,these ranges are fairly common. Sure, we could standardize the target value as well,but that would ruin the example of exploding gradients!Model Configuration & TrainingWe can build a fairly simple model to tackle this regression problem: a networkwith one hidden layer with 15 units, a ReLU as activation function, and an outputlayer.572 | Extra Chapter: Vanishing and Exploding Gradients
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
- Page 590 and 591: Model Configuration (2)1 loss_fn =
- Page 592 and 593: weights. If done properly, the init
- Page 594 and 595: just did), or, if you are training
- Page 598 and 599: Model Configuration1 torch.manual_s
- Page 600 and 601: torch.manual_seed(42)parm = nn.Para
- Page 602 and 603: (and only if) the norm exceeds the
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 608 and 609: Clipping with HooksFirst, we reset
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
- Page 642 and 643: Model Configuration1 class SquareMo
- Page 644 and 645: StepByStep.loader_apply(test_loader
Figure E.3 - The effect of batch normalization
The left-most plot shows us the result of a bad initialization scheme: vanished
gradients. The center plot shows us the result of a proper initialization scheme.
Finally, the right-most plot shows us that batch normalization can indeed
compensate for a bad initialization.
Not all bad gradients vanish, though—some bad gradients explode!
Exploding Gradients
The root of the problem is the same: deeper and deeper models. If the model has a
couple of layers only, one large gradient won’t do any harm. But, if there are many
layers, the gradients may end up growing uncontrollably. That’s the so-called
exploding gradients problem, and it’s fairly easy to spot: Just look for NaN values in
the loss. If that’s the case, it means that the gradients grew so large that they cannot
be properly represented anymore.
"Why does it happen?"
There may be a compounding effect (think of raising a relatively small number (e.g.,
1.5) to a large power (e.g., 20)), especially in recurrent neural networks (the topic
of Chapter 8), since the same weights are used repeatedly along a sequence. But,
there are other reasons as well: The learning rate may be too high, or the target
variable (in a regression problem) may have a large range of values. Let me
illustrate it.
Data Generation & Preparation
Let’s use Scikit-Learn’s make_regression() to generate a dataset of 1,000 points
with ten features each, and a little bit of noise:
Vanishing and Exploding Gradients | 571