Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Figure E.3 - The effect of batch normalizationThe left-most plot shows us the result of a bad initialization scheme: vanishedgradients. The center plot shows us the result of a proper initialization scheme.Finally, the right-most plot shows us that batch normalization can indeedcompensate for a bad initialization.Not all bad gradients vanish, though—some bad gradients explode!Exploding GradientsThe root of the problem is the same: deeper and deeper models. If the model has acouple of layers only, one large gradient won’t do any harm. But, if there are manylayers, the gradients may end up growing uncontrollably. That’s the so-calledexploding gradients problem, and it’s fairly easy to spot: Just look for NaN values inthe loss. If that’s the case, it means that the gradients grew so large that they cannotbe properly represented anymore."Why does it happen?"There may be a compounding effect (think of raising a relatively small number (e.g.,1.5) to a large power (e.g., 20)), especially in recurrent neural networks (the topicof Chapter 8), since the same weights are used repeatedly along a sequence. But,there are other reasons as well: The learning rate may be too high, or the targetvariable (in a regression problem) may have a large range of values. Let meillustrate it.Data Generation & PreparationLet’s use Scikit-Learn’s make_regression() to generate a dataset of 1,000 pointswith ten features each, and a little bit of noise:Vanishing and Exploding Gradients | 571

Data Generation & Preparation1 X_reg, y_reg = make_regression(2 n_samples=1000, n_features=10, noise=0.1, random_state=423 )4 X_reg = torch.as_tensor(X_reg).float()5 y_reg = torch.as_tensor(y_reg).float().view(-1, 1)67 dataset = TensorDataset(X_reg, y_reg)8 train_loader = DataLoader(9 dataset=dataset, batch_size=32, shuffle=True10 )Even though we cannot plot a ten-dimensional regression, we can still visualize thedistribution of both features and target values.Figure E.4 - Distributions of feature and target valuesIt’s all good and fine with our feature values since they are inside a typicalstandardized range (-3, 3). The target values, though, are on a very different scale,from -400 to 400. If the target variable represents a monetary value, for example,these ranges are fairly common. Sure, we could standardize the target value as well,but that would ruin the example of exploding gradients!Model Configuration & TrainingWe can build a fairly simple model to tackle this regression problem: a networkwith one hidden layer with 15 units, a ReLU as activation function, and an outputlayer.572 | Extra Chapter: Vanishing and Exploding Gradients

Figure E.3 - The effect of batch normalization

The left-most plot shows us the result of a bad initialization scheme: vanished

gradients. The center plot shows us the result of a proper initialization scheme.

Finally, the right-most plot shows us that batch normalization can indeed

compensate for a bad initialization.

Not all bad gradients vanish, though—some bad gradients explode!

Exploding Gradients

The root of the problem is the same: deeper and deeper models. If the model has a

couple of layers only, one large gradient won’t do any harm. But, if there are many

layers, the gradients may end up growing uncontrollably. That’s the so-called

exploding gradients problem, and it’s fairly easy to spot: Just look for NaN values in

the loss. If that’s the case, it means that the gradients grew so large that they cannot

be properly represented anymore.

"Why does it happen?"

There may be a compounding effect (think of raising a relatively small number (e.g.,

1.5) to a large power (e.g., 20)), especially in recurrent neural networks (the topic

of Chapter 8), since the same weights are used repeatedly along a sequence. But,

there are other reasons as well: The learning rate may be too high, or the target

variable (in a regression problem) may have a large range of values. Let me

illustrate it.

Data Generation & Preparation

Let’s use Scikit-Learn’s make_regression() to generate a dataset of 1,000 points

with ten features each, and a little bit of noise:

Vanishing and Exploding Gradients | 571

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!