22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

weights. If done properly, the initial distribution of the weights may lead to a more

consistent distribution of activation values across layers.

If you haven’t noticed already, keeping similar distributions of

activation values across all layers is exactly what batch

normalization was doing.

So, if you’re using batch normalization, vanishing gradients are likely not an issue.

But, before batch normalization layers became a thing, there was another way of

tackling the problem, which is the topic of the next section.

Initialization Schemes

An initialization scheme is a clever way of tweaking the initial distribution of the

weights. It is all about choosing the best standard deviation to use for drawing

random weights from a normal or uniform distribution. In this section, we’ll briefly

discuss two of the most traditional schemes, Xavier (Glorot) and Kaiming (He), and

how to manually initialize weights in PyTorch. For a more detailed explanation of

the inner workings of these initialization schemes, please check my post: "Hyperparameters

in Action! Part II — Weight Initializers." [133]

The Xavier (Glorot) initialization scheme was developed by Xavier Glorot and

Yoshua Bengio and is meant to be used with the hyperbolic-tangent (TanH)

activation function. It is referred to as either Xavier or Glorot initialization,

depending on the context. In PyTorch, it is available as both

nn.init.xavier_uniform() and nn.init.xavier_normal().

The Kaiming (He) initialization scheme was developed by Kaiming He (yes, the

same guy from the ResNet architecture) et al. and is meant to be used with the

rectified linear unit (ReLU) activation function. It is referred to as either Kaiming

or He initialization, depending on the context. In PyTorch, it is available as both

nn.init.kaiming_uniform() and nn.init.kaiming_normal().

"Should I use uniform or normal distribution?"

It shouldn’t make much of a difference, but using the uniform distribution usually

delivers slightly better results than the alternative.

"Do I have to manually initialize the weights?"

Vanishing and Exploding Gradients | 567

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!