Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

weights. If done properly, the initial distribution of the weights may lead to a moreconsistent distribution of activation values across layers.If you haven’t noticed already, keeping similar distributions ofactivation values across all layers is exactly what batchnormalization was doing.So, if you’re using batch normalization, vanishing gradients are likely not an issue.But, before batch normalization layers became a thing, there was another way oftackling the problem, which is the topic of the next section.Initialization SchemesAn initialization scheme is a clever way of tweaking the initial distribution of theweights. It is all about choosing the best standard deviation to use for drawingrandom weights from a normal or uniform distribution. In this section, we’ll brieflydiscuss two of the most traditional schemes, Xavier (Glorot) and Kaiming (He), andhow to manually initialize weights in PyTorch. For a more detailed explanation ofthe inner workings of these initialization schemes, please check my post: "Hyperparametersin Action! Part II — Weight Initializers." [133]The Xavier (Glorot) initialization scheme was developed by Xavier Glorot andYoshua Bengio and is meant to be used with the hyperbolic-tangent (TanH)activation function. It is referred to as either Xavier or Glorot initialization,depending on the context. In PyTorch, it is available as bothnn.init.xavier_uniform() and nn.init.xavier_normal().The Kaiming (He) initialization scheme was developed by Kaiming He (yes, thesame guy from the ResNet architecture) et al. and is meant to be used with therectified linear unit (ReLU) activation function. It is referred to as either Kaimingor He initialization, depending on the context. In PyTorch, it is available as bothnn.init.kaiming_uniform() and nn.init.kaiming_normal()."Should I use uniform or normal distribution?"It shouldn’t make much of a difference, but using the uniform distribution usuallydelivers slightly better results than the alternative."Do I have to manually initialize the weights?"Vanishing and Exploding Gradients | 567

Not necessarily, no. If you’re using transfer learning, for instance, this is prettymuch not an issue because most of the model would be already trained, and a badinitialization of the trainable part should have little to no impact on model training.Besides, as we’ll see in a short while, using batch normalization layers makes yourmodel much more forgiving when it comes to a bad initialization of the weights."What about PyTorch’s defaults? Can’t I simply trust them?"Trust, but verify. Each PyTorch layer has its own default initialization of theweights in the reset_parameters() method. For instance, the nn.Linear layer isinitialized using the Kaiming (He) scheme drawn from a uniform distribution:# nn.Linear.reset_parameters()def reset_parameters(self) -> None:init.kaiming_uniform_(self.weight, a=math.sqrt(5))if self.bias is not None:fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)bound = 1 / math.sqrt(fan_in)init.uniform_(self.bias, -bound, bound)Moreover, it also initializes the biases based on the "fan-in," which is simply thenumber of units in the preceding layer.IMPORTANT: Every default initialization has its ownassumptions, and in this particular case it is assumed (in thereset_parameters() method) that the nn.Linear layer will befollowed by a leaky ReLU (the default value for the nonlinearityargument in the Kaiming initialization) with a negative slopeequal to the square root of five (the “a” argument in the Kaiminginitialization).If your model does not follow these assumptions, you may run into problems. Forinstance, our model used a regular ReLU instead of a leaky one, so the defaultinitialization scheme was off and we ended up with vanishing gradients."How am I supposed to know that?"Unfortunately, there is no easy way around it. You may inspect a layer’sreset_parameters() method and figure out its assumptions from the code (like we568 | Extra Chapter: Vanishing and Exploding Gradients

Not necessarily, no. If you’re using transfer learning, for instance, this is pretty

much not an issue because most of the model would be already trained, and a bad

initialization of the trainable part should have little to no impact on model training.

Besides, as we’ll see in a short while, using batch normalization layers makes your

model much more forgiving when it comes to a bad initialization of the weights.

"What about PyTorch’s defaults? Can’t I simply trust them?"

Trust, but verify. Each PyTorch layer has its own default initialization of the

weights in the reset_parameters() method. For instance, the nn.Linear layer is

initialized using the Kaiming (He) scheme drawn from a uniform distribution:

# nn.Linear.reset_parameters()

def reset_parameters(self) -> None:

init.kaiming_uniform_(self.weight, a=math.sqrt(5))

if self.bias is not None:

fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)

bound = 1 / math.sqrt(fan_in)

init.uniform_(self.bias, -bound, bound)

Moreover, it also initializes the biases based on the "fan-in," which is simply the

number of units in the preceding layer.

IMPORTANT: Every default initialization has its own

assumptions, and in this particular case it is assumed (in the

reset_parameters() method) that the nn.Linear layer will be

followed by a leaky ReLU (the default value for the nonlinearity

argument in the Kaiming initialization) with a negative slope

equal to the square root of five (the “a” argument in the Kaiming

initialization).

If your model does not follow these assumptions, you may run into problems. For

instance, our model used a regular ReLU instead of a leaky one, so the default

initialization scheme was off and we ended up with vanishing gradients.

"How am I supposed to know that?"

Unfortunately, there is no easy way around it. You may inspect a layer’s

reset_parameters() method and figure out its assumptions from the code (like we

568 | Extra Chapter: Vanishing and Exploding Gradients

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!