Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
weights. If done properly, the initial distribution of the weights may lead to a moreconsistent distribution of activation values across layers.If you haven’t noticed already, keeping similar distributions ofactivation values across all layers is exactly what batchnormalization was doing.So, if you’re using batch normalization, vanishing gradients are likely not an issue.But, before batch normalization layers became a thing, there was another way oftackling the problem, which is the topic of the next section.Initialization SchemesAn initialization scheme is a clever way of tweaking the initial distribution of theweights. It is all about choosing the best standard deviation to use for drawingrandom weights from a normal or uniform distribution. In this section, we’ll brieflydiscuss two of the most traditional schemes, Xavier (Glorot) and Kaiming (He), andhow to manually initialize weights in PyTorch. For a more detailed explanation ofthe inner workings of these initialization schemes, please check my post: "Hyperparametersin Action! Part II — Weight Initializers." [133]The Xavier (Glorot) initialization scheme was developed by Xavier Glorot andYoshua Bengio and is meant to be used with the hyperbolic-tangent (TanH)activation function. It is referred to as either Xavier or Glorot initialization,depending on the context. In PyTorch, it is available as bothnn.init.xavier_uniform() and nn.init.xavier_normal().The Kaiming (He) initialization scheme was developed by Kaiming He (yes, thesame guy from the ResNet architecture) et al. and is meant to be used with therectified linear unit (ReLU) activation function. It is referred to as either Kaimingor He initialization, depending on the context. In PyTorch, it is available as bothnn.init.kaiming_uniform() and nn.init.kaiming_normal()."Should I use uniform or normal distribution?"It shouldn’t make much of a difference, but using the uniform distribution usuallydelivers slightly better results than the alternative."Do I have to manually initialize the weights?"Vanishing and Exploding Gradients | 567
Not necessarily, no. If you’re using transfer learning, for instance, this is prettymuch not an issue because most of the model would be already trained, and a badinitialization of the trainable part should have little to no impact on model training.Besides, as we’ll see in a short while, using batch normalization layers makes yourmodel much more forgiving when it comes to a bad initialization of the weights."What about PyTorch’s defaults? Can’t I simply trust them?"Trust, but verify. Each PyTorch layer has its own default initialization of theweights in the reset_parameters() method. For instance, the nn.Linear layer isinitialized using the Kaiming (He) scheme drawn from a uniform distribution:# nn.Linear.reset_parameters()def reset_parameters(self) -> None:init.kaiming_uniform_(self.weight, a=math.sqrt(5))if self.bias is not None:fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)bound = 1 / math.sqrt(fan_in)init.uniform_(self.bias, -bound, bound)Moreover, it also initializes the biases based on the "fan-in," which is simply thenumber of units in the preceding layer.IMPORTANT: Every default initialization has its ownassumptions, and in this particular case it is assumed (in thereset_parameters() method) that the nn.Linear layer will befollowed by a leaky ReLU (the default value for the nonlinearityargument in the Kaiming initialization) with a negative slopeequal to the square root of five (the “a” argument in the Kaiminginitialization).If your model does not follow these assumptions, you may run into problems. Forinstance, our model used a regular ReLU instead of a leaky one, so the defaultinitialization scheme was off and we ended up with vanishing gradients."How am I supposed to know that?"Unfortunately, there is no easy way around it. You may inspect a layer’sreset_parameters() method and figure out its assumptions from the code (like we568 | Extra Chapter: Vanishing and Exploding Gradients
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
- Page 590 and 591: Model Configuration (2)1 loss_fn =
- Page 594 and 595: just did), or, if you are training
- Page 596 and 597: Figure E.3 - The effect of batch no
- Page 598 and 599: Model Configuration1 torch.manual_s
- Page 600 and 601: torch.manual_seed(42)parm = nn.Para
- Page 602 and 603: (and only if) the norm exceeds the
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 608 and 609: Clipping with HooksFirst, we reset
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
- Page 616 and 617: And then let’s visualize the firs
- Page 618 and 619: sequence so far, and a data point f
- Page 620 and 621: Considering this, the not "unrolled
- Page 622 and 623: linear_input = nn.Linear(n_features
- Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627: Now we’re talking! The last hidde
- Page 628 and 629: Let’s take a look at the RNN’s
- Page 630 and 631: ◦ The initial hidden state, which
- Page 632 and 633: batch_first argument to True so we
- Page 634 and 635: OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637: out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639: _l0_reverse).Once again, let’s cr
- Page 640 and 641: For bidirectional RNNs, the last el
Not necessarily, no. If you’re using transfer learning, for instance, this is pretty
much not an issue because most of the model would be already trained, and a bad
initialization of the trainable part should have little to no impact on model training.
Besides, as we’ll see in a short while, using batch normalization layers makes your
model much more forgiving when it comes to a bad initialization of the weights.
"What about PyTorch’s defaults? Can’t I simply trust them?"
Trust, but verify. Each PyTorch layer has its own default initialization of the
weights in the reset_parameters() method. For instance, the nn.Linear layer is
initialized using the Kaiming (He) scheme drawn from a uniform distribution:
# nn.Linear.reset_parameters()
def reset_parameters(self) -> None:
init.kaiming_uniform_(self.weight, a=math.sqrt(5))
if self.bias is not None:
fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
bound = 1 / math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)
Moreover, it also initializes the biases based on the "fan-in," which is simply the
number of units in the preceding layer.
IMPORTANT: Every default initialization has its own
assumptions, and in this particular case it is assumed (in the
reset_parameters() method) that the nn.Linear layer will be
followed by a leaky ReLU (the default value for the nonlinearity
argument in the Kaiming initialization) with a negative slope
equal to the square root of five (the “a” argument in the Kaiming
initialization).
If your model does not follow these assumptions, you may run into problems. For
instance, our model used a regular ReLU instead of a leaky one, so the default
initialization scheme was off and we ended up with vanishing gradients.
"How am I supposed to know that?"
Unfortunately, there is no easy way around it. You may inspect a layer’s
reset_parameters() method and figure out its assumptions from the code (like we
568 | Extra Chapter: Vanishing and Exploding Gradients