Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
OutputOrderedDict([('running_mean', tensor([0.9070, 1.0931])),('running_var', tensor([1.2592, 0.9192])),('num_batches_tracked', tensor(2))])Both running mean and running variance are simple averages over the minibatches:mean2, var2 = batch2[0].mean(axis=0), batch2[0].var(axis=0)running_mean, running_var = (mean1 + mean2) / 2, (var1 + var2) / 2running_mean, running_varOutput(tensor([0.9070, 1.0931]), tensor([1.2592, 0.9192]))Now, let’s pretend we have finished training (even though we don’t have an actualmodel), and we’re using the third mini-batch for evaluation.Evaluation PhaseJust like dropout, batch normalization exhibits different behaviors depending onthe mode: train or eval. We’ve already seen what it does during the trainingphase. We’ve also realized that it doesn’t make sense to compute statistics for anydata that isn’t training data.So, in the evaluation phase, it will use the running statistics computed duringtraining to standardize the new data (the third mini-batch, in our small example):batch_normalizer.eval()normed3 = batch_normalizer(batch3[0])normed3.mean(axis=0), normed3.var(axis=0, unbiased=False)Output(tensor([ 0.1590, -0.0970]), tensor([1.0134, 1.4166]))Batch Normalization | 541
"Is it a bit off again?"Actually, no—since it is standardizing unseen data using statistics computed ontraining data, the results above are expected. The mean will be around zero andthe standard deviation will be around one.MomentumThere is an alternative way of computing running statistics: Instead of using asimple average, it uses an exponentially weighted moving average (EWMA) of thestatistics.The naming convention, though, is very unfortunate: The alpha parameter of theEWMA was named momentum, adding to the confusion. There is even a note inPyTorch’s documentation warning about this:"This momentum argument is different from one used in optimizer classes and theconventional notion of momentum." [128]The bottom line is: Ignore the confusing naming convention andthink of the "momentum" argument as the alpha parameter of aregular EWMA.The documentation also uses x to refer to a particular statistic when introducingthe mathematical formula of "momentum," which does not help at all.So, to make it abundantly clear what is being computed, I present the formulasbelow:Equation 7.7 - Running statisticLet’s try it out in practice:batch_normalizer_mom = nn.BatchNorm1d(num_features=2, affine=False, momentum=0.1)batch_normalizer_mom.state_dict()542 | Chapter 7: Transfer Learning
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
- Page 590 and 591: Model Configuration (2)1 loss_fn =
- Page 592 and 593: weights. If done properly, the init
- Page 594 and 595: just did), or, if you are training
- Page 596 and 597: Figure E.3 - The effect of batch no
- Page 598 and 599: Model Configuration1 torch.manual_s
- Page 600 and 601: torch.manual_seed(42)parm = nn.Para
- Page 602 and 603: (and only if) the norm exceeds the
- Page 604 and 605: if callable(self.clipping): 1self.c
- Page 606 and 607: Moreover, let’s use a ten times h
- Page 608 and 609: Clipping with HooksFirst, we reset
- Page 610 and 611: • visualizing the difference betw
- Page 612 and 613: Chapter 8SequencesSpoilersIn this c
- Page 614 and 615: Before shuffling, the pixels were o
"Is it a bit off again?"
Actually, no—since it is standardizing unseen data using statistics computed on
training data, the results above are expected. The mean will be around zero and
the standard deviation will be around one.
Momentum
There is an alternative way of computing running statistics: Instead of using a
simple average, it uses an exponentially weighted moving average (EWMA) of the
statistics.
The naming convention, though, is very unfortunate: The alpha parameter of the
EWMA was named momentum, adding to the confusion. There is even a note in
PyTorch’s documentation warning about this:
"This momentum argument is different from one used in optimizer classes and the
conventional notion of momentum." [128]
The bottom line is: Ignore the confusing naming convention and
think of the "momentum" argument as the alpha parameter of a
regular EWMA.
The documentation also uses x to refer to a particular statistic when introducing
the mathematical formula of "momentum," which does not help at all.
So, to make it abundantly clear what is being computed, I present the formulas
below:
Equation 7.7 - Running statistic
Let’s try it out in practice:
batch_normalizer_mom = nn.BatchNorm1d(
num_features=2, affine=False, momentum=0.1
)
batch_normalizer_mom.state_dict()
542 | Chapter 7: Transfer Learning