Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
After applying each scheduler to SGD with momentum, and to SGD withNesterov’s momentum, we obtain the following paths:Figure 6.28 - Paths taken by SGD combining momentum and schedulerAdding a scheduler to the mix seems to have helped the optimizer to achieve amore stable path toward the minimum.The general idea behind using a scheduler is to allow theoptimizer to alternate between exploring the loss surface (highlearning rate phase) and targeting a minimum (low learning ratephase).What is the impact of the scheduler on loss trajectories? Let’s check it out!Learning Rates | 491
Figure 6.29 - Losses for SGD combining momentum and schedulerIt is definitely harder to tell the difference between curves in the same row, exceptfor the combination of Nesterov’s momentum and cyclical scheduler, whichproduced a smoother reduction in the training loss.Adaptive vs CyclingAlthough adaptive learning rates are considered competitors of cyclical learning rates,this does not prevent you from combining them and cycling learning rates whileusing Adam. While Adam adapts the gradients using its EWMAs, the cycling policymodifies the learning rate itself, so they can work together indeed.There is much more to learn about in the topic of learning rates: This section ismeant to be only a short introduction to the topic.Putting It All TogetherIn this chapter, we were all over the place: data preparation, model configuration,and model training—a little bit of everything. Starting with a brand-new dataset,Rock Paper Scissors, we built a method for standardizing the images (for real thistime) using a temporary data loader. Next, we developed a fancier model thatincluded dropout layers for regularization. Then, we turned our focus to thetraining part, diving deeper into learning rates, optimizers, and schedulers. Weimplemented many methods: for finding a learning rate, for capturing gradientsand parameters, and for updating the learning rate using a scheduler.492 | Chapter 6: Rock, Paper, Scissors
- Page 466 and 467: torch.manual_seed(13)# Model Config
- Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471: Choosing a learning rate that works
- Page 472 and 473: Higher-Order Learning Rate Function
- Page 474 and 475: Perfect! Now let’s build the actu
- Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
- Page 478 and 479: LRFinderThe function we’ve implem
- Page 480 and 481: value in our moving average has an
- Page 482 and 483: Figure 6.15 - Distribution of weigh
- Page 484 and 485: In code, the implementation of the
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
Figure 6.29 - Losses for SGD combining momentum and scheduler
It is definitely harder to tell the difference between curves in the same row, except
for the combination of Nesterov’s momentum and cyclical scheduler, which
produced a smoother reduction in the training loss.
Adaptive vs Cycling
Although adaptive learning rates are considered competitors of cyclical learning rates,
this does not prevent you from combining them and cycling learning rates while
using Adam. While Adam adapts the gradients using its EWMAs, the cycling policy
modifies the learning rate itself, so they can work together indeed.
There is much more to learn about in the topic of learning rates: This section is
meant to be only a short introduction to the topic.
Putting It All Together
In this chapter, we were all over the place: data preparation, model configuration,
and model training—a little bit of everything. Starting with a brand-new dataset,
Rock Paper Scissors, we built a method for standardizing the images (for real this
time) using a temporary data loader. Next, we developed a fancier model that
included dropout layers for regularization. Then, we turned our focus to the
training part, diving deeper into learning rates, optimizers, and schedulers. We
implemented many methods: for finding a learning rate, for capturing gradients
and parameters, and for updating the learning rate using a scheduler.
492 | Chapter 6: Rock, Paper, Scissors