Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

After applying each scheduler to SGD with momentum, and to SGD withNesterov’s momentum, we obtain the following paths:Figure 6.28 - Paths taken by SGD combining momentum and schedulerAdding a scheduler to the mix seems to have helped the optimizer to achieve amore stable path toward the minimum.The general idea behind using a scheduler is to allow theoptimizer to alternate between exploring the loss surface (highlearning rate phase) and targeting a minimum (low learning ratephase).What is the impact of the scheduler on loss trajectories? Let’s check it out!Learning Rates | 491

Figure 6.29 - Losses for SGD combining momentum and schedulerIt is definitely harder to tell the difference between curves in the same row, exceptfor the combination of Nesterov’s momentum and cyclical scheduler, whichproduced a smoother reduction in the training loss.Adaptive vs CyclingAlthough adaptive learning rates are considered competitors of cyclical learning rates,this does not prevent you from combining them and cycling learning rates whileusing Adam. While Adam adapts the gradients using its EWMAs, the cycling policymodifies the learning rate itself, so they can work together indeed.There is much more to learn about in the topic of learning rates: This section ismeant to be only a short introduction to the topic.Putting It All TogetherIn this chapter, we were all over the place: data preparation, model configuration,and model training—a little bit of everything. Starting with a brand-new dataset,Rock Paper Scissors, we built a method for standardizing the images (for real thistime) using a temporary data loader. Next, we developed a fancier model thatincluded dropout layers for regularization. Then, we turned our focus to thetraining part, diving deeper into learning rates, optimizers, and schedulers. Weimplemented many methods: for finding a learning rate, for capturing gradientsand parameters, and for updating the learning rate using a scheduler.492 | Chapter 6: Rock, Paper, Scissors

Figure 6.29 - Losses for SGD combining momentum and scheduler

It is definitely harder to tell the difference between curves in the same row, except

for the combination of Nesterov’s momentum and cyclical scheduler, which

produced a smoother reduction in the training loss.

Adaptive vs Cycling

Although adaptive learning rates are considered competitors of cyclical learning rates,

this does not prevent you from combining them and cycling learning rates while

using Adam. While Adam adapts the gradients using its EWMAs, the cycling policy

modifies the learning rate itself, so they can work together indeed.

There is much more to learn about in the topic of learning rates: This section is

meant to be only a short introduction to the topic.

Putting It All Together

In this chapter, we were all over the place: data preparation, model configuration,

and model training—a little bit of everything. Starting with a brand-new dataset,

Rock Paper Scissors, we built a method for standardizing the images (for real this

time) using a temporary data loader. Next, we developed a fancier model that

included dropout layers for regularization. Then, we turned our focus to the

training part, diving deeper into learning rates, optimizers, and schedulers. We

implemented many methods: for finding a learning rate, for capturing gradients

and parameters, and for updating the learning rate using a scheduler.

492 | Chapter 6: Rock, Paper, Scissors

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!