22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Figure 6.20 - Paths taken by SGD (with and without momentum)

Like the Adam optimizer, SGD with momentum moves faster and overshoots. But

it does seem to get carried away with it, so much so that it gets past the target and

has to backtrack to approach it from a different direction.

The analogy for the momentum update is that of a ball rolling down a hill: It picks

up so much speed that it ends up climbing the opposite side of the valley, only to

roll back down again with a little bit less speed, doing this back and forth over and

over again until eventually reaching the bottom.

"Isn’t Adam better than this already?"

Yes and no. Adam indeed converges more quickly to a minimum, but not

necessarily a good one. In a simple linear regression, there is a global minimum

corresponding to the optimal value of the parameters. This is not the case in deep

learning models: There are many minima (plural of minimum), and some are better

than others (corresponding to lower losses). So, Adam will find one of these minima

and move there fast, perhaps overlooking better alternatives in the neighborhood.

Momentum may seem a bit sloppy at first, but it may be combined with a learning

rate scheduler (more on that shortly!) to better explore the loss surface in hopes

of finding a better-quality minimum than Adam does.

Both alternatives, Adam and SGD with momentum (especially

when combined with a learning rate scheduler), are commonly

used.

Learning Rates | 473

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!