Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure 6.20 - Paths taken by SGD (with and without momentum)Like the Adam optimizer, SGD with momentum moves faster and overshoots. Butit does seem to get carried away with it, so much so that it gets past the target andhas to backtrack to approach it from a different direction.The analogy for the momentum update is that of a ball rolling down a hill: It picksup so much speed that it ends up climbing the opposite side of the valley, only toroll back down again with a little bit less speed, doing this back and forth over andover again until eventually reaching the bottom."Isn’t Adam better than this already?"Yes and no. Adam indeed converges more quickly to a minimum, but notnecessarily a good one. In a simple linear regression, there is a global minimumcorresponding to the optimal value of the parameters. This is not the case in deeplearning models: There are many minima (plural of minimum), and some are betterthan others (corresponding to lower losses). So, Adam will find one of these minimaand move there fast, perhaps overlooking better alternatives in the neighborhood.Momentum may seem a bit sloppy at first, but it may be combined with a learningrate scheduler (more on that shortly!) to better explore the loss surface in hopesof finding a better-quality minimum than Adam does.Both alternatives, Adam and SGD with momentum (especiallywhen combined with a learning rate scheduler), are commonlyused.Learning Rates | 473
But, if a ball running downhill seems a bit too out of control for your taste, maybeyou’ll like its variant better…NesterovThe Nesterov accelerated gradient (NAG), or Nesterov for short, is a clever variantof SGD with momentum. Let’s say we’re computing momentum for twoconsecutive steps (t and t+1):Equation 6.15 - Nesterov momentumIn the current step (t), we use the current gradient (t) and the momentum from theprevious step (t-1) to compute the current momentum. So far, nothing new.In the next step (t+1), we’ll use the next gradient (t+1) and the momentum we’vejust computed for the current step (t) to compute the next momentum. Again,nothing new.What if I ask you to compute momentum one step ahead?"Can you tell me momentum at step t+1 while you’re still at step t?""Of course I can’t, I do not know the gradient at step t+1!" you say, puzzled at mybizarre question. Fair enough. So I ask you yet another question:"What’s your best guess for the gradient at step t+1?"I hope you answered, "The gradient at step t." If you do not know the future value ofa variable, the naive estimate is its current value. So, let’s go with it, Nesterov-style!In a nutshell, what NAG does is try to compute momentum one step ahead since itis only missing one value and has a good (naive) guess for it. It is as if it werecomputing momentum between two steps:474 | Chapter 6: Rock, Paper, Scissors
- Page 448 and 449: StepByStep Method@staticmethoddef m
- Page 450 and 451: "What’s wrong with the colors?"Th
- Page 452 and 453: three_channel_filter = np.array([[[
- Page 454 and 455: Fancier Model (Constructor)class CN
- Page 456 and 457: Fancier Model (Classifier)def class
- Page 458 and 459: torch.manual_seed(44)dropping_model
- Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300
- Page 462 and 463: Figure 6.8 - Output distribution fo
- Page 464 and 465: Adaptive moment estimation (Adam) u
- Page 466 and 467: torch.manual_seed(13)# Model Config
- Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471: Choosing a learning rate that works
- Page 472 and 473: Higher-Order Learning Rate Function
- Page 474 and 475: Perfect! Now let’s build the actu
- Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
- Page 478 and 479: LRFinderThe function we’ve implem
- Page 480 and 481: value in our moving average has an
- Page 482 and 483: Figure 6.15 - Distribution of weigh
- Page 484 and 485: In code, the implementation of the
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
But, if a ball running downhill seems a bit too out of control for your taste, maybe
you’ll like its variant better…
Nesterov
The Nesterov accelerated gradient (NAG), or Nesterov for short, is a clever variant
of SGD with momentum. Let’s say we’re computing momentum for two
consecutive steps (t and t+1):
Equation 6.15 - Nesterov momentum
In the current step (t), we use the current gradient (t) and the momentum from the
previous step (t-1) to compute the current momentum. So far, nothing new.
In the next step (t+1), we’ll use the next gradient (t+1) and the momentum we’ve
just computed for the current step (t) to compute the next momentum. Again,
nothing new.
What if I ask you to compute momentum one step ahead?
"Can you tell me momentum at step t+1 while you’re still at step t?"
"Of course I can’t, I do not know the gradient at step t+1!" you say, puzzled at my
bizarre question. Fair enough. So I ask you yet another question:
"What’s your best guess for the gradient at step t+1?"
I hope you answered, "The gradient at step t." If you do not know the future value of
a variable, the naive estimate is its current value. So, let’s go with it, Nesterov-style!
In a nutshell, what NAG does is try to compute momentum one step ahead since it
is only missing one value and has a good (naive) guess for it. It is as if it were
computing momentum between two steps:
474 | Chapter 6: Rock, Paper, Scissors