Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
• dampening: dampening factor for momentum• nesterov: enables Nesterov momentum, which is a smarter version of theregular momentum, and also has its own sectionThat’s the perfect moment to dive deeper into momentum (sorry, I really cannot missa pun!).MomentumOne of SGD’s tricks is called momentum. At first sight, it looks very much like usingan EWMA for gradients, but it isn’t. Let’s compare EWMA’s beta formulation withmomentum’s:Equation 6.11 - Momentum vs EWMASee the difference? It does not average the gradients; it runs a cumulative sum of"discounted" gradients. In other words, all past gradients contribute to the sum,but they are "discounted" more and more as they grow older. The "discount" isdriven by the beta parameter. We can also write the formula for momentum likethis:Equation 6.12 - Compounding momentumThe disadvantage of this second formula is that it requires the full history ofgradients, while the previous one depends only on the gradient’s current value andmomentum’s latest value. "What about the dampening factor?" Learning Rates | 471
The dampening factor is a way to, well, dampen the effect of the latest gradient.Instead of having its full value added, the latest gradient gets its contribution tomomentum reduced by the dampening factor. So, if the dampening factor is 0.3,only 70% of the latest gradient gets added to momentum. Its formula is given by:Equation 6.13 - Momentum with dampening factorIf the dampening factor equals the momentum factor (beta), itbecomes a true EWMA!Similar to Adam, SGD with momentum keeps the value of momentum for eachparameter. The beta parameter is stored there as well (momentum). We can take apeek at it using the optimizer’s state_dict():{'state': {139863047119488: {'momentum_buffer': tensor([[-0.0053]])},139863047119168: {'momentum_buffer': tensor([-0.1568])}},'param_groups': [{'lr': 0.1,'momentum': 0.9,'dampening': 0,'weight_decay': 0,'nesterov': False,'params': [139863047119488, 139863047119168]}]}Even though old gradients slowly fade away, contributing less and less to the sum,very recent gradients are taken almost at their face value (assuming a typicalvalue of 0.9 for beta and no dampening). This means that, given a sequence of allpositive (or all negative) gradients, their sum, that is, the momentum, is going upreally fast (in absolute value). A large momentum gets translated into a largeupdate since momentum replaces gradients in the parameter update:Equation 6.14 - Parameter updateThis behavior can be easily visualized in the path taken by SGD with momentum.472 | Chapter 6: Rock, Paper, Scissors
- Page 446 and 447: model anyway. We’ll use it to com
- Page 448 and 449: StepByStep Method@staticmethoddef m
- Page 450 and 451: "What’s wrong with the colors?"Th
- Page 452 and 453: three_channel_filter = np.array([[[
- Page 454 and 455: Fancier Model (Constructor)class CN
- Page 456 and 457: Fancier Model (Classifier)def class
- Page 458 and 459: torch.manual_seed(44)dropping_model
- Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300
- Page 462 and 463: Figure 6.8 - Output distribution fo
- Page 464 and 465: Adaptive moment estimation (Adam) u
- Page 466 and 467: torch.manual_seed(13)# Model Config
- Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471: Choosing a learning rate that works
- Page 472 and 473: Higher-Order Learning Rate Function
- Page 474 and 475: Perfect! Now let’s build the actu
- Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
- Page 478 and 479: LRFinderThe function we’ve implem
- Page 480 and 481: value in our moving average has an
- Page 482 and 483: Figure 6.15 - Distribution of weigh
- Page 484 and 485: In code, the implementation of the
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
• dampening: dampening factor for momentum
• nesterov: enables Nesterov momentum, which is a smarter version of the
regular momentum, and also has its own section
That’s the perfect moment to dive deeper into momentum (sorry, I really cannot miss
a pun!).
Momentum
One of SGD’s tricks is called momentum. At first sight, it looks very much like using
an EWMA for gradients, but it isn’t. Let’s compare EWMA’s beta formulation with
momentum’s:
Equation 6.11 - Momentum vs EWMA
See the difference? It does not average the gradients; it runs a cumulative sum of
"discounted" gradients. In other words, all past gradients contribute to the sum,
but they are "discounted" more and more as they grow older. The "discount" is
driven by the beta parameter. We can also write the formula for momentum like
this:
Equation 6.12 - Compounding momentum
The disadvantage of this second formula is that it requires the full history of
gradients, while the previous one depends only on the gradient’s current value and
momentum’s latest value.
"What about the dampening factor?" Learning Rates | 471