Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Figure 6.22 - Path taken by each SGD flavorTake the third point in the lower-left part of the black line, for instance: Its locationis quite different in each of the plots and thus so are the corresponding gradients.The two plots on the left are already known to us. The new plot in town is the oneto the right. The dampening of the oscillations is abundantly clear, but Nesterov’smomentum still gets past its target and has to backtrack a little to approach itfrom the opposite direction. And let me remind you that this is one of the easiestloss surfaces of all!Talking about losses, let’s take a peek at their trajectories.Figure 6.23 - Losses for each SGD flavorThe plot on the left is there just for comparison; it is the same as before. The one onthe right is quite straightforward too, depicting the fact that Nesterov’smomentum quickly found its way to a lower loss and slowly approached theoptimal value.The plot in the middle is a bit more intriguing: Even though regular momentumproduced a path with wild swings over the loss surface (each black dotcorresponds to a mini-batch), its loss trajectory oscillates less than Adam’s does.This is an artifact of this simple linear regression problem (namely, the bowlshapedloss surface), and should not be taken as representative of typical behavior.Learning Rates | 477
If you’re not convinced by momentum, either regular or Nesterov, let’s addsomething else to the mix…Learning Rate SchedulersIt is also possible to schedule the changes in the learning rate as training goes,instead of adapting the gradients. Say you’d like to reduce the learning rate by oneorder of magnitude (that is, multiplying it by 0.1) every T epochs, such that trainingis faster at the beginning and slows down after a while to try avoiding convergenceproblems.That’s what a learning rate scheduler does: It updates thelearning rate of the optimizer.So, it should be no surprise that one of the scheduler’s arguments is the optimizeritself. The learning rate set for the optimizer will be the initial learning rate of thescheduler. As an example, let’s take the simplest of the schedulers: StepLR, whichsimply multiplies the learning rate by a factor gamma every step_size epochs.In the code below, we create a dummy optimizer, which is "updating" some fakeparameter with an initial learning rate of 0.01. The dummy scheduler, an instanceof StepLR, will multiply that learning rate by 0.1 every two epochs.dummy_optimizer = optim.SGD([nn.Parameter(torch.randn(1))], lr=0.01)dummy_scheduler = StepLR(dummy_optimizer, step_size=2, gamma=0.1)The scheduler has a step() method just like the optimizer.You should call the scheduler’s step() method after calling theoptimizer’s step() method.Inside the training loop, it will look like this:478 | Chapter 6: Rock, Paper, Scissors
- Page 452 and 453: three_channel_filter = np.array([[[
- Page 454 and 455: Fancier Model (Constructor)class CN
- Page 456 and 457: Fancier Model (Classifier)def class
- Page 458 and 459: torch.manual_seed(44)dropping_model
- Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300
- Page 462 and 463: Figure 6.8 - Output distribution fo
- Page 464 and 465: Adaptive moment estimation (Adam) u
- Page 466 and 467: torch.manual_seed(13)# Model Config
- Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471: Choosing a learning rate that works
- Page 472 and 473: Higher-Order Learning Rate Function
- Page 474 and 475: Perfect! Now let’s build the actu
- Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
- Page 478 and 479: LRFinderThe function we’ve implem
- Page 480 and 481: value in our moving average has an
- Page 482 and 483: Figure 6.15 - Distribution of weigh
- Page 484 and 485: In code, the implementation of the
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
If you’re not convinced by momentum, either regular or Nesterov, let’s add
something else to the mix…
Learning Rate Schedulers
It is also possible to schedule the changes in the learning rate as training goes,
instead of adapting the gradients. Say you’d like to reduce the learning rate by one
order of magnitude (that is, multiplying it by 0.1) every T epochs, such that training
is faster at the beginning and slows down after a while to try avoiding convergence
problems.
That’s what a learning rate scheduler does: It updates the
learning rate of the optimizer.
So, it should be no surprise that one of the scheduler’s arguments is the optimizer
itself. The learning rate set for the optimizer will be the initial learning rate of the
scheduler. As an example, let’s take the simplest of the schedulers: StepLR, which
simply multiplies the learning rate by a factor gamma every step_size epochs.
In the code below, we create a dummy optimizer, which is "updating" some fake
parameter with an initial learning rate of 0.01. The dummy scheduler, an instance
of StepLR, will multiply that learning rate by 0.1 every two epochs.
dummy_optimizer = optim.SGD([nn.Parameter(torch.randn(1))], lr=0.01)
dummy_scheduler = StepLR(dummy_optimizer, step_size=2, gamma=0.1)
The scheduler has a step() method just like the optimizer.
You should call the scheduler’s step() method after calling the
optimizer’s step() method.
Inside the training loop, it will look like this:
478 | Chapter 6: Rock, Paper, Scissors