Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
LRFinderThe function we’ve implemented above is fairly basic. For an implementationwith more bells and whistles, check this Python package: torch_lr_finder.[102]I am illustrating its usage here, which is quite similar to what we’ve doneabove, but please refer to the documentation for more details.!pip install --quiet torch-lr-finderfrom torch_lr_finder import LRFinderInstead of calling a function directly, we need to create an instance ofLRFinder first, using the typical model configuration objects (model,optimizer, loss function, and the device). Then, we can take the range_test()method for a spin, providing familiar arguments to it: a data loader, the upperrange for the learning rate, and the number of iterations. The reset()method restores the original states of both model and optimizer.torch.manual_seed(11)new_model = CNN2(n_feature=5, p=0.3)multi_loss_fn = nn.CrossEntropyLoss(reduction='mean')new_optimizer = optim.Adam(new_model.parameters(), lr=3e-4)device = 'cuda' if torch.cuda.is_available() else 'cpu'lr_finder = LRFinder(new_model, new_optimizer, multi_loss_fn, device=device)lr_finder.range_test(train_loader, end_lr=1e-1, num_iter=100)lr_finder.plot(log_lr=True)lr_finder.reset()Learning Rates | 453
Not quite a "U" shape, but we still can tell that something in the ballpark of1e-2 is a good starting point.Adaptive Learning RateThat’s what the Adam optimizer is actually doing for us—it starts with the learningrate provided as an argument, but it adapts the learning rate(s) as it goes, tweakingit in a different way for each parameter in the model. Or does it?Truth to be told, Adam does not adapt the learning rate—it really adapts thegradients. But, since the parameter update is given by the multiplication of bothterms, the learning rate and the gradient, this is a distinction without a difference.Adam combines the characteristics of two other optimizers: SGD (with momentum)and RMSProp. Like the former, it uses a moving average of gradients instead ofgradients themselves (that’s the first moment, in statistics jargon); like the latter, itscales the gradients using a moving average of squared gradients (that’s thesecond moment, or uncentered variance, in statistics jargon).But this is not a simple average. It is a moving average. And it is not any movingaverage. It is an exponentially weighted moving average (EWMA).Before diving into EWMAs, though, we need to briefly go over simple movingaverages.Moving Average (MA)To compute the moving average of a given feature x over a certain number ofperiods, we just have to average the values observed over that many time steps(from an initial value observed periods-1 steps ago all the way up to the currentvalue):Equation 6.1 - Simple moving averageBut, instead of averaging the values themselves, let’s compute the average age ofthe values. The current value has an age equals one unit of time while the oldest454 | Chapter 6: Rock, Paper, Scissors
- Page 428 and 429: Removing Hookssbs_cnn1.remove_hooks
- Page 430 and 431: return figsetattr(StepByStep, 'visu
- Page 432 and 433: Figure 5.22 - Feature maps (classif
- Page 434 and 435: classification: The predicted class
- Page 436 and 437: convolutional layers to our model a
- Page 438 and 439: Capturing Outputsfeaturizer_layers
- Page 440 and 441: the filters learned by the model pr
- Page 442 and 443: given chapter are imported at its v
- Page 444 and 445: Data PreparationThe data preparatio
- Page 446 and 447: model anyway. We’ll use it to com
- Page 448 and 449: StepByStep Method@staticmethoddef m
- Page 450 and 451: "What’s wrong with the colors?"Th
- Page 452 and 453: three_channel_filter = np.array([[[
- Page 454 and 455: Fancier Model (Constructor)class CN
- Page 456 and 457: Fancier Model (Classifier)def class
- Page 458 and 459: torch.manual_seed(44)dropping_model
- Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300
- Page 462 and 463: Figure 6.8 - Output distribution fo
- Page 464 and 465: Adaptive moment estimation (Adam) u
- Page 466 and 467: torch.manual_seed(13)# Model Config
- Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471: Choosing a learning rate that works
- Page 472 and 473: Higher-Order Learning Rate Function
- Page 474 and 475: Perfect! Now let’s build the actu
- Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
- Page 480 and 481: value in our moving average has an
- Page 482 and 483: Figure 6.15 - Distribution of weigh
- Page 484 and 485: In code, the implementation of the
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
LRFinder
The function we’ve implemented above is fairly basic. For an implementation
with more bells and whistles, check this Python package: torch_lr_finder.
[102]
I am illustrating its usage here, which is quite similar to what we’ve done
above, but please refer to the documentation for more details.
!pip install --quiet torch-lr-finder
from torch_lr_finder import LRFinder
Instead of calling a function directly, we need to create an instance of
LRFinder first, using the typical model configuration objects (model,
optimizer, loss function, and the device). Then, we can take the range_test()
method for a spin, providing familiar arguments to it: a data loader, the upper
range for the learning rate, and the number of iterations. The reset()
method restores the original states of both model and optimizer.
torch.manual_seed(11)
new_model = CNN2(n_feature=5, p=0.3)
multi_loss_fn = nn.CrossEntropyLoss(reduction='mean')
new_optimizer = optim.Adam(new_model.parameters(), lr=3e-4)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
lr_finder = LRFinder(
new_model, new_optimizer, multi_loss_fn, device=device
)
lr_finder.range_test(train_loader, end_lr=1e-1, num_iter=100)
lr_finder.plot(log_lr=True)
lr_finder.reset()
Learning Rates | 453