Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Model TrainingWe have everything set to train the "top" layer of our modified version of AlexNet:Model Training1 sbs_alex = StepByStep(alex, multi_loss_fn, optimizer_alex)2 sbs_alex.set_loaders(train_loader, val_loader)3 sbs_alex.train(1)You probably noticed it took several seconds (and a lot more if you’re running on aCPU) to run the code above, even though it is training for one single epoch."How come? Most of the model is frozen; there is only one measlylayer to train…"You’re right, there is only one measly layer to compute gradients for and to updateparameters for, but the forward pass still uses the whole model. So, every singleimage (out of 2,520 in our training set) will have its features computed using morethan 61 million parameters! No wonder it is taking some time! By the way, only12,291 parameters are trainable.If you’re thinking "there must be a better way…," you’re absolutely right—that’s thetopic of the next section.But, first, let’s see how effective transfer learning is by evaluating our model afterhaving trained it over one epoch only:StepByStep.loader_apply(val_loader, sbs_alex.correct)Outputtensor([[111, 124],[124, 124],[124, 124]])That’s 96.51% accuracy in the validation set (it is 99.33% for the training set, incase you’re wondering). Even if it is taking some time to train, these results arepretty good!Transfer Learning in Practice | 515
Generating a Dataset of FeaturesWe’ve just realized that most of the time it takes to train the last layer of ourmodel over one single epoch was spent in the forward pass. Now, imagine if wewanted to train it over ten epochs: Not only would the model spend most of itstime performing the forward pass, but, even worse, it would perform the sameoperations ten times over.Since all layers but the last are frozen, the output of the secondto-lastlayer is always the same.That’s assuming you’re not doing data augmentation, of course.That’s a huge waste of your time, energy, and money (if you’re paying for cloudcomputing)."What can we do about it?"Well, since the frozen layers are simply generating features that will be the inputof the trainable layers, why not treat the frozen layers as such? We could do it infour easy steps:• Keep only the frozen layers in the model.• Run the whole dataset through it and collect its outputs as a dataset offeatures.• Train a separate model (that corresponds to the "top" of the original model)using the dataset of features.• Attach the trained model to the top of the frozen layers.This way, we’re effectively splitting the feature extraction and actual trainingphases, thus avoiding the overhead of generating features over and over again forevery single forward pass.To keep only the frozen layers, we need to get rid of the "top" of the original model.But, since we also want to attach our new layer to the whole model after training,it is a better idea to simply replace the "top" layer with an identity layer instead ofremoving it entirely:516 | Chapter 7: Transfer Learning
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 536 and 537: Replacing the "Top" of the Model1 a
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
- Page 586 and 587: Extra ChapterVanishing and Explodin
- Page 588 and 589: discussing it, let me illustrate it
Model Training
We have everything set to train the "top" layer of our modified version of AlexNet:
Model Training
1 sbs_alex = StepByStep(alex, multi_loss_fn, optimizer_alex)
2 sbs_alex.set_loaders(train_loader, val_loader)
3 sbs_alex.train(1)
You probably noticed it took several seconds (and a lot more if you’re running on a
CPU) to run the code above, even though it is training for one single epoch.
"How come? Most of the model is frozen; there is only one measly
layer to train…"
You’re right, there is only one measly layer to compute gradients for and to update
parameters for, but the forward pass still uses the whole model. So, every single
image (out of 2,520 in our training set) will have its features computed using more
than 61 million parameters! No wonder it is taking some time! By the way, only
12,291 parameters are trainable.
If you’re thinking "there must be a better way…," you’re absolutely right—that’s the
topic of the next section.
But, first, let’s see how effective transfer learning is by evaluating our model after
having trained it over one epoch only:
StepByStep.loader_apply(val_loader, sbs_alex.correct)
Output
tensor([[111, 124],
[124, 124],
[124, 124]])
That’s 96.51% accuracy in the validation set (it is 99.33% for the training set, in
case you’re wondering). Even if it is taking some time to train, these results are
pretty good!
Transfer Learning in Practice | 515