Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Replacing the "Top" of the Model1 alex.classifier[6] = nn.Linear(4096, 3)The following diagram may help you visualize what’s happening.Figure 7.2 - AlexNetSource: Generated using Alexander Lenail’s NN-SVG [124] and adapted by the author.Notice that the number of input features remains the same, since it still takes theoutput from the hidden layer that precedes it. The new output layer requiresgradients by default, but we can double-check it:for name, param in alex.named_parameters():if param.requires_grad == True:print(name)Outputclassifier.6.weightclassifier.6.biasGreat, the only layer that will be learning anything is our brand new output layer(classifier.6), the "top" of the model."What about unfreezing some of the hidden layers?"Transfer Learning in Practice | 511
That’s also a possibility; in this case, it is like resuming training for the hiddenlayers, while learning from scratch for the output layer. You’d probably have tohave more data to pull this off, though."Could I have changed the whole classifier instead of just the outputlayer?"Sure thing! It would be possible to have a different architecture for the classifierpart, as long as it takes the 9,216 input features produced by the first part ofAlexNet, and outputs as many logits as necessary for the task at hand. In this case,the whole classifier would be learning from scratch, and you’d need even moredata to pull it off.The more layers you unfreeze or replace, the more data you’llneed to fine-tune the model.We’re sticking with the simplest approach here; that is, replacing the output layeronly.Technically speaking, we’re only fine-tuning a model if we do notfreeze pre-trained weights; that is, the whole model will be(slightly) updated. Since we are freezing everything but the lastlayer, we are actually using the pre-trained model for featureextraction only."What if I use a different model? Which layer should I replace then?"The table below covers some of the most common models you may use for transferlearning. It lists the expected size of the input images, the classifier layer to bereplaced, and the appropriate replacement, given the number of classes for thetask at hand (three in our case):Model Size Classifier Layer(s) Replacement Layer(s)AlexNet 224 model.classifier[6] nn.Linear(4096,num_classes)VGG 224 model.classifier[6] nn.Linear(4096,num_classes)InceptionV3 299 model.fc nn.Linear(2048,num_classes)model.AuxLogits.fcnn.Linear(768,num_classes)512 | Chapter 7: Transfer Learning
- Page 486 and 487: As expected, the EWMA without corre
- Page 488 and 489: optimizer = optim.Adam(model.parame
- Page 490 and 491: IMPORTANT: The logging function mus
- Page 492 and 493: Output{'state': {140601337662512: {
- Page 494 and 495: different optimizer, set them to ca
- Page 496 and 497: • dampening: dampening factor for
- Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501: Equation 6.16 - Looking aheadOnce N
- Page 502 and 503: Figure 6.22 - Path taken by each SG
- Page 504 and 505: for epoch in range(4):# training lo
- Page 506 and 507: course) up to a given number of epo
- Page 508 and 509: Next, we create a protected method
- Page 510 and 511: Mini-Batch SchedulersThese schedule
- Page 512 and 513: Schedulers in StepByStep — Part I
- Page 514 and 515: Scheduler PathsBefore trying out a
- Page 516 and 517: After applying each scheduler to SG
- Page 518 and 519: Data Preparation1 # Loads temporary
- Page 520 and 521: Figure 6.31 - LossesEvaluationprint
- Page 522 and 523: [96] http://www.samkass.com/theorie
- Page 524 and 525: ImportsFor the sake of organization
- Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529: remained unchanged.ResNet (MSRA Tea
- Page 530 and 531: Transfer Learning in PracticeIn Cha
- Page 532 and 533: dropout. You’re already familiar
- Page 534 and 535: OutputDownloading: "https://downloa
- Page 538 and 539: Model Size Classifier Layer(s) Repl
- Page 540 and 541: Model TrainingWe have everything se
- Page 542 and 543: "Removing" the Top Layer1 alex.clas
- Page 544 and 545: torch.save(train_preproc.tensors, '
- Page 546 and 547: Outputtensor([[109, 124],[124, 124]
- Page 548 and 549: Model Configuration1 optimizer_mode
- Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553: The weights used by PIL are 0.299 f
- Page 554 and 555: • reduce the number of output cha
- Page 556 and 557: The constructor method defines the
- Page 558 and 559: Does it sound familiar? That’s wh
- Page 560 and 561: and w to represent these parameters
- Page 562 and 563: A mini-batch of size 64 is small en
- Page 564 and 565: normed1 = batch_normalizer(batch1[0
- Page 566 and 567: OutputOrderedDict([('running_mean',
- Page 568 and 569: OutputOrderedDict([('running_mean',
- Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573: torch.manual_seed(23)dummy_points =
- Page 574 and 575: np.concatenate([dummy_points[:5].nu
- Page 576 and 577: Another advantage of these shortcut
- Page 578 and 579: It should be pretty clear, except f
- Page 580 and 581: Data Preparation1 # ImageNet statis
- Page 582 and 583: Data Preparation — Preprocessing1
- Page 584 and 585: • freezing the layers of the mode
That’s also a possibility; in this case, it is like resuming training for the hidden
layers, while learning from scratch for the output layer. You’d probably have to
have more data to pull this off, though.
"Could I have changed the whole classifier instead of just the output
layer?"
Sure thing! It would be possible to have a different architecture for the classifier
part, as long as it takes the 9,216 input features produced by the first part of
AlexNet, and outputs as many logits as necessary for the task at hand. In this case,
the whole classifier would be learning from scratch, and you’d need even more
data to pull it off.
The more layers you unfreeze or replace, the more data you’ll
need to fine-tune the model.
We’re sticking with the simplest approach here; that is, replacing the output layer
only.
Technically speaking, we’re only fine-tuning a model if we do not
freeze pre-trained weights; that is, the whole model will be
(slightly) updated. Since we are freezing everything but the last
layer, we are actually using the pre-trained model for feature
extraction only.
"What if I use a different model? Which layer should I replace then?"
The table below covers some of the most common models you may use for transfer
learning. It lists the expected size of the input images, the classifier layer to be
replaced, and the appropriate replacement, given the number of classes for the
task at hand (three in our case):
Model Size Classifier Layer(s) Replacement Layer(s)
AlexNet 224 model.classifier[6] nn.Linear(4096,num_classes)
VGG 224 model.classifier[6] nn.Linear(4096,num_classes)
InceptionV3 299 model.fc nn.Linear(2048,num_classes)
model.AuxLogits.fc
nn.Linear(768,num_classes)
512 | Chapter 7: Transfer Learning