Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Recommendations

Info

As expected, the EWMA without correction (red dashed line) is way off at thebeginning, while the regular moving average (black dashed line) tracks the actualvalues much closer. The corrected EWMA, though, does a very good job trackingthe actual values from the very beginning. Sure enough, after 19 days, the twoEWMAs are barely distinguishable.EWMA Meets GradientsWho cares about temperatures, anyway? Let’s apply the EWMAs to our gradients,Adam-style!For each parameter, we compute two EWMAs: one for its gradients, the other forthe square of its gradients. Next, we use both values to compute the adaptedgradient for that parameter:Equation 6.9 - Adapted gradientThere they are: Adam’s beta1 and beta2 parameters! Its default values, 0.9 and0.999, correspond to averages of 19 and 1999 periods, respectively.So, it is a short-term average for smoothing the gradients, and a very long-termaverage for scaling the gradients. The epsilon value in the denominator (usually1e-8) is there only to prevent numerical issues.Once the adapted gradient is computed, it replaces the actual gradient in theparameter update:Equation 6.10 - Parameter updateClearly, the learning rate (the Greek letter eta) is left untouched!Moreover, as a result of the scaling, the adapted gradient is likely to be inside the [-3, 3] range most of the time (this is akin to the standardization procedure butwithout subtracting the mean).Learning Rates | 461
AdamSo, choosing the Adam optimizer is an easy and straightforward way to tackle yourlearning rate needs. Let’s take a closer look at PyTorch’s Adam optimizer and itsarguments:• params: model’s parameters• lr: learning rate, default value 1e-3• betas: tuple containing beta1 and beta2 for the EWMAs• eps: the epsilon (1e-8) value in the denominatorThe four arguments above should be clear by now. But there are two others wehaven’t talked about yet:• weight_decay: L2 penalty• amsgrad: if the AMSGrad variant should be usedThe first argument, weight decay, introduces a regularization term (L2 penalty) tothe model’s weights. As with every regularization procedure, it aims to preventoverfitting by penalizing weights with large values. The term weight decay comesfrom the fact that the regularization actually increases the gradients by adding theweight value multiplied by the weight decay argument."If it increases the gradients, how come it is called weight decay?"In the parameter update, the gradient is multiplied by the learning rate andsubtracted from the weight’s previous value. So, in effect, adding a penalty to thevalue of the gradients makes the weights smaller. The smaller the weights, thesmaller the penalty, thus making further reductions even smaller—in other words,the weights are decaying.The second argument, amsgrad, makes the optimizer compatible with a variant ofthe same name. In a nutshell, it modifies the formula used to compute adaptedgradients, ditching the bias correction and using the peak value of the EWMA ofsquared gradients instead.For now, we’re sticking with the first four, well-known to us, arguments:462 | Chapter 6: Rock, Paper, Scissors
Page 2 and 3:
Deep Learning with PyTorchStep-by-S
Page 4 and 5:
"What I cannot create, I do not und
Page 6 and 7:
Train-Validation-Test Split. . . .
Page 8 and 9:
Random Split. . . . . . . . . . . .
Page 10 and 11:
The Precision Quirk . . . . . . . .
Page 12 and 13:
A REAL Filter . . . . . . . . . . .
Page 14 and 15:
Jupyter Notebook . . . . . . . . .
Page 16 and 17:
Stacked RNN . . . . . . . . . . . .
Page 18 and 19:
Attention Mechanism . . . . . . . .
Page 20 and 21:
4. Positional Encoding. . . . . . .
Page 22 and 23:
Model Configuration & Training . .
Page 24 and 25:
AcknowledgementsFirst and foremost,
Page 26 and 27:
Frequently Asked Questions (FAQ)Why
Page 28 and 29:
There is yet another advantage of f
Page 30 and 31:
• Classes and methods are written
Page 32 and 33:
What’s Next?It’s time to set up
Page 34 and 35:
After choosing a repository, it wil
Page 36 and 37:
1. AnacondaIf you don’t have Anac
Page 38 and 39:
3. PyTorchPyTorch is the coolest de
Page 40 and 41:
(pytorchbook) C:\> conda install py
Page 42 and 43:
(pytorchbook)$ pip install torchviz
Page 44 and 45:
7. JupyterAfter cloning the reposit
Page 46 and 47:
Part IFundamentals| 21
Page 48 and 49:
notebook. If not, just click on Cha
Page 50 and 51:
Also, let’s say that, on average,
Page 52 and 53:
There is one exception to the "alwa
Page 54 and 55:
Random Initialization1 # Step 0 - I
Page 56 and 57:
Batch, Mini-batch, and Stochastic G
Page 58 and 59:
Outputarray([[-2. , -1.94, -1.88, .
Page 60 and 61:
one matrix for each data point, eac
Page 62 and 63:
Sure, different values of b produce
Page 64 and 65:
Output-3.044811379650508 -1.8337537
Page 66 and 67:
each parameter using the chain rule
Page 68 and 69:
What’s the impact of one update o
Page 70 and 71:
gradients, we know we need to take
Page 72 and 73:
Very High Learning RateWait, it may
Page 74 and 75:
true_b = 1true_w = 2N = 100# Data G
Page 76 and 77:
Let’s look at the cross-sections
Page 78 and 79:
Zero Mean and Unit Standard Deviati
Page 80 and 81:
Sure, in the real world, you’ll n
Page 82 and 83:
computing the loss, as shown in the
Page 84 and 85:
• visualizing the effects of usin
Page 86 and 87:
If you’re using Jupyter’s defau
Page 88 and 89:
Notebook Cell 1.1 - Splitting synth
Page 90 and 91:
Step 2# Step 2 - Computing the loss
Page 92 and 93:
Output[0.49671415] [-0.1382643][0.8
Page 94 and 95:
Notebook Cell 1.2 - Implementing gr
Page 96 and 97:
# Sanity Check: do we get the same
Page 98 and 99:
Outputtensor(3.1416)tensor([1, 2, 3
Page 100 and 101:
Outputtensor([[1., 2., 1.],[1., 1.,
Page 102 and 103:
dummy_array = np.array([1, 2, 3])du
Page 104 and 105:
n_cudas = torch.cuda.device_count()
Page 106 and 107:
back_to_numpy = x_train_tensor.nump
Page 108 and 109:
I am assuming you’d like to use y
Page 110 and 111:
Outputtensor([0.1940], device='cuda
Page 112 and 113:
print(error.requires_grad, yhat.req
Page 114 and 115:
Output(tensor([0.], device='cuda:0'
Page 116 and 117:
56 # need to tell it to let it go..
Page 118 and 119:
computation.If you chose "Local Ins
Page 120 and 121:
Figure 1.6 - Now parameter "b" does
Page 122 and 123:
There are many optimizers: SGD is t
Page 124 and 125:
41 optimizer.zero_grad() 34243 prin
Page 126 and 127:
Notebook Cell 1.8 - PyTorch’s los
Page 128 and 129:
Outputarray(0.00804466, dtype=float
Page 130 and 131:
Let’s build a proper (yet simple)
Page 132 and 133:
"What do we need this for?"It turns
Page 134 and 135:
1 Instantiating a model2 What IS th
Page 136 and 137:
In the __init__() method, we create
Page 138 and 139:
LayersA Linear model can be seen as
Page 140 and 141:
There are MANY different layers tha
Page 142 and 143:
We use magic, just like that:%run -
Page 144 and 145:
• Step 1: compute model’s predi
Page 146 and 147:
RecapFirst of all, congratulations
Page 148 and 149:
Chapter 2Rethinking the Training Lo
Page 150 and 151:
Let’s take a look at the code onc
Page 152 and 153:
Higher-Order FunctionsAlthough this
Page 154 and 155:
def exponentiation_builder(exponent
Page 156 and 157:
Apart from returning the loss value
Page 158 and 159:
Our code should look like this; see
Page 160 and 161:
There is no need to load the whole
Page 162 and 163:
but if we want to get serious about
Page 164 and 165:
How does this change our code so fa
Page 166 and 167:
Run - Model Training V2%run -i mode
Page 168 and 169:
piece of code that’s going to be
Page 170 and 171:
for it. We could do the same for th
Page 172 and 173:
EvaluationHow can we evaluate the m
Page 174 and 175:
And then, we update our model confi
Page 176 and 177:
Run - Model Training V4%run -i mode
Page 178 and 179:
Loading Extension# Load the TensorB
Page 180 and 181:
browser, you’ll likely see someth
Page 182 and 183:
model’s graph (not quite the same
Page 184 and 185:
Figure 2.5 - Scalars on TensorBoard
Page 186 and 187:
Define - Model Training V51 %%write
Page 188 and 189:
If, by any chance, you ended up wit
Page 190 and 191:
The procedure is exactly the same,
Page 192 and 193:
soon, so please bear with me for no
Page 194 and 195:
After recovering our model’s stat
Page 196 and 197:
Run - Model Configuration V31 # %lo
Page 198 and 199:
This is the general structure you
Page 200 and 201:
Chapter 2.1Going ClassySpoilersIn t
Page 202 and 203:
# A completely empty (and useless)
Page 204 and 205:
# These attributes are defined here
Page 206 and 207:
# Creates the train_step function f
Page 208 and 209:
# Builds function that performs a s
Page 210 and 211:
setattrThe setattr function sets th
Page 212 and 213:
See? We effectively modified the un
Page 214 and 215:
the random seed as arguments.This s
Page 216 and 217:
The current state of development of
Page 218 and 219:
Lossesdef plot_losses(self):fig = p
Page 220 and 221:
Run - Data Preparation V21 # %load
Page 222 and 223:
Model TrainingWe start by instantia
Page 224 and 225:
Making PredictionsLet’s make up s
Page 226 and 227:
OutputOrderedDict([('0.weight', ten
Page 228 and 229:
Run - Data Preparation V21 # %load
Page 230 and 231:
• defining our StepByStep class
Page 232 and 233:
import numpy as npimport torchimpor
Page 234 and 235:
Next, we’ll standardize the featu
Page 236 and 237:
Equation 3.1 - A linear regression
Page 238 and 239:
The odds ratio is given by the rati
Page 240 and 241:
As expected, probabilities that add
Page 242 and 243:
Sigmoid Functiondef sigmoid(z):retu
Page 244 and 245:
A picture is worth a thousand words
Page 246 and 247:
OutputOrderedDict([('linear.weight'
Page 248 and 249:
The first summation adds up the err
Page 250 and 251:
IMPORTANT: Make sure to pass the pr
Page 252 and 253:
To make it clear: In this chapter,
Page 254 and 255:
argument of nn.BCEWithLogitsLoss().
Page 256 and 257:
It is not that hard, to be honest.
Page 258 and 259:
Figure 3.6 - Training and validatio
Page 260 and 261:
Outputarray([[0.5504593 ],[0.949995
Page 262 and 263:
decision boundary.Look at the expre
Page 264 and 265:
Are my data points separable?That
Page 266 and 267:
model = nn.Sequential()model.add_mo
Page 268 and 269:
It looks like this:Figure 3.10 - Sp
Page 270 and 271:
True and False Positives and Negati
Page 272 and 273:
tpr_fpr(cm_thresh50)Output(0.909090
Page 274 and 275:
The trade-off between precision and
Page 276 and 277:
Figure 3.13 - Using a low threshold
Page 278 and 279:
Figure 3.16 - Trade-offs for two di
Page 280 and 281:
thresholds do not necessarily inclu
Page 282 and 283:
actual data, it is as bad as it can
Page 284 and 285:
If you want to learn more about bot
Page 286 and 287:
Model Training1 n_epochs = 10023 sb
Page 288 and 289:
step in your journey! What’s next
Page 290 and 291:
Chapter 4Classifying ImagesSpoilers
Page 292 and 293:
Data GenerationOur images are quite
Page 294 and 295:
Images and ChannelsIn case you’re
Page 296 and 297:
image_rgb = np.stack([image_r, imag
Page 298 and 299:
That’s fairly straightforward; we
Page 300 and 301:
• Transformations based on Tensor
Page 302 and 303:
position of an object in a picture
Page 304 and 305:
Outputtensor([[[0., 0., 0., 1., 0.]
Page 306 and 307:
Outputtensor([[[-1., -1., -1., 1.,
Page 308 and 309:
We can convert the former into the
Page 310 and 311:
composer = Compose([RandomHorizonta
Page 312 and 313:
Output<torch.utils.data.dataset.Sub
Page 314 and 315:
train_composer = Compose([RandomHor
Page 316 and 317:
The minority class should have the
Page 318 and 319:
train_loader = DataLoader(dataset=t
Page 320 and 321:
implemented in Chapter 2.1? Let’s
Page 322 and 323:
Let’s take one mini-batch of imag
Page 324 and 325:
What does our model look like? Visu
Page 326 and 327:
Model TrainingLet’s train our mod
Page 328 and 329:
preceding hidden layer to compute i
Page 330 and 331:
fig = sbs_nn.plot_losses()Figure 4.
Page 332 and 333:
Equation 4.2 - Equivalence of deep
Page 334 and 335:
w_nn_equiv = w_nn_output.mm(w_nn_hi
Page 336 and 337:
Weights as PixelsDuring data prepar
Page 338 and 339:
is only 0.25 (for z = 0) and that i
Page 340 and 341:
nn.Tanh()(dummy_z)Outputtensor([-0.
Page 342 and 343:
dummy_z = torch.tensor([-3., 0., 3.
Page 344 and 345:
As you can see, in PyTorch the coef
Page 346 and 347:
Figure 4.16 - Deep model (for real)
Page 348 and 349:
Figure 4.18 - Losses (before and af
Page 350 and 351:
Equation 4.3 - Activation functions
Page 352 and 353:
Helper Function #41 def index_split
Page 354 and 355:
Model Configuration1 # Sets learnin
Page 356 and 357:
Bonus ChapterFeature SpaceThis chap
Page 358 and 359:
Affine TransformationsAn affine tra
Page 360 and 361:
Figure B.3 - Annotated model diagra
Page 362 and 363:
Figure B.5 - In the beginning…But
Page 364 and 365:
OK, now we can clearly see a differ
Page 366 and 367:
In the model above, the sigmoid fun
Page 368 and 369:
the more dimensions, the more separ
Page 370 and 371:
import randomimport numpy as npfrom
Page 372 and 373:
identity = np.array([[[[0, 0, 0],[0
Page 374 and 375:
Figure 5.4 - Striding the image, on
Page 376 and 377:
Output-----------------------------
Page 378 and 379:
Outputtensor([[[[9., 5., 0., 7.],[0
Page 380 and 381:
OutputParameter containing:tensor([
Page 382 and 383:
Moreover, notice that if we were to
Page 384 and 385:
In code, as usual, PyTorch gives us
Page 386 and 387:
Outputtensor([[[[5., 5., 0., 8., 7.
Page 388 and 389:
edge = np.array([[[[0, 1, 0],[1, -4
Page 390 and 391:
A pooling kernel of two-by-two resu
Page 392 and 393:
Outputtensor([[22., 23., 11., 24.,
Page 394 and 395:
Figure 5.15 - LeNet-5 architectureS
Page 396 and 397:
• second block: produces 16-chann
Page 398 and 399:
Transformed Dataset1 class Transfor
Page 400 and 401:
LossNew problem, new loss. Since we
Page 402 and 403:
Outputtensor([4.0000, 1.0000, 0.500
Page 404 and 405:
The loss only considers the predict
Page 406 and 407:
Outputtensor([[-1.5229, -0.3146, -2
Page 408 and 409:
IMPORTANT: I can’t stress this en
Page 410 and 411:
figures at the beginning of this ch
Page 412 and 413:
The three units in the output layer
Page 414 and 415:
StepByStep Method@staticmethoddef _
Page 416 and 417:
The meow() method is totally indepe
Page 418 and 419:
StepByStep Methoddef visualize_filt
Page 420 and 421:
dummy_model = nn.Linear(1, 1)dummy_
Page 422 and 423:
dummy_listOutput[(Linear(in_feature
Page 424 and 425:
Output{Conv2d(1, 1, kernel_size=(3,
Page 426 and 427:
will be the externally defined vari
Page 428 and 429:
Removing Hookssbs_cnn1.remove_hooks
Page 430 and 431:
return figsetattr(StepByStep, 'visu
Page 432 and 433:
Figure 5.22 - Feature maps (classif
Page 434 and 435:
classification: The predicted class
Page 436 and 437: convolutional layers to our model a
Page 438 and 439: Capturing Outputsfeaturizer_layers
Page 440 and 441: the filters learned by the model pr
Page 442 and 443: given chapter are imported at its v
Page 444 and 445: Data PreparationThe data preparatio
Page 446 and 447: model anyway. We’ll use it to com
Page 448 and 449: StepByStep Method@staticmethoddef m
Page 450 and 451: "What’s wrong with the colors?"Th
Page 452 and 453: three_channel_filter = np.array([[[
Page 454 and 455: Fancier Model (Constructor)class CN
Page 456 and 457: Fancier Model (Classifier)def class
Page 458 and 459: torch.manual_seed(44)dropping_model
Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300
Page 462 and 463: Figure 6.8 - Output distribution fo
Page 464 and 465: Adaptive moment estimation (Adam) u
Page 466 and 467: torch.manual_seed(13)# Model Config
Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s
Page 470 and 471: Choosing a learning rate that works
Page 472 and 473: Higher-Order Learning Rate Function
Page 474 and 475: Perfect! Now let’s build the actu
Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se
Page 478 and 479: LRFinderThe function we’ve implem
Page 480 and 481: value in our moving average has an
Page 482 and 483: Figure 6.15 - Distribution of weigh
Page 484 and 485: In code, the implementation of the
Page 488 and 489: optimizer = optim.Adam(model.parame
Page 490 and 491: IMPORTANT: The logging function mus
Page 492 and 493: Output{'state': {140601337662512: {
Page 494 and 495: different optimizer, set them to ca
Page 496 and 497: • dampening: dampening factor for
Page 498 and 499: Figure 6.20 - Paths taken by SGD (w
Page 500 and 501: Equation 6.16 - Looking aheadOnce N
Page 502 and 503: Figure 6.22 - Path taken by each SG
Page 504 and 505: for epoch in range(4):# training lo
Page 506 and 507: course) up to a given number of epo
Page 508 and 509: Next, we create a protected method
Page 510 and 511: Mini-Batch SchedulersThese schedule
Page 512 and 513: Schedulers in StepByStep — Part I
Page 514 and 515: Scheduler PathsBefore trying out a
Page 516 and 517: After applying each scheduler to SG
Page 518 and 519: Data Preparation1 # Loads temporary
Page 520 and 521: Figure 6.31 - LossesEvaluationprint
Page 522 and 523: [96] http://www.samkass.com/theorie
Page 524 and 525: ImportsFor the sake of organization
Page 526 and 527: ILSVRC-2012The 2012 edition [111] o
Page 528 and 529: remained unchanged.ResNet (MSRA Tea
Page 530 and 531: Transfer Learning in PracticeIn Cha
Page 532 and 533: dropout. You’re already familiar
Page 534 and 535: OutputDownloading: "https://downloa
Page 536 and 537:
Replacing the "Top" of the Model1 a
Page 538 and 539:
Model Size Classifier Layer(s) Repl
Page 540 and 541:
Model TrainingWe have everything se
Page 542 and 543:
"Removing" the Top Layer1 alex.clas
Page 544 and 545:
torch.save(train_preproc.tensors, '
Page 546 and 547:
Outputtensor([[109, 124],[124, 124]
Page 548 and 549:
Model Configuration1 optimizer_mode
Page 550 and 551:
Figure 7.4 - 1x1 convolutionThe inp
Page 552 and 553:
The weights used by PIL are 0.299 f
Page 554 and 555:
• reduce the number of output cha
Page 556 and 557:
The constructor method defines the
Page 558 and 559:
Does it sound familiar? That’s wh
Page 560 and 561:
and w to represent these parameters
Page 562 and 563:
A mini-batch of size 64 is small en
Page 564 and 565:
normed1 = batch_normalizer(batch1[0
Page 566 and 567:
OutputOrderedDict([('running_mean',
Page 568 and 569:
OutputOrderedDict([('running_mean',
Page 570 and 571:
batch_normalizer = nn.BatchNorm2d(n
Page 572 and 573:
torch.manual_seed(23)dummy_points =
Page 574 and 575:
np.concatenate([dummy_points[:5].nu
Page 576 and 577:
Another advantage of these shortcut
Page 578 and 579:
It should be pretty clear, except f
Page 580 and 581:
Data Preparation1 # ImageNet statis
Page 582 and 583:
Data Preparation — Preprocessing1
Page 584 and 585:
• freezing the layers of the mode
Page 586 and 587:
Extra ChapterVanishing and Explodin
Page 588 and 589:
discussing it, let me illustrate it
Page 590 and 591:
Model Configuration (2)1 loss_fn =
Page 592 and 593:
weights. If done properly, the init
Page 594 and 595:
just did), or, if you are training
Page 596 and 597:
Figure E.3 - The effect of batch no
Page 598 and 599:
Model Configuration1 torch.manual_s
Page 600 and 601:
torch.manual_seed(42)parm = nn.Para
Page 602 and 603:
(and only if) the norm exceeds the
Page 604 and 605:
if callable(self.clipping): 1self.c
Page 606 and 607:
Moreover, let’s use a ten times h
Page 608 and 609:
Clipping with HooksFirst, we reset
Page 610 and 611:
• visualizing the difference betw
Page 612 and 613:
Chapter 8SequencesSpoilersIn this c
Page 614 and 615:
Before shuffling, the pixels were o
Page 616 and 617:
And then let’s visualize the firs
Page 618 and 619:
sequence so far, and a data point f
Page 620 and 621:
Considering this, the not "unrolled
Page 622 and 623:
linear_input = nn.Linear(n_features
Page 624 and 625:
Outputtensor([[0.3924, 0.8146]], gr
Page 626 and 627:
Now we’re talking! The last hidde
Page 628 and 629:
Let’s take a look at the RNN’s
Page 630 and 631:
◦ The initial hidden state, which
Page 632 and 633:
batch_first argument to True so we
Page 634 and 635:
OutputOrderedDict([('weight_ih_l0',
Page 636 and 637:
out, hidden = rnn_stacked(x)out, hi
Page 638 and 639:
_l0_reverse).Once again, let’s cr
Page 640 and 641:
For bidirectional RNNs, the last el
Page 642 and 643:
Model Configuration1 class SquareMo
Page 644 and 645:
StepByStep.loader_apply(test_loader
Page 646 and 647:
Figure 8.14 - Final hidden states f
Page 648 and 649:
Figure 8.16 - Transforming the hidd
Page 650 and 651:
Since the RNN cell has both of them
Page 652 and 653:
Every gate worthy of its name will
Page 654 and 655:
• For r=0 and z=0, the cell becom
Page 656 and 657:
In code, we can use split() to get
Page 658 and 659:
Let’s pause for a moment here. Fi
Page 660 and 661:
Square Model II — The QuickeningT
Page 662 and 663:
Outputtensor([[53, 53],[75, 75]])Th
Page 664 and 665:
Figure 8.22 - Transforming the hidd
Page 666 and 667:
Equation 8.9 - LSTM—candidate hid
Page 668 and 669:
Now, let’s visualize the internal
Page 670 and 671:
OutputOrderedDict([('weight_ih', te
Page 672 and 673:
def forget_gate(h, x):thf = f_hidde
Page 674 and 675:
Outputtensor([[-5.4936e-02, -8.3816
Page 676 and 677:
1 First change: from RNN to LSTM2 S
Page 678 and 679:
Like the GRU, the LSTM presents fou
Page 680 and 681:
Output-----------------------------
Page 682 and 683:
Before moving on to packed sequence
Page 684 and 685:
column-wise fashion, from top to bo
Page 686 and 687:
does match the last output.• No,
Page 688 and 689:
So, to actually get the last output
Page 690 and 691:
Data Preparation1 class CustomDatas
Page 692 and 693:
OutputPackedSequence(data=tensor([[
Page 694 and 695:
Model Configuration & TrainingWe ca
Page 696 and 697:
size = 5weight = torch.ones(size) *
Page 698 and 699:
torch.manual_seed(17)conv_seq = nn.
Page 700 and 701:
Figure 8.32 - Applying dilated filt
Page 702 and 703:
Model Configuration1 torch.manual_s
Page 704 and 705:
We can actually find an expression
Page 706 and 707:
Data Preparation1 def pack_collate(
Page 708 and 709:
and variable-length sequences.Model
Page 710 and 711:
• generating variable-length sequ
Page 712 and 713:
import copyimport numpy as npimport
Page 714 and 715:
Figure 9.3 - Sequence datasetThe co
Page 716 and 717:
coordinates of a "perfect" square a
Page 718 and 719:
Let’s pretend for a moment that t
Page 720 and 721:
to initialize the hidden state and
Page 722 and 723:
predictions in previous steps have
Page 724 and 725:
the second set of predicted coordin
Page 726 and 727:
Let’s create an instance of the m
Page 728 and 729:
Model Configuration & TrainingThe m
Page 730 and 731:
Sure, we can!AttentionHere is a (no
Page 732 and 733:
based on "the" and "zone," I’ve j
Page 734 and 735:
Figure 9.12 - Matching a query to t
Page 736 and 737:
Outputtensor([[[ 0.0832, -0.0356],[
Page 738 and 739:
utmost importance for the correct i
Page 740 and 741:
Its formula is:Equation 9.3 - Cosin
Page 742 and 743:
second hidden state contributes to
Page 744 and 745:
Outputtensor([[[ 0.5475, 0.0875, -1
Page 746 and 747:
alphas = F.softmax(scaled_products,
Page 748 and 749:
Outputtensor([[[ 0.2138, -0.3175]]]
Page 750 and 751:
Attention Mechanism1 class Attentio
Page 752 and 753:
"Why would I want to force it to do
Page 754 and 755:
1 Sets attention module and adjusts
Page 756 and 757:
encdec = EncoderDecoder(encoder, de
Page 758 and 759:
fig = sbs_seq_attn.plot_losses()Fig
Page 760 and 761:
Figure 9.20 - Attention scoresSee?
Page 762 and 763:
Wide vs Narrow AttentionThis mechan
Page 764 and 765:
"What’s so special about it?"Even
Page 766 and 767:
Once again, the affine transformati
Page 768 and 769:
Next, we shift our focus to the sel
Page 770 and 771:
Encoder + Self-Attention1 class Enc
Page 772 and 773:
Figure 9.27 - Encoder with self- an
Page 774 and 775:
The figure below depicts the self-a
Page 776 and 777:
shifted_seq = torch.cat([source_seq
Page 778 and 779:
Equation 9.17 - Decoder’s (masked
Page 780 and 781:
At evaluation / prediction time we
Page 782 and 783:
Outputtensor([[[0.4132, 0.3728],[0.
Page 784 and 785:
Figure 9.33 - Encoder + decoder + a
Page 786 and 787:
64 return outputsThe encoder-decode
Page 788 and 789:
Figure 9.34 - Losses—encoder + de
Page 790 and 791:
curse. On the one hand, it makes co
Page 792 and 793:
"Are we done now? Is this good enou
Page 794 and 795:
Figure 9.46 - Consistent distancesA
Page 796 and 797:
Let’s recap what we’ve already
Page 798 and 799:
Let’s see it in code:max_len = 10
Page 800 and 801:
Let’s put it all together into a
Page 802 and 803:
Outputtensor([[[-1.0000, 0.0000],[-
Page 804 and 805:
Decoder with Positional Encoding1 c
Page 806 and 807:
Visualizing PredictionsLet’s plot
Page 808 and 809:
Next, we’re moving on to the thre
Page 810 and 811:
Data Generation & Preparation1 # Tr
Page 812 and 813:
59 self.trg_masks)60 else:61 # Deco
Page 814 and 815:
Model Configuration1 class EncoderS
Page 816 and 817:
1617 @property18 def alphas(self):1
Page 818 and 819:
Output(0.016193246061448008, 0.0341
Page 820 and 821:
sequential order of the data• fig
Page 822 and 823:
following imports:import copyimport
Page 824 and 825:
Figure 10.2 - Chunking: the wrong a
Page 826 and 827:
chunks to compute the other half of
Page 828 and 829:
67 # N, L, n_heads, d_k68 context =
Page 830 and 831:
dummy_points = torch.randn(16, 2, 4
Page 832 and 833:
Stacking Encoders and DecodersLet
Page 834 and 835:
"… with great depth comes great c
Page 836 and 837:
Transformer EncoderWe’ll be repre
Page 838 and 839:
Let’s see it in code, starting wi
Page 840 and 841:
Transformer Encoder1 class EncoderT
Page 842 and 843:
of the encoder-decoder (or Transfor
Page 844 and 845:
In PyTorch, the decoder "layer" is
Page 846 and 847:
In PyTorch, the decoder is implemen
Page 848 and 849:
Equation 10.7 - Data points' means
Page 850 and 851:
layer_norm = nn.LayerNorm(d_model)n
Page 852 and 853:
Figure 10.10 - Layer norm vs batch
Page 854 and 855:
Outputtensor([[[ 1.4636, 2.3663],[
Page 856 and 857:
The TransformerLet’s start with t
Page 858 and 859:
"values") in the decoder.• decode
Page 860 and 861:
Data Preparation1 # Generating trai
Page 862 and 863:
Figure 10.15 - Losses—Transformer
Page 864 and 865:
• First, and most important, PyTo
Page 866 and 867:
decode(), with a single one, encode
Page 868 and 869:
46 for i in range(self.target_len):
Page 870 and 871:
Figure 10.18 - Losses - PyTorch’s
Page 872 and 873:
Figure 10.20 - Sample image—label
Page 874 and 875:
4041 # Builds a weighted random sam
Page 876 and 877:
Figure 10.23 - Sample image—split
Page 878 and 879:
Einops"There is more than one way t
Page 880 and 881:
Figure 10.26 - Two patch embeddings
Page 882 and 883:
Now each sequence has ten elements,
Page 884 and 885:
It takes an instance of a Transform
Page 886 and 887:
Putting It All TogetherIn this chap
Page 888 and 889:
1. Encoder-DecoderThe encoder-decod
Page 890 and 891:
This is the actual encoder-decoder
Page 892 and 893:
3. DecoderThe Transformer decoder h
Page 894 and 895:
5. Encoder "Layer"The encoder "laye
Page 896 and 897:
7. "Sub-Layer" WrapperThe "sub-laye
Page 898 and 899:
8. Multi-Headed AttentionThe multi-
Page 900 and 901:
Model Configuration & TrainingModel
Page 902 and 903:
• training the Transformer to tac
Page 904 and 905:
Part IVNatural Language Processing|
Page 906 and 907:
Additional SetupThis is a special c
Page 908 and 909:
"Down the Yellow Brick Rabbit Hole"
Page 910 and 911:
The actual texts of the books are c
Page 912 and 913:
"What is this punkt?"That’s the P
Page 914 and 915:
14 # If there is a configuration fi
Page 916 and 917:
Sentence Tokenization in spaCyBy th
Page 918 and 919:
AttributesThe Dataset has many attr
Page 920 and 921:
Output{'labels': 1,'sentence': 'The
Page 922 and 923:
elements from the text. But preproc
Page 924 and 925:
Data AugmentationLet’s briefly ad
Page 926 and 927:
The corpora’s dictionary is not a
Page 928 and 929:
Finally, if we want to convert a li
Page 930 and 931:
Once we’re happy with the size an
Page 932 and 933:
from transformers import BertTokeni
Page 934 and 935:
"What about the separation token?"T
Page 936 and 937:
The last output, attention_mask, wo
Page 938 and 939:
Outputtensor([[ 3, 27, 1, ..., 0, 0
Page 940 and 941:
vector, right? And our vocabulary i
Page 942 and 943:
Maybe you filled this blank in with
Page 944 and 945:
Continuous Bag-of-Words (CBoW)In th
Page 946 and 947:
That’s a fairly simple model, rig
Page 948 and 949:
Figure 11.13 - Continuous bag-of-wo
Page 950 and 951:
Figure 11.15 - Reviewing restaurant
Page 952 and 953:
You got that right—arithmetic—r
Page 954 and 955:
There we go, 50 dimensions! It’s
Page 956 and 957:
Equation 11.1 - Embedding arithmeti
Page 958 and 959:
Only 82 out of 50,802 words in the
Page 960 and 961:
Now we can use its encode() method
Page 962 and 963:
Model I — GloVE + ClassifierData
Page 964 and 965:
Pre-trained PyTorch EmbeddingsThe e
Page 966 and 967:
Model Configuration & TrainingLet
Page 968 and 969:
6 self.encoder = encoder7 self.mlp
Page 970 and 971:
Figure 11.20 - Losses—Transformer
Page 972 and 973:
Outputtensor([[[2.6334e-01, 6.9912e
Page 974 and 975:
I want to introduce you to…ELMoBo
Page 976 and 977:
OutputToken: 32 watchThe get_token(
Page 978 and 979:
Helper Function to Retrieve Embeddi
Page 980 and 981:
Output(tensor(-0.5047, device='cuda
Page 982 and 983:
torch.all(new_flair_sentences[0].to
Page 984 and 985:
Outputtensor(0.3504, device='cuda:0
Page 986 and 987:
We can leverage this fact to slight
Page 988 and 989:
We can easily get the embeddings fo
Page 990 and 991:
Figure 11.24 - Losses—simple clas
Page 992 and 993:
We can inspect the pre-trained mode
Page 994 and 995:
Every word piece is prefixed with #
Page 996 and 997:
far, our models used these embeddin
Page 998 and 999:
position_ids = torch.arange(512).ex
Page 1000 and 1001:
Pre-training TasksMasked Language M
Page 1002 and 1003:
Then, let’s create an instance of
Page 1004 and 1005:
If these two sentences were the inp
Page 1006 and 1007:
The BERT model may take many other
Page 1008 and 1009:
The contextual word embeddings are
Page 1010 and 1011:
Model Configuration1 class BERTClas
Page 1012 and 1013:
"Which BERT is that? DistilBERT?!"D
Page 1014 and 1015:
Well, you probably don’t want to
Page 1016 and 1017:
set num_labels=1 as argument.If you
Page 1018 and 1019:
Output{'attention_mask': [1, 1, 1,
Page 1020 and 1021:
OutputTrainingArguments(output_dir=
Page 1022 and 1023:
Method for Computing Accuracy1 def
Page 1024 and 1025:
loaded_model = (AutoModelForSequenc
Page 1026 and 1027:
logits.logits.argmax(dim=1)Outputte
Page 1028 and 1029:
For a complete list of available ta
Page 1030 and 1031:
[215]. For a demo of GPT-2’s capa
Page 1032 and 1033:
in Chapter 9, and I reproduce it be
Page 1034 and 1035:
Data Preparation1 auto_tokenizer =
Page 1036 and 1037:
Data Preparation1 lm_train_dataset
Page 1038 and 1039:
The training arguments are roughly
Page 1040 and 1041:
device_index = (model.device.indexi
Page 1042 and 1043:
• learning that a language model
Page 1044 and 1045:
[167] https://huggingface.co/docs/d
show all

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Delete template?

Save as template?