Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Recommendations

Info

• First, and most important, PyTorch implements norm-last "sub-layer"wrappers, normalizing the output of each "sub-layer."Figure 10.17 - "Sub-Layer"—norm-last vs norm-first• It does not implement positional encoding, the final linear layer, or theprojection layer, so we have to handle those ourselves.Let’s take a look at its constructor and forward() methods. The constructorexpects many arguments because PyTorch’s Transformer actually builds bothencoder and decoder by itself:• d_model: the number of (projected) features, that is, the dimensionality of themodel (remember, this number will be split among the attention heads, so itmust be a multiple of the number of heads; its default value is 512)• nhead: the number of attention heads in each attention mechanism (default iseight, so each attention head gets 64 out of the 512 dimensions)• num_encoder_layers: the number of "layers" in the encoder (the Transformeruses six layers by default)• num_decoder_layers: the number of "layers" in the decoder (the Transformeruses six layers by default)• dim_feedforward: the number of units in the hidden layer of the feed-forwardnetwork (default is 2048)• dropout: the probability of dropping out inputs (default is 0.1)• activation: the activation function to be used in the feed-forward networkThe PyTorch Transformer | 839
(ReLU by default)It is also possible to use a custom encoder or decoder by settingthe corresponding arguments: custom_encoder andcustom_decoder. But don’t forget that the PyTorch Transformerexpects sequence-first inputs.The forward() method expects both sequences, source and target, and all sorts of(optional) masks.IMPORTANT: PyTorch’s Transformer uses sequence-firstshapes for its inputs (L, N, F), and there is no batch-first option.There are masks for padded data points:• src_key_padding_mask: the mask for padded data points in the sourcesequence• memory_key_padding_mask: It’s also a mask for padded data points in thesource sequence and should be, in most cases, the same assrc_key_padding_mask.• tgt_key_padding_mask: This mask is used for padded data points in the targetsequence.And there are masks to purposefully hide some of the inputs:• src_mask: It hides inputs in the source sequence, this can be used for traininglanguage models (more on that in Chapter 11).• tgt_mask: That’s the mask used to avoid cheating (although quite important,this argument is still considered optional).◦ The Transformer has a method namedgenerate_square_subsequent_mask() that generates the appropriate maskgiven the size (length) of the sequence.• memory_mask: It hides encoded states used by the decoder.Also, notice that there is no memory argument anymore: The encoded states arehandled internally by the Transformer and fed directly to the decoder part.In our own code, we’ll be replacing the two former methods, encode() and840 | Chapter 10: Transform and Roll Out
Page 2 and 3:
Deep Learning with PyTorchStep-by-S
Page 4 and 5:
"What I cannot create, I do not und
Page 6 and 7:
Train-Validation-Test Split. . . .
Page 8 and 9:
Random Split. . . . . . . . . . . .
Page 10 and 11:
The Precision Quirk . . . . . . . .
Page 12 and 13:
A REAL Filter . . . . . . . . . . .
Page 14 and 15:
Jupyter Notebook . . . . . . . . .
Page 16 and 17:
Stacked RNN . . . . . . . . . . . .
Page 18 and 19:
Attention Mechanism . . . . . . . .
Page 20 and 21:
4. Positional Encoding. . . . . . .
Page 22 and 23:
Model Configuration & Training . .
Page 24 and 25:
AcknowledgementsFirst and foremost,
Page 26 and 27:
Frequently Asked Questions (FAQ)Why
Page 28 and 29:
There is yet another advantage of f
Page 30 and 31:
• Classes and methods are written
Page 32 and 33:
What’s Next?It’s time to set up
Page 34 and 35:
After choosing a repository, it wil
Page 36 and 37:
1. AnacondaIf you don’t have Anac
Page 38 and 39:
3. PyTorchPyTorch is the coolest de
Page 40 and 41:
(pytorchbook) C:\> conda install py
Page 42 and 43:
(pytorchbook)$ pip install torchviz
Page 44 and 45:
7. JupyterAfter cloning the reposit
Page 46 and 47:
Part IFundamentals| 21
Page 48 and 49:
notebook. If not, just click on Cha
Page 50 and 51:
Also, let’s say that, on average,
Page 52 and 53:
There is one exception to the "alwa
Page 54 and 55:
Random Initialization1 # Step 0 - I
Page 56 and 57:
Batch, Mini-batch, and Stochastic G
Page 58 and 59:
Outputarray([[-2. , -1.94, -1.88, .
Page 60 and 61:
one matrix for each data point, eac
Page 62 and 63:
Sure, different values of b produce
Page 64 and 65:
Output-3.044811379650508 -1.8337537
Page 66 and 67:
each parameter using the chain rule
Page 68 and 69:
What’s the impact of one update o
Page 70 and 71:
gradients, we know we need to take
Page 72 and 73:
Very High Learning RateWait, it may
Page 74 and 75:
true_b = 1true_w = 2N = 100# Data G
Page 76 and 77:
Let’s look at the cross-sections
Page 78 and 79:
Zero Mean and Unit Standard Deviati
Page 80 and 81:
Sure, in the real world, you’ll n
Page 82 and 83:
computing the loss, as shown in the
Page 84 and 85:
• visualizing the effects of usin
Page 86 and 87:
If you’re using Jupyter’s defau
Page 88 and 89:
Notebook Cell 1.1 - Splitting synth
Page 90 and 91:
Step 2# Step 2 - Computing the loss
Page 92 and 93:
Output[0.49671415] [-0.1382643][0.8
Page 94 and 95:
Notebook Cell 1.2 - Implementing gr
Page 96 and 97:
# Sanity Check: do we get the same
Page 98 and 99:
Outputtensor(3.1416)tensor([1, 2, 3
Page 100 and 101:
Outputtensor([[1., 2., 1.],[1., 1.,
Page 102 and 103:
dummy_array = np.array([1, 2, 3])du
Page 104 and 105:
n_cudas = torch.cuda.device_count()
Page 106 and 107:
back_to_numpy = x_train_tensor.nump
Page 108 and 109:
I am assuming you’d like to use y
Page 110 and 111:
Outputtensor([0.1940], device='cuda
Page 112 and 113:
print(error.requires_grad, yhat.req
Page 114 and 115:
Output(tensor([0.], device='cuda:0'
Page 116 and 117:
56 # need to tell it to let it go..
Page 118 and 119:
computation.If you chose "Local Ins
Page 120 and 121:
Figure 1.6 - Now parameter "b" does
Page 122 and 123:
There are many optimizers: SGD is t
Page 124 and 125:
41 optimizer.zero_grad() 34243 prin
Page 126 and 127:
Notebook Cell 1.8 - PyTorch’s los
Page 128 and 129:
Outputarray(0.00804466, dtype=float
Page 130 and 131:
Let’s build a proper (yet simple)
Page 132 and 133:
"What do we need this for?"It turns
Page 134 and 135:
1 Instantiating a model2 What IS th
Page 136 and 137:
In the __init__() method, we create
Page 138 and 139:
LayersA Linear model can be seen as
Page 140 and 141:
There are MANY different layers tha
Page 142 and 143:
We use magic, just like that:%run -
Page 144 and 145:
• Step 1: compute model’s predi
Page 146 and 147:
RecapFirst of all, congratulations
Page 148 and 149:
Chapter 2Rethinking the Training Lo
Page 150 and 151:
Let’s take a look at the code onc
Page 152 and 153:
Higher-Order FunctionsAlthough this
Page 154 and 155:
def exponentiation_builder(exponent
Page 156 and 157:
Apart from returning the loss value
Page 158 and 159:
Our code should look like this; see
Page 160 and 161:
There is no need to load the whole
Page 162 and 163:
but if we want to get serious about
Page 164 and 165:
How does this change our code so fa
Page 166 and 167:
Run - Model Training V2%run -i mode
Page 168 and 169:
piece of code that’s going to be
Page 170 and 171:
for it. We could do the same for th
Page 172 and 173:
EvaluationHow can we evaluate the m
Page 174 and 175:
And then, we update our model confi
Page 176 and 177:
Run - Model Training V4%run -i mode
Page 178 and 179:
Loading Extension# Load the TensorB
Page 180 and 181:
browser, you’ll likely see someth
Page 182 and 183:
model’s graph (not quite the same
Page 184 and 185:
Figure 2.5 - Scalars on TensorBoard
Page 186 and 187:
Define - Model Training V51 %%write
Page 188 and 189:
If, by any chance, you ended up wit
Page 190 and 191:
The procedure is exactly the same,
Page 192 and 193:
soon, so please bear with me for no
Page 194 and 195:
After recovering our model’s stat
Page 196 and 197:
Run - Model Configuration V31 # %lo
Page 198 and 199:
This is the general structure you
Page 200 and 201:
Chapter 2.1Going ClassySpoilersIn t
Page 202 and 203:
# A completely empty (and useless)
Page 204 and 205:
# These attributes are defined here
Page 206 and 207:
# Creates the train_step function f
Page 208 and 209:
# Builds function that performs a s
Page 210 and 211:
setattrThe setattr function sets th
Page 212 and 213:
See? We effectively modified the un
Page 214 and 215:
the random seed as arguments.This s
Page 216 and 217:
The current state of development of
Page 218 and 219:
Lossesdef plot_losses(self):fig = p
Page 220 and 221:
Run - Data Preparation V21 # %load
Page 222 and 223:
Model TrainingWe start by instantia
Page 224 and 225:
Making PredictionsLet’s make up s
Page 226 and 227:
OutputOrderedDict([('0.weight', ten
Page 228 and 229:
Run - Data Preparation V21 # %load
Page 230 and 231:
• defining our StepByStep class
Page 232 and 233:
import numpy as npimport torchimpor
Page 234 and 235:
Next, we’ll standardize the featu
Page 236 and 237:
Equation 3.1 - A linear regression
Page 238 and 239:
The odds ratio is given by the rati
Page 240 and 241:
As expected, probabilities that add
Page 242 and 243:
Sigmoid Functiondef sigmoid(z):retu
Page 244 and 245:
A picture is worth a thousand words
Page 246 and 247:
OutputOrderedDict([('linear.weight'
Page 248 and 249:
The first summation adds up the err
Page 250 and 251:
IMPORTANT: Make sure to pass the pr
Page 252 and 253:
To make it clear: In this chapter,
Page 254 and 255:
argument of nn.BCEWithLogitsLoss().
Page 256 and 257:
It is not that hard, to be honest.
Page 258 and 259:
Figure 3.6 - Training and validatio
Page 260 and 261:
Outputarray([[0.5504593 ],[0.949995
Page 262 and 263:
decision boundary.Look at the expre
Page 264 and 265:
Are my data points separable?That
Page 266 and 267:
model = nn.Sequential()model.add_mo
Page 268 and 269:
It looks like this:Figure 3.10 - Sp
Page 270 and 271:
True and False Positives and Negati
Page 272 and 273:
tpr_fpr(cm_thresh50)Output(0.909090
Page 274 and 275:
The trade-off between precision and
Page 276 and 277:
Figure 3.13 - Using a low threshold
Page 278 and 279:
Figure 3.16 - Trade-offs for two di
Page 280 and 281:
thresholds do not necessarily inclu
Page 282 and 283:
actual data, it is as bad as it can
Page 284 and 285:
If you want to learn more about bot
Page 286 and 287:
Model Training1 n_epochs = 10023 sb
Page 288 and 289:
step in your journey! What’s next
Page 290 and 291:
Chapter 4Classifying ImagesSpoilers
Page 292 and 293:
Data GenerationOur images are quite
Page 294 and 295:
Images and ChannelsIn case you’re
Page 296 and 297:
image_rgb = np.stack([image_r, imag
Page 298 and 299:
That’s fairly straightforward; we
Page 300 and 301:
• Transformations based on Tensor
Page 302 and 303:
position of an object in a picture
Page 304 and 305:
Outputtensor([[[0., 0., 0., 1., 0.]
Page 306 and 307:
Outputtensor([[[-1., -1., -1., 1.,
Page 308 and 309:
We can convert the former into the
Page 310 and 311:
composer = Compose([RandomHorizonta
Page 312 and 313:
Output<torch.utils.data.dataset.Sub
Page 314 and 315:
train_composer = Compose([RandomHor
Page 316 and 317:
The minority class should have the
Page 318 and 319:
train_loader = DataLoader(dataset=t
Page 320 and 321:
implemented in Chapter 2.1? Let’s
Page 322 and 323:
Let’s take one mini-batch of imag
Page 324 and 325:
What does our model look like? Visu
Page 326 and 327:
Model TrainingLet’s train our mod
Page 328 and 329:
preceding hidden layer to compute i
Page 330 and 331:
fig = sbs_nn.plot_losses()Figure 4.
Page 332 and 333:
Equation 4.2 - Equivalence of deep
Page 334 and 335:
w_nn_equiv = w_nn_output.mm(w_nn_hi
Page 336 and 337:
Weights as PixelsDuring data prepar
Page 338 and 339:
is only 0.25 (for z = 0) and that i
Page 340 and 341:
nn.Tanh()(dummy_z)Outputtensor([-0.
Page 342 and 343:
dummy_z = torch.tensor([-3., 0., 3.
Page 344 and 345:
As you can see, in PyTorch the coef
Page 346 and 347:
Figure 4.16 - Deep model (for real)
Page 348 and 349:
Figure 4.18 - Losses (before and af
Page 350 and 351:
Equation 4.3 - Activation functions
Page 352 and 353:
Helper Function #41 def index_split
Page 354 and 355:
Model Configuration1 # Sets learnin
Page 356 and 357:
Bonus ChapterFeature SpaceThis chap
Page 358 and 359:
Affine TransformationsAn affine tra
Page 360 and 361:
Figure B.3 - Annotated model diagra
Page 362 and 363:
Figure B.5 - In the beginning…But
Page 364 and 365:
OK, now we can clearly see a differ
Page 366 and 367:
In the model above, the sigmoid fun
Page 368 and 369:
the more dimensions, the more separ
Page 370 and 371:
import randomimport numpy as npfrom
Page 372 and 373:
identity = np.array([[[[0, 0, 0],[0
Page 374 and 375:
Figure 5.4 - Striding the image, on
Page 376 and 377:
Output-----------------------------
Page 378 and 379:
Outputtensor([[[[9., 5., 0., 7.],[0
Page 380 and 381:
OutputParameter containing:tensor([
Page 382 and 383:
Moreover, notice that if we were to
Page 384 and 385:
In code, as usual, PyTorch gives us
Page 386 and 387:
Outputtensor([[[[5., 5., 0., 8., 7.
Page 388 and 389:
edge = np.array([[[[0, 1, 0],[1, -4
Page 390 and 391:
A pooling kernel of two-by-two resu
Page 392 and 393:
Outputtensor([[22., 23., 11., 24.,
Page 394 and 395:
Figure 5.15 - LeNet-5 architectureS
Page 396 and 397:
• second block: produces 16-chann
Page 398 and 399:
Transformed Dataset1 class Transfor
Page 400 and 401:
LossNew problem, new loss. Since we
Page 402 and 403:
Outputtensor([4.0000, 1.0000, 0.500
Page 404 and 405:
The loss only considers the predict
Page 406 and 407:
Outputtensor([[-1.5229, -0.3146, -2
Page 408 and 409:
IMPORTANT: I can’t stress this en
Page 410 and 411:
figures at the beginning of this ch
Page 412 and 413:
The three units in the output layer
Page 414 and 415:
StepByStep Method@staticmethoddef _
Page 416 and 417:
The meow() method is totally indepe
Page 418 and 419:
StepByStep Methoddef visualize_filt
Page 420 and 421:
dummy_model = nn.Linear(1, 1)dummy_
Page 422 and 423:
dummy_listOutput[(Linear(in_feature
Page 424 and 425:
Output{Conv2d(1, 1, kernel_size=(3,
Page 426 and 427:
will be the externally defined vari
Page 428 and 429:
Removing Hookssbs_cnn1.remove_hooks
Page 430 and 431:
return figsetattr(StepByStep, 'visu
Page 432 and 433:
Figure 5.22 - Feature maps (classif
Page 434 and 435:
classification: The predicted class
Page 436 and 437:
convolutional layers to our model a
Page 438 and 439:
Capturing Outputsfeaturizer_layers
Page 440 and 441:
the filters learned by the model pr
Page 442 and 443:
given chapter are imported at its v
Page 444 and 445:
Data PreparationThe data preparatio
Page 446 and 447:
model anyway. We’ll use it to com
Page 448 and 449:
StepByStep Method@staticmethoddef m
Page 450 and 451:
"What’s wrong with the colors?"Th
Page 452 and 453:
three_channel_filter = np.array([[[
Page 454 and 455:
Fancier Model (Constructor)class CN
Page 456 and 457:
Fancier Model (Classifier)def class
Page 458 and 459:
torch.manual_seed(44)dropping_model
Page 460 and 461:
Outputtensor([0.1000, 0.2000, 0.300
Page 462 and 463:
Figure 6.8 - Output distribution fo
Page 464 and 465:
Adaptive moment estimation (Adam) u
Page 466 and 467:
torch.manual_seed(13)# Model Config
Page 468 and 469:
Outputtorch.Size([5, 3, 3, 3])Its s
Page 470 and 471:
Choosing a learning rate that works
Page 472 and 473:
Higher-Order Learning Rate Function
Page 474 and 475:
Perfect! Now let’s build the actu
Page 476 and 477:
ax.set_xlabel('Learning Rate')ax.se
Page 478 and 479:
LRFinderThe function we’ve implem
Page 480 and 481:
value in our moving average has an
Page 482 and 483:
Figure 6.15 - Distribution of weigh
Page 484 and 485:
In code, the implementation of the
Page 486 and 487:
As expected, the EWMA without corre
Page 488 and 489:
optimizer = optim.Adam(model.parame
Page 490 and 491:
IMPORTANT: The logging function mus
Page 492 and 493:
Output{'state': {140601337662512: {
Page 494 and 495:
different optimizer, set them to ca
Page 496 and 497:
• dampening: dampening factor for
Page 498 and 499:
Figure 6.20 - Paths taken by SGD (w
Page 500 and 501:
Equation 6.16 - Looking aheadOnce N
Page 502 and 503:
Figure 6.22 - Path taken by each SG
Page 504 and 505:
for epoch in range(4):# training lo
Page 506 and 507:
course) up to a given number of epo
Page 508 and 509:
Next, we create a protected method
Page 510 and 511:
Mini-Batch SchedulersThese schedule
Page 512 and 513:
Schedulers in StepByStep — Part I
Page 514 and 515:
Scheduler PathsBefore trying out a
Page 516 and 517:
After applying each scheduler to SG
Page 518 and 519:
Data Preparation1 # Loads temporary
Page 520 and 521:
Figure 6.31 - LossesEvaluationprint
Page 522 and 523:
[96] http://www.samkass.com/theorie
Page 524 and 525:
ImportsFor the sake of organization
Page 526 and 527:
ILSVRC-2012The 2012 edition [111] o
Page 528 and 529:
remained unchanged.ResNet (MSRA Tea
Page 530 and 531:
Transfer Learning in PracticeIn Cha
Page 532 and 533:
dropout. You’re already familiar
Page 534 and 535:
OutputDownloading: "https://downloa
Page 536 and 537:
Replacing the "Top" of the Model1 a
Page 538 and 539:
Model Size Classifier Layer(s) Repl
Page 540 and 541:
Model TrainingWe have everything se
Page 542 and 543:
"Removing" the Top Layer1 alex.clas
Page 544 and 545:
torch.save(train_preproc.tensors, '
Page 546 and 547:
Outputtensor([[109, 124],[124, 124]
Page 548 and 549:
Model Configuration1 optimizer_mode
Page 550 and 551:
Figure 7.4 - 1x1 convolutionThe inp
Page 552 and 553:
The weights used by PIL are 0.299 f
Page 554 and 555:
• reduce the number of output cha
Page 556 and 557:
The constructor method defines the
Page 558 and 559:
Does it sound familiar? That’s wh
Page 560 and 561:
and w to represent these parameters
Page 562 and 563:
A mini-batch of size 64 is small en
Page 564 and 565:
normed1 = batch_normalizer(batch1[0
Page 566 and 567:
OutputOrderedDict([('running_mean',
Page 568 and 569:
OutputOrderedDict([('running_mean',
Page 570 and 571:
batch_normalizer = nn.BatchNorm2d(n
Page 572 and 573:
torch.manual_seed(23)dummy_points =
Page 574 and 575:
np.concatenate([dummy_points[:5].nu
Page 576 and 577:
Another advantage of these shortcut
Page 578 and 579:
It should be pretty clear, except f
Page 580 and 581:
Data Preparation1 # ImageNet statis
Page 582 and 583:
Data Preparation — Preprocessing1
Page 584 and 585:
• freezing the layers of the mode
Page 586 and 587:
Extra ChapterVanishing and Explodin
Page 588 and 589:
discussing it, let me illustrate it
Page 590 and 591:
Model Configuration (2)1 loss_fn =
Page 592 and 593:
weights. If done properly, the init
Page 594 and 595:
just did), or, if you are training
Page 596 and 597:
Figure E.3 - The effect of batch no
Page 598 and 599:
Model Configuration1 torch.manual_s
Page 600 and 601:
torch.manual_seed(42)parm = nn.Para
Page 602 and 603:
(and only if) the norm exceeds the
Page 604 and 605:
if callable(self.clipping): 1self.c
Page 606 and 607:
Moreover, let’s use a ten times h
Page 608 and 609:
Clipping with HooksFirst, we reset
Page 610 and 611:
• visualizing the difference betw
Page 612 and 613:
Chapter 8SequencesSpoilersIn this c
Page 614 and 615:
Before shuffling, the pixels were o
Page 616 and 617:
And then let’s visualize the firs
Page 618 and 619:
sequence so far, and a data point f
Page 620 and 621:
Considering this, the not "unrolled
Page 622 and 623:
linear_input = nn.Linear(n_features
Page 624 and 625:
Outputtensor([[0.3924, 0.8146]], gr
Page 626 and 627:
Now we’re talking! The last hidde
Page 628 and 629:
Let’s take a look at the RNN’s
Page 630 and 631:
◦ The initial hidden state, which
Page 632 and 633:
batch_first argument to True so we
Page 634 and 635:
OutputOrderedDict([('weight_ih_l0',
Page 636 and 637:
out, hidden = rnn_stacked(x)out, hi
Page 638 and 639:
_l0_reverse).Once again, let’s cr
Page 640 and 641:
For bidirectional RNNs, the last el
Page 642 and 643:
Model Configuration1 class SquareMo
Page 644 and 645:
StepByStep.loader_apply(test_loader
Page 646 and 647:
Figure 8.14 - Final hidden states f
Page 648 and 649:
Figure 8.16 - Transforming the hidd
Page 650 and 651:
Since the RNN cell has both of them
Page 652 and 653:
Every gate worthy of its name will
Page 654 and 655:
• For r=0 and z=0, the cell becom
Page 656 and 657:
In code, we can use split() to get
Page 658 and 659:
Let’s pause for a moment here. Fi
Page 660 and 661:
Square Model II — The QuickeningT
Page 662 and 663:
Outputtensor([[53, 53],[75, 75]])Th
Page 664 and 665:
Figure 8.22 - Transforming the hidd
Page 666 and 667:
Equation 8.9 - LSTM—candidate hid
Page 668 and 669:
Now, let’s visualize the internal
Page 670 and 671:
OutputOrderedDict([('weight_ih', te
Page 672 and 673:
def forget_gate(h, x):thf = f_hidde
Page 674 and 675:
Outputtensor([[-5.4936e-02, -8.3816
Page 676 and 677:
1 First change: from RNN to LSTM2 S
Page 678 and 679:
Like the GRU, the LSTM presents fou
Page 680 and 681:
Output-----------------------------
Page 682 and 683:
Before moving on to packed sequence
Page 684 and 685:
column-wise fashion, from top to bo
Page 686 and 687:
does match the last output.• No,
Page 688 and 689:
So, to actually get the last output
Page 690 and 691:
Data Preparation1 class CustomDatas
Page 692 and 693:
OutputPackedSequence(data=tensor([[
Page 694 and 695:
Model Configuration & TrainingWe ca
Page 696 and 697:
size = 5weight = torch.ones(size) *
Page 698 and 699:
torch.manual_seed(17)conv_seq = nn.
Page 700 and 701:
Figure 8.32 - Applying dilated filt
Page 702 and 703:
Model Configuration1 torch.manual_s
Page 704 and 705:
We can actually find an expression
Page 706 and 707:
Data Preparation1 def pack_collate(
Page 708 and 709:
and variable-length sequences.Model
Page 710 and 711:
• generating variable-length sequ
Page 712 and 713:
import copyimport numpy as npimport
Page 714 and 715:
Figure 9.3 - Sequence datasetThe co
Page 716 and 717:
coordinates of a "perfect" square a
Page 718 and 719:
Let’s pretend for a moment that t
Page 720 and 721:
to initialize the hidden state and
Page 722 and 723:
predictions in previous steps have
Page 724 and 725:
the second set of predicted coordin
Page 726 and 727:
Let’s create an instance of the m
Page 728 and 729:
Model Configuration & TrainingThe m
Page 730 and 731:
Sure, we can!AttentionHere is a (no
Page 732 and 733:
based on "the" and "zone," I’ve j
Page 734 and 735:
Figure 9.12 - Matching a query to t
Page 736 and 737:
Outputtensor([[[ 0.0832, -0.0356],[
Page 738 and 739:
utmost importance for the correct i
Page 740 and 741:
Its formula is:Equation 9.3 - Cosin
Page 742 and 743:
second hidden state contributes to
Page 744 and 745:
Outputtensor([[[ 0.5475, 0.0875, -1
Page 746 and 747:
alphas = F.softmax(scaled_products,
Page 748 and 749:
Outputtensor([[[ 0.2138, -0.3175]]]
Page 750 and 751:
Attention Mechanism1 class Attentio
Page 752 and 753:
"Why would I want to force it to do
Page 754 and 755:
1 Sets attention module and adjusts
Page 756 and 757:
encdec = EncoderDecoder(encoder, de
Page 758 and 759:
fig = sbs_seq_attn.plot_losses()Fig
Page 760 and 761:
Figure 9.20 - Attention scoresSee?
Page 762 and 763:
Wide vs Narrow AttentionThis mechan
Page 764 and 765:
"What’s so special about it?"Even
Page 766 and 767:
Once again, the affine transformati
Page 768 and 769:
Next, we shift our focus to the sel
Page 770 and 771:
Encoder + Self-Attention1 class Enc
Page 772 and 773:
Figure 9.27 - Encoder with self- an
Page 774 and 775:
The figure below depicts the self-a
Page 776 and 777:
shifted_seq = torch.cat([source_seq
Page 778 and 779:
Equation 9.17 - Decoder’s (masked
Page 780 and 781:
At evaluation / prediction time we
Page 782 and 783:
Outputtensor([[[0.4132, 0.3728],[0.
Page 784 and 785:
Figure 9.33 - Encoder + decoder + a
Page 786 and 787:
64 return outputsThe encoder-decode
Page 788 and 789:
Figure 9.34 - Losses—encoder + de
Page 790 and 791:
curse. On the one hand, it makes co
Page 792 and 793:
"Are we done now? Is this good enou
Page 794 and 795:
Figure 9.46 - Consistent distancesA
Page 796 and 797:
Let’s recap what we’ve already
Page 798 and 799:
Let’s see it in code:max_len = 10
Page 800 and 801:
Let’s put it all together into a
Page 802 and 803:
Outputtensor([[[-1.0000, 0.0000],[-
Page 804 and 805:
Decoder with Positional Encoding1 c
Page 806 and 807:
Visualizing PredictionsLet’s plot
Page 808 and 809:
Next, we’re moving on to the thre
Page 810 and 811:
Data Generation & Preparation1 # Tr
Page 812 and 813:
59 self.trg_masks)60 else:61 # Deco
Page 814 and 815: Model Configuration1 class EncoderS
Page 816 and 817: 1617 @property18 def alphas(self):1
Page 818 and 819: Output(0.016193246061448008, 0.0341
Page 820 and 821: sequential order of the data• fig
Page 822 and 823: following imports:import copyimport
Page 824 and 825: Figure 10.2 - Chunking: the wrong a
Page 826 and 827: chunks to compute the other half of
Page 828 and 829: 67 # N, L, n_heads, d_k68 context =
Page 830 and 831: dummy_points = torch.randn(16, 2, 4
Page 832 and 833: Stacking Encoders and DecodersLet
Page 834 and 835: "… with great depth comes great c
Page 836 and 837: Transformer EncoderWe’ll be repre
Page 838 and 839: Let’s see it in code, starting wi
Page 840 and 841: Transformer Encoder1 class EncoderT
Page 842 and 843: of the encoder-decoder (or Transfor
Page 844 and 845: In PyTorch, the decoder "layer" is
Page 846 and 847: In PyTorch, the decoder is implemen
Page 848 and 849: Equation 10.7 - Data points' means
Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n
Page 852 and 853: Figure 10.10 - Layer norm vs batch
Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[
Page 856 and 857: The TransformerLet’s start with t
Page 858 and 859: "values") in the decoder.• decode
Page 860 and 861: Data Preparation1 # Generating trai
Page 862 and 863: Figure 10.15 - Losses—Transformer
Page 866 and 867: decode(), with a single one, encode
Page 868 and 869: 46 for i in range(self.target_len):
Page 870 and 871: Figure 10.18 - Losses - PyTorch’s
Page 872 and 873: Figure 10.20 - Sample image—label
Page 874 and 875: 4041 # Builds a weighted random sam
Page 876 and 877: Figure 10.23 - Sample image—split
Page 878 and 879: Einops"There is more than one way t
Page 880 and 881: Figure 10.26 - Two patch embeddings
Page 882 and 883: Now each sequence has ten elements,
Page 884 and 885: It takes an instance of a Transform
Page 886 and 887: Putting It All TogetherIn this chap
Page 888 and 889: 1. Encoder-DecoderThe encoder-decod
Page 890 and 891: This is the actual encoder-decoder
Page 892 and 893: 3. DecoderThe Transformer decoder h
Page 894 and 895: 5. Encoder "Layer"The encoder "laye
Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye
Page 898 and 899: 8. Multi-Headed AttentionThe multi-
Page 900 and 901: Model Configuration & TrainingModel
Page 902 and 903: • training the Transformer to tac
Page 904 and 905: Part IVNatural Language Processing|
Page 906 and 907: Additional SetupThis is a special c
Page 908 and 909: "Down the Yellow Brick Rabbit Hole"
Page 910 and 911: The actual texts of the books are c
Page 912 and 913: "What is this punkt?"That’s the P
Page 914 and 915:
14 # If there is a configuration fi
Page 916 and 917:
Sentence Tokenization in spaCyBy th
Page 918 and 919:
AttributesThe Dataset has many attr
Page 920 and 921:
Output{'labels': 1,'sentence': 'The
Page 922 and 923:
elements from the text. But preproc
Page 924 and 925:
Data AugmentationLet’s briefly ad
Page 926 and 927:
The corpora’s dictionary is not a
Page 928 and 929:
Finally, if we want to convert a li
Page 930 and 931:
Once we’re happy with the size an
Page 932 and 933:
from transformers import BertTokeni
Page 934 and 935:
"What about the separation token?"T
Page 936 and 937:
The last output, attention_mask, wo
Page 938 and 939:
Outputtensor([[ 3, 27, 1, ..., 0, 0
Page 940 and 941:
vector, right? And our vocabulary i
Page 942 and 943:
Maybe you filled this blank in with
Page 944 and 945:
Continuous Bag-of-Words (CBoW)In th
Page 946 and 947:
That’s a fairly simple model, rig
Page 948 and 949:
Figure 11.13 - Continuous bag-of-wo
Page 950 and 951:
Figure 11.15 - Reviewing restaurant
Page 952 and 953:
You got that right—arithmetic—r
Page 954 and 955:
There we go, 50 dimensions! It’s
Page 956 and 957:
Equation 11.1 - Embedding arithmeti
Page 958 and 959:
Only 82 out of 50,802 words in the
Page 960 and 961:
Now we can use its encode() method
Page 962 and 963:
Model I — GloVE + ClassifierData
Page 964 and 965:
Pre-trained PyTorch EmbeddingsThe e
Page 966 and 967:
Model Configuration & TrainingLet
Page 968 and 969:
6 self.encoder = encoder7 self.mlp
Page 970 and 971:
Figure 11.20 - Losses—Transformer
Page 972 and 973:
Outputtensor([[[2.6334e-01, 6.9912e
Page 974 and 975:
I want to introduce you to…ELMoBo
Page 976 and 977:
OutputToken: 32 watchThe get_token(
Page 978 and 979:
Helper Function to Retrieve Embeddi
Page 980 and 981:
Output(tensor(-0.5047, device='cuda
Page 982 and 983:
torch.all(new_flair_sentences[0].to
Page 984 and 985:
Outputtensor(0.3504, device='cuda:0
Page 986 and 987:
We can leverage this fact to slight
Page 988 and 989:
We can easily get the embeddings fo
Page 990 and 991:
Figure 11.24 - Losses—simple clas
Page 992 and 993:
We can inspect the pre-trained mode
Page 994 and 995:
Every word piece is prefixed with #
Page 996 and 997:
far, our models used these embeddin
Page 998 and 999:
position_ids = torch.arange(512).ex
Page 1000 and 1001:
Pre-training TasksMasked Language M
Page 1002 and 1003:
Then, let’s create an instance of
Page 1004 and 1005:
If these two sentences were the inp
Page 1006 and 1007:
The BERT model may take many other
Page 1008 and 1009:
The contextual word embeddings are
Page 1010 and 1011:
Model Configuration1 class BERTClas
Page 1012 and 1013:
"Which BERT is that? DistilBERT?!"D
Page 1014 and 1015:
Well, you probably don’t want to
Page 1016 and 1017:
set num_labels=1 as argument.If you
Page 1018 and 1019:
Output{'attention_mask': [1, 1, 1,
Page 1020 and 1021:
OutputTrainingArguments(output_dir=
Page 1022 and 1023:
Method for Computing Accuracy1 def
Page 1024 and 1025:
loaded_model = (AutoModelForSequenc
Page 1026 and 1027:
logits.logits.argmax(dim=1)Outputte
Page 1028 and 1029:
For a complete list of available ta
Page 1030 and 1031:
[215]. For a demo of GPT-2’s capa
Page 1032 and 1033:
in Chapter 9, and I reproduce it be
Page 1034 and 1035:
Data Preparation1 auto_tokenizer =
Page 1036 and 1037:
Data Preparation1 lm_train_dataset
Page 1038 and 1039:
The training arguments are roughly
Page 1040 and 1041:
device_index = (model.device.indexi
Page 1042 and 1043:
• learning that a language model
Page 1044 and 1045:
[167] https://huggingface.co/docs/d
show all

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Delete template?

Save as template?