- Page 2 and 3:
Deep Learning with PyTorchStep-by-S
- Page 4 and 5:
"What I cannot create, I do not und
- Page 6 and 7:
Train-Validation-Test Split. . . .
- Page 8 and 9:
Random Split. . . . . . . . . . . .
- Page 10 and 11:
The Precision Quirk . . . . . . . .
- Page 12 and 13:
A REAL Filter . . . . . . . . . . .
- Page 14 and 15:
Jupyter Notebook . . . . . . . . .
- Page 16 and 17:
Stacked RNN . . . . . . . . . . . .
- Page 18 and 19:
Attention Mechanism . . . . . . . .
- Page 20 and 21:
4. Positional Encoding. . . . . . .
- Page 22 and 23:
Model Configuration & Training . .
- Page 24 and 25:
AcknowledgementsFirst and foremost,
- Page 26 and 27:
Frequently Asked Questions (FAQ)Why
- Page 28 and 29:
There is yet another advantage of f
- Page 30 and 31:
• Classes and methods are written
- Page 32 and 33:
What’s Next?It’s time to set up
- Page 34 and 35:
After choosing a repository, it wil
- Page 36 and 37:
1. AnacondaIf you don’t have Anac
- Page 38 and 39:
3. PyTorchPyTorch is the coolest de
- Page 40 and 41:
(pytorchbook) C:\> conda install py
- Page 42 and 43:
(pytorchbook)$ pip install torchviz
- Page 44 and 45:
7. JupyterAfter cloning the reposit
- Page 46 and 47:
Part IFundamentals| 21
- Page 48 and 49:
notebook. If not, just click on Cha
- Page 50 and 51:
Also, let’s say that, on average,
- Page 52 and 53:
There is one exception to the "alwa
- Page 54 and 55:
Random Initialization1 # Step 0 - I
- Page 56 and 57:
Batch, Mini-batch, and Stochastic G
- Page 58 and 59:
Outputarray([[-2. , -1.94, -1.88, .
- Page 60 and 61:
one matrix for each data point, eac
- Page 62 and 63:
Sure, different values of b produce
- Page 64 and 65:
Output-3.044811379650508 -1.8337537
- Page 66 and 67:
each parameter using the chain rule
- Page 68 and 69:
What’s the impact of one update o
- Page 70 and 71:
gradients, we know we need to take
- Page 72 and 73:
Very High Learning RateWait, it may
- Page 74 and 75:
true_b = 1true_w = 2N = 100# Data G
- Page 76 and 77:
Let’s look at the cross-sections
- Page 78 and 79:
Zero Mean and Unit Standard Deviati
- Page 80 and 81:
Sure, in the real world, you’ll n
- Page 82 and 83:
computing the loss, as shown in the
- Page 84 and 85:
• visualizing the effects of usin
- Page 86 and 87:
If you’re using Jupyter’s defau
- Page 88 and 89: Notebook Cell 1.1 - Splitting synth
- Page 90 and 91: Step 2# Step 2 - Computing the loss
- Page 92 and 93: Output[0.49671415] [-0.1382643][0.8
- Page 94 and 95: Notebook Cell 1.2 - Implementing gr
- Page 96 and 97: # Sanity Check: do we get the same
- Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3
- Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,
- Page 102 and 103: dummy_array = np.array([1, 2, 3])du
- Page 104 and 105: n_cudas = torch.cuda.device_count()
- Page 106 and 107: back_to_numpy = x_train_tensor.nump
- Page 108 and 109: I am assuming you’d like to use y
- Page 110 and 111: Outputtensor([0.1940], device='cuda
- Page 112 and 113: print(error.requires_grad, yhat.req
- Page 114 and 115: Output(tensor([0.], device='cuda:0'
- Page 116 and 117: 56 # need to tell it to let it go..
- Page 118 and 119: computation.If you chose "Local Ins
- Page 120 and 121: Figure 1.6 - Now parameter "b" does
- Page 122 and 123: There are many optimizers: SGD is t
- Page 124 and 125: 41 optimizer.zero_grad() 34243 prin
- Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los
- Page 128 and 129: Outputarray(0.00804466, dtype=float
- Page 130 and 131: Let’s build a proper (yet simple)
- Page 132 and 133: "What do we need this for?"It turns
- Page 134 and 135: 1 Instantiating a model2 What IS th
- Page 136 and 137: In the __init__() method, we create
- Page 140 and 141: There are MANY different layers tha
- Page 142 and 143: We use magic, just like that:%run -
- Page 144 and 145: • Step 1: compute model’s predi
- Page 146 and 147: RecapFirst of all, congratulations
- Page 148 and 149: Chapter 2Rethinking the Training Lo
- Page 150 and 151: Let’s take a look at the code onc
- Page 152 and 153: Higher-Order FunctionsAlthough this
- Page 154 and 155: def exponentiation_builder(exponent
- Page 156 and 157: Apart from returning the loss value
- Page 158 and 159: Our code should look like this; see
- Page 160 and 161: There is no need to load the whole
- Page 162 and 163: but if we want to get serious about
- Page 164 and 165: How does this change our code so fa
- Page 166 and 167: Run - Model Training V2%run -i mode
- Page 168 and 169: piece of code that’s going to be
- Page 170 and 171: for it. We could do the same for th
- Page 172 and 173: EvaluationHow can we evaluate the m
- Page 174 and 175: And then, we update our model confi
- Page 176 and 177: Run - Model Training V4%run -i mode
- Page 178 and 179: Loading Extension# Load the TensorB
- Page 180 and 181: browser, you’ll likely see someth
- Page 182 and 183: model’s graph (not quite the same
- Page 184 and 185: Figure 2.5 - Scalars on TensorBoard
- Page 186 and 187: Define - Model Training V51 %%write
- Page 188 and 189:
If, by any chance, you ended up wit
- Page 190 and 191:
The procedure is exactly the same,
- Page 192 and 193:
soon, so please bear with me for no
- Page 194 and 195:
After recovering our model’s stat
- Page 196 and 197:
Run - Model Configuration V31 # %lo
- Page 198 and 199:
This is the general structure you
- Page 200 and 201:
Chapter 2.1Going ClassySpoilersIn t
- Page 202 and 203:
# A completely empty (and useless)
- Page 204 and 205:
# These attributes are defined here
- Page 206 and 207:
# Creates the train_step function f
- Page 208 and 209:
# Builds function that performs a s
- Page 210 and 211:
setattrThe setattr function sets th
- Page 212 and 213:
See? We effectively modified the un
- Page 214 and 215:
the random seed as arguments.This s
- Page 216 and 217:
The current state of development of
- Page 218 and 219:
Lossesdef plot_losses(self):fig = p
- Page 220 and 221:
Run - Data Preparation V21 # %load
- Page 222 and 223:
Model TrainingWe start by instantia
- Page 224 and 225:
Making PredictionsLet’s make up s
- Page 226 and 227:
OutputOrderedDict([('0.weight', ten
- Page 228 and 229:
Run - Data Preparation V21 # %load
- Page 230 and 231:
• defining our StepByStep class
- Page 232 and 233:
import numpy as npimport torchimpor
- Page 234 and 235:
Next, we’ll standardize the featu
- Page 236 and 237:
Equation 3.1 - A linear regression
- Page 238 and 239:
The odds ratio is given by the rati
- Page 240 and 241:
As expected, probabilities that add
- Page 242 and 243:
Sigmoid Functiondef sigmoid(z):retu
- Page 244 and 245:
A picture is worth a thousand words
- Page 246 and 247:
OutputOrderedDict([('linear.weight'
- Page 248 and 249:
The first summation adds up the err
- Page 250 and 251:
IMPORTANT: Make sure to pass the pr
- Page 252 and 253:
To make it clear: In this chapter,
- Page 254 and 255:
argument of nn.BCEWithLogitsLoss().
- Page 256 and 257:
It is not that hard, to be honest.
- Page 258 and 259:
Figure 3.6 - Training and validatio
- Page 260 and 261:
Outputarray([[0.5504593 ],[0.949995
- Page 262 and 263:
decision boundary.Look at the expre
- Page 264 and 265:
Are my data points separable?That
- Page 266 and 267:
model = nn.Sequential()model.add_mo
- Page 268 and 269:
It looks like this:Figure 3.10 - Sp
- Page 270 and 271:
True and False Positives and Negati
- Page 272 and 273:
tpr_fpr(cm_thresh50)Output(0.909090
- Page 274 and 275:
The trade-off between precision and
- Page 276 and 277:
Figure 3.13 - Using a low threshold
- Page 278 and 279:
Figure 3.16 - Trade-offs for two di
- Page 280 and 281:
thresholds do not necessarily inclu
- Page 282 and 283:
actual data, it is as bad as it can
- Page 284 and 285:
If you want to learn more about bot
- Page 286 and 287:
Model Training1 n_epochs = 10023 sb
- Page 288 and 289:
step in your journey! What’s next
- Page 290 and 291:
Chapter 4Classifying ImagesSpoilers
- Page 292 and 293:
Data GenerationOur images are quite
- Page 294 and 295:
Images and ChannelsIn case you’re
- Page 296 and 297:
image_rgb = np.stack([image_r, imag
- Page 298 and 299:
That’s fairly straightforward; we
- Page 300 and 301:
• Transformations based on Tensor
- Page 302 and 303:
position of an object in a picture
- Page 304 and 305:
Outputtensor([[[0., 0., 0., 1., 0.]
- Page 306 and 307:
Outputtensor([[[-1., -1., -1., 1.,
- Page 308 and 309:
We can convert the former into the
- Page 310 and 311:
composer = Compose([RandomHorizonta
- Page 312 and 313:
Output<torch.utils.data.dataset.Sub
- Page 314 and 315:
train_composer = Compose([RandomHor
- Page 316 and 317:
The minority class should have the
- Page 318 and 319:
train_loader = DataLoader(dataset=t
- Page 320 and 321:
implemented in Chapter 2.1? Let’s
- Page 322 and 323:
Let’s take one mini-batch of imag
- Page 324 and 325:
What does our model look like? Visu
- Page 326 and 327:
Model TrainingLet’s train our mod
- Page 328 and 329:
preceding hidden layer to compute i
- Page 330 and 331:
fig = sbs_nn.plot_losses()Figure 4.
- Page 332 and 333:
Equation 4.2 - Equivalence of deep
- Page 334 and 335:
w_nn_equiv = w_nn_output.mm(w_nn_hi
- Page 336 and 337:
Weights as PixelsDuring data prepar
- Page 338 and 339:
is only 0.25 (for z = 0) and that i
- Page 340 and 341:
nn.Tanh()(dummy_z)Outputtensor([-0.
- Page 342 and 343:
dummy_z = torch.tensor([-3., 0., 3.
- Page 344 and 345:
As you can see, in PyTorch the coef
- Page 346 and 347:
Figure 4.16 - Deep model (for real)
- Page 348 and 349:
Figure 4.18 - Losses (before and af
- Page 350 and 351:
Equation 4.3 - Activation functions
- Page 352 and 353:
Helper Function #41 def index_split
- Page 354 and 355:
Model Configuration1 # Sets learnin
- Page 356 and 357:
Bonus ChapterFeature SpaceThis chap
- Page 358 and 359:
Affine TransformationsAn affine tra
- Page 360 and 361:
Figure B.3 - Annotated model diagra
- Page 362 and 363:
Figure B.5 - In the beginning…But
- Page 364 and 365:
OK, now we can clearly see a differ
- Page 366 and 367:
In the model above, the sigmoid fun
- Page 368 and 369:
the more dimensions, the more separ
- Page 370 and 371:
import randomimport numpy as npfrom
- Page 372 and 373:
identity = np.array([[[[0, 0, 0],[0
- Page 374 and 375:
Figure 5.4 - Striding the image, on
- Page 376 and 377:
Output-----------------------------
- Page 378 and 379:
Outputtensor([[[[9., 5., 0., 7.],[0
- Page 380 and 381:
OutputParameter containing:tensor([
- Page 382 and 383:
Moreover, notice that if we were to
- Page 384 and 385:
In code, as usual, PyTorch gives us
- Page 386 and 387:
Outputtensor([[[[5., 5., 0., 8., 7.
- Page 388 and 389:
edge = np.array([[[[0, 1, 0],[1, -4
- Page 390 and 391:
A pooling kernel of two-by-two resu
- Page 392 and 393:
Outputtensor([[22., 23., 11., 24.,
- Page 394 and 395:
Figure 5.15 - LeNet-5 architectureS
- Page 396 and 397:
• second block: produces 16-chann
- Page 398 and 399:
Transformed Dataset1 class Transfor
- Page 400 and 401:
LossNew problem, new loss. Since we
- Page 402 and 403:
Outputtensor([4.0000, 1.0000, 0.500
- Page 404 and 405:
The loss only considers the predict
- Page 406 and 407:
Outputtensor([[-1.5229, -0.3146, -2
- Page 408 and 409:
IMPORTANT: I can’t stress this en
- Page 410 and 411:
figures at the beginning of this ch
- Page 412 and 413:
The three units in the output layer
- Page 414 and 415:
StepByStep Method@staticmethoddef _
- Page 416 and 417:
The meow() method is totally indepe
- Page 418 and 419:
StepByStep Methoddef visualize_filt
- Page 420 and 421:
dummy_model = nn.Linear(1, 1)dummy_
- Page 422 and 423:
dummy_listOutput[(Linear(in_feature
- Page 424 and 425:
Output{Conv2d(1, 1, kernel_size=(3,
- Page 426 and 427:
will be the externally defined vari
- Page 428 and 429:
Removing Hookssbs_cnn1.remove_hooks
- Page 430 and 431:
return figsetattr(StepByStep, 'visu
- Page 432 and 433:
Figure 5.22 - Feature maps (classif
- Page 434 and 435:
classification: The predicted class
- Page 436 and 437:
convolutional layers to our model a
- Page 438 and 439:
Capturing Outputsfeaturizer_layers
- Page 440 and 441:
the filters learned by the model pr
- Page 442 and 443:
given chapter are imported at its v
- Page 444 and 445:
Data PreparationThe data preparatio
- Page 446 and 447:
model anyway. We’ll use it to com
- Page 448 and 449:
StepByStep Method@staticmethoddef m
- Page 450 and 451:
"What’s wrong with the colors?"Th
- Page 452 and 453:
three_channel_filter = np.array([[[
- Page 454 and 455:
Fancier Model (Constructor)class CN
- Page 456 and 457:
Fancier Model (Classifier)def class
- Page 458 and 459:
torch.manual_seed(44)dropping_model
- Page 460 and 461:
Outputtensor([0.1000, 0.2000, 0.300
- Page 462 and 463:
Figure 6.8 - Output distribution fo
- Page 464 and 465:
Adaptive moment estimation (Adam) u
- Page 466 and 467:
torch.manual_seed(13)# Model Config
- Page 468 and 469:
Outputtorch.Size([5, 3, 3, 3])Its s
- Page 470 and 471:
Choosing a learning rate that works
- Page 472 and 473:
Higher-Order Learning Rate Function
- Page 474 and 475:
Perfect! Now let’s build the actu
- Page 476 and 477:
ax.set_xlabel('Learning Rate')ax.se
- Page 478 and 479:
LRFinderThe function we’ve implem
- Page 480 and 481:
value in our moving average has an
- Page 482 and 483:
Figure 6.15 - Distribution of weigh
- Page 484 and 485:
In code, the implementation of the
- Page 486 and 487:
As expected, the EWMA without corre
- Page 488 and 489:
optimizer = optim.Adam(model.parame
- Page 490 and 491:
IMPORTANT: The logging function mus
- Page 492 and 493:
Output{'state': {140601337662512: {
- Page 494 and 495:
different optimizer, set them to ca
- Page 496 and 497:
• dampening: dampening factor for
- Page 498 and 499:
Figure 6.20 - Paths taken by SGD (w
- Page 500 and 501:
Equation 6.16 - Looking aheadOnce N
- Page 502 and 503:
Figure 6.22 - Path taken by each SG
- Page 504 and 505:
for epoch in range(4):# training lo
- Page 506 and 507:
course) up to a given number of epo
- Page 508 and 509:
Next, we create a protected method
- Page 510 and 511:
Mini-Batch SchedulersThese schedule
- Page 512 and 513:
Schedulers in StepByStep — Part I
- Page 514 and 515:
Scheduler PathsBefore trying out a
- Page 516 and 517:
After applying each scheduler to SG
- Page 518 and 519:
Data Preparation1 # Loads temporary
- Page 520 and 521:
Figure 6.31 - LossesEvaluationprint
- Page 522 and 523:
[96] http://www.samkass.com/theorie
- Page 524 and 525:
ImportsFor the sake of organization
- Page 526 and 527:
ILSVRC-2012The 2012 edition [111] o
- Page 528 and 529:
remained unchanged.ResNet (MSRA Tea
- Page 530 and 531:
Transfer Learning in PracticeIn Cha
- Page 532 and 533:
dropout. You’re already familiar
- Page 534 and 535:
OutputDownloading: "https://downloa
- Page 536 and 537:
Replacing the "Top" of the Model1 a
- Page 538 and 539:
Model Size Classifier Layer(s) Repl
- Page 540 and 541:
Model TrainingWe have everything se
- Page 542 and 543:
"Removing" the Top Layer1 alex.clas
- Page 544 and 545:
torch.save(train_preproc.tensors, '
- Page 546 and 547:
Outputtensor([[109, 124],[124, 124]
- Page 548 and 549:
Model Configuration1 optimizer_mode
- Page 550 and 551:
Figure 7.4 - 1x1 convolutionThe inp
- Page 552 and 553:
The weights used by PIL are 0.299 f
- Page 554 and 555:
• reduce the number of output cha
- Page 556 and 557:
The constructor method defines the
- Page 558 and 559:
Does it sound familiar? That’s wh
- Page 560 and 561:
and w to represent these parameters
- Page 562 and 563:
A mini-batch of size 64 is small en
- Page 564 and 565:
normed1 = batch_normalizer(batch1[0
- Page 566 and 567:
OutputOrderedDict([('running_mean',
- Page 568 and 569:
OutputOrderedDict([('running_mean',
- Page 570 and 571:
batch_normalizer = nn.BatchNorm2d(n
- Page 572 and 573:
torch.manual_seed(23)dummy_points =
- Page 574 and 575:
np.concatenate([dummy_points[:5].nu
- Page 576 and 577:
Another advantage of these shortcut
- Page 578 and 579:
It should be pretty clear, except f
- Page 580 and 581:
Data Preparation1 # ImageNet statis
- Page 582 and 583:
Data Preparation — Preprocessing1
- Page 584 and 585:
• freezing the layers of the mode
- Page 586 and 587:
Extra ChapterVanishing and Explodin
- Page 588 and 589:
discussing it, let me illustrate it
- Page 590 and 591:
Model Configuration (2)1 loss_fn =
- Page 592 and 593:
weights. If done properly, the init
- Page 594 and 595:
just did), or, if you are training
- Page 596 and 597:
Figure E.3 - The effect of batch no
- Page 598 and 599:
Model Configuration1 torch.manual_s
- Page 600 and 601:
torch.manual_seed(42)parm = nn.Para
- Page 602 and 603:
(and only if) the norm exceeds the
- Page 604 and 605:
if callable(self.clipping): 1self.c
- Page 606 and 607:
Moreover, let’s use a ten times h
- Page 608 and 609:
Clipping with HooksFirst, we reset
- Page 610 and 611:
• visualizing the difference betw
- Page 612 and 613:
Chapter 8SequencesSpoilersIn this c
- Page 614 and 615:
Before shuffling, the pixels were o
- Page 616 and 617:
And then let’s visualize the firs
- Page 618 and 619:
sequence so far, and a data point f
- Page 620 and 621:
Considering this, the not "unrolled
- Page 622 and 623:
linear_input = nn.Linear(n_features
- Page 624 and 625:
Outputtensor([[0.3924, 0.8146]], gr
- Page 626 and 627:
Now we’re talking! The last hidde
- Page 628 and 629:
Let’s take a look at the RNN’s
- Page 630 and 631:
◦ The initial hidden state, which
- Page 632 and 633:
batch_first argument to True so we
- Page 634 and 635:
OutputOrderedDict([('weight_ih_l0',
- Page 636 and 637:
out, hidden = rnn_stacked(x)out, hi
- Page 638 and 639:
_l0_reverse).Once again, let’s cr
- Page 640 and 641:
For bidirectional RNNs, the last el
- Page 642 and 643:
Model Configuration1 class SquareMo
- Page 644 and 645:
StepByStep.loader_apply(test_loader
- Page 646 and 647:
Figure 8.14 - Final hidden states f
- Page 648 and 649:
Figure 8.16 - Transforming the hidd
- Page 650 and 651:
Since the RNN cell has both of them
- Page 652 and 653:
Every gate worthy of its name will
- Page 654 and 655:
• For r=0 and z=0, the cell becom
- Page 656 and 657:
In code, we can use split() to get
- Page 658 and 659:
Let’s pause for a moment here. Fi
- Page 660 and 661:
Square Model II — The QuickeningT
- Page 662 and 663:
Outputtensor([[53, 53],[75, 75]])Th
- Page 664 and 665:
Figure 8.22 - Transforming the hidd
- Page 666 and 667:
Equation 8.9 - LSTM—candidate hid
- Page 668 and 669:
Now, let’s visualize the internal
- Page 670 and 671:
OutputOrderedDict([('weight_ih', te
- Page 672 and 673:
def forget_gate(h, x):thf = f_hidde
- Page 674 and 675:
Outputtensor([[-5.4936e-02, -8.3816
- Page 676 and 677:
1 First change: from RNN to LSTM2 S
- Page 678 and 679:
Like the GRU, the LSTM presents fou
- Page 680 and 681:
Output-----------------------------
- Page 682 and 683:
Before moving on to packed sequence
- Page 684 and 685:
column-wise fashion, from top to bo
- Page 686 and 687:
does match the last output.• No,
- Page 688 and 689:
So, to actually get the last output
- Page 690 and 691:
Data Preparation1 class CustomDatas
- Page 692 and 693:
OutputPackedSequence(data=tensor([[
- Page 694 and 695:
Model Configuration & TrainingWe ca
- Page 696 and 697:
size = 5weight = torch.ones(size) *
- Page 698 and 699:
torch.manual_seed(17)conv_seq = nn.
- Page 700 and 701:
Figure 8.32 - Applying dilated filt
- Page 702 and 703:
Model Configuration1 torch.manual_s
- Page 704 and 705:
We can actually find an expression
- Page 706 and 707:
Data Preparation1 def pack_collate(
- Page 708 and 709:
and variable-length sequences.Model
- Page 710 and 711:
• generating variable-length sequ
- Page 712 and 713:
import copyimport numpy as npimport
- Page 714 and 715:
Figure 9.3 - Sequence datasetThe co
- Page 716 and 717:
coordinates of a "perfect" square a
- Page 718 and 719:
Let’s pretend for a moment that t
- Page 720 and 721:
to initialize the hidden state and
- Page 722 and 723:
predictions in previous steps have
- Page 724 and 725:
the second set of predicted coordin
- Page 726 and 727:
Let’s create an instance of the m
- Page 728 and 729:
Model Configuration & TrainingThe m
- Page 730 and 731:
Sure, we can!AttentionHere is a (no
- Page 732 and 733:
based on "the" and "zone," I’ve j
- Page 734 and 735:
Figure 9.12 - Matching a query to t
- Page 736 and 737:
Outputtensor([[[ 0.0832, -0.0356],[
- Page 738 and 739:
utmost importance for the correct i
- Page 740 and 741:
Its formula is:Equation 9.3 - Cosin
- Page 742 and 743:
second hidden state contributes to
- Page 744 and 745:
Outputtensor([[[ 0.5475, 0.0875, -1
- Page 746 and 747:
alphas = F.softmax(scaled_products,
- Page 748 and 749:
Outputtensor([[[ 0.2138, -0.3175]]]
- Page 750 and 751:
Attention Mechanism1 class Attentio
- Page 752 and 753:
"Why would I want to force it to do
- Page 754 and 755:
1 Sets attention module and adjusts
- Page 756 and 757:
encdec = EncoderDecoder(encoder, de
- Page 758 and 759:
fig = sbs_seq_attn.plot_losses()Fig
- Page 760 and 761:
Figure 9.20 - Attention scoresSee?
- Page 762 and 763:
Wide vs Narrow AttentionThis mechan
- Page 764 and 765:
"What’s so special about it?"Even
- Page 766 and 767:
Once again, the affine transformati
- Page 768 and 769:
Next, we shift our focus to the sel
- Page 770 and 771:
Encoder + Self-Attention1 class Enc
- Page 772 and 773:
Figure 9.27 - Encoder with self- an
- Page 774 and 775:
The figure below depicts the self-a
- Page 776 and 777:
shifted_seq = torch.cat([source_seq
- Page 778 and 779:
Equation 9.17 - Decoder’s (masked
- Page 780 and 781:
At evaluation / prediction time we
- Page 782 and 783:
Outputtensor([[[0.4132, 0.3728],[0.
- Page 784 and 785:
Figure 9.33 - Encoder + decoder + a
- Page 786 and 787:
64 return outputsThe encoder-decode
- Page 788 and 789:
Figure 9.34 - Losses—encoder + de
- Page 790 and 791:
curse. On the one hand, it makes co
- Page 792 and 793:
"Are we done now? Is this good enou
- Page 794 and 795:
Figure 9.46 - Consistent distancesA
- Page 796 and 797:
Let’s recap what we’ve already
- Page 798 and 799:
Let’s see it in code:max_len = 10
- Page 800 and 801:
Let’s put it all together into a
- Page 802 and 803:
Outputtensor([[[-1.0000, 0.0000],[-
- Page 804 and 805:
Decoder with Positional Encoding1 c
- Page 806 and 807:
Visualizing PredictionsLet’s plot
- Page 808 and 809:
Next, we’re moving on to the thre
- Page 810 and 811:
Data Generation & Preparation1 # Tr
- Page 812 and 813:
59 self.trg_masks)60 else:61 # Deco
- Page 814 and 815:
Model Configuration1 class EncoderS
- Page 816 and 817:
1617 @property18 def alphas(self):1
- Page 818 and 819:
Output(0.016193246061448008, 0.0341
- Page 820 and 821:
sequential order of the data• fig
- Page 822 and 823:
following imports:import copyimport
- Page 824 and 825:
Figure 10.2 - Chunking: the wrong a
- Page 826 and 827:
chunks to compute the other half of
- Page 828 and 829:
67 # N, L, n_heads, d_k68 context =
- Page 830 and 831:
dummy_points = torch.randn(16, 2, 4
- Page 832 and 833:
Stacking Encoders and DecodersLet
- Page 834 and 835:
"… with great depth comes great c
- Page 836 and 837:
Transformer EncoderWe’ll be repre
- Page 838 and 839:
Let’s see it in code, starting wi
- Page 840 and 841:
Transformer Encoder1 class EncoderT
- Page 842 and 843:
of the encoder-decoder (or Transfor
- Page 844 and 845:
In PyTorch, the decoder "layer" is
- Page 846 and 847:
In PyTorch, the decoder is implemen
- Page 848 and 849:
Equation 10.7 - Data points' means
- Page 850 and 851:
layer_norm = nn.LayerNorm(d_model)n
- Page 852 and 853:
Figure 10.10 - Layer norm vs batch
- Page 854 and 855:
Outputtensor([[[ 1.4636, 2.3663],[
- Page 856 and 857:
The TransformerLet’s start with t
- Page 858 and 859:
"values") in the decoder.• decode
- Page 860 and 861:
Data Preparation1 # Generating trai
- Page 862 and 863:
Figure 10.15 - Losses—Transformer
- Page 864 and 865:
• First, and most important, PyTo
- Page 866 and 867:
decode(), with a single one, encode
- Page 868 and 869:
46 for i in range(self.target_len):
- Page 870 and 871:
Figure 10.18 - Losses - PyTorch’s
- Page 872 and 873:
Figure 10.20 - Sample image—label
- Page 874 and 875:
4041 # Builds a weighted random sam
- Page 876 and 877:
Figure 10.23 - Sample image—split
- Page 878 and 879:
Einops"There is more than one way t
- Page 880 and 881:
Figure 10.26 - Two patch embeddings
- Page 882 and 883:
Now each sequence has ten elements,
- Page 884 and 885:
It takes an instance of a Transform
- Page 886 and 887:
Putting It All TogetherIn this chap
- Page 888 and 889:
1. Encoder-DecoderThe encoder-decod
- Page 890 and 891:
This is the actual encoder-decoder
- Page 892 and 893:
3. DecoderThe Transformer decoder h
- Page 894 and 895:
5. Encoder "Layer"The encoder "laye
- Page 896 and 897:
7. "Sub-Layer" WrapperThe "sub-laye
- Page 898 and 899:
8. Multi-Headed AttentionThe multi-
- Page 900 and 901:
Model Configuration & TrainingModel
- Page 902 and 903:
• training the Transformer to tac
- Page 904 and 905:
Part IVNatural Language Processing|
- Page 906 and 907:
Additional SetupThis is a special c
- Page 908 and 909:
"Down the Yellow Brick Rabbit Hole"
- Page 910 and 911:
The actual texts of the books are c
- Page 912 and 913:
"What is this punkt?"That’s the P
- Page 914 and 915:
14 # If there is a configuration fi
- Page 916 and 917:
Sentence Tokenization in spaCyBy th
- Page 918 and 919:
AttributesThe Dataset has many attr
- Page 920 and 921:
Output{'labels': 1,'sentence': 'The
- Page 922 and 923:
elements from the text. But preproc
- Page 924 and 925:
Data AugmentationLet’s briefly ad
- Page 926 and 927:
The corpora’s dictionary is not a
- Page 928 and 929:
Finally, if we want to convert a li
- Page 930 and 931:
Once we’re happy with the size an
- Page 932 and 933:
from transformers import BertTokeni
- Page 934 and 935:
"What about the separation token?"T
- Page 936 and 937:
The last output, attention_mask, wo
- Page 938 and 939:
Outputtensor([[ 3, 27, 1, ..., 0, 0
- Page 940 and 941:
vector, right? And our vocabulary i
- Page 942 and 943:
Maybe you filled this blank in with
- Page 944 and 945:
Continuous Bag-of-Words (CBoW)In th
- Page 946 and 947:
That’s a fairly simple model, rig
- Page 948 and 949:
Figure 11.13 - Continuous bag-of-wo
- Page 950 and 951:
Figure 11.15 - Reviewing restaurant
- Page 952 and 953:
You got that right—arithmetic—r
- Page 954 and 955:
There we go, 50 dimensions! It’s
- Page 956 and 957:
Equation 11.1 - Embedding arithmeti
- Page 958 and 959:
Only 82 out of 50,802 words in the
- Page 960 and 961:
Now we can use its encode() method
- Page 962 and 963:
Model I — GloVE + ClassifierData
- Page 964 and 965:
Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967:
Model Configuration & TrainingLet
- Page 968 and 969:
6 self.encoder = encoder7 self.mlp
- Page 970 and 971:
Figure 11.20 - Losses—Transformer
- Page 972 and 973:
Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975:
I want to introduce you to…ELMoBo
- Page 976 and 977:
OutputToken: 32 watchThe get_token(
- Page 978 and 979:
Helper Function to Retrieve Embeddi
- Page 980 and 981:
Output(tensor(-0.5047, device='cuda
- Page 982 and 983:
torch.all(new_flair_sentences[0].to
- Page 984 and 985:
Outputtensor(0.3504, device='cuda:0
- Page 986 and 987:
We can leverage this fact to slight
- Page 988 and 989:
We can easily get the embeddings fo
- Page 990 and 991:
Figure 11.24 - Losses—simple clas
- Page 992 and 993:
We can inspect the pre-trained mode
- Page 994 and 995:
Every word piece is prefixed with #
- Page 996 and 997:
far, our models used these embeddin
- Page 998 and 999:
position_ids = torch.arange(512).ex
- Page 1000 and 1001:
Pre-training TasksMasked Language M
- Page 1002 and 1003:
Then, let’s create an instance of
- Page 1004 and 1005:
If these two sentences were the inp
- Page 1006 and 1007:
The BERT model may take many other
- Page 1008 and 1009:
The contextual word embeddings are
- Page 1010 and 1011:
Model Configuration1 class BERTClas
- Page 1012 and 1013:
"Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015:
Well, you probably don’t want to
- Page 1016 and 1017:
set num_labels=1 as argument.If you
- Page 1018 and 1019:
Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021:
OutputTrainingArguments(output_dir=
- Page 1022 and 1023:
Method for Computing Accuracy1 def
- Page 1024 and 1025:
loaded_model = (AutoModelForSequenc
- Page 1026 and 1027:
logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029:
For a complete list of available ta
- Page 1030 and 1031:
[215]. For a demo of GPT-2’s capa
- Page 1032 and 1033:
in Chapter 9, and I reproduce it be
- Page 1034 and 1035:
Data Preparation1 auto_tokenizer =
- Page 1036 and 1037:
Data Preparation1 lm_train_dataset
- Page 1038 and 1039:
The training arguments are roughly
- Page 1040 and 1041:
device_index = (model.device.indexi
- Page 1042 and 1043:
• learning that a language model
- Page 1044 and 1045:
[167] https://huggingface.co/docs/d