Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

22.02.2024 Views
Figure 6.20 - Paths taken by SGD (with and without momentum)Like the Adam optimizer, SGD with momentum moves faster and overshoots. Butit does seem to get carried away with it, so much so that it gets past the target andhas to backtrack to approach it from a different direction.The analogy for the momentum update is that of a ball rolling down a hill: It picksup so much speed that it ends up climbing the opposite side of the valley, only toroll back down again with a little bit less speed, doing this back and forth over andover again until eventually reaching the bottom."Isn’t Adam better than this already?"Yes and no. Adam indeed converges more quickly to a minimum, but notnecessarily a good one. In a simple linear regression, there is a global minimumcorresponding to the optimal value of the parameters. This is not the case in deeplearning models: There are many minima (plural of minimum), and some are betterthan others (corresponding to lower losses). So, Adam will find one of these minimaand move there fast, perhaps overlooking better alternatives in the neighborhood.Momentum may seem a bit sloppy at first, but it may be combined with a learningrate scheduler (more on that shortly!) to better explore the loss surface in hopesof finding a better-quality minimum than Adam does.Both alternatives, Adam and SGD with momentum (especiallywhen combined with a learning rate scheduler), are commonlyused.Learning Rates | 473

But, if a ball running downhill seems a bit too out of control for your taste, maybeyou’ll like its variant better…NesterovThe Nesterov accelerated gradient (NAG), or Nesterov for short, is a clever variantof SGD with momentum. Let’s say we’re computing momentum for twoconsecutive steps (t and t+1):Equation 6.15 - Nesterov momentumIn the current step (t), we use the current gradient (t) and the momentum from theprevious step (t-1) to compute the current momentum. So far, nothing new.In the next step (t+1), we’ll use the next gradient (t+1) and the momentum we’vejust computed for the current step (t) to compute the next momentum. Again,nothing new.What if I ask you to compute momentum one step ahead?"Can you tell me momentum at step t+1 while you’re still at step t?""Of course I can’t, I do not know the gradient at step t+1!" you say, puzzled at mybizarre question. Fair enough. So I ask you yet another question:"What’s your best guess for the gradient at step t+1?"I hope you answered, "The gradient at step t." If you do not know the future value ofa variable, the naive estimate is its current value. So, let’s go with it, Nesterov-style!In a nutshell, what NAG does is try to compute momentum one step ahead since itis only missing one value and has a good (naive) guess for it. It is as if it werecomputing momentum between two steps:474 | Chapter 6: Rock, Paper, Scissors

Page 2 and 3: Deep Learning with PyTorchStep-by-S

Page 4 and 5: "What I cannot create, I do not und

Page 6 and 7: Train-Validation-Test Split. . . .

Page 8 and 9: Random Split. . . . . . . . . . . .

Page 10 and 11: The Precision Quirk . . . . . . . .

Page 12 and 13: A REAL Filter . . . . . . . . . . .

Page 14 and 15: Jupyter Notebook . . . . . . . . .

Page 16 and 17: Stacked RNN . . . . . . . . . . . .

Page 18 and 19: Attention Mechanism . . . . . . . .

Page 20 and 21: 4. Positional Encoding. . . . . . .

Page 22 and 23: Model Configuration & Training . .

Page 24 and 25: AcknowledgementsFirst and foremost,

Page 26 and 27: Frequently Asked Questions (FAQ)Why

Page 28 and 29: There is yet another advantage of f

Page 30 and 31: • Classes and methods are written

Page 32 and 33: What’s Next?It’s time to set up

Page 34 and 35: After choosing a repository, it wil

Page 36 and 37: 1. AnacondaIf you don’t have Anac

Page 38 and 39: 3. PyTorchPyTorch is the coolest de

Page 40 and 41: (pytorchbook) C:\> conda install py

Page 42 and 43: (pytorchbook)$ pip install torchviz

Page 44 and 45: 7. JupyterAfter cloning the reposit

Page 46 and 47: Part IFundamentals| 21

Page 48 and 49: notebook. If not, just click on Cha

Page 50 and 51: Also, let’s say that, on average,

Page 52 and 53: There is one exception to the "alwa

Page 54 and 55: Random Initialization1 # Step 0 - I

Page 56 and 57: Batch, Mini-batch, and Stochastic G

Page 58 and 59: Outputarray([[-2. , -1.94, -1.88, .

Page 60 and 61: one matrix for each data point, eac

Page 62 and 63: Sure, different values of b produce

Page 64 and 65: Output-3.044811379650508 -1.8337537

Page 66 and 67: each parameter using the chain rule

Page 68 and 69: What’s the impact of one update o

Page 70 and 71: gradients, we know we need to take

Page 72 and 73: Very High Learning RateWait, it may

Page 74 and 75: true_b = 1true_w = 2N = 100# Data G

Page 76 and 77: Let’s look at the cross-sections

Page 78 and 79: Zero Mean and Unit Standard Deviati

Page 80 and 81: Sure, in the real world, you’ll n

Page 82 and 83: computing the loss, as shown in the

Page 84 and 85: • visualizing the effects of usin

Page 86 and 87: If you’re using Jupyter’s defau

Page 88 and 89: Notebook Cell 1.1 - Splitting synth

Page 90 and 91: Step 2# Step 2 - Computing the loss

Page 92 and 93: Output[0.49671415] [-0.1382643][0.8

Page 94 and 95: Notebook Cell 1.2 - Implementing gr

Page 96 and 97: # Sanity Check: do we get the same

Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3

Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,

Page 102 and 103: dummy_array = np.array([1, 2, 3])du

Page 104 and 105: n_cudas = torch.cuda.device_count()

Page 106 and 107: back_to_numpy = x_train_tensor.nump

Page 108 and 109: I am assuming you’d like to use y

Page 110 and 111: Outputtensor([0.1940], device='cuda

Page 112 and 113: print(error.requires_grad, yhat.req

Page 114 and 115: Output(tensor([0.], device='cuda:0'

Page 116 and 117: 56 # need to tell it to let it go..

Page 118 and 119: computation.If you chose "Local Ins

Page 120 and 121: Figure 1.6 - Now parameter "b" does

Page 122 and 123: There are many optimizers: SGD is t

Page 124 and 125: 41 optimizer.zero_grad() 34243 prin

Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los

Page 128 and 129: Outputarray(0.00804466, dtype=float

Page 130 and 131: Let’s build a proper (yet simple)

Page 132 and 133: "What do we need this for?"It turns

Page 134 and 135: 1 Instantiating a model2 What IS th

Page 136 and 137: In the __init__() method, we create

Page 138 and 139: LayersA Linear model can be seen as

Page 140 and 141: There are MANY different layers tha

Page 142 and 143: We use magic, just like that:%run -

Page 144 and 145: • Step 1: compute model’s predi

Page 146 and 147: RecapFirst of all, congratulations

Page 148 and 149: Chapter 2Rethinking the Training Lo

Page 150 and 151: Let’s take a look at the code onc

Page 152 and 153: Higher-Order FunctionsAlthough this

Page 154 and 155: def exponentiation_builder(exponent

Page 156 and 157: Apart from returning the loss value

Page 158 and 159: Our code should look like this; see

Page 160 and 161: There is no need to load the whole

Page 162 and 163: but if we want to get serious about

Page 164 and 165: How does this change our code so fa

Page 166 and 167: Run - Model Training V2%run -i mode

Page 168 and 169: piece of code that’s going to be

Page 170 and 171: for it. We could do the same for th

Page 172 and 173: EvaluationHow can we evaluate the m

Page 174 and 175: And then, we update our model confi

Page 176 and 177: Run - Model Training V4%run -i mode

Page 178 and 179: Loading Extension# Load the TensorB

Page 180 and 181: browser, you’ll likely see someth

Page 182 and 183: model’s graph (not quite the same

Page 184 and 185: Figure 2.5 - Scalars on TensorBoard

Page 186 and 187: Define - Model Training V51 %%write

Page 188 and 189: If, by any chance, you ended up wit

Page 190 and 191: The procedure is exactly the same,

Page 192 and 193: soon, so please bear with me for no

Page 194 and 195: After recovering our model’s stat

Page 196 and 197: Run - Model Configuration V31 # %lo

Page 198 and 199: This is the general structure you

Page 200 and 201: Chapter 2.1Going ClassySpoilersIn t

Page 202 and 203: # A completely empty (and useless)

Page 204 and 205: # These attributes are defined here

Page 206 and 207: # Creates the train_step function f

Page 208 and 209: # Builds function that performs a s

Page 210 and 211: setattrThe setattr function sets th

Page 212 and 213: See? We effectively modified the un

Page 214 and 215: the random seed as arguments.This s

Page 216 and 217: The current state of development of

Page 218 and 219: Lossesdef plot_losses(self):fig = p

Page 220 and 221: Run - Data Preparation V21 # %load

Page 222 and 223: Model TrainingWe start by instantia

Page 224 and 225: Making PredictionsLet’s make up s

Page 226 and 227: OutputOrderedDict([('0.weight', ten

Page 228 and 229: Run - Data Preparation V21 # %load

Page 230 and 231: • defining our StepByStep class

Page 232 and 233: import numpy as npimport torchimpor

Page 234 and 235: Next, we’ll standardize the featu

Page 236 and 237: Equation 3.1 - A linear regression

Page 238 and 239: The odds ratio is given by the rati

Page 240 and 241: As expected, probabilities that add

Page 242 and 243: Sigmoid Functiondef sigmoid(z):retu

Page 244 and 245: A picture is worth a thousand words

Page 246 and 247: OutputOrderedDict([('linear.weight'

Page 248 and 249: The first summation adds up the err

Page 250 and 251: IMPORTANT: Make sure to pass the pr

Page 252 and 253: To make it clear: In this chapter,

Page 254 and 255: argument of nn.BCEWithLogitsLoss().

Page 256 and 257: It is not that hard, to be honest.

Page 258 and 259: Figure 3.6 - Training and validatio

Page 260 and 261: Outputarray([[0.5504593 ],[0.949995

Page 262 and 263: decision boundary.Look at the expre

Page 264 and 265: Are my data points separable?That

Page 266 and 267: model = nn.Sequential()model.add_mo

Page 268 and 269: It looks like this:Figure 3.10 - Sp

Page 270 and 271: True and False Positives and Negati

Page 272 and 273: tpr_fpr(cm_thresh50)Output(0.909090

Page 274 and 275: The trade-off between precision and

Page 276 and 277: Figure 3.13 - Using a low threshold

Page 278 and 279: Figure 3.16 - Trade-offs for two di

Page 280 and 281: thresholds do not necessarily inclu

Page 282 and 283: actual data, it is as bad as it can

Page 284 and 285: If you want to learn more about bot

Page 286 and 287: Model Training1 n_epochs = 10023 sb

Page 288 and 289: step in your journey! What’s next

Page 290 and 291: Chapter 4Classifying ImagesSpoilers

Page 292 and 293: Data GenerationOur images are quite

Page 294 and 295: Images and ChannelsIn case you’re

Page 296 and 297: image_rgb = np.stack([image_r, imag

Page 298 and 299: That’s fairly straightforward; we

Page 300 and 301: • Transformations based on Tensor

Page 302 and 303: position of an object in a picture

Page 304 and 305: Outputtensor([[[0., 0., 0., 1., 0.]

Page 306 and 307: Outputtensor([[[-1., -1., -1., 1.,

Page 308 and 309: We can convert the former into the

Page 310 and 311: composer = Compose([RandomHorizonta

Page 312 and 313: Output<torch.utils.data.dataset.Sub

Page 314 and 315: train_composer = Compose([RandomHor

Page 316 and 317: The minority class should have the

Page 318 and 319: train_loader = DataLoader(dataset=t

Page 320 and 321: implemented in Chapter 2.1? Let’s

Page 322 and 323: Let’s take one mini-batch of imag

Page 324 and 325: What does our model look like? Visu

Page 326 and 327: Model TrainingLet’s train our mod

Page 328 and 329: preceding hidden layer to compute i

Page 330 and 331: fig = sbs_nn.plot_losses()Figure 4.

Page 332 and 333: Equation 4.2 - Equivalence of deep

Page 334 and 335: w_nn_equiv = w_nn_output.mm(w_nn_hi

Page 336 and 337: Weights as PixelsDuring data prepar

Page 338 and 339: is only 0.25 (for z = 0) and that i

Page 340 and 341: nn.Tanh()(dummy_z)Outputtensor([-0.

Page 342 and 343: dummy_z = torch.tensor([-3., 0., 3.

Page 344 and 345: As you can see, in PyTorch the coef

Page 346 and 347: Figure 4.16 - Deep model (for real)

Page 348 and 349: Figure 4.18 - Losses (before and af

Page 350 and 351: Equation 4.3 - Activation functions

Page 352 and 353: Helper Function #41 def index_split

Page 354 and 355: Model Configuration1 # Sets learnin

Page 356 and 357: Bonus ChapterFeature SpaceThis chap

Page 358 and 359: Affine TransformationsAn affine tra

Page 360 and 361: Figure B.3 - Annotated model diagra

Page 362 and 363: Figure B.5 - In the beginning…But

Page 364 and 365: OK, now we can clearly see a differ

Page 366 and 367: In the model above, the sigmoid fun

Page 368 and 369: the more dimensions, the more separ

Page 370 and 371: import randomimport numpy as npfrom

Page 372 and 373: identity = np.array([[[[0, 0, 0],[0

Page 374 and 375: Figure 5.4 - Striding the image, on

Page 376 and 377: Output-----------------------------

Page 378 and 379: Outputtensor([[[[9., 5., 0., 7.],[0

Page 380 and 381: OutputParameter containing:tensor([

Page 382 and 383: Moreover, notice that if we were to

Page 384 and 385: In code, as usual, PyTorch gives us

Page 386 and 387: Outputtensor([[[[5., 5., 0., 8., 7.

Page 388 and 389: edge = np.array([[[[0, 1, 0],[1, -4

Page 390 and 391: A pooling kernel of two-by-two resu

Page 392 and 393: Outputtensor([[22., 23., 11., 24.,

Page 394 and 395: Figure 5.15 - LeNet-5 architectureS

Page 396 and 397: • second block: produces 16-chann

Page 398 and 399: Transformed Dataset1 class Transfor

Page 400 and 401: LossNew problem, new loss. Since we

Page 402 and 403: Outputtensor([4.0000, 1.0000, 0.500

Page 404 and 405: The loss only considers the predict

Page 406 and 407: Outputtensor([[-1.5229, -0.3146, -2

Page 408 and 409: IMPORTANT: I can’t stress this en

Page 410 and 411: figures at the beginning of this ch

Page 412 and 413: The three units in the output layer

Page 414 and 415: StepByStep Method@staticmethoddef _

Page 416 and 417: The meow() method is totally indepe

Page 418 and 419: StepByStep Methoddef visualize_filt

Page 420 and 421: dummy_model = nn.Linear(1, 1)dummy_

Page 422 and 423: dummy_listOutput[(Linear(in_feature

Page 424 and 425: Output{Conv2d(1, 1, kernel_size=(3,

Page 426 and 427: will be the externally defined vari

Page 428 and 429: Removing Hookssbs_cnn1.remove_hooks

Page 430 and 431: return figsetattr(StepByStep, 'visu

Page 432 and 433: Figure 5.22 - Feature maps (classif

Page 434 and 435: classification: The predicted class

Page 436 and 437: convolutional layers to our model a

Page 438 and 439: Capturing Outputsfeaturizer_layers

Page 440 and 441: the filters learned by the model pr

Page 442 and 443: given chapter are imported at its v

Page 444 and 445: Data PreparationThe data preparatio

Page 446 and 447: model anyway. We’ll use it to com

Page 448 and 449: StepByStep Method@staticmethoddef m

Page 450 and 451: "What’s wrong with the colors?"Th

Page 452 and 453: three_channel_filter = np.array([[[

Page 454 and 455: Fancier Model (Constructor)class CN

Page 456 and 457: Fancier Model (Classifier)def class

Page 458 and 459: torch.manual_seed(44)dropping_model

Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300

Page 462 and 463: Figure 6.8 - Output distribution fo

Page 464 and 465: Adaptive moment estimation (Adam) u

Page 466 and 467: torch.manual_seed(13)# Model Config

Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s

Page 470 and 471: Choosing a learning rate that works

Page 472 and 473: Higher-Order Learning Rate Function

Page 474 and 475: Perfect! Now let’s build the actu

Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se

Page 478 and 479: LRFinderThe function we’ve implem

Page 480 and 481: value in our moving average has an

Page 482 and 483: Figure 6.15 - Distribution of weigh

Page 484 and 485: In code, the implementation of the

Page 486 and 487: As expected, the EWMA without corre

Page 488 and 489: optimizer = optim.Adam(model.parame

Page 490 and 491: IMPORTANT: The logging function mus

Page 492 and 493: Output{'state': {140601337662512: {

Page 494 and 495: different optimizer, set them to ca

Page 496 and 497: • dampening: dampening factor for

Page 500 and 501: Equation 6.16 - Looking aheadOnce N

Page 502 and 503: Figure 6.22 - Path taken by each SG

Page 504 and 505: for epoch in range(4):# training lo

Page 506 and 507: course) up to a given number of epo

Page 508 and 509: Next, we create a protected method

Page 510 and 511: Mini-Batch SchedulersThese schedule

Page 512 and 513: Schedulers in StepByStep — Part I

Page 514 and 515: Scheduler PathsBefore trying out a

Page 516 and 517: After applying each scheduler to SG

Page 518 and 519: Data Preparation1 # Loads temporary

Page 520 and 521: Figure 6.31 - LossesEvaluationprint

Page 522 and 523: [96] http://www.samkass.com/theorie

Page 524 and 525: ImportsFor the sake of organization

Page 526 and 527: ILSVRC-2012The 2012 edition [111] o

Page 528 and 529: remained unchanged.ResNet (MSRA Tea

Page 530 and 531: Transfer Learning in PracticeIn Cha

Page 532 and 533: dropout. You’re already familiar

Page 534 and 535: OutputDownloading: "https://downloa

Page 536 and 537: Replacing the "Top" of the Model1 a

Page 538 and 539: Model Size Classifier Layer(s) Repl

Page 540 and 541: Model TrainingWe have everything se

Page 542 and 543: "Removing" the Top Layer1 alex.clas

Page 544 and 545: torch.save(train_preproc.tensors, '

Page 546 and 547: Outputtensor([[109, 124],[124, 124]

Page 548 and 549: Model Configuration1 optimizer_mode

Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp

Page 552 and 553: The weights used by PIL are 0.299 f

Page 554 and 555: • reduce the number of output cha

Page 556 and 557: The constructor method defines the

Page 558 and 559: Does it sound familiar? That’s wh

Page 560 and 561: and w to represent these parameters

Page 562 and 563: A mini-batch of size 64 is small en

Page 564 and 565: normed1 = batch_normalizer(batch1[0

Page 566 and 567: OutputOrderedDict([('running_mean',

Page 568 and 569: OutputOrderedDict([('running_mean',

Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n

Page 572 and 573: torch.manual_seed(23)dummy_points =

Page 574 and 575: np.concatenate([dummy_points[:5].nu

Page 576 and 577: Another advantage of these shortcut

Page 578 and 579: It should be pretty clear, except f

Page 580 and 581: Data Preparation1 # ImageNet statis

Page 582 and 583: Data Preparation — Preprocessing1

Page 584 and 585: • freezing the layers of the mode

Page 586 and 587: Extra ChapterVanishing and Explodin

Page 588 and 589: discussing it, let me illustrate it

Page 590 and 591: Model Configuration (2)1 loss_fn =

Page 592 and 593: weights. If done properly, the init

Page 594 and 595: just did), or, if you are training

Page 596 and 597: Figure E.3 - The effect of batch no

Page 598 and 599: Model Configuration1 torch.manual_s

Page 600 and 601: torch.manual_seed(42)parm = nn.Para

Page 602 and 603: (and only if) the norm exceeds the

Page 604 and 605: if callable(self.clipping): 1self.c

Page 606 and 607: Moreover, let’s use a ten times h

Page 608 and 609: Clipping with HooksFirst, we reset

Page 610 and 611: • visualizing the difference betw

Page 612 and 613: Chapter 8SequencesSpoilersIn this c

Page 614 and 615: Before shuffling, the pixels were o

Page 616 and 617: And then let’s visualize the firs

Page 618 and 619: sequence so far, and a data point f

Page 620 and 621: Considering this, the not "unrolled

Page 622 and 623: linear_input = nn.Linear(n_features

Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr

Page 626 and 627: Now we’re talking! The last hidde

Page 628 and 629: Let’s take a look at the RNN’s

Page 630 and 631: ◦ The initial hidden state, which

Page 632 and 633: batch_first argument to True so we

Page 634 and 635: OutputOrderedDict([('weight_ih_l0',

Page 636 and 637: out, hidden = rnn_stacked(x)out, hi

Page 638 and 639: _l0_reverse).Once again, let’s cr

Page 640 and 641: For bidirectional RNNs, the last el

Page 642 and 643: Model Configuration1 class SquareMo

Page 644 and 645: StepByStep.loader_apply(test_loader

Page 646 and 647: Figure 8.14 - Final hidden states f

Page 648 and 649: Figure 8.16 - Transforming the hidd

Page 650 and 651: Since the RNN cell has both of them

Page 652 and 653: Every gate worthy of its name will

Page 654 and 655: • For r=0 and z=0, the cell becom

Page 656 and 657: In code, we can use split() to get

Page 658 and 659: Let’s pause for a moment here. Fi

Page 660 and 661: Square Model II — The QuickeningT

Page 662 and 663: Outputtensor([[53, 53],[75, 75]])Th

Page 664 and 665: Figure 8.22 - Transforming the hidd

Page 666 and 667: Equation 8.9 - LSTM—candidate hid

Page 668 and 669: Now, let’s visualize the internal

Page 670 and 671: OutputOrderedDict([('weight_ih', te

Page 672 and 673: def forget_gate(h, x):thf = f_hidde

Page 674 and 675: Outputtensor([[-5.4936e-02, -8.3816

Page 676 and 677: 1 First change: from RNN to LSTM2 S

Page 678 and 679: Like the GRU, the LSTM presents fou

Page 680 and 681: Output-----------------------------

Page 682 and 683: Before moving on to packed sequence

Page 684 and 685: column-wise fashion, from top to bo

Page 686 and 687: does match the last output.• No,

Page 688 and 689: So, to actually get the last output

Page 690 and 691: Data Preparation1 class CustomDatas

Page 692 and 693: OutputPackedSequence(data=tensor([[

Page 694 and 695: Model Configuration & TrainingWe ca

Page 696 and 697: size = 5weight = torch.ones(size) *

Page 698 and 699: torch.manual_seed(17)conv_seq = nn.

Page 700 and 701: Figure 8.32 - Applying dilated filt

Page 702 and 703: Model Configuration1 torch.manual_s

Page 704 and 705: We can actually find an expression

Page 706 and 707: Data Preparation1 def pack_collate(

Page 708 and 709: and variable-length sequences.Model

Page 710 and 711: • generating variable-length sequ

Page 712 and 713: import copyimport numpy as npimport

Page 714 and 715: Figure 9.3 - Sequence datasetThe co

Page 716 and 717: coordinates of a "perfect" square a

Page 718 and 719: Let’s pretend for a moment that t

Page 720 and 721: to initialize the hidden state and

Page 722 and 723: predictions in previous steps have

Page 724 and 725: the second set of predicted coordin

Page 726 and 727: Let’s create an instance of the m

Page 728 and 729: Model Configuration & TrainingThe m

Page 730 and 731: Sure, we can!AttentionHere is a (no

Page 732 and 733: based on "the" and "zone," I’ve j

Page 734 and 735: Figure 9.12 - Matching a query to t

Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[

Page 738 and 739: utmost importance for the correct i

Page 740 and 741: Its formula is:Equation 9.3 - Cosin

Page 742 and 743: second hidden state contributes to

Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1

Page 746 and 747: alphas = F.softmax(scaled_products,

Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]

Page 750 and 751: Attention Mechanism1 class Attentio

Page 752 and 753: "Why would I want to force it to do

Page 754 and 755: 1 Sets attention module and adjusts

Page 756 and 757: encdec = EncoderDecoder(encoder, de

Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig

Page 760 and 761: Figure 9.20 - Attention scoresSee?

Page 762 and 763: Wide vs Narrow AttentionThis mechan

Page 764 and 765: "What’s so special about it?"Even

Page 766 and 767: Once again, the affine transformati

Page 768 and 769: Next, we shift our focus to the sel

Page 770 and 771: Encoder + Self-Attention1 class Enc

Page 772 and 773: Figure 9.27 - Encoder with self- an

Page 774 and 775: The figure below depicts the self-a

Page 776 and 777: shifted_seq = torch.cat([source_seq

Page 778 and 779: Equation 9.17 - Decoder’s (masked

Page 780 and 781: At evaluation / prediction time we

Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.

Page 784 and 785: Figure 9.33 - Encoder + decoder + a

Page 786 and 787: 64 return outputsThe encoder-decode

Page 788 and 789: Figure 9.34 - Losses—encoder + de

Page 790 and 791: curse. On the one hand, it makes co

Page 792 and 793: "Are we done now? Is this good enou

Page 794 and 795: Figure 9.46 - Consistent distancesA

Page 796 and 797: Let’s recap what we’ve already

Page 798 and 799: Let’s see it in code:max_len = 10

Page 800 and 801: Let’s put it all together into a

Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-

Page 804 and 805: Decoder with Positional Encoding1 c

Page 806 and 807: Visualizing PredictionsLet’s plot

Page 808 and 809: Next, we’re moving on to the thre

Page 810 and 811: Data Generation & Preparation1 # Tr

Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco

Page 814 and 815: Model Configuration1 class EncoderS

Page 816 and 817: 1617 @property18 def alphas(self):1

Page 818 and 819: Output(0.016193246061448008, 0.0341

Page 820 and 821: sequential order of the data• fig

Page 822 and 823: following imports:import copyimport

Page 824 and 825: Figure 10.2 - Chunking: the wrong a

Page 826 and 827: chunks to compute the other half of

Page 828 and 829: 67 # N, L, n_heads, d_k68 context =

Page 830 and 831: dummy_points = torch.randn(16, 2, 4

Page 832 and 833: Stacking Encoders and DecodersLet

Page 834 and 835: "… with great depth comes great c

Page 836 and 837: Transformer EncoderWe’ll be repre

Page 838 and 839: Let’s see it in code, starting wi

Page 840 and 841: Transformer Encoder1 class EncoderT

Page 842 and 843: of the encoder-decoder (or Transfor

Page 844 and 845: In PyTorch, the decoder "layer" is

Page 846 and 847: In PyTorch, the decoder is implemen

Page 848 and 849: Equation 10.7 - Data points' means

Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n

Page 852 and 853: Figure 10.10 - Layer norm vs batch

Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[

Page 856 and 857: The TransformerLet’s start with t

Page 858 and 859: "values") in the decoder.• decode

Page 860 and 861: Data Preparation1 # Generating trai

Page 862 and 863: Figure 10.15 - Losses—Transformer

Page 864 and 865: • First, and most important, PyTo

Page 866 and 867: decode(), with a single one, encode

Page 868 and 869: 46 for i in range(self.target_len):

Page 870 and 871: Figure 10.18 - Losses - PyTorch’s

Page 872 and 873: Figure 10.20 - Sample image—label

Page 874 and 875: 4041 # Builds a weighted random sam

Page 876 and 877: Figure 10.23 - Sample image—split

Page 878 and 879: Einops"There is more than one way t

Page 880 and 881: Figure 10.26 - Two patch embeddings

Page 882 and 883: Now each sequence has ten elements,

Page 884 and 885: It takes an instance of a Transform

Page 886 and 887: Putting It All TogetherIn this chap

Page 888 and 889: 1. Encoder-DecoderThe encoder-decod

Page 890 and 891: This is the actual encoder-decoder

Page 892 and 893: 3. DecoderThe Transformer decoder h

Page 894 and 895: 5. Encoder "Layer"The encoder "laye

Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye

Page 898 and 899: 8. Multi-Headed AttentionThe multi-

Page 900 and 901: Model Configuration & TrainingModel

Page 902 and 903: • training the Transformer to tac

Page 904 and 905: Part IVNatural Language Processing|

Page 906 and 907: Additional SetupThis is a special c

Page 908 and 909: "Down the Yellow Brick Rabbit Hole"

Page 910 and 911: The actual texts of the books are c

Page 912 and 913: "What is this punkt?"That’s the P

Page 914 and 915: 14 # If there is a configuration fi

Page 916 and 917: Sentence Tokenization in spaCyBy th

Page 918 and 919: AttributesThe Dataset has many attr

Page 920 and 921: Output{'labels': 1,'sentence': 'The

Page 922 and 923: elements from the text. But preproc

Page 924 and 925: Data AugmentationLet’s briefly ad

Page 926 and 927: The corpora’s dictionary is not a

Page 928 and 929: Finally, if we want to convert a li

Page 930 and 931: Once we’re happy with the size an

Page 932 and 933: from transformers import BertTokeni

Page 934 and 935: "What about the separation token?"T

Page 936 and 937: The last output, attention_mask, wo

Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0

Page 940 and 941: vector, right? And our vocabulary i

Page 942 and 943: Maybe you filled this blank in with

Page 944 and 945: Continuous Bag-of-Words (CBoW)In th

Page 946 and 947: That’s a fairly simple model, rig

Page 948 and 949: Figure 11.13 - Continuous bag-of-wo

Page 950 and 951: Figure 11.15 - Reviewing restaurant

Page 952 and 953: You got that right—arithmetic—r

Page 954 and 955: There we go, 50 dimensions! It’s

Page 956 and 957: Equation 11.1 - Embedding arithmeti

Page 958 and 959: Only 82 out of 50,802 words in the

Page 960 and 961: Now we can use its encode() method

Page 962 and 963: Model I — GloVE + ClassifierData

Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e

Page 966 and 967: Model Configuration & TrainingLet

Page 968 and 969: 6 self.encoder = encoder7 self.mlp

Page 970 and 971: Figure 11.20 - Losses—Transformer

Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e

Page 974 and 975: I want to introduce you to…ELMoBo

Page 976 and 977: OutputToken: 32 watchThe get_token(

Page 978 and 979: Helper Function to Retrieve Embeddi

Page 980 and 981: Output(tensor(-0.5047, device='cuda

Page 982 and 983: torch.all(new_flair_sentences[0].to

Page 984 and 985: Outputtensor(0.3504, device='cuda:0

Page 986 and 987: We can leverage this fact to slight

Page 988 and 989: We can easily get the embeddings fo

Page 990 and 991: Figure 11.24 - Losses—simple clas

Page 992 and 993: We can inspect the pre-trained mode

Page 994 and 995: Every word piece is prefixed with #

Page 996 and 997: far, our models used these embeddin

Page 998 and 999: position_ids = torch.arange(512).ex

Page 1000 and 1001: Pre-training TasksMasked Language M

Page 1002 and 1003: Then, let’s create an instance of

Page 1004 and 1005: If these two sentences were the inp

Page 1006 and 1007: The BERT model may take many other

Page 1008 and 1009: The contextual word embeddings are

Page 1010 and 1011: Model Configuration1 class BERTClas

Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D

Page 1014 and 1015: Well, you probably don’t want to

Page 1016 and 1017: set num_labels=1 as argument.If you

Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,

Page 1020 and 1021: OutputTrainingArguments(output_dir=

Page 1022 and 1023: Method for Computing Accuracy1 def

Page 1024 and 1025: loaded_model = (AutoModelForSequenc

Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte

Page 1028 and 1029: For a complete list of available ta

Page 1030 and 1031: [215]. For a demo of GPT-2’s capa

Page 1032 and 1033: in Chapter 9, and I reproduce it be

Page 1034 and 1035: Data Preparation1 auto_tokenizer =

Page 1036 and 1037: Data Preparation1 lm_train_dataset

Page 1038 and 1039: The training arguments are roughly

Page 1040 and 1041: device_index = (model.device.indexi

Page 1042 and 1043: • learning that a language model

Page 1044 and 1045: [167] https://huggingface.co/docs/d

output

layer

sequence

dataset

method

gradients

corresponding

embeddings

parameters

input

voigt

godoy

pytorch

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Delete template?

Save as template ?