Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

22.02.2024 Views
Frequently Asked Questions (FAQ)Why PyTorch?First, coding in PyTorch is fun :-) Really, there is something to it that makes it veryenjoyable to write code in. Some say it is because it is very pythonic, or maybethere is something else, who knows? I hope that, by the end of this book, you feellike that too!Second, maybe there are even some unexpected benefits to your health—checkAndrej Karpathy’s tweet [3] about it!Jokes aside, PyTorch is the fastest-growing [4] framework for developing deeplearning models and it has a huge ecosystem. [5] That is, there are many tools andlibraries developed on top of PyTorch. It is the preferred framework [6] in academiaalready and is making its way in the industry.Several companies are already powered by PyTorch; [7] to name a few:• Facebook: The company is the original developer of PyTorch, released inOctober 2016.• Tesla: Watch Andrej Karpathy (AI director at Tesla) speak about "how Tesla isusing PyTorch to develop full self-driving capabilities for its vehicles" in this video. [8]• OpenAI: In January 2020, OpenAI decided to standardize its deep learningframework on PyTorch (source [9] ).• fastai: fastai is a library [10] built on top of PyTorch to simplify model training andis used in its "Practical Deep Learning for Coders" [11] course. The fastai library isdeeply connected to PyTorch and "you can’t become really proficient at usingfastai if you don’t know PyTorch well, too." [12]• Uber: The company is a significant contributor to PyTorch’s ecosystem, havingdeveloped libraries like Pyro [13] (probabilistic programming) and Horovod [14] (adistributed training framework).• Airbnb: PyTorch sits at the core of the company’s dialog assistant for customerservice.(source [15] )This book aims to get you started with PyTorch while giving you a solidunderstanding of how it works.Why PyTorch? | 1

Why This Book?If you’re looking for a book where you can learn about deep learning and PyTorchwithout having to spend hours deciphering cryptic text and code, and one that’seasy and enjoyable to read, this is it :-)The book covers from the basics of gradient descent all the way up to fine-tuninglarge NLP models (BERT and GPT-2) using HuggingFace. It is divided into fourparts:• Part I: Fundamentals (gradient descent, training linear and logistic regressionsin PyTorch)• Part II: Computer Vision (deeper models and activation functions,convolutions, transfer learning, initialization schemes)• Part III: Sequences (RNN, GRU, LSTM, seq2seq models, attention, selfattention,Transformers)• Part IV: Natural Language Processing (tokenization, embeddings, contextualword embeddings, ELMo, BERT, GPT-2)This is not a typical book: most tutorials start with some nice and pretty imageclassification problem to illustrate how to use PyTorch. It may seem cool, but Ibelieve it distracts you from the main goal: learning how PyTorch works. In thisbook, I present a structured, incremental, and from-first-principles approach tolearn PyTorch.Moreover, this is not a formal book in any way: I am writing this book as if I werehaving a conversation with you, the reader. I will ask you questions (and give youanswers shortly afterward), and I will also make (silly) jokes.My job here is to make you understand the topic, so I will avoid fancymathematical notation as much as possible and spell it out in plain English.In this book, I will guide you through the development of many models in PyTorch,showing you why PyTorch makes it much easier and more intuitive to build modelsin Python: autograd, dynamic computation graph, model classes, and much, muchmore.We will build, step-by-step, not only the models themselves but also yourunderstanding as I show you both the reasoning behind the code and how to avoidsome common pitfalls and errors along the way.2 | Frequently Asked Questions (FAQ)

Page 2 and 3: Deep Learning with PyTorchStep-by-S

Page 4 and 5: "What I cannot create, I do not und

Page 6 and 7: Train-Validation-Test Split. . . .

Page 8 and 9: Random Split. . . . . . . . . . . .

Page 10 and 11: The Precision Quirk . . . . . . . .

Page 12 and 13: A REAL Filter . . . . . . . . . . .

Page 14 and 15: Jupyter Notebook . . . . . . . . .

Page 16 and 17: Stacked RNN . . . . . . . . . . . .

Page 18 and 19: Attention Mechanism . . . . . . . .

Page 20 and 21: 4. Positional Encoding. . . . . . .

Page 22 and 23: Model Configuration & Training . .

Page 24 and 25: AcknowledgementsFirst and foremost,

Page 28 and 29: There is yet another advantage of f

Page 30 and 31: • Classes and methods are written

Page 32 and 33: What’s Next?It’s time to set up

Page 34 and 35: After choosing a repository, it wil

Page 36 and 37: 1. AnacondaIf you don’t have Anac

Page 38 and 39: 3. PyTorchPyTorch is the coolest de

Page 40 and 41: (pytorchbook) C:\> conda install py

Page 42 and 43: (pytorchbook)$ pip install torchviz

Page 44 and 45: 7. JupyterAfter cloning the reposit

Page 46 and 47: Part IFundamentals| 21

Page 48 and 49: notebook. If not, just click on Cha

Page 50 and 51: Also, let’s say that, on average,

Page 52 and 53: There is one exception to the "alwa

Page 54 and 55: Random Initialization1 # Step 0 - I

Page 56 and 57: Batch, Mini-batch, and Stochastic G

Page 58 and 59: Outputarray([[-2. , -1.94, -1.88, .

Page 60 and 61: one matrix for each data point, eac

Page 62 and 63: Sure, different values of b produce

Page 64 and 65: Output-3.044811379650508 -1.8337537

Page 66 and 67: each parameter using the chain rule

Page 68 and 69: What’s the impact of one update o

Page 70 and 71: gradients, we know we need to take

Page 72 and 73: Very High Learning RateWait, it may

Page 74 and 75: true_b = 1true_w = 2N = 100# Data G

Page 76 and 77: Let’s look at the cross-sections

Page 78 and 79: Zero Mean and Unit Standard Deviati

Page 80 and 81: Sure, in the real world, you’ll n

Page 82 and 83: computing the loss, as shown in the

Page 84 and 85: • visualizing the effects of usin

Page 86 and 87: If you’re using Jupyter’s defau

Page 88 and 89: Notebook Cell 1.1 - Splitting synth

Page 90 and 91: Step 2# Step 2 - Computing the loss

Page 92 and 93: Output[0.49671415] [-0.1382643][0.8

Page 94 and 95: Notebook Cell 1.2 - Implementing gr

Page 96 and 97: # Sanity Check: do we get the same

Page 98 and 99: Outputtensor(3.1416)tensor([1, 2, 3

Page 100 and 101: Outputtensor([[1., 2., 1.],[1., 1.,

Page 102 and 103: dummy_array = np.array([1, 2, 3])du

Page 104 and 105: n_cudas = torch.cuda.device_count()

Page 106 and 107: back_to_numpy = x_train_tensor.nump

Page 108 and 109: I am assuming you’d like to use y

Page 110 and 111: Outputtensor([0.1940], device='cuda

Page 112 and 113: print(error.requires_grad, yhat.req

Page 114 and 115: Output(tensor([0.], device='cuda:0'

Page 116 and 117: 56 # need to tell it to let it go..

Page 118 and 119: computation.If you chose "Local Ins

Page 120 and 121: Figure 1.6 - Now parameter "b" does

Page 122 and 123: There are many optimizers: SGD is t

Page 124 and 125: 41 optimizer.zero_grad() 34243 prin

Page 126 and 127: Notebook Cell 1.8 - PyTorch’s los

Page 128 and 129: Outputarray(0.00804466, dtype=float

Page 130 and 131: Let’s build a proper (yet simple)

Page 132 and 133: "What do we need this for?"It turns

Page 134 and 135: 1 Instantiating a model2 What IS th

Page 136 and 137: In the __init__() method, we create

Page 138 and 139: LayersA Linear model can be seen as

Page 140 and 141: There are MANY different layers tha

Page 142 and 143: We use magic, just like that:%run -

Page 144 and 145: • Step 1: compute model’s predi

Page 146 and 147: RecapFirst of all, congratulations

Page 148 and 149: Chapter 2Rethinking the Training Lo

Page 150 and 151: Let’s take a look at the code onc

Page 152 and 153: Higher-Order FunctionsAlthough this

Page 154 and 155: def exponentiation_builder(exponent

Page 156 and 157: Apart from returning the loss value

Page 158 and 159: Our code should look like this; see

Page 160 and 161: There is no need to load the whole

Page 162 and 163: but if we want to get serious about

Page 164 and 165: How does this change our code so fa

Page 166 and 167: Run - Model Training V2%run -i mode

Page 168 and 169: piece of code that’s going to be

Page 170 and 171: for it. We could do the same for th

Page 172 and 173: EvaluationHow can we evaluate the m

Page 174 and 175: And then, we update our model confi

Page 176 and 177: Run - Model Training V4%run -i mode

Page 178 and 179: Loading Extension# Load the TensorB

Page 180 and 181: browser, you’ll likely see someth

Page 182 and 183: model’s graph (not quite the same

Page 184 and 185: Figure 2.5 - Scalars on TensorBoard

Page 186 and 187: Define - Model Training V51 %%write

Page 188 and 189: If, by any chance, you ended up wit

Page 190 and 191: The procedure is exactly the same,

Page 192 and 193: soon, so please bear with me for no

Page 194 and 195: After recovering our model’s stat

Page 196 and 197: Run - Model Configuration V31 # %lo

Page 198 and 199: This is the general structure you

Page 200 and 201: Chapter 2.1Going ClassySpoilersIn t

Page 202 and 203: # A completely empty (and useless)

Page 204 and 205: # These attributes are defined here

Page 206 and 207: # Creates the train_step function f

Page 208 and 209: # Builds function that performs a s

Page 210 and 211: setattrThe setattr function sets th

Page 212 and 213: See? We effectively modified the un

Page 214 and 215: the random seed as arguments.This s

Page 216 and 217: The current state of development of

Page 218 and 219: Lossesdef plot_losses(self):fig = p

Page 220 and 221: Run - Data Preparation V21 # %load

Page 222 and 223: Model TrainingWe start by instantia

Page 224 and 225: Making PredictionsLet’s make up s

Page 226 and 227: OutputOrderedDict([('0.weight', ten

Page 228 and 229: Run - Data Preparation V21 # %load

Page 230 and 231: • defining our StepByStep class

Page 232 and 233: import numpy as npimport torchimpor

Page 234 and 235: Next, we’ll standardize the featu

Page 236 and 237: Equation 3.1 - A linear regression

Page 238 and 239: The odds ratio is given by the rati

Page 240 and 241: As expected, probabilities that add

Page 242 and 243: Sigmoid Functiondef sigmoid(z):retu

Page 244 and 245: A picture is worth a thousand words

Page 246 and 247: OutputOrderedDict([('linear.weight'

Page 248 and 249: The first summation adds up the err

Page 250 and 251: IMPORTANT: Make sure to pass the pr

Page 252 and 253: To make it clear: In this chapter,

Page 254 and 255: argument of nn.BCEWithLogitsLoss().

Page 256 and 257: It is not that hard, to be honest.

Page 258 and 259: Figure 3.6 - Training and validatio

Page 260 and 261: Outputarray([[0.5504593 ],[0.949995

Page 262 and 263: decision boundary.Look at the expre

Page 264 and 265: Are my data points separable?That

Page 266 and 267: model = nn.Sequential()model.add_mo

Page 268 and 269: It looks like this:Figure 3.10 - Sp

Page 270 and 271: True and False Positives and Negati

Page 272 and 273: tpr_fpr(cm_thresh50)Output(0.909090

Page 274 and 275: The trade-off between precision and

Page 276 and 277: Figure 3.13 - Using a low threshold

Page 278 and 279: Figure 3.16 - Trade-offs for two di

Page 280 and 281: thresholds do not necessarily inclu

Page 282 and 283: actual data, it is as bad as it can

Page 284 and 285: If you want to learn more about bot

Page 286 and 287: Model Training1 n_epochs = 10023 sb

Page 288 and 289: step in your journey! What’s next

Page 290 and 291: Chapter 4Classifying ImagesSpoilers

Page 292 and 293: Data GenerationOur images are quite

Page 294 and 295: Images and ChannelsIn case you’re

Page 296 and 297: image_rgb = np.stack([image_r, imag

Page 298 and 299: That’s fairly straightforward; we

Page 300 and 301: • Transformations based on Tensor

Page 302 and 303: position of an object in a picture

Page 304 and 305: Outputtensor([[[0., 0., 0., 1., 0.]

Page 306 and 307: Outputtensor([[[-1., -1., -1., 1.,

Page 308 and 309: We can convert the former into the

Page 310 and 311: composer = Compose([RandomHorizonta

Page 312 and 313: Output<torch.utils.data.dataset.Sub

Page 314 and 315: train_composer = Compose([RandomHor

Page 316 and 317: The minority class should have the

Page 318 and 319: train_loader = DataLoader(dataset=t

Page 320 and 321: implemented in Chapter 2.1? Let’s

Page 322 and 323: Let’s take one mini-batch of imag

Page 324 and 325: What does our model look like? Visu

Page 326 and 327: Model TrainingLet’s train our mod

Page 328 and 329: preceding hidden layer to compute i

Page 330 and 331: fig = sbs_nn.plot_losses()Figure 4.

Page 332 and 333: Equation 4.2 - Equivalence of deep

Page 334 and 335: w_nn_equiv = w_nn_output.mm(w_nn_hi

Page 336 and 337: Weights as PixelsDuring data prepar

Page 338 and 339: is only 0.25 (for z = 0) and that i

Page 340 and 341: nn.Tanh()(dummy_z)Outputtensor([-0.

Page 342 and 343: dummy_z = torch.tensor([-3., 0., 3.

Page 344 and 345: As you can see, in PyTorch the coef

Page 346 and 347: Figure 4.16 - Deep model (for real)

Page 348 and 349: Figure 4.18 - Losses (before and af

Page 350 and 351: Equation 4.3 - Activation functions

Page 352 and 353: Helper Function #41 def index_split

Page 354 and 355: Model Configuration1 # Sets learnin

Page 356 and 357: Bonus ChapterFeature SpaceThis chap

Page 358 and 359: Affine TransformationsAn affine tra

Page 360 and 361: Figure B.3 - Annotated model diagra

Page 362 and 363: Figure B.5 - In the beginning…But

Page 364 and 365: OK, now we can clearly see a differ

Page 366 and 367: In the model above, the sigmoid fun

Page 368 and 369: the more dimensions, the more separ

Page 370 and 371: import randomimport numpy as npfrom

Page 372 and 373: identity = np.array([[[[0, 0, 0],[0

Page 374 and 375: Figure 5.4 - Striding the image, on

Page 376 and 377: Output-----------------------------

Page 378 and 379: Outputtensor([[[[9., 5., 0., 7.],[0

Page 380 and 381: OutputParameter containing:tensor([

Page 382 and 383: Moreover, notice that if we were to

Page 384 and 385: In code, as usual, PyTorch gives us

Page 386 and 387: Outputtensor([[[[5., 5., 0., 8., 7.

Page 388 and 389: edge = np.array([[[[0, 1, 0],[1, -4

Page 390 and 391: A pooling kernel of two-by-two resu

Page 392 and 393: Outputtensor([[22., 23., 11., 24.,

Page 394 and 395: Figure 5.15 - LeNet-5 architectureS

Page 396 and 397: • second block: produces 16-chann

Page 398 and 399: Transformed Dataset1 class Transfor

Page 400 and 401: LossNew problem, new loss. Since we

Page 402 and 403: Outputtensor([4.0000, 1.0000, 0.500

Page 404 and 405: The loss only considers the predict

Page 406 and 407: Outputtensor([[-1.5229, -0.3146, -2

Page 408 and 409: IMPORTANT: I can’t stress this en

Page 410 and 411: figures at the beginning of this ch

Page 412 and 413: The three units in the output layer

Page 414 and 415: StepByStep Method@staticmethoddef _

Page 416 and 417: The meow() method is totally indepe

Page 418 and 419: StepByStep Methoddef visualize_filt

Page 420 and 421: dummy_model = nn.Linear(1, 1)dummy_

Page 422 and 423: dummy_listOutput[(Linear(in_feature

Page 424 and 425: Output{Conv2d(1, 1, kernel_size=(3,

Page 426 and 427: will be the externally defined vari

Page 428 and 429: Removing Hookssbs_cnn1.remove_hooks

Page 430 and 431: return figsetattr(StepByStep, 'visu

Page 432 and 433: Figure 5.22 - Feature maps (classif

Page 434 and 435: classification: The predicted class

Page 436 and 437: convolutional layers to our model a

Page 438 and 439: Capturing Outputsfeaturizer_layers

Page 440 and 441: the filters learned by the model pr

Page 442 and 443: given chapter are imported at its v

Page 444 and 445: Data PreparationThe data preparatio

Page 446 and 447: model anyway. We’ll use it to com

Page 448 and 449: StepByStep Method@staticmethoddef m

Page 450 and 451: "What’s wrong with the colors?"Th

Page 452 and 453: three_channel_filter = np.array([[[

Page 454 and 455: Fancier Model (Constructor)class CN

Page 456 and 457: Fancier Model (Classifier)def class

Page 458 and 459: torch.manual_seed(44)dropping_model

Page 460 and 461: Outputtensor([0.1000, 0.2000, 0.300

Page 462 and 463: Figure 6.8 - Output distribution fo

Page 464 and 465: Adaptive moment estimation (Adam) u

Page 466 and 467: torch.manual_seed(13)# Model Config

Page 468 and 469: Outputtorch.Size([5, 3, 3, 3])Its s

Page 470 and 471: Choosing a learning rate that works

Page 472 and 473: Higher-Order Learning Rate Function

Page 474 and 475: Perfect! Now let’s build the actu

Page 476 and 477: ax.set_xlabel('Learning Rate')ax.se

Page 478 and 479: LRFinderThe function we’ve implem

Page 480 and 481: value in our moving average has an

Page 482 and 483: Figure 6.15 - Distribution of weigh

Page 484 and 485: In code, the implementation of the

Page 486 and 487: As expected, the EWMA without corre

Page 488 and 489: optimizer = optim.Adam(model.parame

Page 490 and 491: IMPORTANT: The logging function mus

Page 492 and 493: Output{'state': {140601337662512: {

Page 494 and 495: different optimizer, set them to ca

Page 496 and 497: • dampening: dampening factor for

Page 498 and 499: Figure 6.20 - Paths taken by SGD (w

Page 500 and 501: Equation 6.16 - Looking aheadOnce N

Page 502 and 503: Figure 6.22 - Path taken by each SG

Page 504 and 505: for epoch in range(4):# training lo

Page 506 and 507: course) up to a given number of epo

Page 508 and 509: Next, we create a protected method

Page 510 and 511: Mini-Batch SchedulersThese schedule

Page 512 and 513: Schedulers in StepByStep — Part I

Page 514 and 515: Scheduler PathsBefore trying out a

Page 516 and 517: After applying each scheduler to SG

Page 518 and 519: Data Preparation1 # Loads temporary

Page 520 and 521: Figure 6.31 - LossesEvaluationprint

Page 522 and 523: [96] http://www.samkass.com/theorie

Page 524 and 525: ImportsFor the sake of organization

Page 526 and 527: ILSVRC-2012The 2012 edition [111] o

Page 528 and 529: remained unchanged.ResNet (MSRA Tea

Page 530 and 531: Transfer Learning in PracticeIn Cha

Page 532 and 533: dropout. You’re already familiar

Page 534 and 535: OutputDownloading: "https://downloa

Page 536 and 537: Replacing the "Top" of the Model1 a

Page 538 and 539: Model Size Classifier Layer(s) Repl

Page 540 and 541: Model TrainingWe have everything se

Page 542 and 543: "Removing" the Top Layer1 alex.clas

Page 544 and 545: torch.save(train_preproc.tensors, '

Page 546 and 547: Outputtensor([[109, 124],[124, 124]

Page 548 and 549: Model Configuration1 optimizer_mode

Page 550 and 551: Figure 7.4 - 1x1 convolutionThe inp

Page 552 and 553: The weights used by PIL are 0.299 f

Page 554 and 555: • reduce the number of output cha

Page 556 and 557: The constructor method defines the

Page 558 and 559: Does it sound familiar? That’s wh

Page 560 and 561: and w to represent these parameters

Page 562 and 563: A mini-batch of size 64 is small en

Page 564 and 565: normed1 = batch_normalizer(batch1[0

Page 566 and 567: OutputOrderedDict([('running_mean',

Page 568 and 569: OutputOrderedDict([('running_mean',

Page 570 and 571: batch_normalizer = nn.BatchNorm2d(n

Page 572 and 573: torch.manual_seed(23)dummy_points =

Page 574 and 575: np.concatenate([dummy_points[:5].nu

Page 576 and 577: Another advantage of these shortcut

Page 578 and 579: It should be pretty clear, except f

Page 580 and 581: Data Preparation1 # ImageNet statis

Page 582 and 583: Data Preparation — Preprocessing1

Page 584 and 585: • freezing the layers of the mode

Page 586 and 587: Extra ChapterVanishing and Explodin

Page 588 and 589: discussing it, let me illustrate it

Page 590 and 591: Model Configuration (2)1 loss_fn =

Page 592 and 593: weights. If done properly, the init

Page 594 and 595: just did), or, if you are training

Page 596 and 597: Figure E.3 - The effect of batch no

Page 598 and 599: Model Configuration1 torch.manual_s

Page 600 and 601: torch.manual_seed(42)parm = nn.Para

Page 602 and 603: (and only if) the norm exceeds the

Page 604 and 605: if callable(self.clipping): 1self.c

Page 606 and 607: Moreover, let’s use a ten times h

Page 608 and 609: Clipping with HooksFirst, we reset

Page 610 and 611: • visualizing the difference betw

Page 612 and 613: Chapter 8SequencesSpoilersIn this c

Page 614 and 615: Before shuffling, the pixels were o

Page 616 and 617: And then let’s visualize the firs

Page 618 and 619: sequence so far, and a data point f

Page 620 and 621: Considering this, the not "unrolled

Page 622 and 623: linear_input = nn.Linear(n_features

Page 624 and 625: Outputtensor([[0.3924, 0.8146]], gr

Page 626 and 627: Now we’re talking! The last hidde

Page 628 and 629: Let’s take a look at the RNN’s

Page 630 and 631: ◦ The initial hidden state, which

Page 632 and 633: batch_first argument to True so we

Page 634 and 635: OutputOrderedDict([('weight_ih_l0',

Page 636 and 637: out, hidden = rnn_stacked(x)out, hi

Page 638 and 639: _l0_reverse).Once again, let’s cr

Page 640 and 641: For bidirectional RNNs, the last el

Page 642 and 643: Model Configuration1 class SquareMo

Page 644 and 645: StepByStep.loader_apply(test_loader

Page 646 and 647: Figure 8.14 - Final hidden states f

Page 648 and 649: Figure 8.16 - Transforming the hidd

Page 650 and 651: Since the RNN cell has both of them

Page 652 and 653: Every gate worthy of its name will

Page 654 and 655: • For r=0 and z=0, the cell becom

Page 656 and 657: In code, we can use split() to get

Page 658 and 659: Let’s pause for a moment here. Fi

Page 660 and 661: Square Model II — The QuickeningT

Page 662 and 663: Outputtensor([[53, 53],[75, 75]])Th

Page 664 and 665: Figure 8.22 - Transforming the hidd

Page 666 and 667: Equation 8.9 - LSTM—candidate hid

Page 668 and 669: Now, let’s visualize the internal

Page 670 and 671: OutputOrderedDict([('weight_ih', te

Page 672 and 673: def forget_gate(h, x):thf = f_hidde

Page 674 and 675: Outputtensor([[-5.4936e-02, -8.3816

Page 676 and 677: 1 First change: from RNN to LSTM2 S

Page 678 and 679: Like the GRU, the LSTM presents fou

Page 680 and 681: Output-----------------------------

Page 682 and 683: Before moving on to packed sequence

Page 684 and 685: column-wise fashion, from top to bo

Page 686 and 687: does match the last output.• No,

Page 688 and 689: So, to actually get the last output

Page 690 and 691: Data Preparation1 class CustomDatas

Page 692 and 693: OutputPackedSequence(data=tensor([[

Page 694 and 695: Model Configuration & TrainingWe ca

Page 696 and 697: size = 5weight = torch.ones(size) *

Page 698 and 699: torch.manual_seed(17)conv_seq = nn.

Page 700 and 701: Figure 8.32 - Applying dilated filt

Page 702 and 703: Model Configuration1 torch.manual_s

Page 704 and 705: We can actually find an expression

Page 706 and 707: Data Preparation1 def pack_collate(

Page 708 and 709: and variable-length sequences.Model

Page 710 and 711: • generating variable-length sequ

Page 712 and 713: import copyimport numpy as npimport

Page 714 and 715: Figure 9.3 - Sequence datasetThe co

Page 716 and 717: coordinates of a "perfect" square a

Page 718 and 719: Let’s pretend for a moment that t

Page 720 and 721: to initialize the hidden state and

Page 722 and 723: predictions in previous steps have

Page 724 and 725: the second set of predicted coordin

Page 726 and 727: Let’s create an instance of the m

Page 728 and 729: Model Configuration & TrainingThe m

Page 730 and 731: Sure, we can!AttentionHere is a (no

Page 732 and 733: based on "the" and "zone," I’ve j

Page 734 and 735: Figure 9.12 - Matching a query to t

Page 736 and 737: Outputtensor([[[ 0.0832, -0.0356],[

Page 738 and 739: utmost importance for the correct i

Page 740 and 741: Its formula is:Equation 9.3 - Cosin

Page 742 and 743: second hidden state contributes to

Page 744 and 745: Outputtensor([[[ 0.5475, 0.0875, -1

Page 746 and 747: alphas = F.softmax(scaled_products,

Page 748 and 749: Outputtensor([[[ 0.2138, -0.3175]]]

Page 750 and 751: Attention Mechanism1 class Attentio

Page 752 and 753: "Why would I want to force it to do

Page 754 and 755: 1 Sets attention module and adjusts

Page 756 and 757: encdec = EncoderDecoder(encoder, de

Page 758 and 759: fig = sbs_seq_attn.plot_losses()Fig

Page 760 and 761: Figure 9.20 - Attention scoresSee?

Page 762 and 763: Wide vs Narrow AttentionThis mechan

Page 764 and 765: "What’s so special about it?"Even

Page 766 and 767: Once again, the affine transformati

Page 768 and 769: Next, we shift our focus to the sel

Page 770 and 771: Encoder + Self-Attention1 class Enc

Page 772 and 773: Figure 9.27 - Encoder with self- an

Page 774 and 775: The figure below depicts the self-a

Page 776 and 777: shifted_seq = torch.cat([source_seq

Page 778 and 779: Equation 9.17 - Decoder’s (masked

Page 780 and 781: At evaluation / prediction time we

Page 782 and 783: Outputtensor([[[0.4132, 0.3728],[0.

Page 784 and 785: Figure 9.33 - Encoder + decoder + a

Page 786 and 787: 64 return outputsThe encoder-decode

Page 788 and 789: Figure 9.34 - Losses—encoder + de

Page 790 and 791: curse. On the one hand, it makes co

Page 792 and 793: "Are we done now? Is this good enou

Page 794 and 795: Figure 9.46 - Consistent distancesA

Page 796 and 797: Let’s recap what we’ve already

Page 798 and 799: Let’s see it in code:max_len = 10

Page 800 and 801: Let’s put it all together into a

Page 802 and 803: Outputtensor([[[-1.0000, 0.0000],[-

Page 804 and 805: Decoder with Positional Encoding1 c

Page 806 and 807: Visualizing PredictionsLet’s plot

Page 808 and 809: Next, we’re moving on to the thre

Page 810 and 811: Data Generation & Preparation1 # Tr

Page 812 and 813: 59 self.trg_masks)60 else:61 # Deco

Page 814 and 815: Model Configuration1 class EncoderS

Page 816 and 817: 1617 @property18 def alphas(self):1

Page 818 and 819: Output(0.016193246061448008, 0.0341

Page 820 and 821: sequential order of the data• fig

Page 822 and 823: following imports:import copyimport

Page 824 and 825: Figure 10.2 - Chunking: the wrong a

Page 826 and 827: chunks to compute the other half of

Page 828 and 829: 67 # N, L, n_heads, d_k68 context =

Page 830 and 831: dummy_points = torch.randn(16, 2, 4

Page 832 and 833: Stacking Encoders and DecodersLet

Page 834 and 835: "… with great depth comes great c

Page 836 and 837: Transformer EncoderWe’ll be repre

Page 838 and 839: Let’s see it in code, starting wi

Page 840 and 841: Transformer Encoder1 class EncoderT

Page 842 and 843: of the encoder-decoder (or Transfor

Page 844 and 845: In PyTorch, the decoder "layer" is

Page 846 and 847: In PyTorch, the decoder is implemen

Page 848 and 849: Equation 10.7 - Data points' means

Page 850 and 851: layer_norm = nn.LayerNorm(d_model)n

Page 852 and 853: Figure 10.10 - Layer norm vs batch

Page 854 and 855: Outputtensor([[[ 1.4636, 2.3663],[

Page 856 and 857: The TransformerLet’s start with t

Page 858 and 859: "values") in the decoder.• decode

Page 860 and 861: Data Preparation1 # Generating trai

Page 862 and 863: Figure 10.15 - Losses—Transformer

Page 864 and 865: • First, and most important, PyTo

Page 866 and 867: decode(), with a single one, encode

Page 868 and 869: 46 for i in range(self.target_len):

Page 870 and 871: Figure 10.18 - Losses - PyTorch’s

Page 872 and 873: Figure 10.20 - Sample image—label

Page 874 and 875: 4041 # Builds a weighted random sam

Page 876 and 877: Figure 10.23 - Sample image—split

Page 878 and 879: Einops"There is more than one way t

Page 880 and 881: Figure 10.26 - Two patch embeddings

Page 882 and 883: Now each sequence has ten elements,

Page 884 and 885: It takes an instance of a Transform

Page 886 and 887: Putting It All TogetherIn this chap

Page 888 and 889: 1. Encoder-DecoderThe encoder-decod

Page 890 and 891: This is the actual encoder-decoder

Page 892 and 893: 3. DecoderThe Transformer decoder h

Page 894 and 895: 5. Encoder "Layer"The encoder "laye

Page 896 and 897: 7. "Sub-Layer" WrapperThe "sub-laye

Page 898 and 899: 8. Multi-Headed AttentionThe multi-

Page 900 and 901: Model Configuration & TrainingModel

Page 902 and 903: • training the Transformer to tac

Page 904 and 905: Part IVNatural Language Processing|

Page 906 and 907: Additional SetupThis is a special c

Page 908 and 909: "Down the Yellow Brick Rabbit Hole"

Page 910 and 911: The actual texts of the books are c

Page 912 and 913: "What is this punkt?"That’s the P

Page 914 and 915: 14 # If there is a configuration fi

Page 916 and 917: Sentence Tokenization in spaCyBy th

Page 918 and 919: AttributesThe Dataset has many attr

Page 920 and 921: Output{'labels': 1,'sentence': 'The

Page 922 and 923: elements from the text. But preproc

Page 924 and 925: Data AugmentationLet’s briefly ad

Page 926 and 927: The corpora’s dictionary is not a

Page 928 and 929: Finally, if we want to convert a li

Page 930 and 931: Once we’re happy with the size an

Page 932 and 933: from transformers import BertTokeni

Page 934 and 935: "What about the separation token?"T

Page 936 and 937: The last output, attention_mask, wo

Page 938 and 939: Outputtensor([[ 3, 27, 1, ..., 0, 0

Page 940 and 941: vector, right? And our vocabulary i

Page 942 and 943: Maybe you filled this blank in with

Page 944 and 945: Continuous Bag-of-Words (CBoW)In th

Page 946 and 947: That’s a fairly simple model, rig

Page 948 and 949: Figure 11.13 - Continuous bag-of-wo

Page 950 and 951: Figure 11.15 - Reviewing restaurant

Page 952 and 953: You got that right—arithmetic—r

Page 954 and 955: There we go, 50 dimensions! It’s

Page 956 and 957: Equation 11.1 - Embedding arithmeti

Page 958 and 959: Only 82 out of 50,802 words in the

Page 960 and 961: Now we can use its encode() method

Page 962 and 963: Model I — GloVE + ClassifierData

Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e

Page 966 and 967: Model Configuration & TrainingLet

Page 968 and 969: 6 self.encoder = encoder7 self.mlp

Page 970 and 971: Figure 11.20 - Losses—Transformer

Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e

Page 974 and 975: I want to introduce you to…ELMoBo

Page 976 and 977: OutputToken: 32 watchThe get_token(

Page 978 and 979: Helper Function to Retrieve Embeddi

Page 980 and 981: Output(tensor(-0.5047, device='cuda

Page 982 and 983: torch.all(new_flair_sentences[0].to

Page 984 and 985: Outputtensor(0.3504, device='cuda:0

Page 986 and 987: We can leverage this fact to slight

Page 988 and 989: We can easily get the embeddings fo

Page 990 and 991: Figure 11.24 - Losses—simple clas

Page 992 and 993: We can inspect the pre-trained mode

Page 994 and 995: Every word piece is prefixed with #

Page 996 and 997: far, our models used these embeddin

Page 998 and 999: position_ids = torch.arange(512).ex

Page 1000 and 1001: Pre-training TasksMasked Language M

Page 1002 and 1003: Then, let’s create an instance of

Page 1004 and 1005: If these two sentences were the inp

Page 1006 and 1007: The BERT model may take many other

Page 1008 and 1009: The contextual word embeddings are

Page 1010 and 1011: Model Configuration1 class BERTClas

Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D

Page 1014 and 1015: Well, you probably don’t want to

Page 1016 and 1017: set num_labels=1 as argument.If you

Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,

Page 1020 and 1021: OutputTrainingArguments(output_dir=

Page 1022 and 1023: Method for Computing Accuracy1 def

Page 1024 and 1025: loaded_model = (AutoModelForSequenc

Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte

Page 1028 and 1029: For a complete list of available ta

Page 1030 and 1031: [215]. For a demo of GPT-2’s capa

Page 1032 and 1033: in Chapter 9, and I reproduce it be

Page 1034 and 1035: Data Preparation1 auto_tokenizer =

Page 1036 and 1037: Data Preparation1 lm_train_dataset

Page 1038 and 1039: The training arguments are roughly

Page 1040 and 1041: device_index = (model.device.indexi

Page 1042 and 1043: • learning that a language model

Page 1044 and 1045: [167] https://huggingface.co/docs/d

output

layer

sequence

dataset

method

gradients

corresponding

embeddings

parameters

input

voigt

godoy

pytorch

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Delete template?

Save as template ?