22.02.2024
•
Views
Let’s take a look at the RNN’s arguments:• input_size: It is the number of features in each data point of the sequence.• hidden_size: It is the number of hidden dimensions you want to use.• bias: Just like any other layer, it includes the bias in the equations.• nonlinearity: By default, it uses the hyperbolic tangent, but you can change itto ReLU if you want.The four arguments above are exactly the same as those in the RNN cell. So, we caneasily create a full-fledged RNN like that:n_features = 2hidden_dim = 2torch.manual_seed(19)rnn = nn.RNN(input_size=n_features, hidden_size=hidden_dim)rnn.state_dict()OutputOrderedDict([('weight_ih_l0', tensor([[ 0.6627, -0.4245],[ 0.5373, 0.2294]])),('weight_hh_l0', tensor([[-0.4015, -0.5385],[-0.1956, -0.6835]])),('bias_ih_l0', tensor([0.4954, 0.6533])),('bias_hh_l0', tensor([-0.3565, -0.2904]))])Since the seed is exactly the same, you’ll notice that the weights and biases haveexactly the same values as our former RNN cell. The only difference is in theparameters' names: Now they all have an _l0 suffix to indicate they belong to thefirst "layer.""What do you mean by layer? Isn’t the RNN itself a layer?"Yes, the RNN itself can be a layer in our model. But it may have its own internallayers! You can configure those with the following four extra arguments:• num_layers: The RNN we’ve been using so far has one layer (the default value),Recurrent Neural Networks (RNNs) | 603
but if you use more than one, you’ll be creating a stacked RNN, which we’ll seein its own section.• bidirectional: So far, our RNNs have been handling sequences in the left-torightdirection (the default), but if you set this argument to True, you’ll becreating a bidirectional RNN, which we’ll also see in its own section.• dropout: This introduces an RNN’s own dropout layer between its internallayers, so it only makes sense if you’re using a stacked RNN.And I saved the best (actually, the worst) for last:• batch_first: The documentation says, "if True, then the input and output tensorsare provided as (batch, seq, feature)," which makes you think that you only needto set it to True and it will turn everything into your nice and familiar tensorswhere different batches are concatenated together as its first dimension—andyou’d be sorely mistaken."Why? What’s wrong with that?"The problem is, you need to read the documentation very literally: Only the inputand output tensors are going to be batch first; the hidden state will never be batchfirst. This behavior may bring complications you need to be aware of.ShapesBefore going through an example, let’s take a look at the expected inputs andoutputs of our RNN:• Inputs:◦ The input tensor containing the sequence you want to run through theRNN:▪ The default shape is sequence-first; that is, (sequence length, batchsize, number of features), which we’re abbreviating to (L, N, F).▪ But if you choose batch_first, it will flip the first two dimensions, andthen it will expect an (N, L, F) shape, which is what you’re likely gettingfrom a data loader.▪ By the way, the input can also be a packed sequence—we’ll get back tothat in a later section.604 | Chapter 8: Sequences
-
Page 2 and 3:
Deep Learning with PyTorchStep-by-S
-
Page 4 and 5:
"What I cannot create, I do not und
-
Page 6 and 7:
Train-Validation-Test Split. . . .
-
Page 8 and 9:
Random Split. . . . . . . . . . . .
-
Page 10 and 11:
The Precision Quirk . . . . . . . .
-
Page 12 and 13:
A REAL Filter . . . . . . . . . . .
-
Page 14 and 15:
Jupyter Notebook . . . . . . . . .
-
Page 16 and 17:
Stacked RNN . . . . . . . . . . . .
-
Page 18 and 19:
Attention Mechanism . . . . . . . .
-
Page 20 and 21:
4. Positional Encoding. . . . . . .
-
Page 22 and 23:
Model Configuration & Training . .
-
Page 24 and 25:
AcknowledgementsFirst and foremost,
-
Page 26 and 27:
Frequently Asked Questions (FAQ)Why
-
Page 28 and 29:
There is yet another advantage of f
-
Page 30 and 31:
• Classes and methods are written
-
Page 32 and 33:
What’s Next?It’s time to set up
-
Page 34 and 35:
After choosing a repository, it wil
-
Page 36 and 37:
1. AnacondaIf you don’t have Anac
-
Page 38 and 39:
3. PyTorchPyTorch is the coolest de
-
Page 40 and 41:
(pytorchbook) C:\> conda install py
-
Page 42 and 43:
(pytorchbook)$ pip install torchviz
-
Page 44 and 45:
7. JupyterAfter cloning the reposit
-
Page 46 and 47:
Part IFundamentals| 21
-
Page 48 and 49:
notebook. If not, just click on Cha
-
Page 50 and 51:
Also, let’s say that, on average,
-
Page 52 and 53:
There is one exception to the "alwa
-
Page 54 and 55:
Random Initialization1 # Step 0 - I
-
Page 56 and 57:
Batch, Mini-batch, and Stochastic G
-
Page 58 and 59:
Outputarray([[-2. , -1.94, -1.88, .
-
Page 60 and 61:
one matrix for each data point, eac
-
Page 62 and 63:
Sure, different values of b produce
-
Page 64 and 65:
Output-3.044811379650508 -1.8337537
-
Page 66 and 67:
each parameter using the chain rule
-
Page 68 and 69:
What’s the impact of one update o
-
Page 70 and 71:
gradients, we know we need to take
-
Page 72 and 73:
Very High Learning RateWait, it may
-
Page 74 and 75:
true_b = 1true_w = 2N = 100# Data G
-
Page 76 and 77:
Let’s look at the cross-sections
-
Page 78 and 79:
Zero Mean and Unit Standard Deviati
-
Page 80 and 81:
Sure, in the real world, you’ll n
-
Page 82 and 83:
computing the loss, as shown in the
-
Page 84 and 85:
• visualizing the effects of usin
-
Page 86 and 87:
If you’re using Jupyter’s defau
-
Page 88 and 89:
Notebook Cell 1.1 - Splitting synth
-
Page 90 and 91:
Step 2# Step 2 - Computing the loss
-
Page 92 and 93:
Output[0.49671415] [-0.1382643][0.8
-
Page 94 and 95:
Notebook Cell 1.2 - Implementing gr
-
Page 96 and 97:
# Sanity Check: do we get the same
-
Page 98 and 99:
Outputtensor(3.1416)tensor([1, 2, 3
-
Page 100 and 101:
Outputtensor([[1., 2., 1.],[1., 1.,
-
Page 102 and 103:
dummy_array = np.array([1, 2, 3])du
-
Page 104 and 105:
n_cudas = torch.cuda.device_count()
-
Page 106 and 107:
back_to_numpy = x_train_tensor.nump
-
Page 108 and 109:
I am assuming you’d like to use y
-
Page 110 and 111:
Outputtensor([0.1940], device='cuda
-
Page 112 and 113:
print(error.requires_grad, yhat.req
-
Page 114 and 115:
Output(tensor([0.], device='cuda:0'
-
Page 116 and 117:
56 # need to tell it to let it go..
-
Page 118 and 119:
computation.If you chose "Local Ins
-
Page 120 and 121:
Figure 1.6 - Now parameter "b" does
-
Page 122 and 123:
There are many optimizers: SGD is t
-
Page 124 and 125:
41 optimizer.zero_grad() 34243 prin
-
Page 126 and 127:
Notebook Cell 1.8 - PyTorch’s los
-
Page 128 and 129:
Outputarray(0.00804466, dtype=float
-
Page 130 and 131:
Let’s build a proper (yet simple)
-
Page 132 and 133:
"What do we need this for?"It turns
-
Page 134 and 135:
1 Instantiating a model2 What IS th
-
Page 136 and 137:
In the __init__() method, we create
-
Page 138 and 139:
LayersA Linear model can be seen as
-
Page 140 and 141:
There are MANY different layers tha
-
Page 142 and 143:
We use magic, just like that:%run -
-
Page 144 and 145:
• Step 1: compute model’s predi
-
Page 146 and 147:
RecapFirst of all, congratulations
-
Page 148 and 149:
Chapter 2Rethinking the Training Lo
-
Page 150 and 151:
Let’s take a look at the code onc
-
Page 152 and 153:
Higher-Order FunctionsAlthough this
-
Page 154 and 155:
def exponentiation_builder(exponent
-
Page 156 and 157:
Apart from returning the loss value
-
Page 158 and 159:
Our code should look like this; see
-
Page 160 and 161:
There is no need to load the whole
-
Page 162 and 163:
but if we want to get serious about
-
Page 164 and 165:
How does this change our code so fa
-
Page 166 and 167:
Run - Model Training V2%run -i mode
-
Page 168 and 169:
piece of code that’s going to be
-
Page 170 and 171:
for it. We could do the same for th
-
Page 172 and 173:
EvaluationHow can we evaluate the m
-
Page 174 and 175:
And then, we update our model confi
-
Page 176 and 177:
Run - Model Training V4%run -i mode
-
Page 178 and 179:
Loading Extension# Load the TensorB
-
Page 180 and 181:
browser, you’ll likely see someth
-
Page 182 and 183:
model’s graph (not quite the same
-
Page 184 and 185:
Figure 2.5 - Scalars on TensorBoard
-
Page 186 and 187:
Define - Model Training V51 %%write
-
Page 188 and 189:
If, by any chance, you ended up wit
-
Page 190 and 191:
The procedure is exactly the same,
-
Page 192 and 193:
soon, so please bear with me for no
-
Page 194 and 195:
After recovering our model’s stat
-
Page 196 and 197:
Run - Model Configuration V31 # %lo
-
Page 198 and 199:
This is the general structure you
-
Page 200 and 201:
Chapter 2.1Going ClassySpoilersIn t
-
Page 202 and 203:
# A completely empty (and useless)
-
Page 204 and 205:
# These attributes are defined here
-
Page 206 and 207:
# Creates the train_step function f
-
Page 208 and 209:
# Builds function that performs a s
-
Page 210 and 211:
setattrThe setattr function sets th
-
Page 212 and 213:
See? We effectively modified the un
-
Page 214 and 215:
the random seed as arguments.This s
-
Page 216 and 217:
The current state of development of
-
Page 218 and 219:
Lossesdef plot_losses(self):fig = p
-
Page 220 and 221:
Run - Data Preparation V21 # %load
-
Page 222 and 223:
Model TrainingWe start by instantia
-
Page 224 and 225:
Making PredictionsLet’s make up s
-
Page 226 and 227:
OutputOrderedDict([('0.weight', ten
-
Page 228 and 229:
Run - Data Preparation V21 # %load
-
Page 230 and 231:
• defining our StepByStep class
-
Page 232 and 233:
import numpy as npimport torchimpor
-
Page 234 and 235:
Next, we’ll standardize the featu
-
Page 236 and 237:
Equation 3.1 - A linear regression
-
Page 238 and 239:
The odds ratio is given by the rati
-
Page 240 and 241:
As expected, probabilities that add
-
Page 242 and 243:
Sigmoid Functiondef sigmoid(z):retu
-
Page 244 and 245:
A picture is worth a thousand words
-
Page 246 and 247:
OutputOrderedDict([('linear.weight'
-
Page 248 and 249:
The first summation adds up the err
-
Page 250 and 251:
IMPORTANT: Make sure to pass the pr
-
Page 252 and 253:
To make it clear: In this chapter,
-
Page 254 and 255:
argument of nn.BCEWithLogitsLoss().
-
Page 256 and 257:
It is not that hard, to be honest.
-
Page 258 and 259:
Figure 3.6 - Training and validatio
-
Page 260 and 261:
Outputarray([[0.5504593 ],[0.949995
-
Page 262 and 263:
decision boundary.Look at the expre
-
Page 264 and 265:
Are my data points separable?That
-
Page 266 and 267:
model = nn.Sequential()model.add_mo
-
Page 268 and 269:
It looks like this:Figure 3.10 - Sp
-
Page 270 and 271:
True and False Positives and Negati
-
Page 272 and 273:
tpr_fpr(cm_thresh50)Output(0.909090
-
Page 274 and 275:
The trade-off between precision and
-
Page 276 and 277:
Figure 3.13 - Using a low threshold
-
Page 278 and 279:
Figure 3.16 - Trade-offs for two di
-
Page 280 and 281:
thresholds do not necessarily inclu
-
Page 282 and 283:
actual data, it is as bad as it can
-
Page 284 and 285:
If you want to learn more about bot
-
Page 286 and 287:
Model Training1 n_epochs = 10023 sb
-
Page 288 and 289:
step in your journey! What’s next
-
Page 290 and 291:
Chapter 4Classifying ImagesSpoilers
-
Page 292 and 293:
Data GenerationOur images are quite
-
Page 294 and 295:
Images and ChannelsIn case you’re
-
Page 296 and 297:
image_rgb = np.stack([image_r, imag
-
Page 298 and 299:
That’s fairly straightforward; we
-
Page 300 and 301:
• Transformations based on Tensor
-
Page 302 and 303:
position of an object in a picture
-
Page 304 and 305:
Outputtensor([[[0., 0., 0., 1., 0.]
-
Page 306 and 307:
Outputtensor([[[-1., -1., -1., 1.,
-
Page 308 and 309:
We can convert the former into the
-
Page 310 and 311:
composer = Compose([RandomHorizonta
-
Page 312 and 313:
Output<torch.utils.data.dataset.Sub
-
Page 314 and 315:
train_composer = Compose([RandomHor
-
Page 316 and 317:
The minority class should have the
-
Page 318 and 319:
train_loader = DataLoader(dataset=t
-
Page 320 and 321:
implemented in Chapter 2.1? Let’s
-
Page 322 and 323:
Let’s take one mini-batch of imag
-
Page 324 and 325:
What does our model look like? Visu
-
Page 326 and 327:
Model TrainingLet’s train our mod
-
Page 328 and 329:
preceding hidden layer to compute i
-
Page 330 and 331:
fig = sbs_nn.plot_losses()Figure 4.
-
Page 332 and 333:
Equation 4.2 - Equivalence of deep
-
Page 334 and 335:
w_nn_equiv = w_nn_output.mm(w_nn_hi
-
Page 336 and 337:
Weights as PixelsDuring data prepar
-
Page 338 and 339:
is only 0.25 (for z = 0) and that i
-
Page 340 and 341:
nn.Tanh()(dummy_z)Outputtensor([-0.
-
Page 342 and 343:
dummy_z = torch.tensor([-3., 0., 3.
-
Page 344 and 345:
As you can see, in PyTorch the coef
-
Page 346 and 347:
Figure 4.16 - Deep model (for real)
-
Page 348 and 349:
Figure 4.18 - Losses (before and af
-
Page 350 and 351:
Equation 4.3 - Activation functions
-
Page 352 and 353:
Helper Function #41 def index_split
-
Page 354 and 355:
Model Configuration1 # Sets learnin
-
Page 356 and 357:
Bonus ChapterFeature SpaceThis chap
-
Page 358 and 359:
Affine TransformationsAn affine tra
-
Page 360 and 361:
Figure B.3 - Annotated model diagra
-
Page 362 and 363:
Figure B.5 - In the beginning…But
-
Page 364 and 365:
OK, now we can clearly see a differ
-
Page 366 and 367:
In the model above, the sigmoid fun
-
Page 368 and 369:
the more dimensions, the more separ
-
Page 370 and 371:
import randomimport numpy as npfrom
-
Page 372 and 373:
identity = np.array([[[[0, 0, 0],[0
-
Page 374 and 375:
Figure 5.4 - Striding the image, on
-
Page 376 and 377:
Output-----------------------------
-
Page 378 and 379:
Outputtensor([[[[9., 5., 0., 7.],[0
-
Page 380 and 381:
OutputParameter containing:tensor([
-
Page 382 and 383:
Moreover, notice that if we were to
-
Page 384 and 385:
In code, as usual, PyTorch gives us
-
Page 386 and 387:
Outputtensor([[[[5., 5., 0., 8., 7.
-
Page 388 and 389:
edge = np.array([[[[0, 1, 0],[1, -4
-
Page 390 and 391:
A pooling kernel of two-by-two resu
-
Page 392 and 393:
Outputtensor([[22., 23., 11., 24.,
-
Page 394 and 395:
Figure 5.15 - LeNet-5 architectureS
-
Page 396 and 397:
• second block: produces 16-chann
-
Page 398 and 399:
Transformed Dataset1 class Transfor
-
Page 400 and 401:
LossNew problem, new loss. Since we
-
Page 402 and 403:
Outputtensor([4.0000, 1.0000, 0.500
-
Page 404 and 405:
The loss only considers the predict
-
Page 406 and 407:
Outputtensor([[-1.5229, -0.3146, -2
-
Page 408 and 409:
IMPORTANT: I can’t stress this en
-
Page 410 and 411:
figures at the beginning of this ch
-
Page 412 and 413:
The three units in the output layer
-
Page 414 and 415:
StepByStep Method@staticmethoddef _
-
Page 416 and 417:
The meow() method is totally indepe
-
Page 418 and 419:
StepByStep Methoddef visualize_filt
-
Page 420 and 421:
dummy_model = nn.Linear(1, 1)dummy_
-
Page 422 and 423:
dummy_listOutput[(Linear(in_feature
-
Page 424 and 425:
Output{Conv2d(1, 1, kernel_size=(3,
-
Page 426 and 427:
will be the externally defined vari
-
Page 428 and 429:
Removing Hookssbs_cnn1.remove_hooks
-
Page 430 and 431:
return figsetattr(StepByStep, 'visu
-
Page 432 and 433:
Figure 5.22 - Feature maps (classif
-
Page 434 and 435:
classification: The predicted class
-
Page 436 and 437:
convolutional layers to our model a
-
Page 438 and 439:
Capturing Outputsfeaturizer_layers
-
Page 440 and 441:
the filters learned by the model pr
-
Page 442 and 443:
given chapter are imported at its v
-
Page 444 and 445:
Data PreparationThe data preparatio
-
Page 446 and 447:
model anyway. We’ll use it to com
-
Page 448 and 449:
StepByStep Method@staticmethoddef m
-
Page 450 and 451:
"What’s wrong with the colors?"Th
-
Page 452 and 453:
three_channel_filter = np.array([[[
-
Page 454 and 455:
Fancier Model (Constructor)class CN
-
Page 456 and 457:
Fancier Model (Classifier)def class
-
Page 458 and 459:
torch.manual_seed(44)dropping_model
-
Page 460 and 461:
Outputtensor([0.1000, 0.2000, 0.300
-
Page 462 and 463:
Figure 6.8 - Output distribution fo
-
Page 464 and 465:
Adaptive moment estimation (Adam) u
-
Page 466 and 467:
torch.manual_seed(13)# Model Config
-
Page 468 and 469:
Outputtorch.Size([5, 3, 3, 3])Its s
-
Page 470 and 471:
Choosing a learning rate that works
-
Page 472 and 473:
Higher-Order Learning Rate Function
-
Page 474 and 475:
Perfect! Now let’s build the actu
-
Page 476 and 477:
ax.set_xlabel('Learning Rate')ax.se
-
Page 478 and 479:
LRFinderThe function we’ve implem
-
Page 480 and 481:
value in our moving average has an
-
Page 482 and 483:
Figure 6.15 - Distribution of weigh
-
Page 484 and 485:
In code, the implementation of the
-
Page 486 and 487:
As expected, the EWMA without corre
-
Page 488 and 489:
optimizer = optim.Adam(model.parame
-
Page 490 and 491:
IMPORTANT: The logging function mus
-
Page 492 and 493:
Output{'state': {140601337662512: {
-
Page 494 and 495:
different optimizer, set them to ca
-
Page 496 and 497:
• dampening: dampening factor for
-
Page 498 and 499:
Figure 6.20 - Paths taken by SGD (w
-
Page 500 and 501:
Equation 6.16 - Looking aheadOnce N
-
Page 502 and 503:
Figure 6.22 - Path taken by each SG
-
Page 504 and 505:
for epoch in range(4):# training lo
-
Page 506 and 507:
course) up to a given number of epo
-
Page 508 and 509:
Next, we create a protected method
-
Page 510 and 511:
Mini-Batch SchedulersThese schedule
-
Page 512 and 513:
Schedulers in StepByStep — Part I
-
Page 514 and 515:
Scheduler PathsBefore trying out a
-
Page 516 and 517:
After applying each scheduler to SG
-
Page 518 and 519:
Data Preparation1 # Loads temporary
-
Page 520 and 521:
Figure 6.31 - LossesEvaluationprint
-
Page 522 and 523:
[96] http://www.samkass.com/theorie
-
Page 524 and 525:
ImportsFor the sake of organization
-
Page 526 and 527:
ILSVRC-2012The 2012 edition [111] o
-
Page 528 and 529:
remained unchanged.ResNet (MSRA Tea
-
Page 530 and 531:
Transfer Learning in PracticeIn Cha
-
Page 532 and 533:
dropout. You’re already familiar
-
Page 534 and 535:
OutputDownloading: "https://downloa
-
Page 536 and 537:
Replacing the "Top" of the Model1 a
-
Page 538 and 539:
Model Size Classifier Layer(s) Repl
-
Page 540 and 541:
Model TrainingWe have everything se
-
Page 542 and 543:
"Removing" the Top Layer1 alex.clas
-
Page 544 and 545:
torch.save(train_preproc.tensors, '
-
Page 546 and 547:
Outputtensor([[109, 124],[124, 124]
-
Page 548 and 549:
Model Configuration1 optimizer_mode
-
Page 550 and 551:
Figure 7.4 - 1x1 convolutionThe inp
-
Page 552 and 553:
The weights used by PIL are 0.299 f
-
Page 554 and 555:
• reduce the number of output cha
-
Page 556 and 557:
The constructor method defines the
-
Page 558 and 559:
Does it sound familiar? That’s wh
-
Page 560 and 561:
and w to represent these parameters
-
Page 562 and 563:
A mini-batch of size 64 is small en
-
Page 564 and 565:
normed1 = batch_normalizer(batch1[0
-
Page 566 and 567:
OutputOrderedDict([('running_mean',
-
Page 568 and 569:
OutputOrderedDict([('running_mean',
-
Page 570 and 571:
batch_normalizer = nn.BatchNorm2d(n
-
Page 572 and 573:
torch.manual_seed(23)dummy_points =
-
Page 574 and 575:
np.concatenate([dummy_points[:5].nu
-
Page 576 and 577:
Another advantage of these shortcut
-
Page 578 and 579:
It should be pretty clear, except f
-
Page 580 and 581:
Data Preparation1 # ImageNet statis
-
Page 582 and 583:
Data Preparation — Preprocessing1
-
Page 584 and 585:
• freezing the layers of the mode
-
Page 586 and 587:
Extra ChapterVanishing and Explodin
-
Page 588 and 589:
discussing it, let me illustrate it
-
Page 590 and 591:
Model Configuration (2)1 loss_fn =
-
Page 592 and 593:
weights. If done properly, the init
-
Page 594 and 595:
just did), or, if you are training
-
Page 596 and 597:
Figure E.3 - The effect of batch no
-
Page 598 and 599:
Model Configuration1 torch.manual_s
-
Page 600 and 601:
torch.manual_seed(42)parm = nn.Para
-
Page 602 and 603:
(and only if) the norm exceeds the
-
Page 604 and 605:
if callable(self.clipping): 1self.c
-
Page 606 and 607:
Moreover, let’s use a ten times h
-
Page 608 and 609:
Clipping with HooksFirst, we reset
-
Page 610 and 611:
• visualizing the difference betw
-
Page 612 and 613:
Chapter 8SequencesSpoilersIn this c
-
Page 614 and 615:
Before shuffling, the pixels were o
-
Page 616 and 617:
And then let’s visualize the firs
-
Page 618 and 619:
sequence so far, and a data point f
-
Page 620 and 621:
Considering this, the not "unrolled
-
Page 622 and 623:
linear_input = nn.Linear(n_features
-
Page 624 and 625:
Outputtensor([[0.3924, 0.8146]], gr
-
Page 626 and 627:
Now we’re talking! The last hidde
-
Page 630 and 631:
◦ The initial hidden state, which
-
Page 632 and 633:
batch_first argument to True so we
-
Page 634 and 635:
OutputOrderedDict([('weight_ih_l0',
-
Page 636 and 637:
out, hidden = rnn_stacked(x)out, hi
-
Page 638 and 639:
_l0_reverse).Once again, let’s cr
-
Page 640 and 641:
For bidirectional RNNs, the last el
-
Page 642 and 643:
Model Configuration1 class SquareMo
-
Page 644 and 645:
StepByStep.loader_apply(test_loader
-
Page 646 and 647:
Figure 8.14 - Final hidden states f
-
Page 648 and 649:
Figure 8.16 - Transforming the hidd
-
Page 650 and 651:
Since the RNN cell has both of them
-
Page 652 and 653:
Every gate worthy of its name will
-
Page 654 and 655:
• For r=0 and z=0, the cell becom
-
Page 656 and 657:
In code, we can use split() to get
-
Page 658 and 659:
Let’s pause for a moment here. Fi
-
Page 660 and 661:
Square Model II — The QuickeningT
-
Page 662 and 663:
Outputtensor([[53, 53],[75, 75]])Th
-
Page 664 and 665:
Figure 8.22 - Transforming the hidd
-
Page 666 and 667:
Equation 8.9 - LSTM—candidate hid
-
Page 668 and 669:
Now, let’s visualize the internal
-
Page 670 and 671:
OutputOrderedDict([('weight_ih', te
-
Page 672 and 673:
def forget_gate(h, x):thf = f_hidde
-
Page 674 and 675:
Outputtensor([[-5.4936e-02, -8.3816
-
Page 676 and 677:
1 First change: from RNN to LSTM2 S
-
Page 678 and 679:
Like the GRU, the LSTM presents fou
-
Page 680 and 681:
Output-----------------------------
-
Page 682 and 683:
Before moving on to packed sequence
-
Page 684 and 685:
column-wise fashion, from top to bo
-
Page 686 and 687:
does match the last output.• No,
-
Page 688 and 689:
So, to actually get the last output
-
Page 690 and 691:
Data Preparation1 class CustomDatas
-
Page 692 and 693:
OutputPackedSequence(data=tensor([[
-
Page 694 and 695:
Model Configuration & TrainingWe ca
-
Page 696 and 697:
size = 5weight = torch.ones(size) *
-
Page 698 and 699:
torch.manual_seed(17)conv_seq = nn.
-
Page 700 and 701:
Figure 8.32 - Applying dilated filt
-
Page 702 and 703:
Model Configuration1 torch.manual_s
-
Page 704 and 705:
We can actually find an expression
-
Page 706 and 707:
Data Preparation1 def pack_collate(
-
Page 708 and 709:
and variable-length sequences.Model
-
Page 710 and 711:
• generating variable-length sequ
-
Page 712 and 713:
import copyimport numpy as npimport
-
Page 714 and 715:
Figure 9.3 - Sequence datasetThe co
-
Page 716 and 717:
coordinates of a "perfect" square a
-
Page 718 and 719:
Let’s pretend for a moment that t
-
Page 720 and 721:
to initialize the hidden state and
-
Page 722 and 723:
predictions in previous steps have
-
Page 724 and 725:
the second set of predicted coordin
-
Page 726 and 727:
Let’s create an instance of the m
-
Page 728 and 729:
Model Configuration & TrainingThe m
-
Page 730 and 731:
Sure, we can!AttentionHere is a (no
-
Page 732 and 733:
based on "the" and "zone," I’ve j
-
Page 734 and 735:
Figure 9.12 - Matching a query to t
-
Page 736 and 737:
Outputtensor([[[ 0.0832, -0.0356],[
-
Page 738 and 739:
utmost importance for the correct i
-
Page 740 and 741:
Its formula is:Equation 9.3 - Cosin
-
Page 742 and 743:
second hidden state contributes to
-
Page 744 and 745:
Outputtensor([[[ 0.5475, 0.0875, -1
-
Page 746 and 747:
alphas = F.softmax(scaled_products,
-
Page 748 and 749:
Outputtensor([[[ 0.2138, -0.3175]]]
-
Page 750 and 751:
Attention Mechanism1 class Attentio
-
Page 752 and 753:
"Why would I want to force it to do
-
Page 754 and 755:
1 Sets attention module and adjusts
-
Page 756 and 757:
encdec = EncoderDecoder(encoder, de
-
Page 758 and 759:
fig = sbs_seq_attn.plot_losses()Fig
-
Page 760 and 761:
Figure 9.20 - Attention scoresSee?
-
Page 762 and 763:
Wide vs Narrow AttentionThis mechan
-
Page 764 and 765:
"What’s so special about it?"Even
-
Page 766 and 767:
Once again, the affine transformati
-
Page 768 and 769:
Next, we shift our focus to the sel
-
Page 770 and 771:
Encoder + Self-Attention1 class Enc
-
Page 772 and 773:
Figure 9.27 - Encoder with self- an
-
Page 774 and 775:
The figure below depicts the self-a
-
Page 776 and 777:
shifted_seq = torch.cat([source_seq
-
Page 778 and 779:
Equation 9.17 - Decoder’s (masked
-
Page 780 and 781:
At evaluation / prediction time we
-
Page 782 and 783:
Outputtensor([[[0.4132, 0.3728],[0.
-
Page 784 and 785:
Figure 9.33 - Encoder + decoder + a
-
Page 786 and 787:
64 return outputsThe encoder-decode
-
Page 788 and 789:
Figure 9.34 - Losses—encoder + de
-
Page 790 and 791:
curse. On the one hand, it makes co
-
Page 792 and 793:
"Are we done now? Is this good enou
-
Page 794 and 795:
Figure 9.46 - Consistent distancesA
-
Page 796 and 797:
Let’s recap what we’ve already
-
Page 798 and 799:
Let’s see it in code:max_len = 10
-
Page 800 and 801:
Let’s put it all together into a
-
Page 802 and 803:
Outputtensor([[[-1.0000, 0.0000],[-
-
Page 804 and 805:
Decoder with Positional Encoding1 c
-
Page 806 and 807:
Visualizing PredictionsLet’s plot
-
Page 808 and 809:
Next, we’re moving on to the thre
-
Page 810 and 811:
Data Generation & Preparation1 # Tr
-
Page 812 and 813:
59 self.trg_masks)60 else:61 # Deco
-
Page 814 and 815:
Model Configuration1 class EncoderS
-
Page 816 and 817:
1617 @property18 def alphas(self):1
-
Page 818 and 819:
Output(0.016193246061448008, 0.0341
-
Page 820 and 821:
sequential order of the data• fig
-
Page 822 and 823:
following imports:import copyimport
-
Page 824 and 825:
Figure 10.2 - Chunking: the wrong a
-
Page 826 and 827:
chunks to compute the other half of
-
Page 828 and 829:
67 # N, L, n_heads, d_k68 context =
-
Page 830 and 831:
dummy_points = torch.randn(16, 2, 4
-
Page 832 and 833:
Stacking Encoders and DecodersLet
-
Page 834 and 835:
"… with great depth comes great c
-
Page 836 and 837:
Transformer EncoderWe’ll be repre
-
Page 838 and 839:
Let’s see it in code, starting wi
-
Page 840 and 841:
Transformer Encoder1 class EncoderT
-
Page 842 and 843:
of the encoder-decoder (or Transfor
-
Page 844 and 845:
In PyTorch, the decoder "layer" is
-
Page 846 and 847:
In PyTorch, the decoder is implemen
-
Page 848 and 849:
Equation 10.7 - Data points' means
-
Page 850 and 851:
layer_norm = nn.LayerNorm(d_model)n
-
Page 852 and 853:
Figure 10.10 - Layer norm vs batch
-
Page 854 and 855:
Outputtensor([[[ 1.4636, 2.3663],[
-
Page 856 and 857:
The TransformerLet’s start with t
-
Page 858 and 859:
"values") in the decoder.• decode
-
Page 860 and 861:
Data Preparation1 # Generating trai
-
Page 862 and 863:
Figure 10.15 - Losses—Transformer
-
Page 864 and 865:
• First, and most important, PyTo
-
Page 866 and 867:
decode(), with a single one, encode
-
Page 868 and 869:
46 for i in range(self.target_len):
-
Page 870 and 871:
Figure 10.18 - Losses - PyTorch’s
-
Page 872 and 873:
Figure 10.20 - Sample image—label
-
Page 874 and 875:
4041 # Builds a weighted random sam
-
Page 876 and 877:
Figure 10.23 - Sample image—split
-
Page 878 and 879:
Einops"There is more than one way t
-
Page 880 and 881:
Figure 10.26 - Two patch embeddings
-
Page 882 and 883:
Now each sequence has ten elements,
-
Page 884 and 885:
It takes an instance of a Transform
-
Page 886 and 887:
Putting It All TogetherIn this chap
-
Page 888 and 889:
1. Encoder-DecoderThe encoder-decod
-
Page 890 and 891:
This is the actual encoder-decoder
-
Page 892 and 893:
3. DecoderThe Transformer decoder h
-
Page 894 and 895:
5. Encoder "Layer"The encoder "laye
-
Page 896 and 897:
7. "Sub-Layer" WrapperThe "sub-laye
-
Page 898 and 899:
8. Multi-Headed AttentionThe multi-
-
Page 900 and 901:
Model Configuration & TrainingModel
-
Page 902 and 903:
• training the Transformer to tac
-
Page 904 and 905:
Part IVNatural Language Processing|
-
Page 906 and 907:
Additional SetupThis is a special c
-
Page 908 and 909:
"Down the Yellow Brick Rabbit Hole"
-
Page 910 and 911:
The actual texts of the books are c
-
Page 912 and 913:
"What is this punkt?"That’s the P
-
Page 914 and 915:
14 # If there is a configuration fi
-
Page 916 and 917:
Sentence Tokenization in spaCyBy th
-
Page 918 and 919:
AttributesThe Dataset has many attr
-
Page 920 and 921:
Output{'labels': 1,'sentence': 'The
-
Page 922 and 923:
elements from the text. But preproc
-
Page 924 and 925:
Data AugmentationLet’s briefly ad
-
Page 926 and 927:
The corpora’s dictionary is not a
-
Page 928 and 929:
Finally, if we want to convert a li
-
Page 930 and 931:
Once we’re happy with the size an
-
Page 932 and 933:
from transformers import BertTokeni
-
Page 934 and 935:
"What about the separation token?"T
-
Page 936 and 937:
The last output, attention_mask, wo
-
Page 938 and 939:
Outputtensor([[ 3, 27, 1, ..., 0, 0
-
Page 940 and 941:
vector, right? And our vocabulary i
-
Page 942 and 943:
Maybe you filled this blank in with
-
Page 944 and 945:
Continuous Bag-of-Words (CBoW)In th
-
Page 946 and 947:
That’s a fairly simple model, rig
-
Page 948 and 949:
Figure 11.13 - Continuous bag-of-wo
-
Page 950 and 951:
Figure 11.15 - Reviewing restaurant
-
Page 952 and 953:
You got that right—arithmetic—r
-
Page 954 and 955:
There we go, 50 dimensions! It’s
-
Page 956 and 957:
Equation 11.1 - Embedding arithmeti
-
Page 958 and 959:
Only 82 out of 50,802 words in the
-
Page 960 and 961:
Now we can use its encode() method
-
Page 962 and 963:
Model I — GloVE + ClassifierData
-
Page 964 and 965:
Pre-trained PyTorch EmbeddingsThe e
-
Page 966 and 967:
Model Configuration & TrainingLet
-
Page 968 and 969:
6 self.encoder = encoder7 self.mlp
-
Page 970 and 971:
Figure 11.20 - Losses—Transformer
-
Page 972 and 973:
Outputtensor([[[2.6334e-01, 6.9912e
-
Page 974 and 975:
I want to introduce you to…ELMoBo
-
Page 976 and 977:
OutputToken: 32 watchThe get_token(
-
Page 978 and 979:
Helper Function to Retrieve Embeddi
-
Page 980 and 981:
Output(tensor(-0.5047, device='cuda
-
Page 982 and 983:
torch.all(new_flair_sentences[0].to
-
Page 984 and 985:
Outputtensor(0.3504, device='cuda:0
-
Page 986 and 987:
We can leverage this fact to slight
-
Page 988 and 989:
We can easily get the embeddings fo
-
Page 990 and 991:
Figure 11.24 - Losses—simple clas
-
Page 992 and 993:
We can inspect the pre-trained mode
-
Page 994 and 995:
Every word piece is prefixed with #
-
Page 996 and 997:
far, our models used these embeddin
-
Page 998 and 999:
position_ids = torch.arange(512).ex
-
Page 1000 and 1001:
Pre-training TasksMasked Language M
-
Page 1002 and 1003:
Then, let’s create an instance of
-
Page 1004 and 1005:
If these two sentences were the inp
-
Page 1006 and 1007:
The BERT model may take many other
-
Page 1008 and 1009:
The contextual word embeddings are
-
Page 1010 and 1011:
Model Configuration1 class BERTClas
-
Page 1012 and 1013:
"Which BERT is that? DistilBERT?!"D
-
Page 1014 and 1015:
Well, you probably don’t want to
-
Page 1016 and 1017:
set num_labels=1 as argument.If you
-
Page 1018 and 1019:
Output{'attention_mask': [1, 1, 1,
-
Page 1020 and 1021:
OutputTrainingArguments(output_dir=
-
Page 1022 and 1023:
Method for Computing Accuracy1 def
-
Page 1024 and 1025:
loaded_model = (AutoModelForSequenc
-
Page 1026 and 1027:
logits.logits.argmax(dim=1)Outputte
-
Page 1028 and 1029:
For a complete list of available ta
-
Page 1030 and 1031:
[215]. For a demo of GPT-2’s capa
-
Page 1032 and 1033:
in Chapter 9, and I reproduce it be
-
Page 1034 and 1035:
Data Preparation1 auto_tokenizer =
-
Page 1036 and 1037:
Data Preparation1 lm_train_dataset
-
Page 1038 and 1039:
The training arguments are roughly
-
Page 1040 and 1041:
device_index = (model.device.indexi
-
Page 1042 and 1043:
• learning that a language model
-
Page 1044 and 1045:
[167] https://huggingface.co/docs/d