22.02.2024
•
Views
In code, we can use split() to get tensors for each of the components:Wxr, Wxz, Wxn = Wx.split(hidden_dim, dim=0)bxr, bxz, bxn = bx.split(hidden_dim, dim=0)Whr, Whz, Whn = Wh.split(hidden_dim, dim=0)bhr, bhz, bhn = bh.split(hidden_dim, dim=0)Wxr, bxrOutput(tensor([[-0.0930, 0.0497],[ 0.4670, -0.5319]]), tensor([-0.4316, 0.4019]))Next, let’s use the weights and biases to create the corresponding linear layers:def linear_layers(Wx, bx, Wh, bh):hidden_dim, n_features = Wx.size()lin_input = nn.Linear(n_features, hidden_dim)lin_input.load_state_dict({'weight': Wx, 'bias': bx})lin_hidden = nn.Linear(hidden_dim, hidden_dim)lin_hidden.load_state_dict({'weight': Wh, 'bias': bh})return lin_hidden, lin_input# reset gate - redr_hidden, r_input = linear_layers(Wxr, bxr, Whr, bhr)# update gate - bluez_hidden, z_input = linear_layers(Wxz, bxz, Whz, bhz)# candidate state - blackn_hidden, n_input = linear_layers(Wxn, bxn, Whn, bhn)Gated Recurrent Units (GRUs) | 631
Then, let’s use these layers to create functions that replicate both gates (r and z)and the candidate hidden state (n):def reset_gate(h, x):thr = r_hidden(h)txr = r_input(x)r = torch.sigmoid(thr + txr)return r # reddef update_gate(h, x):thz = z_hidden(h)txz = z_input(x)z = torch.sigmoid(thz + txz)return z # bluedef candidate_n(h, x, r):thn = n_hidden(h)txn = n_input(x)n = torch.tanh(r * thn + txn)return n # blackCool—all the transformations and activations are handled by the functions above.This means we can replicate the mechanics of a GRU cell at its component level (r, z,and n). We also need an initial hidden state and the first data point (corner) of asequence:initial_hidden = torch.zeros(1, hidden_dim)X = torch.as_tensor(points[0]).float()first_corner = X[0:1]We use both values to get the output from the reset gate (r):r = reset_gate(initial_hidden, first_corner)rOutputtensor([[0.2387, 0.6928]], grad_fn=<SigmoidBackward>)632 | Chapter 8: Sequences
-
Page 2 and 3:
Deep Learning with PyTorchStep-by-S
-
Page 4 and 5:
"What I cannot create, I do not und
-
Page 6 and 7:
Train-Validation-Test Split. . . .
-
Page 8 and 9:
Random Split. . . . . . . . . . . .
-
Page 10 and 11:
The Precision Quirk . . . . . . . .
-
Page 12 and 13:
A REAL Filter . . . . . . . . . . .
-
Page 14 and 15:
Jupyter Notebook . . . . . . . . .
-
Page 16 and 17:
Stacked RNN . . . . . . . . . . . .
-
Page 18 and 19:
Attention Mechanism . . . . . . . .
-
Page 20 and 21:
4. Positional Encoding. . . . . . .
-
Page 22 and 23:
Model Configuration & Training . .
-
Page 24 and 25:
AcknowledgementsFirst and foremost,
-
Page 26 and 27:
Frequently Asked Questions (FAQ)Why
-
Page 28 and 29:
There is yet another advantage of f
-
Page 30 and 31:
• Classes and methods are written
-
Page 32 and 33:
What’s Next?It’s time to set up
-
Page 34 and 35:
After choosing a repository, it wil
-
Page 36 and 37:
1. AnacondaIf you don’t have Anac
-
Page 38 and 39:
3. PyTorchPyTorch is the coolest de
-
Page 40 and 41:
(pytorchbook) C:\> conda install py
-
Page 42 and 43:
(pytorchbook)$ pip install torchviz
-
Page 44 and 45:
7. JupyterAfter cloning the reposit
-
Page 46 and 47:
Part IFundamentals| 21
-
Page 48 and 49:
notebook. If not, just click on Cha
-
Page 50 and 51:
Also, let’s say that, on average,
-
Page 52 and 53:
There is one exception to the "alwa
-
Page 54 and 55:
Random Initialization1 # Step 0 - I
-
Page 56 and 57:
Batch, Mini-batch, and Stochastic G
-
Page 58 and 59:
Outputarray([[-2. , -1.94, -1.88, .
-
Page 60 and 61:
one matrix for each data point, eac
-
Page 62 and 63:
Sure, different values of b produce
-
Page 64 and 65:
Output-3.044811379650508 -1.8337537
-
Page 66 and 67:
each parameter using the chain rule
-
Page 68 and 69:
What’s the impact of one update o
-
Page 70 and 71:
gradients, we know we need to take
-
Page 72 and 73:
Very High Learning RateWait, it may
-
Page 74 and 75:
true_b = 1true_w = 2N = 100# Data G
-
Page 76 and 77:
Let’s look at the cross-sections
-
Page 78 and 79:
Zero Mean and Unit Standard Deviati
-
Page 80 and 81:
Sure, in the real world, you’ll n
-
Page 82 and 83:
computing the loss, as shown in the
-
Page 84 and 85:
• visualizing the effects of usin
-
Page 86 and 87:
If you’re using Jupyter’s defau
-
Page 88 and 89:
Notebook Cell 1.1 - Splitting synth
-
Page 90 and 91:
Step 2# Step 2 - Computing the loss
-
Page 92 and 93:
Output[0.49671415] [-0.1382643][0.8
-
Page 94 and 95:
Notebook Cell 1.2 - Implementing gr
-
Page 96 and 97:
# Sanity Check: do we get the same
-
Page 98 and 99:
Outputtensor(3.1416)tensor([1, 2, 3
-
Page 100 and 101:
Outputtensor([[1., 2., 1.],[1., 1.,
-
Page 102 and 103:
dummy_array = np.array([1, 2, 3])du
-
Page 104 and 105:
n_cudas = torch.cuda.device_count()
-
Page 106 and 107:
back_to_numpy = x_train_tensor.nump
-
Page 108 and 109:
I am assuming you’d like to use y
-
Page 110 and 111:
Outputtensor([0.1940], device='cuda
-
Page 112 and 113:
print(error.requires_grad, yhat.req
-
Page 114 and 115:
Output(tensor([0.], device='cuda:0'
-
Page 116 and 117:
56 # need to tell it to let it go..
-
Page 118 and 119:
computation.If you chose "Local Ins
-
Page 120 and 121:
Figure 1.6 - Now parameter "b" does
-
Page 122 and 123:
There are many optimizers: SGD is t
-
Page 124 and 125:
41 optimizer.zero_grad() 34243 prin
-
Page 126 and 127:
Notebook Cell 1.8 - PyTorch’s los
-
Page 128 and 129:
Outputarray(0.00804466, dtype=float
-
Page 130 and 131:
Let’s build a proper (yet simple)
-
Page 132 and 133:
"What do we need this for?"It turns
-
Page 134 and 135:
1 Instantiating a model2 What IS th
-
Page 136 and 137:
In the __init__() method, we create
-
Page 138 and 139:
LayersA Linear model can be seen as
-
Page 140 and 141:
There are MANY different layers tha
-
Page 142 and 143:
We use magic, just like that:%run -
-
Page 144 and 145:
• Step 1: compute model’s predi
-
Page 146 and 147:
RecapFirst of all, congratulations
-
Page 148 and 149:
Chapter 2Rethinking the Training Lo
-
Page 150 and 151:
Let’s take a look at the code onc
-
Page 152 and 153:
Higher-Order FunctionsAlthough this
-
Page 154 and 155:
def exponentiation_builder(exponent
-
Page 156 and 157:
Apart from returning the loss value
-
Page 158 and 159:
Our code should look like this; see
-
Page 160 and 161:
There is no need to load the whole
-
Page 162 and 163:
but if we want to get serious about
-
Page 164 and 165:
How does this change our code so fa
-
Page 166 and 167:
Run - Model Training V2%run -i mode
-
Page 168 and 169:
piece of code that’s going to be
-
Page 170 and 171:
for it. We could do the same for th
-
Page 172 and 173:
EvaluationHow can we evaluate the m
-
Page 174 and 175:
And then, we update our model confi
-
Page 176 and 177:
Run - Model Training V4%run -i mode
-
Page 178 and 179:
Loading Extension# Load the TensorB
-
Page 180 and 181:
browser, you’ll likely see someth
-
Page 182 and 183:
model’s graph (not quite the same
-
Page 184 and 185:
Figure 2.5 - Scalars on TensorBoard
-
Page 186 and 187:
Define - Model Training V51 %%write
-
Page 188 and 189:
If, by any chance, you ended up wit
-
Page 190 and 191:
The procedure is exactly the same,
-
Page 192 and 193:
soon, so please bear with me for no
-
Page 194 and 195:
After recovering our model’s stat
-
Page 196 and 197:
Run - Model Configuration V31 # %lo
-
Page 198 and 199:
This is the general structure you
-
Page 200 and 201:
Chapter 2.1Going ClassySpoilersIn t
-
Page 202 and 203:
# A completely empty (and useless)
-
Page 204 and 205:
# These attributes are defined here
-
Page 206 and 207:
# Creates the train_step function f
-
Page 208 and 209:
# Builds function that performs a s
-
Page 210 and 211:
setattrThe setattr function sets th
-
Page 212 and 213:
See? We effectively modified the un
-
Page 214 and 215:
the random seed as arguments.This s
-
Page 216 and 217:
The current state of development of
-
Page 218 and 219:
Lossesdef plot_losses(self):fig = p
-
Page 220 and 221:
Run - Data Preparation V21 # %load
-
Page 222 and 223:
Model TrainingWe start by instantia
-
Page 224 and 225:
Making PredictionsLet’s make up s
-
Page 226 and 227:
OutputOrderedDict([('0.weight', ten
-
Page 228 and 229:
Run - Data Preparation V21 # %load
-
Page 230 and 231:
• defining our StepByStep class
-
Page 232 and 233:
import numpy as npimport torchimpor
-
Page 234 and 235:
Next, we’ll standardize the featu
-
Page 236 and 237:
Equation 3.1 - A linear regression
-
Page 238 and 239:
The odds ratio is given by the rati
-
Page 240 and 241:
As expected, probabilities that add
-
Page 242 and 243:
Sigmoid Functiondef sigmoid(z):retu
-
Page 244 and 245:
A picture is worth a thousand words
-
Page 246 and 247:
OutputOrderedDict([('linear.weight'
-
Page 248 and 249:
The first summation adds up the err
-
Page 250 and 251:
IMPORTANT: Make sure to pass the pr
-
Page 252 and 253:
To make it clear: In this chapter,
-
Page 254 and 255:
argument of nn.BCEWithLogitsLoss().
-
Page 256 and 257:
It is not that hard, to be honest.
-
Page 258 and 259:
Figure 3.6 - Training and validatio
-
Page 260 and 261:
Outputarray([[0.5504593 ],[0.949995
-
Page 262 and 263:
decision boundary.Look at the expre
-
Page 264 and 265:
Are my data points separable?That
-
Page 266 and 267:
model = nn.Sequential()model.add_mo
-
Page 268 and 269:
It looks like this:Figure 3.10 - Sp
-
Page 270 and 271:
True and False Positives and Negati
-
Page 272 and 273:
tpr_fpr(cm_thresh50)Output(0.909090
-
Page 274 and 275:
The trade-off between precision and
-
Page 276 and 277:
Figure 3.13 - Using a low threshold
-
Page 278 and 279:
Figure 3.16 - Trade-offs for two di
-
Page 280 and 281:
thresholds do not necessarily inclu
-
Page 282 and 283:
actual data, it is as bad as it can
-
Page 284 and 285:
If you want to learn more about bot
-
Page 286 and 287:
Model Training1 n_epochs = 10023 sb
-
Page 288 and 289:
step in your journey! What’s next
-
Page 290 and 291:
Chapter 4Classifying ImagesSpoilers
-
Page 292 and 293:
Data GenerationOur images are quite
-
Page 294 and 295:
Images and ChannelsIn case you’re
-
Page 296 and 297:
image_rgb = np.stack([image_r, imag
-
Page 298 and 299:
That’s fairly straightforward; we
-
Page 300 and 301:
• Transformations based on Tensor
-
Page 302 and 303:
position of an object in a picture
-
Page 304 and 305:
Outputtensor([[[0., 0., 0., 1., 0.]
-
Page 306 and 307:
Outputtensor([[[-1., -1., -1., 1.,
-
Page 308 and 309:
We can convert the former into the
-
Page 310 and 311:
composer = Compose([RandomHorizonta
-
Page 312 and 313:
Output<torch.utils.data.dataset.Sub
-
Page 314 and 315:
train_composer = Compose([RandomHor
-
Page 316 and 317:
The minority class should have the
-
Page 318 and 319:
train_loader = DataLoader(dataset=t
-
Page 320 and 321:
implemented in Chapter 2.1? Let’s
-
Page 322 and 323:
Let’s take one mini-batch of imag
-
Page 324 and 325:
What does our model look like? Visu
-
Page 326 and 327:
Model TrainingLet’s train our mod
-
Page 328 and 329:
preceding hidden layer to compute i
-
Page 330 and 331:
fig = sbs_nn.plot_losses()Figure 4.
-
Page 332 and 333:
Equation 4.2 - Equivalence of deep
-
Page 334 and 335:
w_nn_equiv = w_nn_output.mm(w_nn_hi
-
Page 336 and 337:
Weights as PixelsDuring data prepar
-
Page 338 and 339:
is only 0.25 (for z = 0) and that i
-
Page 340 and 341:
nn.Tanh()(dummy_z)Outputtensor([-0.
-
Page 342 and 343:
dummy_z = torch.tensor([-3., 0., 3.
-
Page 344 and 345:
As you can see, in PyTorch the coef
-
Page 346 and 347:
Figure 4.16 - Deep model (for real)
-
Page 348 and 349:
Figure 4.18 - Losses (before and af
-
Page 350 and 351:
Equation 4.3 - Activation functions
-
Page 352 and 353:
Helper Function #41 def index_split
-
Page 354 and 355:
Model Configuration1 # Sets learnin
-
Page 356 and 357:
Bonus ChapterFeature SpaceThis chap
-
Page 358 and 359:
Affine TransformationsAn affine tra
-
Page 360 and 361:
Figure B.3 - Annotated model diagra
-
Page 362 and 363:
Figure B.5 - In the beginning…But
-
Page 364 and 365:
OK, now we can clearly see a differ
-
Page 366 and 367:
In the model above, the sigmoid fun
-
Page 368 and 369:
the more dimensions, the more separ
-
Page 370 and 371:
import randomimport numpy as npfrom
-
Page 372 and 373:
identity = np.array([[[[0, 0, 0],[0
-
Page 374 and 375:
Figure 5.4 - Striding the image, on
-
Page 376 and 377:
Output-----------------------------
-
Page 378 and 379:
Outputtensor([[[[9., 5., 0., 7.],[0
-
Page 380 and 381:
OutputParameter containing:tensor([
-
Page 382 and 383:
Moreover, notice that if we were to
-
Page 384 and 385:
In code, as usual, PyTorch gives us
-
Page 386 and 387:
Outputtensor([[[[5., 5., 0., 8., 7.
-
Page 388 and 389:
edge = np.array([[[[0, 1, 0],[1, -4
-
Page 390 and 391:
A pooling kernel of two-by-two resu
-
Page 392 and 393:
Outputtensor([[22., 23., 11., 24.,
-
Page 394 and 395:
Figure 5.15 - LeNet-5 architectureS
-
Page 396 and 397:
• second block: produces 16-chann
-
Page 398 and 399:
Transformed Dataset1 class Transfor
-
Page 400 and 401:
LossNew problem, new loss. Since we
-
Page 402 and 403:
Outputtensor([4.0000, 1.0000, 0.500
-
Page 404 and 405:
The loss only considers the predict
-
Page 406 and 407:
Outputtensor([[-1.5229, -0.3146, -2
-
Page 408 and 409:
IMPORTANT: I can’t stress this en
-
Page 410 and 411:
figures at the beginning of this ch
-
Page 412 and 413:
The three units in the output layer
-
Page 414 and 415:
StepByStep Method@staticmethoddef _
-
Page 416 and 417:
The meow() method is totally indepe
-
Page 418 and 419:
StepByStep Methoddef visualize_filt
-
Page 420 and 421:
dummy_model = nn.Linear(1, 1)dummy_
-
Page 422 and 423:
dummy_listOutput[(Linear(in_feature
-
Page 424 and 425:
Output{Conv2d(1, 1, kernel_size=(3,
-
Page 426 and 427:
will be the externally defined vari
-
Page 428 and 429:
Removing Hookssbs_cnn1.remove_hooks
-
Page 430 and 431:
return figsetattr(StepByStep, 'visu
-
Page 432 and 433:
Figure 5.22 - Feature maps (classif
-
Page 434 and 435:
classification: The predicted class
-
Page 436 and 437:
convolutional layers to our model a
-
Page 438 and 439:
Capturing Outputsfeaturizer_layers
-
Page 440 and 441:
the filters learned by the model pr
-
Page 442 and 443:
given chapter are imported at its v
-
Page 444 and 445:
Data PreparationThe data preparatio
-
Page 446 and 447:
model anyway. We’ll use it to com
-
Page 448 and 449:
StepByStep Method@staticmethoddef m
-
Page 450 and 451:
"What’s wrong with the colors?"Th
-
Page 452 and 453:
three_channel_filter = np.array([[[
-
Page 454 and 455:
Fancier Model (Constructor)class CN
-
Page 456 and 457:
Fancier Model (Classifier)def class
-
Page 458 and 459:
torch.manual_seed(44)dropping_model
-
Page 460 and 461:
Outputtensor([0.1000, 0.2000, 0.300
-
Page 462 and 463:
Figure 6.8 - Output distribution fo
-
Page 464 and 465:
Adaptive moment estimation (Adam) u
-
Page 466 and 467:
torch.manual_seed(13)# Model Config
-
Page 468 and 469:
Outputtorch.Size([5, 3, 3, 3])Its s
-
Page 470 and 471:
Choosing a learning rate that works
-
Page 472 and 473:
Higher-Order Learning Rate Function
-
Page 474 and 475:
Perfect! Now let’s build the actu
-
Page 476 and 477:
ax.set_xlabel('Learning Rate')ax.se
-
Page 478 and 479:
LRFinderThe function we’ve implem
-
Page 480 and 481:
value in our moving average has an
-
Page 482 and 483:
Figure 6.15 - Distribution of weigh
-
Page 484 and 485:
In code, the implementation of the
-
Page 486 and 487:
As expected, the EWMA without corre
-
Page 488 and 489:
optimizer = optim.Adam(model.parame
-
Page 490 and 491:
IMPORTANT: The logging function mus
-
Page 492 and 493:
Output{'state': {140601337662512: {
-
Page 494 and 495:
different optimizer, set them to ca
-
Page 496 and 497:
• dampening: dampening factor for
-
Page 498 and 499:
Figure 6.20 - Paths taken by SGD (w
-
Page 500 and 501:
Equation 6.16 - Looking aheadOnce N
-
Page 502 and 503:
Figure 6.22 - Path taken by each SG
-
Page 504 and 505:
for epoch in range(4):# training lo
-
Page 506 and 507:
course) up to a given number of epo
-
Page 508 and 509:
Next, we create a protected method
-
Page 510 and 511:
Mini-Batch SchedulersThese schedule
-
Page 512 and 513:
Schedulers in StepByStep — Part I
-
Page 514 and 515:
Scheduler PathsBefore trying out a
-
Page 516 and 517:
After applying each scheduler to SG
-
Page 518 and 519:
Data Preparation1 # Loads temporary
-
Page 520 and 521:
Figure 6.31 - LossesEvaluationprint
-
Page 522 and 523:
[96] http://www.samkass.com/theorie
-
Page 524 and 525:
ImportsFor the sake of organization
-
Page 526 and 527:
ILSVRC-2012The 2012 edition [111] o
-
Page 528 and 529:
remained unchanged.ResNet (MSRA Tea
-
Page 530 and 531:
Transfer Learning in PracticeIn Cha
-
Page 532 and 533:
dropout. You’re already familiar
-
Page 534 and 535:
OutputDownloading: "https://downloa
-
Page 536 and 537:
Replacing the "Top" of the Model1 a
-
Page 538 and 539:
Model Size Classifier Layer(s) Repl
-
Page 540 and 541:
Model TrainingWe have everything se
-
Page 542 and 543:
"Removing" the Top Layer1 alex.clas
-
Page 544 and 545:
torch.save(train_preproc.tensors, '
-
Page 546 and 547:
Outputtensor([[109, 124],[124, 124]
-
Page 548 and 549:
Model Configuration1 optimizer_mode
-
Page 550 and 551:
Figure 7.4 - 1x1 convolutionThe inp
-
Page 552 and 553:
The weights used by PIL are 0.299 f
-
Page 554 and 555:
• reduce the number of output cha
-
Page 556 and 557:
The constructor method defines the
-
Page 558 and 559:
Does it sound familiar? That’s wh
-
Page 560 and 561:
and w to represent these parameters
-
Page 562 and 563:
A mini-batch of size 64 is small en
-
Page 564 and 565:
normed1 = batch_normalizer(batch1[0
-
Page 566 and 567:
OutputOrderedDict([('running_mean',
-
Page 568 and 569:
OutputOrderedDict([('running_mean',
-
Page 570 and 571:
batch_normalizer = nn.BatchNorm2d(n
-
Page 572 and 573:
torch.manual_seed(23)dummy_points =
-
Page 574 and 575:
np.concatenate([dummy_points[:5].nu
-
Page 576 and 577:
Another advantage of these shortcut
-
Page 578 and 579:
It should be pretty clear, except f
-
Page 580 and 581:
Data Preparation1 # ImageNet statis
-
Page 582 and 583:
Data Preparation — Preprocessing1
-
Page 584 and 585:
• freezing the layers of the mode
-
Page 586 and 587:
Extra ChapterVanishing and Explodin
-
Page 588 and 589:
discussing it, let me illustrate it
-
Page 590 and 591:
Model Configuration (2)1 loss_fn =
-
Page 592 and 593:
weights. If done properly, the init
-
Page 594 and 595:
just did), or, if you are training
-
Page 596 and 597:
Figure E.3 - The effect of batch no
-
Page 598 and 599:
Model Configuration1 torch.manual_s
-
Page 600 and 601:
torch.manual_seed(42)parm = nn.Para
-
Page 602 and 603:
(and only if) the norm exceeds the
-
Page 604 and 605:
if callable(self.clipping): 1self.c
-
Page 606 and 607:
Moreover, let’s use a ten times h
-
Page 608 and 609:
Clipping with HooksFirst, we reset
-
Page 610 and 611:
• visualizing the difference betw
-
Page 612 and 613:
Chapter 8SequencesSpoilersIn this c
-
Page 614 and 615:
Before shuffling, the pixels were o
-
Page 616 and 617:
And then let’s visualize the firs
-
Page 618 and 619:
sequence so far, and a data point f
-
Page 620 and 621:
Considering this, the not "unrolled
-
Page 622 and 623:
linear_input = nn.Linear(n_features
-
Page 624 and 625:
Outputtensor([[0.3924, 0.8146]], gr
-
Page 626 and 627:
Now we’re talking! The last hidde
-
Page 628 and 629:
Let’s take a look at the RNN’s
-
Page 630 and 631:
◦ The initial hidden state, which
-
Page 632 and 633:
batch_first argument to True so we
-
Page 634 and 635:
OutputOrderedDict([('weight_ih_l0',
-
Page 636 and 637:
out, hidden = rnn_stacked(x)out, hi
-
Page 638 and 639:
_l0_reverse).Once again, let’s cr
-
Page 640 and 641:
For bidirectional RNNs, the last el
-
Page 642 and 643:
Model Configuration1 class SquareMo
-
Page 644 and 645:
StepByStep.loader_apply(test_loader
-
Page 646 and 647:
Figure 8.14 - Final hidden states f
-
Page 648 and 649:
Figure 8.16 - Transforming the hidd
-
Page 650 and 651:
Since the RNN cell has both of them
-
Page 652 and 653:
Every gate worthy of its name will
-
Page 654 and 655:
• For r=0 and z=0, the cell becom
-
Page 658 and 659:
Let’s pause for a moment here. Fi
-
Page 660 and 661:
Square Model II — The QuickeningT
-
Page 662 and 663:
Outputtensor([[53, 53],[75, 75]])Th
-
Page 664 and 665:
Figure 8.22 - Transforming the hidd
-
Page 666 and 667:
Equation 8.9 - LSTM—candidate hid
-
Page 668 and 669:
Now, let’s visualize the internal
-
Page 670 and 671:
OutputOrderedDict([('weight_ih', te
-
Page 672 and 673:
def forget_gate(h, x):thf = f_hidde
-
Page 674 and 675:
Outputtensor([[-5.4936e-02, -8.3816
-
Page 676 and 677:
1 First change: from RNN to LSTM2 S
-
Page 678 and 679:
Like the GRU, the LSTM presents fou
-
Page 680 and 681:
Output-----------------------------
-
Page 682 and 683:
Before moving on to packed sequence
-
Page 684 and 685:
column-wise fashion, from top to bo
-
Page 686 and 687:
does match the last output.• No,
-
Page 688 and 689:
So, to actually get the last output
-
Page 690 and 691:
Data Preparation1 class CustomDatas
-
Page 692 and 693:
OutputPackedSequence(data=tensor([[
-
Page 694 and 695:
Model Configuration & TrainingWe ca
-
Page 696 and 697:
size = 5weight = torch.ones(size) *
-
Page 698 and 699:
torch.manual_seed(17)conv_seq = nn.
-
Page 700 and 701:
Figure 8.32 - Applying dilated filt
-
Page 702 and 703:
Model Configuration1 torch.manual_s
-
Page 704 and 705:
We can actually find an expression
-
Page 706 and 707:
Data Preparation1 def pack_collate(
-
Page 708 and 709:
and variable-length sequences.Model
-
Page 710 and 711:
• generating variable-length sequ
-
Page 712 and 713:
import copyimport numpy as npimport
-
Page 714 and 715:
Figure 9.3 - Sequence datasetThe co
-
Page 716 and 717:
coordinates of a "perfect" square a
-
Page 718 and 719:
Let’s pretend for a moment that t
-
Page 720 and 721:
to initialize the hidden state and
-
Page 722 and 723:
predictions in previous steps have
-
Page 724 and 725:
the second set of predicted coordin
-
Page 726 and 727:
Let’s create an instance of the m
-
Page 728 and 729:
Model Configuration & TrainingThe m
-
Page 730 and 731:
Sure, we can!AttentionHere is a (no
-
Page 732 and 733:
based on "the" and "zone," I’ve j
-
Page 734 and 735:
Figure 9.12 - Matching a query to t
-
Page 736 and 737:
Outputtensor([[[ 0.0832, -0.0356],[
-
Page 738 and 739:
utmost importance for the correct i
-
Page 740 and 741:
Its formula is:Equation 9.3 - Cosin
-
Page 742 and 743:
second hidden state contributes to
-
Page 744 and 745:
Outputtensor([[[ 0.5475, 0.0875, -1
-
Page 746 and 747:
alphas = F.softmax(scaled_products,
-
Page 748 and 749:
Outputtensor([[[ 0.2138, -0.3175]]]
-
Page 750 and 751:
Attention Mechanism1 class Attentio
-
Page 752 and 753:
"Why would I want to force it to do
-
Page 754 and 755:
1 Sets attention module and adjusts
-
Page 756 and 757:
encdec = EncoderDecoder(encoder, de
-
Page 758 and 759:
fig = sbs_seq_attn.plot_losses()Fig
-
Page 760 and 761:
Figure 9.20 - Attention scoresSee?
-
Page 762 and 763:
Wide vs Narrow AttentionThis mechan
-
Page 764 and 765:
"What’s so special about it?"Even
-
Page 766 and 767:
Once again, the affine transformati
-
Page 768 and 769:
Next, we shift our focus to the sel
-
Page 770 and 771:
Encoder + Self-Attention1 class Enc
-
Page 772 and 773:
Figure 9.27 - Encoder with self- an
-
Page 774 and 775:
The figure below depicts the self-a
-
Page 776 and 777:
shifted_seq = torch.cat([source_seq
-
Page 778 and 779:
Equation 9.17 - Decoder’s (masked
-
Page 780 and 781:
At evaluation / prediction time we
-
Page 782 and 783:
Outputtensor([[[0.4132, 0.3728],[0.
-
Page 784 and 785:
Figure 9.33 - Encoder + decoder + a
-
Page 786 and 787:
64 return outputsThe encoder-decode
-
Page 788 and 789:
Figure 9.34 - Losses—encoder + de
-
Page 790 and 791:
curse. On the one hand, it makes co
-
Page 792 and 793:
"Are we done now? Is this good enou
-
Page 794 and 795:
Figure 9.46 - Consistent distancesA
-
Page 796 and 797:
Let’s recap what we’ve already
-
Page 798 and 799:
Let’s see it in code:max_len = 10
-
Page 800 and 801:
Let’s put it all together into a
-
Page 802 and 803:
Outputtensor([[[-1.0000, 0.0000],[-
-
Page 804 and 805:
Decoder with Positional Encoding1 c
-
Page 806 and 807:
Visualizing PredictionsLet’s plot
-
Page 808 and 809:
Next, we’re moving on to the thre
-
Page 810 and 811:
Data Generation & Preparation1 # Tr
-
Page 812 and 813:
59 self.trg_masks)60 else:61 # Deco
-
Page 814 and 815:
Model Configuration1 class EncoderS
-
Page 816 and 817:
1617 @property18 def alphas(self):1
-
Page 818 and 819:
Output(0.016193246061448008, 0.0341
-
Page 820 and 821:
sequential order of the data• fig
-
Page 822 and 823:
following imports:import copyimport
-
Page 824 and 825:
Figure 10.2 - Chunking: the wrong a
-
Page 826 and 827:
chunks to compute the other half of
-
Page 828 and 829:
67 # N, L, n_heads, d_k68 context =
-
Page 830 and 831:
dummy_points = torch.randn(16, 2, 4
-
Page 832 and 833:
Stacking Encoders and DecodersLet
-
Page 834 and 835:
"… with great depth comes great c
-
Page 836 and 837:
Transformer EncoderWe’ll be repre
-
Page 838 and 839:
Let’s see it in code, starting wi
-
Page 840 and 841:
Transformer Encoder1 class EncoderT
-
Page 842 and 843:
of the encoder-decoder (or Transfor
-
Page 844 and 845:
In PyTorch, the decoder "layer" is
-
Page 846 and 847:
In PyTorch, the decoder is implemen
-
Page 848 and 849:
Equation 10.7 - Data points' means
-
Page 850 and 851:
layer_norm = nn.LayerNorm(d_model)n
-
Page 852 and 853:
Figure 10.10 - Layer norm vs batch
-
Page 854 and 855:
Outputtensor([[[ 1.4636, 2.3663],[
-
Page 856 and 857:
The TransformerLet’s start with t
-
Page 858 and 859:
"values") in the decoder.• decode
-
Page 860 and 861:
Data Preparation1 # Generating trai
-
Page 862 and 863:
Figure 10.15 - Losses—Transformer
-
Page 864 and 865:
• First, and most important, PyTo
-
Page 866 and 867:
decode(), with a single one, encode
-
Page 868 and 869:
46 for i in range(self.target_len):
-
Page 870 and 871:
Figure 10.18 - Losses - PyTorch’s
-
Page 872 and 873:
Figure 10.20 - Sample image—label
-
Page 874 and 875:
4041 # Builds a weighted random sam
-
Page 876 and 877:
Figure 10.23 - Sample image—split
-
Page 878 and 879:
Einops"There is more than one way t
-
Page 880 and 881:
Figure 10.26 - Two patch embeddings
-
Page 882 and 883:
Now each sequence has ten elements,
-
Page 884 and 885:
It takes an instance of a Transform
-
Page 886 and 887:
Putting It All TogetherIn this chap
-
Page 888 and 889:
1. Encoder-DecoderThe encoder-decod
-
Page 890 and 891:
This is the actual encoder-decoder
-
Page 892 and 893:
3. DecoderThe Transformer decoder h
-
Page 894 and 895:
5. Encoder "Layer"The encoder "laye
-
Page 896 and 897:
7. "Sub-Layer" WrapperThe "sub-laye
-
Page 898 and 899:
8. Multi-Headed AttentionThe multi-
-
Page 900 and 901:
Model Configuration & TrainingModel
-
Page 902 and 903:
• training the Transformer to tac
-
Page 904 and 905:
Part IVNatural Language Processing|
-
Page 906 and 907:
Additional SetupThis is a special c
-
Page 908 and 909:
"Down the Yellow Brick Rabbit Hole"
-
Page 910 and 911:
The actual texts of the books are c
-
Page 912 and 913:
"What is this punkt?"That’s the P
-
Page 914 and 915:
14 # If there is a configuration fi
-
Page 916 and 917:
Sentence Tokenization in spaCyBy th
-
Page 918 and 919:
AttributesThe Dataset has many attr
-
Page 920 and 921:
Output{'labels': 1,'sentence': 'The
-
Page 922 and 923:
elements from the text. But preproc
-
Page 924 and 925:
Data AugmentationLet’s briefly ad
-
Page 926 and 927:
The corpora’s dictionary is not a
-
Page 928 and 929:
Finally, if we want to convert a li
-
Page 930 and 931:
Once we’re happy with the size an
-
Page 932 and 933:
from transformers import BertTokeni
-
Page 934 and 935:
"What about the separation token?"T
-
Page 936 and 937:
The last output, attention_mask, wo
-
Page 938 and 939:
Outputtensor([[ 3, 27, 1, ..., 0, 0
-
Page 940 and 941:
vector, right? And our vocabulary i
-
Page 942 and 943:
Maybe you filled this blank in with
-
Page 944 and 945:
Continuous Bag-of-Words (CBoW)In th
-
Page 946 and 947:
That’s a fairly simple model, rig
-
Page 948 and 949:
Figure 11.13 - Continuous bag-of-wo
-
Page 950 and 951:
Figure 11.15 - Reviewing restaurant
-
Page 952 and 953:
You got that right—arithmetic—r
-
Page 954 and 955:
There we go, 50 dimensions! It’s
-
Page 956 and 957:
Equation 11.1 - Embedding arithmeti
-
Page 958 and 959:
Only 82 out of 50,802 words in the
-
Page 960 and 961:
Now we can use its encode() method
-
Page 962 and 963:
Model I — GloVE + ClassifierData
-
Page 964 and 965:
Pre-trained PyTorch EmbeddingsThe e
-
Page 966 and 967:
Model Configuration & TrainingLet
-
Page 968 and 969:
6 self.encoder = encoder7 self.mlp
-
Page 970 and 971:
Figure 11.20 - Losses—Transformer
-
Page 972 and 973:
Outputtensor([[[2.6334e-01, 6.9912e
-
Page 974 and 975:
I want to introduce you to…ELMoBo
-
Page 976 and 977:
OutputToken: 32 watchThe get_token(
-
Page 978 and 979:
Helper Function to Retrieve Embeddi
-
Page 980 and 981:
Output(tensor(-0.5047, device='cuda
-
Page 982 and 983:
torch.all(new_flair_sentences[0].to
-
Page 984 and 985:
Outputtensor(0.3504, device='cuda:0
-
Page 986 and 987:
We can leverage this fact to slight
-
Page 988 and 989:
We can easily get the embeddings fo
-
Page 990 and 991:
Figure 11.24 - Losses—simple clas
-
Page 992 and 993:
We can inspect the pre-trained mode
-
Page 994 and 995:
Every word piece is prefixed with #
-
Page 996 and 997:
far, our models used these embeddin
-
Page 998 and 999:
position_ids = torch.arange(512).ex
-
Page 1000 and 1001:
Pre-training TasksMasked Language M
-
Page 1002 and 1003:
Then, let’s create an instance of
-
Page 1004 and 1005:
If these two sentences were the inp
-
Page 1006 and 1007:
The BERT model may take many other
-
Page 1008 and 1009:
The contextual word embeddings are
-
Page 1010 and 1011:
Model Configuration1 class BERTClas
-
Page 1012 and 1013:
"Which BERT is that? DistilBERT?!"D
-
Page 1014 and 1015:
Well, you probably don’t want to
-
Page 1016 and 1017:
set num_labels=1 as argument.If you
-
Page 1018 and 1019:
Output{'attention_mask': [1, 1, 1,
-
Page 1020 and 1021:
OutputTrainingArguments(output_dir=
-
Page 1022 and 1023:
Method for Computing Accuracy1 def
-
Page 1024 and 1025:
loaded_model = (AutoModelForSequenc
-
Page 1026 and 1027:
logits.logits.argmax(dim=1)Outputte
-
Page 1028 and 1029:
For a complete list of available ta
-
Page 1030 and 1031:
[215]. For a demo of GPT-2’s capa
-
Page 1032 and 1033:
in Chapter 9, and I reproduce it be
-
Page 1034 and 1035:
Data Preparation1 auto_tokenizer =
-
Page 1036 and 1037:
Data Preparation1 lm_train_dataset
-
Page 1038 and 1039:
The training arguments are roughly
-
Page 1040 and 1041:
device_index = (model.device.indexi
-
Page 1042 and 1043:
• learning that a language model
-
Page 1044 and 1045:
[167] https://huggingface.co/docs/d