Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

from peiying410632 More from this publisher

Outputtensor([[[ 1.4636, 2.3663],[ 1.9806, -0.7564]]])Next, we normalize it:norm = nn.LayerNorm(2)norm(source_seq_enc)Outputtensor([[[-1.0000, 1.0000],[ 1.0000, -1.0000]]], grad_fn=<NativeLayerNormBackward>)"Wait, what happened here?"That’s what happens when one tries to normalize two features only: They becomeeither minus one or one. Even worse, it will be the same for every data point. Thesevalues won’t get us anywhere, that’s for sure.We need to do better, we need…Projections or EmbeddingsSometimes projections and embeddings are usedinterchangeably. Here, though, we’re sticking with embeddingsfor categorical values and projections for numerical values.In Chapter 11, we’ll be using embeddings to get a numerical representation (avector) for a given word or token. Since words or tokens are categorical values,the embedding layer works like a large lookup table: It will look up a given word ortoken in its keys and return the corresponding tensor. But, since we’re dealing withcoordinates, that is, numerical values, we are using projections instead. A simplelinear layer is all that it takes to project our pair of coordinates into a higherdimensionalfeature space:Layer Normalization | 829

torch.manual_seed(11)proj_dim = 6linear_proj = nn.Linear(2, proj_dim)pe = PositionalEncoding(2, proj_dim)source_seq_proj = linear_proj(source_seq)source_seq_proj_enc = pe(source_seq_proj)source_seq_proj_encOutputtensor([[[-2.0934, 1.5040, 1.8742, 0.0628, 0.3034, 2.0190],[-0.8853, 2.8213, 0.5911, 2.4193, -2.5230, 0.3599]]],grad_fn=<AddBackward0>)See? Now each data point in our source sequence has six features (the projecteddimensions), and they are positionally-encoded too. Sure, this particular projectionis totally random, but that won’t be the case once we add the corresponding linearlayer to our model. It will learn a meaningful projection that, after beingpositionally-encoded, will be normalized:norm = nn.LayerNorm(proj_dim)norm(source_seq_proj_enc)Outputtensor([[[-1.9061, 0.6287, 0.8896, -0.3868, -0.2172, 0.9917],[-0.7362, 1.2864, 0.0694, 1.0670, -1.6299, -0.0568]]],grad_fn=<NativeLayerNormBackward>)Problem solved! Finally, we have everything we need to build a full-blownTransformer!In Chapter 9, we used affine transformations inside the attentionheads to map from input dimensions to hidden (or model)dimensions. Now, this change in dimensionality is beingperformed using projections directly on the input sequencesbefore they are passed to the encoder and the decoder.830 | Chapter 10: Transform and Roll Out


proj_dim = 6

linear_proj = nn.Linear(2, proj_dim)

pe = PositionalEncoding(2, proj_dim)

source_seq_proj = linear_proj(source_seq)

source_seq_proj_enc = pe(source_seq_proj)



tensor([[[-2.0934, 1.5040, 1.8742, 0.0628, 0.3034, 2.0190],

[-0.8853, 2.8213, 0.5911, 2.4193, -2.5230, 0.3599]]],


See? Now each data point in our source sequence has six features (the projected

dimensions), and they are positionally-encoded too. Sure, this particular projection

is totally random, but that won’t be the case once we add the corresponding linear

layer to our model. It will learn a meaningful projection that, after being

positionally-encoded, will be normalized:

norm = nn.LayerNorm(proj_dim)



tensor([[[-1.9061, 0.6287, 0.8896, -0.3868, -0.2172, 0.9917],

[-0.7362, 1.2864, 0.0694, 1.0670, -1.6299, -0.0568]]],


Problem solved! Finally, we have everything we need to build a full-blown


In Chapter 9, we used affine transformations inside the attention

heads to map from input dimensions to hidden (or model)

dimensions. Now, this change in dimensionality is being

performed using projections directly on the input sequences

before they are passed to the encoder and the decoder.

830 | Chapter 10: Transform and Roll Out

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!