Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Outputtensor([[[-1.0000, 0.0000],[-0.1585, 1.5403]]])"What am I looking at?"It turns out, the original coordinates were somewhat crowded-out by the addition ofthe positional encoding (especially the first row). This may happen if the datapoints have values roughly in the same range as the positional encoding.Unfortunately, this is fairly common: Both standardized inputs and wordembeddings (we’ll get back to them in Chapter 11) are likely to have most of theirvalues inside the [-1, 1] range of the positional encoding."How can we handle it then?"That’s what the scaling in the forward() method is for: It’s as if we were "reversingthe standardization" of the inputs (using a standard deviation equal to the squareroot of their dimensionality) to retrieve the hypothetical "raw" inputs.Equation 9.22 - "Reversing" the standardizationBy the way, previously, we scaled the dot product using theinverse of the square root of its dimensionality, which was itsstandard deviation.Even though this is not the same thing, the analogy might helpyou remember that the inputs are also scaled by the square rootof their number of dimensions before the positional encodinggets added to them.In our example, the dimensionality is two (coordinates), so the inputs are going tobe scaled by the square root of two:posenc(source_seq)Positional Encoding (PE) | 777

Outputtensor([[[-1.4142, -0.4142],[-0.5727, 1.9545]]])The results above (after the encoding) illustrate the effect of scaling the inputs: Itseems to have lessened the crowding-out effect of the positional encoding. Forinputs with many dimensions, the effect will be much more pronounced: A 300-dimension embedding will have a scaling factor around 17, for example."Wait, isn’t this bad for the model?"Left unchecked, yes, it could be bad for the model. That’s why we’ll pull off yetanother normalization trick: layer normalization. We’ll discuss it in detail in thenext chapter.For now, scaling the coordinates by the square root of two isn’t going to be an issue,so we can move on and integrate positional encoding into our model.Encoder + Decoder + PEThe new encoder and decoder classes are just wrapping their self-attentioncounterparts by assigning the latter to be the layer attribute of the former, andencoding the inputs prior to calling the corresponding layer:Encoder with Positional Encoding1 class EncoderPe(nn.Module):2 def __init__(self, n_heads, d_model, ff_units,3 n_features=None, max_len=100):4 super().__init__()5 pe_dim = d_model if n_features is None else n_features6 self.pe = PositionalEncoding(max_len, pe_dim)7 self.layer = EncoderSelfAttn(n_heads, d_model,8 ff_units, n_features)910 def forward(self, query, mask=None):11 query_pe = self.pe(query)12 out = self.layer(query_pe, mask)13 return out778 | Chapter 9 — Part II: Sequence-to-Sequence

Output

tensor([[[-1.0000, 0.0000],

[-0.1585, 1.5403]]])

"What am I looking at?"

It turns out, the original coordinates were somewhat crowded-out by the addition of

the positional encoding (especially the first row). This may happen if the data

points have values roughly in the same range as the positional encoding.

Unfortunately, this is fairly common: Both standardized inputs and word

embeddings (we’ll get back to them in Chapter 11) are likely to have most of their

values inside the [-1, 1] range of the positional encoding.

"How can we handle it then?"

That’s what the scaling in the forward() method is for: It’s as if we were "reversing

the standardization" of the inputs (using a standard deviation equal to the square

root of their dimensionality) to retrieve the hypothetical "raw" inputs.

Equation 9.22 - "Reversing" the standardization

By the way, previously, we scaled the dot product using the

inverse of the square root of its dimensionality, which was its

standard deviation.

Even though this is not the same thing, the analogy might help

you remember that the inputs are also scaled by the square root

of their number of dimensions before the positional encoding

gets added to them.

In our example, the dimensionality is two (coordinates), so the inputs are going to

be scaled by the square root of two:

posenc(source_seq)

Positional Encoding (PE) | 777

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!