22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Dot Product’s Standard Deviation

You probably noticed that the square root of the number of dimensions as

the standard deviation simply appeared out of thin air. We’re not proving it

or anything, but we can try to simulate a ton of dot products to see what

happens, right?

n_dims = 10

vector1 = torch.randn(10000, 1, n_dims)

vector2 = torch.randn(10000, 1, n_dims).permute(0, 2, 1)

torch.bmm(vector1, vector2).squeeze().var()

Output

tensor(9.8681)

Even though the values in hidden states coming out of both encoder and

decoder are bounded to (-1, 1) by the hyperbolic tangent, remember that

we’re likely performing an affine transformation on them to produce both

"keys" and "query." This means that the simulation above, where values are

drawn from a normal distribution, is not as far-fetched as it may seem at first

sight.

If you try different values for the number of dimensions, you’ll see that, on

average, the variance equals the number of dimensions. So, the standard

deviation is given by the square root of the number of dimensions:

Equation 9.9 - Standard deviation of the dot product

alphas = calc_alphas(keys, query)

# N, 1, L x N, L, H -> 1, L x L, H -> 1, H

context_vector = torch.bmm(alphas, values)

context_vector

722 | Chapter 9 — Part I: Sequence-to-Sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!