Outputtensor([[[ 0.0832, -0.0356],[ 0.3105, -0.5263]]], grad_fn=<PermuteBackward>)keys = hidden_seq # N, L, HkeysOutputtensor([[[ 0.0832, -0.0356],[ 0.3105, -0.5263]]], grad_fn=<PermuteBackward>)The encoder-decoder dynamics stay exactly the same: We still use the encoder’sfinal hidden state as the decoder’s initial hidden state (even though we’re sendingthe whole sequence to the decoder, it still uses the last hidden state only), and westill use the last element of the source sequence as input to the first step of thedecoder:torch.manual_seed(21)decoder = Decoder(n_features=2, hidden_dim=2)decoder.init_hidden(hidden_seq)inputs = source_seq[:, -1:]out = decoder(inputs)The first "query" (Q) is the decoder’s hidden state (remember, hidden states arealways sequence-first, so we’re permuting it to batch-first):query = decoder.hidden.permute(1, 0, 2) # N, 1, HqueryOutputtensor([[[ 0.3913, -0.6853]]], grad_fn=<PermuteBackward>)Attention | 711

OK, we have the "keys" and a "query," so let’s pretend we can compute attentionscores (alphas) using them:def calc_alphas(ks, q):N, L, H = ks.size()alphas = torch.ones(N, 1, L).float() * 1/Lreturn alphasalphas = calc_alphas(keys, query)alphasOutputtensor([[[0.5000, 0.5000]]])We had to make sure alphas had the right shape (N, 1, L) so that, when multipliedby the "values" with shape (N, L, H), it will result in a weighted sum of thealignment vectors with shape (N, 1, H). We can use batch matrix multiplication(torch.bmm()) for that:Equation 9.2 - Shapes for batch matrix multiplicationIn other words, we can simply ignore the first dimension, and PyTorch will go overall the elements in the mini-batch for us:# N, 1, L x N, L, H -> 1, L x L, H -> 1, Hcontext_vector = torch.bmm(alphas, values)context_vectorOutputtensor([[[ 0.1968, -0.2809]]], grad_fn=<BmmBackward0>)"Why are you spending so much time on matrix multiplication, of allthings?"Although it seems a fairly basic topic, getting the shapes and dimensions right is of712 | Chapter 9 — Part I: Sequence-to-Sequence


