09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8

The Attention context signal c j

is computed as follows. For every decoder step

j, we compute the alignment between the decoder state s j-1

and every encoder

state h i

. This gives us a set of N similarity values e ij

for each decoder state j, which

we then convert to a probability distribution by computing their corresponding

softmax values b ij

. Finally, the Attention context c j

is computed as the weighted

sum of the encoder states h i

and their corresponding softmax weights b ij

over all N

encoder time steps. The set of equations shown encapsulate this transformation for

each decoder step j:

ee iiii = aaaaaaaaaa(h ii , ss jj−1 )∀ii

bb iiii = ssssssssssssss(ee iiii )

NN

cc jj = ∑ h ii bb iiii

ii=0

Multiple Attention mechanisms have been proposed based on how the alignment is

done. We will describe a few next. For notational convenience, we will indicate the

state vector h i

on the encoder side with h, and the state vector s j-1

on the decoder side

with s.

The simplest formulation of alignment is content-based attention. It was proposed

by Graves, Wayne, and Danihelka [27], and is just the cosine similarity between the

encoder and decoder states. A precondition for using this formulation is that the

hidden state vector on both the encoder and decoder must have the same dimensions:

ee = cccccccccccc(h, ss)

Another formulation, known as additive or Bahdanau attention, was proposed by

Bahdanau, Cho, and Bengio [28]. This involves combining the state vectors using

learnable weights in a small neural network, given by the following equation. Here

the s and h vectors are concatenated and multiplied by the learned weights W,

which is equivalent to using two learned weights W s

and W h

to multiply with s and h,

and adding the results:

ee = vv TT tanh(WW[ss; h])

Luong, Pham, and Manning [29], proposed a set of three attention formulations

(dot, general, and concat), of which the general formulation is also known as the

multiplicative or Luong's attention. The dot and concat attention formulations are

similar to the content-based and additive attention formulations discussed earlier.

The multiplicative attention formulation is given by the following equation:

ee = h TT WWWW

[ 329 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!