09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recurrent Neural Networks

Attention mechanism

In the previous section we saw how the context or thought vector from the last

time step of the encoder is fed into the decoder as the initial hidden state. As the

context flows through the time steps on the decoder, the signal gets combined with

the decoder output and progressively gets weaker and weaker. The result is that

the context does not have much effect towards the later time steps on the decoder.

In addition, certain sections of the decoder output may depend more heavily

on certain sections of the input. For example, consider an input "thank you very

much", and the corresponding output "merci beaucoup" for an English to French

translation network such as the one we looked at in the previous section. Here the

English phrases "thank you", and "very much", correspond to the French "merci"

and "beaucoup" respectively. This information is also not conveyed adequately

through the single context vector.

The Attention mechanism provides access to all encoder hidden states at every

time step on the decoder. The decoder learns which part of the encoder states to

pay more attention to. The use of attention has resulted in great improvements to

the quality of machine translation, as well as a variety of standard natural language

processing tasks.

The use of Attention is not limited to seq2seq networks. For example, Attention

is a key component in the "Embed, Encode, Attend, Predict" formula for creating

state of the art deep learning models for NLP [34]. Here, Attention has been used to

preserve as much information as possible when downsizing from a larger to a more

compact representation, for example, when reducing a sequence of word vectors

into a single sentence vector.

Essentially, the Attention mechanism provides a way to score tokens in the

target against all tokens in the source and modify the input signal to the decoder

accordingly. Consider an encoder-decoder architecture where the input and output

time steps are denoted by indices i and j respectively, and the hidden states on the

encoder and decoder at these respective time steps are denoted by h i

and s j

. Inputs

to the encoder are denoted by x i

and outputs from the decoder are denoted by y j

.

In an encoder-decoder network without attention, the value of decoder state s j

is

given by the hidden state s j-1

and output y j-1

at the previous time step. The Attention

mechanism adds a third signal c j

, known as the Attention context. With Attention,

therefore, the decoder hidden state s j

is a function of y j-1

, s j-1

, and c j

, shown as follows:

ss jj = ff(yy jj−1 , ss jj−1 , cc jj )

[ 328 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!