09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recurrent Neural Networks

Finally, Vaswani, et al. [30], proposed a variation on content-based attention, called

the scaled dot-product attention, which is given by the following equation. Here, N

is the dimension of the encoder hidden state h. Scaled dot-product attention is used

in the transformer architecture, which we will learn about shortly in this chapter:

ee = hTT ss

√NN

Attention mechanisms can also be categorized by what it attends to. Using this

categorization scheme, Attention mechanisms can be self-attention, global or

soft attention, and local or hard attention.

Self-attention is when the alignment is computed across different sections of the

same sequence, and has been found to be useful for applications such as machine

reading, abstractive text summarization, and image caption generation.

Soft or global attention is when the alignment is computed over the entire input

sequence, and hard or local attention is when the alignment is computed over

part of the sequence. The advantage of soft attention is that it is differentiable,

however it can be expensive to compute. Conversely, hard attention is cheaper to

compute at inference time, but is non-differentiable and requires more complicated

techniques during training.

In the next section, we will see how to integrate the attention mechanism with a

seq2seq network, and how it improves the performance.

Example ‒ seq2seq with attention for

machine translation

Let us look at the same example of machine translation that we saw earlier in this

chapter, except that the decoder will now attend to the encoder outputs using

the additive Attention mechanism proposed by Bahdanau, et al. [28], and the

multiplicative one proposed by Luong, et al [29].

The first change is to the Encoder. Instead of returning a single context or thought

vector, it will return outputs at every time point, because the Attention mechanism

will need this information. Here is the revised Encoder class, with the change

highlighted:

class Encoder(tf.keras.Model):

def __init__(self, vocab_size, num_timesteps,

embedding_dim, encoder_dim, **kwargs):

super(Encoder, self).__init__(**kwargs)

[ 330 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!