09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

3. The target sequences are fed into dense blocks which convert the output

embeddings to the final sequence of integer form:

Chapter 8

Figure 7: Flow of data in (a) seq2seq + Attention, and (b) Transformer architecture.

Image Source: Zhang, et al. [31].

And the two architectures differ in the following ways:

1. The Recurrent and Attention layers on the encoder, and the recurrent layer in

the decoder of the seq2seq network, have been replaced with a transformer

block. The transformer block on the encoder side consists of a sequence of

multi-head Attention layers, an add and norm layer, and a position-wise

feed-forward layer. The transformer on the decoder side has an additional

layer of multi-head attention, and add and norm layers in front of the

encoder state signal.

2. The encoder state is passed to every transformer block on the decoder,

instead of to the first recurrent time step as with the seq2seq with attention

network. This allows transformers to work in parallel across time steps, since

there is no longer a temporal dependency as with seq2seq networks.

[ 337 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!