09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 8

Since the entire sequence is consumed in parallel on the Encoder, information about

the positions of individual elements are lost. To compensate for this, the input

embeddings are augmented with a positional embedding, which is implemented

as a sinusoidal function without learned parameters. The positional embedding is

added to the input embedding.

The output of the Encoder is a pair of attention vectors K and V. This is sent in

parallel to all the transformer blocks in the decoder. The transformer block on the

decoder is similar to that on the encoder, except that it has an additional multi-head

attention layer to attend to the attention vectors from the encoder. This additional

multi-head attention layer works similar to the one in the encoder and the one

below it, except it combines the Q vector from the layer below it and the K and Q

vectors from the encoder state.

Similar to the seq2seq network, the output sequence is generated one token at a

time, using the input from the previous time step. As with the input to the encoder,

the input to the decoder is also augmented with a positional embedding. Unlike the

encoder, the self attention process in the decoder is only allowed to attend to tokens

at previous time points. This is done by masking out tokens at future time points.

The output of the last transformer block in the Decoder is a sequence of lowdimensional

embeddings (512 for reference implementation [30] as noted earlier).

This is passed to the Dense layer, which converts it into a sequence of probability

distributions across the target vocabulary, from which we generate the most

probable word either greedily or by a more sophisticated technique such as beam

search.

This has been a fairly high-level coverage of the transformer architecture. It has

achieved state of the art results in some machine translation benchmarks. The BERT

embedding, which we have talked about in the previous chapter, is the encoder

portion of a transformer network trained on sentence pairs in the same language.

The BERT network comes in two flavors, both of which are somewhat larger than the

reference implementation – BERT-base has 12 encoder layers, a hidden dimension of

768, and 8 attention heads on its Multi-head attention layers, while BERT-large has

24 encoder layers, hidden dimension of 1024, and 16 attention heads.

If you would like to learn more about transformers, the illustrated transformer blog

post by Allamar [33] provides a very detailed, and very visual, guide to the structure

and inner workings of this network. In addition, for those of you who prefer code,

the textbook by Zhang, et al. [31], describes and builds up a working model of the

transformer network using MXNet.

[ 339 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!