Chapter 8Since the entire sequence is consumed in parallel on the Encoder, information aboutthe positions of individual elements are lost. To compensate for this, the inputembeddings are augmented with a positional embedding, which is implementedas a sinusoidal function without learned parameters. The positional embedding isadded to the input embedding.The output of the Encoder is a pair of attention vectors K and V. This is sent inparallel to all the transformer blocks in the decoder. The transformer block on thedecoder is similar to that on the encoder, except that it has an additional multi-headattention layer to attend to the attention vectors from the encoder. This additionalmulti-head attention layer works similar to the one in the encoder and the onebelow it, except it combines the Q vector from the layer below it and the K and Qvectors from the encoder state.Similar to the seq2seq network, the output sequence is generated one token at atime, using the input from the previous time step. As with the input to the encoder,the input to the decoder is also augmented with a positional embedding. Unlike theencoder, the self attention process in the decoder is only allowed to attend to tokensat previous time points. This is done by masking out tokens at future time points.The output of the last transformer block in the Decoder is a sequence of lowdimensionalembeddings (512 for reference implementation [30] as noted earlier).This is passed to the Dense layer, which converts it into a sequence of probabilitydistributions across the target vocabulary, from which we generate the mostprobable word either greedily or by a more sophisticated technique such as beamsearch.This has been a fairly high-level coverage of the transformer architecture. It hasachieved state of the art results in some machine translation benchmarks. The BERTembedding, which we have talked about in the previous chapter, is the encoderportion of a transformer network trained on sentence pairs in the same language.The BERT network comes in two flavors, both of which are somewhat larger than thereference implementation – BERT-base has 12 encoder layers, a hidden dimension of768, and 8 attention heads on its Multi-head attention layers, while BERT-large has24 encoder layers, hidden dimension of 1024, and 16 attention heads.If you would like to learn more about transformers, the illustrated transformer blogpost by Allamar [33] provides a very detailed, and very visual, guide to the structureand inner workings of this network. In addition, for those of you who prefer code,the textbook by Zhang, et al. [31], describes and builds up a working model of thetransformer network using MXNet.[ 339 ]

