pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 8Since the entire sequence is consumed in parallel on the Encoder, information aboutthe positions of individual elements are lost. To compensate for this, the inputembeddings are augmented with a positional embedding, which is implementedas a sinusoidal function without learned parameters. The positional embedding isadded to the input embedding.The output of the Encoder is a pair of attention vectors K and V. This is sent inparallel to all the transformer blocks in the decoder. The transformer block on thedecoder is similar to that on the encoder, except that it has an additional multi-headattention layer to attend to the attention vectors from the encoder. This additionalmulti-head attention layer works similar to the one in the encoder and the onebelow it, except it combines the Q vector from the layer below it and the K and Qvectors from the encoder state.Similar to the seq2seq network, the output sequence is generated one token at atime, using the input from the previous time step. As with the input to the encoder,the input to the decoder is also augmented with a positional embedding. Unlike theencoder, the self attention process in the decoder is only allowed to attend to tokensat previous time points. This is done by masking out tokens at future time points.The output of the last transformer block in the Decoder is a sequence of lowdimensionalembeddings (512 for reference implementation [30] as noted earlier).This is passed to the Dense layer, which converts it into a sequence of probabilitydistributions across the target vocabulary, from which we generate the mostprobable word either greedily or by a more sophisticated technique such as beamsearch.This has been a fairly high-level coverage of the transformer architecture. It hasachieved state of the art results in some machine translation benchmarks. The BERTembedding, which we have talked about in the previous chapter, is the encoderportion of a transformer network trained on sentence pairs in the same language.The BERT network comes in two flavors, both of which are somewhat larger than thereference implementation – BERT-base has 12 encoder layers, a hidden dimension of768, and 8 attention heads on its Multi-head attention layers, while BERT-large has24 encoder layers, hidden dimension of 1024, and 16 attention heads.If you would like to learn more about transformers, the illustrated transformer blogpost by Allamar [33] provides a very detailed, and very visual, guide to the structureand inner workings of this network. In addition, for those of you who prefer code,the textbook by Zhang, et al. [31], describes and builds up a working model of thetransformer network using MXNet.[ 339 ]

Recurrent Neural NetworksSummaryIn this chapter, we learned about RNNs, a class of networks that are specializedfor dealing with sequences such as natural language, time series, speech, and soon. Just like CNNs exploit the geometry of images, RNNs exploit the sequentialstructure of their inputs. We learned about the basic RNN cell and how it handlesstate from previous time steps, and how it suffers from vanishing and explodinggradients because of inherent problems with BPTT. We saw how these problemsled to the development of novel RNN cell architectures such as LSTM, GRU, andpeephole LSTMs. We also learned about some simple ways to make your RNNmore effective, such as making it Bidirectional or Stateful.We then looked at different RNN topologies, and how each topology is adapted toa particular set of problems. After a lot of theory, we finally saw examples of threeof these topologies. We then focused on one of these topologies, called seq2seq,which first gained popularity in the machine translation community, but has sincebeen used in situations where the use case can be adapted to look like a machinetranslation problem.From here, we looked at attention, which started off as a way to improve theperformance of seq2seq networks, but has since been used very effectively in manysituations where we want to compress the representation while keeping the dataloss to a minimum. We looked at different kinds of attention, and looked at anexample of using them in a seq2seq network with attention.Finally, we looked at the transformer network, which is basically an Encoder-Decoder architecture where the recurrent layers have been replaced with Attentionlayers. At the time of writing this, transformer networks are considered state ofthe art, and they are being increasingly used in many situations.In the next chapter, you will learn about Autoencoders, another type of Encoder-Decoder architecture that have proven to be useful in semi-supervised orunsupervised settings.References1. Jozefowicz, R., Zaremba, R. and Sutskever, I. (2015). An Empirical Explorationof Recurrent Neural Network Architectures. Journal of Machine Learning.2. Greff, K., et al. (July 2016). LSTM: A Search Space Odyssey. IEEE Transactionson Neural Networks and Learning Systems.3. Bernal, A., Fok, S., and Pidaparthi, R. (December 2012). Financial Markets TimeSeries Prediction with Recurrent Neural Networks.[ 340 ]

Recurrent Neural Networks

Summary

In this chapter, we learned about RNNs, a class of networks that are specialized

for dealing with sequences such as natural language, time series, speech, and so

on. Just like CNNs exploit the geometry of images, RNNs exploit the sequential

structure of their inputs. We learned about the basic RNN cell and how it handles

state from previous time steps, and how it suffers from vanishing and exploding

gradients because of inherent problems with BPTT. We saw how these problems

led to the development of novel RNN cell architectures such as LSTM, GRU, and

peephole LSTMs. We also learned about some simple ways to make your RNN

more effective, such as making it Bidirectional or Stateful.

We then looked at different RNN topologies, and how each topology is adapted to

a particular set of problems. After a lot of theory, we finally saw examples of three

of these topologies. We then focused on one of these topologies, called seq2seq,

which first gained popularity in the machine translation community, but has since

been used in situations where the use case can be adapted to look like a machine

translation problem.

From here, we looked at attention, which started off as a way to improve the

performance of seq2seq networks, but has since been used very effectively in many

situations where we want to compress the representation while keeping the data

loss to a minimum. We looked at different kinds of attention, and looked at an

example of using them in a seq2seq network with attention.

Finally, we looked at the transformer network, which is basically an Encoder-

Decoder architecture where the recurrent layers have been replaced with Attention

layers. At the time of writing this, transformer networks are considered state of

the art, and they are being increasingly used in many situations.

In the next chapter, you will learn about Autoencoders, another type of Encoder-

Decoder architecture that have proven to be useful in semi-supervised or

unsupervised settings.

References

1. Jozefowicz, R., Zaremba, R. and Sutskever, I. (2015). An Empirical Exploration

of Recurrent Neural Network Architectures. Journal of Machine Learning.

2. Greff, K., et al. (July 2016). LSTM: A Search Space Odyssey. IEEE Transactions

on Neural Networks and Learning Systems.

3. Bernal, A., Fok, S., and Pidaparthi, R. (December 2012). Financial Markets Time

Series Prediction with Recurrent Neural Networks.

[ 340 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!