pdfcoffee
Chapter 8Since the entire sequence is consumed in parallel on the Encoder, information aboutthe positions of individual elements are lost. To compensate for this, the inputembeddings are augmented with a positional embedding, which is implementedas a sinusoidal function without learned parameters. The positional embedding isadded to the input embedding.The output of the Encoder is a pair of attention vectors K and V. This is sent inparallel to all the transformer blocks in the decoder. The transformer block on thedecoder is similar to that on the encoder, except that it has an additional multi-headattention layer to attend to the attention vectors from the encoder. This additionalmulti-head attention layer works similar to the one in the encoder and the onebelow it, except it combines the Q vector from the layer below it and the K and Qvectors from the encoder state.Similar to the seq2seq network, the output sequence is generated one token at atime, using the input from the previous time step. As with the input to the encoder,the input to the decoder is also augmented with a positional embedding. Unlike theencoder, the self attention process in the decoder is only allowed to attend to tokensat previous time points. This is done by masking out tokens at future time points.The output of the last transformer block in the Decoder is a sequence of lowdimensionalembeddings (512 for reference implementation [30] as noted earlier).This is passed to the Dense layer, which converts it into a sequence of probabilitydistributions across the target vocabulary, from which we generate the mostprobable word either greedily or by a more sophisticated technique such as beamsearch.This has been a fairly high-level coverage of the transformer architecture. It hasachieved state of the art results in some machine translation benchmarks. The BERTembedding, which we have talked about in the previous chapter, is the encoderportion of a transformer network trained on sentence pairs in the same language.The BERT network comes in two flavors, both of which are somewhat larger than thereference implementation – BERT-base has 12 encoder layers, a hidden dimension of768, and 8 attention heads on its Multi-head attention layers, while BERT-large has24 encoder layers, hidden dimension of 1024, and 16 attention heads.If you would like to learn more about transformers, the illustrated transformer blogpost by Allamar [33] provides a very detailed, and very visual, guide to the structureand inner workings of this network. In addition, for those of you who prefer code,the textbook by Zhang, et al. [31], describes and builds up a working model of thetransformer network using MXNet.[ 339 ]
Recurrent Neural NetworksSummaryIn this chapter, we learned about RNNs, a class of networks that are specializedfor dealing with sequences such as natural language, time series, speech, and soon. Just like CNNs exploit the geometry of images, RNNs exploit the sequentialstructure of their inputs. We learned about the basic RNN cell and how it handlesstate from previous time steps, and how it suffers from vanishing and explodinggradients because of inherent problems with BPTT. We saw how these problemsled to the development of novel RNN cell architectures such as LSTM, GRU, andpeephole LSTMs. We also learned about some simple ways to make your RNNmore effective, such as making it Bidirectional or Stateful.We then looked at different RNN topologies, and how each topology is adapted toa particular set of problems. After a lot of theory, we finally saw examples of threeof these topologies. We then focused on one of these topologies, called seq2seq,which first gained popularity in the machine translation community, but has sincebeen used in situations where the use case can be adapted to look like a machinetranslation problem.From here, we looked at attention, which started off as a way to improve theperformance of seq2seq networks, but has since been used very effectively in manysituations where we want to compress the representation while keeping the dataloss to a minimum. We looked at different kinds of attention, and looked at anexample of using them in a seq2seq network with attention.Finally, we looked at the transformer network, which is basically an Encoder-Decoder architecture where the recurrent layers have been replaced with Attentionlayers. At the time of writing this, transformer networks are considered state ofthe art, and they are being increasingly used in many situations.In the next chapter, you will learn about Autoencoders, another type of Encoder-Decoder architecture that have proven to be useful in semi-supervised orunsupervised settings.References1. Jozefowicz, R., Zaremba, R. and Sutskever, I. (2015). An Empirical Explorationof Recurrent Neural Network Architectures. Journal of Machine Learning.2. Greff, K., et al. (July 2016). LSTM: A Search Space Odyssey. IEEE Transactionson Neural Networks and Learning Systems.3. Bernal, A., Fok, S., and Pidaparthi, R. (December 2012). Financial Markets TimeSeries Prediction with Recurrent Neural Networks.[ 340 ]
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
- Page 329 and 330: Recurrent Neural Networkstexts = do
- Page 331 and 332: Recurrent Neural Networksdef call(s
- Page 333 and 334: Recurrent Neural Networks# callback
- Page 335 and 336: Recurrent Neural NetworksExample
- Page 337 and 338: Recurrent Neural NetworksAs can be
- Page 339 and 340: Recurrent Neural Networksdata_dir =
- Page 341 and 342: Recurrent Neural NetworksWe can als
- Page 343 and 344: Recurrent Neural NetworksIn order t
- Page 345 and 346: Recurrent Neural Networkssource_voc
- Page 347 and 348: Recurrent Neural NetworksFinally, w
- Page 349 and 350: Recurrent Neural Networks38 - val_l
- Page 351 and 352: Recurrent Neural NetworksIf you wou
- Page 353 and 354: Recurrent Neural NetworksExample
- Page 355 and 356: Recurrent Neural NetworksNext we ha
- Page 357 and 358: Recurrent Neural Networksself.embed
- Page 359 and 360: Recurrent Neural NetworksThis is a
- Page 361 and 362: Recurrent Neural Networksreturn np.
- Page 363 and 364: Recurrent Neural NetworksAttention
- Page 365 and 366: Recurrent Neural NetworksFinally, V
- Page 367 and 368: Recurrent Neural Networks# query.sh
- Page 369 and 370: Recurrent Neural Networksself.atten
- Page 371 and 372: Recurrent Neural Networks30 try to
- Page 373: Recurrent Neural Networks3. Because
- Page 377 and 378: Recurrent Neural Networks18. Shi, X
- Page 380 and 381: AutoencodersAutoencoders are feed-f
- Page 382 and 383: Depending upon the actual dimension
- Page 384 and 385: • __init__(): Here, you define al
- Page 386 and 387: Chapter 9And then we reshape the te
- Page 388 and 389: Chapter 9plt.imshow(x_test[index].r
- Page 390 and 391: Chapter 9Keeping the rest of the co
- Page 392 and 393: noise = np.random.normal(loc=0.5, s
- Page 394 and 395: Chapter 9x_train,validation_data=(x
- Page 396 and 397: Chapter 9import matplotlib.pyplot a
- Page 398 and 399: Chapter 9self.conv4 = Conv2D(1, 3,
- Page 400 and 401: Chapter 9You can see that the image
- Page 402 and 403: [ 367 ]Chapter 9Let us use the prec
- Page 404 and 405: Chapter 9Our autoencoder model take
- Page 406 and 407: We train the autoencoder for 20 epo
- Page 408 and 409: Chapter 90.97905576229095460.989323
- Page 410 and 411: Unsupervised LearningThis chapter d
- Page 412 and 413: Chapter 10Next we load the MNIST da
- Page 414 and 415: Chapter 10TensorFlow Embedding APIT
- Page 416 and 417: 3. Recompute the centroids using cu
- Page 418 and 419: Chapter 10Figure 4: Plot of the fin
- Page 420 and 421: Chapter 10In SOMs, neurons are usua
- Page 422 and 423: [ 387 ]Chapter 10Colour mapping usi
Recurrent Neural Networks
Summary
In this chapter, we learned about RNNs, a class of networks that are specialized
for dealing with sequences such as natural language, time series, speech, and so
on. Just like CNNs exploit the geometry of images, RNNs exploit the sequential
structure of their inputs. We learned about the basic RNN cell and how it handles
state from previous time steps, and how it suffers from vanishing and exploding
gradients because of inherent problems with BPTT. We saw how these problems
led to the development of novel RNN cell architectures such as LSTM, GRU, and
peephole LSTMs. We also learned about some simple ways to make your RNN
more effective, such as making it Bidirectional or Stateful.
We then looked at different RNN topologies, and how each topology is adapted to
a particular set of problems. After a lot of theory, we finally saw examples of three
of these topologies. We then focused on one of these topologies, called seq2seq,
which first gained popularity in the machine translation community, but has since
been used in situations where the use case can be adapted to look like a machine
translation problem.
From here, we looked at attention, which started off as a way to improve the
performance of seq2seq networks, but has since been used very effectively in many
situations where we want to compress the representation while keeping the data
loss to a minimum. We looked at different kinds of attention, and looked at an
example of using them in a seq2seq network with attention.
Finally, we looked at the transformer network, which is basically an Encoder-
Decoder architecture where the recurrent layers have been replaced with Attention
layers. At the time of writing this, transformer networks are considered state of
the art, and they are being increasingly used in many situations.
In the next chapter, you will learn about Autoencoders, another type of Encoder-
Decoder architecture that have proven to be useful in semi-supervised or
unsupervised settings.
References
1. Jozefowicz, R., Zaremba, R. and Sutskever, I. (2015). An Empirical Exploration
of Recurrent Neural Network Architectures. Journal of Machine Learning.
2. Greff, K., et al. (July 2016). LSTM: A Search Space Odyssey. IEEE Transactions
on Neural Networks and Learning Systems.
3. Bernal, A., Fok, S., and Pidaparthi, R. (December 2012). Financial Markets Time
Series Prediction with Recurrent Neural Networks.
[ 340 ]