pdfcoffee
Chapter 7Figure 4: Different stages of training ULMFit embeddings (Howard and Ruder, 2018)The main difference between a language model as a word embedding and moretraditional embeddings is that traditional embeddings are applied as a single initialtransformation on the data, and are then fine-tuned for specific tasks. In contrast,language models are trained on large external corpora and represent a model ofthe particular language, say English. This step is called Pretraining. The computingcost to pretrain these language models is usually fairly high; however, the peoplewho pretrain these models generally make them available for use by others so thatwe usually do not need to worry about this step. The next step is to fine-tune thesegeneral-purpose language models for your particular application domain. Forexample, if you are working in the travel or healthcare industry, you would finetunethe language model with text from your own domain. Fine-tuning involvesretraining the last few layers with your own text. Once fine-tuned, you can reusethis model for multiple tasks within your domain. The fine-tuning step is generallymuch less expensive compared to the pretraining step.[ 265 ]
Word EmbeddingsOnce you have the fine-tuned language model, you remove the last layer of thelanguage model and replace it with a one-to two-layer fully connected networkthat converts the language model embedding for your input into the final categoricalor regression output that your task needs. The idea is identical to transfer learningthat you learned about in Chapter 5, Advanced Convolutional Neural Networks,the only difference here is that you are doing transfer learning on text insteadof images. As with transfer learning with images, these language model-basedembeddings allow us to get surprisingly good results with very little labeleddata. Not surprisingly, language model embeddings have been referred to asthe "ImageNet moment" for natural language processing.The language model-based embedding idea has its roots in the ELMo [28] network,which you have already seen in this chapter. ELMo learns about its language bybeing trained on a large text corpus to learn to predict the next and previous wordsgiven a sequence of words. ELMo is based on a bidirectional LSTM, which you learnmore about in Chapter 9, Autoencoders.The first viable language model embedding was proposed by Howard and Ruder[27] via their Universal Language Model Fine-Tuning (ULMFit) model, which wastrained on the wikitext-103 dataset consisting of 28,595 Wikipedia articles and 103million words. ULMFit provides the same benefits that Transfer Learning providesfor image tasks—better results from supervised learning tasks with comparativelyless labeled data.Meanwhile, the transformer architecture had become the preferred network formachine translation tasks, replacing the LSTM network because it allows for paralleloperations and better handling of long-term dependencies. We will learn more aboutthe Transformer architecture in the next chapter. The OpenAI team of Radford, etal. [30] proposed using the decoder stack from the standard transformer networkinstead of the LSTM network used in ULMFit. Using this, they built a languagemodel embedding called Generative Pretraining (GPT) that achieved state ofthe art results for many language processing tasks. The paper proposes severalconfigurations for supervised tasks involving single-and multi-sentence taskssuch as classification, entailment, similarity, and multiple-choice question answering.The Allen AI team later followed this up by building an even larger language modelcalled GPT-2, which they ended up not releasing to the public because of fears of thetechnology being misused by malicious operators [31]. Instead they have releaseda smaller model for researchers to experiment with.One problem with the OpenAI transformer architecture is that it is unidirectionalwhereas its predecessors ELMo and ULMFit were bidirectional. Bidirectional EncoderRepresentations for Transformers (BERT), proposed by the Google AI team [29], usesthe encoder stack of the Transformer architecture and achieves bidirectionality safelyby masking up to 15% of its input, which it asks the model to predict.[ 266 ]
- Page 250 and 251: Chapter 6Figure 11: Illegible initi
- Page 252 and 253: Chapter 6Bedrooms: Generated bedroo
- Page 254 and 255: Chapter 6The images need to be norm
- Page 256 and 257: Chapter 6initializer = tf.random_no
- Page 258 and 259: Cool, right? Now we can define the
- Page 260 and 261: Chapter 6d_loss = (dA_loss + dB_los
- Page 262 and 263: Chapter 6generator_AB.save_weights(
- Page 264: 6. Ledig, Christian, et al. Photo-R
- Page 267 and 268: Word EmbeddingsDeep learning models
- Page 269 and 270: Word EmbeddingsFor example, "crucia
- Page 271 and 272: Word EmbeddingsAssuming a window si
- Page 273 and 274: Word EmbeddingsGloVeThe Global vect
- Page 275 and 276: Word Embeddingsgensim is an open so
- Page 277 and 278: Word Embeddingsgensim also provides
- Page 279 and 280: Word EmbeddingsSpecifically, we wil
- Page 281 and 282: Word EmbeddingsWe will also convert
- Page 283 and 284: Word EmbeddingsE = np.zeros((vocab_
- Page 285 and 286: Word Embeddingsx = self.embedding(x
- Page 287 and 288: Word EmbeddingsThe change in valida
- Page 289 and 290: Word EmbeddingsThe dataset is a 114
- Page 291 and 292: Word Embeddingsprint("random walks
- Page 293 and 294: Word Embeddingssize=128, # size of
- Page 295 and 296: Word EmbeddingsfastText computes em
- Page 297 and 298: Word EmbeddingsIn the future, once
- Page 299: Word EmbeddingsA much earlier relat
- Page 303 and 304: Word EmbeddingsThis will create the
- Page 305 and 306: Word EmbeddingsClassifying with BER
- Page 307 and 308: Word Embeddings2. Each Transformer
- Page 309 and 310: Word EmbeddingsOnce trained, we sav
- Page 311 and 312: Word Embeddings4. Pennington, J., S
- Page 313 and 314: Word Embeddings34. Google Research,
- Page 315 and 316: Recurrent Neural NetworksWe will th
- Page 317 and 318: Recurrent Neural NetworksFor notati
- Page 319 and 320: Recurrent Neural NetworksThis probl
- Page 321 and 322: Recurrent Neural NetworksThe line a
- Page 323 and 324: Recurrent Neural NetworksGated recu
- Page 325 and 326: Recurrent Neural NetworksThis probl
- Page 327 and 328: Recurrent Neural NetworksThe topolo
- Page 329 and 330: Recurrent Neural Networkstexts = do
- Page 331 and 332: Recurrent Neural Networksdef call(s
- Page 333 and 334: Recurrent Neural Networks# callback
- Page 335 and 336: Recurrent Neural NetworksExample
- Page 337 and 338: Recurrent Neural NetworksAs can be
- Page 339 and 340: Recurrent Neural Networksdata_dir =
- Page 341 and 342: Recurrent Neural NetworksWe can als
- Page 343 and 344: Recurrent Neural NetworksIn order t
- Page 345 and 346: Recurrent Neural Networkssource_voc
- Page 347 and 348: Recurrent Neural NetworksFinally, w
- Page 349 and 350: Recurrent Neural Networks38 - val_l
Chapter 7
Figure 4: Different stages of training ULMFit embeddings (Howard and Ruder, 2018)
The main difference between a language model as a word embedding and more
traditional embeddings is that traditional embeddings are applied as a single initial
transformation on the data, and are then fine-tuned for specific tasks. In contrast,
language models are trained on large external corpora and represent a model of
the particular language, say English. This step is called Pretraining. The computing
cost to pretrain these language models is usually fairly high; however, the people
who pretrain these models generally make them available for use by others so that
we usually do not need to worry about this step. The next step is to fine-tune these
general-purpose language models for your particular application domain. For
example, if you are working in the travel or healthcare industry, you would finetune
the language model with text from your own domain. Fine-tuning involves
retraining the last few layers with your own text. Once fine-tuned, you can reuse
this model for multiple tasks within your domain. The fine-tuning step is generally
much less expensive compared to the pretraining step.
[ 265 ]