pdfcoffee

soumyasankar99
from soumyasankar99 More from this publisher
09.05.2023 Views

Chapter 7Figure 4: Different stages of training ULMFit embeddings (Howard and Ruder, 2018)The main difference between a language model as a word embedding and moretraditional embeddings is that traditional embeddings are applied as a single initialtransformation on the data, and are then fine-tuned for specific tasks. In contrast,language models are trained on large external corpora and represent a model ofthe particular language, say English. This step is called Pretraining. The computingcost to pretrain these language models is usually fairly high; however, the peoplewho pretrain these models generally make them available for use by others so thatwe usually do not need to worry about this step. The next step is to fine-tune thesegeneral-purpose language models for your particular application domain. Forexample, if you are working in the travel or healthcare industry, you would finetunethe language model with text from your own domain. Fine-tuning involvesretraining the last few layers with your own text. Once fine-tuned, you can reusethis model for multiple tasks within your domain. The fine-tuning step is generallymuch less expensive compared to the pretraining step.[ 265 ]

Word EmbeddingsOnce you have the fine-tuned language model, you remove the last layer of thelanguage model and replace it with a one-to two-layer fully connected networkthat converts the language model embedding for your input into the final categoricalor regression output that your task needs. The idea is identical to transfer learningthat you learned about in Chapter 5, Advanced Convolutional Neural Networks,the only difference here is that you are doing transfer learning on text insteadof images. As with transfer learning with images, these language model-basedembeddings allow us to get surprisingly good results with very little labeleddata. Not surprisingly, language model embeddings have been referred to asthe "ImageNet moment" for natural language processing.The language model-based embedding idea has its roots in the ELMo [28] network,which you have already seen in this chapter. ELMo learns about its language bybeing trained on a large text corpus to learn to predict the next and previous wordsgiven a sequence of words. ELMo is based on a bidirectional LSTM, which you learnmore about in Chapter 9, Autoencoders.The first viable language model embedding was proposed by Howard and Ruder[27] via their Universal Language Model Fine-Tuning (ULMFit) model, which wastrained on the wikitext-103 dataset consisting of 28,595 Wikipedia articles and 103million words. ULMFit provides the same benefits that Transfer Learning providesfor image tasks—better results from supervised learning tasks with comparativelyless labeled data.Meanwhile, the transformer architecture had become the preferred network formachine translation tasks, replacing the LSTM network because it allows for paralleloperations and better handling of long-term dependencies. We will learn more aboutthe Transformer architecture in the next chapter. The OpenAI team of Radford, etal. [30] proposed using the decoder stack from the standard transformer networkinstead of the LSTM network used in ULMFit. Using this, they built a languagemodel embedding called Generative Pretraining (GPT) that achieved state ofthe art results for many language processing tasks. The paper proposes severalconfigurations for supervised tasks involving single-and multi-sentence taskssuch as classification, entailment, similarity, and multiple-choice question answering.The Allen AI team later followed this up by building an even larger language modelcalled GPT-2, which they ended up not releasing to the public because of fears of thetechnology being misused by malicious operators [31]. Instead they have releaseda smaller model for researchers to experiment with.One problem with the OpenAI transformer architecture is that it is unidirectionalwhereas its predecessors ELMo and ULMFit were bidirectional. Bidirectional EncoderRepresentations for Transformers (BERT), proposed by the Google AI team [29], usesthe encoder stack of the Transformer architecture and achieves bidirectionality safelyby masking up to 15% of its input, which it asks the model to predict.[ 266 ]

Chapter 7

Figure 4: Different stages of training ULMFit embeddings (Howard and Ruder, 2018)

The main difference between a language model as a word embedding and more

traditional embeddings is that traditional embeddings are applied as a single initial

transformation on the data, and are then fine-tuned for specific tasks. In contrast,

language models are trained on large external corpora and represent a model of

the particular language, say English. This step is called Pretraining. The computing

cost to pretrain these language models is usually fairly high; however, the people

who pretrain these models generally make them available for use by others so that

we usually do not need to worry about this step. The next step is to fine-tune these

general-purpose language models for your particular application domain. For

example, if you are working in the travel or healthcare industry, you would finetune

the language model with text from your own domain. Fine-tuning involves

retraining the last few layers with your own text. Once fine-tuned, you can reuse

this model for multiple tasks within your domain. The fine-tuning step is generally

much less expensive compared to the pretraining step.

[ 265 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!