09.05.2023 Views

pdfcoffee

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Word Embeddings

Once you have the fine-tuned language model, you remove the last layer of the

language model and replace it with a one-to two-layer fully connected network

that converts the language model embedding for your input into the final categorical

or regression output that your task needs. The idea is identical to transfer learning

that you learned about in Chapter 5, Advanced Convolutional Neural Networks,

the only difference here is that you are doing transfer learning on text instead

of images. As with transfer learning with images, these language model-based

embeddings allow us to get surprisingly good results with very little labeled

data. Not surprisingly, language model embeddings have been referred to as

the "ImageNet moment" for natural language processing.

The language model-based embedding idea has its roots in the ELMo [28] network,

which you have already seen in this chapter. ELMo learns about its language by

being trained on a large text corpus to learn to predict the next and previous words

given a sequence of words. ELMo is based on a bidirectional LSTM, which you learn

more about in Chapter 9, Autoencoders.

The first viable language model embedding was proposed by Howard and Ruder

[27] via their Universal Language Model Fine-Tuning (ULMFit) model, which was

trained on the wikitext-103 dataset consisting of 28,595 Wikipedia articles and 103

million words. ULMFit provides the same benefits that Transfer Learning provides

for image tasks—better results from supervised learning tasks with comparatively

less labeled data.

Meanwhile, the transformer architecture had become the preferred network for

machine translation tasks, replacing the LSTM network because it allows for parallel

operations and better handling of long-term dependencies. We will learn more about

the Transformer architecture in the next chapter. The OpenAI team of Radford, et

al. [30] proposed using the decoder stack from the standard transformer network

instead of the LSTM network used in ULMFit. Using this, they built a language

model embedding called Generative Pretraining (GPT) that achieved state of

the art results for many language processing tasks. The paper proposes several

configurations for supervised tasks involving single-and multi-sentence tasks

such as classification, entailment, similarity, and multiple-choice question answering.

The Allen AI team later followed this up by building an even larger language model

called GPT-2, which they ended up not releasing to the public because of fears of the

technology being misused by malicious operators [31]. Instead they have released

a smaller model for researchers to experiment with.

One problem with the OpenAI transformer architecture is that it is unidirectional

whereas its predecessors ELMo and ULMFit were bidirectional. Bidirectional Encoder

Representations for Transformers (BERT), proposed by the Google AI team [29], uses

the encoder stack of the Transformer architecture and achieves bidirectionality safely

by masking up to 15% of its input, which it asks the model to predict.

[ 266 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!