Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

Pre-training TasksMasked Language Model (MLM)BERT is said to be an autoencoding model because it is a Transformer encoder andbecause it was trained to "reconstruct" sentences from corrupted inputs (it doesnot reconstruct the entire input but predicts the corrected words instead). That’sthe masked language model (MLM) pre-training task.In the "Language Model" section we saw that the goal of a language model is toestimate the probability of a token or a sequence of tokens or, simply put, topredict the tokens more likely to fill in a blank. That looks like a perfect task for aTransformer decoder, right?"But BERT is an encoder…"Well, yeah, but who said the blank must be at the end? In the continuous bag-ofwords(CBoW) model, the blank was the word in the center, and the remainingwords were the context. In a way, that’s what the MLM task is doing: It is randomlychoosing words to be masked as blanks in a sentence. BERT then tries to predictthe correct words that fill in the blanks.Actually, it’s a bit more structured than that:• 80% of the time, it masks 15% of the tokens at random: "Alice followed the[MASK] rabbit."• 10% of the time, it replaces 15% of the tokens with some other random word:"Alice followed the watch rabbit."• The remaining 10% of the time, the tokens are unchanged: "Alice followed thewhite rabbit."The target is the original sentence: "Alice followed the white rabbit." This way, themodel effectively learns to reconstruct the original sentence from corrupted inputs(containing missing—masked—or randomly replaced words).This is the perfect use case (besides padding) for the source maskargument of the Transformer encoder.BERT | 975

Figure 11.26 - Pre-training task—masked language model (MLM)Also, notice that BERT computes logits for the randomly masked inputs only. Theremaining inputs are not even considered for computing the loss."OK, but how can we randomly replace tokens like that?"One alternative, similar to the way we do data augmentation for images, would beto implement a custom dataset that performs the replacements on the fly in the__getitem__() method. There is a better alternative, though: using a collatefunction or, better yet, a data collator. There’s a data collator that performs thereplacement procedure prescribed by BERT: DataCollatorForLanguageModeling.Let’s see an example of it in action, starting with an input sentence:sentence = 'Alice is inexplicably following the white rabbit'tokens = bert_tokenizer(sentence)tokens['input_ids']Output[101, 5650, 2003, 1999, 10288, 24759, 5555, 6321, 2206, 1996, 2317,10442, 102]976 | Chapter 11: Down the Yellow Brick Rabbit Hole

Figure 11.26 - Pre-training task—masked language model (MLM)

Also, notice that BERT computes logits for the randomly masked inputs only. The

remaining inputs are not even considered for computing the loss.

"OK, but how can we randomly replace tokens like that?"

One alternative, similar to the way we do data augmentation for images, would be

to implement a custom dataset that performs the replacements on the fly in the

__getitem__() method. There is a better alternative, though: using a collate

function or, better yet, a data collator. There’s a data collator that performs the

replacement procedure prescribed by BERT: DataCollatorForLanguageModeling.

Let’s see an example of it in action, starting with an input sentence:

sentence = 'Alice is inexplicably following the white rabbit'

tokens = bert_tokenizer(sentence)

tokens['input_ids']

Output

[101, 5650, 2003, 1999, 10288, 24759, 5555, 6321, 2206, 1996, 2317,

10442, 102]

976 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!