Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Pre-training TasksMasked Language Model (MLM)BERT is said to be an autoencoding model because it is a Transformer encoder andbecause it was trained to "reconstruct" sentences from corrupted inputs (it doesnot reconstruct the entire input but predicts the corrected words instead). That’sthe masked language model (MLM) pre-training task.In the "Language Model" section we saw that the goal of a language model is toestimate the probability of a token or a sequence of tokens or, simply put, topredict the tokens more likely to fill in a blank. That looks like a perfect task for aTransformer decoder, right?"But BERT is an encoder…"Well, yeah, but who said the blank must be at the end? In the continuous bag-ofwords(CBoW) model, the blank was the word in the center, and the remainingwords were the context. In a way, that’s what the MLM task is doing: It is randomlychoosing words to be masked as blanks in a sentence. BERT then tries to predictthe correct words that fill in the blanks.Actually, it’s a bit more structured than that:• 80% of the time, it masks 15% of the tokens at random: "Alice followed the[MASK] rabbit."• 10% of the time, it replaces 15% of the tokens with some other random word:"Alice followed the watch rabbit."• The remaining 10% of the time, the tokens are unchanged: "Alice followed thewhite rabbit."The target is the original sentence: "Alice followed the white rabbit." This way, themodel effectively learns to reconstruct the original sentence from corrupted inputs(containing missing—masked—or randomly replaced words).This is the perfect use case (besides padding) for the source maskargument of the Transformer encoder.BERT | 975
Figure 11.26 - Pre-training task—masked language model (MLM)Also, notice that BERT computes logits for the randomly masked inputs only. Theremaining inputs are not even considered for computing the loss."OK, but how can we randomly replace tokens like that?"One alternative, similar to the way we do data augmentation for images, would beto implement a custom dataset that performs the replacements on the fly in the__getitem__() method. There is a better alternative, though: using a collatefunction or, better yet, a data collator. There’s a data collator that performs thereplacement procedure prescribed by BERT: DataCollatorForLanguageModeling.Let’s see an example of it in action, starting with an input sentence:sentence = 'Alice is inexplicably following the white rabbit'tokens = bert_tokenizer(sentence)tokens['input_ids']Output[101, 5650, 2003, 1999, 10288, 24759, 5555, 6321, 2206, 1996, 2317,10442, 102]976 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 950 and 951: Figure 11.15 - Reviewing restaurant
- Page 952 and 953: You got that right—arithmetic—r
- Page 954 and 955: There we go, 50 dimensions! It’s
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1006 and 1007: The BERT model may take many other
- Page 1008 and 1009: The contextual word embeddings are
- Page 1010 and 1011: Model Configuration1 class BERTClas
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015: Well, you probably don’t want to
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021: OutputTrainingArguments(output_dir=
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1040 and 1041: device_index = (model.device.indexi
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
Pre-training Tasks
Masked Language Model (MLM)
BERT is said to be an autoencoding model because it is a Transformer encoder and
because it was trained to "reconstruct" sentences from corrupted inputs (it does
not reconstruct the entire input but predicts the corrected words instead). That’s
the masked language model (MLM) pre-training task.
In the "Language Model" section we saw that the goal of a language model is to
estimate the probability of a token or a sequence of tokens or, simply put, to
predict the tokens more likely to fill in a blank. That looks like a perfect task for a
Transformer decoder, right?
"But BERT is an encoder…"
Well, yeah, but who said the blank must be at the end? In the continuous bag-ofwords
(CBoW) model, the blank was the word in the center, and the remaining
words were the context. In a way, that’s what the MLM task is doing: It is randomly
choosing words to be masked as blanks in a sentence. BERT then tries to predict
the correct words that fill in the blanks.
Actually, it’s a bit more structured than that:
• 80% of the time, it masks 15% of the tokens at random: "Alice followed the
[MASK] rabbit."
• 10% of the time, it replaces 15% of the tokens with some other random word:
"Alice followed the watch rabbit."
• The remaining 10% of the time, the tokens are unchanged: "Alice followed the
white rabbit."
The target is the original sentence: "Alice followed the white rabbit." This way, the
model effectively learns to reconstruct the original sentence from corrupted inputs
(containing missing—masked—or randomly replaced words).
This is the perfect use case (besides padding) for the source mask
argument of the Transformer encoder.
BERT | 975