Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Well, you probably don’t want to go through all this trouble—adjusting the datasetsand writing a model class—to fine-tune a BERT model, right?Say no more!Fine-Tuning with HuggingFaceWhat if I told you that there is a BERT model for every task, and you just need tofine-tune it? Cool, isn’t it? Then, what if I told you that you can use a trainer to domost of the fine-tuning work for you? Amazing, right? The HuggingFace library isthat good, really!There are BERT models available for many different tasks:• Pre-training tasks:◦ Masked language model (BertForMaskedLM)◦ Next sentence prediction (BertForNextSentencePrediction)• Typical tasks (also available as AutoModel):◦ Sequence classification (BertForSequenceClassification)◦ Token classification (BertForTokenClassification)◦ Question answering (BertForQuestionAnswering)• BERT (and family) specific:◦ Multiple choice (BertForMultipleChoice)We’re sticking with the sequence classification task using DistilBERT instead ofregular BERT so as to make the fine-tuning faster.Sequence Classification (or Regression)Let’s load the pre-trained model using its corresponding class:Fine-Tuning with HuggingFace | 989
Model Configuration1 from transformers import DistilBertForSequenceClassification2 torch.manual_seed(42)3 bert_cls = DistilBertForSequenceClassification.from_pretrained(4 'distilbert-base-uncased', num_labels=25 )It comes with a warning:OutputYou should probably TRAIN this model on a down-stream task to beable to use it for predictions and inference.It makes sense!Since ours is a binary classification task, the num_labels argument is two, whichhappens to be the default value. Unfortunately, at the time of writing, thedocumentation is not as explicit as it should be in this case. There is no mention ofnum_labels as a possible argument of the model, and it’s only referred to in thedocumentation of the forward() method ofDistilBertForSequenceClassification (highlights are mine):• labels (torch.LongTensor of shape (batch_size,), optional) – Labels forcomputing the sequence classification / regression loss. Indices should be in [0,..., config.num_labels - 1]. If config.num_labels == 1 a regression loss iscomputed (Mean-Square loss), If config.num_labels > 1 a classification lossis computed (Cross-Entropy).Some of the returning values of the forward() method also include references tothe num_labels argument:• loss (torch.FloatTensor of shape (1,), optional, returned when labels isprovided) – Classification (or regression if config.num_labels==1) loss.• logits (torch.FloatTensor of shape (batch_size, config.num_labels)) –Classification (or regression if config.num_labels==1) scores (before SoftMax).That’s right! DistilBertForSequenceClassification (or any otherForSequenceClassification model) can be used for regression too as long as you990 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1006 and 1007: The BERT model may take many other
- Page 1008 and 1009: The contextual word embeddings are
- Page 1010 and 1011: Model Configuration1 class BERTClas
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021: OutputTrainingArguments(output_dir=
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1040 and 1041: device_index = (model.device.indexi
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
Well, you probably don’t want to go through all this trouble—adjusting the datasets
and writing a model class—to fine-tune a BERT model, right?
Say no more!
Fine-Tuning with HuggingFace
What if I told you that there is a BERT model for every task, and you just need to
fine-tune it? Cool, isn’t it? Then, what if I told you that you can use a trainer to do
most of the fine-tuning work for you? Amazing, right? The HuggingFace library is
that good, really!
There are BERT models available for many different tasks:
• Pre-training tasks:
◦ Masked language model (BertForMaskedLM)
◦ Next sentence prediction (BertForNextSentencePrediction)
• Typical tasks (also available as AutoModel):
◦ Sequence classification (BertForSequenceClassification)
◦ Token classification (BertForTokenClassification)
◦ Question answering (BertForQuestionAnswering)
• BERT (and family) specific:
◦ Multiple choice (BertForMultipleChoice)
We’re sticking with the sequence classification task using DistilBERT instead of
regular BERT so as to make the fine-tuning faster.
Sequence Classification (or Regression)
Let’s load the pre-trained model using its corresponding class:
Fine-Tuning with HuggingFace | 989