Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
Model Configuration1 class BERTClassifier(nn.Module):2 def __init__(self, bert_model, ff_units,3 n_outputs, dropout=0.3):4 super().__init__()5 self.d_model = bert_model.config.dim6 self.n_outputs = n_outputs7 self.encoder = bert_model8 self.mlp = nn.Sequential(9 nn.Linear(self.d_model, ff_units),10 nn.ReLU(),11 nn.Dropout(dropout),12 nn.Linear(ff_units, n_outputs)13 )1415 def encode(self, source, source_mask=None):16 states = self.encoder(17 input_ids=source, attention_mask=source_mask)[0]18 cls_state = states[:, 0]19 return cls_state2021 def forward(self, X):22 source_mask = (X > 0)23 # Featurizer24 cls_state = self.encode(X, source_mask)25 # Classifier26 out = self.mlp(cls_state)27 return outBoth encode() and forward() methods are roughly the same as before, but theclassifier (mlp) has both hidden and dropout layers now.Our model takes an instance of a pre-trained BERT model, the number of units inthe hidden layer of the classifier, and the desired number of outputs (logits)corresponding to the number of existing classes. The forward() method takes amini-batch of token IDs, encodes them using BERT (featurizer), and outputs logits(classifier)."Why does the model compute the source mask itself instead of usingthe output from the tokenizer?"BERT | 985
Good catch! I know that’s less than ideal, but our StepByStep class can only take asingle mini-batch of inputs, and no additional information like the attention masks.Of course, we could modify our class to handle that, but HuggingFace has its owntrainer (more on that soon!), so there’s no point in doing so.This is actually the last time we’ll use the StepByStep class since it requires toomany adjustments to the inputs to work well with HuggingFace’s tokenizers andmodels.Data PreparationTo turn the sentences in our datasets into mini-batches of token IDs and labels fora binary classification task, we can create a helper function that takes an HF’sDataset, the names of the fields corresponding to the sentences and labels, and atokenizer and builds a TensorDataset out of them:From HF’s Dataset to Tokenized TensorDataset1 def tokenize_dataset(hf_dataset, sentence_field,2 label_field, tokenizer, **kwargs):3 sentences = hf_dataset[sentence_field]4 token_ids = tokenizer(5 sentences, return_tensors='pt', **kwargs6 )['input_ids']7 labels = torch.as_tensor(hf_dataset[label_field])8 dataset = TensorDataset(token_ids, labels)9 return datasetFirst, we create a tokenizer and define the parameters we’ll use while tokenizingthe sentences:Data Preparation1 auto_tokenizer = AutoTokenizer.from_pretrained(2 'distilbert-base-uncased'3 )4 tokenizer_kwargs = dict(truncation=True,5 padding=True,6 max_length=30,7 add_special_tokens=True)986 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1006 and 1007: The BERT model may take many other
- Page 1008 and 1009: The contextual word embeddings are
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015: Well, you probably don’t want to
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021: OutputTrainingArguments(output_dir=
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1040 and 1041: device_index = (model.device.indexi
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
Model Configuration
1 class BERTClassifier(nn.Module):
2 def __init__(self, bert_model, ff_units,
3 n_outputs, dropout=0.3):
4 super().__init__()
5 self.d_model = bert_model.config.dim
6 self.n_outputs = n_outputs
7 self.encoder = bert_model
8 self.mlp = nn.Sequential(
9 nn.Linear(self.d_model, ff_units),
10 nn.ReLU(),
11 nn.Dropout(dropout),
12 nn.Linear(ff_units, n_outputs)
13 )
14
15 def encode(self, source, source_mask=None):
16 states = self.encoder(
17 input_ids=source, attention_mask=source_mask)[0]
18 cls_state = states[:, 0]
19 return cls_state
20
21 def forward(self, X):
22 source_mask = (X > 0)
23 # Featurizer
24 cls_state = self.encode(X, source_mask)
25 # Classifier
26 out = self.mlp(cls_state)
27 return out
Both encode() and forward() methods are roughly the same as before, but the
classifier (mlp) has both hidden and dropout layers now.
Our model takes an instance of a pre-trained BERT model, the number of units in
the hidden layer of the classifier, and the desired number of outputs (logits)
corresponding to the number of existing classes. The forward() method takes a
mini-batch of token IDs, encodes them using BERT (featurizer), and outputs logits
(classifier).
"Why does the model compute the source mask itself instead of using
the output from the tokenizer?"
BERT | 985