Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
The BERT model may take many other arguments, and we’re using three of them toget richer outputs:bert_model.eval()out = bert_model(input_ids=tokens['input_ids'],attention_mask=tokens['attention_mask'],output_attentions=True,output_hidden_states=True,return_dict=True)out.keys()Outputodict_keys(['last_hidden_state', 'pooler_output', 'hidden_states','attentions'])Let’s see what’s inside each of these four outputs:• last_hidden_state is returned by default and is the most important output ofall: It contains the final hidden states for each and every token in the input,which can be used as contextual word embeddings.Figure 11.28 - Word embeddings from BERT’s last layerDon’t forget that the first token is the special classifier token[CLS] and that there may be padding ([PAD]) and separator([SEP]) tokens as well!BERT | 981
last_hidden_batch = out['last_hidden_state']last_hidden_sentence = last_hidden_batch[0]# Removes hidden states for [PAD] tokens using the maskmask = tokens['attention_mask'].squeeze().bool()embeddings = last_hidden_sentence[mask]# Removes embeddings for the first [CLS] and last [SEP] tokensembeddings[1:-1]Outputtensor([[ 0.0100, 0.8575, -0.5429, ..., 0.4241, -0.2035],[-0.3705, 1.1001, 0.3326, ..., 0.0656, -0.5644],[-0.2947, 0.5797, 0.1997, ..., -0.3062, 0.6690],...,[ 0.0691, 0.7393, 0.0552, ..., -0.4896, -0.4832],[-0.1566, 0.6177, 0.1536, ..., 0.0904, -0.4917],[ 0.7511, 0.3110, -0.3116, ..., -0.1740, -0.2337]],grad_fn=<SliceBackward>)The flair library is doing exactly that under its hood! We can use ourget_embeddings() function to get embeddings for our sentence using the wrapperfor BERT from flair:get_embeddings(bert_flair, sentence)Outputtensor([[ 0.0100, 0.8575, -0.5429, ..., 0.4241, -0.2035],[-0.3705, 1.1001, 0.3326, ..., 0.0656, -0.5644],[-0.2947, 0.5797, 0.1997, ..., -0.3062, 0.6690],...,[ 0.0691, 0.7393, 0.0552, ..., -0.4896, -0.4832],[-0.1566, 0.6177, 0.1536, ..., 0.0904, -0.4917],[ 0.7511, 0.3110, -0.3116, ..., -0.1740, -0.2337]],device='cuda:0')Perfect match!982 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 956 and 957: Equation 11.1 - Embedding arithmeti
- Page 958 and 959: Only 82 out of 50,802 words in the
- Page 960 and 961: Now we can use its encode() method
- Page 962 and 963: Model I — GloVE + ClassifierData
- Page 964 and 965: Pre-trained PyTorch EmbeddingsThe e
- Page 966 and 967: Model Configuration & TrainingLet
- Page 968 and 969: 6 self.encoder = encoder7 self.mlp
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1008 and 1009: The contextual word embeddings are
- Page 1010 and 1011: Model Configuration1 class BERTClas
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015: Well, you probably don’t want to
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1020 and 1021: OutputTrainingArguments(output_dir=
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1040 and 1041: device_index = (model.device.indexi
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
last_hidden_batch = out['last_hidden_state']
last_hidden_sentence = last_hidden_batch[0]
# Removes hidden states for [PAD] tokens using the mask
mask = tokens['attention_mask'].squeeze().bool()
embeddings = last_hidden_sentence[mask]
# Removes embeddings for the first [CLS] and last [SEP] tokens
embeddings[1:-1]
Output
tensor([[ 0.0100, 0.8575, -0.5429, ..., 0.4241, -0.2035],
[-0.3705, 1.1001, 0.3326, ..., 0.0656, -0.5644],
[-0.2947, 0.5797, 0.1997, ..., -0.3062, 0.6690],
...,
[ 0.0691, 0.7393, 0.0552, ..., -0.4896, -0.4832],
[-0.1566, 0.6177, 0.1536, ..., 0.0904, -0.4917],
[ 0.7511, 0.3110, -0.3116, ..., -0.1740, -0.2337]],
grad_fn=<SliceBackward>)
The flair library is doing exactly that under its hood! We can use our
get_embeddings() function to get embeddings for our sentence using the wrapper
for BERT from flair:
get_embeddings(bert_flair, sentence)
Output
tensor([[ 0.0100, 0.8575, -0.5429, ..., 0.4241, -0.2035],
[-0.3705, 1.1001, 0.3326, ..., 0.0656, -0.5644],
[-0.2947, 0.5797, 0.1997, ..., -0.3062, 0.6690],
...,
[ 0.0691, 0.7393, 0.0552, ..., -0.4896, -0.4832],
[-0.1566, 0.6177, 0.1536, ..., 0.0904, -0.4917],
[ 0.7511, 0.3110, -0.3116, ..., -0.1740, -0.2337]],
device='cuda:0')
Perfect match!
982 | Chapter 11: Down the Yellow Brick Rabbit Hole