Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

"What about the separation token?"This special token is used to, well, separate inputs into two distinct sentences. Yes,it is possible to feed BERT with two sentences at once, and this kind of input is usedfor the next sentence prediction task. We won’t be using that in our example, butwe’ll get back to it while discussing how BERT is trained.We can actually get rid of the special tokens if we’re not using them:tokenizer.encode(new_sentence, add_special_tokens=False)Output[1219, 5, 229, 200, 1]"OK, but where is the promised additional information?"That’s easy enough—we can simply call the tokenizer itself instead of a particularmethod and it will produce an enriched output:tokenizer(new_sentence,add_special_tokens=False,return_tensors='pt')Output{'input_ids': tensor([[1219, 5, 229, 200, 1]]),'token_type_ids': tensor([[0, 0, 0, 0, 0]]),'attention_mask': tensor([[1, 1, 1, 1, 1]])}By default, the outputs are lists, but we used the return_tensors argument to getPyTorch tensors instead (pt stands for PyTorch). There are three outputs in thedictionary: input_ids, token_type_ids, and attention_mask.The first one, input_ids, is the familiar list of token IDs. They are the mostfundamental input, and sometimes the only one, required by the model.Word Tokenization | 909

The second output, token_type_ids, works as a sentence index, and it only makessense if the input has more than one sentence (and the special separation tokensbetween them). For example:sentence1 = 'follow the white rabbit neo'sentence2 = 'no one can be told what the matrix is'tokenizer(sentence1, sentence2)Output{'input_ids': [3, 1219, 5, 229, 200, 1, 2, 51, 42, 78, 32, 307, 41,5, 1, 30, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1]}Although the tokenizer received two sentences as arguments, it considered them asingle input, thus producing a single sequence of IDs. Let’s convert the IDs back totokens and inspect the result:print(tokenizer.convert_ids_to_tokens(joined_sentences['input_ids']))Output['[CLS]', 'follow', 'the', 'white', 'rabbit', '[UNK]', '[SEP]','no', 'one', 'can', 'be', 'told', 'what', 'the', '[UNK]', 'is','[SEP]']The two sentences were concatenated together with a special separation token([SEP]) at the end of each one.910 | Chapter 11: Down the Yellow Brick Rabbit Hole

"What about the separation token?"

This special token is used to, well, separate inputs into two distinct sentences. Yes,

it is possible to feed BERT with two sentences at once, and this kind of input is used

for the next sentence prediction task. We won’t be using that in our example, but

we’ll get back to it while discussing how BERT is trained.

We can actually get rid of the special tokens if we’re not using them:

tokenizer.encode(new_sentence, add_special_tokens=False)

Output

[1219, 5, 229, 200, 1]

"OK, but where is the promised additional information?"

That’s easy enough—we can simply call the tokenizer itself instead of a particular

method and it will produce an enriched output:

tokenizer(new_sentence,

add_special_tokens=False,

return_tensors='pt')

Output

{'input_ids': tensor([[1219, 5, 229, 200, 1]]),

'token_type_ids': tensor([[0, 0, 0, 0, 0]]),

'attention_mask': tensor([[1, 1, 1, 1, 1]])}

By default, the outputs are lists, but we used the return_tensors argument to get

PyTorch tensors instead (pt stands for PyTorch). There are three outputs in the

dictionary: input_ids, token_type_ids, and attention_mask.

The first one, input_ids, is the familiar list of token IDs. They are the most

fundamental input, and sometimes the only one, required by the model.

Word Tokenization | 909

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!