22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The second output, token_type_ids, works as a sentence index, and it only makes

sense if the input has more than one sentence (and the special separation tokens

between them). For example:

sentence1 = 'follow the white rabbit neo'

sentence2 = 'no one can be told what the matrix is'

tokenizer(sentence1, sentence2)

Output

{'input_ids': [3, 1219, 5, 229, 200, 1, 2, 51, 42, 78, 32, 307, 41,

5, 1, 30, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1]}

Although the tokenizer received two sentences as arguments, it considered them a

single input, thus producing a single sequence of IDs. Let’s convert the IDs back to

tokens and inspect the result:

print(

tokenizer.convert_ids_to_tokens(joined_sentences['input_ids'])

)

Output

['[CLS]', 'follow', 'the', 'white', 'rabbit', '[UNK]', '[SEP]',

'no', 'one', 'can', 'be', 'told', 'what', 'the', '[UNK]', 'is',

'[SEP]']

The two sentences were concatenated together with a special separation token

([SEP]) at the end of each one.

910 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!