22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Excellent question! It’s actually easy: Simply use two batches, one containing the

first sentence of each pair, the other containing the second sentence of each pair:

first_sentences = [sentence1, 'another first sentence']

second_sentences = [sentence2, 'a second sentence here']

batch_of_pairs = tokenizer(first_sentences, second_sentences)

first_input = tokenizer.convert_ids_to_tokens(

batch_of_pairs['input_ids'][0]

)

second_input = tokenizer.convert_ids_to_tokens(

batch_of_pairs['input_ids'][1]

)

print(first_input)

print(second_input)

Output

['[CLS]', 'follow', 'the', 'white', 'rabbit', '[UNK]', '[SEP]',

'no', 'one', 'can', 'be', 'told', 'what', 'the', '[UNK]', 'is',

'[SEP]']

['[CLS]', 'another', 'first', 'sentence', '[SEP]', '[UNK]',

'second', 'sentence', 'here', '[SEP]']

The batch above has only two inputs, and each input has two sentences.

Finally, let’s apply our tokenizer to our dataset of sentences, padding them and

returning PyTorch tensors:

tokenized_dataset = tokenizer(dataset['sentence'],

padding=True,

return_tensors='pt',

max_length=50,

truncation=True)

tokenized_dataset['input_ids']

912 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!