22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Let’s retrieve the embeddings for the words in the first sentence of our training set:

sentence = train_dataset[0]['sentence']

sentence

Output

'And, so far as they knew, they were quite right.'

First, we need to tokenize it:

tokens = bert_tokenizer(sentence,

padding='max_length',

max_length=30,

truncation=True,

return_tensors="pt")

tokens

Output

{'input_ids': tensor([[ 101, 1998, 1010, 2061, 2521, 2004, 2027,

2354, 1010, 2027, 2020, 3243, 2157, 1012, 102, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),

'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),

'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

The tokenizer is padding the sentence up to the maximum length (only 30 in this

example to more easily visualize the outputs), and this is reflected on the

attention_mask as well. We’ll use both input_ids and attention_mask as inputs to

our BERT model (the token_type_ids are irrelevant here because there is only one

sentence).

980 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!