22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Model I — GloVE + Classifier

Data Preparation

It all starts with the data preparation step. As we already know, we need to tokenize

the sentences to get their corresponding sequences of token IDs. The sentences

(and the labels) can be easily retrieved from HF’s dataset, like a dictionary:

Data Preparation

1 train_sentences = train_dataset['sentence']

2 train_labels = train_dataset['labels']

3

4 test_sentences = test_dataset['sentence']

5 test_labels = test_dataset['labels']

Next, we use our glove_tokenizer() to tokenize the sentences, making sure that

we pad and truncate them so they all end up with 60 tokens (like we did in the

"HuggingFace’s Tokenizer" section). We only need the inputs_ids to fetch their

corresponding embeddings later on:

Data Preparation — Tokenizing

1 train_ids = glove_tokenizer(train_sentences,

2 truncation=True,

3 padding=True,

4 max_length=60,

5 add_special_tokens=False,

6 return_tensors='pt')['input_ids']

7 train_labels = torch.as_tensor(train_labels).float().view(-1, 1)

8

9 test_ids = glove_tokenizer(test_sentences,

10 truncation=True,

11 padding=True,

12 max_length=60,

13 add_special_tokens=False,

14 return_tensors='pt')['input_ids']

15 test_labels = torch.as_tensor(test_labels).float().view(-1, 1)

Word Embeddings | 937

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!