22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Finally, if we want to convert a list of tokens into a list of their corresponding

indices in the vocabulary, we can use the doc2idx() method:

sentence = 'follow the white rabbit'

new_tokens = simple_preprocess(sentence)

ids = dictionary.doc2idx(new_tokens)

print(new_tokens)

print(ids)

Output

['follow', 'the', 'white', 'rabbit']

[1482, 20, 497, 333]

The problem is, however large we make the vocabulary, there will always be a new

word that’s not in there.

"What do we do with words that aren’t in the vocabulary?"

If the word isn’t in the vocabulary, it is an unknown word, and it is going to be

replaced by the corresponding special token: [UNK]. This means we need to add

[UNK] to the vocabulary. Luckily, Gensim’s Dictionary has a

patch_with_special_tokens() method that makes it very easy to patch our

vocabulary:

special_tokens = {'[PAD]': 0, '[UNK]': 1}

dictionary.patch_with_special_tokens(special_tokens)

Besides, since we’re at it, let’s add yet another special token: [PAD]. At some point,

we’ll have to pad our sequences (like we did in Chapter 8), so it will be useful to

have a token ready for it.

What if, instead of adding more tokens to the vocabulary, we try removing words

from it? Maybe we’d like to remove rare words (aardvark always comes to my mind)

to get a smaller vocabulary, or maybe we’d like to remove bad words (profanity)

from it.

Word Tokenization | 903

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!