22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Once we’re happy with the size and scope of a vocabulary, we can save it to disk as

a plain text file, one token (word) per line. The helper function below takes a list of

sentences, generates the corresponding vocabulary, and saves it to a file named

vocab.txt:

Method to Build a Vocabulary from a Dataset of Sentences

1 def make_vocab(sentences, folder=None, special_tokens=None,

2 vocab_size=None, min_freq=None):

3 if folder is not None:

4 if not os.path.exists(folder):

5 os.mkdir(folder)

6

7 # tokenizes the sentences and creates a Dictionary

8 tokens = [simple_preprocess(sent) for sent in sentences]

9 dictionary = corpora.Dictionary(tokens)

10 # keeps only the most frequent words (vocab size)

11 if vocab_size is not None:

12 dictionary.filter_extremes(keep_n=vocab_size)

13 # removes rare words (in case the vocab size still

14 # includes words with low frequency)

15 if min_freq is not None:

16 rare_tokens = get_rare_ids(dictionary, min_freq)

17 dictionary.filter_tokens(bad_ids=rare_tokens)

18 # gets the whole list of tokens and frequencies

19 items = dictionary.cfs.items()

20 # sorts the tokens in descending order

21 words = [dictionary[t[0]]

22 for t in sorted(dictionary.cfs.items(),

23 key=lambda t: -t[1])]

24 # prepends special tokens, if any

25 if special_tokens is not None:

26 to_add = []

27 for special_token in special_tokens:

28 if special_token not in words:

29 to_add.append(special_token)

30 words = to_add + words

31

32 with open(os.path.join(folder, 'vocab.txt'), 'w') as f:

33 for word in words:

34 f.write(f'{word}\n')

Word Tokenization | 905

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!