22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The corpora’s dictionary is not a typical Python dictionary. It has some specific (and

useful) attributes:

dictionary.num_docs

Output

3081

The num_docs attribute tells us how many documents were processed (sentences, in

our case), and it corresponds to the length of the (outer) list of tokens.

dictionary.num_pos

Output

50802

The num_pos attribute tells us how many tokens (words) were processed over all

documents (sentences).

dictionary.token2id

Output

{'and': 0,

'as': 1,

'far': 2,

'knew': 3,

'quite': 4,

...

The token2id attribute is a (Python) dictionary containing the unique words found

in the text corpora, and a unique ID sequentially assigned to the words.

The keys of the token2id dictionary are the actual vocabulary of our corpora:

Word Tokenization | 901

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!