22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Output

["I'm", 'following', 'the', 'white', 'rabbit']

"What about 'I’m'? Isn’t it two words?"

Yes, and no. Not helpful, right? As usual, it depends—word contractions like that

are fairly common, and maybe you want to keep them as single tokens. But it is also

possible to have the token itself split into its two basic components, "I" and "am,"

such that the sentence above has six tokens instead of five. For now, we’re only

interested in sentence tokenization, which, as you probably already guessed,

means to split a text into its sentences.

We’ll get back to the topic of tokenization at word (and subword)

levels later.

For a brief introduction to the topic, check the "Tokenization" [163]

section of the Introduction to Information Retrieval [164] book by

Christopher D. Manning, Prabhakar Raghavan, and Hinrich

Schütze, Cambridge University Press (2008).

We’re using NLTK’s sent_tokenize() method to accomplish this instead of trying

to devise the splitting rules ourselves (NLTK is the natural language toolKit library

and is one of the most traditional tools for handling NLP tasks):

import nltk

from nltk.tokenize import sent_tokenize

nltk.download('punkt')

corpus_alice = sent_tokenize(alice)

corpus_wizard = sent_tokenize(wizard)

len(corpus_alice), len(corpus_wizard)

Output

(1612, 2240)

There are 1,612 sentences in Alice’s Adventures in Wonderland and 2,240 sentences

in The Wonderful Wizard of Oz.

886 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!