22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

"What is this punkt?"

That’s the Punkt Sentence Tokenizer, and its pre-trained model (for the English

language) is included in the NLTK package.

"And what is a corpus?"

A corpus is a structured set of documents. But there is quite a lot of wiggle room in

this definition: One can define a document as a sentence, a paragraph, or even a

whole book. In our case, the document is a sentence, so each book is actually a set

of sentences, and thus each book may be considered a corpus. The plural of corpus

is actually corpora (yay, Latin!), so we do have a corpora.

Let’s check one sentence from the first corpus of text:

corpus_alice[2]

Output

'There was nothing so VERY remarkable in that; nor did Alice\nthink

it so VERY much out of the way to hear the Rabbit say to\nitself,

`Oh dear!'

Notice that it still includes the line breaks (\n) from the original text. The sentence

tokenizer only handles the sentence splitting; it does not clean up the line breaks.

Let’s check one sentence from the second corpus of text:

corpus_wizard[30]

Output

'"There\'s a cyclone coming, Em," he called to his wife.'

No line breaks here, but notice the quotation marks (") in the text.

"Why do we care about line breaks and quotation marks anyway?"

Building a Dataset | 887

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!