Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

The actual texts of the books are contained between lines 105 and 3703(remember Python’s zero-based indexing) and 309 and 5099, respectively.Moreover, we’re joining all the lines together into a single large string of text foreach book because we’re going to organize the resulting texts into sentences, andin a regular book there are line breaks mid-sentence all over.We definitely do not want to do that manually every time, right? Although it wouldbe more difficult to automatically remove any additions to the original text, we canpartially automate the removal of the extra lines by setting the real start and endlines of each text in a configuration file (lines.cfg):Configuration File1 text_cfg = """fname,start,end2 alice28-1476.txt,104,37043 wizoz10-1740.txt,310,5100"""4 bytes_written = open(5 os.path.join(localfolder, 'lines.cfg'), 'w'6 ).write(text_cfg)Your local folder (texts) should have three files now: alice28-1476.txt, lines.cfg,and wizoz10-1740.txt. Now, it is time to perform…Sentence TokenizationA token is a piece of a text, and to tokenize a text means to splitit into pieces; that is, into a list of tokens."What kind of pieces are we talking about here?"The most common kind of piece is a word. So, tokenizing a text usually means tosplit it into words using the white space as a separator:sentence = "I'm following the white rabbit"tokens = sentence.split(' ')tokensBuilding a Dataset | 885

Output["I'm", 'following', 'the', 'white', 'rabbit']"What about 'I’m'? Isn’t it two words?"Yes, and no. Not helpful, right? As usual, it depends—word contractions like thatare fairly common, and maybe you want to keep them as single tokens. But it is alsopossible to have the token itself split into its two basic components, "I" and "am,"such that the sentence above has six tokens instead of five. For now, we’re onlyinterested in sentence tokenization, which, as you probably already guessed,means to split a text into its sentences.We’ll get back to the topic of tokenization at word (and subword)levels later.For a brief introduction to the topic, check the "Tokenization" [163]section of the Introduction to Information Retrieval [164] book byChristopher D. Manning, Prabhakar Raghavan, and HinrichSchütze, Cambridge University Press (2008).We’re using NLTK’s sent_tokenize() method to accomplish this instead of tryingto devise the splitting rules ourselves (NLTK is the natural language toolKit libraryand is one of the most traditional tools for handling NLP tasks):import nltkfrom nltk.tokenize import sent_tokenizenltk.download('punkt')corpus_alice = sent_tokenize(alice)corpus_wizard = sent_tokenize(wizard)len(corpus_alice), len(corpus_wizard)Output(1612, 2240)There are 1,612 sentences in Alice’s Adventures in Wonderland and 2,240 sentencesin The Wonderful Wizard of Oz.886 | Chapter 11: Down the Yellow Brick Rabbit Hole

The actual texts of the books are contained between lines 105 and 3703

(remember Python’s zero-based indexing) and 309 and 5099, respectively.

Moreover, we’re joining all the lines together into a single large string of text for

each book because we’re going to organize the resulting texts into sentences, and

in a regular book there are line breaks mid-sentence all over.

We definitely do not want to do that manually every time, right? Although it would

be more difficult to automatically remove any additions to the original text, we can

partially automate the removal of the extra lines by setting the real start and end

lines of each text in a configuration file (lines.cfg):

Configuration File

1 text_cfg = """fname,start,end

2 alice28-1476.txt,104,3704

3 wizoz10-1740.txt,310,5100"""

4 bytes_written = open(

5 os.path.join(localfolder, 'lines.cfg'), 'w'

6 ).write(text_cfg)

Your local folder (texts) should have three files now: alice28-1476.txt, lines.cfg,

and wizoz10-1740.txt. Now, it is time to perform…

Sentence Tokenization

A token is a piece of a text, and to tokenize a text means to split

it into pieces; that is, into a list of tokens.

"What kind of pieces are we talking about here?"

The most common kind of piece is a word. So, tokenizing a text usually means to

split it into words using the white space as a separator:

sentence = "I'm following the white rabbit"

tokens = sentence.split(' ')

tokens

Building a Dataset | 885

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!