22.02.2024 Views

Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Our dataset is going to be a collection of CSV files, one file for each book, with each

CSV file containing one sentence per line.

Therefore, we need to:

• clean the line breaks to make sure each sentence is on one line only;

• define an appropriate quote char to "wrap" the sentence such that the original

commas and semicolons in the original text do not get misinterpreted as

separation chars of the CSV file; and

• add a second column to the CSV file (the first one is the sentence itself) to

identify the original source of the sentence since we’ll be concatenating, and

shuffling the sentences before training a model on our corpora.

The sentence above should end up looking like this:

\"There's a cyclone coming, Em," he called to his wife.\,wizoz10

-1740.txt

The escape character "\" is a good choice for quote char because it is not present in

any of the books (we would probably have to choose something else if our books of

choice were about coding).

The function below does the grunt work of cleaning, splitting, and saving the

sentences to a CSV file for us:

Method to Generate CSV of Sentences

1 def sentence_tokenize(source, quote_char='\\', sep_char=',',

2 include_header=True, include_source=True,

3 extensions=('txt'), **kwargs):

4 nltk.download('punkt')

5 # If source is a folder, goes through all files inside it

6 # that match the desired extensions ('txt' by default)

7 if os.path.isdir(source):

8 filenames = [f for f in os.listdir(source)

9 if os.path.isfile(os.path.join(source, f))

10 and os.path.splitext(f)[1][1:] in extensions]

11 elif isinstance(source, str):

12 filenames = [source]

13

888 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!