Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

14 # If there is a configuration file, builds a dictionary with15 # the corresponding start and end lines of each text file16 config_file = os.path.join(source, 'lines.cfg')17 config = {}18 if os.path.exists(config_file):19 with open(config_file, 'r') as f:20 rows = f.readlines()2122 for r in rows[1:]:23 fname, start, end = r.strip().split(',')24 config.update({fname: (int(start), int(end))})2526 new_fnames = []27 # For each file of text28 for fname in filenames:29 # If there's a start and end line for that file, use it30 try:31 start, end = config[fname]32 except KeyError:33 start = None34 end = None3536 # Opens the file, slices the configures lines (if any)37 # cleans line breaks and uses the sentence tokenizer38 with open(os.path.join(source, fname), 'r') as f:39 contents = (40 ''.join(f.readlines()[slice(start, end, None)])41 .replace('\n', ' ').replace('\r', '')42 )43 corpus = sent_tokenize(contents, **kwargs)4445 # Builds a CSV file containing tokenized sentences46 base = os.path.splitext(fname)[0]47 new_fname = f'{base}.sent.csv'48 new_fname = os.path.join(source, new_fname)49 with open(new_fname, 'w') as f:50 # Header of the file51 if include_header:52 if include_source:53 f.write('sentence,source\n')54 else:55 f.write('sentence\n')Building a Dataset | 889

56 # Writes one line for each sentence57 for sentence in corpus:58 if include_source:59 f.write(f'{quote_char}{sentence}{quote_char}\60 {sep_char}{fname}\n')61 else:62 f.write(f'{quote_char}{sentence}\63 {quote_char}\n')64 new_fnames.append(new_fname)6566 # Returns list of the newly generated CSV files67 return sorted(new_fnames)It takes a source folder (or a single file) and goes through the files with the rightextensions (only .txt by default), removing lines based on the lines.cfg file (ifany), applying the sentence tokenizer to each file, and generating thecorresponding CSV files of sentences using the configured quote_char andsep_char. It may also use include_header and include_source in the CSV file.The CSV files are named after the corresponding text files by dropping the originalextension and appending .sent.csv to it. Let’s see it in action:Generating Dataset of Sentences1 new_fnames = sentence_tokenize(localfolder)2 new_fnamesOutput['texts/alice28-1476.sent.csv', 'texts/wizoz10-1740.sent.csv']Each CSV file contains the sentences of a book, and we’ll use both of them to buildour own dataset.890 | Chapter 11: Down the Yellow Brick Rabbit Hole

56 # Writes one line for each sentence

57 for sentence in corpus:

58 if include_source:

59 f.write(f'{quote_char}{sentence}{quote_char}\

60 {sep_char}{fname}\n')

61 else:

62 f.write(f'{quote_char}{sentence}\

63 {quote_char}\n')

64 new_fnames.append(new_fname)

65

66 # Returns list of the newly generated CSV files

67 return sorted(new_fnames)

It takes a source folder (or a single file) and goes through the files with the right

extensions (only .txt by default), removing lines based on the lines.cfg file (if

any), applying the sentence tokenizer to each file, and generating the

corresponding CSV files of sentences using the configured quote_char and

sep_char. It may also use include_header and include_source in the CSV file.

The CSV files are named after the corresponding text files by dropping the original

extension and appending .sent.csv to it. Let’s see it in action:

Generating Dataset of Sentences

1 new_fnames = sentence_tokenize(localfolder)

2 new_fnames

Output

['texts/alice28-1476.sent.csv', 'texts/wizoz10-1740.sent.csv']

Each CSV file contains the sentences of a book, and we’ll use both of them to build

our own dataset.

890 | Chapter 11: Down the Yellow Brick Rabbit Hole

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!