Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub

peiying410632
from peiying410632 More from this publisher
22.02.2024 Views

14 # If there is a configuration file, builds a dictionary with15 # the corresponding start and end lines of each text file16 config_file = os.path.join(source, 'lines.cfg')17 config = {}18 if os.path.exists(config_file):19 with open(config_file, 'r') as f:20 rows = f.readlines()2122 for r in rows[1:]:23 fname, start, end = r.strip().split(',')24 config.update({fname: (int(start), int(end))})2526 new_fnames = []27 # For each file of text28 for fname in filenames:29 # If there's a start and end line for that file, use it30 try:31 start, end = config[fname]32 except KeyError:33 start = None34 end = None3536 # Opens the file, slices the configures lines (if any)37 # cleans line breaks and uses the sentence tokenizer38 with open(os.path.join(source, fname), 'r') as f:39 contents = (40 ''.join(f.readlines()[slice(start, end, None)])41 .replace('\n', ' ').replace('\r', '')42 )43 corpus = sent_tokenize(contents, **kwargs)4445 # Builds a CSV file containing tokenized sentences46 base = os.path.splitext(fname)[0]47 new_fname = f'{base}.sent.csv'48 new_fname = os.path.join(source, new_fname)49 with open(new_fname, 'w') as f:50 # Header of the file51 if include_header:52 if include_source:53 f.write('sentence,source\n')54 else:55 f.write('sentence\n')Building a Dataset | 889

56 # Writes one line for each sentence57 for sentence in corpus:58 if include_source:59 f.write(f'{quote_char}{sentence}{quote_char}\60 {sep_char}{fname}\n')61 else:62 f.write(f'{quote_char}{sentence}\63 {quote_char}\n')64 new_fnames.append(new_fname)6566 # Returns list of the newly generated CSV files67 return sorted(new_fnames)It takes a source folder (or a single file) and goes through the files with the rightextensions (only .txt by default), removing lines based on the lines.cfg file (ifany), applying the sentence tokenizer to each file, and generating thecorresponding CSV files of sentences using the configured quote_char andsep_char. It may also use include_header and include_source in the CSV file.The CSV files are named after the corresponding text files by dropping the originalextension and appending .sent.csv to it. Let’s see it in action:Generating Dataset of Sentences1 new_fnames = sentence_tokenize(localfolder)2 new_fnamesOutput['texts/alice28-1476.sent.csv', 'texts/wizoz10-1740.sent.csv']Each CSV file contains the sentences of a book, and we’ll use both of them to buildour own dataset.890 | Chapter 11: Down the Yellow Brick Rabbit Hole

14 # If there is a configuration file, builds a dictionary with

15 # the corresponding start and end lines of each text file

16 config_file = os.path.join(source, 'lines.cfg')

17 config = {}

18 if os.path.exists(config_file):

19 with open(config_file, 'r') as f:

20 rows = f.readlines()

21

22 for r in rows[1:]:

23 fname, start, end = r.strip().split(',')

24 config.update({fname: (int(start), int(end))})

25

26 new_fnames = []

27 # For each file of text

28 for fname in filenames:

29 # If there's a start and end line for that file, use it

30 try:

31 start, end = config[fname]

32 except KeyError:

33 start = None

34 end = None

35

36 # Opens the file, slices the configures lines (if any)

37 # cleans line breaks and uses the sentence tokenizer

38 with open(os.path.join(source, fname), 'r') as f:

39 contents = (

40 ''.join(f.readlines()[slice(start, end, None)])

41 .replace('\n', ' ').replace('\r', '')

42 )

43 corpus = sent_tokenize(contents, **kwargs)

44

45 # Builds a CSV file containing tokenized sentences

46 base = os.path.splitext(fname)[0]

47 new_fname = f'{base}.sent.csv'

48 new_fname = os.path.join(source, new_fname)

49 with open(new_fname, 'w') as f:

50 # Header of the file

51 if include_header:

52 if include_source:

53 f.write('sentence,source\n')

54 else:

55 f.write('sentence\n')

Building a Dataset | 889

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!