09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Autoencoders

import collections

import matplotlib.pyplot as plt

import nltk

import numpy as np

import os

from time import gmtime, strftime

from tensorflow.keras.callbacks import TensorBoard

import re

# Needed to run only once

nltk.download('punkt')

The data is provided as a set of SGML files. The helper code to convert the SGML

files to text.tsv which is based on Scikit-learn: https://scikit-learn.org/

stable/auto_examples/applications/plot_out_of_core_classification.

html, is added in the GitHub the file named parse.py. We will use the data from

this file and first convert each block of text into a list of sentences, one sentence per

line. Also, each word in the sentence is normalized as it is added. The normalization

involves removing all numbers and replacing them with the number 9, then

converting the word to lower case. Simultaneously we also calculate the word

frequencies in the same code. The result is the word frequency table, word_freqs:

DATA_DIR = "data"

def is_number(n):

temp = re.sub("[.,-/]", "",n)

return temp.isdigit()

# parsing sentences and building vocabulary

word_freqs = collections.Counter()

ftext = open(os.path.join(DATA_DIR, "text.tsv"), "r")

sents = []

sent_lens = []

for line in ftext:

docid, text = line.strip().split("\t")

for sent in nltk.sent_tokenize(text):

for word in nltk.word_tokenize(sent):

if is_number(word):

word = "9"

word = word.lower()

word_freqs[word] += 1

sents.append(sent)

sent_lens.append(len(sent))

ftext.close()

[ 366 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!