24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 9<br />

We create lists for storing the documents themselves and the author classes:<br />

documents = []<br />

authors = []<br />

We then create a list of each of the subfolders in the parent directly, as the script<br />

creates a subfolder for each author. The code is as follows:<br />

subfolders = [subfolder for subfolder in os.listdir(folder)<br />

if os.path.isdir(os.path.join(folder,<br />

subfolder))]<br />

Next we iterate over these subfolders, assigning each subfolder a number using<br />

enumerate:<br />

for author_number, subfolder in enumerate(subfolders):<br />

We then create the full subfolder path and look for all documents within<br />

that subfolder:<br />

full_subfolder_path = os.path.join(folder, subfolder)<br />

for document_name in os.listdir(full_subfolder_path):<br />

For each of those files, we open it, read the contents, preprocess those contents,<br />

and append it to our documents list. The code is as follows:<br />

with open(os.path.join(full_subfolder_path,<br />

document_name)) as inf:<br />

documents.append(clean_book(inf.read()))<br />

We also append the number we assigned to this author to our authors list,<br />

which will form our classes:<br />

authors.append(author_number)<br />

We then return the documents and classes (which we transform into a NumPy<br />

array for each indexing later on):<br />

return documents, np.array(authors, dtype='int')<br />

We can now get our documents and classes using the following function call:<br />

documents, classes = load_books_data(data_folder)<br />

[ 191 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!