24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Getting the data<br />

The data we will use for this chapter is a set of books from Project Gutenberg at<br />

<strong>www</strong>.gutenberg.org, which is a repository of public domain literature works.<br />

The books I used for these experiments <strong>com</strong>e from a variety of authors:<br />

• Booth Tarkington (22 titles)<br />

• Charles Dickens (44 titles)<br />

• Edith Nesbit (10 titles)<br />

• Arthur Conan Doyle (51 titles)<br />

• Mark Twain (29 titles)<br />

• Sir Richard Francis Burton (11 titles)<br />

• Emile Gaboriau (10 titles)<br />

Overall, there are 177 documents from 7 authors, giving a significant amount of<br />

text to work with. A full list of the titles, along with download links and a script<br />

to automatically fetch them, is given in the code bundle.<br />

To download these books, we use the requests library to download the files into<br />

our data directory. First, set up the data directory and ensure the following code<br />

links to it:<br />

Chapter 9<br />

import os<br />

import sys<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data", "books")<br />

Next, run the script from the code bundle to download each of the books from<br />

Project Gutenberg. This will place them in the appropriate subfolders of this<br />

data folder.<br />

To run the script, download the getdata.py script from the Chapter 9 folder<br />

in the code bundle. Save it to your notebooks folder and enter the following into a<br />

new cell:<br />

!load getdata.py<br />

Then, from inside your IPython Notebook, press Shift + Enter to run the cell.<br />

This will load the script into the cell. Then click the code again and press<br />

Shift + Enter to run the script itself. This will take a while, but it will print a<br />

message to let you know it is <strong>com</strong>plete.<br />

[ 189 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!