www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Using pandas to load the dataset The pandas library is a library for loading, managing, and manipulating data. It handles data structures behind-the-scenes and supports analysis methods, such as computing the mean. Chapter 3 When doing multiple data mining experiments, you will find that you write many of the same functions again and again, such as reading files and extracting features. Each time this reimplementation happens, you run the risk of introducing bugs. Using a high-class library such as pandas significantly reduces the amount of work needed to do these functions and also gives you more confidence in using well tested code. Throughout this book, we will be using pandas quite significantly, introducing use cases as we go. We can load the dataset using the read_csv function: import pandas as pd dataset = pd.read_csv(data_filename) The result of this is a pandas Dataframe, and it has some useful functions that we will use later on. Looking at the resulting dataset, we can see some issues. Type the following and run the code to see the first five rows of the dataset: dataset.ix[:5] Here's the output: This is actually a usable dataset, but it contains some problems that we will fix up soon. [ 43 ]

Predicting Sports Winners with Decision Trees Cleaning up the dataset After looking at the output, we can see a number of problems: • The date is just a string and not a date object • The first row is blank • From visually inspecting the results, the headings aren't complete or correct These issues come from the data, and we could fix this by altering the data itself. However, in doing this, we could forget the steps we took or misapply them; that is, we can't replicate our results. As with the previous section where we used pipelines to track the transformations we made to a dataset, we will use pandas to apply transformations to the raw data itself. The pandas.read_csv function has parameters to fix each of these issues, which we can specify when loading the file. We can also change the headings after loading the file, as shown in the following code: dataset = pd.read_csv(data_filename, parse_dates=["Date"], skiprows=[0,]) dataset.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"] The results have significantly improved, as we can see if we print out the resulting data frame: dataset.ix[:5] The output is as follows: Even in well-compiled data sources such as this one, you need to make some adjustments. Different systems have different nuances, resulting in data files that are not quite compatible with each other. [ 44 ]

Predicting Sports Winners with Decision Trees<br />

Cleaning up the dataset<br />

After looking at the output, we can see a number of problems:<br />

• The date is just a string and not a date object<br />

• The first row is blank<br />

• From visually inspecting the results, the headings aren't <strong>com</strong>plete or correct<br />

These issues <strong>com</strong>e from the data, and we could fix this by altering the data itself.<br />

However, in doing this, we could forget the steps we took or misapply them; that is,<br />

we can't replicate our results. As with the previous section where we used pipelines<br />

to track the transformations we made to a dataset, we will use pandas to apply<br />

transformations to the raw data itself.<br />

The pandas.read_csv function has parameters to fix each of these issues, which we<br />

can specify when loading the file. We can also change the headings after loading the<br />

file, as shown in the following code:<br />

dataset = pd.read_csv(data_filename, parse_dates=["Date"],<br />

skiprows=[0,])<br />

dataset.columns = ["Date", "Score Type", "Visitor Team",<br />

"VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]<br />

The results have significantly improved, as we can see if we print out the resulting<br />

data frame:<br />

dataset.ix[:5]<br />

The output is as follows:<br />

Even in well-<strong>com</strong>piled data sources such as this one, you need to make some<br />

adjustments. Different systems have different nuances, resulting in data files<br />

that are not quite <strong>com</strong>patible with each other.<br />

[ 44 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!