24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 3<br />

Now that we have our dataset, we can <strong>com</strong>pute a baseline. A baseline is an accuracy<br />

that indicates an easy way to get a good accuracy. Any data mining solution should<br />

beat this.<br />

In each match, we have two teams: a home team and a visitor team. An obvious<br />

baseline, called the chance rate, is 50 percent. Choosing randomly will (over time)<br />

result in an accuracy of 50 percent.<br />

Extracting new features<br />

We can now extract our features from this dataset by <strong>com</strong>bining and <strong>com</strong>paring<br />

the existing data. First up, we need to specify our class value, which will give<br />

our classification algorithm something to <strong>com</strong>pare against to see if its prediction<br />

is correct or not. This could be encoded in a number of ways; however, for this<br />

application, we will specify our class as 1 if the home team wins and 0 if the visitor<br />

team wins. In basketball, the team with the most points wins. So, while the data set<br />

doesn't specify who wins, we can <strong>com</strong>pute it easily.<br />

We can specify the data set by the following:<br />

dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]<br />

We then copy those values into a NumPy array to use later for our scikit-learn<br />

classifiers. There is not currently a clean integration between pandas and scikit-learn,<br />

but they work nicely together through the use of NumPy arrays. While we will use<br />

pandas to extract features, we will need to extract the values to use them with<br />

scikit-learn:<br />

y_true = dataset["HomeWin"].values<br />

The preceding array now holds our class values in a format that scikit-learn can read.<br />

We can also start creating some features to use in our data mining. While sometimes<br />

we just throw the raw data into our classifier, we often need to derive continuous<br />

numerical or categorical features.<br />

The first two features we want to create to help us predict which team will win<br />

are whether either of those two teams won their last game. This would roughly<br />

approximate which team is playing well.<br />

We will <strong>com</strong>pute this feature by iterating through the rows in order and recording<br />

which team won. When we get to a new row, we look up whether the team won the<br />

last time we saw them.<br />

[ 45 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!