24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Predicting Sports Winners with Decision Trees<br />

These integers can be fed into the decision tree, but they will still be interpreted<br />

as continuous features by DecisionTreeClassifier. For example, teams may be<br />

allocated integers 0 to 16. The algorithm will see teams 1 and 2 as being similar,<br />

while teams 4 and 10 will be different—but this makes no sense as all. All of the<br />

teams are different from each other—two teams are either the same or they are not!<br />

To fix this inconsistency, we use the OneHotEncoder transformer to encode these<br />

integers into a number of binary features. Each binary feature will be a single value<br />

for the feature. For example, if the NBA team Chicago Bulls is allocated as integer<br />

7 by the LabelEncoder, then the seventh feature returned by the OneHotEncoder<br />

will be a 1 if the team is Chicago Bulls and 0 for all other teams. This is done for every<br />

possible value, resulting in a much larger dataset. The code is as follows:<br />

from sklearn.preprocessing import OneHotEncoder<br />

onehot = OneHotEncoder()<br />

We fit and transform on the same dataset, saving the results:<br />

X_teams_expanded = onehot.fit_transform(X_teams).todense()<br />

Next, we run the decision tree as before on the new dataset:<br />

clf = DecisionTreeClassifier(random_state=14)<br />

scores = cross_val_score(clf, X_teams_expanded, y_true,<br />

scoring='accuracy')<br />

print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))<br />

This scores an accuracy of 60 percent. The score is better than the baseline, but<br />

not as good as before. It is possible that the larger number of features were not<br />

handled properly by the decision trees. For this reason, we will try changing the<br />

algorithm and see if that helps. Data mining can be an iterative process of trying new<br />

algorithms and features.<br />

Random forests<br />

A single decision tree can learn quite <strong>com</strong>plex functions. However, in many ways<br />

it will be prone to overfitting—learning rules that work only for the training set.<br />

One of the ways that we can adjust for this is to limit the number of rules that it<br />

learns. For instance, we could limit the depth of the tree to just three layers. Such<br />

a tree will learn the best rules for splitting the dataset at a global level, but won't<br />

learn highly specific rules that separate the dataset into highly accurate groups.<br />

This trade-off results in trees that may have a good generalization, but overall<br />

slightly poorer performance.<br />

[ 54 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!