24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Predicting Sports Winners with Decision Trees<br />

Engineering new features<br />

In the previous few examples, we saw that changing the features can have quite a<br />

large impact on the performance of the algorithm. Through our small amount of<br />

testing, we had more than 10 percent variance just from the features.<br />

You can create features that <strong>com</strong>e from a simple function in pandas by doing<br />

something like this:<br />

dataset["New Feature"] = feature_creator()<br />

The feature_creator function must return a list of the feature's value for each<br />

sample in the dataset. A <strong>com</strong>mon pattern is to use the dataset as a parameter:<br />

dataset["New Feature"] = feature_creator(dataset)<br />

You can create those features more directly by setting all the values to a single<br />

"default" value, like 0 in the next line:<br />

dataset["My New Feature"] = 0<br />

You can then iterate over the dataset, <strong>com</strong>puting the features as you go. We used<br />

this format in this chapter to create many of our features:<br />

for index, row in dataset.iterrows():<br />

home_team = row["Home Team"]<br />

visitor_team = row["Visitor Team"]<br />

# Some calculation here to alter row<br />

dataset.ix[index] = row<br />

Keep in mind that this pattern isn't very efficient. If you are going to do this, try all of<br />

your features at once. A <strong>com</strong>mon "best practice" is to touch every sample as little as<br />

possible, preferably only once.<br />

Some example features that you could try and implement are as follows:<br />

• How many days has it been since each team's previous match? Teams may be<br />

tired if they play too many games in a short time frame.<br />

• How many games of the last five did each team win? This will give a more<br />

stable form of the HomeLastWin and VisitorLastWin features we extracted<br />

earlier (and can be extracted in a very similar way).<br />

• Do teams have a good record when visiting certain other teams? For instance,<br />

one team may play well in a particular stadium, even if they are the visitors.<br />

[ 58 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!