www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Appendix To install it, clone the repository and follow the instructions to install the Bleeding Edge code available at: http://scikit-learn.org/stable/install.html. Remember to use the above repository's code, rather than the official source. I recommend you install using virtualenv or a virtual machine, rather than installing it directly on your computer. A great guide to virtualenv can be found here: http://docs.python-guide.org/en/latest/dev/virtualenvs/. More complex pipelines http://scikit-learn.org/stable/modules/pipeline.html#featureunioncomposite-feature-spaces The Pipelines we have used in the book follow a single stream—the output of one step is the input of another step. Pipelines follow the transformer and estimator interfaces as well—this allows us to embed Pipelines within Pipelines. This is a useful construct for very complex models, but becomes very powerful when combined with Feature Unions, as shown in the above link. This allows us to extract multiple types of features at a time and then combine them to form a single dataset. For more details, see the example at: http://scikitlearn.org/stable/auto_examples/feature_stacker.html. Comparing classifiers There are lots of classifiers in scikit-learn that are ready to use. The one you choose for a particular task is going to be based on a variety of factors. You can compare the f1-score to see which method is better, and you can investigate the deviation of those scores to see if that result is statistically significant. An important factor is that they are trained and tested on the same data—that is, the test set for one classifier is the test set for all classifiers. Our use of random states allows us to ensure this is the case—an important factor for replicating experiments. [ 299 ]

Next Steps… Chapter 3: Predicting Sports Winners with Decision Trees More on pandas http://pandas.pydata.org/pandas-docs/stable/tutorials.html The pandas library is a great package—anything you normally write to do data loading is probably already implemented in pandas. You can learn more about it from their tutorial, linked above. There is also a great blog post written by Chris Moffitt that overviews common tasks people do in Excel and how to do them in pandas: http://pbpython.com/excelpandas-comp.html You can also handle large datasets with pandas; see the answer, from user Jeff (the top answer at the time of writing), to this StackOverflow question for an extensive overview of the process: http://stackoverflow.com/questions/14262433/ large-data-work-flows-using-pandas. Another great tutorial on pandas is written by Brian Connelly: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/ More complex features http://www.basketball-reference.com/teams/ORL/2014_roster_status.html Sports teams change regularly from game to game. What is an easy win for a team can turn into a difficult game if a couple of the best players are injured. You can get the team rosters from basketball-reference as well. For example, the roster for the 2013-2014 season for the Orlando Magic is available at the above link—similar data is available for all NBA teams. Writing code to integrate how much a team changes, and using that to add new features, can improve the model significantly. This task will take quite a bit of work, though! [ 300 ]

Appendix<br />

To install it, clone the repository and follow the instructions to install the Bleeding<br />

Edge code available at: http://scikit-learn.org/stable/install.html.<br />

Remember to use the above repository's code, rather than the official source.<br />

I re<strong>com</strong>mend you install using virtualenv or a virtual machine, rather than<br />

installing it directly on your <strong>com</strong>puter. A great guide to virtualenv can be found<br />

here: http://docs.python-guide.org/en/latest/dev/virtualenvs/.<br />

More <strong>com</strong>plex pipelines<br />

http://scikit-learn.org/stable/modules/pipeline.html#featureunion<strong>com</strong>posite-feature-spaces<br />

The Pipelines we have used in the book follow a single stream—the output of one<br />

step is the input of another step.<br />

Pipelines follow the transformer and estimator interfaces as well—this allows us to<br />

embed Pipelines within Pipelines. This is a useful construct for very <strong>com</strong>plex models,<br />

but be<strong>com</strong>es very powerful when <strong>com</strong>bined with Feature Unions, as shown in the<br />

above link.<br />

This allows us to extract multiple types of features at a time and then <strong>com</strong>bine them<br />

to form a single dataset. For more details, see the example at: http://scikitlearn.org/stable/auto_examples/feature_stacker.html.<br />

Comparing classifiers<br />

There are lots of classifiers in scikit-learn that are ready to use. The one you choose<br />

for a particular task is going to be based on a variety of factors. You can <strong>com</strong>pare the<br />

f1-score to see which method is better, and you can investigate the deviation of those<br />

scores to see if that result is statistically significant.<br />

An important factor is that they are trained and tested on the same data—that is,<br />

the test set for one classifier is the test set for all classifiers. Our use of random states<br />

allows us to ensure this is the case—an important factor for replicating experiments.<br />

[ 299 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!