www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Appendix To install it, clone the repository and follow the instructions to install the Bleeding Edge code available at: http://scikit-learn.org/stable/install.html. Remember to use the above repository's code, rather than the official source. I recommend you install using virtualenv or a virtual machine, rather than installing it directly on your computer. A great guide to virtualenv can be found here: http://docs.python-guide.org/en/latest/dev/virtualenvs/. More complex pipelines http://scikit-learn.org/stable/modules/pipeline.html#featureunioncomposite-feature-spaces The Pipelines we have used in the book follow a single stream—the output of one step is the input of another step. Pipelines follow the transformer and estimator interfaces as well—this allows us to embed Pipelines within Pipelines. This is a useful construct for very complex models, but becomes very powerful when combined with Feature Unions, as shown in the above link. This allows us to extract multiple types of features at a time and then combine them to form a single dataset. For more details, see the example at: http://scikitlearn.org/stable/auto_examples/feature_stacker.html. Comparing classifiers There are lots of classifiers in scikit-learn that are ready to use. The one you choose for a particular task is going to be based on a variety of factors. You can compare the f1-score to see which method is better, and you can investigate the deviation of those scores to see if that result is statistically significant. An important factor is that they are trained and tested on the same data—that is, the test set for one classifier is the test set for all classifiers. Our use of random states allows us to ensure this is the case—an important factor for replicating experiments. [ 299 ]
Next Steps… Chapter 3: Predicting Sports Winners with Decision Trees More on pandas http://pandas.pydata.org/pandas-docs/stable/tutorials.html The pandas library is a great package—anything you normally write to do data loading is probably already implemented in pandas. You can learn more about it from their tutorial, linked above. There is also a great blog post written by Chris Moffitt that overviews common tasks people do in Excel and how to do them in pandas: http://pbpython.com/excelpandas-comp.html You can also handle large datasets with pandas; see the answer, from user Jeff (the top answer at the time of writing), to this StackOverflow question for an extensive overview of the process: http://stackoverflow.com/questions/14262433/ large-data-work-flows-using-pandas. Another great tutorial on pandas is written by Brian Connelly: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/ More complex features http://www.basketball-reference.com/teams/ORL/2014_roster_status.html Sports teams change regularly from game to game. What is an easy win for a team can turn into a difficult game if a couple of the best players are injured. You can get the team rosters from basketball-reference as well. For example, the roster for the 2013-2014 season for the Orlando Magic is available at the above link—similar data is available for all NBA teams. Writing code to integrate how much a team changes, and using that to add new features, can improve the model significantly. This task will take quite a bit of work, though! [ 300 ]
- Page 272 and 273: Chapter 11 Using Theano, we can def
- Page 274 and 275: Chapter 11 Building a neural networ
- Page 276 and 277: Chapter 11 Finally, we create Thean
- Page 278 and 279: Chapter 11 return [image,] return s
- Page 280 and 281: Chapter 11 Next, we define how the
- Page 282 and 283: Chapter 11 Getting your code to run
- Page 284 and 285: Chapter 11 Setting up the environme
- Page 286 and 287: This will unzip only one Coval.otf
- Page 288 and 289: Chapter 11 First we create the laye
- Page 290 and 291: Chapter 11 Finally, we set the verb
- Page 292: Chapter 11 Summary In this chapter,
- Page 295 and 296: Working with Big Data Big data What
- Page 297 and 298: Working with Big Data Governments a
- Page 299 and 300: Working with Big Data We start by c
- Page 301 and 302: Working with Big Data The final ste
- Page 303 and 304: Working with Big Data Getting the d
- Page 305 and 306: Working with Big Data If we aren't
- Page 307 and 308: Working with Big Data Before we sta
- Page 309 and 310: Working with Big Data The first val
- Page 311 and 312: Working with Big Data This gives us
- Page 313 and 314: Working with Big Data Next, we crea
- Page 315 and 316: Working with Big Data Then, make a
- Page 317 and 318: Working with Big Data Left-click th
- Page 319 and 320: Working with Big Data The result is
- Page 321: Next Steps… Extending the IPython
- Page 325 and 326: Next Steps… Vowpal Wabbit http://
- Page 327 and 328: Next Steps… Deeper networks These
- Page 329 and 330: Next Steps… Real-time clusterings
- Page 331 and 332: Next Steps… More resources Kaggle
- Page 333 and 334: authorship, attributing 185-188 AWS
- Page 335 and 336: feature extraction about 82 common
- Page 337 and 338: NetworkX about 145 defining 303 URL
- Page 339 and 340: scikit-learn package references 305
- Page 342 and 343: Thank you for buying Learning Data
- Page 344: Learning Python Data Visualization
Appendix<br />
To install it, clone the repository and follow the instructions to install the Bleeding<br />
Edge code available at: http://scikit-learn.org/stable/install.html.<br />
Remember to use the above repository's code, rather than the official source.<br />
I re<strong>com</strong>mend you install using virtualenv or a virtual machine, rather than<br />
installing it directly on your <strong>com</strong>puter. A great guide to virtualenv can be found<br />
here: http://docs.python-guide.org/en/latest/dev/virtualenvs/.<br />
More <strong>com</strong>plex pipelines<br />
http://scikit-learn.org/stable/modules/pipeline.html#featureunion<strong>com</strong>posite-feature-spaces<br />
The Pipelines we have used in the book follow a single stream—the output of one<br />
step is the input of another step.<br />
Pipelines follow the transformer and estimator interfaces as well—this allows us to<br />
embed Pipelines within Pipelines. This is a useful construct for very <strong>com</strong>plex models,<br />
but be<strong>com</strong>es very powerful when <strong>com</strong>bined with Feature Unions, as shown in the<br />
above link.<br />
This allows us to extract multiple types of features at a time and then <strong>com</strong>bine them<br />
to form a single dataset. For more details, see the example at: http://scikitlearn.org/stable/auto_examples/feature_stacker.html.<br />
Comparing classifiers<br />
There are lots of classifiers in scikit-learn that are ready to use. The one you choose<br />
for a particular task is going to be based on a variety of factors. You can <strong>com</strong>pare the<br />
f1-score to see which method is better, and you can investigate the deviation of those<br />
scores to see if that result is statistically significant.<br />
An important factor is that they are trained and tested on the same data—that is,<br />
the test set for one classifier is the test set for all classifiers. Our use of random states<br />
allows us to ensure this is the case—an important factor for replicating experiments.<br />
[ 299 ]