www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

To compensate for this, we could create many decision trees and then ask each to predict the class value. We could take a majority vote and use that answer as our overall prediction. Random forests work on this principle. Chapter 3 There are two problems with the aforementioned procedure. The first problem is that building decision trees is largely deterministic—using the same input will result in the same output each time. We only have one training dataset, which means our input (and therefore the output) will be the same if we try build multiple trees. We can address this by choosing a random subsample of our dataset, effectively creating new training sets. This process is called bagging. The second problem is that the features that are used for the first few decision nodes in our tree will be quite good. Even if we choose random subsamples of our training data, it is still quite possible that the decision trees built will be largely the same. To compensate for this, we also choose a random subset of the features to perform our data splits on. Then, we have randomly built trees using randomly chosen samples, using (nearly) randomly chosen features. This is a Random Forest and, perhaps unintuitively, this algorithm is very effective on many datasets. How do ensembles work? The randomness inherent in Random forests may make it seem like we are leaving the results of the algorithm up to chance. However, we apply the benefits of averaging to nearly randomly built decision trees, resulting in an algorithm that reduces the variance of the result. Variance is the error introduced by variations in the training dataset on the algorithm. Algorithms with a high variance (such as decision trees) can be greatly affected by variations to the training dataset. This results in models that have the problem of overfitting. In contrast, bias is the error introduced by assumptions in the algorithm rather than anything to do with the dataset, that is, if we had an algorithm that presumed that all features would be normally distributed then our algorithm may have a high error if the features were not. Negative impacts from bias can be reduced by analyzing the data to see if the classifier's data model matches that of the actual data. [ 55 ]

Predicting Sports Winners with Decision Trees By averaging a large number of decision trees, this variance is greatly reduced. This results in a model with a higher overall accuracy. In general, ensembles work on the assumption that errors in prediction are effectively random and that those errors are quite different from classifier to classifier. By averaging the results across many models, these random errors are canceled out—leaving the true prediction. We will see many more ensembles in action throughout the rest of the book. Parameters in Random forests The Random forest implementation in scikit-learn is called RandomForestClassifier, and it has a number of parameters. As Random forests use many instances of DecisionTreeClassifier, they share many of the same parameters such as the criterion (Gini Impurity or Entropy/Information Gain), max_features, and min_samples_split. Also, there are some new parameters that are used in the ensemble process: • n_estimators: This dictates how many decision trees should be built. A higher value will take longer to run, but will (probably) result in a higher accuracy. • oob_score: If true, the method is tested using samples that aren't in the random subsamples chosen for training the decision trees. • n_jobs: This specifies the number of cores to use when training the decision trees in parallel. The scikit-learn package uses a library called Joblib for in-built parallelization. This parameter dictates how many cores to use. By default, only a single core is used—if you have more cores, you can increase this, or set it to -1 to use all cores. Applying Random forests Random forests in scikit-learn use the estimator interface, allowing us to use almost the exact same code as before to do cross fold validation: from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=14) scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy') print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) [ 56 ]

Predicting Sports Winners with Decision Trees<br />

By averaging a large number of decision trees, this variance is greatly reduced.<br />

This results in a model with a higher overall accuracy.<br />

In general, ensembles work on the assumption that errors in prediction are<br />

effectively random and that those errors are quite different from classifier to<br />

classifier. By averaging the results across many models, these random errors are<br />

canceled out—leaving the true prediction. We will see many more ensembles in<br />

action throughout the rest of the book.<br />

Parameters in Random forests<br />

The Random forest implementation in scikit-learn is called<br />

RandomForestClassifier, and it has a number of parameters. As Random forests<br />

use many instances of DecisionTreeClassifier, they share many of the same<br />

parameters such as the criterion (Gini Impurity or Entropy/Information Gain),<br />

max_features, and min_samples_split.<br />

Also, there are some new parameters that are used in the ensemble process:<br />

• n_estimators: This dictates how many decision trees should be built.<br />

A higher value will take longer to run, but will (probably) result in a<br />

higher accuracy.<br />

• oob_score: If true, the method is tested using samples that aren't in the<br />

random subsamples chosen for training the decision trees.<br />

• n_jobs: This specifies the number of cores to use when training the<br />

decision trees in parallel.<br />

The scikit-learn package uses a library called Joblib for in-built parallelization.<br />

This parameter dictates how many cores to use. By default, only a single core is<br />

used—if you have more cores, you can increase this, or set it to -1 to use all cores.<br />

Applying Random forests<br />

Random forests in scikit-learn use the estimator interface, allowing us to use almost<br />

the exact same code as before to do cross fold validation:<br />

from sklearn.ensemble import RandomForestClassifier<br />

clf = RandomForestClassifier(random_state=14)<br />

scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')<br />

print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))<br />

[ 56 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!