www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
To compensate for this, we could create many decision trees and then ask each to predict the class value. We could take a majority vote and use that answer as our overall prediction. Random forests work on this principle. Chapter 3 There are two problems with the aforementioned procedure. The first problem is that building decision trees is largely deterministic—using the same input will result in the same output each time. We only have one training dataset, which means our input (and therefore the output) will be the same if we try build multiple trees. We can address this by choosing a random subsample of our dataset, effectively creating new training sets. This process is called bagging. The second problem is that the features that are used for the first few decision nodes in our tree will be quite good. Even if we choose random subsamples of our training data, it is still quite possible that the decision trees built will be largely the same. To compensate for this, we also choose a random subset of the features to perform our data splits on. Then, we have randomly built trees using randomly chosen samples, using (nearly) randomly chosen features. This is a Random Forest and, perhaps unintuitively, this algorithm is very effective on many datasets. How do ensembles work? The randomness inherent in Random forests may make it seem like we are leaving the results of the algorithm up to chance. However, we apply the benefits of averaging to nearly randomly built decision trees, resulting in an algorithm that reduces the variance of the result. Variance is the error introduced by variations in the training dataset on the algorithm. Algorithms with a high variance (such as decision trees) can be greatly affected by variations to the training dataset. This results in models that have the problem of overfitting. In contrast, bias is the error introduced by assumptions in the algorithm rather than anything to do with the dataset, that is, if we had an algorithm that presumed that all features would be normally distributed then our algorithm may have a high error if the features were not. Negative impacts from bias can be reduced by analyzing the data to see if the classifier's data model matches that of the actual data. [ 55 ]
Predicting Sports Winners with Decision Trees By averaging a large number of decision trees, this variance is greatly reduced. This results in a model with a higher overall accuracy. In general, ensembles work on the assumption that errors in prediction are effectively random and that those errors are quite different from classifier to classifier. By averaging the results across many models, these random errors are canceled out—leaving the true prediction. We will see many more ensembles in action throughout the rest of the book. Parameters in Random forests The Random forest implementation in scikit-learn is called RandomForestClassifier, and it has a number of parameters. As Random forests use many instances of DecisionTreeClassifier, they share many of the same parameters such as the criterion (Gini Impurity or Entropy/Information Gain), max_features, and min_samples_split. Also, there are some new parameters that are used in the ensemble process: • n_estimators: This dictates how many decision trees should be built. A higher value will take longer to run, but will (probably) result in a higher accuracy. • oob_score: If true, the method is tested using samples that aren't in the random subsamples chosen for training the decision trees. • n_jobs: This specifies the number of cores to use when training the decision trees in parallel. The scikit-learn package uses a library called Joblib for in-built parallelization. This parameter dictates how many cores to use. By default, only a single core is used—if you have more cores, you can increase this, or set it to -1 to use all cores. Applying Random forests Random forests in scikit-learn use the estimator interface, allowing us to use almost the exact same code as before to do cross fold validation: from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier(random_state=14) scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy') print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100)) [ 56 ]
- Page 28 and 29: After you have the above "Hello, wo
- Page 30 and 31: Chapter 1 Windows users may need to
- Page 32 and 33: Chapter 1 The dataset we are going
- Page 34 and 35: Chapter 1 As an example, we will co
- Page 36 and 37: We get the names of the features fo
- Page 38 and 39: Chapter 1 Two rules are near the to
- Page 40 and 41: Chapter 1 The scikit-learn library
- Page 42 and 43: We then iterate over all the sample
- Page 44 and 45: Chapter 1 Overfitting is the proble
- Page 46: Chapter 1 Summary In this chapter,
- Page 49 and 50: Classifying with scikit-learn Estim
- Page 51 and 52: Classifying with scikit-learn Estim
- Page 53 and 54: Classifying with scikit-learn Estim
- Page 55 and 56: Classifying with scikit-learn Estim
- Page 57 and 58: Classifying with scikit-learn Estim
- Page 59 and 60: Classifying with scikit-learn Estim
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65 and 66: Predicting Sports Winners with Deci
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 106 and 107: Chapter 5 Thought should always be
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 112 and 113: Chapter 5 [18, 19, 20], [21, 22, 23
- Page 114 and 115: Chapter 5 Next, we create our trans
- Page 116 and 117: Chapter 5 This returns a different
- Page 118 and 119: Also, we want to set the final colu
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
Predicting Sports Winners with Decision Trees<br />
By averaging a large number of decision trees, this variance is greatly reduced.<br />
This results in a model with a higher overall accuracy.<br />
In general, ensembles work on the assumption that errors in prediction are<br />
effectively random and that those errors are quite different from classifier to<br />
classifier. By averaging the results across many models, these random errors are<br />
canceled out—leaving the true prediction. We will see many more ensembles in<br />
action throughout the rest of the book.<br />
Parameters in Random forests<br />
The Random forest implementation in scikit-learn is called<br />
RandomForestClassifier, and it has a number of parameters. As Random forests<br />
use many instances of DecisionTreeClassifier, they share many of the same<br />
parameters such as the criterion (Gini Impurity or Entropy/Information Gain),<br />
max_features, and min_samples_split.<br />
Also, there are some new parameters that are used in the ensemble process:<br />
• n_estimators: This dictates how many decision trees should be built.<br />
A higher value will take longer to run, but will (probably) result in a<br />
higher accuracy.<br />
• oob_score: If true, the method is tested using samples that aren't in the<br />
random subsamples chosen for training the decision trees.<br />
• n_jobs: This specifies the number of cores to use when training the<br />
decision trees in parallel.<br />
The scikit-learn package uses a library called Joblib for in-built parallelization.<br />
This parameter dictates how many cores to use. By default, only a single core is<br />
used—if you have more cores, you can increase this, or set it to -1 to use all cores.<br />
Applying Random forests<br />
Random forests in scikit-learn use the estimator interface, allowing us to use almost<br />
the exact same code as before to do cross fold validation:<br />
from sklearn.ensemble import RandomForestClassifier<br />
clf = RandomForestClassifier(random_state=14)<br />
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')<br />
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))<br />
[ 56 ]