www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 5 [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]]) Then, we set the entire second column/feature to the value 1: X[:,1] = 1 The result has lots of variance in the first and third rows, but no variance in the second row: array([[ 0, 1, 2], [ 3, 1, 5], [ 6, 1, 8], [ 9, 1, 11], [12, 1, 14], [15, 1, 17], [18, 1, 20], [21, 1, 23], [24, 1, 26], [27, 1, 29]]) We can now create a VarianceThreshold transformer and apply it to our dataset: from sklearn.feature_selection import VarianceThreshold vt = VarianceThreshold() Xt = vt.fit_transform(X) Now, the result Xt does not have the second column: array([[ 0, 2], [ 3, 5], [ 6, 8], [ 9, 11], [12, 14], [15, 17], [18, 20], [21, 23], [24, 26], [27, 29]]) We can observe the variances for each column by printing the vt.variances_ attribute: print(vt.variances_) [ 89 ]
Extracting Features with Transformers The result shows that while the first and third column contains at least some information, the second column had no variance: array([ 74.25, 0. , 74.25]) A simple and obvious test like this is always good to run when seeing data for the first time. Features with no variance do not add any value to a data mining application; however, they can slow down the performance of the algorithm. Selecting the best individual features If we have a number of features, the problem of finding the best subset is a difficult task. It relates to solving the data mining problem itself, multiple times. As we saw in Chapter 4, Recommending Movies Using Affinity Analysis, subset-based tasks increase exponentially as the number of features increase. This exponential growth in time needed is also true for finding the best subset of features. A workaround to this problem is not to look for a subset that works well together, rather than just finding the best individual features. This univariate feature selection gives us a score based on how well a feature performs by itself. This is usually done for classification tasks, and we generally measure some type of correlation between a variable and the target class. The scikit-learn package has a number of transformers for performing univariate feature selection. They include SelectKBest, which returns the k best performing features, and SelectPercentile, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature. There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy. We can observe single-feature tests in action using our Adult dataset. First, we extract a dataset and class values from our pandas DataFrame. We get a selection of the features: X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values We will also create a target class array by testing whether the Earnings-Raw value is above $50,000 or not. If it is, the class will be True. Otherwise, it will be False. Let's look at the code: y = (adult["Earnings-Raw"] == ' >50K').values [ 90 ]
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65 and 66: Predicting Sports Winners with Deci
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 106 and 107: Chapter 5 Thought should always be
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 114 and 115: Chapter 5 Next, we create our trans
- Page 116 and 117: Chapter 5 This returns a different
- Page 118 and 119: Also, we want to set the final colu
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
- Page 129 and 130: Social Media Insight Using Naive Ba
- Page 131 and 132: Social Media Insight Using Naive Ba
- Page 133 and 134: Social Media Insight Using Naive Ba
- Page 135 and 136: Social Media Insight Using Naive Ba
- Page 137 and 138: Social Media Insight Using Naive Ba
- Page 139 and 140: Social Media Insight Using Naive Ba
- Page 141 and 142: Social Media Insight Using Naive Ba
- Page 143 and 144: Social Media Insight Using Naive Ba
- Page 145 and 146: Social Media Insight Using Naive Ba
- Page 147 and 148: Social Media Insight Using Naive Ba
- Page 149 and 150: Social Media Insight Using Naive Ba
- Page 151 and 152: Social Media Insight Using Naive Ba
- Page 153 and 154: Social Media Insight Using Naive Ba
- Page 155 and 156: Social Media Insight Using Naive Ba
- Page 158 and 159: Discovering Accounts to Follow Usin
- Page 160 and 161: Chapter 7 Next, we will need a list
Chapter 5<br />
[18, 19, 20],<br />
[21, 22, 23],<br />
[24, 25, 26],<br />
[27, 28, 29]])<br />
Then, we set the entire second column/feature to the value 1:<br />
X[:,1] = 1<br />
The result has lots of variance in the first and third rows, but no variance in the<br />
second row:<br />
array([[ 0, 1, 2],<br />
[ 3, 1, 5],<br />
[ 6, 1, 8],<br />
[ 9, 1, 11],<br />
[12, 1, 14],<br />
[15, 1, 17],<br />
[18, 1, 20],<br />
[21, 1, 23],<br />
[24, 1, 26],<br />
[27, 1, 29]])<br />
We can now create a VarianceThreshold transformer and apply it to our dataset:<br />
from sklearn.feature_selection import VarianceThreshold<br />
vt = VarianceThreshold()<br />
Xt = vt.fit_transform(X)<br />
Now, the result Xt does not have the second column:<br />
array([[ 0, 2],<br />
[ 3, 5],<br />
[ 6, 8],<br />
[ 9, 11],<br />
[12, 14],<br />
[15, 17],<br />
[18, 20],<br />
[21, 23],<br />
[24, 26],<br />
[27, 29]])<br />
We can observe the variances for each column by printing the vt.variances_<br />
attribute:<br />
print(vt.variances_)<br />
[ 89 ]