www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 5 [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]]) Then, we set the entire second column/feature to the value 1: X[:,1] = 1 The result has lots of variance in the first and third rows, but no variance in the second row: array([[ 0, 1, 2], [ 3, 1, 5], [ 6, 1, 8], [ 9, 1, 11], [12, 1, 14], [15, 1, 17], [18, 1, 20], [21, 1, 23], [24, 1, 26], [27, 1, 29]]) We can now create a VarianceThreshold transformer and apply it to our dataset: from sklearn.feature_selection import VarianceThreshold vt = VarianceThreshold() Xt = vt.fit_transform(X) Now, the result Xt does not have the second column: array([[ 0, 2], [ 3, 5], [ 6, 8], [ 9, 11], [12, 14], [15, 17], [18, 20], [21, 23], [24, 26], [27, 29]]) We can observe the variances for each column by printing the vt.variances_ attribute: print(vt.variances_) [ 89 ]

Extracting Features with Transformers The result shows that while the first and third column contains at least some information, the second column had no variance: array([ 74.25, 0. , 74.25]) A simple and obvious test like this is always good to run when seeing data for the first time. Features with no variance do not add any value to a data mining application; however, they can slow down the performance of the algorithm. Selecting the best individual features If we have a number of features, the problem of finding the best subset is a difficult task. It relates to solving the data mining problem itself, multiple times. As we saw in Chapter 4, Recommending Movies Using Affinity Analysis, subset-based tasks increase exponentially as the number of features increase. This exponential growth in time needed is also true for finding the best subset of features. A workaround to this problem is not to look for a subset that works well together, rather than just finding the best individual features. This univariate feature selection gives us a score based on how well a feature performs by itself. This is usually done for classification tasks, and we generally measure some type of correlation between a variable and the target class. The scikit-learn package has a number of transformers for performing univariate feature selection. They include SelectKBest, which returns the k best performing features, and SelectPercentile, which returns the top r% of features. In both cases, there are a number of methods of computing the quality of a feature. There are many different methods to compute how effectively a single feature correlates with a class value. A commonly used method is the chi-squared (χ2) test. Other methods include mutual information and entropy. We can observe single-feature tests in action using our Adult dataset. First, we extract a dataset and class values from our pandas DataFrame. We get a selection of the features: X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values We will also create a target class array by testing whether the Earnings-Raw value is above $50,000 or not. If it is, the class will be True. Otherwise, it will be False. Let's look at the code: y = (adult["Earnings-Raw"] == ' >50K').values [ 90 ]

Chapter 5<br />

[18, 19, 20],<br />

[21, 22, 23],<br />

[24, 25, 26],<br />

[27, 28, 29]])<br />

Then, we set the entire second column/feature to the value 1:<br />

X[:,1] = 1<br />

The result has lots of variance in the first and third rows, but no variance in the<br />

second row:<br />

array([[ 0, 1, 2],<br />

[ 3, 1, 5],<br />

[ 6, 1, 8],<br />

[ 9, 1, 11],<br />

[12, 1, 14],<br />

[15, 1, 17],<br />

[18, 1, 20],<br />

[21, 1, 23],<br />

[24, 1, 26],<br />

[27, 1, 29]])<br />

We can now create a VarianceThreshold transformer and apply it to our dataset:<br />

from sklearn.feature_selection import VarianceThreshold<br />

vt = VarianceThreshold()<br />

Xt = vt.fit_transform(X)<br />

Now, the result Xt does not have the second column:<br />

array([[ 0, 2],<br />

[ 3, 5],<br />

[ 6, 8],<br />

[ 9, 11],<br />

[12, 14],<br />

[15, 17],<br />

[18, 20],<br />

[21, 23],<br />

[24, 26],<br />

[27, 29]])<br />

We can observe the variances for each column by printing the vt.variances_<br />

attribute:<br />

print(vt.variances_)<br />

[ 89 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!