24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features with Transformers<br />

Feature selection<br />

We will often have a large number of features to choose from, but we wish to select<br />

only a small subset. There are many possible reasons for this:<br />

• Reducing <strong>com</strong>plexity: Many data mining algorithms need more time and<br />

resources with increase in the number of features. Reducing the number<br />

of features is a great way to make an algorithm run faster or with fewer<br />

resources.<br />

• Reducing noise: Adding extra features doesn't always lead to better<br />

performance. Extra features may confuse the algorithm, finding correlations<br />

and patterns that don’t have meaning (this is <strong>com</strong>mon in smaller datasets).<br />

Choosing only the appropriate features is a good way to reduce the chance of<br />

random correlations that have no real meaning.<br />

• Creating readable models: While many data mining algorithms will happily<br />

<strong>com</strong>pute an answer for models with thousands of features, the results may be<br />

difficult to interpret for a human. In these cases, it may be worth using fewer<br />

features and creating a model that a human can understand.<br />

Some classification algorithms can handle data with issues such as these. Getting<br />

the data right and getting the features to effectively describe the dataset you are<br />

modeling can still assist algorithms.<br />

There are some basic tests we can perform, such as ensuring that the features are at<br />

least different. If a feature's values are all same, it can't give us extra information to<br />

perform our data mining.<br />

The VarianceThreshold transformer in scikit-learn, for instance, will remove any<br />

feature that doesn't have at least a minimum level of variance in the values. To show<br />

how this works, we first create a simple matrix using NumPy:<br />

import numpy as np<br />

X = np.arange(30).reshape((10, 3))<br />

The result is the numbers zero to 29, in three columns and 10 rows. This represents a<br />

synthetic dataset with 10 samples and three features:<br />

array([[ 0, 1, 2],<br />

[ 3, 4, 5],<br />

[ 6, 7, 8],<br />

[ 9, 10, 11],<br />

[12, 13, 14],<br />

[15, 16, 17],<br />

[ 88 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!