24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Classifying with scikit-learn Estimators<br />

For a mathematical-based algorithm to <strong>com</strong>pare each of these features, the<br />

differences in the scale, range, and units can be difficult to interpret. If we used<br />

the above features in many algorithms, the weight would probably be the most<br />

influential feature due to only the larger numbers and not anything to do with the<br />

actual effectiveness of the feature.<br />

One of the methods to over<strong>com</strong>e this is to use a process called preprocessing<br />

to normalize the features so that they all have the same range, or are put into<br />

categories like small, medium and large. Suddenly, the large difference in the<br />

types of features has less of an impact on the algorithm, and can lead to large<br />

increases in the accuracy.<br />

Preprocessing can also be used to choose only the more effective features, create new<br />

features, and so on. Preprocessing in scikit-learn is done through Transformer<br />

objects, which take a dataset in one form and return an altered dataset after some<br />

transformation of the data. These don't have to be numerical, as Transformers are also<br />

used to extract features-however, in this section, we will stick with preprocessing.<br />

An example<br />

We can show an example of the problem by breaking the Ionosphere dataset.<br />

While this is only an example, many real-world datasets have problems of this<br />

form. First, we create a copy of the array so that we do not alter the original dataset:<br />

X_broken = np.array(X)<br />

Next, we break the dataset by dividing every second feature by 10:<br />

X_broken[:,::2] /= 10<br />

In theory, this should not have a great effect on the result. After all, the values<br />

for these features are still relatively the same. The major issue is that the scale has<br />

changed and the odd features are now larger than the even features. We can see the<br />

effect of this by <strong>com</strong>puting the accuracy:<br />

estimator = KNeighborsClassifier()<br />

original_scores = cross_val_score(estimator, X, y,<br />

scoring='accuracy')<br />

print("The original average accuracy for is<br />

{0:.1f}%".format(np.mean(original_scores) * 100))<br />

broken_scores = cross_val_score(estimator, X_broken, y,<br />

scoring='accuracy')<br />

print("The 'broken' average accuracy for is<br />

{0:.1f}%".format(np.mean(broken_scores) * 100))<br />

[ 36 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!