www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 5 This returns a different set of features! The features chosen this way are the first, second, and fifth columns: the Age, Education, and Hours-per-week worked. This shows that there is not a definitive answer to what the best features are— it depends on the metric. We can see which feature set is better by running them through a classifier. Keep in mind that the results only indicate which subset is better for a particular classifier and/or feature combination—there is rarely a case in data mining where one method is strictly better than another in all cases! Let's look at the code: from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import cross_val_score clf = DecisionTreeClassifier(random_state=14) scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy') scores_pearson = cross_val_score(clf, Xt_pearson, y, scoing='accuracy') The chi2 average here is 0.83, while the Pearson score is lower at 0.77. For this combination, chi2 returns better results! It is worth remembering the goal of this data mining activity: predicting wealth. Using a combination of good features and feature selection, we can achieve 83 percent accuracy using just three features of a person! Feature creation Sometimes, just selecting features from what we have isn't enough. We can create features in different ways from features we already have. The one-hot encoding method we saw previously is an example of this. Instead of having a category features with options A, B and C, we would create three new features Is it A?, Is it B?, Is it C?. Creating new features may seem unnecessary and to have no clear benefit—after all, the information is already in the dataset and we just need to use it. However, some algorithms struggle when features correlate significantly, or if there are redundant features. They may also struggle if there are redundant features. For this reason, there are various ways to create new features from the features we already have. We are going to load a new dataset, so now is a good time to start a new IPython Notebook. Download the Advertisements dataset from http://archive.ics.uci. edu/ml/datasets/Internet+Advertisements and save it to your Data folder. [ 93 ]

Extracting Features with Transformers Next, we need to load the dataset with pandas. First, we set the data's filename as always: import os import numpy as np import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data") data_filename = os.path.join(data_folder, "Ads", "ad.data") There are a couple of issues with this dataset that stop us from loading it easily. First, the first few features are numerical, but pandas will load them as strings. To fix this, we need to write a converting function that will convert strings to numbers if possible. Otherwise, we will get a NaN (which is short for Not a Number), which is a special value that indicates that the value could not be interpreted as a number. It is similar to none or null in other programming languages. Another issue with this dataset is that some values are missing. These are represented in the dataset using the string ?. Luckily, the question mark doesn't convert to a float, so we can convert those to NaNs using the same concept. In further chapters, we will look at other ways of dealing with missing values like this. We will create a function that will do this conversion for us: def convert_number(x): First, we want to convert the string to a number and see if that fails. Then, we will surround the conversion in a try/except block, catching a ValueError exception (which is what is thrown if a string cannot be converted into a number this way): try: return float(x) except ValueError: Finally, if the conversion failed, we get a NaN that comes from the NumPy library we imported previously: return np.nan Now, we create a dictionary for the conversion. We want to convert all of the features to floats: converters = defaultdict(convert_number [ 94 ]

Extracting Features with Transformers<br />

Next, we need to load the dataset with pandas. First, we set the data's filename<br />

as always:<br />

import os<br />

import numpy as np<br />

import pandas as pd<br />

data_folder = os.path.join(os.path.expanduser("~"), "Data")<br />

data_filename = os.path.join(data_folder, "Ads", "ad.data")<br />

There are a couple of issues with this dataset that stop us from loading it easily.<br />

First, the first few features are numerical, but pandas will load them as strings.<br />

To fix this, we need to write a converting function that will convert strings to<br />

numbers if possible. Otherwise, we will get a NaN (which is short for Not a<br />

Number), which is a special value that indicates that the value could not be<br />

interpreted as a number. It is similar to none or null in other programming languages.<br />

Another issue with this dataset is that some values are missing. These are<br />

represented in the dataset using the string ?. Luckily, the question mark doesn't<br />

convert to a float, so we can convert those to NaNs using the same concept. In further<br />

chapters, we will look at other ways of dealing with missing values like this.<br />

We will create a function that will do this conversion for us:<br />

def convert_number(x):<br />

First, we want to convert the string to a number and see if that fails. Then, we will<br />

surround the conversion in a try/except block, catching a ValueError exception<br />

(which is what is thrown if a string cannot be converted into a number this way):<br />

try:<br />

return float(x)<br />

except ValueError:<br />

Finally, if the conversion failed, we get a NaN that <strong>com</strong>es from the NumPy library<br />

we imported previously:<br />

return np.nan<br />

Now, we create a dictionary for the conversion. We want to convert all of the<br />

features to floats:<br />

converters = defaultdict(convert_number<br />

[ 94 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!