24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

Similarly, we can convert numerical features to categorical features through a<br />

process called discretization, as we saw in Chapter 4, Re<strong>com</strong>mending Movies Using<br />

Affiity Analysis. We can call any person who is taller than 1.7 m tall, and any person<br />

shorter than 1.7 m short. This gives us a categorical feature (although still an ordinal<br />

one). We do lose some data here. For instance, two people, one 1.69 m tall and one<br />

1.71 m, will be in two different categories and considered drastically different from<br />

each other. In contrast, a person 1.2 m tall will be considered "of roughly the same<br />

height" as the person 1.69 m tall! This loss of detail is a side effect of discretization,<br />

and it is an issue that we deal with when creating models.<br />

In the Adult dataset, we can create a LongHours feature, which tells us if a<br />

person works more than 40 hours per week. This turns our continuous feature<br />

(Hours-per-week) into a categorical one:<br />

adult["LongHours"] = adult["Hours-per-week"] > 40<br />

Creating good features<br />

Modeling, and the loss of information that the simplification causes, are the reasons<br />

why we do not have data mining methods that can just be applied to any dataset. A<br />

good data mining practitioner will have, or obtain, domain knowledge in the area<br />

they are applying data mining to. They will look at the problem, the available data,<br />

and <strong>com</strong>e up with a model that represents what they are trying to achieve.<br />

For instance, a height feature may describe one <strong>com</strong>ponent of a person, but may<br />

not describe their academic performance well. If we were attempting to predict a<br />

person's grade, we may not bother measuring each person's height.<br />

This is where data mining be<strong>com</strong>es more art than science. Extracting good features<br />

is difficult and is the topic of significant and ongoing research. Choosing better<br />

classification algorithms can improve the performance of a data mining application,<br />

but choosing better features is often a better option.<br />

In all data mining applications, you should first outline what you are looking for<br />

before you start designing the methodology that will find it. This will dictate the<br />

types of features you are aiming for, the types of algorithms that you can use, and<br />

the expectations on the final result.<br />

[ 87 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!