www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Chapter 5 Thought should always be given to how to represent reality in the form of a model. Rather than just using what has been used in the past, you need to consider the goal of the data mining exercise. What are you trying to achieve? In Chapter 3, Predicting Sports Winners with Decision Trees, we created features by thinking about the goal (predicting winners) and used a little domain knowledge to come up with ideas for new features. Not all features need to be numeric or categorical. Algorithms have been developed that work directly on text, graphs, and other data structures. Unfortunately, those algorithms are outside the scope of this book. In this book, we mainly use numeric or categorical features. The Adult dataset is a great example of taking a complex reality and attempting to model it using features. In this dataset, the aim is to estimate if someone earns more than $50,000 per year. To download the dataset, navigate to http://archive.ics. uci.edu/ml/datasets/Adult and click on the Data Folder link. Download the adult.data and adult.names into a directory named Adult in your data folder. This dataset takes a complex task and describes it in features. These features describe the person, their environment, their background, and their life status. Open a new IPython Notebook for this chapter and set the data's filename and import pandas to load the file: import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "Adult") adult_filename = os.path.join(data_folder, "adult.data") Using pandas as before, we load the file with read_csv: adult = pd.read_csv(adult_filename, header=None, names=["Age", "Work-Class", "fnlwgt", "Education", "Education-Num", "Marital-Status", "Occupation", "Relationship", "Race", "Sex", "Capital-gain", "Capital-loss", "Hours-per-week", "Native-Country", "Earnings-Raw"]) Most of the code is the same as in the previous chapters. [ 83 ]

Extracting Features with Transformers The adult file itself contains two blank lines at the end of the file. By default, pandas will interpret the penultimate new line to be an empty (but valid) row. To remove this, we remove any line with invalid numbers (the use of inplace just makes sure the same Dataframe is affected, rather than creating a new one): adult.dropna(how='all', inplace=True) Having a look at the dataset, we can see a variety of features from adult.columns: adult.columns The results show each of the feature names that are stored inside an Index object from pandas: Index(['Age', 'Work-Class', 'fnlwgt', 'Education', 'Education-Num', 'Marital-Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-Country', 'Earnings-Raw'], dtype='object') Common feature patterns While there are millions of ways to create features, there are some common patterns that are employed across different disciplines. However, choosing appropriate features is tricky and it is worth considering how a feature might correlate to the end result. As the adage says, don't judge a book by its cover—it is probably not worth considering the size of a book if you are interested in the message contained within. Some commonly used features focus on the physical properties of the real world objects being studied, for example: • Spatial properties such as the length, width, and height of an object • Weight and/or density of the object • Age of an object or its components • The type of the object • The quality of the object Other features might rely on the usage or history of the object: • The producer, publisher, or creator of the object • The year of manufacturing • The use of the object [ 84 ]

Extracting Features with Transformers<br />

The adult file itself contains two blank lines at the end of the file. By default, pandas<br />

will interpret the penultimate new line to be an empty (but valid) row. To remove<br />

this, we remove any line with invalid numbers (the use of inplace just makes sure<br />

the same Dataframe is affected, rather than creating a new one):<br />

adult.dropna(how='all', inplace=True)<br />

Having a look at the dataset, we can see a variety of features from adult.columns:<br />

adult.columns<br />

The results show each of the feature names that are stored inside an Index object<br />

from pandas:<br />

Index(['Age', 'Work-Class', 'fnlwgt', 'Education',<br />

'Education-Num', 'Marital-Status', 'Occupation', 'Relationship',<br />

'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week',<br />

'Native-Country', 'Earnings-Raw'], dtype='object')<br />

Common feature patterns<br />

While there are millions of ways to create features, there are some <strong>com</strong>mon patterns<br />

that are employed across different disciplines. However, choosing appropriate<br />

features is tricky and it is worth considering how a feature might correlate to the end<br />

result. As the adage says, don't judge a book by its cover—it is probably not worth<br />

considering the size of a book if you are interested in the message contained within.<br />

Some <strong>com</strong>monly used features focus on the physical properties of the real world<br />

objects being studied, for example:<br />

• Spatial properties such as the length, width, and height of an object<br />

• Weight and/or density of the object<br />

• Age of an object or its <strong>com</strong>ponents<br />

• The type of the object<br />

• The quality of the object<br />

Other features might rely on the usage or history of the object:<br />

• The producer, publisher, or creator of the object<br />

• The year of manufacturing<br />

• The use of the object<br />

[ 84 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!