www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 5 Thought should always be given to how to represent reality in the form of a model. Rather than just using what has been used in the past, you need to consider the goal of the data mining exercise. What are you trying to achieve? In Chapter 3, Predicting Sports Winners with Decision Trees, we created features by thinking about the goal (predicting winners) and used a little domain knowledge to come up with ideas for new features. Not all features need to be numeric or categorical. Algorithms have been developed that work directly on text, graphs, and other data structures. Unfortunately, those algorithms are outside the scope of this book. In this book, we mainly use numeric or categorical features. The Adult dataset is a great example of taking a complex reality and attempting to model it using features. In this dataset, the aim is to estimate if someone earns more than $50,000 per year. To download the dataset, navigate to http://archive.ics. uci.edu/ml/datasets/Adult and click on the Data Folder link. Download the adult.data and adult.names into a directory named Adult in your data folder. This dataset takes a complex task and describes it in features. These features describe the person, their environment, their background, and their life status. Open a new IPython Notebook for this chapter and set the data's filename and import pandas to load the file: import os import pandas as pd data_folder = os.path.join(os.path.expanduser("~"), "Data", "Adult") adult_filename = os.path.join(data_folder, "adult.data") Using pandas as before, we load the file with read_csv: adult = pd.read_csv(adult_filename, header=None, names=["Age", "Work-Class", "fnlwgt", "Education", "Education-Num", "Marital-Status", "Occupation", "Relationship", "Race", "Sex", "Capital-gain", "Capital-loss", "Hours-per-week", "Native-Country", "Earnings-Raw"]) Most of the code is the same as in the previous chapters. [ 83 ]
Extracting Features with Transformers The adult file itself contains two blank lines at the end of the file. By default, pandas will interpret the penultimate new line to be an empty (but valid) row. To remove this, we remove any line with invalid numbers (the use of inplace just makes sure the same Dataframe is affected, rather than creating a new one): adult.dropna(how='all', inplace=True) Having a look at the dataset, we can see a variety of features from adult.columns: adult.columns The results show each of the feature names that are stored inside an Index object from pandas: Index(['Age', 'Work-Class', 'fnlwgt', 'Education', 'Education-Num', 'Marital-Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week', 'Native-Country', 'Earnings-Raw'], dtype='object') Common feature patterns While there are millions of ways to create features, there are some common patterns that are employed across different disciplines. However, choosing appropriate features is tricky and it is worth considering how a feature might correlate to the end result. As the adage says, don't judge a book by its cover—it is probably not worth considering the size of a book if you are interested in the message contained within. Some commonly used features focus on the physical properties of the real world objects being studied, for example: • Spatial properties such as the length, width, and height of an object • Weight and/or density of the object • Age of an object or its components • The type of the object • The quality of the object Other features might rely on the usage or history of the object: • The producer, publisher, or creator of the object • The year of manufacturing • The use of the object [ 84 ]
- Page 55 and 56: Classifying with scikit-learn Estim
- Page 57 and 58: Classifying with scikit-learn Estim
- Page 59 and 60: Classifying with scikit-learn Estim
- Page 61 and 62: Classifying with scikit-learn Estim
- Page 63 and 64: Classifying with scikit-learn Estim
- Page 65 and 66: Predicting Sports Winners with Deci
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 112 and 113: Chapter 5 [18, 19, 20], [21, 22, 23
- Page 114 and 115: Chapter 5 Next, we create our trans
- Page 116 and 117: Chapter 5 This returns a different
- Page 118 and 119: Also, we want to set the final colu
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
- Page 129 and 130: Social Media Insight Using Naive Ba
- Page 131 and 132: Social Media Insight Using Naive Ba
- Page 133 and 134: Social Media Insight Using Naive Ba
- Page 135 and 136: Social Media Insight Using Naive Ba
- Page 137 and 138: Social Media Insight Using Naive Ba
- Page 139 and 140: Social Media Insight Using Naive Ba
- Page 141 and 142: Social Media Insight Using Naive Ba
- Page 143 and 144: Social Media Insight Using Naive Ba
- Page 145 and 146: Social Media Insight Using Naive Ba
- Page 147 and 148: Social Media Insight Using Naive Ba
- Page 149 and 150: Social Media Insight Using Naive Ba
- Page 151 and 152: Social Media Insight Using Naive Ba
- Page 153 and 154: Social Media Insight Using Naive Ba
- Page 155 and 156: Social Media Insight Using Naive Ba
Extracting Features with Transformers<br />
The adult file itself contains two blank lines at the end of the file. By default, pandas<br />
will interpret the penultimate new line to be an empty (but valid) row. To remove<br />
this, we remove any line with invalid numbers (the use of inplace just makes sure<br />
the same Dataframe is affected, rather than creating a new one):<br />
adult.dropna(how='all', inplace=True)<br />
Having a look at the dataset, we can see a variety of features from adult.columns:<br />
adult.columns<br />
The results show each of the feature names that are stored inside an Index object<br />
from pandas:<br />
Index(['Age', 'Work-Class', 'fnlwgt', 'Education',<br />
'Education-Num', 'Marital-Status', 'Occupation', 'Relationship',<br />
'Race', 'Sex', 'Capital-gain', 'Capital-loss', 'Hours-per-week',<br />
'Native-Country', 'Earnings-Raw'], dtype='object')<br />
Common feature patterns<br />
While there are millions of ways to create features, there are some <strong>com</strong>mon patterns<br />
that are employed across different disciplines. However, choosing appropriate<br />
features is tricky and it is worth considering how a feature might correlate to the end<br />
result. As the adage says, don't judge a book by its cover—it is probably not worth<br />
considering the size of a book if you are interested in the message contained within.<br />
Some <strong>com</strong>monly used features focus on the physical properties of the real world<br />
objects being studied, for example:<br />
• Spatial properties such as the length, width, and height of an object<br />
• Weight and/or density of the object<br />
• Age of an object or its <strong>com</strong>ponents<br />
• The type of the object<br />
• The quality of the object<br />
Other features might rely on the usage or history of the object:<br />
• The producer, publisher, or creator of the object<br />
• The year of manufacturing<br />
• The use of the object<br />
[ 84 ]