www.allitebooks.com

Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python

24.07.2016 Views

Also, we want to set the final column (column index #1558), which is the class, to a binary feature. In the Adult dataset, we created a new feature for this. In the dataset, we will convert the feature while we load it. converters[1558] = lambda x: 1 if x.strip() == "ad." else 0 Chapter 5 Now we can load the dataset using read_csv. We use the converters parameter to pass our custom conversion into pandas: ads = pd.read_csv(data_filename, header=None, converters=converters) The resulting dataset is quite large, with 1,559 features and more than 2,000 rows. Here are some of the feature values the first five, printed by inserting ads[:5] into a new cell: This dataset describes images on websites, with the goal of determining whether a given image is an advertisement or not. The features in this dataset are not described well by their headings. There are two files accompanying the ad.data file that have more information: ad.DOCUMENTATION and ad.names. The first three features are the height, width, and ratio of the image size. The final feature is 1 if it is an advertisement and 0 if it is not. The other features are 1 for the presence of certain words in the URL, alt text, or caption of the image. These words, such as the word sponsor, are used to determine if the image is likely to be an advertisement. Many of the features overlap considerably, as they are combinations of other features. Therefore, this dataset has a lot of redundant information. With our dataset loaded in pandas, we will now extract the x and y data for our classification algorithms. The x matrix will be all of the columns in our Dataframe, except for the last column. In contrast, the y array will be only that last column, feature #1558. Let's look at the code: X = ads.drop(1558, axis=1).values y = ads[1558] [ 95 ]

Extracting Features with Transformers Principal Component Analysis In some datasets, features heavily correlate with each other. For example, the speed and the fuel consumption would be heavily correlated in a go-kart with a single gear. While it can be useful to find these correlations for some applications, data mining algorithms typically do not need the redundant information. The ads dataset has heavily correlated features, as many of the keywords are repeated across the alt text and caption. The Principal Component Analysis (PCA) aims to find combinations of features that describe the dataset in less information. It aims to discover principal components, which are features that do not correlate with each other and explain the information—specifically the variance—of the dataset. What this means is that we can often capture most of the information in a dataset in fewer features. We apply PCA just like any other transformer. It has one key parameter, which is the number of components to find. By default, it will result in as many features as you have in the original dataset. However, these principal components are ranked—the first feature explains the largest amount of the variance in the dataset, the second a little less, and so on. Therefore, finding just the first few features is often enough to explain much of the dataset. Let's look at the code: from sklearn.decomposition import PCA pca = PCA(n_components=5) Xd = pca.fit_transform(X) The resulting matrix, Xd, has just five features. However, let's look at the amount of variance that is explained by each of these features: np.set_printoptions(precision=3, suppress=True) pca.explained_variance_ratio_ The result, array([ 0.854, 0.145, 0.001, 0. , 0. ]), shows us that the first feature accounts for 85.4 percent of the variance in the dataset, the second accounts for 14.5 percent, and so on. By the fourth feature, less than one-tenth of a percent of the variance is contained in the feature. The other 1,553 features explain even less. [ 96 ]

Extracting Features with Transformers<br />

Principal Component Analysis<br />

In some datasets, features heavily correlate with each other. For example, the speed<br />

and the fuel consumption would be heavily correlated in a go-kart with a single gear.<br />

While it can be useful to find these correlations for some applications, data mining<br />

algorithms typically do not need the redundant information.<br />

The ads dataset has heavily correlated features, as many of the keywords are<br />

repeated across the alt text and caption.<br />

The Principal Component Analysis (PCA) aims to find <strong>com</strong>binations of<br />

features that describe the dataset in less information. It aims to discover principal<br />

<strong>com</strong>ponents, which are features that do not correlate with each other and explain the<br />

information—specifically the variance—of the dataset. What this means is that we<br />

can often capture most of the information in a dataset in fewer features.<br />

We apply PCA just like any other transformer. It has one key parameter, which is the<br />

number of <strong>com</strong>ponents to find. By default, it will result in as many features as you<br />

have in the original dataset. However, these principal <strong>com</strong>ponents are ranked—the<br />

first feature explains the largest amount of the variance in the dataset, the second a<br />

little less, and so on. Therefore, finding just the first few features is often enough to<br />

explain much of the dataset. Let's look at the code:<br />

from sklearn.de<strong>com</strong>position import PCA<br />

pca = PCA(n_<strong>com</strong>ponents=5)<br />

Xd = pca.fit_transform(X)<br />

The resulting matrix, Xd, has just five features. However, let's look at the amount of<br />

variance that is explained by each of these features:<br />

np.set_printoptions(precision=3, suppress=True)<br />

pca.explained_variance_ratio_<br />

The result, array([ 0.854, 0.145, 0.001, 0. , 0. ]), shows<br />

us that the first feature accounts for 85.4 percent of the variance in the dataset,<br />

the second accounts for 14.5 percent, and so on. By the fourth feature, less than<br />

one-tenth of a percent of the variance is contained in the feature. The other 1,553<br />

features explain even less.<br />

[ 96 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!