www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Also, we want to set the final column (column index #1558), which is the class, to a binary feature. In the Adult dataset, we created a new feature for this. In the dataset, we will convert the feature while we load it. converters[1558] = lambda x: 1 if x.strip() == "ad." else 0 Chapter 5 Now we can load the dataset using read_csv. We use the converters parameter to pass our custom conversion into pandas: ads = pd.read_csv(data_filename, header=None, converters=converters) The resulting dataset is quite large, with 1,559 features and more than 2,000 rows. Here are some of the feature values the first five, printed by inserting ads[:5] into a new cell: This dataset describes images on websites, with the goal of determining whether a given image is an advertisement or not. The features in this dataset are not described well by their headings. There are two files accompanying the ad.data file that have more information: ad.DOCUMENTATION and ad.names. The first three features are the height, width, and ratio of the image size. The final feature is 1 if it is an advertisement and 0 if it is not. The other features are 1 for the presence of certain words in the URL, alt text, or caption of the image. These words, such as the word sponsor, are used to determine if the image is likely to be an advertisement. Many of the features overlap considerably, as they are combinations of other features. Therefore, this dataset has a lot of redundant information. With our dataset loaded in pandas, we will now extract the x and y data for our classification algorithms. The x matrix will be all of the columns in our Dataframe, except for the last column. In contrast, the y array will be only that last column, feature #1558. Let's look at the code: X = ads.drop(1558, axis=1).values y = ads[1558] [ 95 ]
Extracting Features with Transformers Principal Component Analysis In some datasets, features heavily correlate with each other. For example, the speed and the fuel consumption would be heavily correlated in a go-kart with a single gear. While it can be useful to find these correlations for some applications, data mining algorithms typically do not need the redundant information. The ads dataset has heavily correlated features, as many of the keywords are repeated across the alt text and caption. The Principal Component Analysis (PCA) aims to find combinations of features that describe the dataset in less information. It aims to discover principal components, which are features that do not correlate with each other and explain the information—specifically the variance—of the dataset. What this means is that we can often capture most of the information in a dataset in fewer features. We apply PCA just like any other transformer. It has one key parameter, which is the number of components to find. By default, it will result in as many features as you have in the original dataset. However, these principal components are ranked—the first feature explains the largest amount of the variance in the dataset, the second a little less, and so on. Therefore, finding just the first few features is often enough to explain much of the dataset. Let's look at the code: from sklearn.decomposition import PCA pca = PCA(n_components=5) Xd = pca.fit_transform(X) The resulting matrix, Xd, has just five features. However, let's look at the amount of variance that is explained by each of these features: np.set_printoptions(precision=3, suppress=True) pca.explained_variance_ratio_ The result, array([ 0.854, 0.145, 0.001, 0. , 0. ]), shows us that the first feature accounts for 85.4 percent of the variance in the dataset, the second accounts for 14.5 percent, and so on. By the fourth feature, less than one-tenth of a percent of the variance is contained in the feature. The other 1,553 features explain even less. [ 96 ]
- Page 67 and 68: Predicting Sports Winners with Deci
- Page 69 and 70: Predicting Sports Winners with Deci
- Page 71 and 72: Predicting Sports Winners with Deci
- Page 73 and 74: Predicting Sports Winners with Deci
- Page 75 and 76: Predicting Sports Winners with Deci
- Page 77 and 78: Predicting Sports Winners with Deci
- Page 79 and 80: Predicting Sports Winners with Deci
- Page 81 and 82: Predicting Sports Winners with Deci
- Page 84 and 85: Recommending Movies Using Affinity
- Page 86 and 87: Chapter 4 The classic algorithm for
- Page 88 and 89: Chapter 4 When loading the file, we
- Page 90 and 91: Chapter 4 We will sample our datase
- Page 92 and 93: Chapter 4 Implementation On the fir
- Page 94 and 95: Chapter 4 We want to break out the
- Page 96 and 97: The process starts by creating dict
- Page 98 and 99: movie_name_data.columns = ["MovieID
- Page 100 and 101: To do this, we will compute the tes
- Page 102 and 103: Chapter 4 - Train Confidence: 1.000
- Page 104 and 105: Extracting Features with Transforme
- Page 106 and 107: Chapter 5 Thought should always be
- Page 108 and 109: Chapter 5 Other features describe a
- Page 110 and 111: Chapter 5 Similarly, we can convert
- Page 112 and 113: Chapter 5 [18, 19, 20], [21, 22, 23
- Page 114 and 115: Chapter 5 Next, we create our trans
- Page 116 and 117: Chapter 5 This returns a different
- Page 120 and 121: Chapter 5 The downside to transform
- Page 122 and 123: Chapter 5 A transformer is akin to
- Page 124 and 125: We can then create an instance of t
- Page 126: Chapter 5 Putting it all together N
- Page 129 and 130: Social Media Insight Using Naive Ba
- Page 131 and 132: Social Media Insight Using Naive Ba
- Page 133 and 134: Social Media Insight Using Naive Ba
- Page 135 and 136: Social Media Insight Using Naive Ba
- Page 137 and 138: Social Media Insight Using Naive Ba
- Page 139 and 140: Social Media Insight Using Naive Ba
- Page 141 and 142: Social Media Insight Using Naive Ba
- Page 143 and 144: Social Media Insight Using Naive Ba
- Page 145 and 146: Social Media Insight Using Naive Ba
- Page 147 and 148: Social Media Insight Using Naive Ba
- Page 149 and 150: Social Media Insight Using Naive Ba
- Page 151 and 152: Social Media Insight Using Naive Ba
- Page 153 and 154: Social Media Insight Using Naive Ba
- Page 155 and 156: Social Media Insight Using Naive Ba
- Page 158 and 159: Discovering Accounts to Follow Usin
- Page 160 and 161: Chapter 7 Next, we will need a list
- Page 162 and 163: Chapter 7 Make sure the filename is
- Page 164 and 165: Chapter 7 cursor = results['next_cu
- Page 166 and 167: Chapter 7 Next, we are going to rem
Extracting Features with Transformers<br />
Principal Component Analysis<br />
In some datasets, features heavily correlate with each other. For example, the speed<br />
and the fuel consumption would be heavily correlated in a go-kart with a single gear.<br />
While it can be useful to find these correlations for some applications, data mining<br />
algorithms typically do not need the redundant information.<br />
The ads dataset has heavily correlated features, as many of the keywords are<br />
repeated across the alt text and caption.<br />
The Principal Component Analysis (PCA) aims to find <strong>com</strong>binations of<br />
features that describe the dataset in less information. It aims to discover principal<br />
<strong>com</strong>ponents, which are features that do not correlate with each other and explain the<br />
information—specifically the variance—of the dataset. What this means is that we<br />
can often capture most of the information in a dataset in fewer features.<br />
We apply PCA just like any other transformer. It has one key parameter, which is the<br />
number of <strong>com</strong>ponents to find. By default, it will result in as many features as you<br />
have in the original dataset. However, these principal <strong>com</strong>ponents are ranked—the<br />
first feature explains the largest amount of the variance in the dataset, the second a<br />
little less, and so on. Therefore, finding just the first few features is often enough to<br />
explain much of the dataset. Let's look at the code:<br />
from sklearn.de<strong>com</strong>position import PCA<br />
pca = PCA(n_<strong>com</strong>ponents=5)<br />
Xd = pca.fit_transform(X)<br />
The resulting matrix, Xd, has just five features. However, let's look at the amount of<br />
variance that is explained by each of these features:<br />
np.set_printoptions(precision=3, suppress=True)<br />
pca.explained_variance_ratio_<br />
The result, array([ 0.854, 0.145, 0.001, 0. , 0. ]), shows<br />
us that the first feature accounts for 85.4 percent of the variance in the dataset,<br />
the second accounts for 14.5 percent, and so on. By the fourth feature, less than<br />
one-tenth of a percent of the variance is contained in the feature. The other 1,553<br />
features explain even less.<br />
[ 96 ]