24.07.2016 Views

www.allitebooks.com

Learning%20Data%20Mining%20with%20Python

Learning%20Data%20Mining%20with%20Python

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features with Transformers<br />

The preceding function almost fits the interface needed to be used in scikit-learn's<br />

univariate transformers. The function needs to accept two arrays (x and y in our<br />

example) as parameters and returns two arrays, the scores for each feature and the<br />

corresponding p-values. The chi2 function we used earlier only uses the required<br />

interface, which allowed us to just pass it directly to SelectKBest.<br />

The pearsonr function in SciPy accepts two arrays; however, the X array it accepts is<br />

only one dimension. We will write a wrapper function that allows us to use this for<br />

multivariate arrays like the one we have. Let's look at the code:<br />

def multivariate_pearsonr(X, y):<br />

We create our scores and pvalues arrays, and then iterate over each column of<br />

the dataset:<br />

scores, pvalues = [], []<br />

for column in range(X.shape[1]):<br />

We <strong>com</strong>pute the Pearson correlation for this column only and the record both the<br />

score and p-value.<br />

cur_score, cur_p = pearsonr(X[:,column], y)<br />

scores.append(abs(cur_score))<br />

pvalues.append(cur_p)<br />

The Pearson value could be between -1 and 1. A value of 1 implies a<br />

perfect correlation between two variables, while a value of -1 implies a<br />

perfect negative correlation, that is, high values in one variable give low<br />

values in the other and vice versa. Such features are really useful to have,<br />

but would be discarded. For this reason, we have stored the absolute<br />

value in the scores array, rather than the original signed value.<br />

Finally, we return the scores and p-values in a tuple:<br />

return (np.array(scores), np.array(pvalues))<br />

Now, we can use the transformer class as before to rank the features using the<br />

Pearson correlation coefficient:<br />

transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)<br />

Xt_pearson = transformer.fit_transform(X, y)<br />

print(transformer.scores_)<br />

[ 92 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!