www.allitebooks.com
Learning%20Data%20Mining%20with%20Python Learning%20Data%20Mining%20with%20Python
Chapter 9 The derivation of these equations is outside the scope of this book, but I recommend interested readers to go through the derivations at http://en.wikibooks.org/wiki/Support_Vector_ Machines for the details. Alternatively, you can visit http://docs. opencv.org/doc/tutorials/ml/introduction_to_svm/ introduction_to_svm.html. Classifying with SVMs After training the model, we have a line of maximum margin. The classification of new samples is then simply asking the question: does it fall above the line, or below it? If it falls above the line, it is predicted as one class. If it is below the line, it is predicted as the other class. For multiple classes, we create multiple SVMs—each a binary classifier. We then connect them using any one of a variety of strategies. A basic strategy is to create a one-versus-all classifier for each class, where we train using two classes—the given class and all other samples. We do this for each class and run each classifier on a new sample, choosing the best match from each of these. This process is performed automatically in most SVM implementations. We saw two parameters in our previous code: C and the kernel. We will cover the kernel parameter in the next section, but the C parameter is an important parameter for fitting SVMs. The C parameter relates to how much the classifier should aim to predict all training samples correctly, at the risk of overfitting. Selecting a higher C value will find a line of separation with a smaller margin, aiming to classify all training samples correctly. Choosing a lower C value will result in a line of separation with a larger margin—even if that means that some training samples are incorrectly classified. In this case, a lower C value presents a lower chance of overfitting, at the risk of choosing a generally poorer line of separation. One limitation with SVMs (in their basic form) is that they only separate data that is linearly separable. What happens if the data isn't? For that problem, we use kernels. [ 197 ]
Authorship Attribution Kernels When the data cannot be separated linearly, the trick is to embed it onto a higher dimensional space. What this means, with a lot of hand-waving about the details, is to add pseudo-features until the data is linearly separable (which will always happen if you add enough of the right kinds of features). The trick is that we often compute the inner-produce of the samples when finding the best line to separate the dataset. Given a function that uses the dot product, we effectively manufacture new features without having to actually define those new features. This is handy because we don't know what those features were going to be anyway. We now define a kernel as a function that itself is the dot product of the function of two samples from the dataset, rather than based on the samples (and the made-up features) themselves. We can now compute what that dot product is (or approximate it) and then just use that. There are a number of kernels in common use. The linear kernel is the most straightforward and is simply the dot product of the two sample feature vectors, the weight feature, and a bias value. There is also a polynomial kernel, which raises the dot product to a given degree (for instance, 2). Others include the Gaussian (rbf) and Sigmoidal functions. In our previous code sample, we tested between the linear kernel and the rbf kernels. The end result from all this derivation is that these kernels effectively define a distance between two samples that is used in the classification of new samples in SVMs. In theory, any distance could be used, although it may not share the same characteristics that enable easy optimization of the SVM training. In scikit-learn's implementation of SVMs, we can define the kernel parameter to change which kernel function is used in computations, as we saw in the previous code sample. Character n-grams We saw how function words can be used as features to predict the author of a document. Another feature type is character n-grams. An n-gram is a sequence of n objects, where n is a value (for text, generally between 2 and 6). Word n-grams have been used in many studies, usually relating to the topic of the documents. However, character n-grams have proven to be of high quality for authorship attribution. [ 198 ]
- Page 170 and 171: Chapter 7 As you can see, it is ver
- Page 172 and 173: Chapter 7 Next, we will only add th
- Page 174 and 175: Chapter 7 The difference in this gr
- Page 176 and 177: Chapter 7 We can graph the entire s
- Page 178 and 179: Chapter 7 Optimizing criteria Our a
- Page 180 and 181: Chapter 7 Next, we need to get the
- Page 182 and 183: • method='nelder-mead': This is u
- Page 184 and 185: Beating CAPTCHAs with Neural Networ
- Page 186 and 187: Chapter 8 The red lines indicate th
- Page 188 and 189: Chapter 8 The combination of an app
- Page 190 and 191: Chapter 8 Next we set the font of t
- Page 192 and 193: Chapter 8 We can then extract the s
- Page 194 and 195: Chapter 8 Our targets are integer v
- Page 196 and 197: Chapter 8 Then we iterate over our
- Page 198 and 199: Chapter 8 From these predictions, w
- Page 200 and 201: Chapter 8 This code correctly predi
- Page 202 and 203: The result is shown in the next gra
- Page 204 and 205: Chapter 8 However, it isn't very go
- Page 206: Chapter 8 Summary In this chapter,
- Page 209 and 210: Authorship Attribution Attributing
- Page 211 and 212: Authorship Attribution If we cannot
- Page 213 and 214: Authorship Attribution After taking
- Page 215 and 216: Authorship Attribution This dataset
- Page 217 and 218: Authorship Attribution "instead", "
- Page 219: Authorship Attribution Support vect
- Page 223 and 224: Authorship Attribution We can reuse
- Page 225 and 226: Authorship Attribution With our dat
- Page 227 and 228: Authorship Attribution We then reco
- Page 229 and 230: Authorship Attribution If it doesn'
- Page 231 and 232: Authorship Attribution Finally, we
- Page 234 and 235: Clustering News Articles In most of
- Page 236 and 237: Chapter 10 API Endpoints are the ac
- Page 238 and 239: The token object is just a dictiona
- Page 240 and 241: Chapter 10 We then create a list to
- Page 242 and 243: Chapter 10 We are going to use MD5
- Page 244 and 245: Chapter 10 Next, we develop the cod
- Page 246 and 247: Chapter 10 We use clustering techni
- Page 248 and 249: Chapter 10 The k-means algorithm is
- Page 250 and 251: Chapter 10 We only fit the X matrix
- Page 252 and 253: Chapter 10 We then print out the mo
- Page 254 and 255: Chapter 10 Our function definition
- Page 256 and 257: Chapter 10 The result from the prec
- Page 258 and 259: Chapter 10 Implementation Putting a
- Page 260 and 261: Chapter 10 Neural networks can also
- Page 262 and 263: We then call the partial_fit functi
- Page 264 and 265: Classifying Objects in Images Using
- Page 266 and 267: Chapter 11 This dataset comes from
- Page 268 and 269: You can change the image index to s
Authorship Attribution<br />
Kernels<br />
When the data cannot be separated linearly, the trick is to embed it onto a higher<br />
dimensional space. What this means, with a lot of hand-waving about the details, is<br />
to add pseudo-features until the data is linearly separable (which will always happen<br />
if you add enough of the right kinds of features).<br />
The trick is that we often <strong>com</strong>pute the inner-produce of the samples when finding<br />
the best line to separate the dataset. Given a function that uses the dot product, we<br />
effectively manufacture new features without having to actually define those new<br />
features. This is handy because we don't know what those features were going to<br />
be anyway. We now define a kernel as a function that itself is the dot product of<br />
the function of two samples from the dataset, rather than based on the samples<br />
(and the made-up features) themselves.<br />
We can now <strong>com</strong>pute what that dot product is (or approximate it) and then just<br />
use that.<br />
There are a number of kernels in <strong>com</strong>mon use. The linear kernel is the most<br />
straightforward and is simply the dot product of the two sample feature vectors,<br />
the weight feature, and a bias value. There is also a polynomial kernel, which raises<br />
the dot product to a given degree (for instance, 2). Others include the Gaussian<br />
(rbf) and Sigmoidal functions. In our previous code sample, we tested between<br />
the linear kernel and the rbf kernels.<br />
The end result from all this derivation is that these kernels effectively define a<br />
distance between two samples that is used in the classification of new samples in<br />
SVMs. In theory, any distance could be used, although it may not share the same<br />
characteristics that enable easy optimization of the SVM training.<br />
In scikit-learn's implementation of SVMs, we can define the kernel parameter to<br />
change which kernel function is used in <strong>com</strong>putations, as we saw in the previous<br />
code sample.<br />
Character n-grams<br />
We saw how function words can be used as features to predict the author of a<br />
document. Another feature type is character n-grams. An n-gram is a sequence of n<br />
objects, where n is a value (for text, generally between 2 and 6). Word n-grams have<br />
been used in many studies, usually relating to the topic of the documents. However,<br />
character n-grams have proven to be of high quality for authorship attribution.<br />
[ 198 ]