12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

produces a feature vector. A vector is just a sequence of numbers, like(0,34,2.3). We canthink of a vector as having multiple dimensions, where each dimension is a number in thesequence. So 0 is in the first dimension of (0,34,2.3), 34 is in the second dimension, and2.3 is in the third dimension. For text categorization, each dimension might correspond toa particular word (although character-based representations are also possible [Lodhi et al.,2002]). The value at that dimension could be 1 if the word is present in the document, and0 otherwise. These are binary feature values. We sometimes say that a feature fires if thatfeature value is non-zero, meaning, <strong>for</strong> text categorization, that the word is present in thedocument. We also sometimes refer to the feature vector as the feature representation ofthe problem.In machine learning, the feature vector is usually denoted as ¯x, so ¯x = Φ(d).A simple feature representation would be to have the first dimension be <strong>for</strong> the presenceof the word the, the second dimension <strong>for</strong> the presence of curling, and the third <strong>for</strong> the presenceof Obama. If the document read only “Obama attended yesterday’s curling match,”then the feature vector would be (0,1,1). If the document read “stocks are up today onWall Street,” then the feature vector would be(0,0,0). Notice the order of the words in thetext doesn’t matter. “Curling went Obama” would have the same feature vector as “Obamawent curling.” So this is sometimes referred to as the bag-of-words feature representation.That’s not really important but it’s a term that is often seen in bold text when describingmachine learning.The linear classifier,h(¯x), works by multiplying the feature vector, ¯x = (x 1 ,x 2 ,...x N )by a set of learned weights, ¯w = (w 1 ,w 2 ,...):h(¯x) = ¯w · ¯x = ∑ iw i x i (2.1)where the dot product (·) is a mathematical shorthand meaning, as indicated, that eachw i is multiplied with the feature value at dimension i and the results are summed. Wecan also write a dot product using matrix notation as ¯w T¯x. A linear classifier using anN-dimensional feature vector will sum the products of N multiplications. It’s known as alinear classifier because this is a linear combination of the features. Note, sometimes theweights are also represented using λ = (λ 1 ,...λ N ). This is sometimes convenient in NLPwhen we might want to use w to refer to a word.The objective of the linear classifier is to produce labels on new examples. Labels arealmost always represented asy. We choose the label using the output of the linear classifier.In a common paradigm, if the output is positive, that is, h(¯x) > 0, then we take this as apositive decision: yes, the document d does belong to the sports category, so the label, yequals +1 (the positive class). If h(¯x) < 0, we say the document does not belong in thesports category, and y = −1 (the negative class).Now, the job of the machine learning algorithm is to learn these weights. That’s reallyit. In the context of the widely-used linear classifier, the weights fully define the classifier.Training means choosing the weights, and testing means computing the dot product withthe weights <strong>for</strong> new feature vectors. How does the algorithm actually choose the weights?In supervised machine learning, you give some examples of feature vectors and the correctdecision on the vector. The index of each training example is usually written as a superscript,so that a training set of M examples can be written as: {(¯x 1 ,y 1 ),...,(¯x M ,y M )}.For example, a set of two training examples might be {(0,1,0),+1}, {(1,0,0),−1} <strong>for</strong>a positive (+1) and a negative (−1) example. The algorithm tries to choose the parameters(a synonym <strong>for</strong> the weights, ¯w) that result in the correct decision on this training data15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!