produces a feature vector. A vector is just a sequence of numbers, like(0,34,2.3). We canthink of a vector as having multiple dimensions, where each dimension is a number in thesequence. So 0 is in the first dimension of (0,34,2.3), 34 is in the second dimension, and2.3 is in the third dimension. For text categorization, each dimension might correspond toa particular word (although character-based representations are also possible [Lodhi et al.,2002]). The value at that dimension could be 1 if the word is present in the document, and0 otherwise. These are binary feature values. We sometimes say that a feature fires if thatfeature value is non-zero, meaning, <strong>for</strong> text categorization, that the word is present in thedocument. We also sometimes refer to the feature vector as the feature representation ofthe problem.In machine learning, the feature vector is usually denoted as ¯x, so ¯x = Φ(d).A simple feature representation would be to have the first dimension be <strong>for</strong> the presenceof the word the, the second dimension <strong>for</strong> the presence of curling, and the third <strong>for</strong> the presenceof Obama. If the document read only “Obama attended yesterday’s curling match,”then the feature vector would be (0,1,1). If the document read “stocks are up today onWall Street,” then the feature vector would be(0,0,0). Notice the order of the words in thetext doesn’t matter. “Curling went Obama” would have the same feature vector as “Obamawent curling.” So this is sometimes referred to as the bag-of-words feature representation.That’s not really important but it’s a term that is often seen in bold text when describingmachine learning.The linear classifier,h(¯x), works by multiplying the feature vector, ¯x = (x 1 ,x 2 ,...x N )by a set of learned weights, ¯w = (w 1 ,w 2 ,...):h(¯x) = ¯w · ¯x = ∑ iw i x i (2.1)where the dot product (·) is a mathematical shorthand meaning, as indicated, that eachw i is multiplied with the feature value at dimension i and the results are summed. Wecan also write a dot product using matrix notation as ¯w T¯x. A linear classifier using anN-dimensional feature vector will sum the products of N multiplications. It’s known as alinear classifier because this is a linear combination of the features. Note, sometimes theweights are also represented using λ = (λ 1 ,...λ N ). This is sometimes convenient in NLPwhen we might want to use w to refer to a word.The objective of the linear classifier is to produce labels on new examples. Labels arealmost always represented asy. We choose the label using the output of the linear classifier.In a common paradigm, if the output is positive, that is, h(¯x) > 0, then we take this as apositive decision: yes, the document d does belong to the sports category, so the label, yequals +1 (the positive class). If h(¯x) < 0, we say the document does not belong in thesports category, and y = −1 (the negative class).Now, the job of the machine learning algorithm is to learn these weights. That’s reallyit. In the context of the widely-used linear classifier, the weights fully define the classifier.Training means choosing the weights, and testing means computing the dot product withthe weights <strong>for</strong> new feature vectors. How does the algorithm actually choose the weights?In supervised machine learning, you give some examples of feature vectors and the correctdecision on the vector. The index of each training example is usually written as a superscript,so that a training set of M examples can be written as: {(¯x 1 ,y 1 ),...,(¯x M ,y M )}.For example, a set of two training examples might be {(0,1,0),+1}, {(1,0,0),−1} <strong>for</strong>a positive (+1) and a negative (−1) example. The algorithm tries to choose the parameters(a synonym <strong>for</strong> the weights, ¯w) that result in the correct decision on this training data15
Figure 2.1: The linear classifier hyperplane (as given by an SVM, with support vectorsindicated)when the dot product is computed (here between three weights and three features). For oursports example, we would hope that the algorithm would learn, <strong>for</strong> example, that curlingshould get a positive weight, since documents that contain the word curling are usuallyabout sports. It should assign a fairly low weight, perhaps zero weight, to the word the,since this word doesn’t have much to say one way or the other. Choosing an appropriateweight <strong>for</strong> the Obama feature is left as an exercise <strong>for</strong> the reader. Note that weights canbe negative. Section 2.3 has more details on some of the different algorithms that learn theweights.If we take a geometric view, and think of the feature vectors as points inN-dimensionalspace, then learning the weights can also be thought of as learning a separating hyperplane.Once we have any classifier, then all feature vectors that get positive scores will be in oneregion of space, and all the feature vectors that get negative scores will be in another. Witha linear classifier, a hyperplane will divide these two regions. Figure 2.1 depicts this set-upin two dimensions, with the points of one class on the left, the points <strong>for</strong> the other class onthe right, and the dividing hyperplane as a bar down the middle. 1In this discussion, we’ve focused on binary classification: is the document about sportsor not? In many practical applications, however, we have more than two categories, e.g.sports, finance, politics, etc. It’s fairly easy to adapt the binary linear classifier to the multiclasscase. For K classes, one common approach is the one-versus-all strategy: we haveK binary classifiers that each predict whether a document is part of a given category ornot. Thus we might classify a document about Obama going curling as both a sports anda politics document. In cases where only one category is possible (i.e., the classes are mutuallyexclusive, such as the restriction that each word have only one part-of-speech tag),we could take the highest-scoring classifier (the highest h(¯x)) as the class. There are alsomulticlass classifiers, like the approach we use in Chapter 3, that essentially jointly optimizethe K classifiers (e.g. [Crammer and Singer, 2001]). Chapter 4 defines and evaluatesvarious multi-class learning approaches.A final point to address: should we be using a linear classifier <strong>for</strong> our problems atall? Linear classifiers are very simple, extremely fast, and work very well on a range of1 From: www.stat.columbia.edu/˜cook/movabletype/archives/2006/02/interesting_cas_1.html16
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23: emphasis on “deliverables and eva
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76:
Since we wanted the system to learn
- Page 77 and 78:
Chapter 5Creating Robust Supervised
- Page 79 and 80:
§ In-Domain (IN) Out-of-Domain #1
- Page 81 and 82:
Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V