12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

For each target wordv 0 , there are five 5-gram context patterns that may span it. For Example(1) in Section 3.1, we can extract the following 5-gram patterns:system tried to decidev 0tried to decidev 0 theto decidev 0 the twodecidev 0 the two confusablev 0 the two confusable wordsSimilarly, there are four 4-gram patterns, three 3-gram patterns and two 2-gram patternsspanning the target. With |F| fillers, there are 14|F| filled patterns with relevant N-gramcounts. For example, <strong>for</strong> F ={among,between}, there are two filled 5-gram patterns thatbegin with the word decide: “decide among the two confusable” and “decide between thetwo confusable.” We collect counts <strong>for</strong> each of these, along with all the other filled patterns<strong>for</strong> this example. When F ={among,between}, there are 28 relevant counts <strong>for</strong> eachexample.We now describe various systems that use these counts.3.3.1 SUPERLMWe use supervised learning to map a target word and its context to an output. There aretwo steps in this mapping: a) converting the word and its context into a feature vector, andb) applying a classifier to determine the output class.In order to use the standard x,y notation <strong>for</strong> classifiers, we write things as follows:Let ¯x = Φ(V) be a mapping of the input to a feature representation, ¯x. We might alsothink of the feature function as being parameterized by the set of fillers, F and the N-gramcorpus, R, so that ¯x = Φ (F,R) (V). The feature function Φ (F,R) (·) outputs the count (inlogarithmic <strong>for</strong>m) of the different context patterns with the different fillers. Each of thesehas a corresponding dimension in the feature representation. IfN = 14|F| counts are used,then each ¯x is anN-dimensional feature vector.Now, the classifier outputs the index of the highest-scoring candidate in the set of candidateoutputs, C = {c 1 ,c 2 ...,c K }. That is, we let y ∈ {1,...,K} be the set of classesthat can be produced by the classifier. The classifier, H, is there<strong>for</strong>e a K-class classifier,mapping an attribute vector, ¯x, to a class, y. Using the standard [Crammer and Singer,2001]-style multi-class <strong>for</strong>mulation, H is parameterized by a K-by-N matrix of weights,W:KH W (¯x) = argmax{ ¯W r · ¯x} (3.1)r=1where ¯W r is the rth row of W. That is, the predicted class is the index of the row of Wthat has the highest inner-product with the attributes, ¯x. The weights are optimized using aset of M training examples, {(¯x 1 ,y 1 ),...,(¯x M ,y M )}.This differs a little from the linear classifier that we presented in Section 2.2. Here weactually have K linear classifiers. Although there is only one set of N features, there isa different linear combination <strong>for</strong> each row of W. There<strong>for</strong>e, the weight on a particularcount depends on the class we are scoring (corresponding to the row of W, r), as well asthe filler, the context position, and the context size, all of which select one of the 14|F|base features. There are there<strong>for</strong>e a total of 14|F|K count-weight parameters. Chapter 4<strong>for</strong>mally describes how these parameters are learned using a multi-class SVM. Chapter 4also discusses enhancements to this model that can enable better per<strong>for</strong>mance with fewertraining examples.40

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!