These rules and scores might depend on many properties of the input sentence: the worditself, the surrounding words, the case, the prefixes and suffixes of the surrounding words,etc. The number of properties of interest (what in machine learning is called “the numberof features”) may be quite large, and it is difficult to choose the set of rules and weights thatresults in the best per<strong>for</strong>mance (See Chapter 2, Section 2.1 <strong>for</strong> further discussion).Rather than specifying the rules and weights by hand, the current dominant approachin NLP is to provide a set of labeled examples that the system can learn from. That is,we train the system to make decisions using guidance from labeled data. By labeled data,we simply mean data where the correct, gold-standard answer has been explicitly provided.The properties of the input are typically encoded as numerical features. A score is producedusing a weighted combination of the features. The learning algorithm assigns weights to thefeatures so that the correct output scores higher than incorrect outputs on the training set.Or, in cases where the true output can not be generated by the system, so that the highestscoring output (the system prediction) is as close as possible to the known true answer.For example, feature 96345 might be a binary feature, equal to one if “the word iswind,” and otherwise equal to zero. This feature (e.g. f 96345 ) may get a high weight <strong>for</strong>predicting whether the word is a common noun, NN (e.g. the corresponding weight parameter,w 96345 , may be 10). If the weighted-sum-of-features score <strong>for</strong> the NN tag is higherthan the scores <strong>for</strong> the other tags, then NN is predicted. Again, the key point is that theseweights are chosen, automatically, in order to maximize per<strong>for</strong>mance on human-provided,labeled examples. Chapter 2 covers the fundamental equations of machine learning (ML)and discusses how machine learning is used in NLP.Statistical machine learning works a lot better than specifying rules by hand. ML systemsare easier to develop (because a computer program fine-tunes the rules, not a human)and easier to adapt to new domains (because we need only annotate new data, rather thanwrite new rules). ML systems also tend to achieve better per<strong>for</strong>mance (again, see Chapter 2,Section 2.1). 2The chief bottleneck in developing supervised systems is the manual annotation of data.Historically, most labeled data sets were created by experts in linguistics. Because of thegreat cost of producing this data, the size and variety of these data sets is quite limited.Although the amount of labeled data is limited, there is quite a lot of unlabeled dataavailable (as we mentioned above). This dissertation explores various methods to combinevery large amounts of unlabeled data with standard supervised learning on a variety of NLPtasks. This combination of learning from both labeled and unlabeled data is often referredto as semi-supervised learning.1.3 <strong>Learning</strong> from Unlabeled DataAn example from part-of-speech tagging will help illustrate how unlabeled data can be useful.Suppose we are trying to label the parts-of-speech in the following examples. Specifically,there is some ambiguity <strong>for</strong> the tag of the verb won.(1) “He saw the Bears won yesterday.”2 Machine learned systems are also more fun to design. At a talk last year at Johns Hopkins University (June,2009), BBN employee Ralph Weischeidel suggested that one of the reasons that BBN switched to machinelearning approaches was because one of their chief designers got so bored writing rules <strong>for</strong> their in<strong>for</strong>mationextraction system, he decided to go back to graduate school.3
(2) “He saw the trophy won yesterday.”(3) “He saw the boog won yesterday.”Only one word differs in each sentence: the word be<strong>for</strong>e the verb won. In Example 1,Bears is the subject of the verb won (it was the Bears who won yesterday). Here, wonshould get the VBD tag. In Example 2, trophy is the object of the verb won (it was thetrophy that was won). In this sentence, won gets a VBN tag. In a typical training set (i.e.the training sections of the Penn Treebank [Marcus et al., 1993]), we don’t see Bears wonor trophy won at all. In fact, both the words Bears and trophy are rare enough to essentiallylook like Example 3 to our system. They might as well be boog! Based on even a fairlylarge set of labeled data, like the Penn Treebank, the correct tag <strong>for</strong> won is ambiguous.However, the relationship between Bears and won, and between trophy and won, isfairly unambiguous if we look at unlabeled data. For both pairs of words, I have collectedall 2-to-5-grams where the words co-occur in the Google V2 corpus, a collection of N-gramsfrom the entire world wide web. An N-gram corpus states how often each sequence of words(up to length N) occurs (N-grams are discussed in detail in Chapter 3, while the Google V2corpus is described in Chapter 5; note the Google V2 corpus includes part-of-speech tags).I replace non-stopwords by their part-of-speech tag, and sum the counts <strong>for</strong> each pattern.The top fifty most frequent patterns <strong>for</strong> {Bears, won} and {trophy, won} are given:Bears won:• Bears won:3215• the Bears won:1252• Bears won the:956• The Bears won:875• Bears have won:874• NNP Bears won:767• Bears won their:443• Bears won CD:436• The Bears have won:328• Bears won their JJ:321• Bears have won CD:305• , the Bears won:305• the NNP Bears won:305• The Bears won the:296• the Bears won the:293• The NNP Bears won:274• NNP Bears won the:262• the Bears have won:255• NNP Bears have won:217• as the Bears won:168• the Bears won CD:168• Bears won the NNP:162• Bears have won 00:160• Bears won the NN:157• Bears won a:153• the Bears won their:148• NNP Bears won their:129• The Bears have won CD:128• Bears won ,:124• Bears had won:121• The Bears won their:121• when the Bears won:119• The NNP Bears have won:117• Bears have won the:112• Bears won the JJ:112• Bears , who won:107• The Bears won CD:103• Bears won the NNP NNP:102• The NNP Bears won the:100• the NNP Bears won the:96• Bears have RB won:94• , the Bears have won:93• and the Bears won:91• IN the Bears won:89• Bears also won:87• Bears won 00:86• Bears have won CD of:84• as the NNP Bears won:80• Bears won CD .:80• , the Bears won the:77trophy won:• won the trophy:4868• won a trophy:2770• won the trophy <strong>for</strong>:1375• won the JJ trophy:825• trophy was won:811• trophy won:8034
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11: drawn in by establishing a partial
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64:
anaphoricity by [Denis and Baldridg
- Page 65 and 66:
ter, we present a simple technique
- Page 67 and 68:
We seek weights such that the class
- Page 69 and 70:
each optimum performance is at most
- Page 71 and 72:
We now show that ¯w T (diag(¯p)
- Page 73 and 74:
Training ExamplesSystem 10 100 1K 1
- Page 75 and 76:
Since we wanted the system to learn
- Page 77 and 78:
Chapter 5Creating Robust Supervised
- Page 79 and 80:
§ In-Domain (IN) Out-of-Domain #1
- Page 81 and 82:
Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V