• won a trophy <strong>for</strong>:689• won the trophy <strong>for</strong> the:631• trophy was won by:626• won a JJ trophy:513• won the trophy in:511• won the trophy.:493• RB won a trophy:439• trophy they won:421• won the NN trophy:405• trophy won by:396• have won the trophy:396• won this trophy:377• the trophy they won:329• won the NNP trophy:325• won a trophy .:313• won the trophy NN:295• trophy he won:292• has won the trophy:290• won the trophy <strong>for</strong> JJS:284• won a trophy in:274• won the trophy in 0000:272• won the JJ NN trophy:267• won a trophy and:249• RB won the trophy:242• who won the trophy:242• and won the trophy:240• won the trophy,:228• won a trophy ,:215• won a trophy at:199• , won the trophy:191• also won the trophy:189• had won the trophy:186• won DT trophy:184• and won a trophy:178• the trophy won:173• won their JJ trophy:169• JJ trophy:168• won the trophy RB:161• won the JJ trophy in:155• won a JJ NN trophy:155• I won a trophy:153• won the trophy CD:145• won the trophy and:141• trophy , won:141In this data, Bears is almost always the subject of the verb, occurring be<strong>for</strong>e won andwith an object phrase afterwards (like won the or won their, etc.). On the other hand, trophyalmost always appears as an object, occurring after won or in passive constructions (trophywas won, trophy won by) or with another noun in the subject role (trophy they won, trophy hewon). If, on the web, a pair of words tends to occur in a particular relationship, then <strong>for</strong> anambiguous instance of this pair at test time, it is reasonable to also predict this relationship.Now think about boog. A lot of words look like boog to a system that has only seenlimited labeled data. Now, if globally the words boog and won occur in the same patternsin which trophy and won occur, then it would be clear that boog is also usually the objectof won, and thus won is likely a past participle (VBN) in Example 3. If, on the other hand,boog occurs in the same patterns as Bears, we would consider it a subject, and label won asa past-tense verb (VBD). 3So, in summary, while a pair of words, like trophy and won, might be very rare in ourlabeled data, the patterns in which these words occur (the distribution of the words), likewon the trophy, and trophy was won, may be very indicative of a particular relationship.These indicative patterns will likely be shared by other pairs in the labeled training data(e.g., we’ll see global patterns like bought the securities, market was closed, etc. <strong>for</strong> labeledexamples like “the securities bought by” and “the market closed up 134 points”). So, wesupplement our sparse in<strong>for</strong>mation (the identity of individual words) with more-generalin<strong>for</strong>mation (statistics from the distribution of those words on the web). The word’s globaldistribution can provide features just like the features taken from the word’s local context.By local, I mean the contextual in<strong>for</strong>mation surrounding the words to be classified in agiven sentence. Combining local and global sources of in<strong>for</strong>mation together, we can achievehigher per<strong>for</strong>mance.Note, however, that when the local context is unambiguous, it is usually a better bet torely on the local in<strong>for</strong>mation over the global, distributional statistics. For example, if the3 Of course, it might be the case that boog and won don’t occur in unlabeled data either, in which case wemight back off to even more general global features, but we leave this issue aside <strong>for</strong> the moment.5
actual sentence said, “My son’s simple trophy won their hearts,” then we should guess VBD<strong>for</strong> won, regardless of the global distribution of trophy won. Of course, we let the learningalgorithm choose the relative weight on global vs. local in<strong>for</strong>mation. In my experience,when good local features are available, the learning algorithm will usually put most of theweight on them, as the algorithm finds these features to be statistically more reliable. So wemust lower our expectations <strong>for</strong> the possible benefits of purely distributional in<strong>for</strong>mation.When there are already other good sources of in<strong>for</strong>mation available locally, the effect ofglobal in<strong>for</strong>mation is diminished. Section 5.6 presents some experimental results on VBN-VBD disambiguation and discusses this point further.Using N-grams <strong>for</strong> <strong>Learning</strong> from Unlabeled DataIn our work, we make use of aggregate counts over a large corpus; we don’t inspect the individualinstances of each phrase. That is, we do not separately process the 4868 sentenceswhere “won the trophy” occurs on the web, rather we use the N-gram, won the trophy, andits count, 4868, as a single unit of in<strong>for</strong>mation. We do this mainly because it’s computationallyinefficient to process all the instances (that is, the entire web). Very good inferences canbe drawn from the aggregate statistics. Chapter 2 describes a range of alternative methods<strong>for</strong> exploiting unlabeled data; many of these can not scale to web-scale text.1.4 A Perspective on Statistical vs. Linguistic ApproachesWhen reading any document, it can be useful to think about the author’s perspective. Sometimes,when we establish the author’s perspective, we might also establish that the documentis not worth reading any further. This might happen, <strong>for</strong> example, if the author’s perspectiveis completely at odds with our own, or if it seems likely the author’s perspective willprevent them from viewing evidence objectively.Surely, some readers of this document are also wondering about the perspective of itsauthor. Does he approach language from a purely statistical viewpoint, or is he interested inlinguistics itself? The answer: Although I certainly advocate the use of statistical methodsand huge volumes of data, I am mostly interested in how these resources can help withreal linguistic phenomena. I agree that linguistics has an essential role to play in the futureof NLP [Jelinek, 2005; Hajič and Hajičová, 2007]. I aim to be aware of the knowledge oflinguists and I try to think about where this knowledge might apply in my own work. I try togain insight into problems by annotating data myself. When I tackle a particular linguisticphenomenon, I try to think about how that phenomenon serves human communication andthought, how it may work differently in written or spoken language, how it may workdifferently across human languages, and how a particular computational representation maybe inadequate. By doing these things, I hope to not only produce more interesting andinsightful research, but to produce systems that work better. For example, while a search onGoogle Scholar reveals a number of papers proposing “language independent” approachesto tasks such as named-entity recognition, parsing, grapheme-to-phoneme conversion, andin<strong>for</strong>mation retrieval, it is my experience that approaches that pay attention to languagespecificissues tend to work better (e.g., in transliteration [Jiampojamarn et al., 2010]). Infact, exploiting linguistic knowledge can even help the Google statistical translation system[Xu et al., 2009] – a system that is often mentioned as an example of a purely data-drivenNLP approach.6
- Page 1 and 2: University of AlbertaLarge-Scale Se
- Page 5 and 6: Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13: (2) “He saw the trophy won yester
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57 and 58: Set BASE [Golding and Roth, 1999] T
- Page 59 and 60: pronoun (#3) guarantees that at the
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66:
ter, we present a simple technique
- Page 67 and 68:
We seek weights such that the class
- Page 69 and 70:
each optimum performance is at most
- Page 71 and 72:
We now show that ¯w T (diag(¯p)
- Page 73 and 74:
Training ExamplesSystem 10 100 1K 1
- Page 75 and 76:
Since we wanted the system to learn
- Page 77 and 78:
Chapter 5Creating Robust Supervised
- Page 79 and 80:
§ In-Domain (IN) Out-of-Domain #1
- Page 81 and 82:
Adjective ordering is also needed i
- Page 83 and 84:
Accuracy (%)10095908580757065601001
- Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88:
90% of the time in Gutenberg. The L
- Page 89 and 90:
VBN/VBD distinction by providing re
- Page 91 and 92:
other tasks we only had a handful o
- Page 93 and 94:
without the need for manual annotat
- Page 95 and 96:
DSP uses these labels to identify o
- Page 97 and 98:
Semantic classesMotivated by previo
- Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106:
Chapter 7Alignment-Based Discrimina
- Page 107 and 108:
ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V