Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

• won a trophy for:689• won the trophy for the:631• trophy was won by:626• won a JJ trophy:513• won the trophy in:511• won the trophy.:493• RB won a trophy:439• trophy they won:421• won the NN trophy:405• trophy won by:396• have won the trophy:396• won this trophy:377• the trophy they won:329• won the NNP trophy:325• won a trophy .:313• won the trophy NN:295• trophy he won:292• has won the trophy:290• won the trophy for JJS:284• won a trophy in:274• won the trophy in 0000:272• won the JJ NN trophy:267• won a trophy and:249• RB won the trophy:242• who won the trophy:242• and won the trophy:240• won the trophy,:228• won a trophy ,:215• won a trophy at:199• , won the trophy:191• also won the trophy:189• had won the trophy:186• won DT trophy:184• and won a trophy:178• the trophy won:173• won their JJ trophy:169• JJ trophy:168• won the trophy RB:161• won the JJ trophy in:155• won a JJ NN trophy:155• I won a trophy:153• won the trophy CD:145• won the trophy and:141• trophy , won:141In this data, Bears is almost always the subject of the verb, occurring before won andwith an object phrase afterwards (like won the or won their, etc.). On the other hand, trophyalmost always appears as an object, occurring after won or in passive constructions (trophywas won, trophy won by) or with another noun in the subject role (trophy they won, trophy hewon). If, on the web, a pair of words tends to occur in a particular relationship, then for anambiguous instance of this pair at test time, it is reasonable to also predict this relationship.Now think about boog. A lot of words look like boog to a system that has only seenlimited labeled data. Now, if globally the words boog and won occur in the same patternsin which trophy and won occur, then it would be clear that boog is also usually the objectof won, and thus won is likely a past participle (VBN) in Example 3. If, on the other hand,boog occurs in the same patterns as Bears, we would consider it a subject, and label won asa past-tense verb (VBD). 3So, in summary, while a pair of words, like trophy and won, might be very rare in ourlabeled data, the patterns in which these words occur (the distribution of the words), likewon the trophy, and trophy was won, may be very indicative of a particular relationship.These indicative patterns will likely be shared by other pairs in the labeled training data(e.g., we’ll see global patterns like bought the securities, market was closed, etc. for labeledexamples like “the securities bought by” and “the market closed up 134 points”). So, wesupplement our sparse information (the identity of individual words) with more-generalinformation (statistics from the distribution of those words on the web). The word’s globaldistribution can provide features just like the features taken from the word’s local context.By local, I mean the contextual information surrounding the words to be classified in agiven sentence. Combining local and global sources of information together, we can achievehigher performance.Note, however, that when the local context is unambiguous, it is usually a better bet torely on the local information over the global, distributional statistics. For example, if the3 Of course, it might be the case that boog and won don’t occur in unlabeled data either, in which case wemight back off to even more general global features, but we leave this issue aside for the moment.5
actual sentence said, “My son’s simple trophy won their hearts,” then we should guess VBDfor won, regardless of the global distribution of trophy won. Of course, we let the learningalgorithm choose the relative weight on global vs. local information. In my experience,when good local features are available, the learning algorithm will usually put most of theweight on them, as the algorithm finds these features to be statistically more reliable. So wemust lower our expectations for the possible benefits of purely distributional information.When there are already other good sources of information available locally, the effect ofglobal information is diminished. Section 5.6 presents some experimental results on VBN-VBD disambiguation and discusses this point further.Using N-grams for Learning from Unlabeled DataIn our work, we make use of aggregate counts over a large corpus; we don’t inspect the individualinstances of each phrase. That is, we do not separately process the 4868 sentenceswhere “won the trophy” occurs on the web, rather we use the N-gram, won the trophy, andits count, 4868, as a single unit of information. We do this mainly because it’s computationallyinefficient to process all the instances (that is, the entire web). Very good inferences canbe drawn from the aggregate statistics. Chapter 2 describes a range of alternative methodsfor exploiting unlabeled data; many of these can not scale to web-scale text.1.4 A Perspective on Statistical vs. Linguistic ApproachesWhen reading any document, it can be useful to think about the author’s perspective. Sometimes,when we establish the author’s perspective, we might also establish that the documentis not worth reading any further. This might happen, for example, if the author’s perspectiveis completely at odds with our own, or if it seems likely the author’s perspective willprevent them from viewing evidence objectively.Surely, some readers of this document are also wondering about the perspective of itsauthor. Does he approach language from a purely statistical viewpoint, or is he interested inlinguistics itself? The answer: Although I certainly advocate the use of statistical methodsand huge volumes of data, I am mostly interested in how these resources can help withreal linguistic phenomena. I agree that linguistics has an essential role to play in the futureof NLP [Jelinek, 2005; Hajič and Hajičová, 2007]. I aim to be aware of the knowledge oflinguists and I try to think about where this knowledge might apply in my own work. I try togain insight into problems by annotating data myself. When I tackle a particular linguisticphenomenon, I try to think about how that phenomenon serves human communication andthought, how it may work differently in written or spoken language, how it may workdifferently across human languages, and how a particular computational representation maybe inadequate. By doing these things, I hope to not only produce more interesting andinsightful research, but to produce systems that work better. For example, while a search onGoogle Scholar reveals a number of papers proposing “language independent” approachesto tasks such as named-entity recognition, parsing, grapheme-to-phoneme conversion, andinformation retrieval, it is my experience that approaches that pay attention to languagespecificissues tend to work better (e.g., in transliteration [Jiampojamarn et al., 2010]). Infact, exploiting linguistic knowledge can even help the Google statistical translation system[Xu et al., 2009] – a system that is often mentioned as an example of a purely data-drivenNLP approach.6
Page 1 and 2: University of AlbertaLarge-Scale Se
Page 5 and 6: Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13: (2) “He saw the trophy won yester
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31 and 32: their slack value). In practice, I
Page 33 and 34: One way to find a better solution i
Page 35 and 36: Figure 2.2: Learning from labeled a
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57 and 58: Set BASE [Golding and Roth, 1999] T
Page 59 and 60: pronoun (#3) guarantees that at the
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66:
ter, we present a simple technique
Page 67 and 68:
We seek weights such that the class
Page 69 and 70:
each optimum performance is at most
Page 71 and 72:
We now show that ¯w T (diag(¯p)
Page 73 and 74:
Training ExamplesSystem 10 100 1K 1
Page 75 and 76:
Since we wanted the system to learn
Page 77 and 78:
Chapter 5Creating Robust Supervised
Page 79 and 80:
§ In-Domain (IN) Out-of-Domain #1
Page 81 and 82:
Adjective ordering is also needed i
Page 83 and 84:
Accuracy (%)10095908580757065601001
Page 85 and 86:
System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88:
90% of the time in Gutenberg. The L
Page 89 and 90:
VBN/VBD distinction by providing re
Page 91 and 92:
other tasks we only had a handful o
Page 93 and 94:
without the need for manual annotat
Page 95 and 96:
DSP uses these labels to identify o
Page 97 and 98:
Semantic classesMotivated by previo
Page 99 and 100:
empirical Pr(n|v) in Equation (6.2)
Page 101 and 102:
Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104:
SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106:
Chapter 7Alignment-Based Discrimina
Page 107 and 108:
ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?