Large-Scale Semi-Supervised Learning for Natural Language ...

More documents

Recommendations

Info

Pattern Filler TypeString#1: 3rd-person pron. sing. it/its#2: 3rd-person pron. plur. they/them/their#3: any other pronoun he/him/his/,I/me/my, etc.#4: infrequent word token 〈UNK〉#5: any other token *Table 3.3: Pattern filler typesthat we can reliably classify a pronoun as being referential or non-referential based solelyon the local context surrounding the pronoun, using the techniques described in Section 3.3.The difficulty of non-referential pronouns has been acknowledged since the beginningof computational resolution of anaphora. Hobbs [1978] notes his algorithm does not handlepronominal references to sentences nor cases where it occurs in time or weather expressions.Hirst [1981, page 17] emphasizes the importance of detecting non-referential pronouns,“lest precious hours be lost in bootless searches for textual referents.” Mueller [2006]summarizes the evolution of computational approaches to non-referential it detection. Inparticular, note the pioneering work of Paice and Husk [1987], the inclusion of non-referentialit detection in a full anaphora resolution system by Lappin and Leass [1994], and the machinelearning approach of Evans [2001].3.7.2 Our Approach to Non-referential Pronoun DetectionWe apply our web-scale disambiguation systems to this task. Like in the above approaches,we turn the context into patterns, with it as the word to be labeled. Since the output classesare not explicit words, we devise some surrogate fillers. To illustrate for Example (1), notewe can extract the context pattern “make * in advance” and for Example (2) “make * inHollywood,” where “*” represents the filler in the position of it. Non-referential instancestend to have the word it filling this position in the pattern’s distribution. This is because nonreferentialpatterns are fairly unique to non-referential pronouns. Referential distributionsoccur with many other noun phrase fillers. For example, in the Google N-gram corpus,“make it in advance” and “make them in advance” occur roughly the same number of times(442 vs. 449), indicating a referential pattern. In contrast, “make it in Hollywood” occurs3421 times while “make them in Hollywood” does not occur at all. This indicates that someuseful statistics are counts for patterns filled with the words it and them.These simple counts strongly indicate whether another noun can replace the pronoun.Thus we can computationally distinguish between a) pronouns that refer to nouns, and b)all other instances: including those that have no antecedent, like Example (2), and thosethat refer to sentences, clauses, or implied topics of discourse.We now discuss our full set of pattern fillers. For identifying non-referential it in English,we are interested in how often it occurs as a pattern filler versus other nouns. Assurrogates for nouns, we gather counts for five different classes of words that fill the wildcardposition, determined by string match (Table 3.3). 6 The third-person plural they (#2)reliably occurs in patterns where referential it also resides. The occurrence of any other6 Note, this work was done before the availability of the POS-tagged Google V2 corpus (Chapter 5). Wecould directly count noun-fillers using that corpus.49
pronoun (#3) guarantees that at the very least the pattern filler is a noun. A match with theinfrequent word token 〈UNK〉 (#4) (explained in Section 3.2.2) will likely be a noun becausenouns account for a large proportion of rare words in a corpus. Gathering any othertoken (#5) also mostly finds nouns; inserting another part-of-speech usually results in anunlikely-to-be-observed, ungrammatical pattern.Unlike our work in preposition selection and spelling correction above, we process ourinput examples and our N-gram corpus in various way to improve generality. We changethe patterns to lower-case, convert sequences of digits to the # symbol, and run the Porterstemmer [Porter, 1980]. 7 Our method also works without the stemmer; we simply truncatethe words in the pattern at a given maximum length. With simple truncation, all the patternprocessing can be easily applied to other languages. To generalize rare names, we convertcapitalized words longer than five characters to a special NE (named entity) tag. We alsoadded a few simple rules to stem the irregular verbs be, have, do, and said, and convert thecommon contractions ’nt, ’s, ’m, ’re, ’ve, ’d, and ’ll to their most likely stem. When weextract counts for a processed pattern, we sum all the counts for matching N-grams in theidentically-processed Google corpus.We run SUPERLM using the above fillers and their processed-pattern counts as describedin Section 3.3.1. For SUMLM, we decide NonRef if the difference between theSUMLM scores for it and they is above a threshold. For TRIGRAM, we also threshold theratio between it-counts and they-counts. For RATIOLM, we compare the frequencies of itand all, and decide NonRef if the count of it is higher. These thresholds and comparisonchoices were optimized on the development set.3.7.3 Non-referential Pronoun Detection DataWe need labeled data for training and evaluation of our system. This data indicates, forevery occurrence of the pronoun it, whether it refers to a preceding noun phrase or not.Standard coreference resolution data sets annotate all noun phrases that have an antecedentnoun phrase in the text. Therefore, we can extract labeled instances of it from these sets.We do this for the dry-run and formal sets from MUC-7 [1997], and merge them into asingle data set.Of course, full coreference-annotated data is a precious resource, with the pronoun itmaking up only a small portion of the marked-up noun phrases. We thus created annotateddata specifically for the pronoun it. We annotated 1020 instances in a collection of ScienceNews articles (from 1995-2000), downloaded from the Science News website. We alsoannotated 709 instances in the WSJ portion of the DARPA TIPSTER Project [Harman,1992], and 279 instances in the English portion of the Europarl Corpus [Koehn, 2005]. Wetake the first half of each of the subsets for training, the next quarter for development andthe final quarter for testing, creating an aggregate set with 1070 training, 533 developmentand 534 test examples.A single annotator (A 1 ) labeled all three data sets, while two additional annotators notconnected with the project (A 2 and A 3 ) were asked to separately re-annotate a portion ofeach, so that inter-annotator agreement could be calculated. A 1 and A 2 agreed on 96%of annotation decisions, while A 1 -A 3 , and A 2 -A 3 , agreed on 91% and 93% of decisions,respectively. The Kappa statistic [Jurafsky and Martin, 2000, page 315], with Pr(E) computedfrom the confusion matrices, was a high 0.90 for A 1 -A 2 , and 0.79 and 0.81 for the7 Adapted from the Bow-toolkit [McCallum, 1996].50
Page 1 and 2:
University of AlbertaLarge-Scale Se
Page 5 and 6:
Table of Contents1 Introduction 11.
Page 7 and 8: 7 Alignment-Based Discriminative St
Page 9 and 10: List of Figures2.1 The linear class
Page 11 and 12: drawn in by establishing a partial
Page 13 and 14: (2) “He saw the trophy won yester
Page 15 and 16: actual sentence said, “My son’s
Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
Page 19 and 20: spelling correction, and the identi
Page 21 and 22: Chapter 2Supervised and Semi-Superv
Page 23 and 24: emphasis on “deliverables and eva
Page 25 and 26: Figure 2.1: The linear classifier h
Page 27 and 28: The above experimental set-up is so
Page 29 and 30: and discriminative models therefore
Page 31 and 32: their slack value). In practice, I
Page 33 and 34: One way to find a better solution i
Page 35 and 36: Figure 2.2: Learning from labeled a
Page 37 and 38: algorithm). Yarowsky used it for wo
Page 39 and 40: Learning with Natural Automatic Exa
Page 41 and 42: positive examples from any collecti
Page 43 and 44: generated word clusters. Several re
Page 45 and 46: One common disambiguation task is t
Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
Page 49 and 50: For each target wordv 0 , there are
Page 51 and 52: ut without counts for the class pri
Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
Page 55 and 56: We also follow Carlson et al. [2001
Page 57: Set BASE [Golding and Roth, 1999] T
Page 61 and 62: 807876F-Score747270Stemmed patterns
Page 63 and 64: anaphoricity by [Denis and Baldridg
Page 65 and 66: ter, we present a simple technique
Page 67 and 68: We seek weights such that the class
Page 69 and 70: each optimum performance is at most
Page 71 and 72: We now show that ¯w T (diag(¯p)
Page 73 and 74: Training ExamplesSystem 10 100 1K 1
Page 75 and 76: Since we wanted the system to learn
Page 77 and 78: Chapter 5Creating Robust Supervised
Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
Page 81 and 82: Adjective ordering is also needed i
Page 83 and 84: Accuracy (%)10095908580757065601001
Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
Page 87 and 88: 90% of the time in Gutenberg. The L
Page 89 and 90: VBN/VBD distinction by providing re
Page 91 and 92: other tasks we only had a handful o
Page 93 and 94: without the need for manual annotat
Page 95 and 96: DSP uses these labels to identify o
Page 97 and 98: Semantic classesMotivated by previo
Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
Page 105 and 106: Chapter 7Alignment-Based Discrimina
Page 107 and 108: ious measures to learn the recurren
Page 109 and 110:
how labeled word pairs can be colle
Page 111 and 112:
Figure 7.1: LCSR histogram and poly
Page 113 and 114:
0.711-pt Average Precision0.60.50.4
Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
Page 117 and 118:
Chapter 8Conclusions and Future Wor
Page 119 and 120:
8.3 Future WorkThis section outline
Page 121 and 122:
My focus is thus on enabling robust
Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
Page 137:
NNP noun, proper, singular Motown V
show all

Large-Scale Semi-Supervised Learning for Natural Language ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?