Pattern Filler TypeString#1: 3rd-person pron. sing. it/its#2: 3rd-person pron. plur. they/them/their#3: any other pronoun he/him/his/,I/me/my, etc.#4: infrequent word token 〈UNK〉#5: any other token *Table 3.3: Pattern filler typesthat we can reliably classify a pronoun as being referential or non-referential based solelyon the local context surrounding the pronoun, using the techniques described in Section 3.3.The difficulty of non-referential pronouns has been acknowledged since the beginningof computational resolution of anaphora. Hobbs [1978] notes his algorithm does not handlepronominal references to sentences nor cases where it occurs in time or weather expressions.Hirst [1981, page 17] emphasizes the importance of detecting non-referential pronouns,“lest precious hours be lost in bootless searches <strong>for</strong> textual referents.” Mueller [2006]summarizes the evolution of computational approaches to non-referential it detection. Inparticular, note the pioneering work of Paice and Husk [1987], the inclusion of non-referentialit detection in a full anaphora resolution system by Lappin and Leass [1994], and the machinelearning approach of Evans [2001].3.7.2 Our Approach to Non-referential Pronoun DetectionWe apply our web-scale disambiguation systems to this task. Like in the above approaches,we turn the context into patterns, with it as the word to be labeled. Since the output classesare not explicit words, we devise some surrogate fillers. To illustrate <strong>for</strong> Example (1), notewe can extract the context pattern “make * in advance” and <strong>for</strong> Example (2) “make * inHollywood,” where “*” represents the filler in the position of it. Non-referential instancestend to have the word it filling this position in the pattern’s distribution. This is because nonreferentialpatterns are fairly unique to non-referential pronouns. Referential distributionsoccur with many other noun phrase fillers. For example, in the Google N-gram corpus,“make it in advance” and “make them in advance” occur roughly the same number of times(442 vs. 449), indicating a referential pattern. In contrast, “make it in Hollywood” occurs3421 times while “make them in Hollywood” does not occur at all. This indicates that someuseful statistics are counts <strong>for</strong> patterns filled with the words it and them.These simple counts strongly indicate whether another noun can replace the pronoun.Thus we can computationally distinguish between a) pronouns that refer to nouns, and b)all other instances: including those that have no antecedent, like Example (2), and thosethat refer to sentences, clauses, or implied topics of discourse.We now discuss our full set of pattern fillers. For identifying non-referential it in English,we are interested in how often it occurs as a pattern filler versus other nouns. Assurrogates <strong>for</strong> nouns, we gather counts <strong>for</strong> five different classes of words that fill the wildcardposition, determined by string match (Table 3.3). 6 The third-person plural they (#2)reliably occurs in patterns where referential it also resides. The occurrence of any other6 Note, this work was done be<strong>for</strong>e the availability of the POS-tagged Google V2 corpus (Chapter 5). Wecould directly count noun-fillers using that corpus.49
pronoun (#3) guarantees that at the very least the pattern filler is a noun. A match with theinfrequent word token 〈UNK〉 (#4) (explained in Section 3.2.2) will likely be a noun becausenouns account <strong>for</strong> a large proportion of rare words in a corpus. Gathering any othertoken (#5) also mostly finds nouns; inserting another part-of-speech usually results in anunlikely-to-be-observed, ungrammatical pattern.Unlike our work in preposition selection and spelling correction above, we process ourinput examples and our N-gram corpus in various way to improve generality. We changethe patterns to lower-case, convert sequences of digits to the # symbol, and run the Porterstemmer [Porter, 1980]. 7 Our method also works without the stemmer; we simply truncatethe words in the pattern at a given maximum length. With simple truncation, all the patternprocessing can be easily applied to other languages. To generalize rare names, we convertcapitalized words longer than five characters to a special NE (named entity) tag. We alsoadded a few simple rules to stem the irregular verbs be, have, do, and said, and convert thecommon contractions ’nt, ’s, ’m, ’re, ’ve, ’d, and ’ll to their most likely stem. When weextract counts <strong>for</strong> a processed pattern, we sum all the counts <strong>for</strong> matching N-grams in theidentically-processed Google corpus.We run SUPERLM using the above fillers and their processed-pattern counts as describedin Section 3.3.1. For SUMLM, we decide NonRef if the difference between theSUMLM scores <strong>for</strong> it and they is above a threshold. For TRIGRAM, we also threshold theratio between it-counts and they-counts. For RATIOLM, we compare the frequencies of itand all, and decide NonRef if the count of it is higher. These thresholds and comparisonchoices were optimized on the development set.3.7.3 Non-referential Pronoun Detection DataWe need labeled data <strong>for</strong> training and evaluation of our system. This data indicates, <strong>for</strong>every occurrence of the pronoun it, whether it refers to a preceding noun phrase or not.Standard coreference resolution data sets annotate all noun phrases that have an antecedentnoun phrase in the text. There<strong>for</strong>e, we can extract labeled instances of it from these sets.We do this <strong>for</strong> the dry-run and <strong>for</strong>mal sets from MUC-7 [1997], and merge them into asingle data set.Of course, full coreference-annotated data is a precious resource, with the pronoun itmaking up only a small portion of the marked-up noun phrases. We thus created annotateddata specifically <strong>for</strong> the pronoun it. We annotated 1020 instances in a collection of ScienceNews articles (from 1995-2000), downloaded from the Science News website. We alsoannotated 709 instances in the WSJ portion of the DARPA TIPSTER Project [Harman,1992], and 279 instances in the English portion of the Europarl Corpus [Koehn, 2005]. Wetake the first half of each of the subsets <strong>for</strong> training, the next quarter <strong>for</strong> development andthe final quarter <strong>for</strong> testing, creating an aggregate set with 1070 training, 533 developmentand 534 test examples.A single annotator (A 1 ) labeled all three data sets, while two additional annotators notconnected with the project (A 2 and A 3 ) were asked to separately re-annotate a portion ofeach, so that inter-annotator agreement could be calculated. A 1 and A 2 agreed on 96%of annotation decisions, while A 1 -A 3 , and A 2 -A 3 , agreed on 91% and 93% of decisions,respectively. The Kappa statistic [Jurafsky and Martin, 2000, page 315], with Pr(E) computedfrom the confusion matrices, was a high 0.90 <strong>for</strong> A 1 -A 2 , and 0.79 and 0.81 <strong>for</strong> the7 Adapted from the Bow-toolkit [McCallum, 1996].50
- Page 1 and 2:
University of AlbertaLarge-Scale Se
- Page 5 and 6:
Table of Contents1 Introduction 11.
- Page 7 and 8: 7 Alignment-Based Discriminative St
- Page 9 and 10: List of Figures2.1 The linear class
- Page 11 and 12: drawn in by establishing a partial
- Page 13 and 14: (2) “He saw the trophy won yester
- Page 15 and 16: actual sentence said, “My son’s
- Page 17 and 18: Uses Web-Scale N-grams Auto-Creates
- Page 19 and 20: spelling correction, and the identi
- Page 21 and 22: Chapter 2Supervised and Semi-Superv
- Page 23 and 24: emphasis on “deliverables and eva
- Page 25 and 26: Figure 2.1: The linear classifier h
- Page 27 and 28: The above experimental set-up is so
- Page 29 and 30: and discriminative models therefore
- Page 31 and 32: their slack value). In practice, I
- Page 33 and 34: One way to find a better solution i
- Page 35 and 36: Figure 2.2: Learning from labeled a
- Page 37 and 38: algorithm). Yarowsky used it for wo
- Page 39 and 40: Learning with Natural Automatic Exa
- Page 41 and 42: positive examples from any collecti
- Page 43 and 44: generated word clusters. Several re
- Page 45 and 46: One common disambiguation task is t
- Page 47 and 48: 3.2.2 Web-Scale Statistics in NLPEx
- Page 49 and 50: For each target wordv 0 , there are
- Page 51 and 52: ut without counts for the class pri
- Page 53 and 54: Accuracy (%)10090807060SUPERLMSUMLM
- Page 55 and 56: We also follow Carlson et al. [2001
- Page 57: Set BASE [Golding and Roth, 1999] T
- Page 61 and 62: 807876F-Score747270Stemmed patterns
- Page 63 and 64: anaphoricity by [Denis and Baldridg
- Page 65 and 66: ter, we present a simple technique
- Page 67 and 68: We seek weights such that the class
- Page 69 and 70: each optimum performance is at most
- Page 71 and 72: We now show that ¯w T (diag(¯p)
- Page 73 and 74: Training ExamplesSystem 10 100 1K 1
- Page 75 and 76: Since we wanted the system to learn
- Page 77 and 78: Chapter 5Creating Robust Supervised
- Page 79 and 80: § In-Domain (IN) Out-of-Domain #1
- Page 81 and 82: Adjective ordering is also needed i
- Page 83 and 84: Accuracy (%)10095908580757065601001
- Page 85 and 86: System IN O1 O2Baseline 66.9 44.6 6
- Page 87 and 88: 90% of the time in Gutenberg. The L
- Page 89 and 90: VBN/VBD distinction by providing re
- Page 91 and 92: other tasks we only had a handful o
- Page 93 and 94: without the need for manual annotat
- Page 95 and 96: DSP uses these labels to identify o
- Page 97 and 98: Semantic classesMotivated by previo
- Page 99 and 100: empirical Pr(n|v) in Equation (6.2)
- Page 101 and 102: Verb Plaus./Implaus. Resnik Dagan e
- Page 103 and 104: SystemAccMost-Recent Noun 17.9%Maxi
- Page 105 and 106: Chapter 7Alignment-Based Discrimina
- Page 107 and 108: ious measures to learn the recurren
- Page 109 and 110:
how labeled word pairs can be colle
- Page 111 and 112:
Figure 7.1: LCSR histogram and poly
- Page 113 and 114:
0.711-pt Average Precision0.60.50.4
- Page 115 and 116:
Fr-En Bitext Es-En Bitext De-En Bit
- Page 117 and 118:
Chapter 8Conclusions and Future Wor
- Page 119 and 120:
8.3 Future WorkThis section outline
- Page 121 and 122:
My focus is thus on enabling robust
- Page 123 and 124:
[Bergsma and Cherry, 2010] Shane Be
- Page 125 and 126:
[Church and Mercer, 1993] Kenneth W
- Page 127 and 128:
[Grefenstette, 1999] Gregory Grefen
- Page 129 and 130:
[Koehn, 2005] Philipp Koehn. Europa
- Page 131 and 132:
[Mihalcea and Moldovan, 1999] Rada
- Page 133 and 134:
[Ristad and Yianilos, 1998] Eric Sv
- Page 135 and 136:
[Wang et al., 2008] Qin Iris Wang,
- Page 137:
NNP noun, proper, singular Motown V