Unni Cathrine Eiken February 2005
Unni Cathrine Eiken February 2005 Unni Cathrine Eiken February 2005
with such an approach is stated by the developers: “the strategy is simple, but requires a fairly large amount of knowledge to be useful for a broad range of cases” (Carbonell and Brown 1988, p. 97). Generally speaking, the knowledge bases that knowledge-based systems for anaphora resolution rely on are difficult to represent and process, and require a considerable amount of human input (Mitkov 2001, p. 110). The information is structured using different frameworks; often each anaphora resolution system structures its knowledge base in a system-specific manner. Rather than giving an outline of various specific methods belonging to the traditional approaches, some of the formats used for knowledge representation are briefly mentioned below. Several frameworks have been developed to cope with the need for a formalism to represent real-world or domain knowledge. Most of these have been part of specific anaphora resolutions systems and have not constituted independent frameworks for the representations of real-world knowledge. Minsky’s Frames (Minsky 1975, in Botley and McEnery 2000) is a framework for representing knowledge about stereotyped objects and events. The frames are dynamic in the sense that the information they hold about a particular object or event can change if new information is encountered. Input into the system is interpreted in accordance with the information present in the frames; the frames generate expectations about the input (Botley and McEnery 2000, p. 12). In the case of a “shooting frame” being evoked upon processing of the sentence in (2-9a), the expectation that if somebody misses, it is likely to be the same person that also was doing the shooting, is created. Following such an expectation, it is easy to identify the correct antecedent for the anaphor. Schank’s Scripts (Schank 1972, in Botley and McEnery 2000) have some similarity to Minsky’s Frames, but are primarily used to represent knowledge about events which do not undergo change (Botley and McEnery 2000, p. 12). Information about role assignment and the sequence of events in given contexts is represented in the script. 2.1.2.3 Alternative approaches to anaphora resolution Hand-coded knowledge bases that aim at representing real-world or domain knowledge are expensive and labor-intensive to build and maintain. As a consequence, the focus has shifted toward systems that rely less heavily on world knowledge in the last 15 years (see Mitkov 2003 20
for an overview). Many of these systems incorporate semantic and real-world knowledge, but use methods that enable the collection of this information to have a high degree of automation (Baldwin 1997; Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). Mitkov (2003) terms these systems knowledge-poor and attributes their growth in number in recent years to the fact that corpora and similar electronic linguistic resources have become better, larger and more available. Some of these systems do not really attempt at building a world- or domain knowledge base (Baldwin 1997; Nasukawa 1994), but rather look at features such as co- occurrence patterns in the text itself, while others integrate corpora and use them as a form of abstract knowledge base (Dagan and Itai 1990; Dagan et al. 1995). Among the different “alternative” approaches, Dagan and Itai’s (1990) statistical approach, Dagan et al.’s (1995) estimation of unseen patterns and Nasukawa’s (1994) knowledge-free method are of particular interest for this project. Dagan and Itai’s (1990) method is that of using co-occurrence patterns observed in a corpus as a type of selectional restrictions. Co-occurrence patterns observed in a large corpus are thought to reflect the semantic constraints that apply to natural language. Candidates for antecedents for the anaphor it are identified in the text and put in the place of the anaphor to be resolved. This produces co-occurrence patterns that are checked against the corpus. Subsequently the candidate present in the most frequently occurring cooccurrence pattern is chosen as the antecedent. This method relies on a large corpus, as only patterns which actually have been seen in the corpus are considered. Infrequent patterns will not be picked since they generally speaking will not feature on the top of the pattern list. Dagan et al. (1995) offer a solution to this problem by presenting a similar method which also estimates the probability of co-occurrence patterns that have not been observed in the training data. They state the importance of distinguishing between probable and improbable unobserved cooccurrence patterns and emphasise that the “distinctions ought to be made using the data that do occur in the corpus” (Dagan et al. 1995, p. 164). Anologies are made between specific unseen co-occurrence patterns and observed co-occurrences which contain similar words, determining word similarity by a similarity metric. Patterns that contain similar words to the target word and that have been observed in the training data are used to calculate how likely the target word is to occur in the same pattern. Nasukawa (1994) presents a resolution rate of 93,8% in an even knowledge-poorer method for pronoun resolution. Instead of drawing information from a corpus, word frequency and co-occurrence patterns in the text itself are used to filter out the most likely candidate for the antecedent. In Nasukawa’s approach, inter-sentential data is 21
- Page 1 and 2: University of Bergen Section for li
- Page 3 and 4: Preface The project presented in th
- Page 5 and 6: Table of Contents 1 INTRODUCTION AN
- Page 7 and 8: 1 Introduction and problem statemen
- Page 9 and 10: patterns found in a text collection
- Page 11 and 12: The results obtained in this projec
- Page 13 and 14: The term anaphor describes a lingui
- Page 15 and 16: 2.1.1.1 Discourse representation th
- Page 17 and 18: eferring to BT. The NP which is lin
- Page 19 and 20: esolution system will not be able t
- Page 21 and 22: (2- 12) REC SUBJ EXIST OBJ IND-OBJ
- Page 23 and 24: Figure 1 17
- Page 25: means that the algorithm would prop
- Page 29 and 30: (2- 15) a. Politiet etterlyste i da
- Page 31 and 32: section. The theory dates back to t
- Page 33 and 34: 2.2.2 Different types of context So
- Page 35 and 36: neighbours. For example, a target w
- Page 37 and 38: with it. Selectional constraints al
- Page 39 and 40: 3 From text to EPAS - the extractio
- Page 41 and 42: 3.2 Predicate-argument structures "
- Page 43 and 44: speaker flexibility with regards to
- Page 45 and 46: and woman occur together both in su
- Page 47 and 48: occur with. Arguments which are unl
- Page 49 and 50: 3.3.1 NorGram in outline Norsk komp
- Page 51 and 52: Figure 3 The most useful structure
- Page 53 and 54: 3.4 Altering the source As already
- Page 55 and 56: (3- 12) (3- 13) Politiet leter ette
- Page 57 and 58: ARG1 and ARG2 arrays display a valu
- Page 59 and 60: (3- 20) Anne Slåtten bodde i et st
- Page 61 and 62: value and highly desirable. As such
- Page 63 and 64: this project, this can be interpret
- Page 65 and 66: The process of classifying the cons
- Page 67 and 68: There are several different distanc
- Page 69 and 70: . ankomme,etterforsker,?,? ankomme,
- Page 71 and 72: Test 2 Training set: EPAS_arg1 with
- Page 73 and 74: The training and test material was
- Page 75 and 76: • level 0: words which co-occur w
for an overview). Many of these systems incorporate semantic and real-world knowledge, but<br />
use methods that enable the collection of this information to have a high degree of automation<br />
(Baldwin 1997; Dagan and Itai 1990; Dagan et al. 1995; Nasukawa 1994; inter al.). Mitkov<br />
(2003) terms these systems knowledge-poor and attributes their growth in number in recent<br />
years to the fact that corpora and similar electronic linguistic resources have become better,<br />
larger and more available. Some of these systems do not really attempt at building a world- or<br />
domain knowledge base (Baldwin 1997; Nasukawa 1994), but rather look at features such as co-<br />
occurrence patterns in the text itself, while others integrate corpora and use them as a form of<br />
abstract knowledge base (Dagan and Itai 1990; Dagan et al. 1995).<br />
Among the different “alternative” approaches, Dagan and Itai’s (1990) statistical approach,<br />
Dagan et al.’s (1995) estimation of unseen patterns and Nasukawa’s (1994) knowledge-free<br />
method are of particular interest for this project. Dagan and Itai’s (1990) method is that of using<br />
co-occurrence patterns observed in a corpus as a type of selectional restrictions. Co-occurrence<br />
patterns observed in a large corpus are thought to reflect the semantic constraints that apply to<br />
natural language. Candidates for antecedents for the anaphor it are identified in the text and put<br />
in the place of the anaphor to be resolved. This produces co-occurrence patterns that are checked<br />
against the corpus. Subsequently the candidate present in the most frequently occurring cooccurrence<br />
pattern is chosen as the antecedent. This method relies on a large corpus, as only<br />
patterns which actually have been seen in the corpus are considered. Infrequent patterns will not<br />
be picked since they generally speaking will not feature on the top of the pattern list. Dagan et<br />
al. (1995) offer a solution to this problem by presenting a similar method which also estimates<br />
the probability of co-occurrence patterns that have not been observed in the training data. They<br />
state the importance of distinguishing between probable and improbable unobserved cooccurrence<br />
patterns and emphasise that the “distinctions ought to be made using the data that do<br />
occur in the corpus” (Dagan et al. 1995, p. 164). Anologies are made between specific unseen<br />
co-occurrence patterns and observed co-occurrences which contain similar words, determining<br />
word similarity by a similarity metric. Patterns that contain similar words to the target word and<br />
that have been observed in the training data are used to calculate how likely the target word is to<br />
occur in the same pattern. Nasukawa (1994) presents a resolution rate of 93,8% in an even<br />
knowledge-poorer method for pronoun resolution. Instead of drawing information from a<br />
corpus, word frequency and co-occurrence patterns in the text itself are used to filter out the<br />
most likely candidate for the antecedent. In Nasukawa’s approach, inter-sentential data is<br />
21