sense tagging: don't look for the meaning but for the use

More documents

Recommendations

Info

and annotators were asked to mark the senses inadditional columns. They had therefore alloccurrences of the same word available on thescreen. They could mark them in any order, andrevise their judgement as they were going along.Annotators were instructed to chose either onesense, or several if they felt that more than onewere appropriate in the given context. They couldalso choose no sense at all, if they felt that nosense in the dictionary was appropriate in thecontext. In the latter case, they were instructed towrite down a question mark in the sense column.In the subsequent study, the question mark wastreated as an additional sense for each word,grouping all meanings that were not found in thedictionary.3.3. ResultsThe annotators gave more senses per context forverbs than for adjectives and nouns (Table 3,column Nsen). This is likely to be a result of thelarger number of senses offered for verbs in thedictionary (see discussion above and Table 2). Theaverage number of senses (used by a single judgein a given context) per POS category is not veryhigh, which shows that annotators have a tendencyto avoid multiple answers (as said above, the “nosense” answer is counted as a special sense).However, the average per POS category masksimportant differences between words: the averagenumber of responses per word ranges from 1 to to1.311 (verb comprendre). In some cases,annotators used up to six senses in a singleresponse for a given context.Agreement was computed according to severalmeasures (summarised in Table 3):(1) Full agreement among the six annotators.Two variants were computed:MinMaxCounts agreement when judges agree on all sensesproposed for a given contextCounts agreement when judges agree on at leastone of the senses proposed for a given contextThe difference between the min and max measuresis not very important, apart from for a few words(sûr, comprendre, importer). This is due to thefact that the average number of senses given byjudges is close to 1 (Table 3, column Nsen). Ofcourse, these measures are biased with the numberof judges, as mentioned above. It is howeverstriking to note that for some words (correct,historique, économie, comprendre) there was fullagreement on none of the contexts for that word.(2) Pairwise agreement. Three variants werecomputed:MinMaxWeightedCounts agreement when judges agree on all sensesproposed for a given contextCounts agreement when judges agree on at leastone of the senses proposed for a given contextAccounts for partial agreement using the Dicecoefficient:A∩BDice = 2A + BAgain, there is not much difference between themeasures, apart from for a few words,interestingly enough not exactly the same asbefore (chef, comprendre, connaître).(3) Agreement corrected for chance. Themeasures above are not completely satisfactory,because they do not enable comparison ofobserved agreement and agreement that would beobtained by pure chance. The κ statisticsmentioned above enables such a comparison. Inorder to account for partial agreement, κ wascomputed on the weighted pairwise measure usingthe extension proposed in [8].It is interesting to note that κ ranges between 0.92(noun détention) and 0.007 (adjective correct). Inother terms, there is no more agreement thanchance for some words. The average κ values arelow, below 50%, which indicates a great amountof disagreement among judges.FullPairwisePOS Nsen min max min max wgh κA 1.013 0.43 0.46 0.69 0.72 0.71 0.41N 1.009 0.44 0.45 0.72 0.74 0.73 0.46V 1.045 0.29 0.34 0.60 0.65 0.63 0.41Table 3. Agreement measures per POS catagoryIt is possible that the sense divisions contained indictionaries are too fine-grained for NLP purposes.This argument has been made many times, andmany WSD systems have been restricted tohomograph level or broad sense distinctions.In order to test this hypothesis, I have computedthe degree of inter-annotator agreement when theirresponses are reduced to the top-level distinctionsmade in the dictionary (French dictionaries aremuch more hierarchical than English ones, due todifferent lexicographic traditions). The4
improvement was measured as the reduction ofdisagreement once corrected for chance, i.e.:1 −κ∆ = 1−1 −κThe results are disappointing: the disagreementreduction is only of 8% for adjectives and 9% forverbs. It is higher for nouns, but reaches only 25%(Table 4).FullPairwisePOS Nsen min max min max wgh κ ∆(%)A 1.010 0.55 0.57 0.78 0.80 0.79 0.46 7.9N 1.003 0.70 0.70 0.86 0.86 0.86 0.60 25.2V 1.018 0.54 0.56 0.77 0.80 0.79 0.46 8.9Table 4. Agreement on top-level divisions4. DISCUSSION4.1. Summary of resultsExperiment One showed that judges disagreewidely on whether a given word is polysemous ornot in a corpus. Experiment Two showed that theyalso disagree enormously when they have to tagcorpus examples according to the sense listprovided by a common dictionary. The rate ofdisagreement is so important that for some words,there was no more agreement than what would beobtained by mere chance. It cannot be argued thatsense distinctions are too fine-grained for WSD,since, somewhat surprisingly, most disagreementbetween annotators spans across the top-leveldivisions of entries.These results shed a new light on automated sensetagging.The dictionary chosen (Petit Larousse) isnot at fault. It is a very respectable medium-sizedictionary which builds on a century and a half oflexicographic tradition. I am convinced that theresults would be similar with any other traditionaldictionary.4.2. An example of difficultyThe word degré (=degree) exemplifies the type ofdifficulty that annotators are faced with. At the toplevel, the dictionary gives the following divisionsand definitions (I translate roughly and skip thesub-senses for lack of space):DEGRÉ. I. Literary: step/stair. II. each of theintermediary state leading from one state to another. III.relative intensity (of an affective, moral or pathologicalstate). IV. each of the divisions, corresponding to a unit,21of a scale of measurement.If divisions I and IV are (almost) straightforward,the distinction between II and III is extremelyconfusing for annotators. In sentences such as:...les trois principaux degrés de cette éliminationétatique: le génocide, la déportation en masse etl'assimilation forcée... (...the three main steps/levels ofthis state elimination: genocide, mass deportation andforced assimilation...)Ils s'inquiètent de ce qu'ils perçoivent comme un degrécroissant d'anarchie... (they point out their concernabout what they perceive to be an increasing level oflawlessness...)it is very unclear whether degré refers to “anintermediary state leading from one state toanother” or a “relative intensity of an affective,moral or pathological state” 3 .In this example, it would however be very easy tosplit uses according to syntactic criteria. A first setof uses accepts cardinal determiners (un, deux,trois / =one, two, three, etc.) as well as ordinalqualifiers such as premier, second, dernier (=first,second, last):...les trois principaux degrés de cette éliminationétatique → le premier degré, le second degré, etc. (thefirst step, the second step, etc.)On the other hand, another, disjoint, set of usesaccepts intensifying qualifiers whose prototype isthe fort/faible (= high/low) pair:un degré croissant d'anarchie → un faible degré, unfort degré d'anarchie (a low level, a high level oflawlessness)Other adjectives in the paradigm are alarmant(=alarming), élevé (=high), minimal (=minimal),différent (=different), croissant (=increasing), etc.In other words, one set of uses is discrete andcountable, the other set is continuous andintensifiable. Annotators would have little troubleusing these tests, and machines could use thepresence of the appropriate adjectives ordeterminers as a reliable disambiguating clue.However, none of the French dictionaries that Iexamined use or mention this rather simple3 WordNet 1.6 proposes a similar distinction for the Englishdegree, resulting in exactly the same kind of indecision: 1.a position on a scale of intensity or amount or quality(e.g. : “a moderate degree of intelligence” etc.) 2. aspecific identifiable position in a continuum or series orespecially in a process (e.g. : “a remarkable degree offrankness” etc.). It is hard to see why “degree ofintelligence” and “degree of frankness” should be treateddifferently.5
Page 6 and 7: syntactic property. Worse yet, the
Page 8 and 9: can relatively easily be extracted

sense tagging: don't look for the meaning but for the use

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?