12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

drawn in by establishing a partial vacuum,” but rather, “to be disagreeable.” So anotherpotentially useful annotation is word sense:• “The movie sucked → The movie sucked〈Sense=IS-DISAGREEABLE〉.”More directly, we might consider the German translation itself as the annotation:• “The movie sucked → Der Film war schrecklich.Finally, if we’re the company Powerset, our stated objective is to produce “parse trees”<strong>for</strong> the entire web as a preprocessing step <strong>for</strong> our search engine. One part of parsing is tolabel the syntactic category of each word (i.e., which are nouns, which are verbs, etc.). Thepart-of-speech annotation might look as follows:• “The movie sucked → The\DT movie\NN sucked\VBDWhere DT means determiner, NN means a singular or mass noun, and VBD means apast-tense verb. 1 Again, note the potential ambiguity <strong>for</strong> the tag of sucked; it could alsobe labeled VBN (verb, past participle). For example, sucked is a VBN in the phrase, “themovie sucked into the vacuum cleaner was destroyed.”These outputs are just a few of the possible annotations that can be produced <strong>for</strong> textualnatural language input. Other branches and fields of NLP may operate over speechsignals rather than actual text. Also, in the natural language generation (NLG) community,the input may not be text, but in<strong>for</strong>mation in another <strong>for</strong>m, with the desired output beinggrammatically-correct English sentences. Most of the work in the NLP community, however,operates exactly in this framework: text comes in, annotations come out. But howdoes an NLP system produce these annotations automatically?1.2 Writing Rules vs. Machine <strong>Learning</strong>One might imagine writing some rules to produce these annotations automatically. For partof-speechtagging, we might say, “if the word is movie, then label the word as NN.” Theseword-based rules fail when the word can have multiple tags (e.g. saw, wind, etc. can benouns or verbs). Also, no matter how many rules we write, there will always be new or rarewords that didn’t make our rule set. For ambiguous words, we could try to use rules thatdepend on the word’s context. Such a rule might be, “if the previous word is The and thenext word ends in -ed, then label as NN.” But this rule would fail <strong>for</strong> “the Oilers skated,”since here the tag is not NN but NNPS: a plural proper noun. We could change the ruleto: “if the previous word is The and the next word ends in -ed, and the word is lower-case,then label as NN.” But this would fail <strong>for</strong> “The begrudgingly viewed movie,” where now“begrudgingly” is an adverb, not a noun. We might imagine adding many many more rules.Also, we might wish to attach scores to our rules, to principally resolve conflicting rules.We could say, “if the word is wind, give the score <strong>for</strong> being a NN a ten and <strong>for</strong> being aVB a two,” and this score could be combined with other context-based scores, to produce adifferent cumulative score <strong>for</strong> each possible tag. The highest-scoring tag would be taken asthe output.1 Refer to Appendix A <strong>for</strong> definitions and examples from the Penn Treebank tag set, the most commonlyusedpart-of-speech tag set.2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!