12.07.2015 Views

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

• won a trophy <strong>for</strong>:689• won the trophy <strong>for</strong> the:631• trophy was won by:626• won a JJ trophy:513• won the trophy in:511• won the trophy.:493• RB won a trophy:439• trophy they won:421• won the NN trophy:405• trophy won by:396• have won the trophy:396• won this trophy:377• the trophy they won:329• won the NNP trophy:325• won a trophy .:313• won the trophy NN:295• trophy he won:292• has won the trophy:290• won the trophy <strong>for</strong> JJS:284• won a trophy in:274• won the trophy in 0000:272• won the JJ NN trophy:267• won a trophy and:249• RB won the trophy:242• who won the trophy:242• and won the trophy:240• won the trophy,:228• won a trophy ,:215• won a trophy at:199• , won the trophy:191• also won the trophy:189• had won the trophy:186• won DT trophy:184• and won a trophy:178• the trophy won:173• won their JJ trophy:169• JJ trophy:168• won the trophy RB:161• won the JJ trophy in:155• won a JJ NN trophy:155• I won a trophy:153• won the trophy CD:145• won the trophy and:141• trophy , won:141In this data, Bears is almost always the subject of the verb, occurring be<strong>for</strong>e won andwith an object phrase afterwards (like won the or won their, etc.). On the other hand, trophyalmost always appears as an object, occurring after won or in passive constructions (trophywas won, trophy won by) or with another noun in the subject role (trophy they won, trophy hewon). If, on the web, a pair of words tends to occur in a particular relationship, then <strong>for</strong> anambiguous instance of this pair at test time, it is reasonable to also predict this relationship.Now think about boog. A lot of words look like boog to a system that has only seenlimited labeled data. Now, if globally the words boog and won occur in the same patternsin which trophy and won occur, then it would be clear that boog is also usually the objectof won, and thus won is likely a past participle (VBN) in Example 3. If, on the other hand,boog occurs in the same patterns as Bears, we would consider it a subject, and label won asa past-tense verb (VBD). 3So, in summary, while a pair of words, like trophy and won, might be very rare in ourlabeled data, the patterns in which these words occur (the distribution of the words), likewon the trophy, and trophy was won, may be very indicative of a particular relationship.These indicative patterns will likely be shared by other pairs in the labeled training data(e.g., we’ll see global patterns like bought the securities, market was closed, etc. <strong>for</strong> labeledexamples like “the securities bought by” and “the market closed up 134 points”). So, wesupplement our sparse in<strong>for</strong>mation (the identity of individual words) with more-generalin<strong>for</strong>mation (statistics from the distribution of those words on the web). The word’s globaldistribution can provide features just like the features taken from the word’s local context.By local, I mean the contextual in<strong>for</strong>mation surrounding the words to be classified in agiven sentence. Combining local and global sources of in<strong>for</strong>mation together, we can achievehigher per<strong>for</strong>mance.Note, however, that when the local context is unambiguous, it is usually a better bet torely on the local in<strong>for</strong>mation over the global, distributional statistics. For example, if the3 Of course, it might be the case that boog and won don’t occur in unlabeled data either, in which case wemight back off to even more general global features, but we leave this issue aside <strong>for</strong> the moment.5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!