Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...

Large-Scale Semi-Supervised Learning for Natural Language ...


You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

(2) “He saw the trophy won yesterday.”(3) “He saw the boog won yesterday.”Only one word differs in each sentence: the word be<strong>for</strong>e the verb won. In Example 1,Bears is the subject of the verb won (it was the Bears who won yesterday). Here, wonshould get the VBD tag. In Example 2, trophy is the object of the verb won (it was thetrophy that was won). In this sentence, won gets a VBN tag. In a typical training set (i.e.the training sections of the Penn Treebank [Marcus et al., 1993]), we don’t see Bears wonor trophy won at all. In fact, both the words Bears and trophy are rare enough to essentiallylook like Example 3 to our system. They might as well be boog! Based on even a fairlylarge set of labeled data, like the Penn Treebank, the correct tag <strong>for</strong> won is ambiguous.However, the relationship between Bears and won, and between trophy and won, isfairly unambiguous if we look at unlabeled data. For both pairs of words, I have collectedall 2-to-5-grams where the words co-occur in the Google V2 corpus, a collection of N-gramsfrom the entire world wide web. An N-gram corpus states how often each sequence of words(up to length N) occurs (N-grams are discussed in detail in Chapter 3, while the Google V2corpus is described in Chapter 5; note the Google V2 corpus includes part-of-speech tags).I replace non-stopwords by their part-of-speech tag, and sum the counts <strong>for</strong> each pattern.The top fifty most frequent patterns <strong>for</strong> {Bears, won} and {trophy, won} are given:Bears won:• Bears won:3215• the Bears won:1252• Bears won the:956• The Bears won:875• Bears have won:874• NNP Bears won:767• Bears won their:443• Bears won CD:436• The Bears have won:328• Bears won their JJ:321• Bears have won CD:305• , the Bears won:305• the NNP Bears won:305• The Bears won the:296• the Bears won the:293• The NNP Bears won:274• NNP Bears won the:262• the Bears have won:255• NNP Bears have won:217• as the Bears won:168• the Bears won CD:168• Bears won the NNP:162• Bears have won 00:160• Bears won the NN:157• Bears won a:153• the Bears won their:148• NNP Bears won their:129• The Bears have won CD:128• Bears won ,:124• Bears had won:121• The Bears won their:121• when the Bears won:119• The NNP Bears have won:117• Bears have won the:112• Bears won the JJ:112• Bears , who won:107• The Bears won CD:103• Bears won the NNP NNP:102• The NNP Bears won the:100• the NNP Bears won the:96• Bears have RB won:94• , the Bears have won:93• and the Bears won:91• IN the Bears won:89• Bears also won:87• Bears won 00:86• Bears have won CD of:84• as the NNP Bears won:80• Bears won CD .:80• , the Bears won the:77trophy won:• won the trophy:4868• won a trophy:2770• won the trophy <strong>for</strong>:1375• won the JJ trophy:825• trophy was won:811• trophy won:8034

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!