We KnowItAll: - The Proteus Project at NYU
We KnowItAll: - The Proteus Project at NYU We KnowItAll: - The Proteus Project at NYU
We KnowItAll: Lessons from a Quarter Century of Web Extraction Research Oren Etzioni Turing Center University of Washington http://turing.cs.washington.edu KnowItAll Team � Michele Banko � Michael Cafarella � Tom Lin � Mausam � Alan Ritter � Stefan Schoenmackers � Stephen Soderland � Dan Weld � Alumni: Doug Downey, Ana- Maria Popescu, Alex Yates, and others. 2 1
- Page 2 and 3: “In the Knowledge Lies the Power
- Page 4 and 5: IE as Supervised Learning (E.g., Ri
- Page 6 and 7: Relation-Indep. Extraction Rules Ca
- Page 8 and 9: Leveraging the Web’s Redundancy (
- Page 10 and 11: But What About… � Relative clau
- Page 12 and 13: TextRunner Demo Extraction run at G
- Page 14 and 15: Lesson 4: Open IE yields a mass of
- Page 16 and 17: Functional Relations help resolutio
- Page 18 and 19: III. Compositional Inference is Key
- Page 20 and 21: Lessons Reprised � Scale & divers
- Page 22 and 23: Machine Reading at Web Scale? Diffi
- Page 24 and 25: PMI In Action 1. E = “Yakima” (
- Page 26 and 27: Recall - Precision Curve (City, Cou
- Page 28 and 29: Sources of ambiguity � Time: “B
- Page 30 and 31: Fundamental AI Questions � How to
<strong>We</strong> <strong>KnowItAll</strong>:<br />
Lessons from a Quarter Century<br />
of <strong>We</strong>b Extraction Research<br />
Oren Etzioni<br />
Turing Center<br />
University of Washington<br />
http://turing.cs.washington.edu<br />
<strong>KnowItAll</strong> Team<br />
� Michele Banko<br />
� Michael Cafarella<br />
� Tom Lin<br />
� Mausam<br />
� Alan Ritter<br />
� Stefan Schoenmackers<br />
� Stephen Soderland<br />
� Dan <strong>We</strong>ld<br />
� Alumni: Doug Downey, Ana-<br />
Maria Popescu, Alex Y<strong>at</strong>es,<br />
and others.<br />
2<br />
1
“In the Knowledge Lies the Power”<br />
� Manual “knowledge engineering”<br />
� Cyc, Freebase, WordNet<br />
� Volunteers<br />
� Open Mind, DBPedia<br />
� von Ahn: games � Verbosity<br />
� <strong>KnowItAll</strong> project: “read” the <strong>We</strong>b<br />
Outline<br />
1. Turing Center<br />
2. Machine Reading & Open IE<br />
3. TextRunner<br />
4. Textual Inference<br />
5. Conclusions<br />
3<br />
4<br />
2
2. Wh<strong>at</strong> is Machine Reading?<br />
Unsupervised understanding of text<br />
Inform<strong>at</strong>ion extraction + inference<br />
Inform<strong>at</strong>ion Extraction (IE) via Rules<br />
(Hearst ’92)<br />
� “…Cities such as Boston, Los Angeles, and<br />
Se<strong>at</strong>tle…”<br />
(“C such as NP1, NP2, and NP3”) =><br />
IS-A(each(head(NP)), C), …<br />
•Detailed inform<strong>at</strong>ion for several countries<br />
such as maps, …”ProperNoun(head(NP))<br />
• “I listen to pretty much all music but<br />
prefer country such as Garth Brooks”<br />
5<br />
6<br />
3
IE as Supervised Learning<br />
(E.g., Riloff ‘96, Soderland ’99)<br />
Labeled Examples<br />
Rel<strong>at</strong>ion Extractor<br />
ManagementSuccession<br />
Person-In<br />
Person-Out<br />
Organiz<strong>at</strong>ion<br />
Position<br />
� Hand-label examples of each rel<strong>at</strong>ion<br />
� Manual labor linear in |rel<strong>at</strong>ions|<br />
� Learn lexicalized extractor<br />
+ S. Smith formerly<br />
chairman of XYZ<br />
Corpor<strong>at</strong>ion …<br />
Semi-Supervised Learning<br />
� Few hand-labeled examples<br />
� � Limit on the number of rel<strong>at</strong>ions<br />
� � Rel<strong>at</strong>ions are pre-specified<br />
� � Problem<strong>at</strong>ic for Machine Reading<br />
=<br />
formerly <br />
of <br />
per rel<strong>at</strong>ion!<br />
� Altern<strong>at</strong>ive: self-supervised learning<br />
� Learner discovers rel<strong>at</strong>ions on the fly (Sekine ’06)<br />
� Learner autom<strong>at</strong>ically labels examples<br />
� “Look Ma, No Hands!” (Downey & Etzioni ’08)<br />
7<br />
8<br />
4
Open IE = Self-supervised IE<br />
(Banko, et. al, IJCAI ’07, ACL ‘08)<br />
Input:<br />
Rel<strong>at</strong>ions:<br />
Complexity:<br />
Output:<br />
Traditional IE<br />
Corpus + Handlabeled<br />
D<strong>at</strong>a<br />
Specified<br />
in Advance<br />
O(D * R)<br />
R rel<strong>at</strong>ions<br />
Lexicalized,<br />
rel<strong>at</strong>ion-specific<br />
How is Open IE Possible?<br />
Open IE<br />
Corpus<br />
Discovered<br />
Autom<strong>at</strong>ically<br />
O(D)<br />
D documents<br />
Rel<strong>at</strong>ion-independent<br />
� <strong>The</strong>re is a compact set of rel<strong>at</strong>ionship<br />
expressions in English (Banko & Etzioni ACL ’08)<br />
� “Rel<strong>at</strong>ionship expressions” can be<br />
specified without reference to the<br />
individual rel<strong>at</strong>ion!<br />
� G<strong>at</strong>es founded Microsoft �<br />
founded(G<strong>at</strong>es, Microsoft)<br />
9<br />
10<br />
5
Rel<strong>at</strong>ion-Indep. Extraction Rules<br />
C<strong>at</strong>egory<br />
Verb<br />
Noun+Prep<br />
Verb+Prep<br />
Infinitive<br />
Modifier<br />
Coordin<strong>at</strong>e n<br />
Coordin<strong>at</strong>e v<br />
P<strong>at</strong>tern<br />
E 1 Verb E 2<br />
X established Y<br />
E 1 NP Prep E 2<br />
the X settlement with Y<br />
E 1 Verb Prep E 2<br />
Xmoved to Y<br />
E 1 to Verb E 2<br />
X to acquire Y<br />
E 1 Verb E 2 NP<br />
X is Y winner<br />
E 1 (and|,|-|:) E 2 NP<br />
X - Y deal<br />
E 1 (and|,) E 2 Verb<br />
X , Y merge<br />
E NP ( | )? E<br />
Frequency<br />
37.8%<br />
22.8%<br />
16.0%<br />
9.4%<br />
5.2%<br />
1.8%<br />
Rel<strong>at</strong>ion-Independent Observ<strong>at</strong>ions<br />
1.0<br />
� 95% of sample � 8 broad p<strong>at</strong>terns!<br />
� Rules are simplified…<br />
� Applicability conditions complex<br />
“E1 Verb E2”<br />
1. Kentucky Fried Chicken<br />
2. Microsoft announced Tuesday th<strong>at</strong>…<br />
� Learned via a CRF model<br />
� Effective, but far from perfect!<br />
11<br />
12<br />
6
Lesson 1: to scale IE to the<br />
<strong>We</strong>b corpus, use Open IE<br />
Wh<strong>at</strong> about extraction<br />
errors?<br />
7
Leveraging the <strong>We</strong>b’s Redundancy<br />
(Downey, Soderland, & Etzioni, IJCAI ’05)<br />
1) Repetition in distinct sentences<br />
2) Multiple, distinct extraction rules<br />
Phrase<br />
Hits<br />
“Atlanta and other cities” 980<br />
“Canada and other cities” 286<br />
“cities such as Atlanta” 5860<br />
“cities such as Canada” 7<br />
IJCAI ‘05: A formal model of these intuitions<br />
IJCAI ’05 Combin<strong>at</strong>orial Model<br />
If an extraction x appears k times in a set of n distinct<br />
sentences m<strong>at</strong>ching a rule, wh<strong>at</strong> is the probability th<strong>at</strong><br />
x ∈ class C?<br />
15<br />
16<br />
8
Lesson 2: utilize the <strong>We</strong>b’s<br />
redundancy<br />
Wh<strong>at</strong> about complex sentences?<br />
� Paris, which has been proclaimed by many<br />
literary figures to be a gre<strong>at</strong> source of<br />
inspir<strong>at</strong>ion, is also a capital city, but not<br />
the capital city of an ordinary country, but<br />
r<strong>at</strong>her the capital of a gre<strong>at</strong> republic-the<br />
republic of France!<br />
Paris is the Capital of France.<br />
18<br />
9
But Wh<strong>at</strong> About…<br />
� Rel<strong>at</strong>ive clauses<br />
� Disjunction & conjunction<br />
� Anaphora<br />
� Quantific<strong>at</strong>ion<br />
� Temporal qualific<strong>at</strong>ion<br />
� Counterfactuals<br />
See Durm & Schubert ’08; MacCartney ‘08<br />
Lesson 3: focus on ‘tractable’<br />
sentences<br />
19<br />
10
3. TextRunner (<strong>We</strong>b’s 1 st Open IE system)<br />
1. Self-Supervised Learner: autom<strong>at</strong>ically<br />
labels example extractions & trains a CRFbased<br />
extractor<br />
2. Single-Pass Extractor: makes a single pass<br />
over corpus, identifying extractions in<br />
‘tractable’ sentences<br />
3. Search Engine: indexes and ranks<br />
extractions based on redundancy<br />
4. Query Processor: interactive speeds<br />
Applic<strong>at</strong>ion: Inform<strong>at</strong>ion Fusion<br />
� Wh<strong>at</strong> kills bacteria?<br />
� Wh<strong>at</strong> west coast, nano-technology<br />
companies are hiring?<br />
� Compare Obama’s “buzz” versus Hillary’s?<br />
� Wh<strong>at</strong> is a quiet, inexpensive, 4-star hotel<br />
in Vancouver?<br />
21<br />
22<br />
11
TextRunner Demo<br />
Extraction run <strong>at</strong> Google on<br />
500,000,000 high-quality <strong>We</strong>b pages.<br />
24<br />
12
Sample of <strong>We</strong>b Pages<br />
Triples<br />
11.3 million<br />
With <strong>We</strong>ll-Formed Rel<strong>at</strong>ion<br />
9.3 million<br />
With <strong>We</strong>ll-Formed Entities<br />
7.8 million<br />
Abstract<br />
6.8 million<br />
79.2% correct<br />
Concrete<br />
1.0 million<br />
88.1%<br />
correct<br />
� Concrete facts:<br />
(Oppenheimer, taught<br />
<strong>at</strong>, Berkeley)<br />
� Abstract facts: (fruit,<br />
contain, vitamins)<br />
25<br />
26<br />
13
Lesson 4: Open IE yields a<br />
mass of isol<strong>at</strong>ed “nuggets”<br />
Textual inference is the next<br />
step towards machine reading<br />
4. Examples of Textual Inference<br />
I. Entity and predic<strong>at</strong>e resolution<br />
II. Composing facts to draw conclusions<br />
28<br />
14
I. Entity Resolution<br />
P(String1 = String2)?<br />
Resolver (Y<strong>at</strong>es & Etzioni, HLT ’07): determines<br />
synonymy based on rel<strong>at</strong>ions found by<br />
TextRunner (cf. Pantel & Lin ‘01)<br />
� (X, born in, 1941) (M, born in, 1941)<br />
� (X, citizen of, US) (M, citizen of, US)<br />
� (X, friend of, Joe) (M, friend of, Mary)<br />
P(X = M) ~ shared rel<strong>at</strong>ions<br />
Rel<strong>at</strong>ion Synonymy<br />
� (1, R, 2)<br />
� (2, R 4)<br />
� (4, R, 8)<br />
� Etc.<br />
� (1, Q, 2)<br />
� (2, Q, 4)<br />
� (4, Q, 8)<br />
� Etc.<br />
P(R = Q) ~ shared argument pairs<br />
• See paper for probabilistic model<br />
•O(n log(n)) algorithm, n =|tuples|<br />
29<br />
30<br />
15
Functional Rel<strong>at</strong>ions help resolution<br />
1. married-to(Oren, X)<br />
2. married-to(Oren, Y)<br />
If function(married-to)<br />
then P(X=Y) is high<br />
� Cave<strong>at</strong>: Oren & Oren refer to the same<br />
individual<br />
Lesson 4: leverage rel<strong>at</strong>ions<br />
implicit in the text<br />
Functional rel<strong>at</strong>ions are<br />
particularly inform<strong>at</strong>ive<br />
31<br />
16
Returning to Our Cave<strong>at</strong>…<br />
� BornIn(Oren, New York)<br />
� BornIn(Oren, Maryland)<br />
� But these are two different Orens.<br />
� How can you tell?<br />
� <strong>We</strong>b page context<br />
Lesson 5: context enhances<br />
entity resolution<br />
To d<strong>at</strong>e, our extraction process has<br />
been myopic<br />
33<br />
17
III. Compositional Inference is Key<br />
<strong>The</strong> Tour de France is the<br />
world's largest cycle race.<br />
<strong>The</strong> Tour de France takes<br />
place in France.<br />
<strong>The</strong> illumin<strong>at</strong>ed helmet is<br />
used during sporting events<br />
such as cycle racing.<br />
Inform<strong>at</strong>ion<br />
Extraction<br />
Is-A(Tour de France,<br />
cycle race)<br />
In(Tour de France,<br />
France)<br />
Is-A(cycle race,<br />
sporting event)<br />
sporting event in France? Tour de France<br />
Scalable Textual Inference<br />
(Schoenmackers, Etzioni, <strong>We</strong>ld EMNLP ‘08)<br />
Desider<strong>at</strong>a for inference:<br />
� In text � probabilistic inference<br />
� On the <strong>We</strong>b � scalable<br />
� � linear in |Corpus|<br />
Novel property of textual rel<strong>at</strong>ions:<br />
� Prevalent<br />
� Provably linear<br />
� Empirically linear!<br />
35<br />
36<br />
18
Inference Scalability for Holmes<br />
Rel<strong>at</strong>ed Work<br />
� <strong>We</strong>ld’s Intelligence in Wikipedia project<br />
� Sekine’s “pre-emptive IE”<br />
� Powerset<br />
� Textual Entailment<br />
� AAAI ‘07 Symposium on “Machine Reading”<br />
� Growing body of work on IE from the <strong>We</strong>b<br />
37<br />
38<br />
19
Lessons Reprised<br />
� Scale & diversity of the <strong>We</strong>b �Open IE<br />
� Redundancy helps accuracy!<br />
� Focus on ‘tractable’ sentences<br />
� Open IE(<strong>We</strong>b) = disjointed “nuggets”<br />
� Scalable inference is key<br />
� Leverage rel<strong>at</strong>ions implicit in the text<br />
� Sentence-based IE � page-based IE<br />
Directions for Future Work<br />
� Characterize ‘tractable’ sentences<br />
� Page-based IE<br />
� Better entity resolution<br />
� Integr<strong>at</strong>e Open IE with ontologies<br />
� Investig<strong>at</strong>e other domains for Open IE<br />
(e.g., news)<br />
39<br />
40<br />
20
Implic<strong>at</strong>ions for <strong>We</strong>b Search<br />
Imagine search systems th<strong>at</strong> oper<strong>at</strong>e over a<br />
(more) semantic space<br />
� Key words, documents � extractions<br />
� TF-IDF, pagerank � entities, rel<strong>at</strong>ions<br />
� <strong>We</strong>b pages, links � Q/A, inform<strong>at</strong>ion fusion<br />
Reading the <strong>We</strong>b � new Search Paradigm<br />
Found 2,300 reviews; 85%<br />
positive, key fe<strong>at</strong>ures are…<br />
Open IE over<br />
10^ 10 <strong>We</strong>b<br />
pages.<br />
Machine Reading of the <strong>We</strong>b<br />
World Wide <strong>We</strong>b<br />
How is the ThinkPad T-40?<br />
Inform<strong>at</strong>ion Food Chain<br />
41<br />
42<br />
21
Machine Reading <strong>at</strong> <strong>We</strong>b Scale?<br />
Difficult, ungramm<strong>at</strong>ical<br />
sentences<br />
Unreliable inform<strong>at</strong>ion<br />
Unanticip<strong>at</strong>ed<br />
objects & rel<strong>at</strong>ions<br />
Massive Scale<br />
Tractable<br />
Sentences<br />
(PageRank)<br />
Redundancy<br />
Open Inform<strong>at</strong>ion<br />
Extraction +<br />
Linear-time Inference<br />
43<br />
44<br />
22
Holmes - KBMC<br />
KBs<br />
Knowledge<br />
Bases<br />
0.9 Prevents(Ca, osteo.)<br />
0.8 Contains(kale, Ca)<br />
…<br />
Inf. Rules<br />
<strong>We</strong>ighted Horn Clauses<br />
1.5 P(X,Z):-C(X,Y)^P(Y,Z)<br />
Query<br />
Q(X) :- prevent(X,osteo.)<br />
Test<br />
Expand<br />
MN?<br />
Find Proof<br />
Trees<br />
Answers<br />
Construct<br />
Markov<br />
Network<br />
Chain backwards from the query<br />
Convert proof tree into Markov network<br />
Approxim<strong>at</strong>e<br />
Probabilistic<br />
Inference<br />
� Assess candid<strong>at</strong>e extraction E using Pointwise Mutual<br />
Inform<strong>at</strong>ion (PMI) between E and a “discrimin<strong>at</strong>or<br />
phrase” D<br />
� PMI-IR (Turney ’01): use hit counts for efficient PMI<br />
comput<strong>at</strong>ion over the <strong>We</strong>b<br />
| Hits(<br />
Yakima+<br />
City)<br />
|<br />
PMI(<br />
Yakima,<br />
City)<br />
=<br />
| Hits(<br />
Yakima)<br />
|<br />
45<br />
46<br />
23
PMI In Action<br />
1. E = “Yakima” (1,340,000)<br />
2. D = <br />
3. E+D = “Yakima city” (2760)<br />
4. PMI = (2760 / 1.34M)= 0.002<br />
•E = “Avocado” (1,000,000)<br />
•E+D =“Avocado city” (10)<br />
PMI = 0.00001
Can IE achieve“adequ<strong>at</strong>e”<br />
precision/recall on the <strong>We</strong>b<br />
corpus?<br />
49<br />
50<br />
25
Recall – Precision Curve<br />
(City, Country, USSt<strong>at</strong>e)<br />
High Precision for Binary Predic<strong>at</strong>es<br />
Precision<br />
1<br />
0.9<br />
0.8<br />
0.7<br />
0.6<br />
0.5<br />
0.4<br />
0.3<br />
0.2<br />
0.1<br />
0<br />
0 0.2 0.4 0.6 0.8 1<br />
CapitalOf CeoOf<br />
51<br />
52<br />
26
Methods for Improving Recall<br />
� RL: learn class-specific p<strong>at</strong>terns<br />
“Headquarted in ”<br />
• SE: Recursively extract subclasses<br />
“Scientists such as physicists, chemists, and<br />
biologists…”<br />
• LE: extract lists of items & vote<br />
•Bootstrapped from generic <strong>KnowItAll</strong>—no hand-labeled examples!<br />
Results for City<br />
Found 10,300 cities missing from Tipster Gazetteer.<br />
53<br />
27
Sources of ambiguity<br />
� Time: “Bush is the president”<br />
� in 2006!<br />
� Context: “common misconceptions..”<br />
� Opinion: Who killed JFK?<br />
� Multiple word senses: Amazon, Chicago,<br />
Chevy Chase, etc.<br />
Critique of <strong>KnowItAll</strong> 1.0<br />
� User query (“African cities”) launches<br />
search engine queries � slow, limited<br />
� Rel<strong>at</strong>ions specified by user<br />
� N rel<strong>at</strong>ions � N passes over the d<strong>at</strong>a<br />
can we make a single pass over the <strong>We</strong>b<br />
corpus?<br />
55<br />
56<br />
28
Extractor Overview (Banko & Etzioni, ’08)<br />
1. Use a simple model of rel<strong>at</strong>ionships in English to<br />
label extractions<br />
2. Bootstrap a general model of rel<strong>at</strong>ionships in<br />
English sentences, encoded as a CRF<br />
3. Decompose each sentence into one or more (NP1,<br />
VP, NP2) “chunks”<br />
4. Use CRF model to retain relevant parts of each<br />
NP and VP.<br />
<strong>The</strong> extractor is rel<strong>at</strong>ion-independent!<br />
57<br />
58<br />
29
Fundamental AI Questions<br />
� How to accumul<strong>at</strong>e massive amounts of<br />
knowledge?<br />
� Can a program read and “understand” wh<strong>at</strong><br />
it’s read?<br />
59<br />
60<br />
30
Massive Inform<strong>at</strong>ion Extraction from the <strong>We</strong>b<br />
Found 2,300 reviews; 85%<br />
positive.<br />
St<strong>at</strong>s over<br />
10^ 10 web<br />
pages.<br />
<strong>KnowItAll</strong><br />
World Wide <strong>We</strong>b<br />
How is the ThinkPad T-40?<br />
Inform<strong>at</strong>ion Food Chain<br />
61<br />
31