A, i, k
A, i, k
A, i, k
- No tags were found...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
k-best ParsingSSNPIVPVPvNPsawa boyVPNPprepPPNP1-best2nd-bestwitha telescopeI saw a boy with a telescope.Liang Huang (Penn) k-best parsing 2
Not a trivial task…Liang Huang (Penn) k-best parsing 3
Not a trivial task…Aravind JoshiLiang Huang (Penn) k-best parsing 3
Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3
Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3
Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3
I saw her duck.Liang Huang (Penn)k-best parsing4
I saw her duck.how about“I saw her duck with a telescope.”Liang Huang (Penn)k-best parsing4
I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5
I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5
Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6
Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6
Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6
Or even...I eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 7
Why k-best?• postpone disambiguation in a pipeline– 1-best is not always optimal in the future– propagate k-best lists instead of 1-best– e.g.: semantic role labeler uses k-best parses• approximate the set of all possibleinterpretations– reranking (Collins, 2000)– minimum error training (Och, 2003)– online training (McDonald et al., 2005)part-of-speech tagginglatticeSyntactic Parsingk-bestparsesSemantic InterpretationLiang Huang (Penn) k-best parsing 8
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)naive k-best: O(k E)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)3. time complexity: O( V + E )vnaive k-best: O(k E)good k-best:O(E + k log k V)Liang Huang9Dynamic Programming
Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPLiang Huang (Penn) k-best parsing 10
Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPBCi j j+1 kLiang Huang (Penn) k-best parsing 10
Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAikLiang Huang (Penn) k-best parsing 10
Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAcomputational complexity: O(n 3 |P|)P is the set of productions (rules)ikLiang Huang (Penn) k-best parsing 10
Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertex(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11
Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11
Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)hyperedges(S, 1, 6)Liang Huang (Penn) k-best parsing 11
Packed Forest as HypergraphI saw a boy with a telescopeSSNPVPVPpacked foresta compact representationof all parse treesIvsawVPNPa boyNPprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 12
Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13
Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13
Weighted Deduction/Hypergraph(B, i, j): p(C, j+1, k): q(A, i, k): f (p, q)AB C• f is the weight functione.g.: in Probabilistic Context-Free Grammars:f (p, q)= p • q • Pr (A B C)Liang Huang (Penn) k-best parsing 14
Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: bA: f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15
Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: b’≤b A: f (b’, c)≤ f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15
k-best Problem in Hypergraphs• 1-best problem– find the best derivation of the target vertex t• k-best problem– find the top k derivations of the target vertex t• assumptionin CKY, t = (S, 1, n)– acyclic: so that we can use topological orderLiang Huang (Penn) k-best parsing 16
Outline• Formulations• Algorithms– Generic 1-best Viterbi Algorithm– Algorithm 0: naïve– Algorithm 1: hyperedge-level– Algorithm 2: vertex (item)-level– Algorithm 3: lazy algorithm• Experiments• Applications to Machine TranslationLiang Huang (Penn) k-best parsing 17
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1: f 1(a, b)w: bvLiang Huang (Penn) k-best parsing 18
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possible(VP, 2, 4)(PP, 4, 7)u: aw: bf 1 (VP, 2, 7)v : f 1(a, b)Liang Huang (Penn) k-best parsing 18
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1w: bv: better ( f 1(a, b), f 2(c, d))u’: cf 2w’: dLiang Huang (Penn) k-best parsing 19
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…w’: dLiang Huang (Penn) k-best parsing 20
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: dLiang Huang (Penn) k-best parsing 20
Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: din CKY: |E|=O (n 3 |P|)Liang Huang (Penn) k-best parsing 20
Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiLiang Huang (Penn) k-best parsing 21
Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiWe knew everything so farin your talk 40 years agoLiang Huang (Penn) k-best parsing 21
k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)• just need top k out of the k 2 valuesu: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22
k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)b.1.3.4.5• just need top k out of the k 2 values.6.4a.3.3u: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22
Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: af 1w: bu’: cf 2v: merge k(mult k( f 1, a, b), mult k( f 2, c, d))w’: dLiang Huang (Penn) k-best parsing 23
Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: aw: bu’: cw’: df 2f 1v : merge k(mult k( f 1, a, b), mult k( f 2, c, d))overall time complexity: O(k 2 |E|)Liang Huang (Penn) k-best parsing 23
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2).1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24
Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2)• what about the 3rd-best?.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24
Algorithm 1 (Demo)f (a, b) = abLiang Huang (Penn) k-best parsing 25
Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.6.4.3.3a iLiang Huang (Penn) k-best parsing 25
Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.30.6.4.3.3a iLiang Huang (Penn) k-best parsing 25
Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.24.30 .20.6.4.3.3a iLiang Huang (Penn) k-best parsing 25
Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26
Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26
Algorithm 1 (Demo)f (a, b) = ab.1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26
Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27
Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27
Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.15.6.4.3.3a iLiang Huang (Penn) k-best parsing 27
Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier)b j.1.3.4.18.24.16in each iteration:1. extract-max from theheap2. push the two“shoulders” into theheap.5.30.20.15k iterations..6.4.3.3O(k logk |E|) overall timea iLiang Huang (Penn) k-best parsing 27
Algorithm 2: speedup merge k• Algorithm 1 works on each hyperedgesequentially• can we process them simultaneously?u: a w: b……p: x q: yf 1f if dvLiang Huang (Penn) k-best parsing 28
Algorithm 2 (Demo)starts with an initial heap of the 1-bestderivations from each hyperedgeB 1x C 1B 2x C 2B 3x C 30.10.50.1.42 0.6.36 0.9.32 0.40.4 0.70.3 0.40.7 0.8item-level heapv k = 2, d = 3Liang Huang (Penn) k-best parsing 29
Algorithm 2 (Demo)pop the best (.42) and …B 1x C 1B 2x C 2B 3x C 3.420.4 0.70.1.42 0.60.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3Liang Huang (Penn) k-best parsing 30
Algorithm 2 (Demo)pop the best (.42) and …push the two successors (.07 and .24)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3.42Liang Huang (Penn) k-best parsing 31
Algorithm 2 (Demo)pop the 2 nd -best (.36)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapv k = 2, d = 3.42.36Liang Huang (Penn) k-best parsing 32
Algorithm 3: Offline (lazy)• from Algorithm 0 to Algorithm 2:– delaying the calculations until needed -- lazier– larger locality• even lazier… (one step further)– we are interested in the k-best derivations of thefinal item only!Liang Huang (Penn) k-best parsing 33
Algorithm 3: Offline (lazy)• forward phase– do a normal 1-best search till the final item– but construct the hypergraph (forest) along the way• recursive backward phase– ask the final item: what’s your 2nd-best?– final item will propagate this question till the leaves– then ask the final item: what’s your 3rd-best?Liang Huang (Penn) k-best parsing 34
Algorithm 3 demoafter the “forward” step (1-best parsing):forest = 1-best derivations from each hyperedge.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 35
Algorithm 3 demonow the backward step.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)what’s your 2nd-best?S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 36
Algorithm 3 demo.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)I’m not sure... let meask my parents…S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 37
Algorithm 3 demowell, it must be either … or ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 38
Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 39
Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)oops… forgot to askmore questionsrecursively …S (1, 7)Liang Huang (Penn) k-best parsing 39k=2.42
Algorithm 3 demowhat’s your 2nd-best?.20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 40
Algorithm 3 demo… …recursion goes on to the leaf nodes… ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 41
Algorithm 3 demoand reports back the numbers…??.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 42
Algorithm 3 demopush .30 and .21 to the candidate heap (priority queue).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 43
Algorithm 3 demopop the root of the heap (.30).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)now I know my 2nd-bestS (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 44
Interesting Properties• 1-best is best everywhere (alldecisions optimal)local picture:• 2nd-best is optimal everywhereexcept one decision.1.3.06.18.04.12.03.09.03.09– and that decision must be 2nd-best– and it’s the best of all 2nd-bestdecisionsb j.4.5.24.30.16.20.12.15.12.15• so what about the 3rd-best?• kth-best is….6a i.4.3.3Liang Huang (Penn) k-best parsing 45
Summary of AlgorithmsAlgorithms Time Complexity Locality1-best O(|E|) hyperedgealg. 0: naïve O(k a |E|) hyperedge (mult k)alg. 1 O(k log k |E|) hyperedge (mult k)alg. 2 O(|E|+|V|k log k) item (merge k)alg. 3: lazy O(|E|+|D|k log k) globalfor CKY: a=2, |E|=O(n 3 |P|), |V|=O(n 2 |N|), |D|=O(n)a is the arity of the grammarLiang Huang (Penn) k-best parsing 46
Outline• Formulations• Algorithms: Alg.0 thru Alg. 3• Experiments– Collins/Bikel Parser– both efficiency and accuracy• Applications in Machine TranslationLiang Huang (Penn) k-best parsing 47
Background: Statistical Parsing• Probabilistic Grammar– induced from a treebank (Penn Treebank)• State-of-the-art Parsers– Collins (1999), Bikel (2004), Charniak (2000), etc.• Evaluation of Accuracy– PARSEVAL: tree-similarity (English treebank: ~90%)• Previous work on k-best Parsing:– Collins (2000): turn off dynamic programming– Charniak/Johnson (2005): coarse-to-fine, still too slowLiang Huang (Penn)k-best parsing48
EfficiencyImplemented Algorithms 0, 1, 3 on top of Collins/Bikel ParserAverage (wall-clock) time on Penn Treebank (per sentence):O(|E|+|D|k log k)Liang Huang (Penn) k-best parsing 49
Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parse1-bestaccuracy100%89%Liang Huang (Penn) k-best parsing 50
Oracle Rerankinggiven k parses of a sentencegold standard“correct” parseaccuracy100%• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functionk-best parses1-best………96%91%89%78%Liang Huang (Penn) k-best parsing 50
Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parseoraclek-best parses1-best………accuracy100%96%91%89%78%Liang Huang (Penn) k-best parsing 50
Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51
Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51
Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)sentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52
Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)7% of the test-setsentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)vAlgorithm 2:O(E + k log k V)3. time complexity: O( V + E )Liang Huang53Dynamic Programming
k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)3. time complexity: O( V + E )Liang Huang53vAlgorithm 2:O(E + k log k V)Algorithm 3:O(E + k log k D)Dynamic Programming
Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 54
Implemented in ...– state-of-the-art statistical parsers• Charniak parser (2005); Berkeley parser (2006)• McDonald et al. dependency parser (2005)• Microsoft Research (Redmond) dependency parser (2006)– generic dynamic programming languages/packages• Dyna (Eisner et al., 2005) and Tiburon (May and Knight, 2006)– state-of-the-art syntax-based translation systems• Hiero (Chiang, 2005)• ISI syntax-based system (2005)• CMU syntax-based system (2006)• BBN syntax-based system (2007)Liang Huang (Penn)k-best parsing55
Applications inMachine Translation
Syntax-based Translation• synchronous context-free grammars (SCFGs)• generating pairs of strings/trees simultaneously• co-indexed nonterminal further rewritten as a unitSS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .SNPVPNPVPBaoweierPPVPPowellVPPPyu Shalongjuxing le huitanheld a meetingwith SharonLiang Huang (Penn)Applications in MT57
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VPNPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .held a talk with SharonVP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58
Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencyBrokenEnglishlanguage model (LM)fluencyEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59
Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59
Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59
Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59
Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMLMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...5.47.09.80.28.7I am so hungryApplications in MT59
k-best rescoring results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:BLEU score24.4534.58Liang Huang (Penn)Applications in MT60
Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61
Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yodecoderintegrated(LM-integrated)decodercomputationally challenging! ☹I am so hungryLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61
Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)Applications in MT62
Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62
Integrated Decoding Resultssee my second talk for details!• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62
Rescoring vs. IntegrationqualitygoodpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63
Rescoring vs. IntegrationqualitygoodintegrationpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63
Rescoring vs. Integrationqualitygoodintegration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63
Rescoring vs. IntegrationqualitygoodintegrationYes, forest-rescoring --almost as fast as rescoring, andalmost as good as integration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63
Why Integration is Slow?hyperedgeVP1, 6PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64
Why Integration is Slow?hyperedgeVP1, 6hold ... Sharonheld ... Shalonghold ... ShalongPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64
Forest Rescoringk-best parsingAlgorithm 2hyperedgewith LM cost, we can onlydo k-best approximately.VPPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6process all hyperedges simultaneously!significant savings of computationLiang Huang (Penn)Applications in MT65
Forest Rescoring Results• on the Hiero system (Chiang, 2005)• ~10 fold speed-up at the same level of BLEU• on my syntax-directed system (Huang et al., 2006)• ~10 fold speed-up at the same level of search-error• on a typical phrase-based system (Pharaoh)• ~30 fold speed-up at the same level of search-error• ~100 fold speed-up at the same level of BLEU• also used in NIST evals by• ISI, CMU, and BBN syntax-based systemssee my ACL ’07 paper for detailsLiang Huang (Penn)Applications in MT66
Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 67
Thank you!!Questions?Comments?
Thank you!!Questions?Comments?
Thank you!!Questions?Comments?• Liang Huang and David Chiang (2005).Better k-best Parsing.In Proceedings of IWPT, Vancouver, B.C.• Liang Huang and David Chiang (2007).Forest Rescoring: Faster Decoding with Integrated Language Models.In Proceedings of ACL, Prague, Czech Rep.
Quality of the k-best listsCollins (2000)Liang Huang (Penn) k-best parsing 69
Syntax-based Experiments
Tree-to-String System• syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006)• the reverse direction is found in (Liu et al., 2006)VP!"""#$%&"""'$beisynchronous treesubstitutiongrammars (STSG)VBDwasVPVP-CPP(Galley et al., 2004; Eisner, 2003)related toSTAG (Shieber/Schabes, 90)VBNPPINNP-CshotTOtoNP-CNNbyDTtheNNpolicetested on 140 sentencesslightly better BLEU scoresthan PharaohLiang Huang (Penn)deathApplications in MT71
Speed vs. Search QualityLiang Huang (Penn)Applications in MT72
Speed vs. Translation AccuracyLiang Huang (Penn)Applications in MT73
Cube PruningVP1, 6PP1, 3 VP3, 6monotonic grid?(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 4.0 9.0(VPheld ⋆ talk3,6 )1.1 2.1 4.1 9.1(VPhold ⋆ conference3,6 )3.5 4.5 6.5 11.5Liang Huang (Penn)Applications in MT74
Cube PruningVP1, 6PP1, 3 VP3, 6(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75
Cube PruningVP1, 6PP1, 3 VP3, 6bigram (meeting, with)(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75
Cube PruningVP1, 6PP1, 3 VP3, 6non-monotonic griddue to LM combo costs(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT76
Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT77
Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT78
Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT79
Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80
Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1finally re-sort the bufferand return inorder:2.4 2.5 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80