A, i, k

k-best ParsingSSNPIVPVPvNPsawa boyVPNPprepPPNP1-best2nd-bestwitha telescopeI saw a boy with a telescope.Liang Huang (Penn) k-best parsing 2

Not a trivial task…Liang Huang (Penn) k-best parsing 3

Not a trivial task…Aravind JoshiLiang Huang (Penn) k-best parsing 3

Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3



I saw her duck.Liang Huang (Penn)k-best parsing4

I saw her duck.how about“I saw her duck with a telescope.”Liang Huang (Penn)k-best parsing4

I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5

I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5

Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6



Or even...I eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 7

Why k-best?• postpone disambiguation in a pipeline– 1-best is not always optimal in the future– propagate k-best lists instead of 1-best– e.g.: semantic role labeler uses k-best parses• approximate the set of all possibleinterpretations– reranking (Collins, 2000)– minimum error training (Och, 2003)– online training (McDonald et al., 2005)part-of-speech tagginglatticeSyntactic Parsingk-bestparsesSemantic InterpretationLiang Huang (Penn) k-best parsing 8

k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming



k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)naive k-best: O(k E)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming

k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)3. time complexity: O( V + E )vnaive k-best: O(k E)good k-best:O(E + k log k V)Liang Huang9Dynamic Programming

Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPLiang Huang (Penn) k-best parsing 10

Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPBCi j j+1 kLiang Huang (Penn) k-best parsing 10

Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAikLiang Huang (Penn) k-best parsing 10

Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAcomputational complexity: O(n 3 |P|)P is the set of productions (rules)ikLiang Huang (Penn) k-best parsing 10

Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertex(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11

Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11

Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)hyperedges(S, 1, 6)Liang Huang (Penn) k-best parsing 11

Packed Forest as HypergraphI saw a boy with a telescopeSSNPVPVPpacked foresta compact representationof all parse treesIvsawVPNPa boyNPprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 12

Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13

Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13

Weighted Deduction/Hypergraph(B, i, j): p(C, j+1, k): q(A, i, k): f (p, q)AB C• f is the weight functione.g.: in Probabilistic Context-Free Grammars:f (p, q)= p • q • Pr (A B C)Liang Huang (Penn) k-best parsing 14

Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: bA: f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15

Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: b’≤b A: f (b’, c)≤ f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15

k-best Problem in Hypergraphs• 1-best problem– find the best derivation of the target vertex t• k-best problem– find the top k derivations of the target vertex t• assumptionin CKY, t = (S, 1, n)– acyclic: so that we can use topological orderLiang Huang (Penn) k-best parsing 16

Outline• Formulations• Algorithms– Generic 1-best Viterbi Algorithm– Algorithm 0: naïve– Algorithm 1: hyperedge-level– Algorithm 2: vertex (item)-level– Algorithm 3: lazy algorithm• Experiments• Applications to Machine TranslationLiang Huang (Penn) k-best parsing 17

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1: f 1(a, b)w: bvLiang Huang (Penn) k-best parsing 18

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possible(VP, 2, 4)(PP, 4, 7)u: aw: bf 1 (VP, 2, 7)v : f 1(a, b)Liang Huang (Penn) k-best parsing 18

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1w: bv: better ( f 1(a, b), f 2(c, d))u’: cf 2w’: dLiang Huang (Penn) k-best parsing 19

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…w’: dLiang Huang (Penn) k-best parsing 20

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: dLiang Huang (Penn) k-best parsing 20

Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: din CKY: |E|=O (n 3 |P|)Liang Huang (Penn) k-best parsing 20

Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiLiang Huang (Penn) k-best parsing 21

Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiWe knew everything so farin your talk 40 years agoLiang Huang (Penn) k-best parsing 21

k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)• just need top k out of the k 2 valuesu: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22

k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)b.1.3.4.5• just need top k out of the k 2 values.6.4a.3.3u: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22

Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: af 1w: bu’: cf 2v: merge k(mult k( f 1, a, b), mult k( f 2, c, d))w’: dLiang Huang (Penn) k-best parsing 23

Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: aw: bu’: cw’: df 2f 1v : merge k(mult k( f 1, a, b), mult k( f 2, c, d))overall time complexity: O(k 2 |E|)Liang Huang (Penn) k-best parsing 23

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2).1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24

Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2)• what about the 3rd-best?.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24

Algorithm 1 (Demo)f (a, b) = abLiang Huang (Penn) k-best parsing 25

Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.6.4.3.3a iLiang Huang (Penn) k-best parsing 25

Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.30.6.4.3.3a iLiang Huang (Penn) k-best parsing 25

Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.24.30 .20.6.4.3.3a iLiang Huang (Penn) k-best parsing 25

Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26

Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26

Algorithm 1 (Demo)f (a, b) = ab.1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26

Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27

Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27

Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.15.6.4.3.3a iLiang Huang (Penn) k-best parsing 27

Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier)b j.1.3.4.18.24.16in each iteration:1. extract-max from theheap2. push the two“shoulders” into theheap.5.30.20.15k iterations..6.4.3.3O(k logk |E|) overall timea iLiang Huang (Penn) k-best parsing 27

Algorithm 2: speedup merge k• Algorithm 1 works on each hyperedgesequentially• can we process them simultaneously?u: a w: b……p: x q: yf 1f if dvLiang Huang (Penn) k-best parsing 28

Algorithm 2 (Demo)starts with an initial heap of the 1-bestderivations from each hyperedgeB 1x C 1B 2x C 2B 3x C 30.10.50.1.42 0.6.36 0.9.32 0.40.4 0.70.3 0.40.7 0.8item-level heapv k = 2, d = 3Liang Huang (Penn) k-best parsing 29

Algorithm 2 (Demo)pop the best (.42) and …B 1x C 1B 2x C 2B 3x C 3.420.4 0.70.1.42 0.60.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3Liang Huang (Penn) k-best parsing 30

Algorithm 2 (Demo)pop the best (.42) and …push the two successors (.07 and .24)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3.42Liang Huang (Penn) k-best parsing 31

Algorithm 2 (Demo)pop the 2 nd -best (.36)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapv k = 2, d = 3.42.36Liang Huang (Penn) k-best parsing 32

Algorithm 3: Offline (lazy)• from Algorithm 0 to Algorithm 2:– delaying the calculations until needed -- lazier– larger locality• even lazier… (one step further)– we are interested in the k-best derivations of thefinal item only!Liang Huang (Penn) k-best parsing 33

Algorithm 3: Offline (lazy)• forward phase– do a normal 1-best search till the final item– but construct the hypergraph (forest) along the way• recursive backward phase– ask the final item: what’s your 2nd-best?– final item will propagate this question till the leaves– then ask the final item: what’s your 3rd-best?Liang Huang (Penn) k-best parsing 34

Algorithm 3 demoafter the “forward” step (1-best parsing):forest = 1-best derivations from each hyperedge.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 35

Algorithm 3 demonow the backward step.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)what’s your 2nd-best?S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 36

Algorithm 3 demo.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)I’m not sure... let meask my parents…S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 37

Algorithm 3 demowell, it must be either … or ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 38

Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 39

Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)oops… forgot to askmore questionsrecursively …S (1, 7)Liang Huang (Penn) k-best parsing 39k=2.42

Algorithm 3 demowhat’s your 2nd-best?.20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 40

Algorithm 3 demo… …recursion goes on to the leaf nodes… ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 41

Algorithm 3 demoand reports back the numbers…??.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 42

Algorithm 3 demopush .30 and .21 to the candidate heap (priority queue).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 43

Algorithm 3 demopop the root of the heap (.30).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)now I know my 2nd-bestS (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 44

Interesting Properties• 1-best is best everywhere (alldecisions optimal)local picture:• 2nd-best is optimal everywhereexcept one decision.1.3.06.18.04.12.03.09.03.09– and that decision must be 2nd-best– and it’s the best of all 2nd-bestdecisionsb j.4.5.24.30.16.20.12.15.12.15• so what about the 3rd-best?• kth-best is….6a i.4.3.3Liang Huang (Penn) k-best parsing 45

Summary of AlgorithmsAlgorithms Time Complexity Locality1-best O(|E|) hyperedgealg. 0: naïve O(k a |E|) hyperedge (mult k)alg. 1 O(k log k |E|) hyperedge (mult k)alg. 2 O(|E|+|V|k log k) item (merge k)alg. 3: lazy O(|E|+|D|k log k) globalfor CKY: a=2, |E|=O(n 3 |P|), |V|=O(n 2 |N|), |D|=O(n)a is the arity of the grammarLiang Huang (Penn) k-best parsing 46

Outline• Formulations• Algorithms: Alg.0 thru Alg. 3• Experiments– Collins/Bikel Parser– both efficiency and accuracy• Applications in Machine TranslationLiang Huang (Penn) k-best parsing 47

Background: Statistical Parsing• Probabilistic Grammar– induced from a treebank (Penn Treebank)• State-of-the-art Parsers– Collins (1999), Bikel (2004), Charniak (2000), etc.• Evaluation of Accuracy– PARSEVAL: tree-similarity (English treebank: ~90%)• Previous work on k-best Parsing:– Collins (2000): turn off dynamic programming– Charniak/Johnson (2005): coarse-to-fine, still too slowLiang Huang (Penn)k-best parsing48

EfficiencyImplemented Algorithms 0, 1, 3 on top of Collins/Bikel ParserAverage (wall-clock) time on Penn Treebank (per sentence):O(|E|+|D|k log k)Liang Huang (Penn) k-best parsing 49

Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parse1-bestaccuracy100%89%Liang Huang (Penn) k-best parsing 50

Oracle Rerankinggiven k parses of a sentencegold standard“correct” parseaccuracy100%• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functionk-best parses1-best………96%91%89%78%Liang Huang (Penn) k-best parsing 50

Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parseoraclek-best parses1-best………accuracy100%96%91%89%78%Liang Huang (Penn) k-best parsing 50

Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51

Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51

Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)sentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52

Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)7% of the test-setsentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52




k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming

k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)vAlgorithm 2:O(E + k log k V)3. time complexity: O( V + E )Liang Huang53Dynamic Programming

k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)3. time complexity: O( V + E )Liang Huang53vAlgorithm 2:O(E + k log k V)Algorithm 3:O(E + k log k D)Dynamic Programming

Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 54

Implemented in ...– state-of-the-art statistical parsers• Charniak parser (2005); Berkeley parser (2006)• McDonald et al. dependency parser (2005)• Microsoft Research (Redmond) dependency parser (2006)– generic dynamic programming languages/packages• Dyna (Eisner et al., 2005) and Tiburon (May and Knight, 2006)– state-of-the-art syntax-based translation systems• Hiero (Chiang, 2005)• ISI syntax-based system (2005)• CMU syntax-based system (2006)• BBN syntax-based system (2007)Liang Huang (Penn)k-best parsing55

Applications inMachine Translation

Syntax-based Translation• synchronous context-free grammars (SCFGs)• generating pairs of strings/trees simultaneously• co-indexed nonterminal further rewritten as a unitSS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .SNPVPNPVPBaoweierPPVPPowellVPPPyu Shalongjuxing le huitanheld a meetingwith SharonLiang Huang (Penn)Applications in MT57

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VPNPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .held a talk with SharonVP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58

Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencyBrokenEnglishlanguage model (LM)fluencyEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59

Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59

Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59

Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59

Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMLMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...5.47.09.80.28.7I am so hungryApplications in MT59

k-best rescoring results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:BLEU score24.4534.58Liang Huang (Penn)Applications in MT60

Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61

Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yodecoderintegrated(LM-integrated)decodercomputationally challenging! ☹I am so hungryLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61

Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)Applications in MT62

Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62

Integrated Decoding Resultssee my second talk for details!• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62

Rescoring vs. IntegrationqualitygoodpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63

Rescoring vs. IntegrationqualitygoodintegrationpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63

Rescoring vs. Integrationqualitygoodintegration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63

Rescoring vs. IntegrationqualitygoodintegrationYes, forest-rescoring --almost as fast as rescoring, andalmost as good as integration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63

Why Integration is Slow?hyperedgeVP1, 6PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64

Why Integration is Slow?hyperedgeVP1, 6hold ... Sharonheld ... Shalonghold ... ShalongPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64

Forest Rescoringk-best parsingAlgorithm 2hyperedgewith LM cost, we can onlydo k-best approximately.VPPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6process all hyperedges simultaneously!significant savings of computationLiang Huang (Penn)Applications in MT65

Forest Rescoring Results• on the Hiero system (Chiang, 2005)• ~10 fold speed-up at the same level of BLEU• on my syntax-directed system (Huang et al., 2006)• ~10 fold speed-up at the same level of search-error• on a typical phrase-based system (Pharaoh)• ~30 fold speed-up at the same level of search-error• ~100 fold speed-up at the same level of BLEU• also used in NIST evals by• ISI, CMU, and BBN syntax-based systemssee my ACL ’07 paper for detailsLiang Huang (Penn)Applications in MT66

Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 67

Thank you!!Questions?Comments?

Thank you!!Questions?Comments?

Thank you!!Questions?Comments?• Liang Huang and David Chiang (2005).Better k-best Parsing.In Proceedings of IWPT, Vancouver, B.C.• Liang Huang and David Chiang (2007).Forest Rescoring: Faster Decoding with Integrated Language Models.In Proceedings of ACL, Prague, Czech Rep.

Quality of the k-best listsCollins (2000)Liang Huang (Penn) k-best parsing 69

Syntax-based Experiments

Tree-to-String System• syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006)• the reverse direction is found in (Liu et al., 2006)VP!"""#$%&"""'$beisynchronous treesubstitutiongrammars (STSG)VBDwasVPVP-CPP(Galley et al., 2004; Eisner, 2003)related toSTAG (Shieber/Schabes, 90)VBNPPINNP-CshotTOtoNP-CNNbyDTtheNNpolicetested on 140 sentencesslightly better BLEU scoresthan PharaohLiang Huang (Penn)deathApplications in MT71

Speed vs. Search QualityLiang Huang (Penn)Applications in MT72

Speed vs. Translation AccuracyLiang Huang (Penn)Applications in MT73

Cube PruningVP1, 6PP1, 3 VP3, 6monotonic grid?(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 4.0 9.0(VPheld ⋆ talk3,6 )1.1 2.1 4.1 9.1(VPhold ⋆ conference3,6 )3.5 4.5 6.5 11.5Liang Huang (Penn)Applications in MT74

Cube PruningVP1, 6PP1, 3 VP3, 6(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75

Cube PruningVP1, 6PP1, 3 VP3, 6bigram (meeting, with)(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75

Cube PruningVP1, 6PP1, 3 VP3, 6non-monotonic griddue to LM combo costs(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT76

Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT77

Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT78

Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT79

Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80

Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1finally re-sort the bufferand return inorder:2.4 2.5 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80

A, i, k

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?