12.07.2015 Views

A, i, k

A, i, k

A, i, k

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

k-best ParsingSSNPIVPVPvNPsawa boyVPNPprepPPNP1-best2nd-bestwitha telescopeI saw a boy with a telescope.Liang Huang (Penn) k-best parsing 2


Not a trivial task…Liang Huang (Penn) k-best parsing 3


Not a trivial task…Aravind JoshiLiang Huang (Penn) k-best parsing 3


Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3


Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3


Not a trivial task…I saw her duck.Aravind JoshiLiang Huang (Penn) k-best parsing 3


I saw her duck.Liang Huang (Penn)k-best parsing4


I saw her duck.how about“I saw her duck with a telescope.”Liang Huang (Penn)k-best parsing4


I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5


I saw her duck with a telescope.Liang Huang (Penn)k-best parsing5


Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6


Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6


Another ExampleI eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 6


Or even...I eat sushi with tuna.Aravind JoshiLiang Huang (Penn) k-best parsing 7


Why k-best?• postpone disambiguation in a pipeline– 1-best is not always optimal in the future– propagate k-best lists instead of 1-best– e.g.: semantic role labeler uses k-best parses• approximate the set of all possibleinterpretations– reranking (Collins, 2000)– minimum error training (Och, 2003)– online training (McDonald et al., 2005)part-of-speech tagginglatticeSyntactic Parsingk-bestparsesSemantic InterpretationLiang Huang (Penn) k-best parsing 8


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)naive k-best: O(k E)v3. time complexity: O( V + E )Liang Huang9Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)3. time complexity: O( V + E )vnaive k-best: O(k E)good k-best:O(E + k log k V)Liang Huang9Dynamic Programming


Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPLiang Huang (Penn) k-best parsing 10


Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPBCi j j+1 kLiang Huang (Penn) k-best parsing 10


Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAikLiang Huang (Penn) k-best parsing 10


Parsing as Deduction• Parsing with context-free grammars (CFGs)– Dynamic Programming (CKY algorithm)(B, i, j ) (C, j+1, k)(A, i, k)AB C(NP, 1, 3) (VP, 4, 6)(S, 1, 6)S NP VPAcomputational complexity: O(n 3 |P|)P is the set of productions (rules)ikLiang Huang (Penn) k-best parsing 10


Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertex(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11


Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)Liang Huang (Penn) k-best parsing 11


Deduction => Hypergraph• hypergraph is a generalization of graph– each hyperedge connectsseveral vertices to one vertexvertices(VB, 3, 3) (PP, 4, 6) (VB, 3, 3) (PP, 4, 6)(NP, 1, 3) (VP, 4, 6)(NP, 1, 3) (VP, 4, 6)(S, 1, 6)hyperedges(S, 1, 6)Liang Huang (Penn) k-best parsing 11


Packed Forest as HypergraphI saw a boy with a telescopeSSNPVPVPpacked foresta compact representationof all parse treesIvsawVPNPa boyNPprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 12


Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13


Packed Forest as HypergraphI saw a boy with a telescopeSNPVPVPNPpacked foresta compact representationof all parse treesIvsawNPa boyprepwithPPNPa telescopeLiang Huang (Penn) k-best parsing 13


Weighted Deduction/Hypergraph(B, i, j): p(C, j+1, k): q(A, i, k): f (p, q)AB C• f is the weight functione.g.: in Probabilistic Context-Free Grammars:f (p, q)= p • q • Pr (A B C)Liang Huang (Penn) k-best parsing 14


Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: bA: f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15


Monotonic Weight Functions• all weight functions must be monotonic on eachof their arguments• optimal sub-problem property in dynamicprogrammingB: b’≤b A: f (b’, c)≤ f (b, c)C: cCKY example:A = (S, 1, 5)B = (NP, 1, 2), C = (VP, 3, 5)f (b, c) = b • c • Pr(S→NP VP)Liang Huang (Penn) k-best parsing 15


k-best Problem in Hypergraphs• 1-best problem– find the best derivation of the target vertex t• k-best problem– find the top k derivations of the target vertex t• assumptionin CKY, t = (S, 1, n)– acyclic: so that we can use topological orderLiang Huang (Penn) k-best parsing 16


Outline• Formulations• Algorithms– Generic 1-best Viterbi Algorithm– Algorithm 0: naïve– Algorithm 1: hyperedge-level– Algorithm 2: vertex (item)-level– Algorithm 3: lazy algorithm• Experiments• Applications to Machine TranslationLiang Huang (Penn) k-best parsing 17


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1: f 1(a, b)w: bvLiang Huang (Penn) k-best parsing 18


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possible(VP, 2, 4)(PP, 4, 7)u: aw: bf 1 (VP, 2, 7)v : f 1(a, b)Liang Huang (Penn) k-best parsing 18


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each vertex(“bottom-up”)• for each incoming hyperedge– compute the result of the f function along the hyperedge– update the 1-best value for the current vertex if possibleu: af 1w: bv: better ( f 1(a, b), f 2(c, d))u’: cf 2w’: dLiang Huang (Penn) k-best parsing 19


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…w’: dLiang Huang (Penn) k-best parsing 20


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: dLiang Huang (Penn) k-best parsing 20


Generic 1-best Viterbi Algorithm• traverse the hypergraph in topological order– for each incoming hyperedge• compute the result of the f function along thehyperedge• update the 1-best value for the current vertex if possibleu: af 1w: bv: better( better ( f 1(a, b), f 2(c, d)), …)u’: cf 2…overall time complexity: O(|E|)w’: din CKY: |E|=O (n 3 |P|)Liang Huang (Penn) k-best parsing 20


Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiLiang Huang (Penn) k-best parsing 21


Dynamic Programming: 1950’sRichard BellmanAndrew ViterbiWe knew everything so farin your talk 40 years agoLiang Huang (Penn) k-best parsing 21


k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)• just need top k out of the k 2 valuesu: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22


k-best Viterbi algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– vector components maintain sorted– now what’s f (a, b) ?• k 2 values -- Cartesian Product f (a i, b j)b.1.3.4.5• just need top k out of the k 2 values.6.4a.3.3u: aw: bf 1mult k( f, a, b) = top k{ f (a i, b j) }: mult k( f 1, a, b)vLiang Huang (Penn) k-best parsing 22


Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: af 1w: bu’: cf 2v: merge k(mult k( f 1, a, b), mult k( f 2, c, d))w’: dLiang Huang (Penn) k-best parsing 23


Algorithm 0: naïve• straightforward k-best extension:– a vector of length k instead of a single value– and how to update?• from two k-lengthed vectors (2k elements)• select the top k elements: O(k)u: aw: bu’: cw’: df 2f 1v : merge k(mult k( f 1, a, b), mult k( f 2, c, d))overall time complexity: O(k 2 |E|)Liang Huang (Penn) k-best parsing 23


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?.1b.3.4.5.6.4a.3.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2).1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24


Algorithm 1: speedup mult kmult k ( f, a, b) = top k{ f (a i, b j) }• only interested in top k, why enumerate all k 2 ?• a and b are sorted!• f is monotonic!• f (a 1, b 1) must be the 1-best• the 2nd-best must be…– either f (a 2, b 1) or f (a 1, b 2)• what about the 3rd-best?.1.3b.4.5.6 .4 .3a.3Liang Huang (Penn) k-best parsing 24


Algorithm 1 (Demo)f (a, b) = abLiang Huang (Penn) k-best parsing 25


Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.6.4.3.3a iLiang Huang (Penn) k-best parsing 25


Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.30.6.4.3.3a iLiang Huang (Penn) k-best parsing 25


Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.5.24.30 .20.6.4.3.3a iLiang Huang (Penn) k-best parsing 25


Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26


Algorithm 1 (Demo)f (a, b) = ab.1.3b j.4.24.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26


Algorithm 1 (Demo)f (a, b) = ab.1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 26


Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27


Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.6.4.3.3a iLiang Huang (Penn) k-best parsing 27


Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier).1.3.18b j.4.24.16.5.30.20.15.6.4.3.3a iLiang Huang (Penn) k-best parsing 27


Algorithm 1 (Demo)use a priority queue (heap) tostore the candidates (frontier)b j.1.3.4.18.24.16in each iteration:1. extract-max from theheap2. push the two“shoulders” into theheap.5.30.20.15k iterations..6.4.3.3O(k logk |E|) overall timea iLiang Huang (Penn) k-best parsing 27


Algorithm 2: speedup merge k• Algorithm 1 works on each hyperedgesequentially• can we process them simultaneously?u: a w: b……p: x q: yf 1f if dvLiang Huang (Penn) k-best parsing 28


Algorithm 2 (Demo)starts with an initial heap of the 1-bestderivations from each hyperedgeB 1x C 1B 2x C 2B 3x C 30.10.50.1.42 0.6.36 0.9.32 0.40.4 0.70.3 0.40.7 0.8item-level heapv k = 2, d = 3Liang Huang (Penn) k-best parsing 29


Algorithm 2 (Demo)pop the best (.42) and …B 1x C 1B 2x C 2B 3x C 3.420.4 0.70.1.42 0.60.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3Liang Huang (Penn) k-best parsing 30


Algorithm 2 (Demo)pop the best (.42) and …push the two successors (.07 and .24)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapvk = 2, d = 3.42Liang Huang (Penn) k-best parsing 31


Algorithm 2 (Demo)pop the 2 nd -best (.36)B 1x C 1B 2x C 2B 3x C 3output.07 0.1.24 .42 0.60.4 0.70.3 0.40.5.36 0.90.7 0.80.1.32 0.4item-level heapv k = 2, d = 3.42.36Liang Huang (Penn) k-best parsing 32


Algorithm 3: Offline (lazy)• from Algorithm 0 to Algorithm 2:– delaying the calculations until needed -- lazier– larger locality• even lazier… (one step further)– we are interested in the k-best derivations of thefinal item only!Liang Huang (Penn) k-best parsing 33


Algorithm 3: Offline (lazy)• forward phase– do a normal 1-best search till the final item– but construct the hypergraph (forest) along the way• recursive backward phase– ask the final item: what’s your 2nd-best?– final item will propagate this question till the leaves– then ask the final item: what’s your 3rd-best?Liang Huang (Penn) k-best parsing 34


Algorithm 3 demoafter the “forward” step (1-best parsing):forest = 1-best derivations from each hyperedge.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 35


Algorithm 3 demonow the backward step.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)what’s your 2nd-best?S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 36


Algorithm 3 demo.20? 0.4?0.5NP (1, 2) VP (3, 7).42? 0.7?0.6NP (1, 3) VP (4, 7)1-best.28? 0.7?0.4VP (1, 5) NP (6, 7)I’m not sure... let meask my parents…S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 37


Algorithm 3 demowell, it must be either … or ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 38


Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 39


Algorithm 3 demobut wait a minute… did you already know the ?’s ?.20? 0.4?0.5NP (1, 2) VP (3, 7)??0.7.42??0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)oops… forgot to askmore questionsrecursively …S (1, 7)Liang Huang (Penn) k-best parsing 39k=2.42


Algorithm 3 demowhat’s your 2nd-best?.20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 40


Algorithm 3 demo… …recursion goes on to the leaf nodes… ….20? 0.4?0.5NP (1, 2) VP (3, 7)?? .42? 0.7?0.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 41


Algorithm 3 demoand reports back the numbers…??.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7)k=2.42Liang Huang (Penn) k-best parsing 42


Algorithm 3 demopush .30 and .21 to the candidate heap (priority queue).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)S (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 43


Algorithm 3 demopop the root of the heap (.30).30.21.20? 0.4?0.5NP (1, 2) VP (3, 7).420.5 0.70.30.6NP (1, 3) VP (4, 7).28? 0.7?0.4VP (1, 5) NP (6, 7)now I know my 2nd-bestS (1, 7).42k=2.30Liang Huang (Penn) k-best parsing 44


Interesting Properties• 1-best is best everywhere (alldecisions optimal)local picture:• 2nd-best is optimal everywhereexcept one decision.1.3.06.18.04.12.03.09.03.09– and that decision must be 2nd-best– and it’s the best of all 2nd-bestdecisionsb j.4.5.24.30.16.20.12.15.12.15• so what about the 3rd-best?• kth-best is….6a i.4.3.3Liang Huang (Penn) k-best parsing 45


Summary of AlgorithmsAlgorithms Time Complexity Locality1-best O(|E|) hyperedgealg. 0: naïve O(k a |E|) hyperedge (mult k)alg. 1 O(k log k |E|) hyperedge (mult k)alg. 2 O(|E|+|V|k log k) item (merge k)alg. 3: lazy O(|E|+|D|k log k) globalfor CKY: a=2, |E|=O(n 3 |P|), |V|=O(n 2 |N|), |D|=O(n)a is the arity of the grammarLiang Huang (Penn) k-best parsing 46


Outline• Formulations• Algorithms: Alg.0 thru Alg. 3• Experiments– Collins/Bikel Parser– both efficiency and accuracy• Applications in Machine TranslationLiang Huang (Penn) k-best parsing 47


Background: Statistical Parsing• Probabilistic Grammar– induced from a treebank (Penn Treebank)• State-of-the-art Parsers– Collins (1999), Bikel (2004), Charniak (2000), etc.• Evaluation of Accuracy– PARSEVAL: tree-similarity (English treebank: ~90%)• Previous work on k-best Parsing:– Collins (2000): turn off dynamic programming– Charniak/Johnson (2005): coarse-to-fine, still too slowLiang Huang (Penn)k-best parsing48


EfficiencyImplemented Algorithms 0, 1, 3 on top of Collins/Bikel ParserAverage (wall-clock) time on Penn Treebank (per sentence):O(|E|+|D|k log k)Liang Huang (Penn) k-best parsing 49


Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parse1-bestaccuracy100%89%Liang Huang (Penn) k-best parsing 50


Oracle Rerankinggiven k parses of a sentencegold standard“correct” parseaccuracy100%• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functionk-best parses1-best………96%91%89%78%Liang Huang (Penn) k-best parsing 50


Oracle Rerankinggiven k parses of a sentence• oracle reranking: pick thebest parse according tothe gold-standard• real reranking: pick thebest parse according tothe score functiongold standard“correct” parseoraclek-best parses1-best………accuracy100%96%91%89%78%Liang Huang (Penn) k-best parsing 50


Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51


Quality of the k-best listsThis work on top ofCollins parser Collins (2000)Liang Huang (Penn) k-best parsing 51


Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)sentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52


Why are our k-best lists better?average number of parsesaverage number of parses for sentences of certain lengththis workCollins (2000)7% of the test-setsentence lengthCollins (2000): turn down dynamic programmingtheoretically exponential time complexity;aggressive beam pruning to make it tractable in practiceLiang Huang (Penn) k-best parsing 52


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)v3. time complexity: O( V + E )Liang Huang53Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)vAlgorithm 2:O(E + k log k V)3. time complexity: O( V + E )Liang Huang53Dynamic Programming


k-Best Viterbi Algorithm?1. topological sort2. visit each vertex v in sorted order and do updates• for each incoming edge (u, v) in E• use d(u) to update d(v): d(v) ⊕ = d(u) ⊗ w(u, v)• key observation: d(u) is fixed to optimal at this timeu w(u, v)Algorithm 0: O(k E)3. time complexity: O( V + E )Liang Huang53vAlgorithm 2:O(E + k log k V)Algorithm 3:O(E + k log k D)Dynamic Programming


Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 54


Implemented in ...– state-of-the-art statistical parsers• Charniak parser (2005); Berkeley parser (2006)• McDonald et al. dependency parser (2005)• Microsoft Research (Redmond) dependency parser (2006)– generic dynamic programming languages/packages• Dyna (Eisner et al., 2005) and Tiburon (May and Knight, 2006)– state-of-the-art syntax-based translation systems• Hiero (Chiang, 2005)• ISI syntax-based system (2005)• CMU syntax-based system (2006)• BBN syntax-based system (2007)Liang Huang (Penn)k-best parsing55


Applications inMachine Translation


Syntax-based Translation• synchronous context-free grammars (SCFGs)• generating pairs of strings/trees simultaneously• co-indexed nonterminal further rewritten as a unitSS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .SNPVPNPVPBaoweierPPVPPowellVPPPyu Shalongjuxing le huitanheld a meetingwith SharonLiang Huang (Penn)Applications in MT57


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .Baoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VPNPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.01.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .VP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Translation as Parsing• translation (“decoding”) => monolingual parsing• parse the source input with the source projection• build the corresponding target sub-strings in parallelS → NP (1) VP (2) , NP (1) VP (2)VP → PP (1) VP (2) , VP (2) PP (1)NP → Baoweier, Powell. . .held a talk with SharonVP3.0Powell with Sharonheld a talk1.51.0 2.0NPPPVPBaoweier yu Shalong juxing le huitanLiang Huang (Penn)Applications in MT58


Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencyBrokenEnglishlanguage model (LM)fluencyEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59


Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59


Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringQue hambre tengo yoLiang Huang (Penn)What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59


Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...I am so hungryApplications in MT59


Language Model: RescoringSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishk-best rescoringTMLMQue hambre tengo yoLiang Huang (Penn)3.73.84.14.57.2What hunger have IHungry I am soHave I that hungerI am so hungryHow hunger have I...5.47.09.80.28.7I am so hungryApplications in MT59


k-best rescoring results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:BLEU score24.4534.58Liang Huang (Penn)Applications in MT60


Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61


Language Model: IntegrationSpanish/EnglishBilingual TextEnglishTextStatistical AnalysisStatistical AnalysisSpanishtranslation model (TM)competencysynchronous CFGBrokenEnglishlanguage model (LM)fluencyn-gram LMEnglishQue hambre tengo yodecoderintegrated(LM-integrated)decodercomputationally challenging! ☹I am so hungryLiang Huang (Penn)(Knight and Koehn, 2003)Applications in MT61


Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)Applications in MT62


Integrated Decoding Results• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62


Integrated Decoding Resultssee my second talk for details!• The ISI syntax-based translation system• currently the best performing system onChinese to English task in NIST evaluations• based on synchronous grammars• translation model (TM) only:• rescoring with trigram LM on 25000-best list:• trigram integrated decoding:BLEU score24.4534.5838.44Liang Huang (Penn)but over 100 times slower!Applications in MT62


Rescoring vs. IntegrationqualitygoodpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63


Rescoring vs. IntegrationqualitygoodintegrationpoorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63


Rescoring vs. Integrationqualitygoodintegration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63


Rescoring vs. IntegrationqualitygoodintegrationYes, forest-rescoring --almost as fast as rescoring, andalmost as good as integration?any compromise?(on-the-fly rescoring?)poorrescoringslowfastspeedLiang Huang (Penn)Applications in MT63


Why Integration is Slow?hyperedgeVP1, 6PP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64


Why Integration is Slow?hyperedgeVP1, 6hold ... Sharonheld ... Shalonghold ... ShalongPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6with ... Sharonalong ... Sharonwith ... Shalongheld ... talkheld ... meetinghold ... talks• split each node into +LM items (w/ boundary words)• beam search: only keep top-k +LM items at each node• but there are many ways to derive each node• can we avoid enumerating all combinations?Liang Huang (Penn)Applications in MT64


Forest Rescoringk-best parsingAlgorithm 2hyperedgewith LM cost, we can onlydo k-best approximately.VPPP1, 3 VP3, 6 PP1, 4 VP4, 6 NP1, 4 VP4, 6process all hyperedges simultaneously!significant savings of computationLiang Huang (Penn)Applications in MT65


Forest Rescoring Results• on the Hiero system (Chiang, 2005)• ~10 fold speed-up at the same level of BLEU• on my syntax-directed system (Huang et al., 2006)• ~10 fold speed-up at the same level of search-error• on a typical phrase-based system (Pharaoh)• ~30 fold speed-up at the same level of search-error• ~100 fold speed-up at the same level of BLEU• also used in NIST evals by• ISI, CMU, and BBN syntax-based systemssee my ACL ’07 paper for detailsLiang Huang (Penn)Applications in MT66


Conclusions• monotonic hypergraph formulation• the k-best derivations problem• k-best Algorithms• Algorithm 0 (naïve) to Algorithm 3 (lazy)• experimental results• efficiency• accuracy (effectively searching over larger space)• applications in machine translation• k-best rescoring and forest rescoringLiang Huang (Penn) Applications in MT 67


Thank you!!Questions?Comments?


Thank you!!Questions?Comments?


Thank you!!Questions?Comments?• Liang Huang and David Chiang (2005).Better k-best Parsing.In Proceedings of IWPT, Vancouver, B.C.• Liang Huang and David Chiang (2007).Forest Rescoring: Faster Decoding with Integrated Language Models.In Proceedings of ACL, Prague, Czech Rep.


Quality of the k-best listsCollins (2000)Liang Huang (Penn) k-best parsing 69


Syntax-based Experiments


Tree-to-String System• syntax-directed, English to Chinese (Huang, Knight, Joshi, 2006)• the reverse direction is found in (Liu et al., 2006)VP!"""#$%&"""'$beisynchronous treesubstitutiongrammars (STSG)VBDwasVPVP-CPP(Galley et al., 2004; Eisner, 2003)related toSTAG (Shieber/Schabes, 90)VBNPPINNP-CshotTOtoNP-CNNbyDTtheNNpolicetested on 140 sentencesslightly better BLEU scoresthan PharaohLiang Huang (Penn)deathApplications in MT71


Speed vs. Search QualityLiang Huang (Penn)Applications in MT72


Speed vs. Translation AccuracyLiang Huang (Penn)Applications in MT73


Cube PruningVP1, 6PP1, 3 VP3, 6monotonic grid?(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 4.0 9.0(VPheld ⋆ talk3,6 )1.1 2.1 4.1 9.1(VPhold ⋆ conference3,6 )3.5 4.5 6.5 11.5Liang Huang (Penn)Applications in MT74


Cube PruningVP1, 6PP1, 3 VP3, 6(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75


Cube PruningVP1, 6PP1, 3 VP3, 6bigram (meeting, with)(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPnon-monotonic griddue to LM combo costs 1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.0 + 0.5 4.0 + 5.0 9.0 + 0.5(VPheld ⋆ talk3,6 )1.1 2.1 + 0.3 4.1 + 5.4 9.1 + 0.3(VPhold ⋆ conference3,6 )3.5 4.5 + 0.6 6.5 +10.5 11.5 + 0.6Liang Huang (Penn)Applications in MT75


Cube PruningVP1, 6PP1, 3 VP3, 6non-monotonic griddue to LM combo costs(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT76


Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT77


Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT78


Cube Pruningk-best parsingAlgorithm 1• a priority queue of candidates• extract the best candidate• push the two successors(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT79


Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PPwith ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 3.0 8.01.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80


Cube Pruningitems are popped out-of-ordersolution: keep a buffer of pop-ups2.5 2.4 5.1finally re-sort the bufferand return inorder:2.4 2.5 5.1(PPwith ⋆ Sharon1,3 )(PPalong ⋆ Sharon1,3 )(PP1.0 3.0 8.0with ⋆ Shalong1,3 )(VPheld ⋆ meeting3,6 )1.0 2.5 9.0 9.5(VPheld ⋆ talk3,6 )1.1 2.4 9.5 9.4(VPhold ⋆ conference3,6 )3.5 5.1 17.0 12.1Liang Huang (Penn)Applications in MT80

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!