Finite State Transducer mechanisms in speech recognition
Finite State Transducer mechanisms in speech recognition
Finite State Transducer mechanisms in speech recognition
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Spoiler◮ “New” part of this talk is: a “perfect” lattice generationalgorithm.◮ Informally, when we generate a lattice for an utterance, wewant it to:◮ Conta<strong>in</strong> all sufficiently likely word sequences◮ Have the “correct” alignments and likelihoods for suchsequences◮ Not be too large due to too-unlikely sequences or duplicatealignments◮ Current lattice generation algorithms generally fall <strong>in</strong>to twocategories:◮ Store one traceback for each state → quick, but approximate.◮ Store N tracebacks → fairly exact, but slow.◮ Our method stores one traceback, but is exact.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Introduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)A less trivial FSAaa0 1b2◮ The previous example doesn’t show the power of FSAsbecause we could represent the set of str<strong>in</strong>gs f<strong>in</strong>itely.◮ This example represents the <strong>in</strong>f<strong>in</strong>ite set {ab,aab,aaab,...}◮ Note: a str<strong>in</strong>g is “accepted” (<strong>in</strong>cluded <strong>in</strong> the set) if:◮ There is a path with that sequence of symbols on it.◮ That path is “successful’ (starts at an <strong>in</strong>itial state, ends at af<strong>in</strong>al state).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)The ǫ (epsilon) symbola0 1◮ The symbol ǫ has a special mean<strong>in</strong>g <strong>in</strong> FSAs (and FSTs)◮ It means “no symbol is there”.◮ This example represents the set of str<strong>in</strong>gs {a,aa,aaa,...}◮ If ǫ were treated as a normal symbol, this would be{a,aǫa,aǫaǫa,...}◮ In text form, ǫ is sometimes written as ◮ Toolkits implement<strong>in</strong>g FSAs/FSTs generally assume isthe symbol numbered zeroDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Determ<strong>in</strong>istic FSAs0aa13bc24a0 1bc23◮ The right hand FSA above is determ<strong>in</strong>istic◮ Means: no state has two outgo<strong>in</strong>g arcs with the same label◮ The classical def<strong>in</strong>ition of “determ<strong>in</strong>istic” also forbids ǫ arcs.◮ There is a def<strong>in</strong>ition that allows ǫ (Mohri/AT&T/OpenFst).◮ Work<strong>in</strong>g out whether a given str<strong>in</strong>g (e.g. ab) is accepted ismore efficient <strong>in</strong> determ<strong>in</strong>istic FSAs.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Determ<strong>in</strong>iz<strong>in</strong>g FSAs<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)0aa13bc24a0 1bc23◮ There is a fairly easy algorithm to determ<strong>in</strong>ize FSAs◮ Each state <strong>in</strong> the determ<strong>in</strong>ized FSA corresponds to a set ofstates <strong>in</strong> the orig<strong>in</strong>al.◮ The algorithm starts from the start state and uses a queue...◮ In this example, states 1 and 3 on get comb<strong>in</strong>ed as state 1.◮ The classical approach has to treat ǫ specially.◮ (Mohri’s version simplifies th<strong>in</strong>gs, mak<strong>in</strong>g ǫ removal aseparate stage).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)M<strong>in</strong>imial determ<strong>in</strong>istic FSAsa0 1bc23dd45a0 1bc23dd4◮ Here, the left FSA is not m<strong>in</strong>imal but the right one is.◮ “M<strong>in</strong>imal” is normally only applied to determ<strong>in</strong>istic FSAs.◮ Th<strong>in</strong>k of it as suffix shar<strong>in</strong>g, or comb<strong>in</strong><strong>in</strong>g redundant states.◮ It’s useful to save space (but not as crucial as determ<strong>in</strong>ization,for ASR).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)M<strong>in</strong>imization of FSAs◮ Common m<strong>in</strong>imization algorithm partitions states <strong>in</strong>to sets.◮ Start out with a partition of just two sets (f<strong>in</strong>al/nonf<strong>in</strong>al).◮ Keep splitt<strong>in</strong>g sets until we can’t split any more.◮ The sets are the states <strong>in</strong> the m<strong>in</strong>imal FSA.◮ Set splitt<strong>in</strong>g is based on yes/no criteria like “does a state havea transition with symbol a to a member of set s?”Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Other algorithms for FSAs◮ Equality and equivalence test<strong>in</strong>g◮ Revers<strong>in</strong>g (like revers<strong>in</strong>g the arrows)◮ Trimm<strong>in</strong>g (remov<strong>in</strong>g unreachable states)◮ Union, difference (view these <strong>in</strong> terms of the “set of str<strong>in</strong>gs”)◮ Concatenation (of str<strong>in</strong>gs accepted by two FSAs)◮ Closure (allow<strong>in</strong>g arbitrary repetitions of the accepted str<strong>in</strong>gs)◮ Epsilon removalDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)FSAs/FSTs and testability◮ Many useful FSA/FST operations preserve equivalence (e.g.determ<strong>in</strong>ization, m<strong>in</strong>imization)◮ We can test equivalence◮ Useful to test correctness of algorithms like determ<strong>in</strong>ization,m<strong>in</strong>imization.◮ E.g.: generate random FSA; determ<strong>in</strong>ize; check determ<strong>in</strong>istic;check equivalent.◮ This is one of the advantages of the FST framework◮ You could hand-code algorithms to solve your specificproblems (e.g. decod<strong>in</strong>g), but could you test them?Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors: normal casea/10 1b/12/1◮ Like a normal FSA but with costs on the arcs and f<strong>in</strong>al-states◮ Note: cost comes after “/”. For f<strong>in</strong>al-state, “2/1” meansf<strong>in</strong>al-cost 1 on state 2.◮ View WFSA as a function from a str<strong>in</strong>g to a cost.◮ In this view, unweighted FSA is f : str<strong>in</strong>g → {0, ∞}.◮ If multiple paths have the same str<strong>in</strong>g, take the one with thelowest cost.◮ This example maps ab to (3 = 1 + 1 + 1), all else to ∞.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Semir<strong>in</strong>gs<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)◮ The semir<strong>in</strong>g concept makes WFSAs more general.◮ A semir<strong>in</strong>g is◮ A set of elements (e.g. R)◮ Two special elements ¯1 and ¯0 (the identity element and zero)◮ Two operations, ⊕ (plus) and × (times) satsify<strong>in</strong>g certa<strong>in</strong>axioms.◮ Examples:◮ The reals, with ⊕ and ⊗ def<strong>in</strong>ed as the normal + and ×,¯1 = 1, ¯0 = 0.◮ The <strong>in</strong>tegers, as above.◮ The nonnegative reals, as above.◮ The nonnegative reals as above, except withx ⊕ y = max(x, y).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Semir<strong>in</strong>gs and Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)◮ The normal semir<strong>in</strong>gs used <strong>in</strong> ASR applications are◮ The tropical semir<strong>in</strong>g, which is the simplest case (⊕ means:take the m<strong>in</strong>imum cost; × means, add the costs)◮ The log semir<strong>in</strong>g, which is equivalent to the nonnegative reals,except represent<strong>in</strong>g them as negative log.◮ There are also “fancier” semir<strong>in</strong>gs used (e.g.) <strong>in</strong>determ<strong>in</strong>ization, where the weight conta<strong>in</strong>s a str<strong>in</strong>gcomponent.◮ In WFSAs, weights are ⊗-multiplied along paths.◮ Weights are ⊕-summed over paths with identicalsymbol-sequences.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Costs versus weights: term<strong>in</strong>ology<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)0a/0b/21/02c/13/1◮ Personally I use “cost” to refer to the numeric value, and“weight” when speak<strong>in</strong>g abstractly, e.g.:◮ The acceptor above accepts a with unit weight.◮ It accepts a with zero cost.◮ It accepts bc with cost 4 = 2 + 1 + 1◮ <strong>State</strong> 1 is f<strong>in</strong>al with unit weight.◮ The acceptor assigns zero weight to xyz.◮ It assigns <strong>in</strong>f<strong>in</strong>ite cost to xyz.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Function <strong>in</strong>terpretation of WFSAs<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)b/1a/01/2d/13/10a/12b/24/2◮ Consider the WFSA above, with the tropical (“Viterbi-like”)semir<strong>in</strong>g. Take the str<strong>in</strong>g ab.◮ We “multiply” (⊗) the weights along paths; this meansadd<strong>in</strong>g the costs.◮ Two paths for ab:◮ One goes through states (0, 1, 1); cost is (0 + 1 + 2) = 3◮ One goes through states (0, 2, 3); cost is (1 + 2 + 2) = 5◮ We add weights across different paths; tropical ⊕ is “take m<strong>in</strong>cost” → this WFSA maps ab to 3Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Equivalence of WFSAs<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)b/1b/1a/00 1/1a/10 1/0◮ Two WFSAs are equivalent if they represent the samefunction from (str<strong>in</strong>g) to (weight).◮ For example, the above two WFSAs are equivalent (but notequal).◮ Easy to test only if we can determ<strong>in</strong>ize.◮ Otherwise there are randomized algorithms (pick randomstr<strong>in</strong>g from one; make sure it has same weight <strong>in</strong> both).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Determ<strong>in</strong>ization of Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors◮ Determ<strong>in</strong>ization is also applicable to WFSAs.◮ In algorithm, (subset of states) → (weighted subset of states).◮ Not all WFSAs are determ<strong>in</strong>izable◮ If you try to determ<strong>in</strong>ize a nondeterm<strong>in</strong>izable WFSA:◮ There will be an <strong>in</strong>f<strong>in</strong>ite sequence of determ<strong>in</strong>ized-states withthe same subset of states but different weights◮ Determ<strong>in</strong>ization will fail to term<strong>in</strong>ate; eventually memory willbe exhausted.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)The tw<strong>in</strong>s propertyb/1a/11x/13/10a/1b/22y/14/1◮ A WFSA is determ<strong>in</strong>izable if it has the “tw<strong>in</strong>s property”...◮ (that “if” is “iff” subject to certa<strong>in</strong> extra conditions).◮ A WFSA (as above) fails to have the tw<strong>in</strong>s property 1 if:◮ There are two states s and t...◮ that are reachable from the start state with the same str<strong>in</strong>g...◮ and there are cycles at both s and t...◮ with the same str<strong>in</strong>g on them but different weights.1 We are gloss<strong>in</strong>g over some details hereDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Other algorithms on Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors◮ Shortest-path and shortest-distance algorithms, e.g.:◮ Total weight of FSA, summed over all “successful” paths, i.e.from <strong>in</strong>itial to f<strong>in</strong>al.◮ Total weight for all paths to f<strong>in</strong>al-state, start<strong>in</strong>g at each state◮ Best path through FSA◮ N-best paths through FSA◮ We can also apply most unweighted FSA algorithms to theweighted case.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)a:x/10 1/0◮ Like a WFSA except with two labels on each arc.◮ View it as a function from a (pair of str<strong>in</strong>gs) to a weight◮ This one maps (a,x) to 1 and all else to ∞◮ Note: view 1 and ∞ as costs. ∞ is ¯0 <strong>in</strong> semir<strong>in</strong>g.◮ Symbols on the left and right are termed “<strong>in</strong>put” and“output” symbols.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Equivalence of WFSTsa:/1a:x/1 0 10 1/0:x/02/0◮ WFSTs are equivalent if they represent same function from(str<strong>in</strong>g, str<strong>in</strong>g) to weight.◮ Both these WFSTs map (a,x) to 1 and all else to ∞◮ Therefore they are equivalent (although not equal)Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>WFSTs versus matrices<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)a:x/20 1/0b:y/2b:/32/0x xya 2 ∞ab 5 4◮ WFSTs can be viewed as function from (<strong>in</strong>put str<strong>in</strong>g, outputstr<strong>in</strong>g) to (weight).◮ Matrices (over reals) can be viewed as function from (row<strong>in</strong>dex, column <strong>in</strong>dex) to (real).◮ Note: the set of possible <strong>in</strong>put and output str<strong>in</strong>gs is <strong>in</strong>f<strong>in</strong>ite.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Functional WFSTsa:x/11:y/12/0a:x/11:y/12/00b:z/13/00a:z/13/0◮ A WFST is functional iff each <strong>in</strong>put str<strong>in</strong>g has a non-¯0 weightwith at most one output str<strong>in</strong>g.◮ Left transducer is functional: maps a to xy with cost 2, and bto z with cost 1.◮ The right transducer is non-functional:◮ The str<strong>in</strong>g a has a nonzero weight with two str<strong>in</strong>gs.◮ I.e. (a, xy) and (a, z) both appear on successful paths.◮ WFST can be <strong>in</strong>terpreted as function from <strong>in</strong>put to (output,weight) only if functional.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Inversion of WFSTsa:x/10 1/0x:a/10 1/0◮ Inversion just means swapp<strong>in</strong>g the <strong>in</strong>put and output symbols.◮ If A if WFST, we write its <strong>in</strong>verse as A −1 .◮ Only vaguely analogous to function <strong>in</strong>version ...◮ In matrix analogy, it’s transposition.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Projection of WFSTsa:x/10 1/0b:y/1a:a/10 1/0b:b/1◮ Projection on the <strong>in</strong>put means copy<strong>in</strong>g <strong>in</strong>put symbols tooutput, so both are identical.◮ Projection on the output means copy<strong>in</strong>g the output symbol to<strong>in</strong>put.◮ In FST toolkits like OpenFst, if <strong>in</strong>put and output symbols areidentical the WFST is termed an acceptor (WFSA) and canbe treated as one.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Composition of WFSTsA B Ca:x/10 1/0b:x/1x:y/10 1/0a:y/20 1/0b:y/2◮ Notation: C = A ◦ B means, C is A composed with B.◮ In special cases, composition is similar to function composition◮ Composition algorithm “matches up” the “<strong>in</strong>ner symbols”◮ i.e. those on the output (right) of A and <strong>in</strong>put (left) of BDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Composition of WFSTs vs matrix multiplication◮ Composition of WFSTs corresponds exactly with matrixmultiplication.◮ Let s, t and u be str<strong>in</strong>gs, and w be a weight.◮ Interpret A,B,C as functions from (str<strong>in</strong>g, str<strong>in</strong>g) to (weight).◮ Then C(s,u) = ⊕ t A(s,t) ⊗ B(t,u) (sum over and discardcentral str<strong>in</strong>g t).◮ Matrix multiplication follows the same pattern (sum overcentral <strong>in</strong>dex).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Composition of WFSTs (algorithm)◮ Ignor<strong>in</strong>g ǫ symbols, algorithm is quite simple.◮ <strong>State</strong>s <strong>in</strong> C correspond to tuples of (state <strong>in</strong> A, state <strong>in</strong> B).◮ But some of these may be <strong>in</strong>accessible and pruned away.◮ Ma<strong>in</strong>ta<strong>in</strong> queue of pairs, <strong>in</strong>itially the s<strong>in</strong>gle pair (0,0) (startstates).◮ When process<strong>in</strong>g a pair (s,t):◮ Consider each pair of (arc a from s), (arc b from t).◮ If these have match<strong>in</strong>g symbols (output of a, <strong>in</strong>put of b):◮ Create transition to state <strong>in</strong> C correspond<strong>in</strong>g to (next-state ofa, next-state of b)◮ If not seen before, add this pair to queue.◮ With ǫ <strong>in</strong>volved, need to be careful to avoid redundant paths...Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong><strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)Determ<strong>in</strong>istic WFSTsa:x/10 1/0b:x/1◮ Taken to mean “determ<strong>in</strong>istic on the <strong>in</strong>put symbol”◮ I.e., no state can have > 1 arc out of it with the same <strong>in</strong>putsymbol.◮ Some <strong>in</strong>terpretations (e.g. Mohri/AT&T/OpenFst) allow ǫ<strong>in</strong>put symbols (i.e. be<strong>in</strong>g ǫ-free is a separate issue).◮ I prefer a def<strong>in</strong>ition that disallows epsilons, except asnecessary to encode a str<strong>in</strong>g of output symbols on an arc.◮ Regardless of def<strong>in</strong>ition, not all WFSTs can be determ<strong>in</strong>ized.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Determ<strong>in</strong>iz<strong>in</strong>g WFSTs<strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (FSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> Acceptors (WFSAs)Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)◮ Easiest to view the output symbol as part of the weight anddeterm<strong>in</strong>ize as WFSA.◮ Encode the WFST as a WFSA <strong>in</strong> an appropriate semir<strong>in</strong>g thatconta<strong>in</strong>s the weight and str<strong>in</strong>g.◮ Do WFSA determ<strong>in</strong>ization (and hope it succeeds).◮ Two basic failure modes:◮ Fails because <strong>in</strong>put FST was non-functional (multiple outputsfor one <strong>in</strong>put)◮ Fails because WFSA fails to have tw<strong>in</strong>s property ... i.e. hascycles with same <strong>in</strong>put-symbol seq, different weights.◮ Can also fail if there are paths with more output than <strong>in</strong>putsymbols and you don’t want to <strong>in</strong>troduce ǫ-<strong>in</strong>put arcs (I allowthese as a special case).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationSpeech <strong>recognition</strong> application of WFSTs◮ HCLG ≡ H ◦ C ◦ L ◦ G is the <strong>recognition</strong> graph◮ G is the grammar or LM (an acceptor)◮ L is the lexicon◮ C adds phonetic context-dependency◮ H specifies the HMM structure of context-dependent phonesInput OutputH p.d.f. labels ctx-dep. phoneC ctx-dep phone phoneL phone wordG word wordDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationDecod<strong>in</strong>g graph construction (simple version)◮ Create H, C, L, G separately◮ Compose them together◮ Determ<strong>in</strong>ize (like mak<strong>in</strong>g a tree-structured lexicon)◮ M<strong>in</strong>imize (like suffix shar<strong>in</strong>g)Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationDecod<strong>in</strong>g graph construction (complexities)◮ Have to do th<strong>in</strong>gs <strong>in</strong> a careful order or algorithms “blow up”◮ Determ<strong>in</strong>ization for WFSTs can fail◮ Need to <strong>in</strong>sert “disambiguation symbols” <strong>in</strong>to the lexicon.◮ Need to “propagate these through” H and C.◮ Need to make sure f<strong>in</strong>al HCLG is stochastic, for optimalprun<strong>in</strong>g performance.◮ I.e. sums to one, like a properly normalized HMM.◮ Standard algorithm to do this (weight-push<strong>in</strong>g) can fail◮ ... because FST representation of backoff LMs isnon-stochastic◮ Also we can’t always recover the phone sequence from paththrough HCLG (but see later).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationDecod<strong>in</strong>g graph construction (our approach)◮ Mostly followed quite closely the AT&T recipe.◮ Added a disambiguation symbol for backoff arcs <strong>in</strong> the LMFST◮ Partly for reasons of taste; not comfortable with treat<strong>in</strong>g ǫ as“real symbol”.◮ It’s actually necessary because we use a determ<strong>in</strong>izationalgorithm that does ǫ removal.◮ Different approach to “weight push<strong>in</strong>g”– see next slide.◮ Put special symbols on <strong>in</strong>put of HCLG that encode more thanthe p.d.f.– see below.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationOur approach to weight push<strong>in</strong>g◮ Want to ensure that HCLG can be <strong>in</strong>terpreted as properlynormalized HMM (“stochastic”).◮ AT&T recipe does this as a f<strong>in</strong>al “weight push<strong>in</strong>g” step.◮ This can fail for backoff LMs, or lead to weird results.◮ Our approach is to ensure that if G is stochastic, each step ofgraph creation keeps it stochastic.◮ This is done <strong>in</strong> such a way that where G is “nearly”stochastic, HCLG will be “nearly” stochastic.◮ Requires do<strong>in</strong>g some algorithms (e.g. determ<strong>in</strong>ization) <strong>in</strong> logsemir<strong>in</strong>g◮ Some algorithms (e.g. epsilon removal), need to “know about”two semir<strong>in</strong>gs at once.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationSymbols on <strong>in</strong>put of HCLG – the problem◮ Normally, the symbols on the <strong>in</strong>put of H (or HCLG) representp.d.f.’s◮ The symbol would actually be the <strong>in</strong>teger identifier of thep.d.f., with range e.g. 5k if there are 5k p.d.f.’s.◮ The decoder will give you an <strong>in</strong>put-symbol str<strong>in</strong>g I, anoutput-symbol str<strong>in</strong>g O, and a weight w.◮ We want I (<strong>in</strong>put-symbol seq). to give enough <strong>in</strong>formation to◮ tra<strong>in</strong> transition models◮ work out the phone sequence◮ reconstruct the path through the HMM.◮ We don’t want to have to assume different phones havedist<strong>in</strong>ct p.d.f.’s.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationSymbols on <strong>in</strong>put of HCLG – our solution◮ Make the symbols on HCLG a slightly a f<strong>in</strong>er-gra<strong>in</strong>edidentifier.◮ We call this a “transition id”.◮ From a transition id, we can work out◮ The p.d.f. <strong>in</strong>dex◮ The phone◮ The transition arc that was taken with<strong>in</strong> the prototype HMMtopology◮ There would typically be about twice as many transition-ids asp.d.f. id’s.◮ Doesn’t expand the graph size, <strong>in</strong> normal configurations.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationDecod<strong>in</strong>g characteristics◮ For a simple, medium-vocab task (Resource Management),about 20 x faster than real time.◮ We can decode Wall Street Journal read <strong>speech</strong> <strong>in</strong> about realtime without significant loss <strong>in</strong> accuracy.◮ This is with a pruned trigram language model– about 0.8Marcs.◮ F<strong>in</strong>al HCLG takes about 0.5G (gigabytes) <strong>in</strong> default OpenFstformat.◮ Can’t go much larger or memory for graph compilationbecomes a problem.◮ Note: our HCLG currently has self-loops.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Ways to decode with bigger LMsThe “standard” recipeThe Kaldi recipeLattice generation◮ Lattice rescor<strong>in</strong>g◮ We can already do this (see lattice generation, below).◮ Store HCLG <strong>in</strong> a more compact form◮ E.g. remove self-loops (decoder would add them)◮ Could then “factor” FST by compactly stor<strong>in</strong>g l<strong>in</strong>ear seqs ofstates.◮ Could decrease f<strong>in</strong>al graph size 10-fold but graph creationwould still require a lot of memory.◮ Onl<strong>in</strong>e “correction” of LM scores <strong>in</strong> HCLG◮ Idea is to add the difference <strong>in</strong> LM score between (small, big)LM when we cross a word◮ Already done prelim<strong>in</strong>ary work (w/ Gilles Boulianne)◮ OpenFst’s “fast lookahead matcher”– need to <strong>in</strong>vestigate this.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationLattice generation◮ From now, will describe how lattice generation <strong>in</strong> Kaldi works.◮ Up to now, we have described standard concepts (with a fewm<strong>in</strong>or twists)◮ Here starts the “novel” part.◮ We will beg<strong>in</strong> with some prelim<strong>in</strong>aries before describ<strong>in</strong>g thealgorithm.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>What is a lattice?The “standard” recipeThe Kaldi recipeLattice generation◮ The word “lattice” is used <strong>in</strong> the ASR literature as:◮ Some k<strong>in</strong>d of compact representation of the alternate wordhypotheses for an utterance.◮ Like an N-best list but with less redundancy.◮ Usually has time <strong>in</strong>formation, sometimes state or wordalignment <strong>in</strong>formation.◮ Generally a directed acyclic graph with only one start node andonly one end node.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationWFST view of a <strong>speech</strong> <strong>recognition</strong> decoder1/4.861/4.163/5.1602/4.9413/5.314/5.912/5.443/6.314/5.0224/8.531/6.022/6.473/0◮ First– a “WFST def<strong>in</strong>ition” of the decod<strong>in</strong>g problem.◮ Let U be an FST that encodes the acoustic scores of anutterance (as above).◮ Let S = U ◦ HCLG be called the search graph for anutterance.◮ Note: if U has N frames (3, above), then◮ #states <strong>in</strong> S is ≤ (N + 1) times #states <strong>in</strong> HCLG.◮ Like N + 1 copies of HCLG.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationPrun<strong>in</strong>g <strong>in</strong> a <strong>speech</strong> <strong>recognition</strong> decoder◮ With beam prun<strong>in</strong>g, we search a subgraph of S.◮ The set of “active states” on all frames, with arcs l<strong>in</strong>k<strong>in</strong>gthem as appropriate, is a subgraph of S.◮ Let this be called the beam-pruned subgraph of S; call it B.◮ A standard <strong>speech</strong> <strong>recognition</strong> decoder f<strong>in</strong>ds the best paththrough B.◮ In our case, the output of the decoder is a l<strong>in</strong>ear WFST thatconsists of this best path.◮ This conta<strong>in</strong>s the follow<strong>in</strong>g useful <strong>in</strong>formation:◮ The word sequence, as output symbols.◮ The state alignment, as <strong>in</strong>put symbols.◮ The cost of the path, as the total weight.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationA lattice – our format◮ We let a lattice be an FST, similar to HCLG.◮ The <strong>in</strong>put symbols will be the p.d.f.’s (actually, transition-ids).◮ The output symbols will be words.◮ The weights will conta<strong>in</strong> the acoustic costs plus the costs <strong>in</strong>HCLG◮ ... well, actually we use a semir<strong>in</strong>g that conta<strong>in</strong>s two floats:◮ Keeps track of (graph, acoustic) costs separately whilebehav<strong>in</strong>g as if they were added together.◮ We also have an “acceptor” form of the lattice. Here,◮ Words appear on both the <strong>in</strong>put and output side.◮ The p.d.f.’s/transition-ids are encoded as part of the weight.◮ This format is more compact, and sometimes more convenient.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationOur statement of the lattice generation problem◮ Let the lattice beam be α (e.g. α = 8).◮ No miss<strong>in</strong>g paths:◮ Every word sequence whose best path <strong>in</strong> B is with<strong>in</strong> α of theoverall best path, should be present <strong>in</strong> L◮ No extra paths:◮ No path should be <strong>in</strong> L if there is not an equivalent path <strong>in</strong> B.◮ No duplicate paths:◮ No two paths <strong>in</strong> L should have the same word sequence.◮ Accurate paths:◮ Each path through L should be equivalent to the best paththrough B with the same word sequence.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationOur solution to the lattice generation problem◮ Dur<strong>in</strong>g decod<strong>in</strong>g, generate B (beam-pruned subgraph of S).◮ Prune B with beam α.◮ I.e. prune all arcs not on path with<strong>in</strong> α of best path.◮ Actually we do this <strong>in</strong> an onl<strong>in</strong>e way to avoid memory bloat.◮ This is not the same as the beam prun<strong>in</strong>g used <strong>in</strong> search.◮ Let this pruned version of B be called P◮ Do a special form of determ<strong>in</strong>ization on P (next slide)◮ Prune the result with beam αDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Lattice determ<strong>in</strong>izationThe “standard” recipeThe Kaldi recipeLattice generation◮ Special determ<strong>in</strong>ization algorithm we call “latticedeterm<strong>in</strong>ization”◮ This is not really a determ<strong>in</strong>ization algorithm <strong>in</strong> a WFSTsense, as it does not preserve equivalence.◮ It keeps only the lowest-cost output-symbol sequence for each<strong>in</strong>put-symbol sequence◮ Note: we’d apply it after <strong>in</strong>version, so words are on <strong>in</strong>put.◮ Conceptual view (not how we really do it):◮ Convert WFST to acceptor <strong>in</strong> a special semir<strong>in</strong>g (see below)◮ Remove epsilons from that acceptor◮ Determ<strong>in</strong>ize;◮ Convert backDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationConvert<strong>in</strong>g an FST to an acceptor◮ Suppose we had an arc <strong>in</strong> B with “1520:HI/3.67” on it.◮ I.e. <strong>in</strong>put symbol is 1520, output is “HI”, cost is 3.67◮ First, <strong>in</strong>vert → “HI:1520/3.67”◮ Next, convert to acceptor → “HI:HI/(3.67,1520)”◮ Remember– acceptors have same <strong>in</strong>put/output symbols onarcs.◮ The weight (after “/”) is a tuple of (old-weight, str<strong>in</strong>g)◮ Other valid weights:“(12.20,1520 1520 1629)”, “(-2.64,)”(empty str<strong>in</strong>g)Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationAcceptor determ<strong>in</strong>ization – conventional semir<strong>in</strong>g◮ Multiplication (⊗) is ⊗-multiply<strong>in</strong>g the weights,concatenat<strong>in</strong>g the str<strong>in</strong>gs.◮ Addition (⊕) only def<strong>in</strong>ed if the str<strong>in</strong>g part is the same◮ Just corresponds to ⊕-add<strong>in</strong>g the weights <strong>in</strong> their semir<strong>in</strong>g.◮ I.e. addition will crash if str<strong>in</strong>gs are different 2◮ ... this is not really a proper semir<strong>in</strong>g◮ ... designed to detect non-functional FSTs and to fail then.◮ Common-divisor of (w,s) and (x,t) is (w ⊕ x,u)◮ where u is the longest common prefix of s and t.◮ ... this is an operation needed for determ<strong>in</strong>ization (tonormalize subsets)2 This is the way it’s done <strong>in</strong> OpenFst, anyway.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationSemir<strong>in</strong>g for lattice-determ<strong>in</strong>ization◮ Only def<strong>in</strong>ed if there is a total order over the weight part, and⊕ on the weights corresponds to tak<strong>in</strong>g the max.◮ Multiplication is as before.◮ Addition means tak<strong>in</strong>g pair that has the largest weight part.◮ We need to be a bit careful when the weights are the same.◮ This is for mathematical purity– it wouldn’t make a practicaldifference.◮ It’s a question of satisfy<strong>in</strong>g semir<strong>in</strong>g axioms.◮ We take the shortest str<strong>in</strong>g (if lengths differ), then uselexicographical order.◮ Simple lexicographical order would have violated one of theaxioms (distributivity of ⊗ over ⊕).◮ Common-divisor is as before.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationLattice-determ<strong>in</strong>ization vs. conventional determ<strong>in</strong>ization◮ Let’s compare them where they are both def<strong>in</strong>ed (for tropicalsemir<strong>in</strong>g, which is like Viterbi).◮ When creat<strong>in</strong>g a weighted subset of states...◮ Suppose a particular state appears twice <strong>in</strong> the subset (i.e. weprocess > 1 l<strong>in</strong>ks to that state)...◮ If the “weights” on that state have the same str<strong>in</strong>g part– bothsemir<strong>in</strong>gs behave the same.◮ Otherwise (different str<strong>in</strong>g part):◮ Lattice-determ<strong>in</strong>ization will cont<strong>in</strong>ue happily, choos<strong>in</strong>g thestr<strong>in</strong>g with better weight.◮ Conventional determ<strong>in</strong>ization will say “I detectednon-functional FST” and fail.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Lattice-determ<strong>in</strong>ization <strong>in</strong> practiceThe “standard” recipeThe Kaldi recipeLattice generation◮ The “OpenFst”/AT&T way to do it would be to:◮ Convert to our special semir<strong>in</strong>g; remove epsilons; determ<strong>in</strong>ize;convert back.◮ We don’t do it this way, because lattices would have a lot of<strong>in</strong>put-epsilons◮ The remove-eps step would “blow up”.◮ Instead we wrote a determ<strong>in</strong>ization algorithm that removesepsilons itself.◮ This is a bit more complex and harder to formally provecorrectness (probably why Mohri doesn’t like it)◮ But easy to verify us<strong>in</strong>g random examples (can checkequivalence and check if determ<strong>in</strong>istic)◮ Optimized it by us<strong>in</strong>g efficient data-structures to store str<strong>in</strong>gs.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationLattice generation– simple version◮ If we had unlimited memory, basic process would be:◮ Generate beam-pruned subgraph B of search graph S◮ The states <strong>in</strong> B correspond to the active states on particularframes.◮ Prune B with beam α to get pruned version P.◮ Convert P to acceptor and lattice-determ<strong>in</strong>ize to get A(determ<strong>in</strong>istic acceptor)◮ Prune A with beam α to get f<strong>in</strong>al lattice L (<strong>in</strong> acceptor form).Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Lattice generation– real versionThe “standard” recipeThe Kaldi recipeLattice generation◮ B would be too large to easily fit <strong>in</strong> memory– prune as we go,not at the end◮ We have a special algorithm for this that’s efficient and exact.◮ I.e. it won’t prune anyth<strong>in</strong>g we wouldn’t have pruned away atthe end.◮ Determ<strong>in</strong>ization of P can use a lot of memory for someutterances (if α is large, e.g. ≥ 10)◮ This is caused by l<strong>in</strong>ks be<strong>in</strong>g generated that would have beenpruned away later anyway.◮ We could create a determ<strong>in</strong>ization algorithm that would avoidthis bloat◮ ... but it would probably be quite slow.◮ For now we just detect this, reduce the beam a bit and tryaga<strong>in</strong>.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationLattice generation– why our algorithm is good◮ We can prove that lattices generated <strong>in</strong> this way will satisfythe properties we stated 3◮ We can get “exact” lattices without approximation (e.g. the“word-pair approximation” of Ney), while only stor<strong>in</strong>g a s<strong>in</strong>gletraceback.◮ No extra time cost for generat<strong>in</strong>g lattices (for small-ish α)◮ This is unlike the traditional “more exact” approaches, whichhave to store multiple tokens <strong>in</strong> each state.◮ ... and those approaches are not even as exact as ours.◮ We have verified that LM rescor<strong>in</strong>g of these lattices gives thesame results as decod<strong>in</strong>g directly with the orig<strong>in</strong>al LM.◮ Will also verify us<strong>in</strong>g an N-best algorithm that the latticeshave the properties we stated.3 We have to slightly weaken one of the propertiesDaniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The “standard” recipeThe Kaldi recipeLattice generationLattice generation– experimental aspects◮ We have verified that LM rescor<strong>in</strong>g of these lattices gives thesame results as decod<strong>in</strong>g directly with the orig<strong>in</strong>al LM (asbeams get large enough)◮ Will also verify us<strong>in</strong>g an N-best algorithm that the latticeshave the properties we stated.◮ Also measur<strong>in</strong>g oracle WERs of lattices:◮ This is ma<strong>in</strong>ly because people expect it.◮ I don’t acknowledge that oracle WER is a relevant metric(consider a s<strong>in</strong>gle word-loop).◮ Not present<strong>in</strong>g detailed results here (not ready yet/ not thatuseful anyway).◮ Hard to compare experimentally with previous approaches:◮ These are often either not published, not completely described,or too tied to a particular type of decoder.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>Lattice generation– summaryThe “standard” recipeThe Kaldi recipeLattice generation◮ I believe this is the “right” way to do lattice generation.◮ No compromise between speed and exactness.◮ Lattices have clearly def<strong>in</strong>ed properties that can be verified.◮ Interest<strong>in</strong>g algorithm:◮ A common objection to WFSTs is, “I could code this easierwithout WFSTs”◮ Many WFST algorithms have simple analogs, e.g.determ<strong>in</strong>ization = tree-structured lexicon; m<strong>in</strong>imization =suffix shar<strong>in</strong>g.◮ This algorithm is hard to expla<strong>in</strong> without WFSTs, and showsthe power of the framework.Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>
Outl<strong>in</strong>eIntroduction to Weighted <strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong>s (WFSTs)WFSTs <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>The endThe “standard” recipeThe Kaldi recipeLattice generation◮ Questions?Daniel Povey<strong>F<strong>in</strong>ite</strong> <strong>State</strong> <strong>Transducer</strong> <strong>mechanisms</strong> <strong>in</strong> <strong>speech</strong> <strong>recognition</strong>