The HTK Book Steve Young Gunnar Evermann Dan Kershaw ...

The HTK Book Steve Young Gunnar Evermann Dan Kershaw ... The HTK Book Steve Young Gunnar Evermann Dan Kershaw ...

old.site.clsp.jhu.edu
from old.site.clsp.jhu.edu More from this publisher
12.07.2015 Views

14.2 Statistically-derived Class Maps 195predicted. Considering for clarity the bigram 6 case, then given G(.) the language model has theterms w i , w i−1 , G(w i ) and G(w i−1 ) available to it. The probability estimate can be decomposedas follows:P class’ (w i | w i−1 ) = P (w i | G(w i ), G(w i−1 ), w i−1 )× P (G(w i ) | G(w i−1 ), w i−1 ) (14.7)It is assumed that P (w i | G(w i ), G(w i−1 ), w i−1 ) is independent of G(w i−1 ) and w i−1 and thatP (G(w i ) | G(w i−1 ), w i−1 ) is independent of w i−1 , resulting in the model:P class (w i | w i−1 ) = P (w i | G(w i )) × P (G(w i ) | G(w i−1 )) (14.8)Almost all reported class n-gram work using statistically-found classes is based on clusteringalgorithms which optimise G(.) on the basis of bigram training set likelihood, even if the class mapis to be used with longer-context models. It is interesting to note that this approximation appearsto works well, however, suggesting that the class maps found are in some respects “general” andcapture some features of natural language which apply irrespective of the context length used whenfinding these features.14.2 Statistically-derived Class MapsAn obvious question that arises is how to compute or otherwise obtain a class map for use in alanguage model. This section discusses one strategy which has successfully been used.Methods of statistical class map construction seek to maximise the likelihood of the trainingtext given the class model by making iterative controlled changes to an initial class map – in orderto make this problem more computationally feasible they typically use a deterministic map.14.2.1 Word exchange algorithm[Kneser and Ney 1993] 7 describes an algorithm to build a class map by starting from some initialguess at a solution and then iteratively searching for changes to improve the existing class map.This is repeated until some minimum change threshold has been reached or a chosen number ofiterations have been performed. The initial guess at a class map is typically chosen by a simplemethod such as randomly distributing words amongst classes or placing all words in the first classexcept for the most frequent words which are put into singleton classes. Potential moves are thenevaluated and those which increase the likelihood of the training text most are applied to the classmap. The algorithm is described in detail below, and is implemented in the HTK tool Cluster.Let W be the training text list of words (w 1 , w 2 , w 3 , . . .) and let W be the set of all words in W.From equation 14.1 it follows that:P class (W) =∏P class (x | y) C(x,y) (14.9)x,y∈Wwhere (x, y) is some word pair ‘x’ preceded by ‘y’ and C(x, y) is the number of times that the wordpair ‘y x’ occurs in the list W.In general evaluating equation 14.9 will lead to problematically small values, so logarithms canbe used:log P class (W) =∑C(x, y). log P class (x | y) (14.10)x,y∈WGiven the definition of a class n-gram model in equation 14.8, the maximum likelihood bigramprobability estimate of a word is:P class (w i | w i−1 ) =C(w i )C(G(w i )) × C (G(w i), G(w i−1 ))C(G(w i−1 ))(14.11)6 By convention unigram refers to a 1-gram, bigram indicates a 2-gram and trigram is a 3-gram. There is nostandard term for a 4-gram.7 R. Kneser and H. Ney, “Improved Clustering Techniques for Class-Based Statistical Language Modelling”;Proceedings of the European Conference on Speech Communication and Technology 1993, pp. 973-976

14.2 Statistically-derived Class Maps 196where C(w) is the number of times that the word ‘w’ occurs in the list W and C(G(w)) is thenumber of times that the class G(w) occurs in the list resulting from applying G(.) to each entry ofW; 8 similarly C(G(w x ), G(w y )) is the count of the class pair ‘G(w y ) G(w x )’ in that resultant list.Substituting equation 14.11 into equation 14.10 and then rearranging gives:∑( )C(x) C(G(x), G(y))log P class (W) = C(x, y). log×=x,y∈W∑x,y∈W= ∑ x∈WC(x). log( C(x)C(x, y). logC(G(x))C(G(x)) C(G(y)))+ ∑( )C(G(x), G(y))C(x, y). logC(G(y))x,y∈W)+ ∑( ) C(g, h)C(g, h). logC(h)( C(x)C(G(x))g,h∈G= ∑ C(x). log C(x) − ∑ C(x). log C(G(x))x∈Wx∈W+ ∑C(g). log C(g)C(g, h). log C(g, h) − ∑g,h∈G g∈G= ∑ C(x). log C(x) +∑C(g, h). log C(g, h)x∈Wg,h∈G− 2 ∑ g∈GC(g). log C(g) (14.12)where (g, h) is some class sequence ‘h g’.Note that the first of these three terms in the final stage of equation 14.12, “ ∑ x∈W C(x) .log(C(x))”, is independent of the class map function G(.), therefore it is not necessary to considerit when optimising G(.). The value a class map must seek to maximise, F MC , can now be defined:∑F MC =C(g). log C(g) (14.13)g,h∈GC(g, h). log C(g, h) − 2 ∑ g∈GA fixed number of classes must be decided before running the algorithm, which can now beformally defined:1. Initialise: ∀w ∈ W : G(w) = 1Set up the class map so that all words are in the first class and all otherclasses are empty (or initialise using some other scheme)2. Iterate: ∀i ∈ {1 . . . n} ∧ ¬sFor a given number of iterations 1 . . . n or until some stop criterion s isfulfilled(a) Iterate: ∀w ∈ WFor each word w in the vocabularyi. Iterate: ∀c ∈ GFor each class cA. Move word w to class c, remembering its previous classB. Calculate the change in F MC for this moveC. Move word w back to its previous classii. Move word w to the class which increased F MC by the most,or do not move it if no move increased F MCThe initialisation scheme given here in step 1 represents a word unigram language model, makingno assumptions about which words should belong in which class. 9 The algorithm is greedy and so8 That is, C(G(w)) =Px:G(x)=G(w) C(x).9 Given this initialisation, the first (|G| − 1) moves will be to place each word into an empty class, however, sincethe class map which maximises F MC is the one which places each word into a singleton class.

14.2 Statistically-derived Class Maps 195predicted. Considering for clarity the bigram 6 case, then given G(.) the language model has theterms w i , w i−1 , G(w i ) and G(w i−1 ) available to it. <strong>The</strong> probability estimate can be decomposedas follows:P class’ (w i | w i−1 ) = P (w i | G(w i ), G(w i−1 ), w i−1 )× P (G(w i ) | G(w i−1 ), w i−1 ) (14.7)It is assumed that P (w i | G(w i ), G(w i−1 ), w i−1 ) is independent of G(w i−1 ) and w i−1 and thatP (G(w i ) | G(w i−1 ), w i−1 ) is independent of w i−1 , resulting in the model:P class (w i | w i−1 ) = P (w i | G(w i )) × P (G(w i ) | G(w i−1 )) (14.8)Almost all reported class n-gram work using statistically-found classes is based on clusteringalgorithms which optimise G(.) on the basis of bigram training set likelihood, even if the class mapis to be used with longer-context models. It is interesting to note that this approximation appearsto works well, however, suggesting that the class maps found are in some respects “general” andcapture some features of natural language which apply irrespective of the context length used whenfinding these features.14.2 Statistically-derived Class MapsAn obvious question that arises is how to compute or otherwise obtain a class map for use in alanguage model. This section discusses one strategy which has successfully been used.Methods of statistical class map construction seek to maximise the likelihood of the trainingtext given the class model by making iterative controlled changes to an initial class map – in orderto make this problem more computationally feasible they typically use a deterministic map.14.2.1 Word exchange algorithm[Kneser and Ney 1993] 7 describes an algorithm to build a class map by starting from some initialguess at a solution and then iteratively searching for changes to improve the existing class map.This is repeated until some minimum change threshold has been reached or a chosen number ofiterations have been performed. <strong>The</strong> initial guess at a class map is typically chosen by a simplemethod such as randomly distributing words amongst classes or placing all words in the first classexcept for the most frequent words which are put into singleton classes. Potential moves are thenevaluated and those which increase the likelihood of the training text most are applied to the classmap. <strong>The</strong> algorithm is described in detail below, and is implemented in the <strong>HTK</strong> tool Cluster.Let W be the training text list of words (w 1 , w 2 , w 3 , . . .) and let W be the set of all words in W.From equation 14.1 it follows that:P class (W) =∏P class (x | y) C(x,y) (14.9)x,y∈Wwhere (x, y) is some word pair ‘x’ preceded by ‘y’ and C(x, y) is the number of times that the wordpair ‘y x’ occurs in the list W.In general evaluating equation 14.9 will lead to problematically small values, so logarithms canbe used:log P class (W) =∑C(x, y). log P class (x | y) (14.10)x,y∈WGiven the definition of a class n-gram model in equation 14.8, the maximum likelihood bigramprobability estimate of a word is:P class (w i | w i−1 ) =C(w i )C(G(w i )) × C (G(w i), G(w i−1 ))C(G(w i−1 ))(14.11)6 By convention unigram refers to a 1-gram, bigram indicates a 2-gram and trigram is a 3-gram. <strong>The</strong>re is nostandard term for a 4-gram.7 R. Kneser and H. Ney, “Improved Clustering Techniques for Class-Based Statistical Language Modelling”;Proceedings of the European Conference on Speech Communication and Technology 1993, pp. 973-976

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!