PDF file - Scale

A NOVEL ESTIMATION OF FEATURE-SPACE MLLR FOR FULL-COVARIANCE MODELSArnab Ghoshal 1 , Daniel Povey 2 ,Mohit Agarwal 3 , Pinar Akyazi 4 , Lukáš Burget 5 , Kai Feng 6 , Ondřej Glembek 5 , Nagendra Goel 7 ,Martin Karafiát 5 , Ariya Rastrow 8 , Richard C. Rose 9 , Petr Schwarz 5 , Samuel Thomas 81 Saarland University, Germany, arnab.ghoshal@lsv.uni-saarland.de;2 Microsoft Research, USA, dpovey@microsoft.com; 3 IIIT Allahabad, India;4 Boǧaziçi University, Turkey; 5 Brno University of Technology, Czech Republic; 6 Hong Kong UST;7 Go-Vivace Inc., USA; 8 Johns Hopkins University, USA; 9 McGill University, CanadaABSTRACTIn this paper we present a novel approach for estimating featurespacemaximum likelihood linear regression (fMLLR) transformsfor full-covariance Gaussian models by directly maximizing the likelihoodfunction by repeated line search in the direction of the gradient.We do this in a pre-transformed parameter space such that anapproximation to the expected Hessian is proportional to the unitmatrix. The proposed algorithm is as efficient or more efficient thanstandard approaches, and is more flexible because it can naturally becombined with sets of basis transforms and with full covariance andsubspace precision and mean (SPAM) models.Index Terms— Speech recognition, Speaker adaptation, HiddenMarkov models, Optimization methods, Linear algebra1. INTRODUCTIONFeature-space MLLR (fMLLR), also known as Constrained MLLR,is a popular adaptation technique used in automatic speech recognitionto reduce the mismatch between the models and the acousticdata from a particular speaker [1]. The method uses an affine transformof the feature vectors,x (s) = A (s) x + b (s) , (1)where the transform parameters A (s) and b (s) are estimated to maximizethe likelihood of the transformed speaker-specific data underthe acoustic model. In this paper we use the following compactnotation, x (s) = W (s) x + , where W (s) = [A (s) , b (s) ], andx + = [x T , 1] T denotes a vector x extended with a 1.For each speaker transform W (s) , we need to estimate d 2 +d parameters,where d is the dimensionality of the feature vectors. Whensmall amounts of per-speaker adaptation data are present, these estimatesmay not be reliable. In such cases we can express W (s) as alinear combination of a set of basis transforms W b (1 ≤ b ≤ B),W (s) = W 0 + P Bb=1 α(s) b W b. (2)This work was conducted at the Johns Hopkins University SummerWorkshop which was (partially) supported by National Science FoundationGrant Number IIS-0833652, with supplemental funding from Google Research,DARPA’s GALE program and the Johns Hopkins University HumanLanguage Technology Center of Excellence. BUT researchers were partiallysupported by Czech MPO project No. FR-TI1/034. Thanks to CLSP staffand faculty, to Tomas Kašpárek for system support, to Patrick Nguyen forintroducing the participants, to Mark Gales for advice and HTK help, and toJan Černocký for proofreading and useful comments.This is similar to the formulation presented in [2]. The bases W bare estimated from the entire training set, and essentially define asubspace for the fMLLR transforms. For each speaker we only needto estimate the coefficients α b , where typically B ≪ d(d + 1).When the underlying probability model has a diagonal covariancestructure, the fMLLR transform can be estimated by iterativelyupdating the matrix rows [1]; a similar row-by-row update has alsobeen proposed for the full covariance case [3]. Such updates are hardto combine with the subspace formulation of Equation (2).We propose to estimate the transformation matrices W (s) byrepeated line search in the gradient direction, in pre-transformedparameter space where the Hessian is proportional to the unit matrix.Our experiments are with a new kind of GMM-based system,a “subspace Gaussian mixture model” (SGMM) system [4]. For thepurposes of fMLLR estimation, the only important fact is that thissystem uses a relatively small number (less than 1000) of sharedfull-covariance matrices. However, our approach is also applicableto diagonal and other types of systems (e.g. diagonal-covariance,full-covariance, SPAM [5]).2. SUBSPACE GMM ACOUSTIC MODELThe simplest form of SGMM can be expressed as follows:p(x|j) = P Ii=1wjiN (x; µji, Σi) (3)µ ji = M iv j (4)w ji =exp(w T i v j)P Ii ′ =1 exp(w i ′ T v j) , (5)where x ∈ R D is the feature, j is the speech state, and v j ∈ R S isthe “state vector” with S ≃ D being the subspace dimension. Themodel in each state is a GMM with I Gaussians whose covariancesΣ i are shared between states. The means µ ji and mixture weightsw ji are not parameters of the model. Instead, they are derived fromv j ∈ R S , via globally shared parameters M i and w i. Refer to [4, 6]for details about the model, parameter estimation, etc.3. ESTIMATION OF FMLLR TRANSFORMThe auxiliary function for estimating the fMLLR transform Wby maximum likelihood, obtained by the standard approach usingJensen’s inequality, is as follows:Q(W) =Xγ ji(t)[ log | det A| − . . .t∈T (s),j,i

12 (Wx+ (t) − µ ji) T Σ −1i (Wx + (t) − µ ji)] (6)Q(W) = C + β log | det A| + tr (WK T ) − . . .X “”tr W T Σ −1i WG i12iwhere C is the sum of the terms without W, T (s) is the set of framesfor speaker s, and β, K, G i are the sufficient statistics that need tobe accumulated:Xβ = γ ji(t) (8)K =G i =G k =t∈T (s),j,iXt∈T (s),j,iXt∈T (s),j,iXt∈T (s),j,i(7)γ ji(t)Σ −1i µ jix + (t) T (9)γ ji(t)x + (t)x + (t) T (type 1) (10)γ ji(t)p jik x + (t)x + (t) T (type 2). (11)The “type 1” formulation is efficient for SGMM where we have a relativelysmall number of full covariance matrices. The “type 2” formulationis a more general approach that covers SPAM systems [5](in which Σ ji are represented as a weighted combination of basis= P k p jikA k ), aswell as diagonal and general full covariance systems. This notationis however not optimal for diagonal and full-covariance systems.Defining P ∈ R d×(d+1) as the derivative of Q(W) with respectto W, we have:inverse variances A k for 1 ≤ k ≤ K, with Σ −1jiP ≡ ∂Q(W)∂WG = X iG = X kh i= β A −T ; 0 + K − G, (12)Σ −1i WG i (type 1) (13)A k WG k (type 2) (14)This gradient is computed on each iteration of an iterative updateprocess.4. HESSIAN COMPUTATIONIn general we would expect the computation of the matrix of secondderivatives (the Hessian) of the likelihood function Q(W) to be difficult.However, with appropriate pre-scaling and assumptions thisd(d + 1) by d(d + 1) matrix has a simple structure. The general processis that we pre-transform so that the means have diagonal varianceand zero mean and the average variance is unity; we make thesimplifying assumption that all variances are unity, and then computethe expected Hessian for statistics generated from the modelaround W = [I ; 0] (i.e. the default value). We then use its inverseCholesky factor as a pre-conditioning transform. We never have toexplicitly construct a d(d + 1) by d(d + 1) matrix.4.1. Pre-transformTo compute the transform matrix W pre = [A pre b pre] we start bycalculating the average within-class and between-class covariances,and the average mean, as in the LDA computation:Σ W = X iγ iΣ i (15)µ =Σ B =1Pj,i γj,i X1Pj,iγ jiµ ji (16)Xγ jiµ jiµ Tj,i γj,i jij,i!− µµ T , (17)where γ ji = P t γji(t), and γi = P jγji. We first do the Choleskydecomposition Σ W = LL T , compute B = L −1 Σ BL −T , do thesingular value decompositionB = UΛV T , (18)and compute A pre = U T L −1 , b pre = −A preµ, andhiW pre = [ A pre ; b pre ] = U T L −1 ; −U T L −1 µ . (19)The bias term b pre makes the data zero-mean on average, whereasthe transformation A pre makes the average within-class covarianceunity, and the between-class covariance diagonal. To revert the transformationby W pre we compute:W inv = ˆ A −1pre ; µ ˜ = [ LU ; µ ] . (20)If W is the transformation matrix in the original space, let W ′ bethe equivalent transform in the space where the model and featuresare transformed by W pre. In effect, W ′ is equivalent to transformingfrom the pre-transformed space to the original space with W inv,applying W, and then transforming back with W pre:W ′ = W preW + W + inv (21)W = W invW ′+ W + pre (22)where the notation M + denotes a matrix M with an extra row whoselast element is 1 and the rest are 0. If P ≡ ∂Q(W) is the gradient∂Wwith respect to the transform in the original space, we can computethe transformed gradient [6] as:4.2. Hessian transformP ′ = A −Tpre PW + preT. (23)Assuming that the model and data have been pre-transformed as describedabove, and approximating all variances with the average varianceI, the expected per-frame model likelihood is:Q(W) = log | det A| − . . .Xiγ jiE jih(Ax + b − µ ji) T (Ax + b − µ ji) , (24)12j,iwhere the expectation E ji is over typical features x generated fromGaussian i of state j. Then we use the fact that the features x forGaussian i of state j are distributed with unit variance and meanµ ji, and the fact that the means µ ji have zero mean and variance Λ(obtained while computing the pre-transforms from Equation (18)),keeping only terms quadratic in A and/or b, to get:Q(W) = K + log | det A| + linear termshitr (A(I + Λ)A T ) + b T b , (25)− 1 2At this point it is possible to work out [6] that around A = I andb = 0,∂ 2 Q∂a rca r ′ c ′ = −δ(r, c ′ )δ(c, r ′ ) − (1 + λ c)δ(r, r ′ )δ(c, c ′ ) (26)

∂ 2 Q∂a rcb r ′= 0 (27)∂ 2 Q∂b rb c= −δ(r, c), (28)where δ(·, ·) is the Kronecker delta function. So the quadratic termsin a quadratic expansion of the auxiliary function around that pointcan be written as:XX− 1 a2 rca cr + a 2 rc(1 + λ c) − 1 b 2 2 r. (29)r,cNote that the term a rca cr arises from the determinant and a 2 rc(1 +λ c) arises from the expression tr (A(I + Λ)A T ). This shows thateach element of A is only correlated with its transpose, so with anappropriate reordering of the elements of W, the Hessian wouldhave a block-diagonal structure with blocks of size 2 and 1 (the size1 blocks correspond to the diagonal of A and the elements of b).Consider the general problem where we have a parameter f, andan auxiliary function of the form Q(f) = f · g − 1 2 f T Hf, where−H is the Hessian w.r.t f. We need to compute a co-ordinate transformationf → ˜f such that the Hessian in the transformed space is−I. By expressing H in terms of its Cholesky factors H = LL T ,it is clear that the appropriate transformed parameter is ˜f = L T f,since Q(˜f) = ˜f · (L −1 g) − 1 2˜f T˜f. This also makes clear that theappropriate transformation on the gradient is L −1 .The details of this Cholesky factor computation can be found in[6]. Multiplying P ′ (cf. equation (23)) with L −1 to obtain ˜P, hasthe following simple form. For 1 ≤ r ≤ d and 1 ≤ c < r,˜p rc = (1 + λ c) − 1 2 p ′ rc (30)˜p cr = − `1 + λ r − (1 + λ c) −1´− 1 2(1 + λ c) −1 p ′ rc+ `1 + λ r − (1 + λ c) −1´− 1 2p ′ cr (31)˜p rr = (2 + λ r) − 1 2 p ′ rr (32)˜p r,(d+1) = p ′ r,(d+1), (33)where λ i are the elements of the diagonal matrix Λ of (18).In this transformed space, the proposed change ∆ in W will be:˜∆ = 1 β ˜P. (34)The factor 1/β is necessary because the expected Hessian is a perobservationquantity. The co-ordinate transformation from ˜∆ to ∆ ′is as follows: for 1 ≤ r ≤ d and for 1 ≤ c < r,δ ′ rc = (1 + λ c) − 1 2 ˜δrc− `1 + λ r − (1 + λ c) −1´− 1 2(1 + λ c) −1˜δcr (35)δ ′ cr = `1 + λ r − (1 + λ c) −1´− 1 2 ˜δcr (36)δ ′ rr = (2 + λ r) − 1 2 ˜δrr (37)δ ′ r,(d+1) = ˜δ r,(d+1) . (38)We can then transform ∆ ′ to ∆; referring to Equation (22),∆ = A inv∆ ′ W + pre. (39)At this point a useful check is to make sure that in the three coordinatesystems, the computed auxiliary function change based ona linear approximation is the same:tr (∆P T ) = tr (∆ ′ P ′T ) = tr ( ˜∆ ˜P T ). (40)rAlgorithm 5.1 fMLLR estimation1: Compute W pre, W inv and Λ // Eq. (18) – (20)2: Initialize: W (s)0 = [ I; 0 ]3: Accumulate statistics: // Eq. (8) – (10)β ← P t,j γj(t)K ← P t,j γj(t)Σ−1 j µ jx + (t) TG j ← P t,j γj(t)x+ (t)x + (t) T4: for n ← 1 . . . N do // e.g. for N=5 iterations5: G ← P j Σ−16: P ← βj W (s)n−1 ˆA Gj−Tn−1 ; 0˜ + K − G7: P ′ ← A T invPW pre+ T// Pre-transform: Eq. (23)8: P ′ → ˜P // Hessian transform: Eq. (30) – (33)9: ˜∆ ←1 // Step in transformed space: Eq. (34)β10: ˜∆ → ∆′// Reverse Hessian transform: Eq. (35) – (38)11: ∆ ← A inv∆ ′ W pre + // Reverse pre-transform: Eq. (39)12: Compute step-size k // Appendix A13: W n(s) ← W (s)n−1 + k∆ // Update14: end for5. ADAPTATION RECIPEIn this section we summarize the steps for estimating the fMLLRtransform in the form of a recipe (Algorithm 5.1).The first step is calculating the pre-transform W pre, the associatedinverse transform W inv, and the diagonal matrix of singularvalues Λ. These quantities depend only on the models, and need tobe computed only once before starting the iterative estimation of W.The next step is to initialize W (s) . If a previous estimate ofW (s) exists (for example, if we are running multiple passes over theadaptation data), it is used as the initial estimate. Otherwise W (s) =[ I ; 0 ] is a reasonable starting point.For each pass over the adaptation data, we first accumulate thesufficient statistics β, K, and G i, which can be done in O(T d 2 )time (type 1), or O(T Kd 2 ) time (type 2).To iteratively update W (s) , we first compute the gradient P inthe original space, and ˜P in the fully transformed space. We thencompute ˜∆, the change in W in this transformed space, from whichwe obtain the change ∆ in the original space by reversing the Hessiantransform and the pre-transform.We next compute the optimal step-size k using an iterative proceduredescribed in Appendix A, which is then used to update W (s) :W (s) ← W (s) + k∆. (41)The time in update is dominated by Equation (13) (type 1) or(14) (type 2) which take time O(Id 3 ) for the type 1 update andO(Kd 3 ) in the type 2 update for a general type of basis (e.g. SPAM)but O(Kd 2 ) for a “simple” basis as in standard diagonal or fullcovariancesystem; this reduces to O(d 3 ) for diagonal (versus O(d 4 )for the standard approach) and O(d 4 ) for full-covariance (versusO(d 4 ) in [3]). The accumulation time (the part proportional to T ) isthe same as standard approaches (type 2) or faster (type 1).6. SUBSPACE VERSION OF FMLLRWe express Equation (2) in the fully transformed space:˜W (s) = ˜W 0 +BXb=1α (s)b˜W b . (42)

Number of SGMM Substates1800 2700 4k 6k 9k 12k 16kSGMM: 51.6 50.9 50.6 50.1 49.9 49.3 49.4Per-speaker adaptationfMLLR: 49.7 49.4 48.7 48.3 48.0 47.6 47.6Per-utterance adaptationfMLLR: 51.1 50.7 50.3 49.8 49.5 49.1 49.2+ subspaces: 50.2 49.9 49.5 48.9 48.6 48.0 47.9Table 1. fMLLR adaptation results (in % WER).where ˜W b from an orthonormal basis, i.e., tr ( ˜W b ˜WT c ) = δ(b, c).With this subspace approach, Equation (34) is modified as:˜∆ = 1 βBXb=1˜W b tr ( ˜W b ˜PT ). (43)Note that in this method of calculation, the quantities α (s)bare implicitand are never referred to in the calculation, but the updated Wwill still be constrained by the subspace. This simplifies the codingprocedure, but at the cost of slightly higher memory and storagerequirement.6.1. Training the basesThe auxiliary function improvement in the transformed space can becomputed as 1 tr ( ˜∆ ˜P T ) (up to a linear approximation). This is the2same as 1 tr ( √ 1 12 β˜P √β ˜PT ). So, the auxiliary function improvement1is the trace of the scatter of √ β˜P projected onto the subspace.The first step in training the basis is to compute the quantity√1 ˜P(s) βfor each speaker. We then compute the scatter matrix:S = X s„ « „ T 11vec √β ˜P(s) vec √β ˜P(s)« , (44)where vec(M) represents concatenating the rows of a matrix M intoa vector. The column vectors u b corresponding to the top B singularvalues in the SVD of S, S = ULV T , gives bases ˜W b , i.e. u b =vec( ˜W b ).7. EXPERIMENTSOur experiments are with an SGMM style of system on the CALL-HOME English database; see [4] for system details. Results arewithout speaker adaptive training.In Table 1 we show adaptation results for different SGMM systemsof varying model complexities [4]. We can see that the proposedmethod for fMLLR provides substantial improvements overan unadapted SGMM baseline when adapting using all the availabledata for a particular speaker. The improvements are consistent withthose obtained by a standard implementation of fMLLR over a baselinesystem that uses conventional GMMs.When adapting per-utterance (i.e. with little adaptation data), wesee that normal fMLLR adaptation provides very modest gains (weuse a minimum of 100 speech frames for adaptation, which givesgood performance). However, using the subspace fMLLR with B =100 basis transforms W b (and the same minimum of 100 frames),we are able to get performance that is comparable to per-speakeradaptation.8. CONCLUSIONSIn this paper we presented a novel estimation algorithm for fMLLRtransforms with full-covariance models, which iteratively finds thegradient in a transformed space where the expected Hessian is proportionalto unity. The proposed algorithm provides large improvementsover a competitive unadapted SGMM baseline on an LVCSRtask. It is also used to estimate a subspace-constrained fMLLR,which provides better results with limited adaptation data. The algorithmitself is independent of the SGMM framework, and can beapplied to any HMM that uses GMM emission densities.9. REFERENCES[1] M. J. F. Gales, “Maximum likelihood linear transformations forHMM-based speech recognition,” Computer Speech and Language,vol. 12, no. 2, pp. 75–98, April 1998.[2] K. Visweswariah, V. Goel, and R. Gopinath, “Structuring lineartransforms for adaptation using training time information,” inProc. IEEE ICASSP, 2002, vol. 1, pp. 585–588.[3] K. C. Sim and M. J. F. Gales, “Adaptation of precision matrixmodels on large vocabulary continuous speech recognition,” inProc. IEEE ICASSP, 2005, vol. I, pp. 97–100.[4] D. Povey et al., “Subspace gaussian mixture models for speechrecognition,” Submitted to ICASSP, 2010.[5] S. Axelrod et al., “Subspace constrained Gaussian mixture modelsfor speech recognition,” IEEE Trans. Speech Audio Process.,vol. 13, no. 6, pp. 1144–1160, 2005.[6] D. Povey, “A Tutorial-style introduction to Subspace GaussianMixture Models for Speech Recognition,” Tech. Rep. MSR-TR-2009-111, Microsoft Research, 2009.A. CALCULATING OPTIMAL STEP SIZEThe auxiliary function in the step size k is:Q(k) = β log det(A + k∆ 1:d,1:d )+k m − 1 2 k2 n, (45)m = tr (∆K T ) − tr (∆G T ) (46)n = X “”tr ∆ T Σ −1j ∆G j (type 1) (47)jn = X ”tr“∆ T A k ∆G k (type 2) (48)kwhere ∆ 1:d,1:d is the first d columns of ∆. We use a Newton’smethod optimization for k. After computingB = (A + k∆ 1:d,1:d ) −1 ∆ 1:d,1:d (49)d 1 = βtr (B) + m − kn (50)d 2 = −β(tr BB) − n (51)where d 1 and d 2 are the first and second derivatives of (45) withrespect to k, we update k as:ˆk = k − d1d 2. (52)At this point we check that Q(ˆk) ≥ Q(k). If Q(·) decreases,we keep halving the step size ˆk ← (k + ˆk)/2 until Q(ˆk) ≥ Q(k).The final k should typically be close to 1.

PDF file - Scale

Create successful ePaper yourself

Delete template?

Save as template?