The Matrix Cookbook - LASA

The Matrix Cookbook[ http://matrixcookbook.com ]Kaare Brandt PetersenMichael Syskind PedersenVersion: September 5, 2007What is this? These pages are a collection of facts (identities, approximations,inequalities, relations, ...) about matrices and matters relating to them.It is collected in this form for the convenience of anyone who wants a quickdesktop reference .Disclaimer: The identities, approximations and relations presented here wereobviously not invented but collected, borrowed and copied from a large amountof sources. These sources include similar but shorter notes found on the internetand appendices in books - see the references for a full list.Errors: Very likely there are errors, typos, and mistakes for which we apologizeand would be grateful to receive corrections at cookbook@2302.dk.Its ongoing: The project of keeping a large repository of relations involvingmatrices is naturally ongoing and the version will be apparent from the date inthe header.Suggestions: Your suggestion for additional content or elaboration of sometopics is most welcome at cookbook@2302.dk.Keywords: Matrix algebra, matrix relations, matrix identities, derivative ofdeterminant, derivative of inverse matrix, differentiate a matrix.Acknowledgements: We would like to thank the following for contributionsand suggestions: Bill Baxter, Christian Rishøj, Douglas L. Theobald, EsbenHoegh-Rasmussen, Jan Larsen, Korbinian Strimmer, Lars Christiansen, LarsKai Hansen, Leland Wilkinson, Liguo He, Loic Thibaut, Ole Winther, StephanHattinger, and Vasile Sima. We would also like thank The Oticon Foundationfor funding our PhD studies.1

CONTENTSCONTENTSContents1 Basics 51.1 Trace and Determinants . . . . . . . . . . . . . . . . . . . . . . . 51.2 The Special Case 2x2 . . . . . . . . . . . . . . . . . . . . . . . . . 52 Derivatives 72.1 Derivatives of a Determinant . . . . . . . . . . . . . . . . . . . . 72.2 Derivatives of an Inverse . . . . . . . . . . . . . . . . . . . . . . . 82.3 Derivatives of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . 92.4 Derivatives of Matrices, Vectors and Scalar Forms . . . . . . . . 92.5 Derivatives of Traces . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Derivatives of vector norms . . . . . . . . . . . . . . . . . . . . . 132.7 Derivatives of matrix norms . . . . . . . . . . . . . . . . . . . . . 132.8 Derivatives of Structured Matrices . . . . . . . . . . . . . . . . . 133 Inverses 163.1 Basic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Exact Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Implication on Inverses . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Generalized Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 Pseudo Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Complex Matrices 224.1 Complex Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 225 Decompositions 255.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 255.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 255.3 Triangular Decomposition . . . . . . . . . . . . . . . . . . . . . . 276 Statistics and Probability 286.1 Definition of Moments . . . . . . . . . . . . . . . . . . . . . . . . 286.2 Expectation of Linear Combinations . . . . . . . . . . . . . . . . 296.3 Weighted Scalar Variable . . . . . . . . . . . . . . . . . . . . . . 307 Multivariate Distributions 317.1 Student’s t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.2 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.3 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.4 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.5 Dirichlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.6 Normal-Inverse Gamma . . . . . . . . . . . . . . . . . . . . . . . 327.7 Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.8 Inverse Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 2

CONTENTSCONTENTS8 Gaussians 348.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . 399 Special Matrices 409.1 Orthogonal, Ortho-symmetric, and Ortho-skew . . . . . . . . . . 409.2 Units, Permutation and Shift . . . . . . . . . . . . . . . . . . . . 409.3 The Singleentry Matrix . . . . . . . . . . . . . . . . . . . . . . . 419.4 Symmetric and Antisymmetric . . . . . . . . . . . . . . . . . . . 439.5 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . 449.6 Vandermonde Matrices . . . . . . . . . . . . . . . . . . . . . . . . 459.7 Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 459.8 The DFT Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 469.9 Positive Definite and Semi-definite Matrices . . . . . . . . . . . . 479.10 Block matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4910 Functions and Operators 5110.1 Functions and Series . . . . . . . . . . . . . . . . . . . . . . . . . 5110.2 Kronecker and Vec Operator . . . . . . . . . . . . . . . . . . . . 5210.3 Solutions to Systems of Equations . . . . . . . . . . . . . . . . . 5310.4 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5610.5 Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5610.6 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.7 Integral Involving Dirac Delta Functions . . . . . . . . . . . . . . 5810.8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58A One-dimensional Results 59A.1 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2 One Dimensional Mixture of Gaussians . . . . . . . . . . . . . . . 60B Proofs and Details 62B.1 Misc Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 3

CONTENTSCONTENTSNotation and NomenclatureA MatrixA ij Matrix indexed for some purposeA i Matrix indexed for some purposeA ij Matrix indexed for some purposeA n Matrix indexed for some purpose orThe n.th power of a square matrixA −1 The inverse matrix of the matrix AA + The pseudo inverse matrix of the matrix A (see Sec. 3.6)A 1/2 The square root of a matrix (if unique), not elementwise(A) ij The (i, j).th entry of the matrix AA ij The (i, j).th entry of the matrix A[A] ij The ij-submatrix, i.e. A with i.th row and j.th column deleteda Vectora i Vector indexed for some purposea i The i.th element of the vector aa ScalarRzRzRZIzIzIZReal part of a scalarReal part of a vectorReal part of a matrixImaginary part of a scalarImaginary part of a vectorImaginary part of a matrixdet(A) Determinant of ATr(A) Trace of the matrix Adiag(A) Diagonal matrix of the matrix A, i.e. (diag(A)) ij = δ ij A ijvec(A) The vector-version of the matrix A (see Sec. 10.2.2)sup Supremum of a set||A|| Matrix norm (subscript if any denotes what norm)A T Transposed matrixA ∗ Complex conjugated matrixA H Transposed and complex conjugated matrix (Hermitian)A ◦ BA ⊗ BHadamard (elementwise) productKronecker product0 The null matrix. Zero in all entries.I The identity matrixThe single-entry matrix, 1 at (i, j) and zero elsewhereΣ A positive definite matrixΛ A diagonal matrixJ ijPetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 4

1 BASICS1 Basics(AB) −1 = B −1 A −1 (1)(ABC...) −1 = ...C −1 B −1 A −1 (2)(A T ) −1 = (A −1 ) T (3)(A + B) T = A T + B T (4)(AB) T = B T A T (5)(ABC...) T = ...C T B T A T (6)(A H ) −1 = (A −1 ) H (7)(A + B) H = A H + B H (8)(AB) H = B H A H (9)(ABC...) H = ...C H B H A H (10)1.1 Trace and DeterminantsTr(A) = ∑ i A ii (11)Tr(A) = ∑ i λ i, λ i = eig(A) (12)Tr(A) = Tr(A T ) (13)Tr(AB) = Tr(BA) (14)Tr(A + B) = Tr(A) + Tr(B) (15)Tr(ABC) = Tr(BCA) = Tr(CAB) (16)det(A) = ∏ i λ i λ i = eig(A) (17)det(cA) = c n det(A), if A ∈ R n×n (18)det(AB) = det(A) det(B) (19)det(A −1 ) = 1/ det(A) (20)det(A n ) = det(A) n (21)det(I + uv T ) = 1 + u T v (22)det(I + εA) ∼ = 1 + εTr(A), ε small (23)1.2 The Special Case 2x2Consider the matrix ADeterminant and traceA =[ ]A11 A 12A 21 A 22det(A) = A 11 A 22 − A 12 A 21 (24)Tr(A) = A 11 + A 22 (25)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 5

1.2 The Special Case 2x2 1 BASICSEigenvaluesλ 2 − λ · Tr(A) + det(A) = 0λ 1 = Tr(A) + √ Tr(A) 2 − 4 det(A)2λ 1 + λ 2 = Tr(A)λ 2 = Tr(A) − √ Tr(A) 2 − 4 det(A)2λ 1 λ 2 = det(A)Eigenvectors[v 1 ∝]A 12λ 1 − A 11[v 2 ∝]A 12λ 2 − A 11InverseA −1 =det(A) −A 21 A 111[A22 −A 12](26)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 6

2 DERIVATIVES2 DerivativesThis section is covering differentiation of a number of expressions with respect toa matrix X. Note that it is always assumed that X has no special structure, i.e.that the elements of X are independent (e.g. not symmetric, Toeplitz, positivedefinite). See section 2.8 for differentiation of structured matrices. The basicassumptions can be written in a formula as∂X kl∂X ij= δ ik δ lj (27)that is for e.g. vector forms,[ ] ∂x= ∂x i∂y ∂yi[ ] ∂x= ∂x∂yi∂y i[ ] ∂x= ∂x i∂yij∂y jThe following rules are general and very useful when deriving the differential ofan expression ([18]):∂A = 0 (A is a constant) (28)∂(αX) = α∂X (29)∂(X + Y) = ∂X + ∂Y (30)∂(Tr(X)) = Tr(∂X) (31)∂(XY) = (∂X)Y + X(∂Y) (32)∂(X ◦ Y) = (∂X) ◦ Y + X ◦ (∂Y) (33)∂(X ⊗ Y) = (∂X) ⊗ Y + X ⊗ (∂Y) (34)∂(X −1 ) = −X −1 (∂X)X −1 (35)∂(det(X)) = det(X)Tr(X −1 ∂X) (36)∂(ln(det(X))) = Tr(X −1 ∂X) (37)∂X T = (∂X) T (38)∂X H = (∂X) H (39)2.1 Derivatives of a Determinant2.1.1 General form∂ det(Y)∂x[= det(Y)Tr Y−1 ∂Y∂x](40)2.1.2 Linear forms∂ det(X)∂X∂ det(AXB)∂X= det(X)(X −1 ) T (41)= det(AXB)(X −1 ) T = det(AXB)(X T ) −1 (42)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 7

2.2 Derivatives of an Inverse 2 DERIVATIVES2.1.3 Square formsIf X is square and invertible, then∂ det(X T AX)∂XIf X is not square but A is symmetric, then∂ det(X T AX)∂XIf X is not square and A is not symmetric, then∂ det(X T AX)∂X2.1.4 Other nonlinear formsSome special cases are (See [9, 7])= 2 det(X T AX)X −T (43)= 2 det(X T AX)AX(X T AX) −1 (44)= det(X T AX)(AX(X T AX) −1 + A T X(X T A T X) −1 ) (45)∂ ln det(X T X)|∂X= 2(X + ) T (46)∂ ln det(X T X)∂X + = −2X T (47)∂ ln | det(X)|∂X= (X −1 ) T = (X T ) −1 (48)∂ det(X k )∂X= k det(X k )X −T (49)2.2 Derivatives of an Inverse¿From [26] we have the basic identityfrom which it follows∂Y −1∂x∂Y= −Y−1∂x Y−1 (50)∂(X −1 ) kl∂X ij= −(X −1 ) ki (X −1 ) jl (51)∂a T X −1 b∂X∂ det(X −1 )∂X∂Tr(AX −1 B)∂X= −X −T ab T X −T (52)= − det(X −1 )(X −1 ) T (53)= −(X −1 BAX −1 ) T (54)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 8

2.3 Derivatives of Eigenvalues 2 DERIVATIVES2.3 Derivatives of Eigenvalues∂ ∑ ∂eig(X) = Tr(X) = I (55)∂X∂X∂ ∏ ∂eig(X) =∂X∂X det(X) = det(X)X−T (56)2.4 Derivatives of Matrices, Vectors and Scalar Forms2.4.1 First Order∂x T a∂x∂a T Xb∂X∂a T X T b∂X∂a T Xa∂X2.4.2 Second Order∂∂X ij= ∂aT x∂x= a (57)= ab T (58)= ba T (59)= ∂aT X T a∂X= aa T (60)∂X∂X ij= J ij (61)∂(XA) ij∂X mn= δ im (A) nj = (J mn A) ij (62)∂(X T A) ij∂X mn= δ in (A) mj = (J nm A) ij (63)∑X kl (64)X kl X mnklmn= 2 ∑ kl∂b T X T Xc∂X= X(bc T + cb T ) (65)∂(Bx + b) T C(Dx + d)∂x= B T C(Dx + d) + D T C T (Bx + b) (66)∂(X T BX) kl∂X ij= δ lj (X T B) ki + δ kj (BX) il (67)∂(X T BX)∂X ij= X T BJ ij + J ji BX (J ij ) kl = δ ik δ jl (68)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 9

2.4 Derivatives of Matrices, Vectors and Scalar Forms 2 DERIVATIVESSee Sec 9.3 for useful properties of the Single-entry matrix J ij∂x T Bx∂x= (B + B T )x (69)∂b T X T DXc∂X= D T Xbc T + DXcb T (70)∂∂X (Xb + c)T D(Xb + c) = (D + D T )(Xb + c)b T (71)Assume W is symmetric, then∂∂s (x − As)T W(x − As) = −2A T W(x − As) (72)∂∂s (x − s)T W(x − s) = −2W(x − s) (73)∂∂x (x − As)T W(x − As) = 2W(x − As) (74)∂∂A (x − As)T W(x − As) = −2W(x − As)s T (75)2.4.3 Higher order and non-linearFor proof of the above, see B.1.1.∂(X n n−1) kl∑= (X r J ij X n−1−r ) kl (76)∂X ijr=0n−1∂∑∂X aT X n b = (X r ) T ab T (X n−1−r ) T (77)r=0∂∂X aT (X n ) T X n b =n−1∑ [X n−1−r ab T (X n ) T X rr=0+(X r ) T X n ab T (X n−1−r ) T ] (78)See B.1.1 for a proof.Assume s and r are functions of x, i.e. s = s(x), r = r(x), and that A is aconstant, then∂∂x sT Ar =[ ] T ∂sAr +∂x[ ] T ∂rA T s (79)∂xPetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 10

2.5 Derivatives of Traces 2 DERIVATIVES2.4.4 Gradient and HessianUsing the above we have for the gradient and the hessian2.5 Derivatives of Traces2.5.1 First Orderf = x T Ax + b T x (80)∇ x f = ∂f∂x = (A + AT )x + b (81)∂ 2 f∂x∂x T = A + A T (82)∂Tr(X)∂X= I (83)∂∂X Tr(XA) = AT (84)∂∂X Tr(AXB) = AT B T (85)∂∂X Tr(AXT B) = BA (86)∂∂X Tr(XT A) = A (87)∂∂X Tr(AXT ) = A (88)∂Tr(A ⊗ X)∂X= Tr(A)I (89)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 11

2.5 Derivatives of Traces 2 DERIVATIVES2.5.2 Second OrderSee [7].∂∂X Tr(X2 ) = 2X T (90)∂∂X Tr(X2 B) = (XB + BX) T (91)∂∂X Tr(XT BX) = BX + B T X (92)∂∂X Tr(XBXT ) = XB T + XB (93)∂∂X Tr(AXBX) = AT X T B T + B T X T A T (94)∂∂X Tr(XT X) = 2X (95)∂∂X Tr(BXXT ) = (B + B T )X (96)∂∂X Tr(BT X T CXB) = C T XBB T + CXBB T (97)∂∂X Tr [ X T BXC ] = BXC + B T XC T (98)∂∂X Tr(AXBXT C) = A T C T XB T + CAXB (99)∂[∂X Tr (AXb + c)(AXb + c) T = 2A T (AXb + c)b T (100)∂∂Tr(X ⊗ X) = Tr(X)Tr(X) = 2Tr(X)I (101)∂X ∂X2.5.3 Higher Order∂∂X Tr(Xk ) = k(X k−1 ) T (102)∂∂X Tr(AXk ) =k−1∑(X r AX k−r−1 ) T (103)r=0∂∂X Tr [ B T X T CXX T CXB ] = CXX T CXBB T+C T XBB T X T C T X+CXBB T X T CX+C T XX T C T XBB T (104)2.5.4 Other∂∂X Tr(AX−1 B) = −(X −1 BAX −1 ) T = −X −T A T B T X −T (105)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 12

2.6 Derivatives of vector norms 2 DERIVATIVESAssume B and C to be symmetric, then∂[]∂X Tr (X T CX) −1 A = −(CX(X T CX) −1 )(A + A T )(X T CX) −1 (106)∂[]∂X Tr (X T CX) −1 (X T BX) = −2CX(X T CX) −1 X T BX(X T CX) −1See [7].norms og matrix norms.2.6 Derivatives of vector norms2.6.1 Two-norm+2BX(X T CX) −1 (107)∂∂x ||x − a|| 2 =x − a(108)||x − a|| 2∂ x − a I (x − a)(x − a)T= −∂x ‖x − a‖ 2 ‖x − a‖ 2 ‖x − a‖ 3 22.7 Derivatives of matrix normsFor more on matrix norms, see Sec. 10.5.2.7.1 Frobenius norm∂See (201).(109)∂||x|| 2 2∂x = ∂||xT x|| 2= 1 ∂x 2 x (110)∂X ||X||2 F =2.8 Derivatives of Structured Matrices∂∂X Tr(XXH ) = 2X (111)Assume that the matrix A has some structure, i.e. symmetric, toeplitz, etc.In that case the derivatives of the previous section does not apply in general.Instead, consider the following general rule for differentiating a scalar functionf(A)df= ∑ [ [ ] ]T∂f ∂A kl ∂f ∂A= Tr(112)dA ij ∂A kl ∂A ij ∂A ∂A ijklThe matrix differentiated with respect to itself is in this document referred toas the structure matrix of A and is defined simply by∂A∂A ij= S ij (113)If A has no special structure we have simply S ij = J ij , that is, the structurematrix is simply the singleentry matrix. Many structures have a representationin singleentry matrices, see Sec. 9.3.6 for more examples of structure matrices.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 13

2.8 Derivatives of Structured Matrices 2 DERIVATIVES2.8.1 The Chain RuleSometimes the objective is to find the derivative of a matrix which is a functionof another matrix. Let U = f(X), the goal is to find the derivative of thefunction g(U) with respect to X:∂g(U)∂X= ∂g(f(X))∂XThen the Chain Rule can then be written the following way:(114)∂g(U)∂X= ∂g(U)∂x ij=M∑N∑k=1 l=1Using matrix notation, this can be written as:2.8.2 Symmetric∂g(U)∂X ij[= Tr( ∂g(U)∂U∂g(U) ∂u kl(115)∂u kl ∂x ij)T∂U∂X ij]. (116)If A is symmetric, then S ij = J ij + J ji − J ij J ij and therefore[ ] [ ] T [ ]df ∂f ∂f∂fdA = + − diag∂A ∂A∂A(117)That is, e.g., ([5]):∂Tr(AX)∂X∂ det(X)∂X∂ ln det(X)∂X= A + A T − (A ◦ I), see (121) (118)= det(X)(2X −1 − (X −1 ◦ I)) (119)= 2X −1 − (X −1 ◦ I) (120)2.8.3 DiagonalIf X is diagonal, then ([18]):∂Tr(AX)∂X= A ◦ I (121)2.8.4 ToeplitzLike symmetric matrices and diagonal matrices also Toeplitz matrices has aspecial structure which should be taken into account when the derivative withrespect to a matrix with Toeplitz structure.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 14

2.8 Derivatives of Structured Matrices 2 DERIVATIVES∂Tr(AT)∂T= ∂Tr(TA)⎡∂T=≡⎢⎣Tr(A) Tr([A T ] n1 ) Tr([[A T ] 1n ] n−1,2 ) · · · A n1Tr([A T ] 1n ))Tr([[A T ] 1n ] 2,n−1 )α(A)Tr(A). . .. . ..... . .. . .. . . Tr([[AT ]1n ] n−1,2 ).... . .. . .. . . Tr([AT ]n1 )A 1n · · · Tr([[A T ] 1n ] 2,n−1 ) Tr([A T ] 1n )) Tr(A)(122)As it can be seen, the derivative α(A) also has a Toeplitz structure. Each valuein the diagonal is the sum of all the diagonal valued in A, the values in thediagonals next to the main diagonal equal the sum of the diagonal next to themain diagonal in A T . This result is only valid for the unconstrained Toeplitzmatrix. If the Toeplitz matrix also is symmetric, the same derivative yields∂Tr(AT)∂T= ∂Tr(TA)∂T⎤⎥⎦= α(A) + α(A) T − α(A) ◦ I (123)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 15

3 INVERSES3 Inverses3.1 Basic3.1.1 DefinitionThe inverse A −1 of a matrix A ∈ C n×n is defined such thatAA −1 = A −1 A = I, (124)where I is the n × n identity matrix. If A −1 exists, A is said to be nonsingular.Otherwise, A is said to be singular (see e.g. [12]).3.1.2 Cofactors and AdjointThe submatrix of a matrix A, denoted by [A] ij is a (n − 1) × (n − 1) matrixobtained by deleting the ith row and the jth column of A. The (i, j) cofactorof a matrix is defined ascof(A, i, j) = (−1) i+j det([A] ij ), (125)The matrix of cofactors can be created from the cofactors⎡⎤cof(A, 1, 1) · · · cof(A, 1, n)cof(A) =⎢ . cof(A, i, j) .⎥⎣⎦cof(A, n, 1) · · · cof(A, n, n)The adjoint matrix is the transpose of the cofactor matrix(126)adj(A) = (cof(A)) T , (127)3.1.3 DeterminantThe determinant of a matrix A ∈ C n×n is defined as (see [12])3.1.4 Constructiondet(A) ==n∑(−1) j+1 A 1j det ([A] 1j ) (128)j=1n∑A 1j cof(A, 1, j). (129)j=1The inverse matrix can be constructed, using the adjoint matrix, byA −1 =For the case of 2 × 2 matrices, see section 1.2.1 · adj(A) (130)det(A)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 16

3.2 Exact Relations 3 INVERSES3.1.5 Condition numberThe condition number of a matrix c(A) is the ratio between the largest and thesmallest singular value of a matrix (see Section 5.2 on singular values),c(A) = d +d −(131)The condition number can be used to measure how singular a matrix is. If thecondition number is large, it indicates that the matrix is nearly singular. Thecondition number can also be estimated from the matrix norms. Herec(A) = ‖A‖ · ‖A −1 ‖, (132)where ‖ · ‖ is a norm such as e.g the 1-norm, the 2-norm, the ∞-norm or theFrobenius norm (see Sec 10.5 for more on matrix norms).The 2-norm of A equals √ (max(eig(A H A))) [12, p.57]. For a symmetricmatrix, this reduces to ||A|| 2 = max(|eig(A)|) [12, p.394]. If the matrix iasymmetric and positive definite, ||A|| 2 = max(eig(A)). The condition numberbased on the 2-norm thus reduces to‖A‖ 2 ‖A −1 ‖ 2 = max(eig(A)) max(eig(A −1 )) = max(eig(A))min(eig(A)) . (133)3.2 Exact Relations3.2.1 Basic3.2.2 The Woodbury identity(AB) −1 = B −1 A −1 (134)The Woodbury identity comes in many variants. The latter of the two can befound in [12](A + CBC T ) −1 = A −1 − A −1 C(B −1 + C T A −1 C) −1 C T A −1 (135)(A + UBV) −1 = A −1 − A −1 U(B −1 + VA −1 U) −1 VA −1 (136)If P, R are positive definite, then (see [29])(P −1 + B T R −1 B) −1 B T R −1 = PB T (BPB T + R) −1 (137)3.2.3 The Kailath VariantSee [4, page 153].(A + BC) −1 = A −1 − A −1 B(I + CA −1 B) −1 CA −1 (138)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 17

3.2 Exact Relations 3 INVERSES3.2.4 The Searle Set of IdentitiesThe following set of identities, can be found in [24, page 151],(I + A −1 ) −1 = A(A + I) −1 (139)(A + BB T ) −1 B = A −1 B(I + B T A −1 B) −1 (140)(A −1 + B −1 ) −1 = A(A + B) −1 B = B(A + B) −1 A (141)A − A(A + B) −1 A = B − B(A + B) −1 B (142)A −1 + B −1 = A −1 (A + B)B −1 (143)(I + AB) −1 = I − A(I + BA) −1 B (144)(I + AB) −1 A = A(I + BA) −1 (145)3.2.5 Rank-1 update of Moore-Penrose InverseThe following is a rank-1 update for the Moore-Penrose pseudo-inverse and proofcan be found in [17]. The matrix G is defined below:Using the the notation(A + cd T ) + = A + + G (146)β = 1 + d T A + c (147)v = A + c (148)n = (A + ) T d (149)w = (I − AA + )c (150)m = (I − A + A) T d (151)the solution is given as six different cases, depending on the entities ||w||,||m||, and β. Please note, that for any (column) vector v it holds that v + =v T (v T v) −1 =vT||v||. The solution is:2Case 1 of 6: If ||w|| ̸= 0 and ||m|| ̸= 0. ThenG = −vw + − (m + ) T n T + β(m + ) T w + (152)= − 1||w|| 2 vwT − 1β||m|| 2 mnT +||m|| 2 ||w|| 2 mwT (153)Case 2 of 6: If ||w|| = 0 and ||m|| ̸= 0 and β = 0. ThenG = −vv + A + − (m + ) T n T (154)= − 1||v|| 2 vvT A + − 1||m|| 2 mnT (155)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 18

3.3 Implication on Inverses 3 INVERSESCase 3 of 6: If ||w|| = 0 and β ≠ 0. ThenG = 1 β mvT A + −(β ||v||2||v|| 2 ||m|| 2 + |β| 2) ( ||m||2β m + v βCase 4 of 6: If ||w|| ̸= 0 and ||m|| = 0 and β = 0. Then) T(A+ ) T v + n(156)G = −A + nn + − vw + (157)= − 1||n|| 2 A+ nn T − 1||w|| 2 vwT (158)Case 5 of 6: If ||m|| = 0 and β ≠ 0. ThenG = 1 ( ) ( )β A+ nw T β ||w||2||n||2T−||n|| 2 ||w|| 2 + |β| 2 βA+ n + vβ w + n (159)Case 6 of 6: If ||w|| = 0 and ||m|| = 0 and β = 0. ThenG = −vv + A + − A + nn + + v + A + nvn + (160)= − 1||v|| 2 vvT A + − 1||n|| 2 A+ nn T +vT A + n||v|| 2 ||n|| 2 vnT (161)3.3 Implication on InversesSee [24].If (A + B) −1 = A −1 + B −1 then AB −1 A = BA −1 B (162)3.3.1 A PosDef identityAssume P, R to be positive definite and invertible, thenSee [29].(P −1 + B T R −1 B) −1 B T R −1 = PB T (BPB T + R) −1 (163)3.4 ApproximationsThe following is a Taylor expansion(I + A) −1 = I − A + A 2 − A 3 + ... (164)The following approximation is from [21] and holds when A large and symmetricIf σ 2 is small compared to Q and M thenA − A(I + A) −1 A ∼ = I − A −1 (165)(Q + σ 2 M) −1 ∼ = Q −1 − σ 2 Q −1 MQ −1 (166)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 19

3.5 Generalized Inverse 3 INVERSES3.5 Generalized Inverse3.5.1 DefinitionA generalized inverse matrix of the matrix A is any matrix A − such that (see[25])AA − A = A (167)The matrix A − is not unique.3.6 Pseudo Inverse3.6.1 DefinitionThe pseudo inverse (or Moore-Penrose inverse) of a matrix A is the matrix A +that fulfilsIAA + A = AII A + AA + = A +IIIIVAA + symmetricA + A symmetricThe matrix A + is unique and does always exist. Note that in case of complexmatrices, the symmetric condition is substituted by a condition of beingHermitian.3.6.2 PropertiesAssume A + to be the pseudo-inverse of A, then (See [3])Assume A to have full rank, then3.6.3 Construction(A + ) + = A (168)(A T ) + = (A + ) T (169)(cA) + = (1/c)A + (170)(A T A) + = A + (A T ) + (171)(AA T ) + = (A T ) + A + (172)(AA + )(AA + ) = AA + (173)(A + A)(A + A) = A + A (174)Assume that A has full rank, thenTr(AA + ) = rank(AA + ) (See [25]) (175)Tr(A + A) = rank(A + A) (See [25]) (176)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 20

3.6 Pseudo Inverse 3 INVERSESA n × n Square rank(A) = n ⇒ A + = A −1A n × m Broad rank(A) = n ⇒ A + = A T (AA T ) −1A n × m Tall rank(A) = m ⇒ A + = (A T A) −1 A TAssume A does not have full rank, i.e. A is n×m and rank(A) = r < min(n, m).The pseudo inverse A + can be constructed from the singular value decompositionA = UDV T , byA + = V r D −1r U T r (177)where U r , D r , and V r are the matrices with the degenerated rows and columnsdeleted. A different way is this: There do always exist two matrices C n × rand D r × m of rank r, such that A = CD. Using these matrices it holds thatSee [3].A + = D T (DD T ) −1 (C T C) −1 C T (178)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 21

4 COMPLEX MATRICES4 Complex Matrices4.1 Complex DerivativesIn order to differentiate an expression f(z) with respect to a complex z, theCauchy-Riemann equations have to be satisfied ([7]):df(z)dzanddf(z)dzor in a more compact form:= ∂R(f(z))∂Rz= −i ∂R(f(z))∂Iz+ i ∂I(f(z))∂Rz+ ∂I(f(z))∂Iz(179)(180)∂f(z)∂Iz= i∂f(z) ∂Rz . (181)A complex function that satisfies the Cauchy-Riemann equations for points in aregion R is said yo be analytic in this region R. In general, expressions involvingcomplex conjugate or conjugate transpose do not satisfy the Cauchy-Riemannequations. In order to avoid this problem, a more generalized definition ofcomplex derivative is used ([23], [6]):• Generalized Complex Derivative:df(z)dz= 1 2• Conjugate Complex Derivative( ∂f(z))∂Rz − i∂f(z) . (182)∂Izdf(z)dz ∗ = 1 ( ∂f(z))2 ∂Rz + i∂f(z) . (183)∂IzThe Generalized Complex Derivative equals the normal derivative, when f is ananalytic function. For a non-analytic function such as f(z) = z ∗ , the derivativeequals zero. The Conjugate Complex Derivative equals zero, when f is ananalytic function. The Conjugate Complex Derivative has e.g been used by [20]when deriving a complex gradient.Notice:df(z)dz≠ ∂f(z)∂Rz + i∂f(z) ∂Iz . (184)• Complex Gradient Vector: If f is a real function of a complex vector z,then the complex gradient vector is given by ([14, p. 798])∇f(z) = 2 df(z)dz ∗ (185)= ∂f(z)∂Rz + i∂f(z) ∂Iz .Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 22

4.1 Complex Derivatives 4 COMPLEX MATRICES• Complex Gradient Matrix: If f is a real function of a complex matrix Z,then the complex gradient matrix is given by ([2])∇f(Z) = 2 df(Z)dZ ∗ (186)= ∂f(Z)∂RZ + i∂f(Z) ∂IZ .These expressions can be used for gradient descent algorithms.4.1.1 The Chain Rule for complex numbersThe chain rule is a little more complicated when the function of a complexu = f(x) is non-analytic. For a non-analytic function, the following chain rulecan be applied ([7])∂g(u)∂x= ∂g ∂u∂u ∂x + ∂g ∂u ∗∂u ∗ ∂x= ∂g ∂u( ∂g∗ ) ∗ ∂u∗∂u ∂x + ∂u ∂x(187)Notice, if the function is analytic, the second term reduces to zero, and the functionis reduced to the normal well-known chain rule. For the matrix derivativeof a scalar function g(U), the chain rule can be written the following way:∂g(U)∂X∂g(U)Tr((= ∂U )T ∂U)+∂X4.1.2 Complex Derivatives of TracesTr((∂g(U)∂U ∗ ) T ∂U ∗ )∂X. (188)If the derivatives involve complex numbers, the conjugate transpose is often involved.The most useful way to show complex derivative is to show the derivativewith respect to the real and the imaginary part separately. An easy example is:∂Tr(X ∗ )∂RX = ∂Tr(XH )∂RXi ∂Tr(X∗ )∂IX = )i∂Tr(XH ∂IX= I (189)= I (190)Since the two results have the same sign, the conjugate complex derivative (183)should be used.∂Tr(X)∂RX = ∂Tr(XT )∂RXi ∂Tr(X)∂IX = )i∂Tr(XT ∂IX= I (191)= −I (192)Here, the two results have different signs, and the generalized complex derivative(182) should be used. Hereby, it can be seen that (84) holds even if X is aPetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 23

4.1 Complex Derivatives 4 COMPLEX MATRICEScomplex number.∂Tr(AX H )∂RXi ∂Tr(AXH )∂IX∂Tr(AX ∗ )∂RXi ∂Tr(AX∗ )∂IX= A (193)= A (194)= A T (195)= A T (196)∂Tr(XX H )∂RXi ∂Tr(XXH )∂IX= ∂Tr(XH X)∂RX= i ∂Tr(XH X)∂IX= 2RX (197)= i2IX (198)By inserting (197) and (198) in (182) and (183), it can be seen that∂Tr(XX H )= X ∗∂X(199)∂Tr(XX H )∂X ∗ = X (200)Since the function Tr(XX H ) is a real function of the complex matrix X, thecomplex gradient matrix (186) is given by∇Tr(XX H ) = 2 ∂Tr(XXH )∂X ∗ = 2X (201)4.1.3 Complex Derivative Involving DeterminantsHere, a calculation example is provided. The objective is to find the derivative ofdet(X H AX) with respect to X ∈ C m×n . The derivative is found with respect tothe real part and the imaginary part of X, by use of (36) and (32), det(X H AX)can be calculated as (see App. B.1.2 for details)∂ det(X H AX)∂Xand the complex conjugate derivative yields= 1 ( ∂ det(X H AX)− i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX) ( (X H AX) −1 X H A ) T(202)∂ det(X H AX)∂X ∗ = 1 ( ∂ det(X H AX)+ i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX)AX(X H AX) −1 (203)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 24

5 DECOMPOSITIONS5 Decompositions5.1 Eigenvalues and Eigenvectors5.1.1 DefinitionThe eigenvectors v and eigenvalues λ are the ones satisfyingAv i = λ i v i (204)AV = VD, (D) ij = δ ij λ i , (205)where the columns of V are the vectors v i5.1.2 General Properties5.1.3 Symmetriceig(AB) = eig(BA) (206)A is n × m ⇒ At most min(n, m) distinct λ i (207)rank(A) = r ⇒ At most r non-zero λ i (208)Assume A is symmetric, thenVV T = I (i.e. V is orthogonal) (209)λ i ∈ R (i.e. λ i is real) (210)Tr(A p ) = ∑ i λp i (211)eig(I + cA) = 1 + cλ i (212)eig(A − cI) = λ i − c (213)eig(A −1 ) = λ −1i (214)For a symmetric, positive matrix A,eig(A T A) = eig(AA T ) = eig(A) ◦ eig(A) (215)5.2 Singular Value DecompositionAny n × m matrix A can be written asA = UDV T , (216)whereU = eigenvectors of AA T n × nD = √ diag(eig(AA T )) n × mV = eigenvectors of A T A m × m(217)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 25

Assume A ∈ R n×n . Then[A]=[ V] [ D] [UT ] , (219)5.2 Singular Value Decomposition 5 DECOMPOSITIONS5.2.1 Symmetric Square decomposed into squaresAssume A to be n × n and symmetric. Then[ ] [ ] [ ] [ A = V D VT ] , (218)where D is diagonal with the eigenvalues of A, and V is orthogonal and theeigenvectors of A.5.2.2 Square decomposed into squareswhere D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.5.2.3 Square decomposed into rectangularAssume V ∗ D ∗ U T ∗ = 0 then we can expand the SVD of A into[ A]=[ V V∗] [ D 00 D ∗] [ UTU T ∗where the SVD of A is A = VDU T .5.2.4 Rectangular decomposition I], (220)Assume A is n × m, V is n × n, D is n × n, U T is n × m[A ] = [ V ] [ D ] [ U T ], (221)where D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.5.2.5 Rectangular decomposition IIAssume A is n × m, V is n × m, D is m × m, U T is m × m⎡ ⎤ ⎡ ⎤[A ] = [ V ] ⎣ D ⎦ ⎣ U T ⎦ (222)5.2.6 Rectangular decomposition IIIAssume A is n × m, V is n × n, D is n × m, U T is m × m⎡ ⎤[A ] = [ V ] [ D ] ⎣ U T ⎦ , (223)where D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 26

5.3 Triangular Decomposition 5 DECOMPOSITIONS5.3 Triangular Decomposition5.3.1 Cholesky-decompositionAssume A is positive definite, thenA = B T B, (224)where B is a unique upper triangular matrix.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 27

6 STATISTICS AND PROBABILITY6 Statistics and Probability6.1 Definition of MomentsAssume x ∈ R n×1 is a random variable6.1.1 MeanThe vector of means, m, is defined by(m) i = 〈x i 〉 (225)6.1.2 CovarianceThe matrix of covariance M is defined by(M) ij = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)〉 (226)or alternatively asM = 〈(x − m)(x − m) T 〉 (227)6.1.3 Third momentsThe matrix of third centralized moments – in some contexts referred to ascoskewness – is defined using the notationasm (3)ijk = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)(x k − 〈x k 〉)〉 (228)M 3 =[]m (3)::1 m(3) ::2 ...m(3) ::n(229)where ’:’ denotes all elements within the given index. M 3 can alternatively beexpressed asM 3 = 〈(x − m)(x − m) T ⊗ (x − m) T 〉 (230)6.1.4 Fourth momentsThe matrix of fourth centralized moments – in some contexts referred to ascokurtosis – is defined using the notationm (4)ijkl = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)(x k − 〈x k 〉)(x l − 〈x l 〉)〉 (231)asM 4 =[]m (4)::11 m(4) ::21 ...m(4) ::n1 |m(4) ::12 m(4) ::22 ...m(4) ::n2 |...|m(4) ::1n m(4) ::2n ...m(4) ::nn(232)or alternatively asM 4 = 〈(x − m)(x − m) T ⊗ (x − m) T ⊗ (x − m) T 〉 (233)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 28

6.2 Expectation of Linear Combinations 6 STATISTICS AND PROBABILITY6.2 Expectation of Linear Combinations6.2.1 Linear FormsAssume X and x to be a matrix and a vector of random variables. Then (seeSee [25])E[AXB + C] = AE[X]B + C (234)Var[Ax] = AVar[x]A T (235)Cov[Ax, By] = ACov[x, y]B T (236)Assume x to be a stochastic vector with mean m, then (see [7])6.2.2 Quadratic FormsE[Ax + b] = Am + b (237)E[Ax] = Am (238)E[x + b] = m + b (239)Assume A is symmetric, c = E[x] and Σ = Var[x]. Assume also that allcoordinates x i are independent, have the same central moments µ 1 , µ 2 , µ 3 , µ 4and denote a = diag(A). Then (See [25])E[x T Ax] = Tr(AΣ) + c T Ac (240)Var[x T Ax] = 2µ 2 2Tr(A 2 ) + 4µ 2 c T A 2 c + 4µ 3 c T Aa + (µ 4 − 3µ 2 2)a T a (241)Also, assume x to be a stochastic vector with mean m, and covariance M. Then(see [7])E[(Ax + a)(Bx + b) T ] = AMB T + (Am + a)(Bm + b) T (242)E[xx T ] = M + mm T (243)E[xa T x] = (M + mm T )a (244)E[x T ax T ] = a T (M + mm T ) (245)E[(Ax)(Ax) T ] = A(M + mm T )A T (246)E[(x + a)(x + a) T ] = M + (m + a)(m + a) T (247)E[(Ax + a) T (Bx + b)] = Tr(AMB T ) + (Am + a) T (Bm + b) (248)E[x T x] = Tr(M) + m T m (249)E[x T Ax] = Tr(AM) + m T Am (250)E[(Ax) T (Ax)] = Tr(AMA T ) + (Am) T (Am) (251)E[(x + a) T (x + a)] = Tr(M) + (m + a) T (m + a) (252)See [7].Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 29

6.3 Weighted Scalar Variable 6 STATISTICS AND PROBABILITY6.2.3 Cubic FormsAssume x to be a stochastic vector with independent coordinates, mean m,covariance M and central moments v 3 = E[(x − m) 3 ]. Then (see [7])E[(Ax + a)(Bx + b) T (Cx + c)] = Adiag(B T C)v 3+Tr(BMC T )(Am + a)+AMC T (Bm + b)+(AMB T + (Am + a)(Bm + b) T )(Cm + c)E[xx T x] = v 3 + 2Mm + (Tr(M) + m T m)mE[(Ax + a)(Ax + a) T (Ax + a)] = Adiag(A T A)v 3+[2AMA T + (Ax + a)(Ax + a) T ](Am + a)+Tr(AMA T )(Am + a)E[(Ax + a)b T (Cx + c)(Dx + d) T ] = (Ax + a)b T (CMD T + (Cm + c)(Dm + d) T )+(AMC T + (Am + a)(Cm + c) T )b(Dm + d) T6.3 Weighted Scalar Variable+b T (Cm + c)(AMD T − (Am + a)(Dm + d) T )Assume x ∈ R n×1 is a random variable, w ∈ R n×1 is a vector of constants andy is the linear combination y = w T x. Assume further that m, M 2 , M 3 , M 4denotes the mean, covariance, and central third and fourth moment matrix ofthe variable x. Then it holds that〈y〉 = w T m (253)〈(y − 〈y〉) 2 〉 = w T M 2 w (254)〈(y − 〈y〉) 3 〉 = w T M 3 w ⊗ w (255)〈(y − 〈y〉) 4 〉 = w T M 4 w ⊗ w ⊗ w (256)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 30

7 MULTIVARIATE DISTRIBUTIONS7 Multivariate Distributions7.1 Student’s tThe density of a Student-t distributed vector t ∈ R P ×1 , is given byν+P−P/2Γ(2p(t|µ, Σ, ν) = (πν) )Γ(ν/2)det(Σ) −1/2[1 + ν−1(t − µ) T Σ −1 (t − µ) ] (ν+P )/2(257)where µ is the location, the scale matrix Σ is symmetric, positive definite, νis the degrees of freedom, and Γ denotes the gamma function. For ν = 1, theStudent-t distribution becomes the Cauchy distribution (see sec 7.2).7.1.1 MeanE(t) = µ, ν > 1 (258)7.1.2 Variancecov(t) =ν Σ, ν > 2 (259)ν − 27.1.3 ModeThe notion mode meaning the position of the most probable value7.1.4 Full Matrix Versionmode(t) = µ (260)If instead of a vector t ∈ R P ×1 one has a matrix T ∈ R P ×N , then the Student-tdistribution for T isp(T|M, Ω, Σ, ν) = π −NP/2 P∏p=1Γ [(ν + P − p + 1)/2]Γ [(ν − p + 1)/2]ν det(Ω) −ν/2 det(Σ) −N/2 ×det [ Ω −1 + (T − M)Σ −1 (T − M) T] −(ν+P )/2(261)where M is the location, Ω is the rescaling matrix, Σ is positive definite, ν isthe degrees of freedom, and Γ denotes the gamma function.7.2 CauchyThe density function for a Cauchy distributed vector t ∈ R P ×1 , is given by1+P−P/2Γ(2p(t|µ, Σ) = π )Γ(1/2)×det(Σ) −1/2[1 + (t − µ)T Σ −1 (t − µ) ] (1+P )/2(262)where µ is the location, Σ is positive definite, and Γ denotes the gamma function.The Cauchy distribution is a special case of the Student-t distribution.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 31

7.3 Gaussian 7 MULTIVARIATE DISTRIBUTIONS7.3 GaussianSee sec. 8.7.4 MultinomialIf the vector n contains counts, i.e. (n) i ∈ 0, 1, 2, ..., then the discrete multinomialdisitrbution for n is given byP (n|a, n) =n!n 1 ! . . . n d !d∏ia nii ,d∑n i = n (263)iwhere a i are probabilities, i.e. 0 ≤ a i ≤ 1 and ∑ i a i = 1.7.5 DirichletThe Dirichlet distribution is a kind of “inverse” distribution compared to themultinomial distribution on the bounded continuous variate x = [x 1 , . . . , x P ][16, p. 44]( ∑P)Γp α pP∏p(x|α) = ∏ Pp Γ(α x αp−1pp)7.6 Normal-Inverse Gamma7.7 WishartThe central Wishart distribution for M ∈ R P ×P , M is positive definite, wherem can be regarded as a degree of freedom parameter [16, equation 3.8.1] [8,section 2.5],[11]pp(M|Σ, m) =12 mP/2 π ∏ P (P −1)/4 Pp Γ[ ×12(m + 1 − p)]det(Σ) −m/2 det(M) (m−P −1)/2 ×[exp − 1 ]2 Tr(Σ−1 M)(264)7.7.1 MeanE(M) = mΣ (265)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 32

7.8 Inverse Wishart 7 MULTIVARIATE DISTRIBUTIONS7.8 Inverse WishartThe (normal) Inverse Wishart distribution for M ∈ R P ×P , M is positive definite,where m can be regarded as a degree of freedom parameter [11]p(M|Σ, m) =12 mP/2 π ∏ P (P −1)/4 Pp Γ[ ×12(m + 1 − p)]det(Σ) m/2 det(M) −(m−P −1)/2 ×[exp − 1 ]2 Tr(ΣM−1 )(266)7.8.1 Mean1E(M) = Σm − P − 1(267)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 33

8 GAUSSIANS8 Gaussians8.1 Basics8.1.1 Density and normalizationThe density of x ∼ N (m, Σ) is[1p(x) = √ exp − 1 ]det(2πΣ) 2 (x − m)T Σ −1 (x − m)(268)Note that if x is d-dimensional, then det(2πΣ) = (2π) d det(Σ).Integration and normalization∫ [exp − 1 ]2 (x − m)T Σ −1 (x − m) dx = √ det(2πΣ)∫ [exp − 1 ]2 xT Ax + b T x dx = √ [ ] 1det(2πA −1 ) exp2 bT A −1 b∫ [exp − 1 ]2 Tr(ST AS) + Tr(B T S) dS = √ [ ]1det(2πA −1 ) exp2 Tr(BT A −1 B)The derivatives of the density are∂p(x)= −p(x)Σ −1 (x − m) (269)∂x∂ 2 p(∂x∂x T = p(x) Σ −1 (x − m)(x − m) T Σ −1 − Σ −1) (270)8.1.2 Marginal DistributionAssume x ∼ N x (µ, Σ) where[ ]xax =x bµ =[µaµ b]Σ =[ ]Σa Σ cΣ T c Σ b(271)then8.1.3 Conditional Distributionp(x a ) = N xa (µ a , Σ a ) (272)p(x b ) = N xb (µ b , Σ b ) (273)Assume x ∼ N x (µ, Σ) where[ ]xax =x bµ =[µaµ b]Σ =[ ]Σa Σ cΣ T c Σ b(274)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 34

8.1 Basics 8 GAUSSIANSthenp(x a |x b ) = N xa (ˆµ a , ˆΣ a )p(x b |x a ) = N xb (ˆµ b , ˆΣ b ){ ˆµa = µ a + Σ c Σ −1b(x b − µ b )ˆΣ a = Σ a − Σ c Σ −1bΣ T (275)c{ ˆµb = µ b + Σ T c Σ −1a (x a − µ a )ˆΣ b = Σ b − Σ T c Σ −1 (276)a Σ cNote, that the covariance matrices are the Schur complement of the block matrix,see 9.10.5 for details.8.1.4 Linear combinationAssume x ∼ N (m x , Σ x ) and y ∼ N (m y , Σ y ) thenAx + By + c ∼ N (Am x + Bm y + c, AΣ x A T + BΣ y B T ) (277)8.1.5 Rearranging Means√det(2π(AT ΣN Ax [m, Σ] =−1 A) −1 )√ N x [A −1 m, (A T Σ −1 A) −1 ] (278)det(2πΣ)8.1.6 Rearranging into squared formIf A is symmetric, then− 1 2 xT Ax + b T x = − 1 2 (x − A−1 b) T A(x − A −1 b) + 1 2 bT A −1 b− 1 2 Tr(XT AX) + Tr(B T X) = − 1 2 Tr[(X − A−1 B) T A(X − A −1 B)] + 1 2 Tr(BT A −1 B)8.1.7 Sum of two squared formsIn vector formulation (assuming Σ 1 , Σ 2 are symmetric)− 1 2 (x − m 1) T Σ −11 (x − m 1) (279)− 1 2 (x − m 2) T Σ −12 (x − m 2) (280)= − 1 2 (x − m c) T Σ −1c (x − m c ) + C (281)Σ −1c = Σ −11 + Σ −12 (282)m c = (Σ −11 + Σ −12 )−1 (Σ −11 m 1 + Σ −12 m 2) (283)C = 1 2 (mT 1 Σ −11 + m T 2 Σ −12 )(Σ−1 1 + Σ −1− 1 ()m T 1 Σ −112m 1 + m T 2 Σ −12 m 22 )−1 (Σ −11 m 1 + Σ −12 m 2)(284)(285)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 35

8.2 Moments 8 GAUSSIANSIn a trace formulation (assuming Σ 1 , Σ 2 are symmetric)− 1 2 Tr((X − M 1) T Σ −11 (X − M 1)) (286)− 1 2 Tr((X − M 2) T Σ −12 (X − M 2)) (287)= − 1 2 Tr[(X − M c) T Σ −1c (X − M c )] + C (288)Σ −1c = Σ −11 + Σ −12 (289)M c = (Σ −11 + Σ −12 )−1 (Σ −11 M 1 + Σ −12 M 2) (290)C = 1 []2 Tr (Σ −11 M 1 + Σ −12 M 2) T (Σ −11 + Σ −12 )−1 (Σ −11 M 1 + Σ −12 M 2)− 1 2 Tr(MT 1 Σ −11 M 1 + M T 2 Σ −12 M 2) (291)8.1.8 Product of gaussian densitiesLet N x (m, Σ) denote a density of x, thenN x (m 1 , Σ 1 ) · N x (m 2 , Σ 2 ) = c c N x (m c , Σ c ) (292)c c = N m1 (m 2 , (Σ 1 + Σ 2 ))1= √[−det(2π(Σ1 + Σ 2 )) exp 1 ]2 (m 1 − m 2 ) T (Σ 1 + Σ 2 ) −1 (m 1 − m 2 )m c = (Σ −11 + Σ −12 )−1 (Σ −11 m 1 + Σ −12 m 2)Σ c = (Σ −11 + Σ −12 )−1but note that the product is not normalized as a density of x.8.2 Moments8.2.1 Mean and covariance of linear formsFirst and second moments. Assume x ∼ N (m, Σ)E(x) = m (293)Cov(x, x) = Var(x) = Σ = E(xx T ) − E(x)E(x T ) = E(xx T ) − mm T (294)As for any other distribution is holds for gaussians thatE[Ax] = AE[x] (295)Var[Ax] = AVar[x]A T (296)Cov[Ax, By] = ACov[x, y]B T (297)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 36

8.2 Moments 8 GAUSSIANS8.2.2 Mean and variance of square formsMean and variance of square forms: Assume x ∼ N (m, Σ)E(xx T ) = Σ + mm T (298)E[x T Ax] = Tr(AΣ) + m T Am (299)Var(x T Ax) = 2σ 4 Tr(A 2 ) + 4σ 2 m T A 2 m (300)E[(x − m ′ ) T A(x − m ′ )] = (m − m ′ ) T A(m − m ′ ) + Tr(AΣ) (301)Assume x ∼ N (0, σ 2 I) and A and B to be symmetric, thenCov(x T Ax, x T Bx) = 2σ 4 Tr(AB) (302)8.2.3 Cubic formsE[xb T xx T ] = mb T (M + mm T ) + (M + mm T )bm T8.2.4 Mean of Quartic Forms+b T m(M − mm T ) (303)E[xx T xx T ] = 2(Σ + mm T ) 2 + m T m(Σ − mm T )+Tr(Σ)(Σ + mm T )E[xx T Axx T ] = (Σ + mm T )(A + A T )(Σ + mm T )+m T Am(Σ − mm T ) + Tr[AΣ](Σ + mm T )E[x T xx T x] = 2Tr(Σ 2 ) + 4m T Σm + (Tr(Σ) + m T m) 2E[x T Axx T Bx] = Tr[AΣ(B + B T )Σ] + m T (A + A T )Σ(B + B T )m+(Tr(AΣ) + m T Am)(Tr(BΣ) + m T Bm)E[a T xb T xc T xd T x]= (a T (Σ + mm T )b)(c T (Σ + mm T )d)+(a T (Σ + mm T )c)(b T (Σ + mm T )d)+(a T (Σ + mm T )d)(b T (Σ + mm T )c) − 2a T mb T mc T md T mE[(Ax + a)(Bx + b) T (Cx + c)(Dx + d) T ]= [AΣB T + (Am + a)(Bm + b) T ][CΣD T + (Cm + c)(Dm + d) T ]+[AΣC T + (Am + a)(Cm + c) T ][BΣD T + (Bm + b)(Dm + d) T ]+(Bm + b) T (Cm + c)[AΣD T − (Am + a)(Dm + d) T ]+Tr(BΣC T )[AΣD T + (Am + a)(Dm + d) T ]Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 37

8.3 Miscellaneous 8 GAUSSIANSE[(Ax + a) T (Bx + b)(Cx + c) T (Dx + d)]= Tr[AΣ(C T D + D T C)ΣB T ]+[(Am + a) T B + (Bm + b) T A]Σ[C T (Dm + d) + D T (Cm + c)]+[Tr(AΣB T ) + (Am + a) T (Bm + b)][Tr(CΣD T ) + (Cm + c) T (Dm + d)]See [7].8.2.5 MomentsE[x] = ∑ kCov(x) = ∑ kρ k m k (304)∑k ′ ρ k ρ k ′(Σ k + m k m T k − m k m T k ′) (305)8.3 Miscellaneous8.3.1 WhiteningAssume x ∼ N (m, Σ) thenz = Σ −1/2 (x − m) ∼ N (0, I) (306)Conversely having z ∼ N (0, I) one can generate data x ∼ N (m, Σ) by settingx = Σ 1/2 z + m ∼ N (m, Σ) (307)Note that Σ 1/2 means the matrix which fulfils Σ 1/2 Σ 1/2 = Σ, and that it existsand is unique since Σ is positive definite.8.3.2 The Chi-Square connectionAssume x ∼ N (m, Σ) and x to be n dimensional, thenz = (x − m) T Σ −1 (x − m) ∼ χ 2 n (308)where χ 2 n denotes the Chi square distribution with n degrees of freedom.8.3.3 EntropyEntropy of a D-dimensional gaussian∫H(x) = − N (m, Σ) ln N (m, Σ)dx = ln √ det(2πΣ) + D 2(309)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 38

8.4 Mixture of Gaussians 8 GAUSSIANS8.4 Mixture of Gaussians8.4.1 DensityThe variable x is distributed as a mixture of gaussians if it has the densityp(x) =K∑k=11ρ k √[−det(2πΣk ) exp 1 ]2 (x − m k) T Σ −1k (x − m k)where ρ k sum to 1 and the Σ k all are positive definite.8.4.2 DerivativesDefining p(s) = ∑ k ρ kN s (µ k , Σ k ) one get∂ ln p(s)∂ρ j==∂ ln p(s)∂µ j==∂ ln p(s)∂Σ j==ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑∂ln[ρ j N s (µ∂ρ j , Σ j )]j1ρ j∂ln[ρ j N s (µ∂µ j , Σ j )]j[−Σ−1k (s − µ k) ]∂k ρ ln[ρ j N s (µkN s (µ k , Σ k ) ∂Σ j , Σ j )]jρ j N s (µ j , Σ j ) 1 [∑−Σ−1k ρ j + Σ −1jkN s (µ k , Σ k ) 2But ρ k and Σ k needs to be constrained.(s − µ j )(s − µ j ) T Σ −1 ]j(310)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 39

9 SPECIAL MATRICES9 Special Matrices9.1 Orthogonal, Ortho-symmetric, and Ortho-skew9.1.1 OrthogonalBy definition, the real matrix A is orthogonal if and only ifA −1 = A TBasic properties for the orthogonal matrix AA −T = AAA T = IA T A = Idet(A) = 19.2 Units, Permutation and Shift9.2.1 Unit vectorLet e i ∈ R n×1 be the ith unit vector, i.e. the vector which is zero in all entriesexcept the ith at which it is 1.9.2.2 Rows and Columns9.2.3 Permutationsi.th row of A = e T i A (311)j.th column of A = Ae j (312)Let P be some permutation matrix, e.g.⎡ ⎤⎡0 1 0P = ⎣ 1 0 0 ⎦ = [ ]e 2 e 1 e 3 = ⎣0 0 1e T 2e T 1e T 3⎤⎦ (313)For permutation matrices it holds thatand thatAP = [ Ae 2 Ae 1 Ae 3]PP T = I (314)⎡PA = ⎣e T 2 Ae T 1 Ae T 3 A⎤⎦ (315)That is, the first is a matrix which has columns of A but in permuted sequenceand the second is a matrix which has the rows of A but in the permuted sequence.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 40

9.3 The Singleentry Matrix 9 SPECIAL MATRICES9.2.4 Translation, Shift or Lag OperatorsLet L denote the lag (or ’translation’ or ’shift’) operator defined on a 4 × 4example by⎡⎤0 0 0 0L = ⎢ 1 0 0 0⎥⎣ 0 1 0 0 ⎦ (316)0 0 1 0i.e. a matrix of zeros with one on the sub-diagonal, (L) ij = δ i,j+1 . With somesignal x t for t = 1, ..., N, the n.th power of the lag operator shifts the indices,i.e.{(L n 0 for t = 1, .., nx) t =(317)x t−n for t = n + 1, ..., NA related but slightly different matrix is the ’recurrent shifted’ operator definedon a 4x4 example by⎡⎤0 0 0 1ˆL = ⎢ 1 0 0 0⎥⎣ 0 1 0 0 ⎦ (318)0 0 1 0i.e. a matrix defined by (ˆL) ij = δ i,j+1 + δ i,1 δ j,dim(L) . On a signal x it has theeffect(ˆL n x) t = x t ′, t ′ = [(t − n) mod N] + 1 (319)That is, ˆL is like the shift operator L except that it ’wraps’ the signal as if itwas periodic and shifted (substituting the zeros with the rear end of the signal).Note that ˆL is invertible and orthogonal, i.e.ˆL −1 = ˆL T (320)9.3 The Singleentry Matrix9.3.1 DefinitionThe single-entry matrix J ij ∈ R n×n is defined as the matrix which is zeroeverywhere except in the entry (i, j) in which it is 1. In a 4 × 4 example onemight haveJ 23 =⎡⎢⎣0 0 0 00 0 1 00 0 0 00 0 0 0⎤⎥⎦ (321)The single-entry matrix is very useful when working with derivatives of expressionsinvolving matrices.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 41

9.3 The Singleentry Matrix 9 SPECIAL MATRICES9.3.2 Swap and ZerosAssume A to be n × m and J ij to be m × pAJ ij = [ 0 0 . . . A i . . . 0 ] (322)i.e. an n × p matrix of zeros with the i.th column of A in place of the j.thcolumn. Assume A to be n × m and J ij to be p × n⎡ ⎤0.0J ij A =A j(323)0⎢ ⎥⎣ . ⎦0i.e. an p × m matrix of zeros with the j.th row of A in the placed of the i.throw.9.3.3 Rewriting product of elementsA ki B jl = (Ae i e T j B) kl = (AJ ij B) kl (324)A ik B lj = (A T e i e T j B T ) kl = (A T J ij B T ) kl (325)A ik B jl = (A T e i e T j B) kl = (A T J ij B) kl (326)A ki B lj = (Ae i e T j B T ) kl = (AJ ij B T ) kl (327)9.3.4 Properties of the Singleentry MatrixIf i = jIf i ≠ jJ ij J ij = J ij (J ij ) T (J ij ) T = J ijJ ij (J ij ) T = J ij (J ij ) T J ij = J ijJ ij J ij = 0 (J ij ) T (J ij ) T = 0J ij (J ij ) T = J ii (J ij ) T J ij = J jj9.3.5 The Singleentry Matrix in Scalar ExpressionsAssume A is n × m and J is m × n, thenTr(AJ ij ) = Tr(J ij A) = (A T ) ij (328)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 42

9.4 Symmetric and Antisymmetric 9 SPECIAL MATRICESAssume A is n × n, J is n × m and B is m × n, thenTr(AJ ij B) = (A T B T ) ij (329)Tr(AJ ji B) = (BA) ij (330)Tr(AJ ij J ij B) = diag(A T B T ) ij (331)Assume A is n × n, J ij is n × m B is m × n, then9.3.6 Structure MatricesThe structure matrix is defined byIf A has no special structure thenx T AJ ij Bx = (A T xx T B T ) ij (332)x T AJ ij J ij Bx = diag(A T xx T B T ) ij (333)∂A∂A ij= S ij (334)S ij = J ij (335)If A is symmetric thenS ij = J ij + J ji − J ij J ij (336)9.4 Symmetric and Antisymmetric9.4.1 SymmetricThe matrix A is said to be symmetric ifA = A T (337)Symmetric matrices have many important properties, e.g. that their eigenvaluesare real and eigenvectors orthogonal.9.4.2 AntisymmetricThe antisymmetric matrix is also known as the skew symmetric matrix. It hasthe following property from which it is definedA = −A T (338)Hereby, it can be seen that the antisymmetric matrices always have a zerodiagonal. The n × n antisymmetric matrices also have the following properties.det(A T ) = det(−A) = (−1) n det(A) (339)− det(A) = det(−A) = 0, if n is odd (340)The eigenvalues of an antisymmetric matrix are placed on the imaginary axisand the eigenvectors are unitary.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 43

9.5 Orthogonal matrices 9 SPECIAL MATRICES9.4.3 DecompositionA square matrix A can always be written as a sum of a symmetric A + and anantisymmetric matrix A −A = A + + A − (341)Such a decomposition could e.g. beA = A + AT2+ A − AT2= A + + A − (342)9.5 Orthogonal matricesIf a square matrix Q is orthogonal Q T Q = QQ T = I. Furthermore Q had thefollowing properties• Its eigenvalues are placed on the unit circle.• Its eigenvectors are unitary.• det(Q) = ±1.• The inverse of an orthogonal matrix is orthogonal too.9.5.1 Ortho-SymA matrix Q + which simultaneously is orthogonal and symmetric is called anortho-sym matrix [19]. HerebyQ T +Q + = I (343)Q + = Q T + (344)The powers of an ortho-sym matrix are given by the following rule9.5.2 Ortho-SkewQ k + = 1 + (−1)k2= 1 + cos(kπ)2I + 1 + (−1)k+1 Q + (345)2I + 1 − cos(kπ) Q + (346)2A matrix which simultaneously is orthogonal and antisymmetric is called anortho-skew matrix [19]. HerebyQ H − Q − = I (347)Q − = −Q H − (348)The powers of an ortho-skew matrix are given by the following ruleQ k − = ik + (−i) k2I − i ik − (−i) kQ − (349)2= cos(k π 2 )I + sin(k π 2 )Q − (350)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 44

9.6 Vandermonde Matrices 9 SPECIAL MATRICES9.5.3 DecompositionA square matrix A can always be written as a sum of a symmetric A + and anantisymmetric matrix A −A = A + + A − (351)9.6 Vandermonde MatricesA Vandermonde matrix has the form [15]⎡V = ⎢⎣ .1 v 1 v 2 1 · · · v n−111 v 2 v 2 2 · · · v n−12. . .1 v n vn 2 · · · vnn−1⎤⎥⎦ . (352)The transpose of V is also said to a Vandermonde matrix. The determinant isgiven by [28]det V = ∏ i>j(v i − v j ) (353)9.7 Toeplitz MatricesA Toeplitz matrix T is a matrix where the elements of each diagonal is thesame. In the n × n square case, it has the following structure:⎡⎤ ⎡⎤t 11 t 12 · · · t 1n t 0 t 1 · · · t n−1.T =t .. . .. 21 .⎢⎣.. .. . ⎥ .. t12 ⎦ = . t .. . .. −1 .⎢⎣.. .. . ⎥ (354)..t1 ⎦t n1 · · · t 21 t 11 t −(n−1) · · · t −1 t 0A Toeplitz matrix is persymmetric. If a matrix is persymmetric (or orthosymmetric),it means that the matrix is symmetric about its northeast-southwestdiagonal (anti-diagonal) [12]. Persymmetric matrices is a larger class of matrices,since a persymmetric matrix not necessarily has a Toeplitz structure. Thereare some special cases of Toeplitz matrices. The symmetric Toeplitz matrix isgiven by:⎡⎤t 0 t 1 · · · t n−1.T =t .. . .. 1 .⎢⎣.. .. . ⎥(355)..t1 ⎦t −(n−1) · · · t 1 t 0Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 45

9.8 The DFT Matrix 9 SPECIAL MATRICESThe circular Toeplitz matrix:⎡⎤t 0 t 1 · · · t n−1.T C =t .. . .. n .⎢⎣.. .. . ⎥.. t1 ⎦t 1 · · · t n−1 t 0(356)The upper triangular Toeplitz matrix:⎡t 0 t 1 · · · t n−1⎤0 · · · 0 t 0 .T U =0 .. . .. .⎢⎣.. .. . ⎥ .. t1 ⎦ , (357)and the lower triangular Toeplitz matrix:⎡⎤t 0 0 · · · 0.T L =t .. . .. −1 .⎢⎣.. .. . ⎥ .. 0 ⎦t −(n−1) · · · t −1 t 09.7.1 Properties of Toeplitz Matrices(358)The Toeplitz matrix has some computational advantages. The addition of twoToeplitz matrices can be done with O(n) flops, multiplication of two Toeplitzmatrices can be done in O(n ln n) flops. Toeplitz equation systems can be solvedin O(n 2 ) flops. The inverse of a positive definite Toeplitz matrix can be foundin O(n 2 ) flops too. The inverse of a Toeplitz matrix is persymmetric. Theproduct of two lower triangular Toeplitz matrices is a Toeplitz matrix. Moreinformation on Toeplitz matrices and circulant matrices can be found in [13, 7].9.8 The DFT MatrixThe DFT matrix is an N × N symmetric matrix W N , where the k, nth elementis given by= e −j2πknN (359)W knNThus the discrete Fourier transform (DFT) can be expressed asX(k) =N−1∑n=0x(n)W knN . (360)Likewise the inverse discrete Fourier transform (IDFT) can be expressed asx(n) = 1 NN−1∑k=0X(k)W −knN . (361)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 46

9.9 Positive Definite and Semi-definite Matrices 9 SPECIAL MATRICESThe DFT of the vector x = [x(0), x(1), · · · , x(N − 1)] T can be written in matrixform asX = W N x, (362)where X = [X(0), X(1), · · · , x(N − 1)] T . The IDFT is similarly given asSome properties of W N exist:If W N = e −j2πN , then [22]x = W −1NX. (363)W −1N= 1 N W∗ N (364)W N W ∗ N = NI (365)W ∗ N = W H N (366)W m+N/2N= −W m N (367)Notice, the DFT matrix is a Vandermonde Matrix.The following important relation between the circulant matrix and the discreteFourier transform (DFT) existsT C = W −1N (I ◦ (W N t))W N , (368)where t = [t 0 , t 1 , · · · , t n−1 ] T is the first row of T C .9.9 Positive Definite and Semi-definite Matrices9.9.1 DefinitionsA matrix A is positive definite if and only ifA matrix A is positive semi-definite if and only ifx T Ax > 0, ∀x (369)x T Ax ≥ 0, ∀x (370)Note that if A is positive definite, then A is also positive semi-definite.9.9.2 EigenvaluesThe following holds with respect to the eigenvalues:A pos. def. ⇔ eig( A+AH2) > 0A pos. semi-def. ⇔ eig( A+AH2) ≥ 0(371)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 47

9.9 Positive Definite and Semi-definite Matrices 9 SPECIAL MATRICES9.9.3 TraceThe following holds with respect to the trace:A pos. def. ⇒ Tr(A) > 0A pos. semi-def. ⇒ Tr(A) ≥ 0(372)9.9.4 InverseIf A is positive definite, then A is invertible and A −1 is also positive definite.9.9.5 DiagonalIf A is positive definite, then A ii > 0, ∀i9.9.6 Decomposition IThe matrix A is positive semi-definite of rank r ⇔ there exists a matrix B ofrank r such that A = BB TThe matrix A is positive definite ⇔ there exists an invertible matrix B suchthat A = BB T9.9.7 Decomposition IIAssume A is an n × n positive semi-definite, then there exists an n × r matrixB of rank r such that B T AB = I.9.9.8 Equation with zerosAssume A is positive semi-definite, then X T AX = 0 ⇒ AX = 09.9.9 Rank of productAssume A is positive definite, then rank(BAB T ) = rank(B)9.9.10 Positive definite propertyIf A is n × n positive definite and B is r × n of rank r, then BAB T is positivedefinite.9.9.11 Outer ProductIf X is n × r, where n ≤ r and rank(X) = n, then XX T is positive definite.9.9.12 Small pertubationsIf A is positive definite and B is symmetric, then A − tB is positive definite forsufficiently small t.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 48

9.10 Block matrices 9 SPECIAL MATRICES9.10 Block matricesLet A ij denote the ijth block of A.9.10.1 MultiplicationAssuming the dimensions of the blocks matches we have[ ] [ ] [ ]A11 A 12 B11 B 12 A11 B=11 + A 12 B 21 A 11 B 12 + A 12 B 22A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 229.10.2 The DeterminantThe determinant can be expressed as by the use ofC 1 = A 11 − A 12 A −122 A 21 (373)C 2 = A 22 − A 21 A −111 A 12 (374)as([ ])A11 Adet12= det(AA 21 A 22 ) · det(C 1 ) = det(A 11 ) · det(C 2 )229.10.3 The InverseThe inverse can be expressed as by the use ofas [ ] −1 [A11 A 12=A 21 A 22=C 1 = A 11 − A 12 A −122 A 21 (375)C 2 = A 22 − A 21 A −111 A 12 (376)C −11 −A −111 A 12C −12−C −12 A 21A −111 C −12[ A−111 + A−1 11 A 12C −12 A 21A −111 −C −11 A 12A −1229.10.4 Block diagonal−A −122 A 21C −11 A −122 + A−1 22 A 21C −11 A 12A −122For block diagonal matrices we have[ ] −1 [ ]A11 0(A11 )=−1 00 A 22 0 (A 22 ) −1 (377)([ ])A11 0det= det(A0 A 11 ) · det(A 22 ) (378)22]]Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 49

9.10 Block matrices 9 SPECIAL MATRICES9.10.5 Schur complementThe Schur complement of the matrixA 21 A 22[A11 A 12]is the matrixA 11 − A 12 A −122 A 21that is, what is denoted C 2 above. Using the Schur complement, one can rewritethe inverse of a block matrix[A11 A 12=[A 21 A 22] −1I 0−A −122 A 21 I] [(A11 − A 12 A −122 A 21) −1 00 A −122] [ I −A12 A −1220 IThe Schur complement is useful when solving linear systems of the form[ ] [ ] [ ]A11 A 12 x1 b1=A 21 A 22 x 1 b 2which has the following equation for x 1(A 11 − A 12 A −122 A 21)x 1 = b 1 − A 12 A −122 x 2When the appropriate inverses exists, this can be solved for x 1 which can thenbe inserted in the equation for x 2 to solve for x 2 .]Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 50

10 FUNCTIONS AND OPERATORS10 Functions and Operators10.1 Functions and Series10.1.1 Finite Series(X n − I)(X − I) −1 = I + X + X 2 + ... + X n−1 (379)10.1.2 Taylor Expansion of Scalar FunctionConsider some scalar function f(x) which takes the vector x as an argument.This we can Taylor expand around x 0f(x) ∼ = f(x 0 ) + g(x 0 ) T (x − x 0 ) + 1 2 (x − x 0) T H(x 0 )(x − x 0 ) (380)whereg(x 0 ) = ∂f(x)∂x∣∣x0H(x 0 ) = ∂2 f(x)∂x∂x T ∣ ∣∣x010.1.3 Matrix Functions by Infinite SeriesAs for analytical functions in one dimension, one can define a matrix functionfor square matrices X by an infinite seriesf(X) =∞∑c n X n (381)n=0assuming the limit exists and is finite. If the coefficients c n fulfils ∑ n c nx n < ∞,then one can prove that the above series exists and is finite, see [1]. Thus forany analytical function f(x) there exists a corresponding matrix function f(x)constructed by the Taylor expansion. Using this one can prove the followingresults:1) A matrix A is a zero of its own characteristic polynomium [1]:p(λ) = det(Iλ − A) = ∑ nc n λ n ⇒ p(A) = 0 (382)2) If A is square it holds that [1]A = UBU −1 ⇒ f(A) = Uf(B)U −1 (383)3) A useful fact when using power series is thatA n → 0forn → ∞ if |A| < 1 (384)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 51

10.2 Kronecker and Vec Operator 10 FUNCTIONS AND OPERATORS10.1.4 Exponential Matrix FunctionIn analogy to the ordinary scalar exponential function, one can define exponentialand logarithmic matrix functions:e Ae −Ae tAln(I + A)≡≡≡≡∞∑ 1n! An = I + A + 1 2 A2 + ... (385)∞∑ 1n! (−1)n A n = I − A + 1 2 A2 − ... (386)∞∑ 1n! (tA)n = I + tA + 1 2 t2 A 2 + ... (387)∞∑ (−1) n−1A n = A − 1 n2 A2 + 1 3 A3 − ... (388)n=0n=0n=0n=1Some of the properties of the exponential function are [1]e A e B = e A+B if AB = BA (389)(e A ) −1 = e −A (390)ddt etA = Ae tA = e tA A, t ∈ R (391)ddt Tr(etA ) = Tr(Ae tA ) (392)det(e A ) = e Tr(A) (393)10.1.5 Trigonometric Functionssin(A)cos(A)≡≡∞∑ (−1) n A 2n+1= A − 1 (2n + 1)! 3! A3 + 1 5! A5 − ... (394)∞∑ (−1) n A 2n= I − 1 (2n)! 2! A2 + 1 4! A4 − ... (395)n=0n=010.2 Kronecker and Vec Operator10.2.1 The Kronecker ProductThe Kronecker product of an m × n matrix A and an r × q matrix B, is anmr × nq matrix, A ⊗ B defined as⎡⎤A 11 B A 12 B ... A 1n BA 21 B A 22 B ... A 2n BA ⊗ B = ⎢⎥(396)⎣ .. ⎦A m1 B A m2 B ... A mn BPetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 52

10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORSThe Kronecker product has the following properties (see [18])A ⊗ (B + C) = A ⊗ B + A ⊗ C (397)A ⊗ B ≠ B ⊗ A in general (398)A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C (399)(α A A ⊗ α B B) = α A α B (A ⊗ B) (400)(A ⊗ B) T = A T ⊗ B T (401)(A ⊗ B)(C ⊗ D) = AC ⊗ BD (402)(A ⊗ B) −1 = A −1 ⊗ B −1 (403)rank(A ⊗ B) = rank(A)rank(B) (404)Tr(A ⊗ B) = Tr(A)Tr(B) (405)det(A ⊗ B) = det(A) rank(B) det(B) rank(A) (406){eig(A ⊗ B)} = {eig(B ⊗ A)} if A, B are square (407){eig(A ⊗ B)} = {eig(A)eig(B) T } if A, B are square (408)Where {λ i } denotes the set of values λ i , that is, the values in no particularorder or structure.10.2.2 The Vec OperatorThe vec-operator applied on a matrix A stacks the columns into a vector, i.e.for a 2 × 2 matrix⎡ ⎤[ ]A 11A11 AA =12vec(A) = ⎢ A 21⎥A 21 A 22⎣ A 12⎦A 22Properties of the vec-operator include (see [18])vec(AXB) = (B T ⊗ A)vec(X) (409)Tr(A T B) = vec(A) T vec(B) (410)vec(A + B) = vec(A) + vec(B) (411)vec(αA) = α · vec(A) (412)10.3 Solutions to Systems of Equations10.3.1 Simple Linear RegressionAssume we have data (x n , y n ) for n = 1, ..., N and are seeking the parametersa, b ∈ R such that y i∼ = axi + b. With a least squares error function, the optimalvalues for a, b can be expressed using the notationx = (x 1 , ..., x N ) T y = (y 1 , ..., y N ) T 1 = (1, ..., 1) T ∈ R N×1Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 53

10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORSandR xx = x T x R x1 = x T 1 R 11 = 1 T 1R yx = y T x R y1 = y T 1as[ ab]=[ ] −1 [ ]Rxx R x1 Rx,yR x1 R 11 R y1(413)10.3.2 Existence in Linear SystemsAssume A is n × m and consider the linear systemAx = b (414)Construct the augmented matrix B = [A b] thenConditionrank(A) = rank(B) = mrank(A) = rank(B) < mrank(A) < rank(B)SolutionUnique solution xMany solutions xNo solutions x10.3.3 Standard SquareAssume A is square and invertible, then10.3.4 Degenerated SquareAx = b ⇒ x = A −1 b (415)Assume A is n×n but of rank r < n. In that case, the system Ax = b is solvedbyx = A + bwhere A + is the pseudo-inverse of the rank-deficient matrix, constructed asdescribed in section 3.6.3.10.3.5 Cramer’s ruleThe equationAx = b, (416)where A is square has exactly one solution x if the ith element in x can befound asx i = det Bdet A , (417)where B equals A, but the ith column in A has been substituted by b.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 54

10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORS10.3.6 Over-determined RectangularAssume A to be n × m, n > m (tall) and rank(A) = m, thenAx = b ⇒ x = (A T A) −1 A T b = A + b (418)that is if there exists a solution x at all! If there is no solution the followingcan be useful:Ax = b ⇒ x min = A + b (419)Now x min is the vector x which minimizes ||Ax − b|| 2 , i.e. the vector which is”least wrong”. The matrix A + is the pseudo-inverse of A. See [3].10.3.7 Under-determined RectangularAssume A is n × m and n < m (”broad”) and rank(A) = n.Ax = b ⇒ x min = A T (AA T ) −1 b (420)The equation have many solutions x. But x min is the solution which minimizes||Ax − b|| 2 and also the solution with the smallest norm ||x|| 2 . The same holdsfor a matrix version: Assume A is n × m, X is m × n and B is n × n, thenAX = B ⇒ X min = A + B (421)The equation have many solutions X. But X min is the solution which minimizes||AX − B|| 2 and also the solution with the smallest norm ||X|| 2 . See [3].Similar but different: Assume A is square n × n and the matrices B 0 , B 1are n × N, where N > n, then if B 0 has maximal rankAB 0 = B 1 ⇒ A min = B 1 B T 0 (B 0 B T 0 ) −1 (422)where A min denotes the matrix which is optimal in a least square sense. Aninterpretation is that A is the linear approximation which maps the columnsvectors of B 0 into the columns vectors of B 1 .10.3.8 Linear form and zeros10.3.9 Square form and zerosIf A is symmetric, then10.3.10 The Lyapunov EquationAx = 0, ∀x ⇒ A = 0 (423)x T Ax = 0, ∀x ⇒ A = 0 (424)AX + XB = C (425)vec(X) = (I ⊗ A + B T ⊗ I) −1 vec(C) (426)Sec 10.2.1 and 10.2.2 for details on the Kronecker product and the vec operator.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 55

10.4 Vector Norms 10 FUNCTIONS AND OPERATORS10.3.11 Encapsulating Sum∑n A nXB n = C (427)vec(X) = (∑ n BT n ⊗ A n) −1vec(C) (428)See Sec 10.2.1 and 10.2.2 for details on the Kronecker product and the vecoperator.10.4 Vector Norms10.4.1 Examples||x|| 1 = ∑ i|x i | (429)||x|| 2 2 = x H x (430)||x|| p =[ ∑i|x i | p ] 1/p(431)Further reading in e.g. [12, p. 52]10.5 Matrix Norms10.5.1 DefinitionsA matrix norm is a mapping which fulfils||x|| ∞ = max |x i | (432)i||A|| ≥ 0 (433)||A|| = 0 ⇔ A = 0 (434)||cA|| = |c|||A||, c ∈ R (435)||A + B|| ≤ ||A|| + ||B|| (436)10.5.2 Induced Norm or Operator NormAn induced norm is a matrix norm induced by a vector norm by the following||A|| = sup{||Ax|| | ||x|| = 1} (437)where || · || ont the left side is the induced matrix norm, while || · || on the rightside denotes the vector norm. For induced norms it holds that||I|| = 1 (438)||Ax|| ≤ ||A|| · ||x||, for all A, x (439)||AB|| ≤ ||A|| · ||B||, for all A, B (440)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 56

10.6 Rank 10 FUNCTIONS AND OPERATORS10.5.3 Examples∑||A|| 1 = max |A ij | (441)j√ i||A|| 2 = max eig(A H A) (442)||A|| p = ( max ||Ax|| p) 1/p (443)||x|| p=1∑||A|| ∞ = max |A ij | (444)i√j∑ √||A|| F = |A ij | 2 = Tr(AA H ) (Frobenius) (445)ij||A|| max = max |A ij | (446)ij||A|| KF = ||sing(A)|| 1 (Ky Fan) (447)where sing(A) is the vector of singular values of the matrix A.10.5.4 InequalitiesE. H. Rasmussen has in yet unpublished material derived and collected thefollowing inequalities. They are collected in a table as below, assuming A is anm × n, and d = rank(A)||A|| max ||A|| 1 ||A|| ∞ ||A|| 2 ||A|| F ||A|| KF||A|| max 1 1√1√1√1||A|| 1 m m√ m√ m√ m||A|| ∞√n√n√ n n n||A|| 2 mn n m 1 1√ √ √ √||A|| F√ mn√ n√ m d√ 1||A|| KF mnd nd md d dwhich are to be read as, e.g.10.5.5 Condition Number||A|| 2 ≤ √ m · ||A|| ∞ (448)The 2-norm of A equals √ (max(eig(A T A))) [12, p.57]. For a symmetric, positivedefinite matrix, this reduces to max(eig(A)) The condition number basedon the 2-norm thus reduces to‖A‖ 2 ‖A −1 ‖ 2 = max(eig(A)) max(eig(A −1 )) = max(eig(A))min(eig(A)) . (449)10.6 Rank10.6.1 Sylvester’s InequalityIf A is m × n and B is n × r, thenrank(A) + rank(B) − n ≤ rank(AB) ≤ min{rank(A), rank(B)} (450)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 57

10.7 Integral Involving Dirac Delta Functions 10 FUNCTIONS AND OPERATORS10.7 Integral Involving Dirac Delta FunctionsAssuming A to be square, then∫1p(s)δ(x − As)ds =det(A) p(A−1 x) (451)Assuming A to be ”underdetermined”, i.e. ”tall”, then∫{√1p(s)δ(x − As)ds = det(A T A) p(A+ x) if x = AA + x0 elsewhereSee [9].10.8 MiscellaneousFor any A it holds that}(452)rank(A) = rank(A T ) = rank(AA T ) = rank(A T A) (453)It holds thatA is positive definite ⇔ ∃B invertible, such that A = BB T (454)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 58

AONE-DIMENSIONAL RESULTSAOne-dimensional ResultsA.1 GaussianA.1.1Densityp(x) =( )1√ exp (x − µ)2−2πσ2 2σ 2(455)A.1.2Normalization ∫e − (s−µ)22σ 2 ds = √ 2πσ 2 (456)∫√ [ π be −(ax2 +bx+c) 2 ]dx =a exp − 4ac(457)4a∫√ [ ]e c2x2 +c 1x+c 0 π c2dx = exp 1 − 4c 2 c 0(458)−c 2 −4c 2A.1.3Derivatives∂p(x)∂µ∂ ln p(x)∂µ∂p(x)∂σ∂ ln p(x)∂σ(x − µ)= p(x)σ 2 (459)(x − µ)=σ 2 (460)= p(x) 1 [ ](x − µ)2σ σ 2 − 1(461)= 1 [ ](x − µ)2σ σ 2 − 1(462)A.1.4orCompleting the Squaresc 2 x 2 + c 1 x + c 0 = −a(x − b) 2 + w−a = c 2 b = 1 c 1w = 1 2 c 2 4c 2 1c 2+ c 0c 2 x 2 + c 1 x + c 0 = − 12σ 2 (x − µ)2 + dµ = −c 12c 2σ 2 = −12c 2d = c 0 − c2 14c 2A.1.5MomentsIf the density is expressed byp(x) =[ ]1√ exp (s − µ)2−2πσ2 2σ 2or p(x) = C exp(c 2 x 2 + c 1 x) (463)then the first few basic moments arePetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 59

A.2 One Dimensional Mixture of Gaussians A ONE-DIMENSIONAL RESULTS〈x〉 = µ = −c12c 2( ) 2〈x 2 〉 = σ 2 + µ 2 = −12c 2+ −c12c 2〈x 3 〉 = 3σ 2 µ + µ 3 =〈x 4 〉 = µ 4 + 6µ 2 σ 2 + 3σ 4 =and the central moments are[c 1(2c 2) 23 − c2 1( ) 4 (c12c 2+ 6c12c 2]〈(x − µ)〉 = 0 = 0[]〈(x − µ) 2 〉 = σ 2 −1=2c 2〈(x − µ) 3 〉 = 0 = 0[〈(x − µ) 4 〉 = 3σ 4 = 3) 2 ( ) (−12c 2 2c 2+ 3] 212c 2) 212c 2A kind of pseudo-moments (un-normalized integrals) can easily be derived as∫√ [ ]π cexp(c 2 x 2 + c 1 x)x n dx = Z〈x n 2〉 = exp 1〈x n 〉 (464)−c 2 −4c 2¿From the un-centralized moments one can derive other entities like〈x 2 〉 − 〈x〉 2 = σ 2 = −12c 2〈x 3 〉 − 〈x 2 〉〈x〉 = 2σ 2 µ =〈x 4 〉 − 〈x 2 〉 2 = 2σ 4 + 4µ 2 σ 2 =22c 1(2c 2) 2[(2c 2) 21 − 4 c2 12c 2]A.2 One Dimensional Mixture of GaussiansA.2.1Density and NormalizationK∑[ρ kp(s) = √ exp − 1 (s − µ k ) 2 ]2πσ2k2kσ 2 k(465)A.2.2MomentsAn useful fact of MoG, is that〈x n 〉 = ∑ kρ k 〈x n 〉 k (466)where 〈·〉 k denotes average with respect to the k.th component. We can calculatethe first four moments from the densitiesp(x) = ∑ [1ρ k √ exp − 1 (x − µ k ) 2 ]k 2πσ2k2 σk2 (467)p(x) = ∑ kρ k C k exp [ c k2 x 2 + c k1 x ] (468)asPetersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 60

A.2 One Dimensional Mixture of Gaussians A ONE-DIMENSIONAL RESULTS〈x〉 = ∑ k ρ kµ k = ∑ [k ρ −ck1k〈x 2 〉 = ∑ k ρ k(σk 2 + µ2 k ) = ∑ [ ( ) ] 2k ρ −1k 2c k2+−ck12c k2〈x 3 〉 = ∑ k ρ k(3σk 2µ k + µ 3 k ) = ∑ [ [ ]]k ρ ck1k (2c k2 )3 − c2 2 k12c k22c k2]〈x 4 〉 = ∑ k ρ k(µ 4 k + 6µ2 k σ2 k + 3σ4 k ) = ∑ k ρ k[ (12c k2) 2[ (ck12c k2) 2− 6c 2 k12c k2+ 3If all the gaussians are centered, i.e. µ k = 0 for all k, then〈x〉 = 0 = 0〈x 2 〉 = ∑ k ρ kσk 2 = ∑ k ρ k〈x 3 〉 = 0 = 0〈x 4 〉 = ∑ k ρ k3σk 4 = ∑ [k ρ k3[]−12c k2] 2−12c k2¿From the un-centralized moments one can derive other entities like〈x 2 〉 − 〈x〉 2 = ∑ k,k ρ [kρ ′ k ′ µ2k+ σk 2 − µ kµ k ′]〈x 3 〉 − 〈x 2 〉〈x〉 = ∑ k,k ρ [kρ ′ k ′ 3σ2kµ k + µ 3 k − (σ2 k + µ2 k )µ ]k ′〈x 4 〉 − 〈x 2 〉 2 = ∑ k,k ρ [kρ ′ k ′ µ4k+ 6µ 2 k σ2 k + 3σ4 k − (σ2 k + µ2 k )(σ2 k + ′ µ2 k ′)]A.2.3DerivativesDefining p(s) = ∑ k ρ kN s (µ k , σk 2) we get for a parameter θ j of the j.th component∂ ln p(s)= ρ jN s (µ j , σj 2)∂ ln(ρ j N s (µ j , σj 2 ∑))∂θ j k ρ kN s (µ k , σk 2) (469)∂θ jthat is,]]∂ ln p(s)∂ρ j=∂ ln p(s)∂µ j=∂ ln p(s)∂σ j=ρ j N s (µ j , σj 2 ∑) 1k ρ kN s (µ k , σk 2) (470)ρ jρ j N s (µ j , σj 2 ∑) (s − µ j )k ρ kN s (µ k , σk 2) σj2 (471)ρ j N s (µ j , σj 2 ∑)[]1 (s − µ j ) 2k ρ kN s (µ k , σk 2) σ j σj2 − 1 (472)Note that ρ k must be constrained to be proper ratios. Defining the ratios byρ j = e rj / ∑ k er k, we obtain∂ ln p(s)∂r j= ∑ l∂ ln p(s)∂ρ l∂ρ l∂r jwhere∂ρ l∂r j= ρ l (δ lj − ρ j ) (473)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 61

BPROOFS AND DETAILSBProofs and DetailsB.1 Misc ProofsB.1.1 Proof of Equation 77Essentially we need to calculate∂(X n ) kl∂X ij=∂ ∑∂X iju 1,...,u n−1X k,u1 X u1,u 2...X un−1,l= δ k,i δ u1,jX u1,u 2...X un−1,l.+X k,u1 δ u1,iδ u2,j...X un−1,l==+X k,u1 X u1,u 2...δ un−1,iδ l,jn−1∑(X r ) ki (X n−1−r ) jlr=0n−1∑(X r J ij X n−1−r ) klr=0Using the properties of the single entry matrix found in Sec. 9.3.4, the resultfollows easily.B.1.2 Details on Eq. 475∂ det(X H AX) = det(X H AX)Tr[(X H AX) −1 ∂(X H AX)]= det(X H AX)Tr[(X H AX) −1 (∂(X H )AX + X H ∂(AX))]= det(X H AX) ( Tr[(X H AX) −1 ∂(X H )AX]+Tr[(X H AX) −1 X H ∂(AX)] )= det(X H AX) ( Tr[AX(X H AX) −1 ∂(X H )]+Tr[(X H AX) −1 X H A∂(X)] )First, the derivative is found with respect to the real part of X∂ det(X H AX)∂RX( Tr[AX(X= det(X H H AX) −1 ∂(X H )]AX)∂RX+ Tr[(XH AX) −1 X H A∂(X)])∂RX= det(X H AX) ( AX(X H AX) −1 + ((X H AX) −1 X H A) T )Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 62

B.1 Misc Proofs B PROOFS AND DETAILSThrough the calculations, (84) and (193) were used. In addition, by use of (194),the derivative is found with respect to the imaginary part of Xi ∂ det(XH AX)∂IX( Tr[AX(X= i det(X H H AX) −1 ∂(X H )]AX)∂IX+ Tr[(XH AX) −1 X H A∂(X)])∂IX= det(X H AX) ( AX(X H AX) −1 − ((X H AX) −1 X H A) T )Hence, derivative yields∂ det(X H AX)∂X= 1 ( ∂ det(X H AX)− i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX) ( (X H AX) −1 X H A ) Tand the complex conjugate derivative yields∂ det(X H AX)∂X ∗ = 1 ( ∂ det(X H AX)+ i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX)AX(X H AX) −1Notice, for real X, A, the sum of (202) and (203) is reduced to (45).Similar calculations yield∂ det(XAX H )∂X= 1 ( ∂ det(XAX H )− i ∂ det(XAXH ))2 ∂RX∂IX= det(XAX H ) ( AX H (XAX H ) −1) T(474)and∂ det(XAX H )∂X ∗ = 1 ( ∂ det(XAX H )+ i ∂ det(XAXH ))2 ∂RX∂IX= det(XAX H )(XAX H ) −1 XA (475)Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 63

REFERENCESREFERENCESReferences[1] Karl Gustav Andersson and Lars-Christer Boiers. Ordinaera differentialekvationer.Studenterlitteratur, 1992.[2] Jörn Anemüller, Terrence J. Sejnowski, and Scott Makeig. Complex independentcomponent analysis of frequency-domain electroencephalographicdata. Neural Networks, 16(9):1311–1323, November 2003.[3] S. Barnet. Matrices. Methods and Applications. Oxford Applied Mathematicsand Computin Science Series. Clarendon Press, 1990.[4] Christoffer Bishop. Neural Networks for Pattern Recognition. Oxford UniversityPress, 1995.[5] Robert J. Boik. Lecture notes: Statistics 550. Online, April 22 2002. Notes.[6] D. H. Brandwood. A complex gradient operator and its application inadaptive array theory. IEE Proceedings, 130(1):11–16, February 1983. PTS.F and H.[7] M. Brookes. Matrix Reference Manual, 2004. Website May 20, 2004.[8] Contradsen K., En introduktion til statistik, IMM lecture notes, 1984.[9] Mads Dyrholm. Some matrix results, 2004. Website August 23, 2004.[10] Nielsen F. A., Formula, Neuro Research Unit and Technical university ofDenmark, 2002.[11] Gelman A. B., J. S. Carlin, H. S. Stern, D. B. Rubin, Bayesian DataAnalysis, Chapman and Hall / CRC, 1995.[12] Gene H. Golub and Charles F. van Loan. Matrix Computations. The JohnsHopkins University Press, Baltimore, 3rd edition, 1996.[13] Robert M. Gray. Toeplitz and circulant matrices: A review. Technicalreport, Information Systems Laboratory, Department of Electrical Engineering,StanfordUniversity, Stanford, California 94305, August 2002.[14] Simon Haykin. Adaptive Filter Theory. Prentice Hall, Upper Saddle River,NJ, 4th edition, 2002.[15] Roger A. Horn and Charles R. Johnson. Matrix Analysis. CambridgeUniversity Press, 1985.[16] Mardia K. V., J.T. Kent and J.M. Bibby, Multivariate Analysis, AcademicPress Ltd., 1979.[17] Carl D. Meyer. Generalized inversion of modified matrices. SIAM Journalof Applied Mathematics, 24(3):315–323, May 1973.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 64

REFERENCESREFERENCES[18] Thomas P. Minka. Old and new matrix algebra useful for statistics, December2000. Notes.[19] Daniele Mortari Ortho–Skew and Ortho–Sym Matrix Trigonometry JohnLee Junkins Astrodynamics Symposium, AAS 03–265, May 2003. TexasA&M University, College Station, TX[20] L. Parra and C. Spence. Convolutive blind separation of non-stationarysources. In IEEE Transactions Speech and Audio Processing, pages 320–327, May 2000.[21] Kaare Brandt Petersen, Jiucang Hao, and Te-Won Lee. Generative andfiltering approaches for overcomplete representations. Neural InformationProcessing - Letters and Reviews, vol. 8(1), 2005.[22] John G. Proakis and Dimitris G. Manolakis. Digital Signal Processing.Prentice-Hall, 1996.[23] Laurent Schwartz. Cours d’Analyse, volume II. Hermann, Paris, 1967. Asreferenced in [14].[24] Shayle R. Searle. Matrix Algebra Useful for Statistics. John Wiley andSons, 1982.[25] G. Seber and A. Lee. Linear Regression Analysis. John Wiley and Sons,2002.[26] S. M. Selby. Standard Mathematical Tables. CRC Press, 1974.[27] Inna Stainvas. Matrix algebra in differential calculus. Neural ComputingResearch Group, Information Engeneering, Aston University, UK, August2002. Notes.[28] P. P. Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall,1993.[29] Max Welling. The Kalman Filter. Lecture Note.Petersen & Pedersen, The Matrix Cookbook, Version: September 5, 2007, Page 65

IndexAnti-symmetric, 43Block matrix, 47Chain rule, 14Cholesky-decomposition, 27Co-kurtosis, 28Co-skewness, 28Cramers Rule, 53Derivative of a complex matrix, 22Derivative of a determinant, 7Derivative of a trace, 11Derivative of an inverse, 8Derivative of symmetric matrix, 14Derivatives of Toeplitz matrix, 15Dirichlet distribution, 32Student-t, 31Sylvester’s Inequality, 57Symmetric, 43Toeplitz matrix, 44Vandermonde matrix, 44Vec operator, 51Wishart distribution, 32Woodbury identity, 17Eigenvalues, 25Eigenvectors, 25Exponential Matrix Function, 51Gaussian, conditional, 34Gaussian, entropy, 38Gaussian, linear combination, 35Gaussian, marginal, 34Gaussian, product of densities, 36Generalized inverse, 20Kronecker product, 51Moore-Penrose inverse, 20Multinomial distribution, 32Norm of a matrix, 55Norm of a vector, 55Normal-Inverse Gamma distribution, 32Normal-Inverse Wishart distribution, 33Pseudo-inverse, 20Schur complement, 35, 48Single entry matrix, 41Singular Valued Decomposition (SVD),2566

The Matrix Cookbook - LASA

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?