13.07.2015 Views

The Matrix Cookbook - LASA

The Matrix Cookbook - LASA

The Matrix Cookbook - LASA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>[ http://matrixcookbook.com ]Kaare Brandt PetersenMichael Syskind PedersenVersion: September 5, 2007What is this? <strong>The</strong>se pages are a collection of facts (identities, approximations,inequalities, relations, ...) about matrices and matters relating to them.It is collected in this form for the convenience of anyone who wants a quickdesktop reference .Disclaimer: <strong>The</strong> identities, approximations and relations presented here wereobviously not invented but collected, borrowed and copied from a large amountof sources. <strong>The</strong>se sources include similar but shorter notes found on the internetand appendices in books - see the references for a full list.Errors: Very likely there are errors, typos, and mistakes for which we apologizeand would be grateful to receive corrections at cookbook@2302.dk.Its ongoing: <strong>The</strong> project of keeping a large repository of relations involvingmatrices is naturally ongoing and the version will be apparent from the date inthe header.Suggestions: Your suggestion for additional content or elaboration of sometopics is most welcome at cookbook@2302.dk.Keywords: <strong>Matrix</strong> algebra, matrix relations, matrix identities, derivative ofdeterminant, derivative of inverse matrix, differentiate a matrix.Acknowledgements: We would like to thank the following for contributionsand suggestions: Bill Baxter, Christian Rishøj, Douglas L. <strong>The</strong>obald, EsbenHoegh-Rasmussen, Jan Larsen, Korbinian Strimmer, Lars Christiansen, LarsKai Hansen, Leland Wilkinson, Liguo He, Loic Thibaut, Ole Winther, StephanHattinger, and Vasile Sima. We would also like thank <strong>The</strong> Oticon Foundationfor funding our PhD studies.1


CONTENTSCONTENTSContents1 Basics 51.1 Trace and Determinants . . . . . . . . . . . . . . . . . . . . . . . 51.2 <strong>The</strong> Special Case 2x2 . . . . . . . . . . . . . . . . . . . . . . . . . 52 Derivatives 72.1 Derivatives of a Determinant . . . . . . . . . . . . . . . . . . . . 72.2 Derivatives of an Inverse . . . . . . . . . . . . . . . . . . . . . . . 82.3 Derivatives of Eigenvalues . . . . . . . . . . . . . . . . . . . . . . 92.4 Derivatives of Matrices, Vectors and Scalar Forms . . . . . . . . 92.5 Derivatives of Traces . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Derivatives of vector norms . . . . . . . . . . . . . . . . . . . . . 132.7 Derivatives of matrix norms . . . . . . . . . . . . . . . . . . . . . 132.8 Derivatives of Structured Matrices . . . . . . . . . . . . . . . . . 133 Inverses 163.1 Basic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Exact Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3 Implication on Inverses . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Generalized Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 Pseudo Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 Complex Matrices 224.1 Complex Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 225 Decompositions 255.1 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . 255.2 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . 255.3 Triangular Decomposition . . . . . . . . . . . . . . . . . . . . . . 276 Statistics and Probability 286.1 Definition of Moments . . . . . . . . . . . . . . . . . . . . . . . . 286.2 Expectation of Linear Combinations . . . . . . . . . . . . . . . . 296.3 Weighted Scalar Variable . . . . . . . . . . . . . . . . . . . . . . 307 Multivariate Distributions 317.1 Student’s t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.2 Cauchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.3 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.4 Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.5 Dirichlet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.6 Normal-Inverse Gamma . . . . . . . . . . . . . . . . . . . . . . . 327.7 Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327.8 Inverse Wishart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 2


CONTENTSCONTENTS8 Gaussians 348.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368.3 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.4 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . 399 Special Matrices 409.1 Orthogonal, Ortho-symmetric, and Ortho-skew . . . . . . . . . . 409.2 Units, Permutation and Shift . . . . . . . . . . . . . . . . . . . . 409.3 <strong>The</strong> Singleentry <strong>Matrix</strong> . . . . . . . . . . . . . . . . . . . . . . . 419.4 Symmetric and Antisymmetric . . . . . . . . . . . . . . . . . . . 439.5 Orthogonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . 449.6 Vandermonde Matrices . . . . . . . . . . . . . . . . . . . . . . . . 459.7 Toeplitz Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 459.8 <strong>The</strong> DFT <strong>Matrix</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 469.9 Positive Definite and Semi-definite Matrices . . . . . . . . . . . . 479.10 Block matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4910 Functions and Operators 5110.1 Functions and Series . . . . . . . . . . . . . . . . . . . . . . . . . 5110.2 Kronecker and Vec Operator . . . . . . . . . . . . . . . . . . . . 5210.3 Solutions to Systems of Equations . . . . . . . . . . . . . . . . . 5310.4 Vector Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5610.5 <strong>Matrix</strong> Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5610.6 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5810.7 Integral Involving Dirac Delta Functions . . . . . . . . . . . . . . 5810.8 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58A One-dimensional Results 59A.1 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2 One Dimensional Mixture of Gaussians . . . . . . . . . . . . . . . 60B Proofs and Details 62B.1 Misc Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 3


CONTENTSCONTENTSNotation and NomenclatureA <strong>Matrix</strong>A ij <strong>Matrix</strong> indexed for some purposeA i <strong>Matrix</strong> indexed for some purposeA ij <strong>Matrix</strong> indexed for some purposeA n <strong>Matrix</strong> indexed for some purpose or<strong>The</strong> n.th power of a square matrixA −1 <strong>The</strong> inverse matrix of the matrix AA + <strong>The</strong> pseudo inverse matrix of the matrix A (see Sec. 3.6)A 1/2 <strong>The</strong> square root of a matrix (if unique), not elementwise(A) ij <strong>The</strong> (i, j).th entry of the matrix AA ij <strong>The</strong> (i, j).th entry of the matrix A[A] ij <strong>The</strong> ij-submatrix, i.e. A with i.th row and j.th column deleteda Vectora i Vector indexed for some purposea i <strong>The</strong> i.th element of the vector aa ScalarRzRzRZIzIzIZReal part of a scalarReal part of a vectorReal part of a matrixImaginary part of a scalarImaginary part of a vectorImaginary part of a matrixdet(A) Determinant of ATr(A) Trace of the matrix Adiag(A) Diagonal matrix of the matrix A, i.e. (diag(A)) ij = δ ij A ijvec(A) <strong>The</strong> vector-version of the matrix A (see Sec. 10.2.2)sup Supremum of a set||A|| <strong>Matrix</strong> norm (subscript if any denotes what norm)A T Transposed matrixA ∗ Complex conjugated matrixA H Transposed and complex conjugated matrix (Hermitian)A ◦ BA ⊗ BHadamard (elementwise) productKronecker product0 <strong>The</strong> null matrix. Zero in all entries.I <strong>The</strong> identity matrix<strong>The</strong> single-entry matrix, 1 at (i, j) and zero elsewhereΣ A positive definite matrixΛ A diagonal matrixJ ijPetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 4


1 BASICS1 Basics(AB) −1 = B −1 A −1 (1)(ABC...) −1 = ...C −1 B −1 A −1 (2)(A T ) −1 = (A −1 ) T (3)(A + B) T = A T + B T (4)(AB) T = B T A T (5)(ABC...) T = ...C T B T A T (6)(A H ) −1 = (A −1 ) H (7)(A + B) H = A H + B H (8)(AB) H = B H A H (9)(ABC...) H = ...C H B H A H (10)1.1 Trace and DeterminantsTr(A) = ∑ i A ii (11)Tr(A) = ∑ i λ i, λ i = eig(A) (12)Tr(A) = Tr(A T ) (13)Tr(AB) = Tr(BA) (14)Tr(A + B) = Tr(A) + Tr(B) (15)Tr(ABC) = Tr(BCA) = Tr(CAB) (16)det(A) = ∏ i λ i λ i = eig(A) (17)det(cA) = c n det(A), if A ∈ R n×n (18)det(AB) = det(A) det(B) (19)det(A −1 ) = 1/ det(A) (20)det(A n ) = det(A) n (21)det(I + uv T ) = 1 + u T v (22)det(I + εA) ∼ = 1 + εTr(A), ε small (23)1.2 <strong>The</strong> Special Case 2x2Consider the matrix ADeterminant and traceA =[ ]A11 A 12A 21 A 22det(A) = A 11 A 22 − A 12 A 21 (24)Tr(A) = A 11 + A 22 (25)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 5


1.2 <strong>The</strong> Special Case 2x2 1 BASICSEigenvaluesλ 2 − λ · Tr(A) + det(A) = 0λ 1 = Tr(A) + √ Tr(A) 2 − 4 det(A)2λ 1 + λ 2 = Tr(A)λ 2 = Tr(A) − √ Tr(A) 2 − 4 det(A)2λ 1 λ 2 = det(A)Eigenvectors[v 1 ∝]A 12λ 1 − A 11[v 2 ∝]A 12λ 2 − A 11InverseA −1 =det(A) −A 21 A 111[A22 −A 12](26)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 6


2 DERIVATIVES2 DerivativesThis section is covering differentiation of a number of expressions with respect toa matrix X. Note that it is always assumed that X has no special structure, i.e.that the elements of X are independent (e.g. not symmetric, Toeplitz, positivedefinite). See section 2.8 for differentiation of structured matrices. <strong>The</strong> basicassumptions can be written in a formula as∂X kl∂X ij= δ ik δ lj (27)that is for e.g. vector forms,[ ] ∂x= ∂x i∂y ∂yi[ ] ∂x= ∂x∂yi∂y i[ ] ∂x= ∂x i∂yij∂y j<strong>The</strong> following rules are general and very useful when deriving the differential ofan expression ([18]):∂A = 0 (A is a constant) (28)∂(αX) = α∂X (29)∂(X + Y) = ∂X + ∂Y (30)∂(Tr(X)) = Tr(∂X) (31)∂(XY) = (∂X)Y + X(∂Y) (32)∂(X ◦ Y) = (∂X) ◦ Y + X ◦ (∂Y) (33)∂(X ⊗ Y) = (∂X) ⊗ Y + X ⊗ (∂Y) (34)∂(X −1 ) = −X −1 (∂X)X −1 (35)∂(det(X)) = det(X)Tr(X −1 ∂X) (36)∂(ln(det(X))) = Tr(X −1 ∂X) (37)∂X T = (∂X) T (38)∂X H = (∂X) H (39)2.1 Derivatives of a Determinant2.1.1 General form∂ det(Y)∂x[= det(Y)Tr Y−1 ∂Y∂x](40)2.1.2 Linear forms∂ det(X)∂X∂ det(AXB)∂X= det(X)(X −1 ) T (41)= det(AXB)(X −1 ) T = det(AXB)(X T ) −1 (42)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 7


2.2 Derivatives of an Inverse 2 DERIVATIVES2.1.3 Square formsIf X is square and invertible, then∂ det(X T AX)∂XIf X is not square but A is symmetric, then∂ det(X T AX)∂XIf X is not square and A is not symmetric, then∂ det(X T AX)∂X2.1.4 Other nonlinear formsSome special cases are (See [9, 7])= 2 det(X T AX)X −T (43)= 2 det(X T AX)AX(X T AX) −1 (44)= det(X T AX)(AX(X T AX) −1 + A T X(X T A T X) −1 ) (45)∂ ln det(X T X)|∂X= 2(X + ) T (46)∂ ln det(X T X)∂X + = −2X T (47)∂ ln | det(X)|∂X= (X −1 ) T = (X T ) −1 (48)∂ det(X k )∂X= k det(X k )X −T (49)2.2 Derivatives of an Inverse¿From [26] we have the basic identityfrom which it follows∂Y −1∂x∂Y= −Y−1∂x Y−1 (50)∂(X −1 ) kl∂X ij= −(X −1 ) ki (X −1 ) jl (51)∂a T X −1 b∂X∂ det(X −1 )∂X∂Tr(AX −1 B)∂X= −X −T ab T X −T (52)= − det(X −1 )(X −1 ) T (53)= −(X −1 BAX −1 ) T (54)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 8


2.3 Derivatives of Eigenvalues 2 DERIVATIVES2.3 Derivatives of Eigenvalues∂ ∑ ∂eig(X) = Tr(X) = I (55)∂X∂X∂ ∏ ∂eig(X) =∂X∂X det(X) = det(X)X−T (56)2.4 Derivatives of Matrices, Vectors and Scalar Forms2.4.1 First Order∂x T a∂x∂a T Xb∂X∂a T X T b∂X∂a T Xa∂X2.4.2 Second Order∂∂X ij= ∂aT x∂x= a (57)= ab T (58)= ba T (59)= ∂aT X T a∂X= aa T (60)∂X∂X ij= J ij (61)∂(XA) ij∂X mn= δ im (A) nj = (J mn A) ij (62)∂(X T A) ij∂X mn= δ in (A) mj = (J nm A) ij (63)∑X kl (64)X kl X mnklmn= 2 ∑ kl∂b T X T Xc∂X= X(bc T + cb T ) (65)∂(Bx + b) T C(Dx + d)∂x= B T C(Dx + d) + D T C T (Bx + b) (66)∂(X T BX) kl∂X ij= δ lj (X T B) ki + δ kj (BX) il (67)∂(X T BX)∂X ij= X T BJ ij + J ji BX (J ij ) kl = δ ik δ jl (68)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 9


2.4 Derivatives of Matrices, Vectors and Scalar Forms 2 DERIVATIVESSee Sec 9.3 for useful properties of the Single-entry matrix J ij∂x T Bx∂x= (B + B T )x (69)∂b T X T DXc∂X= D T Xbc T + DXcb T (70)∂∂X (Xb + c)T D(Xb + c) = (D + D T )(Xb + c)b T (71)Assume W is symmetric, then∂∂s (x − As)T W(x − As) = −2A T W(x − As) (72)∂∂s (x − s)T W(x − s) = −2W(x − s) (73)∂∂x (x − As)T W(x − As) = 2W(x − As) (74)∂∂A (x − As)T W(x − As) = −2W(x − As)s T (75)2.4.3 Higher order and non-linearFor proof of the above, see B.1.1.∂(X n n−1) kl∑= (X r J ij X n−1−r ) kl (76)∂X ijr=0n−1∂∑∂X aT X n b = (X r ) T ab T (X n−1−r ) T (77)r=0∂∂X aT (X n ) T X n b =n−1∑ [X n−1−r ab T (X n ) T X rr=0+(X r ) T X n ab T (X n−1−r ) T ] (78)See B.1.1 for a proof.Assume s and r are functions of x, i.e. s = s(x), r = r(x), and that A is aconstant, then∂∂x sT Ar =[ ] T ∂sAr +∂x[ ] T ∂rA T s (79)∂xPetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 10


2.5 Derivatives of Traces 2 DERIVATIVES2.4.4 Gradient and HessianUsing the above we have for the gradient and the hessian2.5 Derivatives of Traces2.5.1 First Orderf = x T Ax + b T x (80)∇ x f = ∂f∂x = (A + AT )x + b (81)∂ 2 f∂x∂x T = A + A T (82)∂Tr(X)∂X= I (83)∂∂X Tr(XA) = AT (84)∂∂X Tr(AXB) = AT B T (85)∂∂X Tr(AXT B) = BA (86)∂∂X Tr(XT A) = A (87)∂∂X Tr(AXT ) = A (88)∂Tr(A ⊗ X)∂X= Tr(A)I (89)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 11


2.5 Derivatives of Traces 2 DERIVATIVES2.5.2 Second OrderSee [7].∂∂X Tr(X2 ) = 2X T (90)∂∂X Tr(X2 B) = (XB + BX) T (91)∂∂X Tr(XT BX) = BX + B T X (92)∂∂X Tr(XBXT ) = XB T + XB (93)∂∂X Tr(AXBX) = AT X T B T + B T X T A T (94)∂∂X Tr(XT X) = 2X (95)∂∂X Tr(BXXT ) = (B + B T )X (96)∂∂X Tr(BT X T CXB) = C T XBB T + CXBB T (97)∂∂X Tr [ X T BXC ] = BXC + B T XC T (98)∂∂X Tr(AXBXT C) = A T C T XB T + CAXB (99)∂[∂X Tr (AXb + c)(AXb + c) T = 2A T (AXb + c)b T (100)∂∂Tr(X ⊗ X) = Tr(X)Tr(X) = 2Tr(X)I (101)∂X ∂X2.5.3 Higher Order∂∂X Tr(Xk ) = k(X k−1 ) T (102)∂∂X Tr(AXk ) =k−1∑(X r AX k−r−1 ) T (103)r=0∂∂X Tr [ B T X T CXX T CXB ] = CXX T CXBB T+C T XBB T X T C T X+CXBB T X T CX+C T XX T C T XBB T (104)2.5.4 Other∂∂X Tr(AX−1 B) = −(X −1 BAX −1 ) T = −X −T A T B T X −T (105)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 12


2.6 Derivatives of vector norms 2 DERIVATIVESAssume B and C to be symmetric, then∂[]∂X Tr (X T CX) −1 A = −(CX(X T CX) −1 )(A + A T )(X T CX) −1 (106)∂[]∂X Tr (X T CX) −1 (X T BX) = −2CX(X T CX) −1 X T BX(X T CX) −1See [7].norms og matrix norms.2.6 Derivatives of vector norms2.6.1 Two-norm+2BX(X T CX) −1 (107)∂∂x ||x − a|| 2 =x − a(108)||x − a|| 2∂ x − a I (x − a)(x − a)T= −∂x ‖x − a‖ 2 ‖x − a‖ 2 ‖x − a‖ 3 22.7 Derivatives of matrix normsFor more on matrix norms, see Sec. 10.5.2.7.1 Frobenius norm∂See (201).(109)∂||x|| 2 2∂x = ∂||xT x|| 2= 1 ∂x 2 x (110)∂X ||X||2 F =2.8 Derivatives of Structured Matrices∂∂X Tr(XXH ) = 2X (111)Assume that the matrix A has some structure, i.e. symmetric, toeplitz, etc.In that case the derivatives of the previous section does not apply in general.Instead, consider the following general rule for differentiating a scalar functionf(A)df= ∑ [ [ ] ]T∂f ∂A kl ∂f ∂A= Tr(112)dA ij ∂A kl ∂A ij ∂A ∂A ijkl<strong>The</strong> matrix differentiated with respect to itself is in this document referred toas the structure matrix of A and is defined simply by∂A∂A ij= S ij (113)If A has no special structure we have simply S ij = J ij , that is, the structurematrix is simply the singleentry matrix. Many structures have a representationin singleentry matrices, see Sec. 9.3.6 for more examples of structure matrices.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 13


2.8 Derivatives of Structured Matrices 2 DERIVATIVES2.8.1 <strong>The</strong> Chain RuleSometimes the objective is to find the derivative of a matrix which is a functionof another matrix. Let U = f(X), the goal is to find the derivative of thefunction g(U) with respect to X:∂g(U)∂X= ∂g(f(X))∂X<strong>The</strong>n the Chain Rule can then be written the following way:(114)∂g(U)∂X= ∂g(U)∂x ij=M∑N∑k=1 l=1Using matrix notation, this can be written as:2.8.2 Symmetric∂g(U)∂X ij[= Tr( ∂g(U)∂U∂g(U) ∂u kl(115)∂u kl ∂x ij)T∂U∂X ij]. (116)If A is symmetric, then S ij = J ij + J ji − J ij J ij and therefore[ ] [ ] T [ ]df ∂f ∂f∂fdA = + − diag∂A ∂A∂A(117)That is, e.g., ([5]):∂Tr(AX)∂X∂ det(X)∂X∂ ln det(X)∂X= A + A T − (A ◦ I), see (121) (118)= det(X)(2X −1 − (X −1 ◦ I)) (119)= 2X −1 − (X −1 ◦ I) (120)2.8.3 DiagonalIf X is diagonal, then ([18]):∂Tr(AX)∂X= A ◦ I (121)2.8.4 ToeplitzLike symmetric matrices and diagonal matrices also Toeplitz matrices has aspecial structure which should be taken into account when the derivative withrespect to a matrix with Toeplitz structure.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 14


2.8 Derivatives of Structured Matrices 2 DERIVATIVES∂Tr(AT)∂T= ∂Tr(TA)⎡∂T=≡⎢⎣Tr(A) Tr([A T ] n1 ) Tr([[A T ] 1n ] n−1,2 ) · · · A n1Tr([A T ] 1n ))Tr([[A T ] 1n ] 2,n−1 )α(A)Tr(A). . .. . ..... . .. . .. . . Tr([[AT ]1n ] n−1,2 ).... . .. . .. . . Tr([AT ]n1 )A 1n · · · Tr([[A T ] 1n ] 2,n−1 ) Tr([A T ] 1n )) Tr(A)(122)As it can be seen, the derivative α(A) also has a Toeplitz structure. Each valuein the diagonal is the sum of all the diagonal valued in A, the values in thediagonals next to the main diagonal equal the sum of the diagonal next to themain diagonal in A T . This result is only valid for the unconstrained Toeplitzmatrix. If the Toeplitz matrix also is symmetric, the same derivative yields∂Tr(AT)∂T= ∂Tr(TA)∂T⎤⎥⎦= α(A) + α(A) T − α(A) ◦ I (123)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 15


3 INVERSES3 Inverses3.1 Basic3.1.1 Definition<strong>The</strong> inverse A −1 of a matrix A ∈ C n×n is defined such thatAA −1 = A −1 A = I, (124)where I is the n × n identity matrix. If A −1 exists, A is said to be nonsingular.Otherwise, A is said to be singular (see e.g. [12]).3.1.2 Cofactors and Adjoint<strong>The</strong> submatrix of a matrix A, denoted by [A] ij is a (n − 1) × (n − 1) matrixobtained by deleting the ith row and the jth column of A. <strong>The</strong> (i, j) cofactorof a matrix is defined ascof(A, i, j) = (−1) i+j det([A] ij ), (125)<strong>The</strong> matrix of cofactors can be created from the cofactors⎡⎤cof(A, 1, 1) · · · cof(A, 1, n)cof(A) =⎢ . cof(A, i, j) .⎥⎣⎦cof(A, n, 1) · · · cof(A, n, n)<strong>The</strong> adjoint matrix is the transpose of the cofactor matrix(126)adj(A) = (cof(A)) T , (127)3.1.3 Determinant<strong>The</strong> determinant of a matrix A ∈ C n×n is defined as (see [12])3.1.4 Constructiondet(A) ==n∑(−1) j+1 A 1j det ([A] 1j ) (128)j=1n∑A 1j cof(A, 1, j). (129)j=1<strong>The</strong> inverse matrix can be constructed, using the adjoint matrix, byA −1 =For the case of 2 × 2 matrices, see section 1.2.1 · adj(A) (130)det(A)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 16


3.2 Exact Relations 3 INVERSES3.1.5 Condition number<strong>The</strong> condition number of a matrix c(A) is the ratio between the largest and thesmallest singular value of a matrix (see Section 5.2 on singular values),c(A) = d +d −(131)<strong>The</strong> condition number can be used to measure how singular a matrix is. If thecondition number is large, it indicates that the matrix is nearly singular. <strong>The</strong>condition number can also be estimated from the matrix norms. Herec(A) = ‖A‖ · ‖A −1 ‖, (132)where ‖ · ‖ is a norm such as e.g the 1-norm, the 2-norm, the ∞-norm or theFrobenius norm (see Sec 10.5 for more on matrix norms).<strong>The</strong> 2-norm of A equals √ (max(eig(A H A))) [12, p.57]. For a symmetricmatrix, this reduces to ||A|| 2 = max(|eig(A)|) [12, p.394]. If the matrix iasymmetric and positive definite, ||A|| 2 = max(eig(A)). <strong>The</strong> condition numberbased on the 2-norm thus reduces to‖A‖ 2 ‖A −1 ‖ 2 = max(eig(A)) max(eig(A −1 )) = max(eig(A))min(eig(A)) . (133)3.2 Exact Relations3.2.1 Basic3.2.2 <strong>The</strong> Woodbury identity(AB) −1 = B −1 A −1 (134)<strong>The</strong> Woodbury identity comes in many variants. <strong>The</strong> latter of the two can befound in [12](A + CBC T ) −1 = A −1 − A −1 C(B −1 + C T A −1 C) −1 C T A −1 (135)(A + UBV) −1 = A −1 − A −1 U(B −1 + VA −1 U) −1 VA −1 (136)If P, R are positive definite, then (see [29])(P −1 + B T R −1 B) −1 B T R −1 = PB T (BPB T + R) −1 (137)3.2.3 <strong>The</strong> Kailath VariantSee [4, page 153].(A + BC) −1 = A −1 − A −1 B(I + CA −1 B) −1 CA −1 (138)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 17


3.2 Exact Relations 3 INVERSES3.2.4 <strong>The</strong> Searle Set of Identities<strong>The</strong> following set of identities, can be found in [24, page 151],(I + A −1 ) −1 = A(A + I) −1 (139)(A + BB T ) −1 B = A −1 B(I + B T A −1 B) −1 (140)(A −1 + B −1 ) −1 = A(A + B) −1 B = B(A + B) −1 A (141)A − A(A + B) −1 A = B − B(A + B) −1 B (142)A −1 + B −1 = A −1 (A + B)B −1 (143)(I + AB) −1 = I − A(I + BA) −1 B (144)(I + AB) −1 A = A(I + BA) −1 (145)3.2.5 Rank-1 update of Moore-Penrose Inverse<strong>The</strong> following is a rank-1 update for the Moore-Penrose pseudo-inverse and proofcan be found in [17]. <strong>The</strong> matrix G is defined below:Using the the notation(A + cd T ) + = A + + G (146)β = 1 + d T A + c (147)v = A + c (148)n = (A + ) T d (149)w = (I − AA + )c (150)m = (I − A + A) T d (151)the solution is given as six different cases, depending on the entities ||w||,||m||, and β. Please note, that for any (column) vector v it holds that v + =v T (v T v) −1 =vT||v||. <strong>The</strong> solution is:2Case 1 of 6: If ||w|| ̸= 0 and ||m|| ̸= 0. <strong>The</strong>nG = −vw + − (m + ) T n T + β(m + ) T w + (152)= − 1||w|| 2 vwT − 1β||m|| 2 mnT +||m|| 2 ||w|| 2 mwT (153)Case 2 of 6: If ||w|| = 0 and ||m|| ̸= 0 and β = 0. <strong>The</strong>nG = −vv + A + − (m + ) T n T (154)= − 1||v|| 2 vvT A + − 1||m|| 2 mnT (155)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 18


3.3 Implication on Inverses 3 INVERSESCase 3 of 6: If ||w|| = 0 and β ≠ 0. <strong>The</strong>nG = 1 β mvT A + −(β ||v||2||v|| 2 ||m|| 2 + |β| 2) ( ||m||2β m + v βCase 4 of 6: If ||w|| ̸= 0 and ||m|| = 0 and β = 0. <strong>The</strong>n) T(A+ ) T v + n(156)G = −A + nn + − vw + (157)= − 1||n|| 2 A+ nn T − 1||w|| 2 vwT (158)Case 5 of 6: If ||m|| = 0 and β ≠ 0. <strong>The</strong>nG = 1 ( ) ( )β A+ nw T β ||w||2||n||2T−||n|| 2 ||w|| 2 + |β| 2 βA+ n + vβ w + n (159)Case 6 of 6: If ||w|| = 0 and ||m|| = 0 and β = 0. <strong>The</strong>nG = −vv + A + − A + nn + + v + A + nvn + (160)= − 1||v|| 2 vvT A + − 1||n|| 2 A+ nn T +vT A + n||v|| 2 ||n|| 2 vnT (161)3.3 Implication on InversesSee [24].If (A + B) −1 = A −1 + B −1 then AB −1 A = BA −1 B (162)3.3.1 A PosDef identityAssume P, R to be positive definite and invertible, thenSee [29].(P −1 + B T R −1 B) −1 B T R −1 = PB T (BPB T + R) −1 (163)3.4 Approximations<strong>The</strong> following is a Taylor expansion(I + A) −1 = I − A + A 2 − A 3 + ... (164)<strong>The</strong> following approximation is from [21] and holds when A large and symmetricIf σ 2 is small compared to Q and M thenA − A(I + A) −1 A ∼ = I − A −1 (165)(Q + σ 2 M) −1 ∼ = Q −1 − σ 2 Q −1 MQ −1 (166)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 19


3.5 Generalized Inverse 3 INVERSES3.5 Generalized Inverse3.5.1 DefinitionA generalized inverse matrix of the matrix A is any matrix A − such that (see[25])AA − A = A (167)<strong>The</strong> matrix A − is not unique.3.6 Pseudo Inverse3.6.1 Definition<strong>The</strong> pseudo inverse (or Moore-Penrose inverse) of a matrix A is the matrix A +that fulfilsIAA + A = AII A + AA + = A +IIIIVAA + symmetricA + A symmetric<strong>The</strong> matrix A + is unique and does always exist. Note that in case of complexmatrices, the symmetric condition is substituted by a condition of beingHermitian.3.6.2 PropertiesAssume A + to be the pseudo-inverse of A, then (See [3])Assume A to have full rank, then3.6.3 Construction(A + ) + = A (168)(A T ) + = (A + ) T (169)(cA) + = (1/c)A + (170)(A T A) + = A + (A T ) + (171)(AA T ) + = (A T ) + A + (172)(AA + )(AA + ) = AA + (173)(A + A)(A + A) = A + A (174)Assume that A has full rank, thenTr(AA + ) = rank(AA + ) (See [25]) (175)Tr(A + A) = rank(A + A) (See [25]) (176)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 20


3.6 Pseudo Inverse 3 INVERSESA n × n Square rank(A) = n ⇒ A + = A −1A n × m Broad rank(A) = n ⇒ A + = A T (AA T ) −1A n × m Tall rank(A) = m ⇒ A + = (A T A) −1 A TAssume A does not have full rank, i.e. A is n×m and rank(A) = r < min(n, m).<strong>The</strong> pseudo inverse A + can be constructed from the singular value decompositionA = UDV T , byA + = V r D −1r U T r (177)where U r , D r , and V r are the matrices with the degenerated rows and columnsdeleted. A different way is this: <strong>The</strong>re do always exist two matrices C n × rand D r × m of rank r, such that A = CD. Using these matrices it holds thatSee [3].A + = D T (DD T ) −1 (C T C) −1 C T (178)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 21


4 COMPLEX MATRICES4 Complex Matrices4.1 Complex DerivativesIn order to differentiate an expression f(z) with respect to a complex z, theCauchy-Riemann equations have to be satisfied ([7]):df(z)dzanddf(z)dzor in a more compact form:= ∂R(f(z))∂Rz= −i ∂R(f(z))∂Iz+ i ∂I(f(z))∂Rz+ ∂I(f(z))∂Iz(179)(180)∂f(z)∂Iz= i∂f(z) ∂Rz . (181)A complex function that satisfies the Cauchy-Riemann equations for points in aregion R is said yo be analytic in this region R. In general, expressions involvingcomplex conjugate or conjugate transpose do not satisfy the Cauchy-Riemannequations. In order to avoid this problem, a more generalized definition ofcomplex derivative is used ([23], [6]):• Generalized Complex Derivative:df(z)dz= 1 2• Conjugate Complex Derivative( ∂f(z))∂Rz − i∂f(z) . (182)∂Izdf(z)dz ∗ = 1 ( ∂f(z))2 ∂Rz + i∂f(z) . (183)∂Iz<strong>The</strong> Generalized Complex Derivative equals the normal derivative, when f is ananalytic function. For a non-analytic function such as f(z) = z ∗ , the derivativeequals zero. <strong>The</strong> Conjugate Complex Derivative equals zero, when f is ananalytic function. <strong>The</strong> Conjugate Complex Derivative has e.g been used by [20]when deriving a complex gradient.Notice:df(z)dz≠ ∂f(z)∂Rz + i∂f(z) ∂Iz . (184)• Complex Gradient Vector: If f is a real function of a complex vector z,then the complex gradient vector is given by ([14, p. 798])∇f(z) = 2 df(z)dz ∗ (185)= ∂f(z)∂Rz + i∂f(z) ∂Iz .Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 22


4.1 Complex Derivatives 4 COMPLEX MATRICES• Complex Gradient <strong>Matrix</strong>: If f is a real function of a complex matrix Z,then the complex gradient matrix is given by ([2])∇f(Z) = 2 df(Z)dZ ∗ (186)= ∂f(Z)∂RZ + i∂f(Z) ∂IZ .<strong>The</strong>se expressions can be used for gradient descent algorithms.4.1.1 <strong>The</strong> Chain Rule for complex numbers<strong>The</strong> chain rule is a little more complicated when the function of a complexu = f(x) is non-analytic. For a non-analytic function, the following chain rulecan be applied ([7])∂g(u)∂x= ∂g ∂u∂u ∂x + ∂g ∂u ∗∂u ∗ ∂x= ∂g ∂u( ∂g∗ ) ∗ ∂u∗∂u ∂x + ∂u ∂x(187)Notice, if the function is analytic, the second term reduces to zero, and the functionis reduced to the normal well-known chain rule. For the matrix derivativeof a scalar function g(U), the chain rule can be written the following way:∂g(U)∂X∂g(U)Tr((= ∂U )T ∂U)+∂X4.1.2 Complex Derivatives of TracesTr((∂g(U)∂U ∗ ) T ∂U ∗ )∂X. (188)If the derivatives involve complex numbers, the conjugate transpose is often involved.<strong>The</strong> most useful way to show complex derivative is to show the derivativewith respect to the real and the imaginary part separately. An easy example is:∂Tr(X ∗ )∂RX = ∂Tr(XH )∂RXi ∂Tr(X∗ )∂IX = )i∂Tr(XH ∂IX= I (189)= I (190)Since the two results have the same sign, the conjugate complex derivative (183)should be used.∂Tr(X)∂RX = ∂Tr(XT )∂RXi ∂Tr(X)∂IX = )i∂Tr(XT ∂IX= I (191)= −I (192)Here, the two results have different signs, and the generalized complex derivative(182) should be used. Hereby, it can be seen that (84) holds even if X is aPetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 23


4.1 Complex Derivatives 4 COMPLEX MATRICEScomplex number.∂Tr(AX H )∂RXi ∂Tr(AXH )∂IX∂Tr(AX ∗ )∂RXi ∂Tr(AX∗ )∂IX= A (193)= A (194)= A T (195)= A T (196)∂Tr(XX H )∂RXi ∂Tr(XXH )∂IX= ∂Tr(XH X)∂RX= i ∂Tr(XH X)∂IX= 2RX (197)= i2IX (198)By inserting (197) and (198) in (182) and (183), it can be seen that∂Tr(XX H )= X ∗∂X(199)∂Tr(XX H )∂X ∗ = X (200)Since the function Tr(XX H ) is a real function of the complex matrix X, thecomplex gradient matrix (186) is given by∇Tr(XX H ) = 2 ∂Tr(XXH )∂X ∗ = 2X (201)4.1.3 Complex Derivative Involving DeterminantsHere, a calculation example is provided. <strong>The</strong> objective is to find the derivative ofdet(X H AX) with respect to X ∈ C m×n . <strong>The</strong> derivative is found with respect tothe real part and the imaginary part of X, by use of (36) and (32), det(X H AX)can be calculated as (see App. B.1.2 for details)∂ det(X H AX)∂Xand the complex conjugate derivative yields= 1 ( ∂ det(X H AX)− i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX) ( (X H AX) −1 X H A ) T(202)∂ det(X H AX)∂X ∗ = 1 ( ∂ det(X H AX)+ i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX)AX(X H AX) −1 (203)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 24


5 DECOMPOSITIONS5 Decompositions5.1 Eigenvalues and Eigenvectors5.1.1 Definition<strong>The</strong> eigenvectors v and eigenvalues λ are the ones satisfyingAv i = λ i v i (204)AV = VD, (D) ij = δ ij λ i , (205)where the columns of V are the vectors v i5.1.2 General Properties5.1.3 Symmetriceig(AB) = eig(BA) (206)A is n × m ⇒ At most min(n, m) distinct λ i (207)rank(A) = r ⇒ At most r non-zero λ i (208)Assume A is symmetric, thenVV T = I (i.e. V is orthogonal) (209)λ i ∈ R (i.e. λ i is real) (210)Tr(A p ) = ∑ i λp i (211)eig(I + cA) = 1 + cλ i (212)eig(A − cI) = λ i − c (213)eig(A −1 ) = λ −1i (214)For a symmetric, positive matrix A,eig(A T A) = eig(AA T ) = eig(A) ◦ eig(A) (215)5.2 Singular Value DecompositionAny n × m matrix A can be written asA = UDV T , (216)whereU = eigenvectors of AA T n × nD = √ diag(eig(AA T )) n × mV = eigenvectors of A T A m × m(217)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 25


Assume A ∈ R n×n . <strong>The</strong>n[A]=[ V] [ D] [UT ] , (219)5.2 Singular Value Decomposition 5 DECOMPOSITIONS5.2.1 Symmetric Square decomposed into squaresAssume A to be n × n and symmetric. <strong>The</strong>n[ ] [ ] [ ] [ A = V D VT ] , (218)where D is diagonal with the eigenvalues of A, and V is orthogonal and theeigenvectors of A.5.2.2 Square decomposed into squareswhere D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.5.2.3 Square decomposed into rectangularAssume V ∗ D ∗ U T ∗ = 0 then we can expand the SVD of A into[ A]=[ V V∗] [ D 00 D ∗] [ UTU T ∗where the SVD of A is A = VDU T .5.2.4 Rectangular decomposition I], (220)Assume A is n × m, V is n × n, D is n × n, U T is n × m[A ] = [ V ] [ D ] [ U T ], (221)where D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.5.2.5 Rectangular decomposition IIAssume A is n × m, V is n × m, D is m × m, U T is m × m⎡ ⎤ ⎡ ⎤[A ] = [ V ] ⎣ D ⎦ ⎣ U T ⎦ (222)5.2.6 Rectangular decomposition IIIAssume A is n × m, V is n × n, D is n × m, U T is m × m⎡ ⎤[A ] = [ V ] [ D ] ⎣ U T ⎦ , (223)where D is diagonal with the square root of the eigenvalues of AA T , V is theeigenvectors of AA T and U T is the eigenvectors of A T A.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 26


5.3 Triangular Decomposition 5 DECOMPOSITIONS5.3 Triangular Decomposition5.3.1 Cholesky-decompositionAssume A is positive definite, thenA = B T B, (224)where B is a unique upper triangular matrix.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 27


6 STATISTICS AND PROBABILITY6 Statistics and Probability6.1 Definition of MomentsAssume x ∈ R n×1 is a random variable6.1.1 Mean<strong>The</strong> vector of means, m, is defined by(m) i = 〈x i 〉 (225)6.1.2 Covariance<strong>The</strong> matrix of covariance M is defined by(M) ij = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)〉 (226)or alternatively asM = 〈(x − m)(x − m) T 〉 (227)6.1.3 Third moments<strong>The</strong> matrix of third centralized moments – in some contexts referred to ascoskewness – is defined using the notationasm (3)ijk = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)(x k − 〈x k 〉)〉 (228)M 3 =[]m (3)::1 m(3) ::2 ...m(3) ::n(229)where ’:’ denotes all elements within the given index. M 3 can alternatively beexpressed asM 3 = 〈(x − m)(x − m) T ⊗ (x − m) T 〉 (230)6.1.4 Fourth moments<strong>The</strong> matrix of fourth centralized moments – in some contexts referred to ascokurtosis – is defined using the notationm (4)ijkl = 〈(x i − 〈x i 〉)(x j − 〈x j 〉)(x k − 〈x k 〉)(x l − 〈x l 〉)〉 (231)asM 4 =[]m (4)::11 m(4) ::21 ...m(4) ::n1 |m(4) ::12 m(4) ::22 ...m(4) ::n2 |...|m(4) ::1n m(4) ::2n ...m(4) ::nn(232)or alternatively asM 4 = 〈(x − m)(x − m) T ⊗ (x − m) T ⊗ (x − m) T 〉 (233)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 28


6.2 Expectation of Linear Combinations 6 STATISTICS AND PROBABILITY6.2 Expectation of Linear Combinations6.2.1 Linear FormsAssume X and x to be a matrix and a vector of random variables. <strong>The</strong>n (seeSee [25])E[AXB + C] = AE[X]B + C (234)Var[Ax] = AVar[x]A T (235)Cov[Ax, By] = ACov[x, y]B T (236)Assume x to be a stochastic vector with mean m, then (see [7])6.2.2 Quadratic FormsE[Ax + b] = Am + b (237)E[Ax] = Am (238)E[x + b] = m + b (239)Assume A is symmetric, c = E[x] and Σ = Var[x]. Assume also that allcoordinates x i are independent, have the same central moments µ 1 , µ 2 , µ 3 , µ 4and denote a = diag(A). <strong>The</strong>n (See [25])E[x T Ax] = Tr(AΣ) + c T Ac (240)Var[x T Ax] = 2µ 2 2Tr(A 2 ) + 4µ 2 c T A 2 c + 4µ 3 c T Aa + (µ 4 − 3µ 2 2)a T a (241)Also, assume x to be a stochastic vector with mean m, and covariance M. <strong>The</strong>n(see [7])E[(Ax + a)(Bx + b) T ] = AMB T + (Am + a)(Bm + b) T (242)E[xx T ] = M + mm T (243)E[xa T x] = (M + mm T )a (244)E[x T ax T ] = a T (M + mm T ) (245)E[(Ax)(Ax) T ] = A(M + mm T )A T (246)E[(x + a)(x + a) T ] = M + (m + a)(m + a) T (247)E[(Ax + a) T (Bx + b)] = Tr(AMB T ) + (Am + a) T (Bm + b) (248)E[x T x] = Tr(M) + m T m (249)E[x T Ax] = Tr(AM) + m T Am (250)E[(Ax) T (Ax)] = Tr(AMA T ) + (Am) T (Am) (251)E[(x + a) T (x + a)] = Tr(M) + (m + a) T (m + a) (252)See [7].Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 29


6.3 Weighted Scalar Variable 6 STATISTICS AND PROBABILITY6.2.3 Cubic FormsAssume x to be a stochastic vector with independent coordinates, mean m,covariance M and central moments v 3 = E[(x − m) 3 ]. <strong>The</strong>n (see [7])E[(Ax + a)(Bx + b) T (Cx + c)] = Adiag(B T C)v 3+Tr(BMC T )(Am + a)+AMC T (Bm + b)+(AMB T + (Am + a)(Bm + b) T )(Cm + c)E[xx T x] = v 3 + 2Mm + (Tr(M) + m T m)mE[(Ax + a)(Ax + a) T (Ax + a)] = Adiag(A T A)v 3+[2AMA T + (Ax + a)(Ax + a) T ](Am + a)+Tr(AMA T )(Am + a)E[(Ax + a)b T (Cx + c)(Dx + d) T ] = (Ax + a)b T (CMD T + (Cm + c)(Dm + d) T )+(AMC T + (Am + a)(Cm + c) T )b(Dm + d) T6.3 Weighted Scalar Variable+b T (Cm + c)(AMD T − (Am + a)(Dm + d) T )Assume x ∈ R n×1 is a random variable, w ∈ R n×1 is a vector of constants andy is the linear combination y = w T x. Assume further that m, M 2 , M 3 , M 4denotes the mean, covariance, and central third and fourth moment matrix ofthe variable x. <strong>The</strong>n it holds that〈y〉 = w T m (253)〈(y − 〈y〉) 2 〉 = w T M 2 w (254)〈(y − 〈y〉) 3 〉 = w T M 3 w ⊗ w (255)〈(y − 〈y〉) 4 〉 = w T M 4 w ⊗ w ⊗ w (256)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 30


7 MULTIVARIATE DISTRIBUTIONS7 Multivariate Distributions7.1 Student’s t<strong>The</strong> density of a Student-t distributed vector t ∈ R P ×1 , is given byν+P−P/2Γ(2p(t|µ, Σ, ν) = (πν) )Γ(ν/2)det(Σ) −1/2[1 + ν−1(t − µ) T Σ −1 (t − µ) ] (ν+P )/2(257)where µ is the location, the scale matrix Σ is symmetric, positive definite, νis the degrees of freedom, and Γ denotes the gamma function. For ν = 1, theStudent-t distribution becomes the Cauchy distribution (see sec 7.2).7.1.1 MeanE(t) = µ, ν > 1 (258)7.1.2 Variancecov(t) =ν Σ, ν > 2 (259)ν − 27.1.3 Mode<strong>The</strong> notion mode meaning the position of the most probable value7.1.4 Full <strong>Matrix</strong> Versionmode(t) = µ (260)If instead of a vector t ∈ R P ×1 one has a matrix T ∈ R P ×N , then the Student-tdistribution for T isp(T|M, Ω, Σ, ν) = π −NP/2 P∏p=1Γ [(ν + P − p + 1)/2]Γ [(ν − p + 1)/2]ν det(Ω) −ν/2 det(Σ) −N/2 ×det [ Ω −1 + (T − M)Σ −1 (T − M) T] −(ν+P )/2(261)where M is the location, Ω is the rescaling matrix, Σ is positive definite, ν isthe degrees of freedom, and Γ denotes the gamma function.7.2 Cauchy<strong>The</strong> density function for a Cauchy distributed vector t ∈ R P ×1 , is given by1+P−P/2Γ(2p(t|µ, Σ) = π )Γ(1/2)×det(Σ) −1/2[1 + (t − µ)T Σ −1 (t − µ) ] (1+P )/2(262)where µ is the location, Σ is positive definite, and Γ denotes the gamma function.<strong>The</strong> Cauchy distribution is a special case of the Student-t distribution.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 31


7.3 Gaussian 7 MULTIVARIATE DISTRIBUTIONS7.3 GaussianSee sec. 8.7.4 MultinomialIf the vector n contains counts, i.e. (n) i ∈ 0, 1, 2, ..., then the discrete multinomialdisitrbution for n is given byP (n|a, n) =n!n 1 ! . . . n d !d∏ia nii ,d∑n i = n (263)iwhere a i are probabilities, i.e. 0 ≤ a i ≤ 1 and ∑ i a i = 1.7.5 Dirichlet<strong>The</strong> Dirichlet distribution is a kind of “inverse” distribution compared to themultinomial distribution on the bounded continuous variate x = [x 1 , . . . , x P ][16, p. 44]( ∑P)Γp α pP∏p(x|α) = ∏ Pp Γ(α x αp−1pp)7.6 Normal-Inverse Gamma7.7 Wishart<strong>The</strong> central Wishart distribution for M ∈ R P ×P , M is positive definite, wherem can be regarded as a degree of freedom parameter [16, equation 3.8.1] [8,section 2.5],[11]pp(M|Σ, m) =12 mP/2 π ∏ P (P −1)/4 Pp Γ[ ×12(m + 1 − p)]det(Σ) −m/2 det(M) (m−P −1)/2 ×[exp − 1 ]2 Tr(Σ−1 M)(264)7.7.1 MeanE(M) = mΣ (265)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 32


7.8 Inverse Wishart 7 MULTIVARIATE DISTRIBUTIONS7.8 Inverse Wishart<strong>The</strong> (normal) Inverse Wishart distribution for M ∈ R P ×P , M is positive definite,where m can be regarded as a degree of freedom parameter [11]p(M|Σ, m) =12 mP/2 π ∏ P (P −1)/4 Pp Γ[ ×12(m + 1 − p)]det(Σ) m/2 det(M) −(m−P −1)/2 ×[exp − 1 ]2 Tr(ΣM−1 )(266)7.8.1 Mean1E(M) = Σm − P − 1(267)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 33


8 GAUSSIANS8 Gaussians8.1 Basics8.1.1 Density and normalization<strong>The</strong> density of x ∼ N (m, Σ) is[1p(x) = √ exp − 1 ]det(2πΣ) 2 (x − m)T Σ −1 (x − m)(268)Note that if x is d-dimensional, then det(2πΣ) = (2π) d det(Σ).Integration and normalization∫ [exp − 1 ]2 (x − m)T Σ −1 (x − m) dx = √ det(2πΣ)∫ [exp − 1 ]2 xT Ax + b T x dx = √ [ ] 1det(2πA −1 ) exp2 bT A −1 b∫ [exp − 1 ]2 Tr(ST AS) + Tr(B T S) dS = √ [ ]1det(2πA −1 ) exp2 Tr(BT A −1 B)<strong>The</strong> derivatives of the density are∂p(x)= −p(x)Σ −1 (x − m) (269)∂x∂ 2 p(∂x∂x T = p(x) Σ −1 (x − m)(x − m) T Σ −1 − Σ −1) (270)8.1.2 Marginal DistributionAssume x ∼ N x (µ, Σ) where[ ]xax =x bµ =[µaµ b]Σ =[ ]Σa Σ cΣ T c Σ b(271)then8.1.3 Conditional Distributionp(x a ) = N xa (µ a , Σ a ) (272)p(x b ) = N xb (µ b , Σ b ) (273)Assume x ∼ N x (µ, Σ) where[ ]xax =x bµ =[µaµ b]Σ =[ ]Σa Σ cΣ T c Σ b(274)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 34


8.1 Basics 8 GAUSSIANSthenp(x a |x b ) = N xa (ˆµ a , ˆΣ a )p(x b |x a ) = N xb (ˆµ b , ˆΣ b ){ ˆµa = µ a + Σ c Σ −1b(x b − µ b )ˆΣ a = Σ a − Σ c Σ −1bΣ T (275)c{ ˆµb = µ b + Σ T c Σ −1a (x a − µ a )ˆΣ b = Σ b − Σ T c Σ −1 (276)a Σ cNote, that the covariance matrices are the Schur complement of the block matrix,see 9.10.5 for details.8.1.4 Linear combinationAssume x ∼ N (m x , Σ x ) and y ∼ N (m y , Σ y ) thenAx + By + c ∼ N (Am x + Bm y + c, AΣ x A T + BΣ y B T ) (277)8.1.5 Rearranging Means√det(2π(AT ΣN Ax [m, Σ] =−1 A) −1 )√ N x [A −1 m, (A T Σ −1 A) −1 ] (278)det(2πΣ)8.1.6 Rearranging into squared formIf A is symmetric, then− 1 2 xT Ax + b T x = − 1 2 (x − A−1 b) T A(x − A −1 b) + 1 2 bT A −1 b− 1 2 Tr(XT AX) + Tr(B T X) = − 1 2 Tr[(X − A−1 B) T A(X − A −1 B)] + 1 2 Tr(BT A −1 B)8.1.7 Sum of two squared formsIn vector formulation (assuming Σ 1 , Σ 2 are symmetric)− 1 2 (x − m 1) T Σ −11 (x − m 1) (279)− 1 2 (x − m 2) T Σ −12 (x − m 2) (280)= − 1 2 (x − m c) T Σ −1c (x − m c ) + C (281)Σ −1c = Σ −11 + Σ −12 (282)m c = (Σ −11 + Σ −12 )−1 (Σ −11 m 1 + Σ −12 m 2) (283)C = 1 2 (mT 1 Σ −11 + m T 2 Σ −12 )(Σ−1 1 + Σ −1− 1 ()m T 1 Σ −112m 1 + m T 2 Σ −12 m 22 )−1 (Σ −11 m 1 + Σ −12 m 2)(284)(285)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 35


8.2 Moments 8 GAUSSIANSIn a trace formulation (assuming Σ 1 , Σ 2 are symmetric)− 1 2 Tr((X − M 1) T Σ −11 (X − M 1)) (286)− 1 2 Tr((X − M 2) T Σ −12 (X − M 2)) (287)= − 1 2 Tr[(X − M c) T Σ −1c (X − M c )] + C (288)Σ −1c = Σ −11 + Σ −12 (289)M c = (Σ −11 + Σ −12 )−1 (Σ −11 M 1 + Σ −12 M 2) (290)C = 1 []2 Tr (Σ −11 M 1 + Σ −12 M 2) T (Σ −11 + Σ −12 )−1 (Σ −11 M 1 + Σ −12 M 2)− 1 2 Tr(MT 1 Σ −11 M 1 + M T 2 Σ −12 M 2) (291)8.1.8 Product of gaussian densitiesLet N x (m, Σ) denote a density of x, thenN x (m 1 , Σ 1 ) · N x (m 2 , Σ 2 ) = c c N x (m c , Σ c ) (292)c c = N m1 (m 2 , (Σ 1 + Σ 2 ))1= √[−det(2π(Σ1 + Σ 2 )) exp 1 ]2 (m 1 − m 2 ) T (Σ 1 + Σ 2 ) −1 (m 1 − m 2 )m c = (Σ −11 + Σ −12 )−1 (Σ −11 m 1 + Σ −12 m 2)Σ c = (Σ −11 + Σ −12 )−1but note that the product is not normalized as a density of x.8.2 Moments8.2.1 Mean and covariance of linear formsFirst and second moments. Assume x ∼ N (m, Σ)E(x) = m (293)Cov(x, x) = Var(x) = Σ = E(xx T ) − E(x)E(x T ) = E(xx T ) − mm T (294)As for any other distribution is holds for gaussians thatE[Ax] = AE[x] (295)Var[Ax] = AVar[x]A T (296)Cov[Ax, By] = ACov[x, y]B T (297)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 36


8.2 Moments 8 GAUSSIANS8.2.2 Mean and variance of square formsMean and variance of square forms: Assume x ∼ N (m, Σ)E(xx T ) = Σ + mm T (298)E[x T Ax] = Tr(AΣ) + m T Am (299)Var(x T Ax) = 2σ 4 Tr(A 2 ) + 4σ 2 m T A 2 m (300)E[(x − m ′ ) T A(x − m ′ )] = (m − m ′ ) T A(m − m ′ ) + Tr(AΣ) (301)Assume x ∼ N (0, σ 2 I) and A and B to be symmetric, thenCov(x T Ax, x T Bx) = 2σ 4 Tr(AB) (302)8.2.3 Cubic formsE[xb T xx T ] = mb T (M + mm T ) + (M + mm T )bm T8.2.4 Mean of Quartic Forms+b T m(M − mm T ) (303)E[xx T xx T ] = 2(Σ + mm T ) 2 + m T m(Σ − mm T )+Tr(Σ)(Σ + mm T )E[xx T Axx T ] = (Σ + mm T )(A + A T )(Σ + mm T )+m T Am(Σ − mm T ) + Tr[AΣ](Σ + mm T )E[x T xx T x] = 2Tr(Σ 2 ) + 4m T Σm + (Tr(Σ) + m T m) 2E[x T Axx T Bx] = Tr[AΣ(B + B T )Σ] + m T (A + A T )Σ(B + B T )m+(Tr(AΣ) + m T Am)(Tr(BΣ) + m T Bm)E[a T xb T xc T xd T x]= (a T (Σ + mm T )b)(c T (Σ + mm T )d)+(a T (Σ + mm T )c)(b T (Σ + mm T )d)+(a T (Σ + mm T )d)(b T (Σ + mm T )c) − 2a T mb T mc T md T mE[(Ax + a)(Bx + b) T (Cx + c)(Dx + d) T ]= [AΣB T + (Am + a)(Bm + b) T ][CΣD T + (Cm + c)(Dm + d) T ]+[AΣC T + (Am + a)(Cm + c) T ][BΣD T + (Bm + b)(Dm + d) T ]+(Bm + b) T (Cm + c)[AΣD T − (Am + a)(Dm + d) T ]+Tr(BΣC T )[AΣD T + (Am + a)(Dm + d) T ]Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 37


8.3 Miscellaneous 8 GAUSSIANSE[(Ax + a) T (Bx + b)(Cx + c) T (Dx + d)]= Tr[AΣ(C T D + D T C)ΣB T ]+[(Am + a) T B + (Bm + b) T A]Σ[C T (Dm + d) + D T (Cm + c)]+[Tr(AΣB T ) + (Am + a) T (Bm + b)][Tr(CΣD T ) + (Cm + c) T (Dm + d)]See [7].8.2.5 MomentsE[x] = ∑ kCov(x) = ∑ kρ k m k (304)∑k ′ ρ k ρ k ′(Σ k + m k m T k − m k m T k ′) (305)8.3 Miscellaneous8.3.1 WhiteningAssume x ∼ N (m, Σ) thenz = Σ −1/2 (x − m) ∼ N (0, I) (306)Conversely having z ∼ N (0, I) one can generate data x ∼ N (m, Σ) by settingx = Σ 1/2 z + m ∼ N (m, Σ) (307)Note that Σ 1/2 means the matrix which fulfils Σ 1/2 Σ 1/2 = Σ, and that it existsand is unique since Σ is positive definite.8.3.2 <strong>The</strong> Chi-Square connectionAssume x ∼ N (m, Σ) and x to be n dimensional, thenz = (x − m) T Σ −1 (x − m) ∼ χ 2 n (308)where χ 2 n denotes the Chi square distribution with n degrees of freedom.8.3.3 EntropyEntropy of a D-dimensional gaussian∫H(x) = − N (m, Σ) ln N (m, Σ)dx = ln √ det(2πΣ) + D 2(309)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 38


8.4 Mixture of Gaussians 8 GAUSSIANS8.4 Mixture of Gaussians8.4.1 Density<strong>The</strong> variable x is distributed as a mixture of gaussians if it has the densityp(x) =K∑k=11ρ k √[−det(2πΣk ) exp 1 ]2 (x − m k) T Σ −1k (x − m k)where ρ k sum to 1 and the Σ k all are positive definite.8.4.2 DerivativesDefining p(s) = ∑ k ρ kN s (µ k , Σ k ) one get∂ ln p(s)∂ρ j==∂ ln p(s)∂µ j==∂ ln p(s)∂Σ j==ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑k ρ kN s (µ k , Σ k )ρ j N s (µ j , Σ j )∑∂ln[ρ j N s (µ∂ρ j , Σ j )]j1ρ j∂ln[ρ j N s (µ∂µ j , Σ j )]j[−Σ−1k (s − µ k) ]∂k ρ ln[ρ j N s (µkN s (µ k , Σ k ) ∂Σ j , Σ j )]jρ j N s (µ j , Σ j ) 1 [∑−Σ−1k ρ j + Σ −1jkN s (µ k , Σ k ) 2But ρ k and Σ k needs to be constrained.(s − µ j )(s − µ j ) T Σ −1 ]j(310)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 39


9 SPECIAL MATRICES9 Special Matrices9.1 Orthogonal, Ortho-symmetric, and Ortho-skew9.1.1 OrthogonalBy definition, the real matrix A is orthogonal if and only ifA −1 = A TBasic properties for the orthogonal matrix AA −T = AAA T = IA T A = Idet(A) = 19.2 Units, Permutation and Shift9.2.1 Unit vectorLet e i ∈ R n×1 be the ith unit vector, i.e. the vector which is zero in all entriesexcept the ith at which it is 1.9.2.2 Rows and Columns9.2.3 Permutationsi.th row of A = e T i A (311)j.th column of A = Ae j (312)Let P be some permutation matrix, e.g.⎡ ⎤⎡0 1 0P = ⎣ 1 0 0 ⎦ = [ ]e 2 e 1 e 3 = ⎣0 0 1e T 2e T 1e T 3⎤⎦ (313)For permutation matrices it holds thatand thatAP = [ Ae 2 Ae 1 Ae 3]PP T = I (314)⎡PA = ⎣e T 2 Ae T 1 Ae T 3 A⎤⎦ (315)That is, the first is a matrix which has columns of A but in permuted sequenceand the second is a matrix which has the rows of A but in the permuted sequence.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 40


9.3 <strong>The</strong> Singleentry <strong>Matrix</strong> 9 SPECIAL MATRICES9.2.4 Translation, Shift or Lag OperatorsLet L denote the lag (or ’translation’ or ’shift’) operator defined on a 4 × 4example by⎡⎤0 0 0 0L = ⎢ 1 0 0 0⎥⎣ 0 1 0 0 ⎦ (316)0 0 1 0i.e. a matrix of zeros with one on the sub-diagonal, (L) ij = δ i,j+1 . With somesignal x t for t = 1, ..., N, the n.th power of the lag operator shifts the indices,i.e.{(L n 0 for t = 1, .., nx) t =(317)x t−n for t = n + 1, ..., NA related but slightly different matrix is the ’recurrent shifted’ operator definedon a 4x4 example by⎡⎤0 0 0 1ˆL = ⎢ 1 0 0 0⎥⎣ 0 1 0 0 ⎦ (318)0 0 1 0i.e. a matrix defined by (ˆL) ij = δ i,j+1 + δ i,1 δ j,dim(L) . On a signal x it has theeffect(ˆL n x) t = x t ′, t ′ = [(t − n) mod N] + 1 (319)That is, ˆL is like the shift operator L except that it ’wraps’ the signal as if itwas periodic and shifted (substituting the zeros with the rear end of the signal).Note that ˆL is invertible and orthogonal, i.e.ˆL −1 = ˆL T (320)9.3 <strong>The</strong> Singleentry <strong>Matrix</strong>9.3.1 Definition<strong>The</strong> single-entry matrix J ij ∈ R n×n is defined as the matrix which is zeroeverywhere except in the entry (i, j) in which it is 1. In a 4 × 4 example onemight haveJ 23 =⎡⎢⎣0 0 0 00 0 1 00 0 0 00 0 0 0⎤⎥⎦ (321)<strong>The</strong> single-entry matrix is very useful when working with derivatives of expressionsinvolving matrices.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 41


9.3 <strong>The</strong> Singleentry <strong>Matrix</strong> 9 SPECIAL MATRICES9.3.2 Swap and ZerosAssume A to be n × m and J ij to be m × pAJ ij = [ 0 0 . . . A i . . . 0 ] (322)i.e. an n × p matrix of zeros with the i.th column of A in place of the j.thcolumn. Assume A to be n × m and J ij to be p × n⎡ ⎤0.0J ij A =A j(323)0⎢ ⎥⎣ . ⎦0i.e. an p × m matrix of zeros with the j.th row of A in the placed of the i.throw.9.3.3 Rewriting product of elementsA ki B jl = (Ae i e T j B) kl = (AJ ij B) kl (324)A ik B lj = (A T e i e T j B T ) kl = (A T J ij B T ) kl (325)A ik B jl = (A T e i e T j B) kl = (A T J ij B) kl (326)A ki B lj = (Ae i e T j B T ) kl = (AJ ij B T ) kl (327)9.3.4 Properties of the Singleentry <strong>Matrix</strong>If i = jIf i ≠ jJ ij J ij = J ij (J ij ) T (J ij ) T = J ijJ ij (J ij ) T = J ij (J ij ) T J ij = J ijJ ij J ij = 0 (J ij ) T (J ij ) T = 0J ij (J ij ) T = J ii (J ij ) T J ij = J jj9.3.5 <strong>The</strong> Singleentry <strong>Matrix</strong> in Scalar ExpressionsAssume A is n × m and J is m × n, thenTr(AJ ij ) = Tr(J ij A) = (A T ) ij (328)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 42


9.4 Symmetric and Antisymmetric 9 SPECIAL MATRICESAssume A is n × n, J is n × m and B is m × n, thenTr(AJ ij B) = (A T B T ) ij (329)Tr(AJ ji B) = (BA) ij (330)Tr(AJ ij J ij B) = diag(A T B T ) ij (331)Assume A is n × n, J ij is n × m B is m × n, then9.3.6 Structure Matrices<strong>The</strong> structure matrix is defined byIf A has no special structure thenx T AJ ij Bx = (A T xx T B T ) ij (332)x T AJ ij J ij Bx = diag(A T xx T B T ) ij (333)∂A∂A ij= S ij (334)S ij = J ij (335)If A is symmetric thenS ij = J ij + J ji − J ij J ij (336)9.4 Symmetric and Antisymmetric9.4.1 Symmetric<strong>The</strong> matrix A is said to be symmetric ifA = A T (337)Symmetric matrices have many important properties, e.g. that their eigenvaluesare real and eigenvectors orthogonal.9.4.2 Antisymmetric<strong>The</strong> antisymmetric matrix is also known as the skew symmetric matrix. It hasthe following property from which it is definedA = −A T (338)Hereby, it can be seen that the antisymmetric matrices always have a zerodiagonal. <strong>The</strong> n × n antisymmetric matrices also have the following properties.det(A T ) = det(−A) = (−1) n det(A) (339)− det(A) = det(−A) = 0, if n is odd (340)<strong>The</strong> eigenvalues of an antisymmetric matrix are placed on the imaginary axisand the eigenvectors are unitary.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 43


9.5 Orthogonal matrices 9 SPECIAL MATRICES9.4.3 DecompositionA square matrix A can always be written as a sum of a symmetric A + and anantisymmetric matrix A −A = A + + A − (341)Such a decomposition could e.g. beA = A + AT2+ A − AT2= A + + A − (342)9.5 Orthogonal matricesIf a square matrix Q is orthogonal Q T Q = QQ T = I. Furthermore Q had thefollowing properties• Its eigenvalues are placed on the unit circle.• Its eigenvectors are unitary.• det(Q) = ±1.• <strong>The</strong> inverse of an orthogonal matrix is orthogonal too.9.5.1 Ortho-SymA matrix Q + which simultaneously is orthogonal and symmetric is called anortho-sym matrix [19]. HerebyQ T +Q + = I (343)Q + = Q T + (344)<strong>The</strong> powers of an ortho-sym matrix are given by the following rule9.5.2 Ortho-SkewQ k + = 1 + (−1)k2= 1 + cos(kπ)2I + 1 + (−1)k+1 Q + (345)2I + 1 − cos(kπ) Q + (346)2A matrix which simultaneously is orthogonal and antisymmetric is called anortho-skew matrix [19]. HerebyQ H − Q − = I (347)Q − = −Q H − (348)<strong>The</strong> powers of an ortho-skew matrix are given by the following ruleQ k − = ik + (−i) k2I − i ik − (−i) kQ − (349)2= cos(k π 2 )I + sin(k π 2 )Q − (350)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 44


9.6 Vandermonde Matrices 9 SPECIAL MATRICES9.5.3 DecompositionA square matrix A can always be written as a sum of a symmetric A + and anantisymmetric matrix A −A = A + + A − (351)9.6 Vandermonde MatricesA Vandermonde matrix has the form [15]⎡V = ⎢⎣ .1 v 1 v 2 1 · · · v n−111 v 2 v 2 2 · · · v n−12. . .1 v n vn 2 · · · vnn−1⎤⎥⎦ . (352)<strong>The</strong> transpose of V is also said to a Vandermonde matrix. <strong>The</strong> determinant isgiven by [28]det V = ∏ i>j(v i − v j ) (353)9.7 Toeplitz MatricesA Toeplitz matrix T is a matrix where the elements of each diagonal is thesame. In the n × n square case, it has the following structure:⎡⎤ ⎡⎤t 11 t 12 · · · t 1n t 0 t 1 · · · t n−1.T =t .. . .. 21 .⎢⎣.. .. . ⎥ .. t12 ⎦ = . t .. . .. −1 .⎢⎣.. .. . ⎥ (354)..t1 ⎦t n1 · · · t 21 t 11 t −(n−1) · · · t −1 t 0A Toeplitz matrix is persymmetric. If a matrix is persymmetric (or orthosymmetric),it means that the matrix is symmetric about its northeast-southwestdiagonal (anti-diagonal) [12]. Persymmetric matrices is a larger class of matrices,since a persymmetric matrix not necessarily has a Toeplitz structure. <strong>The</strong>reare some special cases of Toeplitz matrices. <strong>The</strong> symmetric Toeplitz matrix isgiven by:⎡⎤t 0 t 1 · · · t n−1.T =t .. . .. 1 .⎢⎣.. .. . ⎥(355)..t1 ⎦t −(n−1) · · · t 1 t 0Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 45


9.8 <strong>The</strong> DFT <strong>Matrix</strong> 9 SPECIAL MATRICES<strong>The</strong> circular Toeplitz matrix:⎡⎤t 0 t 1 · · · t n−1.T C =t .. . .. n .⎢⎣.. .. . ⎥.. t1 ⎦t 1 · · · t n−1 t 0(356)<strong>The</strong> upper triangular Toeplitz matrix:⎡t 0 t 1 · · · t n−1⎤0 · · · 0 t 0 .T U =0 .. . .. .⎢⎣.. .. . ⎥ .. t1 ⎦ , (357)and the lower triangular Toeplitz matrix:⎡⎤t 0 0 · · · 0.T L =t .. . .. −1 .⎢⎣.. .. . ⎥ .. 0 ⎦t −(n−1) · · · t −1 t 09.7.1 Properties of Toeplitz Matrices(358)<strong>The</strong> Toeplitz matrix has some computational advantages. <strong>The</strong> addition of twoToeplitz matrices can be done with O(n) flops, multiplication of two Toeplitzmatrices can be done in O(n ln n) flops. Toeplitz equation systems can be solvedin O(n 2 ) flops. <strong>The</strong> inverse of a positive definite Toeplitz matrix can be foundin O(n 2 ) flops too. <strong>The</strong> inverse of a Toeplitz matrix is persymmetric. <strong>The</strong>product of two lower triangular Toeplitz matrices is a Toeplitz matrix. Moreinformation on Toeplitz matrices and circulant matrices can be found in [13, 7].9.8 <strong>The</strong> DFT <strong>Matrix</strong><strong>The</strong> DFT matrix is an N × N symmetric matrix W N , where the k, nth elementis given by= e −j2πknN (359)W knNThus the discrete Fourier transform (DFT) can be expressed asX(k) =N−1∑n=0x(n)W knN . (360)Likewise the inverse discrete Fourier transform (IDFT) can be expressed asx(n) = 1 NN−1∑k=0X(k)W −knN . (361)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 46


9.9 Positive Definite and Semi-definite Matrices 9 SPECIAL MATRICES<strong>The</strong> DFT of the vector x = [x(0), x(1), · · · , x(N − 1)] T can be written in matrixform asX = W N x, (362)where X = [X(0), X(1), · · · , x(N − 1)] T . <strong>The</strong> IDFT is similarly given asSome properties of W N exist:If W N = e −j2πN , then [22]x = W −1NX. (363)W −1N= 1 N W∗ N (364)W N W ∗ N = NI (365)W ∗ N = W H N (366)W m+N/2N= −W m N (367)Notice, the DFT matrix is a Vandermonde <strong>Matrix</strong>.<strong>The</strong> following important relation between the circulant matrix and the discreteFourier transform (DFT) existsT C = W −1N (I ◦ (W N t))W N , (368)where t = [t 0 , t 1 , · · · , t n−1 ] T is the first row of T C .9.9 Positive Definite and Semi-definite Matrices9.9.1 DefinitionsA matrix A is positive definite if and only ifA matrix A is positive semi-definite if and only ifx T Ax > 0, ∀x (369)x T Ax ≥ 0, ∀x (370)Note that if A is positive definite, then A is also positive semi-definite.9.9.2 Eigenvalues<strong>The</strong> following holds with respect to the eigenvalues:A pos. def. ⇔ eig( A+AH2) > 0A pos. semi-def. ⇔ eig( A+AH2) ≥ 0(371)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 47


9.9 Positive Definite and Semi-definite Matrices 9 SPECIAL MATRICES9.9.3 Trace<strong>The</strong> following holds with respect to the trace:A pos. def. ⇒ Tr(A) > 0A pos. semi-def. ⇒ Tr(A) ≥ 0(372)9.9.4 InverseIf A is positive definite, then A is invertible and A −1 is also positive definite.9.9.5 DiagonalIf A is positive definite, then A ii > 0, ∀i9.9.6 Decomposition I<strong>The</strong> matrix A is positive semi-definite of rank r ⇔ there exists a matrix B ofrank r such that A = BB T<strong>The</strong> matrix A is positive definite ⇔ there exists an invertible matrix B suchthat A = BB T9.9.7 Decomposition IIAssume A is an n × n positive semi-definite, then there exists an n × r matrixB of rank r such that B T AB = I.9.9.8 Equation with zerosAssume A is positive semi-definite, then X T AX = 0 ⇒ AX = 09.9.9 Rank of productAssume A is positive definite, then rank(BAB T ) = rank(B)9.9.10 Positive definite propertyIf A is n × n positive definite and B is r × n of rank r, then BAB T is positivedefinite.9.9.11 Outer ProductIf X is n × r, where n ≤ r and rank(X) = n, then XX T is positive definite.9.9.12 Small pertubationsIf A is positive definite and B is symmetric, then A − tB is positive definite forsufficiently small t.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 48


9.10 Block matrices 9 SPECIAL MATRICES9.10 Block matricesLet A ij denote the ijth block of A.9.10.1 MultiplicationAssuming the dimensions of the blocks matches we have[ ] [ ] [ ]A11 A 12 B11 B 12 A11 B=11 + A 12 B 21 A 11 B 12 + A 12 B 22A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 229.10.2 <strong>The</strong> Determinant<strong>The</strong> determinant can be expressed as by the use ofC 1 = A 11 − A 12 A −122 A 21 (373)C 2 = A 22 − A 21 A −111 A 12 (374)as([ ])A11 Adet12= det(AA 21 A 22 ) · det(C 1 ) = det(A 11 ) · det(C 2 )229.10.3 <strong>The</strong> Inverse<strong>The</strong> inverse can be expressed as by the use ofas [ ] −1 [A11 A 12=A 21 A 22=C 1 = A 11 − A 12 A −122 A 21 (375)C 2 = A 22 − A 21 A −111 A 12 (376)C −11 −A −111 A 12C −12−C −12 A 21A −111 C −12[ A−111 + A−1 11 A 12C −12 A 21A −111 −C −11 A 12A −1229.10.4 Block diagonal−A −122 A 21C −11 A −122 + A−1 22 A 21C −11 A 12A −122For block diagonal matrices we have[ ] −1 [ ]A11 0(A11 )=−1 00 A 22 0 (A 22 ) −1 (377)([ ])A11 0det= det(A0 A 11 ) · det(A 22 ) (378)22]]Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 49


9.10 Block matrices 9 SPECIAL MATRICES9.10.5 Schur complement<strong>The</strong> Schur complement of the matrixA 21 A 22[A11 A 12]is the matrixA 11 − A 12 A −122 A 21that is, what is denoted C 2 above. Using the Schur complement, one can rewritethe inverse of a block matrix[A11 A 12=[A 21 A 22] −1I 0−A −122 A 21 I] [(A11 − A 12 A −122 A 21) −1 00 A −122] [ I −A12 A −1220 I<strong>The</strong> Schur complement is useful when solving linear systems of the form[ ] [ ] [ ]A11 A 12 x1 b1=A 21 A 22 x 1 b 2which has the following equation for x 1(A 11 − A 12 A −122 A 21)x 1 = b 1 − A 12 A −122 x 2When the appropriate inverses exists, this can be solved for x 1 which can thenbe inserted in the equation for x 2 to solve for x 2 .]Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 50


10 FUNCTIONS AND OPERATORS10 Functions and Operators10.1 Functions and Series10.1.1 Finite Series(X n − I)(X − I) −1 = I + X + X 2 + ... + X n−1 (379)10.1.2 Taylor Expansion of Scalar FunctionConsider some scalar function f(x) which takes the vector x as an argument.This we can Taylor expand around x 0f(x) ∼ = f(x 0 ) + g(x 0 ) T (x − x 0 ) + 1 2 (x − x 0) T H(x 0 )(x − x 0 ) (380)whereg(x 0 ) = ∂f(x)∂x∣∣x0H(x 0 ) = ∂2 f(x)∂x∂x T ∣ ∣∣x010.1.3 <strong>Matrix</strong> Functions by Infinite SeriesAs for analytical functions in one dimension, one can define a matrix functionfor square matrices X by an infinite seriesf(X) =∞∑c n X n (381)n=0assuming the limit exists and is finite. If the coefficients c n fulfils ∑ n c nx n < ∞,then one can prove that the above series exists and is finite, see [1]. Thus forany analytical function f(x) there exists a corresponding matrix function f(x)constructed by the Taylor expansion. Using this one can prove the followingresults:1) A matrix A is a zero of its own characteristic polynomium [1]:p(λ) = det(Iλ − A) = ∑ nc n λ n ⇒ p(A) = 0 (382)2) If A is square it holds that [1]A = UBU −1 ⇒ f(A) = Uf(B)U −1 (383)3) A useful fact when using power series is thatA n → 0forn → ∞ if |A| < 1 (384)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 51


10.2 Kronecker and Vec Operator 10 FUNCTIONS AND OPERATORS10.1.4 Exponential <strong>Matrix</strong> FunctionIn analogy to the ordinary scalar exponential function, one can define exponentialand logarithmic matrix functions:e Ae −Ae tAln(I + A)≡≡≡≡∞∑ 1n! An = I + A + 1 2 A2 + ... (385)∞∑ 1n! (−1)n A n = I − A + 1 2 A2 − ... (386)∞∑ 1n! (tA)n = I + tA + 1 2 t2 A 2 + ... (387)∞∑ (−1) n−1A n = A − 1 n2 A2 + 1 3 A3 − ... (388)n=0n=0n=0n=1Some of the properties of the exponential function are [1]e A e B = e A+B if AB = BA (389)(e A ) −1 = e −A (390)ddt etA = Ae tA = e tA A, t ∈ R (391)ddt Tr(etA ) = Tr(Ae tA ) (392)det(e A ) = e Tr(A) (393)10.1.5 Trigonometric Functionssin(A)cos(A)≡≡∞∑ (−1) n A 2n+1= A − 1 (2n + 1)! 3! A3 + 1 5! A5 − ... (394)∞∑ (−1) n A 2n= I − 1 (2n)! 2! A2 + 1 4! A4 − ... (395)n=0n=010.2 Kronecker and Vec Operator10.2.1 <strong>The</strong> Kronecker Product<strong>The</strong> Kronecker product of an m × n matrix A and an r × q matrix B, is anmr × nq matrix, A ⊗ B defined as⎡⎤A 11 B A 12 B ... A 1n BA 21 B A 22 B ... A 2n BA ⊗ B = ⎢⎥(396)⎣ .. ⎦A m1 B A m2 B ... A mn BPetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 52


10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORS<strong>The</strong> Kronecker product has the following properties (see [18])A ⊗ (B + C) = A ⊗ B + A ⊗ C (397)A ⊗ B ≠ B ⊗ A in general (398)A ⊗ (B ⊗ C) = (A ⊗ B) ⊗ C (399)(α A A ⊗ α B B) = α A α B (A ⊗ B) (400)(A ⊗ B) T = A T ⊗ B T (401)(A ⊗ B)(C ⊗ D) = AC ⊗ BD (402)(A ⊗ B) −1 = A −1 ⊗ B −1 (403)rank(A ⊗ B) = rank(A)rank(B) (404)Tr(A ⊗ B) = Tr(A)Tr(B) (405)det(A ⊗ B) = det(A) rank(B) det(B) rank(A) (406){eig(A ⊗ B)} = {eig(B ⊗ A)} if A, B are square (407){eig(A ⊗ B)} = {eig(A)eig(B) T } if A, B are square (408)Where {λ i } denotes the set of values λ i , that is, the values in no particularorder or structure.10.2.2 <strong>The</strong> Vec Operator<strong>The</strong> vec-operator applied on a matrix A stacks the columns into a vector, i.e.for a 2 × 2 matrix⎡ ⎤[ ]A 11A11 AA =12vec(A) = ⎢ A 21⎥A 21 A 22⎣ A 12⎦A 22Properties of the vec-operator include (see [18])vec(AXB) = (B T ⊗ A)vec(X) (409)Tr(A T B) = vec(A) T vec(B) (410)vec(A + B) = vec(A) + vec(B) (411)vec(αA) = α · vec(A) (412)10.3 Solutions to Systems of Equations10.3.1 Simple Linear RegressionAssume we have data (x n , y n ) for n = 1, ..., N and are seeking the parametersa, b ∈ R such that y i∼ = axi + b. With a least squares error function, the optimalvalues for a, b can be expressed using the notationx = (x 1 , ..., x N ) T y = (y 1 , ..., y N ) T 1 = (1, ..., 1) T ∈ R N×1Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 53


10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORSandR xx = x T x R x1 = x T 1 R 11 = 1 T 1R yx = y T x R y1 = y T 1as[ ab]=[ ] −1 [ ]Rxx R x1 Rx,yR x1 R 11 R y1(413)10.3.2 Existence in Linear SystemsAssume A is n × m and consider the linear systemAx = b (414)Construct the augmented matrix B = [A b] thenConditionrank(A) = rank(B) = mrank(A) = rank(B) < mrank(A) < rank(B)SolutionUnique solution xMany solutions xNo solutions x10.3.3 Standard SquareAssume A is square and invertible, then10.3.4 Degenerated SquareAx = b ⇒ x = A −1 b (415)Assume A is n×n but of rank r < n. In that case, the system Ax = b is solvedbyx = A + bwhere A + is the pseudo-inverse of the rank-deficient matrix, constructed asdescribed in section 3.6.3.10.3.5 Cramer’s rule<strong>The</strong> equationAx = b, (416)where A is square has exactly one solution x if the ith element in x can befound asx i = det Bdet A , (417)where B equals A, but the ith column in A has been substituted by b.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 54


10.3 Solutions to Systems of Equations10 FUNCTIONS AND OPERATORS10.3.6 Over-determined RectangularAssume A to be n × m, n > m (tall) and rank(A) = m, thenAx = b ⇒ x = (A T A) −1 A T b = A + b (418)that is if there exists a solution x at all! If there is no solution the followingcan be useful:Ax = b ⇒ x min = A + b (419)Now x min is the vector x which minimizes ||Ax − b|| 2 , i.e. the vector which is”least wrong”. <strong>The</strong> matrix A + is the pseudo-inverse of A. See [3].10.3.7 Under-determined RectangularAssume A is n × m and n < m (”broad”) and rank(A) = n.Ax = b ⇒ x min = A T (AA T ) −1 b (420)<strong>The</strong> equation have many solutions x. But x min is the solution which minimizes||Ax − b|| 2 and also the solution with the smallest norm ||x|| 2 . <strong>The</strong> same holdsfor a matrix version: Assume A is n × m, X is m × n and B is n × n, thenAX = B ⇒ X min = A + B (421)<strong>The</strong> equation have many solutions X. But X min is the solution which minimizes||AX − B|| 2 and also the solution with the smallest norm ||X|| 2 . See [3].Similar but different: Assume A is square n × n and the matrices B 0 , B 1are n × N, where N > n, then if B 0 has maximal rankAB 0 = B 1 ⇒ A min = B 1 B T 0 (B 0 B T 0 ) −1 (422)where A min denotes the matrix which is optimal in a least square sense. Aninterpretation is that A is the linear approximation which maps the columnsvectors of B 0 into the columns vectors of B 1 .10.3.8 Linear form and zeros10.3.9 Square form and zerosIf A is symmetric, then10.3.10 <strong>The</strong> Lyapunov EquationAx = 0, ∀x ⇒ A = 0 (423)x T Ax = 0, ∀x ⇒ A = 0 (424)AX + XB = C (425)vec(X) = (I ⊗ A + B T ⊗ I) −1 vec(C) (426)Sec 10.2.1 and 10.2.2 for details on the Kronecker product and the vec operator.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 55


10.4 Vector Norms 10 FUNCTIONS AND OPERATORS10.3.11 Encapsulating Sum∑n A nXB n = C (427)vec(X) = (∑ n BT n ⊗ A n) −1vec(C) (428)See Sec 10.2.1 and 10.2.2 for details on the Kronecker product and the vecoperator.10.4 Vector Norms10.4.1 Examples||x|| 1 = ∑ i|x i | (429)||x|| 2 2 = x H x (430)||x|| p =[ ∑i|x i | p ] 1/p(431)Further reading in e.g. [12, p. 52]10.5 <strong>Matrix</strong> Norms10.5.1 DefinitionsA matrix norm is a mapping which fulfils||x|| ∞ = max |x i | (432)i||A|| ≥ 0 (433)||A|| = 0 ⇔ A = 0 (434)||cA|| = |c|||A||, c ∈ R (435)||A + B|| ≤ ||A|| + ||B|| (436)10.5.2 Induced Norm or Operator NormAn induced norm is a matrix norm induced by a vector norm by the following||A|| = sup{||Ax|| | ||x|| = 1} (437)where || · || ont the left side is the induced matrix norm, while || · || on the rightside denotes the vector norm. For induced norms it holds that||I|| = 1 (438)||Ax|| ≤ ||A|| · ||x||, for all A, x (439)||AB|| ≤ ||A|| · ||B||, for all A, B (440)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 56


10.6 Rank 10 FUNCTIONS AND OPERATORS10.5.3 Examples∑||A|| 1 = max |A ij | (441)j√ i||A|| 2 = max eig(A H A) (442)||A|| p = ( max ||Ax|| p) 1/p (443)||x|| p=1∑||A|| ∞ = max |A ij | (444)i√j∑ √||A|| F = |A ij | 2 = Tr(AA H ) (Frobenius) (445)ij||A|| max = max |A ij | (446)ij||A|| KF = ||sing(A)|| 1 (Ky Fan) (447)where sing(A) is the vector of singular values of the matrix A.10.5.4 InequalitiesE. H. Rasmussen has in yet unpublished material derived and collected thefollowing inequalities. <strong>The</strong>y are collected in a table as below, assuming A is anm × n, and d = rank(A)||A|| max ||A|| 1 ||A|| ∞ ||A|| 2 ||A|| F ||A|| KF||A|| max 1 1√1√1√1||A|| 1 m m√ m√ m√ m||A|| ∞√n√n√ n n n||A|| 2 mn n m 1 1√ √ √ √||A|| F√ mn√ n√ m d√ 1||A|| KF mnd nd md d dwhich are to be read as, e.g.10.5.5 Condition Number||A|| 2 ≤ √ m · ||A|| ∞ (448)<strong>The</strong> 2-norm of A equals √ (max(eig(A T A))) [12, p.57]. For a symmetric, positivedefinite matrix, this reduces to max(eig(A)) <strong>The</strong> condition number basedon the 2-norm thus reduces to‖A‖ 2 ‖A −1 ‖ 2 = max(eig(A)) max(eig(A −1 )) = max(eig(A))min(eig(A)) . (449)10.6 Rank10.6.1 Sylvester’s InequalityIf A is m × n and B is n × r, thenrank(A) + rank(B) − n ≤ rank(AB) ≤ min{rank(A), rank(B)} (450)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 57


10.7 Integral Involving Dirac Delta Functions 10 FUNCTIONS AND OPERATORS10.7 Integral Involving Dirac Delta FunctionsAssuming A to be square, then∫1p(s)δ(x − As)ds =det(A) p(A−1 x) (451)Assuming A to be ”underdetermined”, i.e. ”tall”, then∫{√1p(s)δ(x − As)ds = det(A T A) p(A+ x) if x = AA + x0 elsewhereSee [9].10.8 MiscellaneousFor any A it holds that}(452)rank(A) = rank(A T ) = rank(AA T ) = rank(A T A) (453)It holds thatA is positive definite ⇔ ∃B invertible, such that A = BB T (454)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 58


AONE-DIMENSIONAL RESULTSAOne-dimensional ResultsA.1 GaussianA.1.1Densityp(x) =( )1√ exp (x − µ)2−2πσ2 2σ 2(455)A.1.2Normalization ∫e − (s−µ)22σ 2 ds = √ 2πσ 2 (456)∫√ [ π be −(ax2 +bx+c) 2 ]dx =a exp − 4ac(457)4a∫√ [ ]e c2x2 +c 1x+c 0 π c2dx = exp 1 − 4c 2 c 0(458)−c 2 −4c 2A.1.3Derivatives∂p(x)∂µ∂ ln p(x)∂µ∂p(x)∂σ∂ ln p(x)∂σ(x − µ)= p(x)σ 2 (459)(x − µ)=σ 2 (460)= p(x) 1 [ ](x − µ)2σ σ 2 − 1(461)= 1 [ ](x − µ)2σ σ 2 − 1(462)A.1.4orCompleting the Squaresc 2 x 2 + c 1 x + c 0 = −a(x − b) 2 + w−a = c 2 b = 1 c 1w = 1 2 c 2 4c 2 1c 2+ c 0c 2 x 2 + c 1 x + c 0 = − 12σ 2 (x − µ)2 + dµ = −c 12c 2σ 2 = −12c 2d = c 0 − c2 14c 2A.1.5MomentsIf the density is expressed byp(x) =[ ]1√ exp (s − µ)2−2πσ2 2σ 2or p(x) = C exp(c 2 x 2 + c 1 x) (463)then the first few basic moments arePetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 59


A.2 One Dimensional Mixture of Gaussians A ONE-DIMENSIONAL RESULTS〈x〉 = µ = −c12c 2( ) 2〈x 2 〉 = σ 2 + µ 2 = −12c 2+ −c12c 2〈x 3 〉 = 3σ 2 µ + µ 3 =〈x 4 〉 = µ 4 + 6µ 2 σ 2 + 3σ 4 =and the central moments are[c 1(2c 2) 23 − c2 1( ) 4 (c12c 2+ 6c12c 2]〈(x − µ)〉 = 0 = 0[]〈(x − µ) 2 〉 = σ 2 −1=2c 2〈(x − µ) 3 〉 = 0 = 0[〈(x − µ) 4 〉 = 3σ 4 = 3) 2 ( ) (−12c 2 2c 2+ 3] 212c 2) 212c 2A kind of pseudo-moments (un-normalized integrals) can easily be derived as∫√ [ ]π cexp(c 2 x 2 + c 1 x)x n dx = Z〈x n 2〉 = exp 1〈x n 〉 (464)−c 2 −4c 2¿From the un-centralized moments one can derive other entities like〈x 2 〉 − 〈x〉 2 = σ 2 = −12c 2〈x 3 〉 − 〈x 2 〉〈x〉 = 2σ 2 µ =〈x 4 〉 − 〈x 2 〉 2 = 2σ 4 + 4µ 2 σ 2 =22c 1(2c 2) 2[(2c 2) 21 − 4 c2 12c 2]A.2 One Dimensional Mixture of GaussiansA.2.1Density and NormalizationK∑[ρ kp(s) = √ exp − 1 (s − µ k ) 2 ]2πσ2k2kσ 2 k(465)A.2.2MomentsAn useful fact of MoG, is that〈x n 〉 = ∑ kρ k 〈x n 〉 k (466)where 〈·〉 k denotes average with respect to the k.th component. We can calculatethe first four moments from the densitiesp(x) = ∑ [1ρ k √ exp − 1 (x − µ k ) 2 ]k 2πσ2k2 σk2 (467)p(x) = ∑ kρ k C k exp [ c k2 x 2 + c k1 x ] (468)asPetersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 60


A.2 One Dimensional Mixture of Gaussians A ONE-DIMENSIONAL RESULTS〈x〉 = ∑ k ρ kµ k = ∑ [k ρ −ck1k〈x 2 〉 = ∑ k ρ k(σk 2 + µ2 k ) = ∑ [ ( ) ] 2k ρ −1k 2c k2+−ck12c k2〈x 3 〉 = ∑ k ρ k(3σk 2µ k + µ 3 k ) = ∑ [ [ ]]k ρ ck1k (2c k2 )3 − c2 2 k12c k22c k2]〈x 4 〉 = ∑ k ρ k(µ 4 k + 6µ2 k σ2 k + 3σ4 k ) = ∑ k ρ k[ (12c k2) 2[ (ck12c k2) 2− 6c 2 k12c k2+ 3If all the gaussians are centered, i.e. µ k = 0 for all k, then〈x〉 = 0 = 0〈x 2 〉 = ∑ k ρ kσk 2 = ∑ k ρ k〈x 3 〉 = 0 = 0〈x 4 〉 = ∑ k ρ k3σk 4 = ∑ [k ρ k3[]−12c k2] 2−12c k2¿From the un-centralized moments one can derive other entities like〈x 2 〉 − 〈x〉 2 = ∑ k,k ρ [kρ ′ k ′ µ2k+ σk 2 − µ kµ k ′]〈x 3 〉 − 〈x 2 〉〈x〉 = ∑ k,k ρ [kρ ′ k ′ 3σ2kµ k + µ 3 k − (σ2 k + µ2 k )µ ]k ′〈x 4 〉 − 〈x 2 〉 2 = ∑ k,k ρ [kρ ′ k ′ µ4k+ 6µ 2 k σ2 k + 3σ4 k − (σ2 k + µ2 k )(σ2 k + ′ µ2 k ′)]A.2.3DerivativesDefining p(s) = ∑ k ρ kN s (µ k , σk 2) we get for a parameter θ j of the j.th component∂ ln p(s)= ρ jN s (µ j , σj 2)∂ ln(ρ j N s (µ j , σj 2 ∑))∂θ j k ρ kN s (µ k , σk 2) (469)∂θ jthat is,]]∂ ln p(s)∂ρ j=∂ ln p(s)∂µ j=∂ ln p(s)∂σ j=ρ j N s (µ j , σj 2 ∑) 1k ρ kN s (µ k , σk 2) (470)ρ jρ j N s (µ j , σj 2 ∑) (s − µ j )k ρ kN s (µ k , σk 2) σj2 (471)ρ j N s (µ j , σj 2 ∑)[]1 (s − µ j ) 2k ρ kN s (µ k , σk 2) σ j σj2 − 1 (472)Note that ρ k must be constrained to be proper ratios. Defining the ratios byρ j = e rj / ∑ k er k, we obtain∂ ln p(s)∂r j= ∑ l∂ ln p(s)∂ρ l∂ρ l∂r jwhere∂ρ l∂r j= ρ l (δ lj − ρ j ) (473)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 61


BPROOFS AND DETAILSBProofs and DetailsB.1 Misc ProofsB.1.1 Proof of Equation 77Essentially we need to calculate∂(X n ) kl∂X ij=∂ ∑∂X iju 1,...,u n−1X k,u1 X u1,u 2...X un−1,l= δ k,i δ u1,jX u1,u 2...X un−1,l.+X k,u1 δ u1,iδ u2,j...X un−1,l==+X k,u1 X u1,u 2...δ un−1,iδ l,jn−1∑(X r ) ki (X n−1−r ) jlr=0n−1∑(X r J ij X n−1−r ) klr=0Using the properties of the single entry matrix found in Sec. 9.3.4, the resultfollows easily.B.1.2 Details on Eq. 475∂ det(X H AX) = det(X H AX)Tr[(X H AX) −1 ∂(X H AX)]= det(X H AX)Tr[(X H AX) −1 (∂(X H )AX + X H ∂(AX))]= det(X H AX) ( Tr[(X H AX) −1 ∂(X H )AX]+Tr[(X H AX) −1 X H ∂(AX)] )= det(X H AX) ( Tr[AX(X H AX) −1 ∂(X H )]+Tr[(X H AX) −1 X H A∂(X)] )First, the derivative is found with respect to the real part of X∂ det(X H AX)∂RX( Tr[AX(X= det(X H H AX) −1 ∂(X H )]AX)∂RX+ Tr[(XH AX) −1 X H A∂(X)])∂RX= det(X H AX) ( AX(X H AX) −1 + ((X H AX) −1 X H A) T )Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 62


B.1 Misc Proofs B PROOFS AND DETAILSThrough the calculations, (84) and (193) were used. In addition, by use of (194),the derivative is found with respect to the imaginary part of Xi ∂ det(XH AX)∂IX( Tr[AX(X= i det(X H H AX) −1 ∂(X H )]AX)∂IX+ Tr[(XH AX) −1 X H A∂(X)])∂IX= det(X H AX) ( AX(X H AX) −1 − ((X H AX) −1 X H A) T )Hence, derivative yields∂ det(X H AX)∂X= 1 ( ∂ det(X H AX)− i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX) ( (X H AX) −1 X H A ) Tand the complex conjugate derivative yields∂ det(X H AX)∂X ∗ = 1 ( ∂ det(X H AX)+ i ∂ det(XH AX))2 ∂RX∂IX= det(X H AX)AX(X H AX) −1Notice, for real X, A, the sum of (202) and (203) is reduced to (45).Similar calculations yield∂ det(XAX H )∂X= 1 ( ∂ det(XAX H )− i ∂ det(XAXH ))2 ∂RX∂IX= det(XAX H ) ( AX H (XAX H ) −1) T(474)and∂ det(XAX H )∂X ∗ = 1 ( ∂ det(XAX H )+ i ∂ det(XAXH ))2 ∂RX∂IX= det(XAX H )(XAX H ) −1 XA (475)Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 63


REFERENCESREFERENCESReferences[1] Karl Gustav Andersson and Lars-Christer Boiers. Ordinaera differentialekvationer.Studenterlitteratur, 1992.[2] Jörn Anemüller, Terrence J. Sejnowski, and Scott Makeig. Complex independentcomponent analysis of frequency-domain electroencephalographicdata. Neural Networks, 16(9):1311–1323, November 2003.[3] S. Barnet. Matrices. Methods and Applications. Oxford Applied Mathematicsand Computin Science Series. Clarendon Press, 1990.[4] Christoffer Bishop. Neural Networks for Pattern Recognition. Oxford UniversityPress, 1995.[5] Robert J. Boik. Lecture notes: Statistics 550. Online, April 22 2002. Notes.[6] D. H. Brandwood. A complex gradient operator and its application inadaptive array theory. IEE Proceedings, 130(1):11–16, February 1983. PTS.F and H.[7] M. Brookes. <strong>Matrix</strong> Reference Manual, 2004. Website May 20, 2004.[8] Contradsen K., En introduktion til statistik, IMM lecture notes, 1984.[9] Mads Dyrholm. Some matrix results, 2004. Website August 23, 2004.[10] Nielsen F. A., Formula, Neuro Research Unit and Technical university ofDenmark, 2002.[11] Gelman A. B., J. S. Carlin, H. S. Stern, D. B. Rubin, Bayesian DataAnalysis, Chapman and Hall / CRC, 1995.[12] Gene H. Golub and Charles F. van Loan. <strong>Matrix</strong> Computations. <strong>The</strong> JohnsHopkins University Press, Baltimore, 3rd edition, 1996.[13] Robert M. Gray. Toeplitz and circulant matrices: A review. Technicalreport, Information Systems Laboratory, Department of Electrical Engineering,StanfordUniversity, Stanford, California 94305, August 2002.[14] Simon Haykin. Adaptive Filter <strong>The</strong>ory. Prentice Hall, Upper Saddle River,NJ, 4th edition, 2002.[15] Roger A. Horn and Charles R. Johnson. <strong>Matrix</strong> Analysis. CambridgeUniversity Press, 1985.[16] Mardia K. V., J.T. Kent and J.M. Bibby, Multivariate Analysis, AcademicPress Ltd., 1979.[17] Carl D. Meyer. Generalized inversion of modified matrices. SIAM Journalof Applied Mathematics, 24(3):315–323, May 1973.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 64


REFERENCESREFERENCES[18] Thomas P. Minka. Old and new matrix algebra useful for statistics, December2000. Notes.[19] Daniele Mortari Ortho–Skew and Ortho–Sym <strong>Matrix</strong> Trigonometry JohnLee Junkins Astrodynamics Symposium, AAS 03–265, May 2003. TexasA&M University, College Station, TX[20] L. Parra and C. Spence. Convolutive blind separation of non-stationarysources. In IEEE Transactions Speech and Audio Processing, pages 320–327, May 2000.[21] Kaare Brandt Petersen, Jiucang Hao, and Te-Won Lee. Generative andfiltering approaches for overcomplete representations. Neural InformationProcessing - Letters and Reviews, vol. 8(1), 2005.[22] John G. Proakis and Dimitris G. Manolakis. Digital Signal Processing.Prentice-Hall, 1996.[23] Laurent Schwartz. Cours d’Analyse, volume II. Hermann, Paris, 1967. Asreferenced in [14].[24] Shayle R. Searle. <strong>Matrix</strong> Algebra Useful for Statistics. John Wiley andSons, 1982.[25] G. Seber and A. Lee. Linear Regression Analysis. John Wiley and Sons,2002.[26] S. M. Selby. Standard Mathematical Tables. CRC Press, 1974.[27] Inna Stainvas. <strong>Matrix</strong> algebra in differential calculus. Neural ComputingResearch Group, Information Engeneering, Aston University, UK, August2002. Notes.[28] P. P. Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall,1993.[29] Max Welling. <strong>The</strong> Kalman Filter. Lecture Note.Petersen & Pedersen, <strong>The</strong> <strong>Matrix</strong> <strong>Cookbook</strong>, Version: September 5, 2007, Page 65


IndexAnti-symmetric, 43Block matrix, 47Chain rule, 14Cholesky-decomposition, 27Co-kurtosis, 28Co-skewness, 28Cramers Rule, 53Derivative of a complex matrix, 22Derivative of a determinant, 7Derivative of a trace, 11Derivative of an inverse, 8Derivative of symmetric matrix, 14Derivatives of Toeplitz matrix, 15Dirichlet distribution, 32Student-t, 31Sylvester’s Inequality, 57Symmetric, 43Toeplitz matrix, 44Vandermonde matrix, 44Vec operator, 51Wishart distribution, 32Woodbury identity, 17Eigenvalues, 25Eigenvectors, 25Exponential <strong>Matrix</strong> Function, 51Gaussian, conditional, 34Gaussian, entropy, 38Gaussian, linear combination, 35Gaussian, marginal, 34Gaussian, product of densities, 36Generalized inverse, 20Kronecker product, 51Moore-Penrose inverse, 20Multinomial distribution, 32Norm of a matrix, 55Norm of a vector, 55Normal-Inverse Gamma distribution, 32Normal-Inverse Wishart distribution, 33Pseudo-inverse, 20Schur complement, 35, 48Single entry matrix, 41Singular Valued Decomposition (SVD),2566

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!