Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

2.1. Optimal Algebraic Properties of Population Principal Components 15and, from (2.1.10),Σ − Σ xz Σ −1zz Σ zx =p∑k=(q+1)λ k α k α ′ k.Finding a linear function of x having maximum conditional variancereduces to finding the eigenvalues and eigenvectors of the conditional covariancematrix, and it easy to verify that these are simply (λ (q+1) , α (q+1) ),(λ (q+2) , α (q+2) ),...,(λ p , α p ). The eigenvector associated with the largestof these eigenvalues is α (q+1) , so the required linear function is α ′ (q+1) x,namely the (q + 1)th PC.Property A4. As in Properties A1, A2, consider the transformationy = B ′ x.Ifdet(Σ y ) denotes the determinant of the covariance matrix y,then det(Σ y ) is maximized when B = A q .Proof. Consider any integer, k, between 1 and q, and let S k =the subspace of p-dimensional vectors orthogonal to α 1 ,...,α k−1 . Thendim(S k )=p − k + 1, where dim(S k ) denotes the dimension of S k .Thektheigenvalue, λ k ,ofΣ satisfiesλ k = Supα∈S kα≠0{ α ′ }Σαα ′ .αSuppose that µ 1 >µ 2 > ··· >µ q , are the eigenvalues of B ′ ΣB and thatγ 1 , γ 2 , ··· , γ q , are the corresponding eigenvectors. Let T k = the subspaceof q-dimensional vectors orthogonal to γ k+1 , ··· , γ q , with dim(T k )=k.Then, for any non-zero vector γ in T k ,γ ′ B ′ ΣBγγ ′ γ≥ µ k .Consider the subspace ˜S k of p-dimensional vectors of the form Bγ for γ inT k .dim( ˜S k )=dim(T k )=k(because B is one-to-one; in fact,B preserves lengths of vectors).From a general result concerning dimensions of two vector spaces, we haveButsodim(S k ∩ ˜S k )+dim(S k + ˜S k )=dimS k +dim˜S k .dim(S k + ˜S k ) ≤ p, dim(S k )=p − k + 1 and dim( ˜S k )=k,dim(S k ∩ ˜S k ) ≥ 1.

16 2. Properties of Population Principal ComponentsThere is therefore a non-zero vector α in S k of the form α = Bγ for aγ in T k , and it follows thatµ k ≤ γ′ B ′ ΣBγγ ′ γ= γ′ B ′ ΣBγγB ′ Bγ= α′ Σαα ′ α ≤ λ k.Thus the kth eigenvalue of B ′ ΣB ≤ kth eigenvalue of Σ for k =1, ··· ,q.This means thatq∏q∏det(Σ y )= (kth eigenvalue of B ′ ΣB) ≤ λ k .k=1But if B = A q , then the eigenvalues of B ′ ΣB areλ 1 ,λ 2 , ··· ,λ q , so that det(Σ y )=q∏k=1k=1in this case, and therefore det(Σ y ) is maximized when B = A q .The result can be extended to the case where the columns of B are notnecessarily orthonormal, but the diagonal elements of B ′ B are unity (seeOkamoto (1969)). A stronger, stepwise version of Property A4 is discussedby O’Hagan (1984), who argues that it provides an alternative derivation ofPCs, and that this derivation can be helpful in motivating the use of PCA.O’Hagan’s derivation is, in fact, equivalent to (though a stepwise versionof) Property A5, which is discussed next.Note that Property A1 could also have been proved using similar reasoningto that just employed for Property A4, but some of the intermediateresults derived during the earlier proof of Al are useful elsewhere in thechapter.The statistical importance of the present result follows because the determinantof a covariance matrix, which is called the generalized variance,can be used as a single measure of spread for a multivariate random variable(Press, 1972, p. 108). The square root of the generalized variance,for a multivariate normal distribution is proportional to the ‘volume’ inp-dimensional space that encloses a fixed proportion of the probability distributionof x. For multivariate normal x, the first q PCs are, therefore, asa consequence of Property A4, q linear functions of x whose joint probabilitydistribution has contours of fixed probability enclosing the maximumvolume.Property A5. Suppose that we wish to predict each random variable, x jin x by a linear function of y, wherey = B ′ x, as before. If σj 2 is the residualvariance in predicting x j from y, then Σ p j=1 σ2 j is minimized if B = A q.The statistical implication of this result is that if we wish to get the bestlinear predictor of x in a q-dimensional subspace, in the sense of minimizingthe sum over elements of x of the residual variances, then this optimalsubspace is defined by the first q PCs.λ k✷

16 2. Properties of Population <strong>Principal</strong> <strong>Component</strong>sThere is therefore a non-zero vector α in S k of the form α = Bγ for aγ in T k , and it follows thatµ k ≤ γ′ B ′ ΣBγγ ′ γ= γ′ B ′ ΣBγγB ′ Bγ= α′ Σαα ′ α ≤ λ k.Thus the kth eigenvalue of B ′ ΣB ≤ kth eigenvalue of Σ for k =1, ··· ,q.This means thatq∏q∏det(Σ y )= (kth eigenvalue of B ′ ΣB) ≤ λ k .k=1But if B = A q , then the eigenvalues of B ′ ΣB areλ 1 ,λ 2 , ··· ,λ q , so that det(Σ y )=q∏k=1k=1in this case, and therefore det(Σ y ) is maximized when B = A q .The result can be extended to the case where the columns of B are notnecessarily orthonormal, but the diagonal elements of B ′ B are unity (seeOkamoto (1969)). A stronger, stepwise version of Property A4 is discussedby O’Hagan (1984), who argues that it provides an alternative derivation ofPCs, and that this derivation can be helpful in motivating the use of PCA.O’Hagan’s derivation is, in fact, equivalent to (though a stepwise versionof) Property A5, which is discussed next.Note that Property A1 could also have been proved using similar reasoningto that just employed for Property A4, but some of the intermediateresults derived during the earlier proof of Al are useful elsewhere in thechapter.The statistical importance of the present result follows because the determinantof a covariance matrix, which is called the generalized variance,can be used as a single measure of spread for a multivariate random variable(Press, 1972, p. 108). The square root of the generalized variance,for a multivariate normal distribution is proportional to the ‘volume’ inp-dimensional space that encloses a fixed proportion of the probability distributionof x. For multivariate normal x, the first q PCs are, therefore, asa consequence of Property A4, q linear functions of x whose joint probabilitydistribution has contours of fixed probability enclosing the maximumvolume.Property A5. Suppose that we wish to predict each random variable, x jin x by a linear function of y, wherey = B ′ x, as before. If σj 2 is the residualvariance in predicting x j from y, then Σ p j=1 σ2 j is minimized if B = A q.The statistical implication of this result is that if we wish to get the bestlinear predictor of x in a q-dimensional subspace, in the sense of minimizingthe sum over elements of x of the residual variances, then this optimalsubspace is defined by the first q PCs.λ k✷

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!