Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
2.1. Optimal Algebraic Properties of Population Principal Components 15and, from (2.1.10),Σ − Σ xz Σ −1zz Σ zx =p∑k=(q+1)λ k α k α ′ k.Finding a linear function of x having maximum conditional variancereduces to finding the eigenvalues and eigenvectors of the conditional covariancematrix, and it easy to verify that these are simply (λ (q+1) , α (q+1) ),(λ (q+2) , α (q+2) ),...,(λ p , α p ). The eigenvector associated with the largestof these eigenvalues is α (q+1) , so the required linear function is α ′ (q+1) x,namely the (q + 1)th PC.Property A4. As in Properties A1, A2, consider the transformationy = B ′ x.Ifdet(Σ y ) denotes the determinant of the covariance matrix y,then det(Σ y ) is maximized when B = A q .Proof. Consider any integer, k, between 1 and q, and let S k =the subspace of p-dimensional vectors orthogonal to α 1 ,...,α k−1 . Thendim(S k )=p − k + 1, where dim(S k ) denotes the dimension of S k .Thektheigenvalue, λ k ,ofΣ satisfiesλ k = Supα∈S kα≠0{ α ′ }Σαα ′ .αSuppose that µ 1 >µ 2 > ··· >µ q , are the eigenvalues of B ′ ΣB and thatγ 1 , γ 2 , ··· , γ q , are the corresponding eigenvectors. Let T k = the subspaceof q-dimensional vectors orthogonal to γ k+1 , ··· , γ q , with dim(T k )=k.Then, for any non-zero vector γ in T k ,γ ′ B ′ ΣBγγ ′ γ≥ µ k .Consider the subspace ˜S k of p-dimensional vectors of the form Bγ for γ inT k .dim( ˜S k )=dim(T k )=k(because B is one-to-one; in fact,B preserves lengths of vectors).From a general result concerning dimensions of two vector spaces, we haveButsodim(S k ∩ ˜S k )+dim(S k + ˜S k )=dimS k +dim˜S k .dim(S k + ˜S k ) ≤ p, dim(S k )=p − k + 1 and dim( ˜S k )=k,dim(S k ∩ ˜S k ) ≥ 1.
16 2. Properties of Population Principal ComponentsThere is therefore a non-zero vector α in S k of the form α = Bγ for aγ in T k , and it follows thatµ k ≤ γ′ B ′ ΣBγγ ′ γ= γ′ B ′ ΣBγγB ′ Bγ= α′ Σαα ′ α ≤ λ k.Thus the kth eigenvalue of B ′ ΣB ≤ kth eigenvalue of Σ for k =1, ··· ,q.This means thatq∏q∏det(Σ y )= (kth eigenvalue of B ′ ΣB) ≤ λ k .k=1But if B = A q , then the eigenvalues of B ′ ΣB areλ 1 ,λ 2 , ··· ,λ q , so that det(Σ y )=q∏k=1k=1in this case, and therefore det(Σ y ) is maximized when B = A q .The result can be extended to the case where the columns of B are notnecessarily orthonormal, but the diagonal elements of B ′ B are unity (seeOkamoto (1969)). A stronger, stepwise version of Property A4 is discussedby O’Hagan (1984), who argues that it provides an alternative derivation ofPCs, and that this derivation can be helpful in motivating the use of PCA.O’Hagan’s derivation is, in fact, equivalent to (though a stepwise versionof) Property A5, which is discussed next.Note that Property A1 could also have been proved using similar reasoningto that just employed for Property A4, but some of the intermediateresults derived during the earlier proof of Al are useful elsewhere in thechapter.The statistical importance of the present result follows because the determinantof a covariance matrix, which is called the generalized variance,can be used as a single measure of spread for a multivariate random variable(Press, 1972, p. 108). The square root of the generalized variance,for a multivariate normal distribution is proportional to the ‘volume’ inp-dimensional space that encloses a fixed proportion of the probability distributionof x. For multivariate normal x, the first q PCs are, therefore, asa consequence of Property A4, q linear functions of x whose joint probabilitydistribution has contours of fixed probability enclosing the maximumvolume.Property A5. Suppose that we wish to predict each random variable, x jin x by a linear function of y, wherey = B ′ x, as before. If σj 2 is the residualvariance in predicting x j from y, then Σ p j=1 σ2 j is minimized if B = A q.The statistical implication of this result is that if we wish to get the bestlinear predictor of x in a q-dimensional subspace, in the sense of minimizingthe sum over elements of x of the residual variances, then this optimalsubspace is defined by the first q PCs.λ k✷
- Page 1: Principal ComponentAnalysis,Second
- Page 7 and 8: viPreface to the Second Editionerty
- Page 9 and 10: viiiPreface to the Second EditionA
- Page 11 and 12: xPreface to the First Editionand in
- Page 13 and 14: xiiPreface to the First EditionIn m
- Page 15 and 16: This page intentionally left blank
- Page 17 and 18: xviAcknowledgmentsthese institution
- Page 19 and 20: xviiiContents3.4.1 Example ........
- Page 21 and 22: xxContents10 Outlier Detection, Inf
- Page 23 and 24: This page intentionally left blank
- Page 25 and 26: xxivList of Figures5.2 Artistic qua
- Page 27 and 28: This page intentionally left blank
- Page 29 and 30: xxviiiList of Tables6.1 First six e
- Page 31 and 32: This page intentionally left blank
- Page 33 and 34: 2 1. IntroductionFigure 1.1. Plot o
- Page 35: 4 1. IntroductionFigure 1.3. Studen
- Page 38 and 39: 1.2. A Brief History of Principal C
- Page 40 and 41: 1.2. A Brief History of Principal C
- Page 42 and 43: 2.1. Optimal Algebraic Properties o
- Page 44 and 45: 2.1. Optimal Algebraic Properties o
- Page 48 and 49: 2.1. Optimal Algebraic Properties o
- Page 50 and 51: 2.2. Geometric Properties of Popula
- Page 52 and 53: 2.3. Principal Components Using a C
- Page 54 and 55: 2.3. Principal Components Using a C
- Page 56 and 57: 2.3. Principal Components Using a C
- Page 58 and 59: 2.4. Principal Components with Equa
- Page 60 and 61: 3Mathematical and StatisticalProper
- Page 62 and 63: where3.1. Optimal Algebraic Propert
- Page 64 and 65: 3.2. Geometric Properties of Sample
- Page 66 and 67: 3.2. Geometric Properties of Sample
- Page 68 and 69: 3.2. Geometric Properties of Sample
- Page 70 and 71: 3.3. Covariance and Correlation Mat
- Page 72 and 73: 3.3. Covariance and Correlation Mat
- Page 74 and 75: 3.4. Principal Components with Equa
- Page 76 and 77: show that X = ULA ′ .⎡ULA ′ =
- Page 78 and 79: 3.6. Probability Distributions for
- Page 80 and 81: 3.7. Inference Based on Sample Prin
- Page 82 and 83: 3.7.2 Interval Estimation3.7. Infer
- Page 84 and 85: 3.7. Inference Based on Sample Prin
- Page 86 and 87: 3.7. Inference Based on Sample Prin
- Page 88 and 89: 3.8. Patterned Covariance and Corre
- Page 90 and 91: 3.9. Models for Principal Component
- Page 92 and 93: 3.9. Models for Principal Component
- Page 94 and 95: 4Principal Components as a SmallNum
16 2. Properties of Population <strong>Principal</strong> <strong>Component</strong>sThere is therefore a non-zero vector α in S k of the form α = Bγ for aγ in T k , and it follows thatµ k ≤ γ′ B ′ ΣBγγ ′ γ= γ′ B ′ ΣBγγB ′ Bγ= α′ Σαα ′ α ≤ λ k.Thus the kth eigenvalue of B ′ ΣB ≤ kth eigenvalue of Σ for k =1, ··· ,q.This means thatq∏q∏det(Σ y )= (kth eigenvalue of B ′ ΣB) ≤ λ k .k=1But if B = A q , then the eigenvalues of B ′ ΣB areλ 1 ,λ 2 , ··· ,λ q , so that det(Σ y )=q∏k=1k=1in this case, and therefore det(Σ y ) is maximized when B = A q .The result can be extended to the case where the columns of B are notnecessarily orthonormal, but the diagonal elements of B ′ B are unity (seeOkamoto (1969)). A stronger, stepwise version of Property A4 is discussedby O’Hagan (1984), who argues that it provides an alternative derivation ofPCs, and that this derivation can be helpful in motivating the use of PCA.O’Hagan’s derivation is, in fact, equivalent to (though a stepwise versionof) Property A5, which is discussed next.Note that Property A1 could also have been proved using similar reasoningto that just employed for Property A4, but some of the intermediateresults derived during the earlier proof of Al are useful elsewhere in thechapter.The statistical importance of the present result follows because the determinantof a covariance matrix, which is called the generalized variance,can be used as a single measure of spread for a multivariate random variable(Press, 1972, p. 108). The square root of the generalized variance,for a multivariate normal distribution is proportional to the ‘volume’ inp-dimensional space that encloses a fixed proportion of the probability distributionof x. For multivariate normal x, the first q PCs are, therefore, asa consequence of Property A4, q linear functions of x whose joint probabilitydistribution has contours of fixed probability enclosing the maximumvolume.Property A5. Suppose that we wish to predict each random variable, x jin x by a linear function of y, wherey = B ′ x, as before. If σj 2 is the residualvariance in predicting x j from y, then Σ p j=1 σ2 j is minimized if B = A q.The statistical implication of this result is that if we wish to get the bestlinear predictor of x in a q-dimensional subspace, in the sense of minimizingthe sum over elements of x of the residual variances, then this optimalsubspace is defined by the first q PCs.λ k✷