Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
3.7. Inference Based on Sample Principal Components 49If a distribution other than the multivariate normal is assumed, distributionalresults for PCs will typically become less tractable. Jackson (1991,Section 4.8) gives a number of references that examine the non-normal case.In addition, for non-normal distributions a number of alternatives to PCscan reasonably be suggested (see Sections 13.1, 13.3 and 14.4).Another deviation from the assumptions underlying most of the distributionalresults arises when the n observations are not independent. Theclassic examples of this are when the observations correspond to adjacentpoints in time (a time series) or in space. Another situation where nonindependenceoccurs is found in sample surveys, where survey designs areoften more complex than simple random sampling, and induce dependencebetween observations (see Skinner et al. (1986)). PCA for non-independentdata, especially time series, is discussed in detail in Chapter 12.As a complete contrast to the strict assumptions made in most workon the distributions of PCs, Efron and Tibshirani (1993, Section 7.2) lookat the use of the ‘bootstrap’ in this context. The idea is, for a particularsample of n observations x 1 , x 2 ,..., x n , to take repeated random samplesof size n from the distribution that has P [x = x i ]= 1 n,i=1, 2,...,n,calculate the PCs for each sample, and build up empirical distributions forPC coefficients and variances. These distributions rely only on the structureof the sample, and not on any predetermined assumptions. Care needs tobe taken in comparing PCs from different bootstrap samples because ofpossible reordering and/or sign switching in the PCs from different samples.Failure to account for these phenomena is likely to give misleadingly widedistributions for PC coefficients, and distributions for PC variances thatmay be too narrow.3.7 Inference Based on Sample PrincipalComponentsThe distributional results outlined in the previous section may be usedto make inferences about population PCs, given the sample PCs, providedthat the necessary assumptions are valid. The major assumption that x hasa multivariate normal distribution is often not satisfied and the practicalvalue of the results is therefore limited. It can be argued that PCA shouldonly ever be done for data that are, at least approximately, multivariatenormal, for it is only then that ‘proper’ inferences can be made regardingthe underlying population PCs. As already noted in Section 2.2, this isa rather narrow view of what PCA can do, as it is a much more widelyapplicable tool whose main use is descriptive rather than inferential. Itcan provide valuable descriptive information for a wide variety of data,whether the variables are continuous and normally distributed or not. Themajority of applications of PCA successfully treat the technique as a purely
50 3. Properties of Sample Principal Componentsdescriptive tool, although Mandel (1972) argued that retaining m PCs inan analysis implicitly assumes a model for the data, based on (3.5.3). Therehas recently been an upsurge of interest in models related to PCA; this isdiscussed further in Section 3.9.Although the purely inferential side of PCA is a very small part of theoverall picture, the ideas of inference can sometimes be useful and arediscussed briefly in the next three subsections.3.7.1 Point EstimationThe maximum likelihood estimator (MLE) for Σ, the covariance matrix ofa multivariate normal distribution, is not S, but (n−1)nS (see, for example,Press (1972, Section 7.1) for a derivation). This result is hardly surprising,given the corresponding result for the univariate normal. If λ, l, α k , a k andrelated quantities are defined as in the previous section, then the MLEs ofλ and α k ,k=1, 2,...,p, can be derived from the MLE of Σ and are equalto ˆλ = (n−1)nl,and ˆα k = a k ,k=1, 2,...,p, assuming that the elements ofλ are all positive and distinct. The MLEs are the same in this case as theestimators derived by the method of moments. The MLE for λ k is biasedbut asymptotically unbiased, as is the MLE for Σ. As noted in the previoussection, l itself, as well as ˆλ, is a biased estimator for λ, but ‘corrections’can be made to reduce the bias.In the case where some of the λ k are equal, the MLE for their commonvalue is simply the average of the corresponding l k , multiplied by (n−1)/n.The MLEs of the α k corresponding to equal λ k are not unique; the (p × q)matrix whose columns are MLEs of α k corresponding to equal λ k can bemultiplied by any (q × q) orthogonal matrix, where q is the multiplicity ofthe eigenvalues, to get another set of MLEs.Most often, point estimates of λ, α k are simply given by l, a k , and theyare rarely accompanied by standard errors. An exception is Flury (1997,Section 8.6). Jackson (1991, Sections 5.3, 7.5) goes further and gives examplesthat not only include estimated standard errors, but also estimatesof the correlations between elements of l and between elements of a k anda k ′. The practical implications of these (sometimes large) correlations arediscussed in Jackson’s examples. Flury (1988, Sections 2.5, 2.6) gives athorough discussion of asymptotic inference for functions of the variancesand coefficients of covariance-based PCs.If multivariate normality cannot be assumed, and if there is no obviousalternative distributional assumption, then it may be desirable to use a‘robust’ approach to the estimation of the PCs: this topic is discussed inSection 10.4.
- Page 29 and 30: xxviiiList of Tables6.1 First six e
- Page 31 and 32: This page intentionally left blank
- Page 33 and 34: 2 1. IntroductionFigure 1.1. Plot o
- Page 35: 4 1. IntroductionFigure 1.3. Studen
- Page 38 and 39: 1.2. A Brief History of Principal C
- Page 40 and 41: 1.2. A Brief History of Principal C
- Page 42 and 43: 2.1. Optimal Algebraic Properties o
- Page 44 and 45: 2.1. Optimal Algebraic Properties o
- Page 46 and 47: 2.1. Optimal Algebraic Properties o
- Page 48 and 49: 2.1. Optimal Algebraic Properties o
- Page 50 and 51: 2.2. Geometric Properties of Popula
- Page 52 and 53: 2.3. Principal Components Using a C
- Page 54 and 55: 2.3. Principal Components Using a C
- Page 56 and 57: 2.3. Principal Components Using a C
- Page 58 and 59: 2.4. Principal Components with Equa
- Page 60 and 61: 3Mathematical and StatisticalProper
- Page 62 and 63: where3.1. Optimal Algebraic Propert
- Page 64 and 65: 3.2. Geometric Properties of Sample
- Page 66 and 67: 3.2. Geometric Properties of Sample
- Page 68 and 69: 3.2. Geometric Properties of Sample
- Page 70 and 71: 3.3. Covariance and Correlation Mat
- Page 72 and 73: 3.3. Covariance and Correlation Mat
- Page 74 and 75: 3.4. Principal Components with Equa
- Page 76 and 77: show that X = ULA ′ .⎡ULA ′ =
- Page 78 and 79: 3.6. Probability Distributions for
- Page 82 and 83: 3.7.2 Interval Estimation3.7. Infer
- Page 84 and 85: 3.7. Inference Based on Sample Prin
- Page 86 and 87: 3.7. Inference Based on Sample Prin
- Page 88 and 89: 3.8. Patterned Covariance and Corre
- Page 90 and 91: 3.9. Models for Principal Component
- Page 92 and 93: 3.9. Models for Principal Component
- Page 94 and 95: 4Principal Components as a SmallNum
- Page 96 and 97: 4.1. Anatomical Measurements 65Tabl
- Page 98 and 99: 4.1. Anatomical Measurements 67spac
- Page 100 and 101: 4.2. The Elderly at Home 69Table 4.
- Page 102 and 103: 4.3. Spatial and Temporal Variation
- Page 104 and 105: 4.3. Spatial and Temporal Variation
- Page 106 and 107: 4.4. Properties of Chemical Compoun
- Page 108 and 109: 4.5. Stock Market Prices 77Table 4.
- Page 110 and 111: 5. Graphical Representation of Data
- Page 112 and 113: Anatomical Measurements5.1. Plottin
- Page 114 and 115: 5.1. Plotting Two or Three Principa
- Page 116 and 117: 5.2. Principal Coordinate Analysis
- Page 118 and 119: 5.2. Principal Coordinate Analysis
- Page 120 and 121: 5.2. Principal Coordinate Analysis
- Page 122 and 123: 5.3. Biplots 91columns, L is an (r
- Page 124 and 125: 5.3. Biplots 93ButandSubstituting i
- Page 126 and 127: 5.3. Biplots 95The vector gi ∗ co
- Page 128 and 129: 5.3. Biplots 97Figure 5.3. Biplot u
50 3. Properties of Sample <strong>Principal</strong> <strong>Component</strong>sdescriptive tool, although Mandel (1972) argued that retaining m PCs inan analysis implicitly assumes a model for the data, based on (3.5.3). Therehas recently been an upsurge of interest in models related to PCA; this isdiscussed further in Section 3.9.Although the purely inferential side of PCA is a very small part of theoverall picture, the ideas of inference can sometimes be useful and arediscussed briefly in the next three subsections.3.7.1 Point EstimationThe maximum likelihood estimator (MLE) for Σ, the covariance matrix ofa multivariate normal distribution, is not S, but (n−1)nS (see, for example,Press (1972, Section 7.1) for a derivation). This result is hardly surprising,given the corresponding result for the univariate normal. If λ, l, α k , a k andrelated quantities are defined as in the previous section, then the MLEs ofλ and α k ,k=1, 2,...,p, can be derived from the MLE of Σ and are equalto ˆλ = (n−1)nl,and ˆα k = a k ,k=1, 2,...,p, assuming that the elements ofλ are all positive and distinct. The MLEs are the same in this case as theestimators derived by the method of moments. The MLE for λ k is biasedbut asymptotically unbiased, as is the MLE for Σ. As noted in the previoussection, l itself, as well as ˆλ, is a biased estimator for λ, but ‘corrections’can be made to reduce the bias.In the case where some of the λ k are equal, the MLE for their commonvalue is simply the average of the corresponding l k , multiplied by (n−1)/n.The MLEs of the α k corresponding to equal λ k are not unique; the (p × q)matrix whose columns are MLEs of α k corresponding to equal λ k can bemultiplied by any (q × q) orthogonal matrix, where q is the multiplicity ofthe eigenvalues, to get another set of MLEs.Most often, point estimates of λ, α k are simply given by l, a k , and theyare rarely accompanied by standard errors. An exception is Flury (1997,Section 8.6). Jackson (1991, Sections 5.3, 7.5) goes further and gives examplesthat not only include estimated standard errors, but also estimatesof the correlations between elements of l and between elements of a k anda k ′. The practical implications of these (sometimes large) correlations arediscussed in Jackson’s examples. Flury (1988, Sections 2.5, 2.6) gives athorough discussion of asymptotic inference for functions of the variancesand coefficients of covariance-based PCs.If multivariate normality cannot be assumed, and if there is no obviousalternative distributional assumption, then it may be desirable to use a‘robust’ approach to the estimation of the PCs: this topic is discussed inSection 10.4.