Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
3.9. Models for Principal Component Analysis 59Table 3.5. Principal components based on the correlation matrix of Table 3.4Component 1 2 3 4 5 6 7 8 9 10numberCoefficientsV1 0.3 −0.2 0.2 −0.5 0.3 0.1 −0.1 −0.0 −0.6 0.2V2 0.4 −0.2 0.2 −0.5 0.3 0.0 −0.1 −0.0 0.7 −0.3V3 0.4 −0.1 −0.1 −0.0 −0.7 0.5 −0.2 0.0 0.1 0.1V4 0.4 −0.1 −0.1 −0.0 −0.4 −0.7 0.3 −0.0 −0.1 −0.1V5 0.3 −0.2 0.1 0.5 0.2 0.2 −0.0 −0.1 −0.2 −0.6V6 0.3 −0.2 0.2 0.5 0.2 −0.1 −0.0 0.1 0.2 0.6V7 0.3 0.3 −0.5 −0.0 0.2 0.3 0.7 0.0 −0.0 0.0V8 0.3 0.3 −0.5 0.1 0.2 −0.2 −0.7 −0.0 −0.0 −0.0V9 0.2 0.5 0.4 0.0 −0.1 0.0 −0.0 0.7 −0.0 −0.1V10 0.2 0.5 0.4 0.0 −0.1 0.0 0.0 −0.7 0.0 0.0Percentage of 52.3 20.4 11.0 8.5 5.0 1.0 0.9 0.6 0.2 0.2total variation explained3.9 Models for Principal Component AnalysisThere is a variety of interpretations of what is meant by a model in thecontext of PCA. Mandel (1972) considers the retention of m PCs, basedon the SVD (3.5.3), as implicitly using a model. Caussinus (1986) discussesthree types of ‘model.’ The first is a ‘descriptive algebraic model,’ which inits simplest form reduces to the SVD. It can also be generalized to includea choice of metric, rather than simply using a least squares approach. Suchgeneralizations are discussed further in Section 14.2.2. This model has norandom element, so there is no idea of expectation or variance. Hence itcorresponds to Pearson’s geometric view of PCA, rather than to Hotelling’svariance-based approach.Caussinus’s (1986) second type of model introduces probability distributionsand corresponds to Hotelling’s definition. Once again, the ‘model’can be generalized by allowing a choice of metric.The third type of model described by Caussinus is the so-called fixedeffects model (see also Esposito (1998)). In this model we assume thatthe rows x 1 , x 2 ,...,x n of X are independent random variables, such thatE(x i )=z i , where z i lies in a q-dimensional subspace, F q . Furthermore,if e i = x i − z i , then E(e i )=0 and var(e i )= σ2w iΓ, where Γ is a positivedefinite symmetric matrix and the w i are positive scalars whose sum is 1.Both Γ and the w i are assumed to be known, but σ 2 ,thez i and thesubspace F q all need to be estimated. This is done by minimizingn∑w i ‖x i − z i ‖ 2 M , (3.9.1)i=1
60 3. Properties of Sample Principal Componentswhere M denotes a metric (see Section 14.2.2) and may be related to Γ.This statement of the model generalizes the usual form of PCA, for whichw i = 1 n ,i =1, 2,...,n and M = I p, to allow different weights on the observationsand a choice of metric. When M = Γ −1 , and the distribution ofthe x i is multivariate normal, the estimates obtained by minimizing (3.9.1)are maximum likelihood estimates (Besse, 1994b). An interesting aspect ofthe fixed effects model is that it moves away from the idea of a sample ofidentically distributed observations whose covariance or correlation structureis to be explored, to a formulation in which the variation among themeans of the observations is the feature of interest.Tipping and Bishop (1999a) describe a model in which column-centredobservations x i are independent normally distributed random variableswith zero means and covariance matrix BB ′ + σ 2 I p , where B is a (p × q)matrix. We shall see in Chapter 7 that this is a special case of a factoranalysis model. The fixed effects model also has links to factor analysisand, indeed, de Leeuw (1986) suggests in discussion of Caussinus (1986)that the model is closer to factor analysis than to PCA. Similar modelsdate back to Young (1941).Tipping and Bishop (1999a) show that, apart from a renormalization ofcolumns, and the possibility of rotation, the maximum likelihood estimateof B is the matrix A q of PC coefficients defined earlier (see also de Leeuw(1986)). The MLE for σ 2 is the average of the smallest (p − q) eigenvaluesof the sample covariance matrix S. Tipping and Bishop (1999a) fittheir model using the EM algorithm (Dempster et al. (1977)), treating theunknown underlying components as ‘missing values.’ Clearly, the complicationof the EM algorithm is not necessary once we realise that we aredealing with PCA, but it has advantages when the model is extended tocope with genuinely missing data or to mixtures of distributions (see Sections13.6, 9.2.3). Bishop (1999) describes a Bayesian treatment of Tippingand Bishop’s (1999a) model. The main objective in introducing a prior distributionfor B appears to be as a means of deciding on its dimension q(see Section 6.1.5).Roweis (1997) also uses the EM algorithm to fit a model for PCA. Hismodel is more general than Tipping and Bishop’s, with the error covariancematrix allowed to take any form, rather than being restricted to σ 2 I p .Inthis respect it is more similar to the fixed effects model with equal weights,but differs from it by not specifying different means for different observations.Roweis (1997) notes that a full PCA, with all p PCs, is obtainedfrom his model in the special case where the covariance matrix is σ 2 I p andσ 2 → 0. He refers to the analysis based on Tipping and Bishop’s (1999a)model with σ 2 > 0assensible principal component analysis.Martin (1988) considers another type of probability-based PCA, in whicheach of the n observations has a probability distribution in p-dimensionalspace centred on it, rather than being represented by a single point. In
- Page 40 and 41: 1.2. A Brief History of Principal C
- Page 42 and 43: 2.1. Optimal Algebraic Properties o
- Page 44 and 45: 2.1. Optimal Algebraic Properties o
- Page 46 and 47: 2.1. Optimal Algebraic Properties o
- Page 48 and 49: 2.1. Optimal Algebraic Properties o
- Page 50 and 51: 2.2. Geometric Properties of Popula
- Page 52 and 53: 2.3. Principal Components Using a C
- Page 54 and 55: 2.3. Principal Components Using a C
- Page 56 and 57: 2.3. Principal Components Using a C
- Page 58 and 59: 2.4. Principal Components with Equa
- Page 60 and 61: 3Mathematical and StatisticalProper
- Page 62 and 63: where3.1. Optimal Algebraic Propert
- Page 64 and 65: 3.2. Geometric Properties of Sample
- Page 66 and 67: 3.2. Geometric Properties of Sample
- Page 68 and 69: 3.2. Geometric Properties of Sample
- Page 70 and 71: 3.3. Covariance and Correlation Mat
- Page 72 and 73: 3.3. Covariance and Correlation Mat
- Page 74 and 75: 3.4. Principal Components with Equa
- Page 76 and 77: show that X = ULA ′ .⎡ULA ′ =
- Page 78 and 79: 3.6. Probability Distributions for
- Page 80 and 81: 3.7. Inference Based on Sample Prin
- Page 82 and 83: 3.7.2 Interval Estimation3.7. Infer
- Page 84 and 85: 3.7. Inference Based on Sample Prin
- Page 86 and 87: 3.7. Inference Based on Sample Prin
- Page 88 and 89: 3.8. Patterned Covariance and Corre
- Page 92 and 93: 3.9. Models for Principal Component
- Page 94 and 95: 4Principal Components as a SmallNum
- Page 96 and 97: 4.1. Anatomical Measurements 65Tabl
- Page 98 and 99: 4.1. Anatomical Measurements 67spac
- Page 100 and 101: 4.2. The Elderly at Home 69Table 4.
- Page 102 and 103: 4.3. Spatial and Temporal Variation
- Page 104 and 105: 4.3. Spatial and Temporal Variation
- Page 106 and 107: 4.4. Properties of Chemical Compoun
- Page 108 and 109: 4.5. Stock Market Prices 77Table 4.
- Page 110 and 111: 5. Graphical Representation of Data
- Page 112 and 113: Anatomical Measurements5.1. Plottin
- Page 114 and 115: 5.1. Plotting Two or Three Principa
- Page 116 and 117: 5.2. Principal Coordinate Analysis
- Page 118 and 119: 5.2. Principal Coordinate Analysis
- Page 120 and 121: 5.2. Principal Coordinate Analysis
- Page 122 and 123: 5.3. Biplots 91columns, L is an (r
- Page 124 and 125: 5.3. Biplots 93ButandSubstituting i
- Page 126 and 127: 5.3. Biplots 95The vector gi ∗ co
- Page 128 and 129: 5.3. Biplots 97Figure 5.3. Biplot u
- Page 130 and 131: 5.3. Biplots 99Table 5.2. First two
- Page 132 and 133: 5.3. Biplots 101Figure 5.5. Biplot
- Page 134 and 135: 5.4. Correspondence Analysis 103of
- Page 136 and 137: 5.4. Correspondence Analysis 105Fig
- Page 138 and 139: 5.6. Displaying Intrinsically High-
3.9. Models for <strong>Principal</strong> <strong>Component</strong> <strong>Analysis</strong> 59Table 3.5. <strong>Principal</strong> components based on the correlation matrix of Table 3.4<strong>Component</strong> 1 2 3 4 5 6 7 8 9 10numberCoefficientsV1 0.3 −0.2 0.2 −0.5 0.3 0.1 −0.1 −0.0 −0.6 0.2V2 0.4 −0.2 0.2 −0.5 0.3 0.0 −0.1 −0.0 0.7 −0.3V3 0.4 −0.1 −0.1 −0.0 −0.7 0.5 −0.2 0.0 0.1 0.1V4 0.4 −0.1 −0.1 −0.0 −0.4 −0.7 0.3 −0.0 −0.1 −0.1V5 0.3 −0.2 0.1 0.5 0.2 0.2 −0.0 −0.1 −0.2 −0.6V6 0.3 −0.2 0.2 0.5 0.2 −0.1 −0.0 0.1 0.2 0.6V7 0.3 0.3 −0.5 −0.0 0.2 0.3 0.7 0.0 −0.0 0.0V8 0.3 0.3 −0.5 0.1 0.2 −0.2 −0.7 −0.0 −0.0 −0.0V9 0.2 0.5 0.4 0.0 −0.1 0.0 −0.0 0.7 −0.0 −0.1V10 0.2 0.5 0.4 0.0 −0.1 0.0 0.0 −0.7 0.0 0.0Percentage of 52.3 20.4 11.0 8.5 5.0 1.0 0.9 0.6 0.2 0.2total variation explained3.9 Models for <strong>Principal</strong> <strong>Component</strong> <strong>Analysis</strong>There is a variety of interpretations of what is meant by a model in thecontext of PCA. Mandel (1972) considers the retention of m PCs, basedon the SVD (3.5.3), as implicitly using a model. Caussinus (1986) discussesthree types of ‘model.’ The first is a ‘descriptive algebraic model,’ which inits simplest form reduces to the SVD. It can also be generalized to includea choice of metric, rather than simply using a least squares approach. Suchgeneralizations are discussed further in Section 14.2.2. This model has norandom element, so there is no idea of expectation or variance. Hence itcorresponds to Pearson’s geometric view of PCA, rather than to Hotelling’svariance-based approach.Caussinus’s (1986) second type of model introduces probability distributionsand corresponds to Hotelling’s definition. Once again, the ‘model’can be generalized by allowing a choice of metric.The third type of model described by Caussinus is the so-called fixedeffects model (see also Esposito (1998)). In this model we assume thatthe rows x 1 , x 2 ,...,x n of X are independent random variables, such thatE(x i )=z i , where z i lies in a q-dimensional subspace, F q . Furthermore,if e i = x i − z i , then E(e i )=0 and var(e i )= σ2w iΓ, where Γ is a positivedefinite symmetric matrix and the w i are positive scalars whose sum is 1.Both Γ and the w i are assumed to be known, but σ 2 ,thez i and thesubspace F q all need to be estimated. This is done by minimizingn∑w i ‖x i − z i ‖ 2 M , (3.9.1)i=1