Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
14.2. Weights, Metrics, Transformations and Centerings 389Standardization, in the sense of dividing each column of the data matrixby its standard deviation, leads to PCA based on the correlation matrix,and its pros and cons are discussed in Sections 2.3 and 3.3. This can bethought of a version of weighted PCA (Section 14.2.1). So, also, can dividingeach column by its range or its mean (Gower, 1966), in the latter casegiving a matrix of coefficients of variation. Underhill (1990) suggests abiplot based on this matrix (see Section 5.3.2). Such plots are only relevantwhen variables are non-negative, as with species abundance data.Principal components are linear functions of x whose coefficients aregiven by the eigenvectors of a covariance or correlation matrix or, equivalently,the eigenvectors of a matrix X ′ X. Here X is a (n × p) matrix whose(i, j)th element is the value for the ith observation of the jth variable,measured about the mean for that variable. Thus, the columns of X havebeen centred, so that the sum of each column is zero, though Holmes-Junca(1985) notes that centering by either medians or modes has been suggestedas an alternative to centering by means.Two alternatives to ‘column-centering’ are:(i) the columns of X are left uncentred, that is x ij is now the value forthe ith observation of the jth variable, as originally measured;(ii) both rows and columns of X are centred, so that sums of rows, as wellas sums of columns, are zero.In either (i) or (ii) the analysis now proceeds by looking at linear functionsof x whose coefficients are the eigenvectors of X ′ X, with X nownon-centred or doubly centred. Of course, these linear functions no longermaximize variance, and so are not PCs according to the usual definition,but it is convenient to refer to them as non-centred and doubly centredPCs, respectively.Non-centred PCA is a fairly well-established technique in ecology (TerBraak,1983). It has also been used in chemistry (Jackson, 1991, Section3.4; Cochran and Horne, 1977) and geology (Reyment and Jöreskog, 1993).As noted by Ter Braak (1983), the technique projects observations ontothe best fitting plane (or flat) through the origin, rather than through thecentroid of the data set. If the data are such that the origin is an importantpoint of reference, then this type of analysis can be relevant. However, if thecentre of the observations is a long way from the origin, then the first ‘PC’will dominate the analysis, and will simply reflect the position of the centroid.For data that consist of counts of a number of biological species (thevariables) at various sites (the observations), Ter Braak (1983) claims thatnon-centred PCA is better than standard (centred) PCA at simultaneouslyrepresenting within-site diversity and between-site diversity of species (seealso Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better atrepresenting between-site species diversity than non-centred PCA, but it ismore difficult to deduce within-site diversity from a centred PCA.
390 14. Generalizations and Adaptations of Principal Component AnalysisReyment and Jöreskog (1993, Section 8.7) discuss an application of themethod (which they refer to as Imbrie’s Q-mode method) in a similar contextconcerning the abundance of various marine micro-organisms in corestaken at a number of sites on the seabed. The same authors also suggestthat this type of analysis is relevant for data where the p variables areamounts of p chemical constituents in n soil or rock samples. If the degreeto which two samples have the same proportions of each constituent is consideredto be an important index of similarity between samples, then thesimilarity measure implied by non-centred PCA is appropriate (Reymentand Jöreskog, 1993, Section 5.4). An alternative approach if proportions areof interest is to reduce the data to compositional form (see Section 13.3).The technique of empirical orthogonal teleconnections (van den Doolet al., 2000), described in Section 11.2.3, operates on uncentred data.Here matters are confused by referring to uncentred sums of squares andcross-products as ‘variances’ and ‘correlations.’ Devijver and Kittler (1982,Section 9.3) use similar misleading terminology in a population derivationand discussion of uncentred PCA.Doubly centred PCA was proposed by Buckland and Anderson (1985) asanother method of analysis for data that consist of species counts at varioussites. They argue that centred PCA of such data may be dominated bya ‘size’ component, which measures the relative abundance of the variousspecies. It is possible to simply ignore the first PC, and concentrate on laterPCs, but an alternative is provided by double centering, which ‘removes’ the‘size’ PC. The same idea has been suggested in the analysis of size/shapedata (see Section 13.2). Double centering introduces a component with zeroeigenvalue, because the constraint x i1 +x i2 +...+x ip = 0 now holds for all i.A further alternative for removing the ‘size’ effect of different abundancesof different species is, for some such data sets, to record only whether aspecies is present or absent at each site, rather than the actual counts foreach species.In fact, what is being done in double centering is the same as Mandel’s(1971, 1972) approach to data in a two-way analysis of variance (see Section13.4). It removes main effects due to rows/observations/sites, and dueto columns/variables/species, and concentrates on the interaction betweenspecies and sites. In the regression context, Hoerl et al. (1985) suggestthat double centering can remove ‘non-essential ill-conditioning,’ which iscaused by the presence of a row (observation) effect in the original data.Kazmierczak (1985) advocates a logarithmic transformation of data, followedby double centering. This gives a procedure that is invariant to preandpost-multiplication of the data matrix by diagonal matrices. Hence itis invariant to different weightings of observations and to different scalingsof the variables.One reason for the suggestion of both non-centred and doubly-centredPCA for counts of species at various sites is perhaps that it is not entirely
- Page 370 and 371: 13.1. Principal Component Analysis
- Page 372 and 373: 13.1. Principal Component Analysis
- Page 374 and 375: 13.2. Analysis of Size and Shape 34
- Page 376 and 377: 13.2. Analysis of Size and Shape 34
- Page 378 and 379: 13.3. Principal Component Analysis
- Page 380 and 381: 13.3. Principal Component Analysis
- Page 382 and 383: 13.4. Principal Component Analysis
- Page 384 and 385: 13.4. Principal Component Analysis
- Page 386 and 387: 13.5. Common Principal Components 3
- Page 388 and 389: 13.5. Common Principal Components 3
- Page 390 and 391: 13.5. Common Principal Components 3
- Page 392 and 393: 13.5. Common Principal Components 3
- Page 394 and 395: 13.6. Principal Component Analysis
- Page 396 and 397: 13.6. Principal Component Analysis
- Page 398 and 399: 13.7. PCA in Statistical Process Co
- Page 400 and 401: 13.8. Some Other Types of Data 369A
- Page 402 and 403: 13.8. Some Other Types of Data 371d
- Page 404 and 405: 14Generalizations and Adaptations o
- Page 406 and 407: 14.1. Non-Linear Extensions of Prin
- Page 408 and 409: 14.1. Additive Principal Components
- Page 410 and 411: 14.1. Additive Principal Components
- Page 412 and 413: 14.1. Additive Principal Components
- Page 414 and 415: 14.2. Weights, Metrics, Transformat
- Page 416 and 417: 14.2. Weights, Metrics, Transformat
- Page 418 and 419: 14.2. Weights, Metrics, Transformat
- Page 422 and 423: 14.2. Weights, Metrics, Transformat
- Page 424 and 425: 14.3. PCs in the Presence of Second
- Page 426 and 427: 14.4. PCA for Non-Normal Distributi
- Page 428 and 429: 14.5. Three-Mode, Multiway and Mult
- Page 430 and 431: 14.5. Three-Mode, Multiway and Mult
- Page 432 and 433: 14.6. Miscellanea 401• Linear App
- Page 434 and 435: 14.6. Miscellanea 40314.6.3 Regress
- Page 436 and 437: 14.7. Concluding Remarks 405space o
- Page 438 and 439: Appendix AComputation of Principal
- Page 440 and 441: A.1. Numerical Calculation of Princ
- Page 442 and 443: A.1. Numerical Calculation of Princ
- Page 444 and 445: A.1. Numerical Calculation of Princ
- Page 446 and 447: ReferencesAguilera, A.M., Gutiérre
- Page 448 and 449: References 417Apley, D.W. and Shi,
- Page 450 and 451: References 419Benasseni, J. (1986b)
- Page 452 and 453: References 421Boik, R.J. (1986). Te
- Page 454 and 455: References 423Castro, P.E., Lawton,
- Page 456 and 457: References 425Cook, R.D. (1986). As
- Page 458 and 459: References 427Dempster, A.P., Laird
- Page 460 and 461: References 429Feeney, G.J. and Hest
- Page 462 and 463: References 431in Descriptive Multiv
- Page 464 and 465: References 433Gunst, R.F. and Mason
- Page 466 and 467: References 435Hocking, R.R., Speed,
- Page 468 and 469: References 437Jeffers, J.N.R. (1978
14.2. Weights, Metrics, Transformations and Centerings 389Standardization, in the sense of dividing each column of the data matrixby its standard deviation, leads to PCA based on the correlation matrix,and its pros and cons are discussed in Sections 2.3 and 3.3. This can bethought of a version of weighted PCA (Section 14.2.1). So, also, can dividingeach column by its range or its mean (Gower, 1966), in the latter casegiving a matrix of coefficients of variation. Underhill (1990) suggests abiplot based on this matrix (see Section 5.3.2). Such plots are only relevantwhen variables are non-negative, as with species abundance data.<strong>Principal</strong> components are linear functions of x whose coefficients aregiven by the eigenvectors of a covariance or correlation matrix or, equivalently,the eigenvectors of a matrix X ′ X. Here X is a (n × p) matrix whose(i, j)th element is the value for the ith observation of the jth variable,measured about the mean for that variable. Thus, the columns of X havebeen centred, so that the sum of each column is zero, though Holmes-Junca(1985) notes that centering by either medians or modes has been suggestedas an alternative to centering by means.Two alternatives to ‘column-centering’ are:(i) the columns of X are left uncentred, that is x ij is now the value forthe ith observation of the jth variable, as originally measured;(ii) both rows and columns of X are centred, so that sums of rows, as wellas sums of columns, are zero.In either (i) or (ii) the analysis now proceeds by looking at linear functionsof x whose coefficients are the eigenvectors of X ′ X, with X nownon-centred or doubly centred. Of course, these linear functions no longermaximize variance, and so are not PCs according to the usual definition,but it is convenient to refer to them as non-centred and doubly centredPCs, respectively.Non-centred PCA is a fairly well-established technique in ecology (TerBraak,1983). It has also been used in chemistry (Jackson, 1991, Section3.4; Cochran and Horne, 1977) and geology (Reyment and Jöreskog, 1993).As noted by Ter Braak (1983), the technique projects observations ontothe best fitting plane (or flat) through the origin, rather than through thecentroid of the data set. If the data are such that the origin is an importantpoint of reference, then this type of analysis can be relevant. However, if thecentre of the observations is a long way from the origin, then the first ‘PC’will dominate the analysis, and will simply reflect the position of the centroid.For data that consist of counts of a number of biological species (thevariables) at various sites (the observations), Ter Braak (1983) claims thatnon-centred PCA is better than standard (centred) PCA at simultaneouslyrepresenting within-site diversity and between-site diversity of species (seealso Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better atrepresenting between-site species diversity than non-centred PCA, but it ismore difficult to deduce within-site diversity from a centred PCA.