Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

14.2. Weights, Metrics, Transformations and Centerings 389Standardization, in the sense of dividing each column of the data matrixby its standard deviation, leads to PCA based on the correlation matrix,and its pros and cons are discussed in Sections 2.3 and 3.3. This can bethought of a version of weighted PCA (Section 14.2.1). So, also, can dividingeach column by its range or its mean (Gower, 1966), in the latter casegiving a matrix of coefficients of variation. Underhill (1990) suggests abiplot based on this matrix (see Section 5.3.2). Such plots are only relevantwhen variables are non-negative, as with species abundance data.Principal components are linear functions of x whose coefficients aregiven by the eigenvectors of a covariance or correlation matrix or, equivalently,the eigenvectors of a matrix X ′ X. Here X is a (n × p) matrix whose(i, j)th element is the value for the ith observation of the jth variable,measured about the mean for that variable. Thus, the columns of X havebeen centred, so that the sum of each column is zero, though Holmes-Junca(1985) notes that centering by either medians or modes has been suggestedas an alternative to centering by means.Two alternatives to ‘column-centering’ are:(i) the columns of X are left uncentred, that is x ij is now the value forthe ith observation of the jth variable, as originally measured;(ii) both rows and columns of X are centred, so that sums of rows, as wellas sums of columns, are zero.In either (i) or (ii) the analysis now proceeds by looking at linear functionsof x whose coefficients are the eigenvectors of X ′ X, with X nownon-centred or doubly centred. Of course, these linear functions no longermaximize variance, and so are not PCs according to the usual definition,but it is convenient to refer to them as non-centred and doubly centredPCs, respectively.Non-centred PCA is a fairly well-established technique in ecology (TerBraak,1983). It has also been used in chemistry (Jackson, 1991, Section3.4; Cochran and Horne, 1977) and geology (Reyment and Jöreskog, 1993).As noted by Ter Braak (1983), the technique projects observations ontothe best fitting plane (or flat) through the origin, rather than through thecentroid of the data set. If the data are such that the origin is an importantpoint of reference, then this type of analysis can be relevant. However, if thecentre of the observations is a long way from the origin, then the first ‘PC’will dominate the analysis, and will simply reflect the position of the centroid.For data that consist of counts of a number of biological species (thevariables) at various sites (the observations), Ter Braak (1983) claims thatnon-centred PCA is better than standard (centred) PCA at simultaneouslyrepresenting within-site diversity and between-site diversity of species (seealso Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better atrepresenting between-site species diversity than non-centred PCA, but it ismore difficult to deduce within-site diversity from a centred PCA.

390 14. Generalizations and Adaptations of Principal Component AnalysisReyment and Jöreskog (1993, Section 8.7) discuss an application of themethod (which they refer to as Imbrie’s Q-mode method) in a similar contextconcerning the abundance of various marine micro-organisms in corestaken at a number of sites on the seabed. The same authors also suggestthat this type of analysis is relevant for data where the p variables areamounts of p chemical constituents in n soil or rock samples. If the degreeto which two samples have the same proportions of each constituent is consideredto be an important index of similarity between samples, then thesimilarity measure implied by non-centred PCA is appropriate (Reymentand Jöreskog, 1993, Section 5.4). An alternative approach if proportions areof interest is to reduce the data to compositional form (see Section 13.3).The technique of empirical orthogonal teleconnections (van den Doolet al., 2000), described in Section 11.2.3, operates on uncentred data.Here matters are confused by referring to uncentred sums of squares andcross-products as ‘variances’ and ‘correlations.’ Devijver and Kittler (1982,Section 9.3) use similar misleading terminology in a population derivationand discussion of uncentred PCA.Doubly centred PCA was proposed by Buckland and Anderson (1985) asanother method of analysis for data that consist of species counts at varioussites. They argue that centred PCA of such data may be dominated bya ‘size’ component, which measures the relative abundance of the variousspecies. It is possible to simply ignore the first PC, and concentrate on laterPCs, but an alternative is provided by double centering, which ‘removes’ the‘size’ PC. The same idea has been suggested in the analysis of size/shapedata (see Section 13.2). Double centering introduces a component with zeroeigenvalue, because the constraint x i1 +x i2 +...+x ip = 0 now holds for all i.A further alternative for removing the ‘size’ effect of different abundancesof different species is, for some such data sets, to record only whether aspecies is present or absent at each site, rather than the actual counts foreach species.In fact, what is being done in double centering is the same as Mandel’s(1971, 1972) approach to data in a two-way analysis of variance (see Section13.4). It removes main effects due to rows/observations/sites, and dueto columns/variables/species, and concentrates on the interaction betweenspecies and sites. In the regression context, Hoerl et al. (1985) suggestthat double centering can remove ‘non-essential ill-conditioning,’ which iscaused by the presence of a row (observation) effect in the original data.Kazmierczak (1985) advocates a logarithmic transformation of data, followedby double centering. This gives a procedure that is invariant to preandpost-multiplication of the data matrix by diagonal matrices. Hence itis invariant to different weightings of observations and to different scalingsof the variables.One reason for the suggestion of both non-centred and doubly-centredPCA for counts of species at various sites is perhaps that it is not entirely

14.2. Weights, Metrics, Transformations and Centerings 389Standardization, in the sense of dividing each column of the data matrixby its standard deviation, leads to PCA based on the correlation matrix,and its pros and cons are discussed in Sections 2.3 and 3.3. This can bethought of a version of weighted PCA (Section 14.2.1). So, also, can dividingeach column by its range or its mean (Gower, 1966), in the latter casegiving a matrix of coefficients of variation. Underhill (1990) suggests abiplot based on this matrix (see Section 5.3.2). Such plots are only relevantwhen variables are non-negative, as with species abundance data.<strong>Principal</strong> components are linear functions of x whose coefficients aregiven by the eigenvectors of a covariance or correlation matrix or, equivalently,the eigenvectors of a matrix X ′ X. Here X is a (n × p) matrix whose(i, j)th element is the value for the ith observation of the jth variable,measured about the mean for that variable. Thus, the columns of X havebeen centred, so that the sum of each column is zero, though Holmes-Junca(1985) notes that centering by either medians or modes has been suggestedas an alternative to centering by means.Two alternatives to ‘column-centering’ are:(i) the columns of X are left uncentred, that is x ij is now the value forthe ith observation of the jth variable, as originally measured;(ii) both rows and columns of X are centred, so that sums of rows, as wellas sums of columns, are zero.In either (i) or (ii) the analysis now proceeds by looking at linear functionsof x whose coefficients are the eigenvectors of X ′ X, with X nownon-centred or doubly centred. Of course, these linear functions no longermaximize variance, and so are not PCs according to the usual definition,but it is convenient to refer to them as non-centred and doubly centredPCs, respectively.Non-centred PCA is a fairly well-established technique in ecology (TerBraak,1983). It has also been used in chemistry (Jackson, 1991, Section3.4; Cochran and Horne, 1977) and geology (Reyment and Jöreskog, 1993).As noted by Ter Braak (1983), the technique projects observations ontothe best fitting plane (or flat) through the origin, rather than through thecentroid of the data set. If the data are such that the origin is an importantpoint of reference, then this type of analysis can be relevant. However, if thecentre of the observations is a long way from the origin, then the first ‘PC’will dominate the analysis, and will simply reflect the position of the centroid.For data that consist of counts of a number of biological species (thevariables) at various sites (the observations), Ter Braak (1983) claims thatnon-centred PCA is better than standard (centred) PCA at simultaneouslyrepresenting within-site diversity and between-site diversity of species (seealso Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better atrepresenting between-site species diversity than non-centred PCA, but it ismore difficult to deduce within-site diversity from a centred PCA.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!