Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

13.8. Some Other Types of Data 371discuss two adaptations of PCA for such data. In the first, called the VER-TICES method, the ith row of the (n × p) data matrix is replaced by the2 p distinct rows whose elements have either x ij or x ij in their jth column.A PCA is then done on the resulting (n2 p × p) matrix. The value or scoreof a PC from this analysis can be calculated for each of the n2 p rows of thenew data matrix. For the ith observation there are 2 p such scores and aninterval can be constructed for the observation, bounded by the smallestand largest of these scores. In plotting the observations, either with respectto the original variables or with respect to PCs, each observation is representedby a rectangle or hyperrectangle in two or higher-dimensional space.The boundaries of the (hyper)rectangle are determined by the intervals forthe variables or PC scores. Chouakria et al. (2000) examine a number ofindices measuring the quality of representation of an interval data set bya small number of ‘interval PCs’ and the contributions of each observationto individual PCs.For large values of p, the VERTICES method produces very large matrices.As an alternative, Chouakria et al. suggest the CENTERS procedure,in which a PCA is done on the (n × p) matrix whose (i, j)th element is(x ij +x ij )/2. The immediate results give a single score for each observationon each PC, but Chouakria and coworkers use the intervals of possible valuesfor the variables to construct intervals for the PC scores. This is doneby finding the combinations of allowable values for the variables, which,when inserted in the expression for a PC in terms of the variables, give themaximum and minimum scores for the PC. An example is given to comparethe VERTICES and CENTERS approaches.Ichino and Yaguchi (1994) describe a generalization of PCA that can beused on a wide variety of data types, including discrete variables in which ameasurement is a subset of more than one of the possible values for a variable;continuous variables recorded as intervals are also included. To carryout PCA, the measurement on each variable is converted to a single value.This is done by first calculating a ‘distance’ between any two observationson each variable, constructed from a formula that involves the union andintersection of the values of the variable taken by the two observations.From these distances a ‘reference event’ is found, defined as the observationwhose sum of distances from all other observations is minimized, wheredistance here refers to the sum of ‘distances’ for each of the p variables.The coordinate of each observation for a particular variable is then takenas the distance on that variable from the reference event, with a suitablyassigned sign. The coordinates of the n observations on the p variables thusdefined form a data set, which is then subjected to PCA.Species Abundance DataThese data are common in ecology—an example was given in Section 5.4.1.When the study area has diverse habitats and many species are included,

372 13. Principal Component Analysis for Special Types of Datathere may be a large number of zeros in the data. If two variables x jand x k simultaneously record zero for a non-trivial number of sites, thecalculation of covariance or correlation between this pair of variables islikely to be distorted. Legendre and Legendre (1983, p. 285) argue that dataare better analysed by nonmetric multidimensional scaling (Cox and Cox,2001) or with correspondence analysis (as in Section 5.4.1), rather than byPCA, when there are many such ‘double zeros’ present. Even when suchzeros are not a problem, species abundance data often have highly skeweddistributions and a transformation; for example, taking logarithms, may beadvisable before PCA is contemplated.Another unique aspect of species abundance data is an interest in thediversity of species at the various sites. It has been argued that to examinediversity, it is more appropriate to use uncentred than column-centredPCA. This is discussed further in Section 14.2.3, together with doublycentred PCA which has also found applications to species abundance data.Large Data SetsThe problems of large data sets are different depending on whether thenumber of observations n or the number of variables p is large, with thelatter typically causing greater difficulties than the former. With large nthere may be problems in viewing graphs because of superimposed observations,but it is the size of the covariance or correlation matrix that usuallydetermines computational limitations. However, if p > n it should beremembered (Property G4 of Section 3.2) that the eigenvectors of X ′ X correspondingto non-zero eigenvalues can be found from those of the smallermatrix XX ′ .For very large values of p, Preisendorfer and Mobley (1988, Chapter 11)suggest splitting the variables into subsets of manageable size, performingPCA on each subset, and then using the separate eigenanalyses to approximatethe eigenstructure of the original large data matrix. Developmentsin computer architecture may soon allow very large problems to be tackledmuch faster using neural network algorithms for PCA (see Appendix A1and Diamantaras and Kung (1996, Chapter 8)).

13.8. Some Other Types of Data 371discuss two adaptations of PCA for such data. In the first, called the VER-TICES method, the ith row of the (n × p) data matrix is replaced by the2 p distinct rows whose elements have either x ij or x ij in their jth column.A PCA is then done on the resulting (n2 p × p) matrix. The value or scoreof a PC from this analysis can be calculated for each of the n2 p rows of thenew data matrix. For the ith observation there are 2 p such scores and aninterval can be constructed for the observation, bounded by the smallestand largest of these scores. In plotting the observations, either with respectto the original variables or with respect to PCs, each observation is representedby a rectangle or hyperrectangle in two or higher-dimensional space.The boundaries of the (hyper)rectangle are determined by the intervals forthe variables or PC scores. Chouakria et al. (2000) examine a number ofindices measuring the quality of representation of an interval data set bya small number of ‘interval PCs’ and the contributions of each observationto individual PCs.For large values of p, the VERTICES method produces very large matrices.As an alternative, Chouakria et al. suggest the CENTERS procedure,in which a PCA is done on the (n × p) matrix whose (i, j)th element is(x ij +x ij )/2. The immediate results give a single score for each observationon each PC, but Chouakria and coworkers use the intervals of possible valuesfor the variables to construct intervals for the PC scores. This is doneby finding the combinations of allowable values for the variables, which,when inserted in the expression for a PC in terms of the variables, give themaximum and minimum scores for the PC. An example is given to comparethe VERTICES and CENTERS approaches.Ichino and Yaguchi (1994) describe a generalization of PCA that can beused on a wide variety of data types, including discrete variables in which ameasurement is a subset of more than one of the possible values for a variable;continuous variables recorded as intervals are also included. To carryout PCA, the measurement on each variable is converted to a single value.This is done by first calculating a ‘distance’ between any two observationson each variable, constructed from a formula that involves the union andintersection of the values of the variable taken by the two observations.From these distances a ‘reference event’ is found, defined as the observationwhose sum of distances from all other observations is minimized, wheredistance here refers to the sum of ‘distances’ for each of the p variables.The coordinate of each observation for a particular variable is then takenas the distance on that variable from the reference event, with a suitablyassigned sign. The coordinates of the n observations on the p variables thusdefined form a data set, which is then subjected to PCA.Species Abundance DataThese data are common in ecology—an example was given in Section 5.4.1.When the study area has diverse habitats and many species are included,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!