Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
13.8. Some Other Types of Data 371discuss two adaptations of PCA for such data. In the first, called the VER-TICES method, the ith row of the (n × p) data matrix is replaced by the2 p distinct rows whose elements have either x ij or x ij in their jth column.A PCA is then done on the resulting (n2 p × p) matrix. The value or scoreof a PC from this analysis can be calculated for each of the n2 p rows of thenew data matrix. For the ith observation there are 2 p such scores and aninterval can be constructed for the observation, bounded by the smallestand largest of these scores. In plotting the observations, either with respectto the original variables or with respect to PCs, each observation is representedby a rectangle or hyperrectangle in two or higher-dimensional space.The boundaries of the (hyper)rectangle are determined by the intervals forthe variables or PC scores. Chouakria et al. (2000) examine a number ofindices measuring the quality of representation of an interval data set bya small number of ‘interval PCs’ and the contributions of each observationto individual PCs.For large values of p, the VERTICES method produces very large matrices.As an alternative, Chouakria et al. suggest the CENTERS procedure,in which a PCA is done on the (n × p) matrix whose (i, j)th element is(x ij +x ij )/2. The immediate results give a single score for each observationon each PC, but Chouakria and coworkers use the intervals of possible valuesfor the variables to construct intervals for the PC scores. This is doneby finding the combinations of allowable values for the variables, which,when inserted in the expression for a PC in terms of the variables, give themaximum and minimum scores for the PC. An example is given to comparethe VERTICES and CENTERS approaches.Ichino and Yaguchi (1994) describe a generalization of PCA that can beused on a wide variety of data types, including discrete variables in which ameasurement is a subset of more than one of the possible values for a variable;continuous variables recorded as intervals are also included. To carryout PCA, the measurement on each variable is converted to a single value.This is done by first calculating a ‘distance’ between any two observationson each variable, constructed from a formula that involves the union andintersection of the values of the variable taken by the two observations.From these distances a ‘reference event’ is found, defined as the observationwhose sum of distances from all other observations is minimized, wheredistance here refers to the sum of ‘distances’ for each of the p variables.The coordinate of each observation for a particular variable is then takenas the distance on that variable from the reference event, with a suitablyassigned sign. The coordinates of the n observations on the p variables thusdefined form a data set, which is then subjected to PCA.Species Abundance DataThese data are common in ecology—an example was given in Section 5.4.1.When the study area has diverse habitats and many species are included,
372 13. Principal Component Analysis for Special Types of Datathere may be a large number of zeros in the data. If two variables x jand x k simultaneously record zero for a non-trivial number of sites, thecalculation of covariance or correlation between this pair of variables islikely to be distorted. Legendre and Legendre (1983, p. 285) argue that dataare better analysed by nonmetric multidimensional scaling (Cox and Cox,2001) or with correspondence analysis (as in Section 5.4.1), rather than byPCA, when there are many such ‘double zeros’ present. Even when suchzeros are not a problem, species abundance data often have highly skeweddistributions and a transformation; for example, taking logarithms, may beadvisable before PCA is contemplated.Another unique aspect of species abundance data is an interest in thediversity of species at the various sites. It has been argued that to examinediversity, it is more appropriate to use uncentred than column-centredPCA. This is discussed further in Section 14.2.3, together with doublycentred PCA which has also found applications to species abundance data.Large Data SetsThe problems of large data sets are different depending on whether thenumber of observations n or the number of variables p is large, with thelatter typically causing greater difficulties than the former. With large nthere may be problems in viewing graphs because of superimposed observations,but it is the size of the covariance or correlation matrix that usuallydetermines computational limitations. However, if p > n it should beremembered (Property G4 of Section 3.2) that the eigenvectors of X ′ X correspondingto non-zero eigenvalues can be found from those of the smallermatrix XX ′ .For very large values of p, Preisendorfer and Mobley (1988, Chapter 11)suggest splitting the variables into subsets of manageable size, performingPCA on each subset, and then using the separate eigenanalyses to approximatethe eigenstructure of the original large data matrix. Developmentsin computer architecture may soon allow very large problems to be tackledmuch faster using neural network algorithms for PCA (see Appendix A1and Diamantaras and Kung (1996, Chapter 8)).
- Page 352 and 353: 12.3. Functional PCA 321speed (mete
- Page 354 and 355: 12.3. Functional PCA 323of the data
- Page 356 and 357: 12.3. Functional PCA 325subject to
- Page 358 and 359: 12.3. Functional PCA 327series than
- Page 360 and 361: 12.4. PCA and Non-Independent Data
- Page 362 and 363: 12.4. PCA and Non-Independent Data
- Page 364 and 365: 12.4. PCA and Non-Independent Data
- Page 366 and 367: 12.4. PCA and Non-Independent Data
- Page 368 and 369: 12.4. PCA and Non-Independent Data
- Page 370 and 371: 13.1. Principal Component Analysis
- Page 372 and 373: 13.1. Principal Component Analysis
- Page 374 and 375: 13.2. Analysis of Size and Shape 34
- Page 376 and 377: 13.2. Analysis of Size and Shape 34
- Page 378 and 379: 13.3. Principal Component Analysis
- Page 380 and 381: 13.3. Principal Component Analysis
- Page 382 and 383: 13.4. Principal Component Analysis
- Page 384 and 385: 13.4. Principal Component Analysis
- Page 386 and 387: 13.5. Common Principal Components 3
- Page 388 and 389: 13.5. Common Principal Components 3
- Page 390 and 391: 13.5. Common Principal Components 3
- Page 392 and 393: 13.5. Common Principal Components 3
- Page 394 and 395: 13.6. Principal Component Analysis
- Page 396 and 397: 13.6. Principal Component Analysis
- Page 398 and 399: 13.7. PCA in Statistical Process Co
- Page 400 and 401: 13.8. Some Other Types of Data 369A
- Page 404 and 405: 14Generalizations and Adaptations o
- Page 406 and 407: 14.1. Non-Linear Extensions of Prin
- Page 408 and 409: 14.1. Additive Principal Components
- Page 410 and 411: 14.1. Additive Principal Components
- Page 412 and 413: 14.1. Additive Principal Components
- Page 414 and 415: 14.2. Weights, Metrics, Transformat
- Page 416 and 417: 14.2. Weights, Metrics, Transformat
- Page 418 and 419: 14.2. Weights, Metrics, Transformat
- Page 420 and 421: 14.2. Weights, Metrics, Transformat
- Page 422 and 423: 14.2. Weights, Metrics, Transformat
- Page 424 and 425: 14.3. PCs in the Presence of Second
- Page 426 and 427: 14.4. PCA for Non-Normal Distributi
- Page 428 and 429: 14.5. Three-Mode, Multiway and Mult
- Page 430 and 431: 14.5. Three-Mode, Multiway and Mult
- Page 432 and 433: 14.6. Miscellanea 401• Linear App
- Page 434 and 435: 14.6. Miscellanea 40314.6.3 Regress
- Page 436 and 437: 14.7. Concluding Remarks 405space o
- Page 438 and 439: Appendix AComputation of Principal
- Page 440 and 441: A.1. Numerical Calculation of Princ
- Page 442 and 443: A.1. Numerical Calculation of Princ
- Page 444 and 445: A.1. Numerical Calculation of Princ
- Page 446 and 447: ReferencesAguilera, A.M., Gutiérre
- Page 448 and 449: References 417Apley, D.W. and Shi,
- Page 450 and 451: References 419Benasseni, J. (1986b)
13.8. Some Other Types of Data 371discuss two adaptations of PCA for such data. In the first, called the VER-TICES method, the ith row of the (n × p) data matrix is replaced by the2 p distinct rows whose elements have either x ij or x ij in their jth column.A PCA is then done on the resulting (n2 p × p) matrix. The value or scoreof a PC from this analysis can be calculated for each of the n2 p rows of thenew data matrix. For the ith observation there are 2 p such scores and aninterval can be constructed for the observation, bounded by the smallestand largest of these scores. In plotting the observations, either with respectto the original variables or with respect to PCs, each observation is representedby a rectangle or hyperrectangle in two or higher-dimensional space.The boundaries of the (hyper)rectangle are determined by the intervals forthe variables or PC scores. Chouakria et al. (2000) examine a number ofindices measuring the quality of representation of an interval data set bya small number of ‘interval PCs’ and the contributions of each observationto individual PCs.For large values of p, the VERTICES method produces very large matrices.As an alternative, Chouakria et al. suggest the CENTERS procedure,in which a PCA is done on the (n × p) matrix whose (i, j)th element is(x ij +x ij )/2. The immediate results give a single score for each observationon each PC, but Chouakria and coworkers use the intervals of possible valuesfor the variables to construct intervals for the PC scores. This is doneby finding the combinations of allowable values for the variables, which,when inserted in the expression for a PC in terms of the variables, give themaximum and minimum scores for the PC. An example is given to comparethe VERTICES and CENTERS approaches.Ichino and Yaguchi (1994) describe a generalization of PCA that can beused on a wide variety of data types, including discrete variables in which ameasurement is a subset of more than one of the possible values for a variable;continuous variables recorded as intervals are also included. To carryout PCA, the measurement on each variable is converted to a single value.This is done by first calculating a ‘distance’ between any two observationson each variable, constructed from a formula that involves the union andintersection of the values of the variable taken by the two observations.From these distances a ‘reference event’ is found, defined as the observationwhose sum of distances from all other observations is minimized, wheredistance here refers to the sum of ‘distances’ for each of the p variables.The coordinate of each observation for a particular variable is then takenas the distance on that variable from the reference event, with a suitablyassigned sign. The coordinates of the n observations on the p variables thusdefined form a data set, which is then subjected to PCA.Species Abundance DataThese data are common in ecology—an example was given in Section 5.4.1.When the study area has diverse habitats and many species are included,