Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

14.2. Weights, Metrics, Transformations and Centerings 391clear which of ‘sites’ and ‘species’ should be treated as ‘variables’ and whichas ‘observations.’ Another possibility is to centre with respect to sites, butnot species, in other words, carrying out an analysis with sites rather thanspecies as the variables. Buckland and Anderson (1985) analyse their datain this way.Yet another technique which has been suggested for analysing sometypes of site-species data is correspondence analysis (see, for example, Section5.4.1 and Gauch, 1982). As pointed out in Section 13.4, correspondenceanalysis has some similarity to Mandel’s approach, and hence to doublycentred PCA. In doubly centred PCA we analyse the residuals from anadditive model for row and column (site and species) effects, whereas incorrespondence analysis the residuals from a multiplicative (independence)model are considered.Both uncentred and doubly centred PCA perform eigenanalyses on matriceswhose elements are not covariances or correlations, but which canstill be viewed as measures of similarity or association between pairs ofvariables. Another technique in the same vein is proposed by Elmore andRichman (2001). Their idea is to find ‘distances’ between variables whichcan then be converted into similarities and an eigenanalysis done on theresulting similarity matrix. Although Elmore and Richman (2001) notea number of possible distance measures, they concentrate on Euclideandistance, so that the distance d jk between variables j and k is[ ∑ n(x ij − x ik ) 2] 1 2.i=1If D is largest of the p 2 d jk , the corresponding similarity matrix is definedto have elementss jk =1− d jkD .The procedure is referred to as PCA based on ES (Euclidean similarity).There is an apparent connection with principal coordinate analysis (Section5.2) but for ES-based PCA it is distances between variables, ratherthan between observations, that are analysed.The technique is only appropriate if the variables are all measured in thesame units—it makes no sense to compute a distance between a vector oftemperatures and a vector of heights, for example. Elmore and Richman(2001) report that the method does better at finding known ‘modes’ in adata set than PCA based on either a covariance or a correlation matrix.However, as with uncentred and doubly centred PCA, it is much less clearthan it is for PCA what is optimized by the technique, and hence it is moredifficult to know how to interpret its results.

392 14. Generalizations and Adaptations of Principal Component Analysis14.3 Principal Components in the Presence ofSecondary or Instrumental VariablesRao (1964) describes two modifications of PCA that involve what he calls‘instrumental variables.’ These are variables which are of secondary importance,but which may be useful in various ways in examining thevariables that are of primary concern. The term ‘instrumental variable’ isin widespread use in econometrics, but in a rather more restricted context(see, for example, Darnell (1994, pp. 197–200)).Suppose that x is, as usual, a p-element vector of primary variables,and that w is a vector of s secondary, or instrumental, variables. Rao(1964) considers the following two problems, described respectively as ‘principalcomponents of instrumental variables’ and ‘principal components...uncorrelated with instrumental variables’:(i) Find linear functions γ ′ 1w, γ ′ 2w,..., of w that best predict x.(ii) Find linear functions α ′ 1x, α ′ 2x,... with maximum variances that,as well as being uncorrelated with each other, are also uncorrelatedwith w.For (i), Rao (1964) notes that w may contain some or all of the elementsof x, and gives two possible measures of predictive ability, corresponding tothe trace and Euclidean norm criteria discussed with respect to PropertyA5 in Section 2.1. He also mentions the possibility of introducing weightsinto the analysis. The two criteria lead to different solutions to (i), oneof which is more straightforward to derive than the other. There is a superficialresemblance between the current problem and that of canonicalcorrelation analysis, where relationships between two sets of variables arealso investigated (see Section 9.3), but the two situations are easily seen tobe different. However, as noted in Sections 6.3 and 9.3.4, the methodologyof Rao’s (1964) PCA of instrumental variables has reappeared under othernames. In particular, it is equivalent to redundancy analysis (van den Wollenberg,1977) and to one way of fitting a reduced rank regression model(Davies and Tso, 1982).The same technique is derived by Esposito (1998). He projects the matrixX onto the space spanned by W, where X, W are data matrices associatedwith x, w, and then finds principal components of the projected data. Thisleads to an eigenequationS XW S −1WW S WXa k = l k a k ,which is the same as equation (9.3.5). Solving that equation leads to redundancyanalysis. Kazi-Aoual et al. (1995) provide a permutation test,using the test statistic tr(S WX S −1XX S XW) to decide whether there is anyrelationship between the x and w variables.

14.2. Weights, Metrics, Transformations and Centerings 391clear which of ‘sites’ and ‘species’ should be treated as ‘variables’ and whichas ‘observations.’ Another possibility is to centre with respect to sites, butnot species, in other words, carrying out an analysis with sites rather thanspecies as the variables. Buckland and Anderson (1985) analyse their datain this way.Yet another technique which has been suggested for analysing sometypes of site-species data is correspondence analysis (see, for example, Section5.4.1 and Gauch, 1982). As pointed out in Section 13.4, correspondenceanalysis has some similarity to Mandel’s approach, and hence to doublycentred PCA. In doubly centred PCA we analyse the residuals from anadditive model for row and column (site and species) effects, whereas incorrespondence analysis the residuals from a multiplicative (independence)model are considered.Both uncentred and doubly centred PCA perform eigenanalyses on matriceswhose elements are not covariances or correlations, but which canstill be viewed as measures of similarity or association between pairs ofvariables. Another technique in the same vein is proposed by Elmore andRichman (2001). Their idea is to find ‘distances’ between variables whichcan then be converted into similarities and an eigenanalysis done on theresulting similarity matrix. Although Elmore and Richman (2001) notea number of possible distance measures, they concentrate on Euclideandistance, so that the distance d jk between variables j and k is[ ∑ n(x ij − x ik ) 2] 1 2.i=1If D is largest of the p 2 d jk , the corresponding similarity matrix is definedto have elementss jk =1− d jkD .The procedure is referred to as PCA based on ES (Euclidean similarity).There is an apparent connection with principal coordinate analysis (Section5.2) but for ES-based PCA it is distances between variables, ratherthan between observations, that are analysed.The technique is only appropriate if the variables are all measured in thesame units—it makes no sense to compute a distance between a vector oftemperatures and a vector of heights, for example. Elmore and Richman(2001) report that the method does better at finding known ‘modes’ in adata set than PCA based on either a covariance or a correlation matrix.However, as with uncentred and doubly centred PCA, it is much less clearthan it is for PCA what is optimized by the technique, and hence it is moredifficult to know how to interpret its results.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!