12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

88 5. Graphical Representation of Data Using <strong>Principal</strong> <strong>Component</strong>ssimilarity matrix. In this adjustment, t hi is replaced by t hi − ¯t h − ¯t i + ¯twhere ¯t h denotes the mean of the elements in the hth row (or column,since T is symmetric) of T, and¯t is the mean of all elements in T. Thisadjusted similarity matrix has ∑ ni=1 c ij = 0, and gives the same valueof ∆ 2 hifor each pair of observations as does T (Gower, 1966). Thus wecan replace the second stage of principal coordinate analysis by an initialadjustment of T, for any similarity matrix T.<strong>Principal</strong> coordinate analysis is equivalent to a plot with respect to thefirst q PCs when the measure of similarity between two points is proportionalto −d 2 hi , where d2 hiis the Euclidean squared distance between thehth and ith observations, calculated from the usual (n × p) data matrix.Assume t hi = −γd 2 hi, where γ is a positive constant; then if stage (i) of aprincipal coordinate analysis is carried out, the ‘distance’ between a pairof points in the constructed n-dimensional space is∆ 2 hi =(t hh + t ii − 2t hi )= γ(−d 2 hh − d 2 ii +2d 2 hi)=2γd 2 hi,as Euclidean distance from a point to itself is zero. Thus, apart from apossible rescaling if γ is taken to be a value other than 1 2, the first stage ofprincipal coordinate analysis correctly reproduces the relative positions ofthe n observations, which lie in a p-dimensional subspace of n-dimensionalspace, so that the subsequent PCA in stage (ii) gives the same result as aPCA on the original data.Two related special cases are of interest. First, consider the situationwhere all variables are binary. A commonly used measure of similarity betweenindividuals h and i is the proportion of the p variables for which hand i take the same value, and it can be easily demonstrated (Gower, 1966)that this measure is equivalent to Euclidean distance. Thus, although PCAof discrete—and in particular—binary data has its critics, it is equivalentto principal coordinate analysis with a very plausible measure of similarity.<strong>Principal</strong> component analysis for discrete data is discussed further inSection 13.1.The second special case occurs when the elements of the similarity matrixT are defined as ‘covariances’ between observations, so that T is proportionalto XX ′ , where X, as before, is the column-centred (n × p) matrixwhose (i, j)th element is the value of the jth variable, measured about itsmean ¯x j , for the ith observation. In this case the (h, i)th similarity is, apartfrom a constant,p∑t hi = x hj x ijj=1and the distances between the points in the n-dimensional space con-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!