12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

9.2. Cluster <strong>Analysis</strong> 211dimensional space defined by the variables. If the variables are measuredin non-compatible units, then each variable can be standardized by dividingby its standard deviation, and an arbitrary, but obvious, measure ofdissimilarity is then the Euclidean distance between a pair of observationsin the p-dimensional space defined by the standardized variables.Suppose that a PCA is done based on the covariance or correlation matrix,and that m (< p) PCs account for most of the variation in x. A possiblealternative dissimilarity measure is the Euclidean distance between a pairof observations in the m-dimensional subspace defined by the first m PCs;such dissimilarity measures have been used in several published studies, forexample <strong>Jolliffe</strong> et al. (1980). There is often no real advantage in using thismeasure, rather than the Euclidean distance in the original p-dimensionalspace, as the Euclidean distance calculated using all p PCs from the covariancematrix is identical to that calculated from the original variables.Similarly, the distance calculated from all p PCs for the correlation matrixis the same as that calculated from the p standardized variables. Using minstead of p PCs simply provides an approximation to the original Euclideandistance, and the extra calculation involved in finding the PCs far outweighsany saving which results from using m instead of p variables in computingthe distance. However, if, as in <strong>Jolliffe</strong> et al. (1980), the PCs are being calculatedin any case, the reduction from p to m variables may be worthwhile.In calculating Euclidean distances, the PCs have the usual normalization,so that the sample variance of a ′ k x is l k, k =1, 2,...,p and l 1 ≥ l 2 ≥···≥l p , using the notation of Section 3.1. As an alternative, a distance can becalculated based on PCs that have been renormalized so that each PChas the same variance. This renormalization is discussed further in thecontext of outlier detection in Section 10.1. In the present setting, wherethe objective is the calculation of a dissimilarity measure, its use is basedon the following idea. Suppose that one of the original variables is almostindependent of all the others, but that several of the remaining variables aremeasuring essentially the same property as each other. Euclidean distancewill then give more weight to this property than to the property describedby the ‘independent’ variable. If it is thought desirable to give equal weightto each property then this can be achieved by finding the PCs and thengiving equal weight to each of the first m PCs.To see that this works, consider a simple example in which four meteorologicalvariables are measured. Three of the variables are temperatures,namely air temperature, sea surface temperature and dewpoint, and thefourth is the height of the cloudbase. The first three variables are highlycorrelated with each other, but nearly independent of the fourth. For asample of 30 measurements on these variables, a PCA based on the correlationmatrix gave a first PC with variance 2.95, which is a nearly equallyweighted average of the three temperature variables. The second PC, withvariance 0.99 is dominated by cloudbase height, and together the first twoPCs account for 98.5% of the total variation in the four variables.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!