Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

9.2. Cluster Analysis 213Before looking at examples of the uses just described of PCA in clusteranalysis, we discuss a rather different way in which cluster analysis canbe used and its connections with PCA. So far we have discussed clusteranalysis on observations or individuals, but in some circumstances it isdesirable to divide variables, rather than observations, into groups. In fact,by far the earliest book on cluster analysis (Tryon, 1939) was concernedwith this type of application. Provided that a suitable measure of similaritybetween variables can be defined—the correlation coefficient is an obviouscandidate—methods of cluster analysis used for observations can be readilyadapted for variables.One connection with PCA is that when the variables fall into well-definedclusters, then, as discussed in Section 3.8, there will be one high-variancePC and, except in the case of ‘single-variable’ clusters, one or more lowvariancePCs associated with each cluster of variables. Thus, PCA willidentify the presence of clusters among the variables, and can be thoughtof as a competitor to standard cluster analysis of variables. The use ofPCA in this way in fairly common in climatology (see, for example, Cohen(1983), White et al. (1991), Romero et al. (1999)). In an analysis of aclimate variable recorded at stations over a large geographical area, theloadings of the PCs at the various stations can be used to divide the areainto regions with high loadings on each PC. In fact, this regionalizationprocedure is usually more effective if the PCs are rotated (see Section 11.1)so that most analyses are done using rotated loadings.Identifying clusters of variables may be of general interest in investigatingthe structure of a data set but, more specifically, if we wish to reducethe number of variables without sacrificing too much information, then wecould retain one variable from each cluster. This is essentially the ideabehind some of the variable selection techniques based on PCA that weredescribed in Section 6.3.Hastie et al. (2000) describe a novel clustering procedure for ‘variables’which uses PCA applied in a genetic context. They call their method ‘geneshaving.’ Their data consist of p = 4673 gene expression measurementsfor n = 48 patients, and the objective is to classify the 4673 genes intogroups that have coherent expressions. The first PC is found for these dataand a proportion of the genes (typically 10%) having the smallest absoluteinner products with this PC are deleted (shaved). PCA followed by shavingis repeated for the reduced data set, and this procedure continues untilultimately only one gene remains. A nested sequence of subsets of genesresults from this algorithm and an optimality criterion is used to decidewhich set in the sequence is best. This gives the first cluster of genes. Thewhole procedure is then repeated after centering the data with respect tothe ‘average gene expression’ in the first cluster, to give a second clusterand so on.Another way of constructing clusters of variables, which simultaneouslyfinds the first PC within each cluster, is proposed by Vigneau and Qannari

214 9. Principal Components Used with Other Multivariate Techniques(2001). Suppose that the p variables are divided into G groups or clusters,and that x g denotes the vector of variables in the gth group, g =1, 2,...,G.Vigneau and Qannari (2001) seek vectors a 11 , a 21 ,...,a G1 that maximize∑ Gg=1 var(a′ g1x g ), where var(a ′ g1x g ) is the sample variance of the linearfunction a ′ g1x g . This sample variance is clearly maximized by the first PCfor the variables in the gth group, but simultaneously we wish to find thepartition of the variables into G groups for which the sum of these variancesis maximized. An iterative procedure is presented by Vigneau and Qannari(2001) for solving this problem.The formulation of the problem assumes that variables with large squaredcorrelations with the first PC in a cluster should be assigned to that cluster.Vigneau and Qannari consider two variations of their technique. In thefirst, the signs of the correlations between variables and PCs are important;only those variables with large positive correlations with a PC should bein its cluster. In the second, relationships with external variables are takeninto account.9.2.1 ExamplesOnly one example will be described in detail here, although a number ofother examples that have appeared elsewhere will be discussed briefly. Inmany of the published examples where PCs have been used in conjunctionwith cluster analysis, there is no clear-cut cluster structure, and clusteranalysis has been used as a dissection technique. An exception is the wellknownexample given by Jeffers (1967), which was discussed in the contextof variable selection in Section 6.4.1. The data consist of 19 variables measuredon 40 aphids and, when the 40 observations are plotted with respectto the first two PCs, there is a strong suggestion of four distinct groups; referto Figure 9.3, on which convex hulls (see Section 5.1) have been drawnaround the four suspected groups. It is likely that the four groups indicatedon Figure 9.3 correspond to four different species of aphids; thesefour species cannot be readily distinguished using only one variable at atime, but the plot with respect to the first two PCs clearly distinguishesthe four populations.The example introduced in Section 1.1 and discussed further in Section5.1.1, which has seven physical measurements on 28 students, also shows (inFigures 1.3, 5.1) how a plot with respect to the first two PCs can distinguishtwo groups, in this case men and women. There is, unlike the aphid data,a small amount of overlap between groups and if the PC plot is used toidentify, rather than verify, a cluster structure, then it is likely that somemisclassification between sexes will occur. A simple but specialized use ofPC scores, one PC at a time, to classify seabird communities is describedby Huettmann and Diamond (2001).In the situation where cluster analysis is used for dissection, the aim of atwo-dimensional plot with respect to the first two PCs will almost always be

214 9. <strong>Principal</strong> <strong>Component</strong>s Used with Other Multivariate Techniques(2001). Suppose that the p variables are divided into G groups or clusters,and that x g denotes the vector of variables in the gth group, g =1, 2,...,G.Vigneau and Qannari (2001) seek vectors a 11 , a 21 ,...,a G1 that maximize∑ Gg=1 var(a′ g1x g ), where var(a ′ g1x g ) is the sample variance of the linearfunction a ′ g1x g . This sample variance is clearly maximized by the first PCfor the variables in the gth group, but simultaneously we wish to find thepartition of the variables into G groups for which the sum of these variancesis maximized. An iterative procedure is presented by Vigneau and Qannari(2001) for solving this problem.The formulation of the problem assumes that variables with large squaredcorrelations with the first PC in a cluster should be assigned to that cluster.Vigneau and Qannari consider two variations of their technique. In thefirst, the signs of the correlations between variables and PCs are important;only those variables with large positive correlations with a PC should bein its cluster. In the second, relationships with external variables are takeninto account.9.2.1 ExamplesOnly one example will be described in detail here, although a number ofother examples that have appeared elsewhere will be discussed briefly. Inmany of the published examples where PCs have been used in conjunctionwith cluster analysis, there is no clear-cut cluster structure, and clusteranalysis has been used as a dissection technique. An exception is the wellknownexample given by Jeffers (1967), which was discussed in the contextof variable selection in Section 6.4.1. The data consist of 19 variables measuredon 40 aphids and, when the 40 observations are plotted with respectto the first two PCs, there is a strong suggestion of four distinct groups; referto Figure 9.3, on which convex hulls (see Section 5.1) have been drawnaround the four suspected groups. It is likely that the four groups indicatedon Figure 9.3 correspond to four different species of aphids; thesefour species cannot be readily distinguished using only one variable at atime, but the plot with respect to the first two PCs clearly distinguishesthe four populations.The example introduced in Section 1.1 and discussed further in Section5.1.1, which has seven physical measurements on 28 students, also shows (inFigures 1.3, 5.1) how a plot with respect to the first two PCs can distinguishtwo groups, in this case men and women. There is, unlike the aphid data,a small amount of overlap between groups and if the PC plot is used toidentify, rather than verify, a cluster structure, then it is likely that somemisclassification between sexes will occur. A simple but specialized use ofPC scores, one PC at a time, to classify seabird communities is describedby Huettmann and Diamond (2001).In the situation where cluster analysis is used for dissection, the aim of atwo-dimensional plot with respect to the first two PCs will almost always be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!