Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
9.2. Cluster Analysis 219county cluster in the top right of the plot splits into three groups containing13, 10 and 4 counties, with some overlap between them.This example is typical of many in which cluster analysis is used fordissection. Examples like that of Jeffers’ (1967) aphids, where a very clearcutand previously unknown cluster structure is uncovered, are relativelyunusual, although another illustration is given by Blackith and Reyment(1971, p. 155). In their example, a plot of the observations with respect tothe second and third (out of seven) PCs shows a very clear separation intotwo groups. It is probable that in many circumstances ‘projection-pursuit’methods, which are discussed next, will provide a better two-dimensionalspace in which to view the results of a cluster analysis than that defined bythe first two PCs. However, if dissection rather than discovery of a clear-cutcluster structure is the objective of a cluster analysis, then there is likelyto be little improvement over a plot with respect to the first two PCs.9.2.2 Projection PursuitAs mentioned earlier in this chapter, it may be possible to find lowdimensionalrepresentations of a data set that are better than the first fewPCs at displaying ‘structure’ in the data. One approach to doing this is todefine structure as ‘interesting’ and then construct an index of ‘interestingness,’which is successively maximized. This is the idea behind projectionpursuit, with different indices leading to different displays. If ‘interesting’is defined as ‘large variance,’ it is seen that PCA is a special case of projectionpursuit. However, the types of structure of interest are often clustersor outliers, and there is no guarantee that the high-variance PCs will findsuch features. The term ‘projection pursuit’ dates back to Friedman andTukey (1974), and a great deal of work was done in the early 1980s. Thisis described at length in three key papers: Friedman (1987), Huber (1985),and Jones and Sibson (1987). The last two both include extensive discussion,in addition to the paper itself. Some techniques are good at findingclusters, whereas others are better at detecting outliers.Most projection pursuit techniques start from the premise that the leastinteresting structure is multivariate normality, so that deviations fromnormality form the basis of many indices. There are measures based onskewness and kurtosis, on entropy, on looking for deviations from uniformityin transformed data, and on finding ‘holes’ in the data. More recently,Foster (1998) suggested looking for directions of high density, after ‘sphering’the data to remove linear structure. Sphering operates by transformingthe variables x to z = S − 1 2 (x−¯x), which is equivalent to converting to PCs,which are then standardized to have zero mean and unit variance. Friedman(1987) also advocates sphering as a first step in his version of projectionpursuit. After identifying the high-density directions for the sphered data,Foster (1998) uses the inverse transformation to discover the nature of theinteresting structures in terms of the original variables.
220 9. Principal Components Used with Other Multivariate TechniquesProjection pursuit indices usually seek out deviations from multivariatenormality. Bolton and Krzanowski (1999) show that if normality holds thenPCA finds directions for which the maximized likelihood is minimized. Theyinterpret this result as PCA choosing interesting directions to be those forwhich normality is least likely, thus providing a link with the ideas of projectionpursuit. A different projection pursuit technique with an implicitassumption of normality is based on the fixed effects model of Section 3.9.Recall that the model postulates that, apart from an error term e i withvar(e i )= σ2w iΓ, the variables x lie in a q-dimensional subspace. To findthe best-fitting subspace, ∑ ni=1 w i ‖x i − z i ‖ 2 Mis minimized for an appropriatelychosen metric M. For multivariate normal e i the optimal choicefor M is Γ −1 . Given a structure of clusters in the data, all w i equal, ande i describing variation within clusters, Caussinus and Ruiz (1990) suggesta robust estimate of Γ, defined byˆΓ =∑ n−1 ∑ ni=1 j=i+1 K[‖x i − x j ‖ 2 S](x −1 i − x j )(x i − x j ) ′∑ n−1 ∑ ni=1 j=i+1 K[‖x i − x j ‖ 2 , (9.2.1)S] −1where K[.] is a decreasing positive real function (Caussinus and Ruiz, 1990,use K[d] =e − β 2 t for β>0) and S is the sample covariance matrix. The bestfit is then given by finding eigenvalues and eigenvectors of SˆΓ −1 ,whichisa type of generalized PCA (see Section 14.2.2). There is a similarity herewith canonical discriminant analysis (Section 9.1), which finds eigenvaluesand eigenvectors of S b S −1w , where S b , S w are between and within-groupcovariance matrices. In Caussinus and Ruiz’s (1990) form of projectionpursuit, S is the overall covariance matrix, and ˆΓ is an estimate of thewithin-group covariance matrix. Equivalent results would be obtained if Swere replaced by an estimate of between-group covariance, so that the onlyreal difference from canonical discriminant analysis is that the groups areknown in the latter case but are unknown in projection pursuit. Furthertheoretical details and examples of Caussinus and Ruiz’s technique canbe found in Caussinus and Ruiz-Gazen (1993, 1995). The choice of valuesfor β is discussed, and values in the range 0.5 to3.0 are recommended.There is a link between Caussinus and Ruiz-Gazen’s technique and themixture models of Section 9.2.3. In discussing theoretical properties of theirtechnique, they consider a framework in which clusters arise from a mixtureof multivariate normal distributions. The q dimensions of the underlyingmodel correspond to q clusters and Γ represents ‘residual’ or within-groupcovariance.Although not projection pursuit as such, Krzanowski (1987b) also looksfor low-dimensional representations of the data that preserve structure, butin the context of variable selection. Plots are made with respect to the firsttwo PCs calculated from only a subset of the variables. A criterion for
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 208 and 209: 8.3. Connections Between PC Regress
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 212 and 213: 8.4. Variations on Principal Compon
- Page 214 and 215: 8.4. Variations on Principal Compon
- Page 216 and 217: 8.5. Variable Selection in Regressi
- Page 218 and 219: 8.5. Variable Selection in Regressi
- Page 220 and 221: 8.6. Functional and Structural Rela
- Page 222 and 223: 8.7. Examples of Principal Componen
- Page 224 and 225: Table 8.3. Principal component regr
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 232 and 233: 9.1. Discriminant Analysis 201on th
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
- Page 258 and 259: 9.3. Canonical Correlation Analysis
- Page 260 and 261: 9.3. Canonical Correlation Analysis
- Page 262 and 263: 9.3. Canonical Correlation Analysis
- Page 264 and 265: 10.1. Detection of Outliers Using P
- Page 266 and 267: 10.1. Detection of Outliers Using P
- Page 268 and 269: 10.1. Detection of Outliers Using P
- Page 270 and 271: 10.1. Detection of Outliers Using P
- Page 272 and 273: 10.1. Detection of Outliers Using P
- Page 274 and 275: 10.1. Detection of Outliers Using P
- Page 276 and 277: 10.1. Detection of Outliers Using P
- Page 278 and 279: 10.1. Detection of Outliers Using P
- Page 280 and 281: 10.2. Influential Observations in a
- Page 282 and 283: 10.2. Influential Observations in a
- Page 284 and 285: 10.2. Influential Observations in a
- Page 286 and 287: 10.2. Influential Observations in a
- Page 288 and 289: 10.2. Influential Observations in a
- Page 290 and 291: 10.3. Sensitivity and Stability 259
- Page 292 and 293: 10.3. Sensitivity and Stability 261
- Page 294 and 295: 10.4. Robust Estimation of Principa
- Page 296 and 297: 10.4. Robust Estimation of Principa
- Page 298 and 299: 10.4. Robust Estimation of Principa
9.2. Cluster <strong>Analysis</strong> 219county cluster in the top right of the plot splits into three groups containing13, 10 and 4 counties, with some overlap between them.This example is typical of many in which cluster analysis is used fordissection. Examples like that of Jeffers’ (1967) aphids, where a very clearcutand previously unknown cluster structure is uncovered, are relativelyunusual, although another illustration is given by Blackith and Reyment(1971, p. 155). In their example, a plot of the observations with respect tothe second and third (out of seven) PCs shows a very clear separation intotwo groups. It is probable that in many circumstances ‘projection-pursuit’methods, which are discussed next, will provide a better two-dimensionalspace in which to view the results of a cluster analysis than that defined bythe first two PCs. However, if dissection rather than discovery of a clear-cutcluster structure is the objective of a cluster analysis, then there is likelyto be little improvement over a plot with respect to the first two PCs.9.2.2 Projection PursuitAs mentioned earlier in this chapter, it may be possible to find lowdimensionalrepresentations of a data set that are better than the first fewPCs at displaying ‘structure’ in the data. One approach to doing this is todefine structure as ‘interesting’ and then construct an index of ‘interestingness,’which is successively maximized. This is the idea behind projectionpursuit, with different indices leading to different displays. If ‘interesting’is defined as ‘large variance,’ it is seen that PCA is a special case of projectionpursuit. However, the types of structure of interest are often clustersor outliers, and there is no guarantee that the high-variance PCs will findsuch features. The term ‘projection pursuit’ dates back to Friedman andTukey (1974), and a great deal of work was done in the early 1980s. Thisis described at length in three key papers: Friedman (1987), Huber (1985),and Jones and Sibson (1987). The last two both include extensive discussion,in addition to the paper itself. Some techniques are good at findingclusters, whereas others are better at detecting outliers.Most projection pursuit techniques start from the premise that the leastinteresting structure is multivariate normality, so that deviations fromnormality form the basis of many indices. There are measures based onskewness and kurtosis, on entropy, on looking for deviations from uniformityin transformed data, and on finding ‘holes’ in the data. More recently,Foster (1998) suggested looking for directions of high density, after ‘sphering’the data to remove linear structure. Sphering operates by transformingthe variables x to z = S − 1 2 (x−¯x), which is equivalent to converting to PCs,which are then standardized to have zero mean and unit variance. Friedman(1987) also advocates sphering as a first step in his version of projectionpursuit. After identifying the high-density directions for the sphered data,Foster (1998) uses the inverse transformation to discover the nature of theinteresting structures in terms of the original variables.