Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
10.1. Detection of Outliers Using Principal Components 245that this (male) student has the equal largest chest measurement, but thatonly 3 of the other 16 male students are shorter than him, and only twohave a smaller waist measurement—perhaps he was a body builder? Similaranalyses can be done for other observations in Table 10.1. For example,observation 20 is extreme on the fifth PC. This PC, which accounts for2.7% of the total variation, is mainly a contrast between height and forearmlength with coefficients 0.67, −0.52, respectively. Observation 20 is (jointlywith one other) the shortest student of the 28, but only one of the otherten women has a larger forearm measurement. Thus, observations 15 and20, and other observations indicated as extreme by the last few PCs, arestudents for whom some aspects of their physical measurements contradictthe general positive correlation among all seven measurements.Household Formation DataThese data were described in Section 8.7.2 and are discussed in detail byGarnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results ofa PC regression of average annual total income per adult on 28 other demographicvariables for 168 local government areas in England and Wales.Garnham (1979) also examined plots of the last few and first few PCs ofthe 28 predictor variables in an attempt to detect outliers. Two such plots,for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.An interesting aspect of these figures is that the most extreme observationswith respect to the last two PCs, namely observations 54, 67, 41 (and 47,53) are also among the most extreme with respect to the first two PCs.Some of these observations are, in addition, in outlying positions on plotsof other low-variance PCs. The most blatant case is observation 54, whichis among the few most extreme observations on PCs 24 to 28 inclusive, andalso on PC1. This observation is ‘Kensington and Chelsea,’ which must bean outlier with respect to several variables individually, as well as beingdifferent in correlation structure from most of the remaining observations.In addition to plotting the data with respect to the last few and first fewPCs, Garnham (1979) examined the statistics d 2 1i for q =1, 2,...,8usinggamma plots, and also looked at normal probability plots of the values ofvarious PCs. As a combined result of these analyses, he identified six likelyoutliers, the five mentioned above together with observation 126, which ismoderately extreme according to several analyses.The PC regression was then repeated without these six observations. Theresults of the regression were noticeably changed, and were better in tworespects than those derived from all the observations. The number of PCswhich it was necessary to retain in the regression was decreased, and theprediction accuracy was improved, with the standard error of predictionreduced to 77.3% of that for the full data set.
246 10. Outlier Detection, Influential Observations and Robust Estimation375 PC2250125-450-300-150150 300 450 600 750PC1-125-2506754-37541-5004753-625Figure 10.4. Household formation data: plot of the observations with respect tothe first two PCs.
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 232 and 233: 9.1. Discriminant Analysis 201on th
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 250 and 251: 9.2. Cluster Analysis 219county clu
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
- Page 258 and 259: 9.3. Canonical Correlation Analysis
- Page 260 and 261: 9.3. Canonical Correlation Analysis
- Page 262 and 263: 9.3. Canonical Correlation Analysis
- Page 264 and 265: 10.1. Detection of Outliers Using P
- Page 266 and 267: 10.1. Detection of Outliers Using P
- Page 268 and 269: 10.1. Detection of Outliers Using P
- Page 270 and 271: 10.1. Detection of Outliers Using P
- Page 272 and 273: 10.1. Detection of Outliers Using P
- Page 274 and 275: 10.1. Detection of Outliers Using P
- Page 278 and 279: 10.1. Detection of Outliers Using P
- Page 280 and 281: 10.2. Influential Observations in a
- Page 282 and 283: 10.2. Influential Observations in a
- Page 284 and 285: 10.2. Influential Observations in a
- Page 286 and 287: 10.2. Influential Observations in a
- Page 288 and 289: 10.2. Influential Observations in a
- Page 290 and 291: 10.3. Sensitivity and Stability 259
- Page 292 and 293: 10.3. Sensitivity and Stability 261
- Page 294 and 295: 10.4. Robust Estimation of Principa
- Page 296 and 297: 10.4. Robust Estimation of Principa
- Page 298 and 299: 10.4. Robust Estimation of Principa
- Page 300 and 301: 11Rotation and Interpretation ofPri
- Page 302 and 303: 11.1. Rotation of Principal Compone
- Page 304 and 305: oot of the corresponding eigenvalue
- Page 306 and 307: 11.1. Rotation of Principal Compone
- Page 308 and 309: 11.1. Rotation of Principal Compone
- Page 310 and 311: 11.2. Alternatives to Rotation 279w
- Page 312 and 313: 11.2. Alternatives to Rotation 281F
- Page 314 and 315: 11.2. Alternatives to Rotation 283F
- Page 316 and 317: 11.2. Alternatives to Rotation 285T
- Page 318 and 319: 11.2. Alternatives to Rotation 287T
- Page 320 and 321: 11.2. Alternatives to Rotation 289A
- Page 322 and 323: 11.2. Alternatives to Rotation 291
- Page 324 and 325: 11.3. Simplified Approximations to
10.1. Detection of Outliers Using <strong>Principal</strong> <strong>Component</strong>s 245that this (male) student has the equal largest chest measurement, but thatonly 3 of the other 16 male students are shorter than him, and only twohave a smaller waist measurement—perhaps he was a body builder? Similaranalyses can be done for other observations in Table 10.1. For example,observation 20 is extreme on the fifth PC. This PC, which accounts for2.7% of the total variation, is mainly a contrast between height and forearmlength with coefficients 0.67, −0.52, respectively. Observation 20 is (jointlywith one other) the shortest student of the 28, but only one of the otherten women has a larger forearm measurement. Thus, observations 15 and20, and other observations indicated as extreme by the last few PCs, arestudents for whom some aspects of their physical measurements contradictthe general positive correlation among all seven measurements.Household Formation DataThese data were described in Section 8.7.2 and are discussed in detail byGarnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results ofa PC regression of average annual total income per adult on 28 other demographicvariables for 168 local government areas in England and Wales.Garnham (1979) also examined plots of the last few and first few PCs ofthe 28 predictor variables in an attempt to detect outliers. Two such plots,for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.An interesting aspect of these figures is that the most extreme observationswith respect to the last two PCs, namely observations 54, 67, 41 (and 47,53) are also among the most extreme with respect to the first two PCs.Some of these observations are, in addition, in outlying positions on plotsof other low-variance PCs. The most blatant case is observation 54, whichis among the few most extreme observations on PCs 24 to 28 inclusive, andalso on PC1. This observation is ‘Kensington and Chelsea,’ which must bean outlier with respect to several variables individually, as well as beingdifferent in correlation structure from most of the remaining observations.In addition to plotting the data with respect to the last few and first fewPCs, Garnham (1979) examined the statistics d 2 1i for q =1, 2,...,8usinggamma plots, and also looked at normal probability plots of the values ofvarious PCs. As a combined result of these analyses, he identified six likelyoutliers, the five mentioned above together with observation 126, which ismoderately extreme according to several analyses.The PC regression was then repeated without these six observations. Theresults of the regression were noticeably changed, and were better in tworespects than those derived from all the observations. The number of PCswhich it was necessary to retain in the regression was decreased, and theprediction accuracy was improved, with the standard error of predictionreduced to 77.3% of that for the full data set.