Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

10.1. Detection of Outliers Using Principal Components 245that this (male) student has the equal largest chest measurement, but thatonly 3 of the other 16 male students are shorter than him, and only twohave a smaller waist measurement—perhaps he was a body builder? Similaranalyses can be done for other observations in Table 10.1. For example,observation 20 is extreme on the fifth PC. This PC, which accounts for2.7% of the total variation, is mainly a contrast between height and forearmlength with coefficients 0.67, −0.52, respectively. Observation 20 is (jointlywith one other) the shortest student of the 28, but only one of the otherten women has a larger forearm measurement. Thus, observations 15 and20, and other observations indicated as extreme by the last few PCs, arestudents for whom some aspects of their physical measurements contradictthe general positive correlation among all seven measurements.Household Formation DataThese data were described in Section 8.7.2 and are discussed in detail byGarnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results ofa PC regression of average annual total income per adult on 28 other demographicvariables for 168 local government areas in England and Wales.Garnham (1979) also examined plots of the last few and first few PCs ofthe 28 predictor variables in an attempt to detect outliers. Two such plots,for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.An interesting aspect of these figures is that the most extreme observationswith respect to the last two PCs, namely observations 54, 67, 41 (and 47,53) are also among the most extreme with respect to the first two PCs.Some of these observations are, in addition, in outlying positions on plotsof other low-variance PCs. The most blatant case is observation 54, whichis among the few most extreme observations on PCs 24 to 28 inclusive, andalso on PC1. This observation is ‘Kensington and Chelsea,’ which must bean outlier with respect to several variables individually, as well as beingdifferent in correlation structure from most of the remaining observations.In addition to plotting the data with respect to the last few and first fewPCs, Garnham (1979) examined the statistics d 2 1i for q =1, 2,...,8usinggamma plots, and also looked at normal probability plots of the values ofvarious PCs. As a combined result of these analyses, he identified six likelyoutliers, the five mentioned above together with observation 126, which ismoderately extreme according to several analyses.The PC regression was then repeated without these six observations. Theresults of the regression were noticeably changed, and were better in tworespects than those derived from all the observations. The number of PCswhich it was necessary to retain in the regression was decreased, and theprediction accuracy was improved, with the standard error of predictionreduced to 77.3% of that for the full data set.

246 10. Outlier Detection, Influential Observations and Robust Estimation375 PC2250125-450-300-150150 300 450 600 750PC1-125-2506754-37541-5004753-625Figure 10.4. Household formation data: plot of the observations with respect tothe first two PCs.

10.1. Detection of Outliers Using <strong>Principal</strong> <strong>Component</strong>s 245that this (male) student has the equal largest chest measurement, but thatonly 3 of the other 16 male students are shorter than him, and only twohave a smaller waist measurement—perhaps he was a body builder? Similaranalyses can be done for other observations in Table 10.1. For example,observation 20 is extreme on the fifth PC. This PC, which accounts for2.7% of the total variation, is mainly a contrast between height and forearmlength with coefficients 0.67, −0.52, respectively. Observation 20 is (jointlywith one other) the shortest student of the 28, but only one of the otherten women has a larger forearm measurement. Thus, observations 15 and20, and other observations indicated as extreme by the last few PCs, arestudents for whom some aspects of their physical measurements contradictthe general positive correlation among all seven measurements.Household Formation DataThese data were described in Section 8.7.2 and are discussed in detail byGarnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results ofa PC regression of average annual total income per adult on 28 other demographicvariables for 168 local government areas in England and Wales.Garnham (1979) also examined plots of the last few and first few PCs ofthe 28 predictor variables in an attempt to detect outliers. Two such plots,for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.An interesting aspect of these figures is that the most extreme observationswith respect to the last two PCs, namely observations 54, 67, 41 (and 47,53) are also among the most extreme with respect to the first two PCs.Some of these observations are, in addition, in outlying positions on plotsof other low-variance PCs. The most blatant case is observation 54, whichis among the few most extreme observations on PCs 24 to 28 inclusive, andalso on PC1. This observation is ‘Kensington and Chelsea,’ which must bean outlier with respect to several variables individually, as well as beingdifferent in correlation structure from most of the remaining observations.In addition to plotting the data with respect to the last few and first fewPCs, Garnham (1979) examined the statistics d 2 1i for q =1, 2,...,8usinggamma plots, and also looked at normal probability plots of the values ofvarious PCs. As a combined result of these analyses, he identified six likelyoutliers, the five mentioned above together with observation 126, which ismoderately extreme according to several analyses.The PC regression was then repeated without these six observations. Theresults of the regression were noticeably changed, and were better in tworespects than those derived from all the observations. The number of PCswhich it was necessary to retain in the regression was decreased, and theprediction accuracy was improved, with the standard error of predictionreduced to 77.3% of that for the full data set.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!