Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

10.1. Detection of Outliers Using Principal Components 243Table 10.1. Anatomical measurements: values of d 2 1i, d 2 2i, d 4i for the most extremeobservations.Number of PCs used, qq =1 q =2d 2 1i Obs. No. d 2 1i Obs. No. d 2 2i Obs. No. d 4i Obs. No.0.81 15 1.00 7 7.71 15 2.64 150.47 1 0.96 11 7.69 7 2.59 110.44 7 0.91 15 6.70 11 2.01 10.16 16 0.48 1 4.11 1 1.97 70.15 4 0.48 23 3.52 23 1.58 230.14 2 0.36 12 2.62 12 1.49 27q =3d 2 1i Obs. No. d 2 2i Obs. No. d 4i Obs. No.1.55 20 9.03 20 2.64 151.37 5 7.82 15 2.59 51.06 11 7.70 5 2.59 111.00 7 7.69 7 2.53 200.96 1 7.23 11 2.01 10.93 15 6.71 1 1.97 7observations on each statistic, where the number of PCs included, q, is1,2or 3. The observations that correspond to the most extreme values of d 2 1i ,d 2 2i and d 4i are identified in Table 10.1, and also on Figure 10.3.Note that when q = 1 the observations have the same ordering for allthree statistics, so only the values of d 2 1i are given in Table 10.1. When qis increased to 2 or 3, the six most extreme observations are the same (ina slightly different order) for both d 2 1i and d2 2i . With the exception of thesixth most extreme observation for q = 2, the same observations are alsoidentified by d 4i . Although the sets of the six most extreme observationsare virtually the same for d 2 1i , d2 2i and d 4i, there are some differences inordering. The most notable example is observation 15 which, for q =3,ismost extreme for d 4i but only sixth most extreme for d 2 1i .Observations 1, 7 and 15 are extreme on all seven statistics given in Table10.1, due to large contributions from the final PC alone for observation15, the last two PCs for observation 7, and the fifth and seventh PCs forobservation 1. Observations 11 and 20, which are not extreme for the finalPC, appear in the columns for q = 2 and 3 because of extreme behaviouron the sixth PC for observation 11, and on both the fifth and sixth PCsfor observation 20. Observation 16, which was discussed earlier as a clearoutlier on the second PC, appears in the list for q = 1, but is not notablyextreme for any of the last three PCs.

244 10. Outlier Detection, Influential Observations and Robust EstimationFigure 10.3. Anatomical measurements: plot of observations with respect to thelast two PCs.Most of the observations identified in Table 10.1 are near the edge of theplot given in Figure 10.3. Observations 2, 4, 5, 12, 16, 20, 23 and 27 areclose to the main body of the data, but observations 7, 11, 15, and to alesser extent 1, are sufficiently far from the remaining data to be worthy offurther consideration. To roughly judge their ‘significance,’ recall that, if nooutliers are present and the data are approximately multivariate normal,then the values of d 4i , are (approximately) absolute values of a normalrandom variable with zero mean and unit variance. The quantities given inthe relevant columns of Table 10.1 are therefore the six largest among 28qsuch variables, and none of them look particularly extreme. Nevertheless,it is of interest to investigate the reasons for the outlying positions of someof the observations, and to do so it is necessary to examine the coefficientsof the last few PCs. The final PC, accounting for only 1.7% of the totalvariation, is largely a contrast between chest and hand measurements withpositive coefficients 0.55, 0.51, and waist and height measurements, whichhave negative coefficients −0.55, −0.32. Looking at observation 15, we find

244 10. Outlier Detection, Influential Observations and Robust EstimationFigure 10.3. Anatomical measurements: plot of observations with respect to thelast two PCs.Most of the observations identified in Table 10.1 are near the edge of theplot given in Figure 10.3. Observations 2, 4, 5, 12, 16, 20, 23 and 27 areclose to the main body of the data, but observations 7, 11, 15, and to alesser extent 1, are sufficiently far from the remaining data to be worthy offurther consideration. To roughly judge their ‘significance,’ recall that, if nooutliers are present and the data are approximately multivariate normal,then the values of d 4i , are (approximately) absolute values of a normalrandom variable with zero mean and unit variance. The quantities given inthe relevant columns of Table 10.1 are therefore the six largest among 28qsuch variables, and none of them look particularly extreme. Nevertheless,it is of interest to investigate the reasons for the outlying positions of someof the observations, and to do so it is necessary to examine the coefficientsof the last few PCs. The final PC, accounting for only 1.7% of the totalvariation, is largely a contrast between chest and hand measurements withpositive coefficients 0.55, 0.51, and waist and height measurements, whichhave negative coefficients −0.55, −0.32. Looking at observation 15, we find

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!