12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.1. Detection of Outliers Using <strong>Principal</strong> <strong>Component</strong>s 233Hoaglin et al. (1983) for a more readable approach), and robustness withrespect to distributional assumptions, as well as with respect to outlying orinfluential observations, may be of interest. A number of techniques havebeen suggested for robustly estimating PCs, and these are discussed in thefourth section of this chapter; the final section presents a few concludingremarks.10.1 Detection of Outliers Using <strong>Principal</strong><strong>Component</strong>sThere is no formal, widely accepted, definition of what is meant by an ‘outlier.’The books on the subject by Barnett and Lewis (1994) and Hawkins(1980) both rely on informal, intuitive definitions, namely that outliers areobservations that are in some way different from, or inconsistent with, theremainder of a data set. For p-variate data, this definition implies that outliersare a long way from the rest of the observations in the p-dimensionalspace defined by the variables. Numerous procedures have been suggestedfor detecting outliers with respect to a single variable, and many of theseare reviewed by Barnett and Lewis (1994) and Hawkins (1980). The literatureon multivariate outliers is less extensive, with each of these twobooks containing only one chapter (comprising less than 15% of their totalcontent) on the subject. Several approaches to the detection of multivariateoutliers use PCs, and these will now be discussed in some detail. As well asthe methods described in this section, which use PCs in fairly direct waysto identify potential outliers, techniques for robustly estimating PCs (seeSection 10.4) may also be used to detect outlying observations.A major problem in detecting multivariate outliers is that an observationthat is not extreme on any of the original variables can still be an outlier,because it does not conform with the correlation structure of the remainderof the data. It is impossible to detect such outliers by looking solely at theoriginal variables one at a time. As a simple example, suppose that heightsand weights are measured for a sample of healthy children of various agesbetween 5 and 15 years old. Then an ‘observation’ with height and weightof 175 cm (70 in) and 25 kg (55 lb), respectively, is not particularly extremeon either the height or weight variables individually, as 175 cm is a plausibleheight for the older children and 25 kg is a plausible weight for the youngestchildren. However, the combination (175 cm, 25 kg) is virtually impossible,and will be a clear outlier because it combines a large height with a smallweight, thus violating the general pattern of a positive correlation betweenthe two variables. Such an outlier is apparent on a plot of the two variables(see Figure 10.1) but, if the number of variables p is large, it is quite possiblethat some outliers will not be apparent on any of the 1 2p(p − 1) plots of twovariables at a time. Thus, for large p we need to consider the possibility

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!