12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10.1. Detection of Outliers Using <strong>Principal</strong> <strong>Component</strong>s 239Note that d 2 1i , computed separately for several populations, is also usedin a form of discriminant analysis (SIMCA) by Wold (1976) (see Section9.1). Mertens et al. (1994) use this relationship to suggest modificationsto SIMCA. They investigate variants in which d 2 1i is replaced by d2 2i , d2 3ior d 4i as a measure of the discrepancy between a new observation and agroup. In an example they find that d 2 2i , but not d2 3i or d 4i, improves thecross-validated misclassification rate compared to that for d 2 1i .The exact distributions for d 2 1i , d2 2i , d2 3i and d 4i can be deduced if we assumethat the observations are from a multivariate normal distribution withmean µ and covariance matrix Σ, where µ, Σ are both known (see Hawkins(1980, p. 113) for results for d 2 2i , d 4i). Both d 2 3i and d2 2i when q = p, aswellas d 2 1i , have (approximate) gamma distributions if no outliers are presentand if normality can be (approximately) assumed (Gnanadesikan and Kettenring,1972), so that gamma probability plots of d 2 2i (with q = p) andd2 3ican again be used to look for outliers. However, in practice µ, Σ are unknown,and the data will often not have a multivariate normal distribution,so that any distributional results derived under the restrictive assumptionscan only be approximations. Jackson (1991, Section 2.7.2) gives a fairlycomplicated function of d 2 1i that has, approximately, a standard normaldistribution when no outliers are present.In order to be satisfactory, such approximations to the distributions ofd 2 1i , d2 2i , d2 3i , d 4i often need not be particularly accurate. Although there areexceptions, such as detecting possible unusual patient behaviour in safetydata from clinical trials (see Penny and <strong>Jolliffe</strong>, 2001), outlier detection isfrequently concerned with finding observations that are blatantly differentfrom the rest, corresponding to very small significance levels for the teststatistics. An observation that is ‘barely significant at 5%’ is typically notof interest, so that there is no great incentive to compute significance levelsvery accurately. The outliers that we wish to detect should ‘stick out likea sore thumb’ provided we find the right direction in which to view thedata; the problem in multivariate outlier detection is to find appropriatedirections. If, on the other hand, identification of less clear-cut outliersis important and multivariate normality cannot be assumed, Dunn andDuncan (2000) propose a procedure, in the context of evaluating habitatsuitability, for assessing ‘significance’ based on the empirical distribution oftheir test statistics. The statistics they use are individual terms from d 2 2i .PCs can be used to detect outliers in any multivariate data set, regardlessof the subsequent analysis which is envisaged for that data set. For particulartypes of data or analysis, other considerations come into play. Formultiple regression, Hocking (1984) suggests that plots of PCs derived from(p + 1) variables consisting of the p predictor variables and the dependentvariable, as used in latent root regression (see Section 8.4), tend to revealoutliers together with observations that are highly influential (Section 10.2)for the regression equation. Plots of PCs derived from the predictor variablesonly also tend to reveal influential observations. Hocking’s (1984)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!