12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

240 10. Outlier Detection, Influential Observations and Robust Estimationsuggestions are illustrated with an example, but no indication is given ofwhether the first few or last few PCs are more likely to be useful—his examplehas only three predictor variables, so it is easy to look at all possibleplots. Mason and Gunst (1985) refer to outliers among the predictor variablesas leverage points. They recommend constructing scatter plots of thefirst few PCs normalized to have unit variance, and claim that such plotsare often effective in detecting leverage points that cluster and leveragepoints that are extreme in two or more dimensions. In the case of multivariateregression, another possibility for detecting outliers (Gnanadesikanand Kettenring, 1972) is to look at the PCs of the (multivariate) residualsfrom the regression analysis.Peña and Yohai (1999) propose a PCA on a matrix of regression diagnosticsthat is also useful in detecting outliers in multiple regression. Supposethat a sample of n observations is available for the analysis. Then an (n×n)matrix can be calculated whose (h, i)th element is the difference ŷ h − ŷ h(i)between the predicted value of the dependent variable y for the hth observationwhen all n observations are used in the regression, and when (n − 1)observations are used with the ith observation omitted. Peña and Yohai(1999) refer to this as a sensitivity matrix and seek a unit-length vectorsuch that the sum of squared lengths of the projections of the rows of thematrix onto that vector is maximized. This leads to the first principal componentof the sensitivity matrix, and subsequent components can be foundin the usual way. Peña and Yohai (1999) call these components principalsensitivity components and show that they also represent directions thatmaximize standardized changes to the vector of the regression coefficient.The definition and properties of principal sensitivity components mean thathigh-leverage outliers are likely to appear as extremes on at least one ofthe first few components.Lu et al. (1997) also advocate the use of the PCs of a matrix of regressiondiagnostics. In their case the matrix is what they call the standardizedinfluence matrix (SIM). If a regression equation has p unknown parametersand n observations with which to estimate them, a (p × n) influencematrix can be formed whose (j, i)th element is a standardized version ofthe theoretical influence function (see Section 10.2) for the jth parameterevaluated for the ith observation. Leaving aside the technical details,the so-called complement of the standardized influence matrix (SIM c )canbe viewed as a covariance matrix for the ‘data’ in the influence matrix.Lu et al. (1997) show that finding the PCs of these standardized data,and hence the eigenvalues and eigenvectors of SIM c , can identify outliersand influential points and give insights into the structure of that influence.Sample versions of SIM and SIM c are given, as are illustrations of theiruse.Another specialized field in which the use of PCs has been proposed inorder to detect outlying observations is that of statistical process control,which is the subject of Section 13.7. A different way of using PCs to detect

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!