Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

10.2. Influential Observations in a Principal Component Analysis 251change to a covariance matrix may change one of the eigenvalues withoutaffecting the others, but that this cannot happen for a correlation matrix.For a correlation matrix the sum of the eigenvalues is a constant, so thatif one of them is changed there must be compensatory changes in at leastone of the others.Expressions for I(x; α k ) are more complicated than those for I(x; λ k );for example, for covariance matrices we havep∑I(x; α k )=−z k z h α h (λ h − λ k ) −1 (10.2.4)h≠kcompared with (10.2.2) for I(x; λ k ). A number of comments can bemade concerning (10.2.4) and the corresponding expression for correlationmatrices, which isp∑p∑ p∑I(x; α k )= α h (λ h − λ k ) −1 α hi α kj I(x; ρ ij ). (10.2.5)h≠ki=1 j=1i≠jFirst, and perhaps most important, the form of the expression is completelydifferent from that for I(x; λ k ). It is possible for an observationto be influential for λ k but not for α k , and vice versa. This behaviour isillustrated by the examples in Section 10.2.1 below.A second related point is that for covariance matrices I(x; α k ) dependson all of the PCs, z 1 ,z 2 ,...,z p , unlike I(x; λ k ), which depends just onz k . The dependence is quadratic, but involves only cross-product termsz j z k , j ≠ k, and not linear or squared terms. The general shape of theinfluence curves I(x; α k ) is hyperbolic for both covariance and correlationmatrices, but the details of the functions are different. The dependence ofboth (10.2.4) and (10.2.5) on eigenvalues is through (λ h −λ k ) −1 . This meansthat influence, and hence changes to α k resulting from small perturbationsto the data, tend to be large when λ k is close to λ (k−1) or to λ (k+1) .A final point is, that unlike regression, the influence of different observationsin PCA is approximately additive, that is the presence of oneobservation does not affect the influence of another (Calder (1986), Tanakaand Tarumi (1987)).To show that theoretical influence functions are relevant to sample data,predictions from the theoretical influence function can be compared withthe sample influence function, which measures actual changes caused bythe deletion from a data set of one observation at a time. The theoreticalinfluence function typically contains unknown parameters and thesemust be replaced by equivalent sample quantities in such comparisons.This gives what Critchley (1985) calls the empirical influence function. Healso considers a third sample-based influence function, the deleted empiricalinfluence function in which the unknown quantities in the theoreticalinfluence function are estimated using a sample from which the observation

252 10. Outlier Detection, Influential Observations and Robust Estimationwhose influence is to be assessed is omitted. The first example given in Section10.2.1 below illustrates that the empirical influence function can givea good approximation to the sample influence function for moderate samplesizes. Critchley (1985) compares the various influence functions from amore theoretical viewpoint.A considerable amount of work was done in the late 1980s and early 1990son influence functions in multivariate analysis, some of which extends thebasic results for PCA given earlier in this section. Benasseni in France andTanaka and co-workers in Japan were particularly active in various aspectsof influence and sensitivity for a wide range of multivariate techniques.Some of their work on sensitivity will be discussed further in Section 10.3.Tanaka (1988) extends earlier work on influence in PCA in two relatedways. The first is to explicitly consider the situation where there are equaleigenvalues—equations (10.2.4) and (10.2.5) break down in this case. Secondlyhe considers influence functions for subspaces spanned by subsetsof PCs, not simply individual PCs. Specifically, if A q is a matrix whosecolumns are a subset of q eigenvectors, and Λ q is the diagonal matrix of correspondingeigenvalues, Tanaka (1988) finds expressions for I(x; A q Λ q A ′ q)and I(x; A q A ′ q). In discussing a general strategy for analysing influencein multivariate methods, Tanaka (1995) suggests that groups of observationswith similar patterns of influence across a set of parameters may bedetected by means of a PCA of the empirical influence functions for eachparameter.Benasseni (1990) examines a number of measures for comparing principalcomponent subspaces computed with and without one of the observations.After eliminating some possible measures such as the RV-coefficient(Robert and Escoufier, 1976) and Yanai’s generalized coefficient of determination(Yanai, 1980) for being too insensitive to perturbations, he settlesonandρ 1(i) =1−ρ 2(i) =1−q∑k=1q∑k=1‖a k − P (i) a k ‖q‖a k(i) − Pa k(i) ‖qwhere a k , a k(i) are eigenvectors with and without the ith observation,P, P (i) are projection matrices onto the subspaces derived with and withoutthe ith observation, and the summation is over the q eigenvectors withinthe subspace of interest. Benasseni (1990) goes on to find expressions forthe theoretical influence functions for these two quantities, which can thenbe used to compute empirical influences.Reducing the comparison of two subspaces to a single measure inevitablyleads to a loss of information about the structure of the differences be-

252 10. Outlier Detection, Influential Observations and Robust Estimationwhose influence is to be assessed is omitted. The first example given in Section10.2.1 below illustrates that the empirical influence function can givea good approximation to the sample influence function for moderate samplesizes. Critchley (1985) compares the various influence functions from amore theoretical viewpoint.A considerable amount of work was done in the late 1980s and early 1990son influence functions in multivariate analysis, some of which extends thebasic results for PCA given earlier in this section. Benasseni in France andTanaka and co-workers in Japan were particularly active in various aspectsof influence and sensitivity for a wide range of multivariate techniques.Some of their work on sensitivity will be discussed further in Section 10.3.Tanaka (1988) extends earlier work on influence in PCA in two relatedways. The first is to explicitly consider the situation where there are equaleigenvalues—equations (10.2.4) and (10.2.5) break down in this case. Secondlyhe considers influence functions for subspaces spanned by subsetsof PCs, not simply individual PCs. Specifically, if A q is a matrix whosecolumns are a subset of q eigenvectors, and Λ q is the diagonal matrix of correspondingeigenvalues, Tanaka (1988) finds expressions for I(x; A q Λ q A ′ q)and I(x; A q A ′ q). In discussing a general strategy for analysing influencein multivariate methods, Tanaka (1995) suggests that groups of observationswith similar patterns of influence across a set of parameters may bedetected by means of a PCA of the empirical influence functions for eachparameter.Benasseni (1990) examines a number of measures for comparing principalcomponent subspaces computed with and without one of the observations.After eliminating some possible measures such as the RV-coefficient(Robert and Escoufier, 1976) and Yanai’s generalized coefficient of determination(Yanai, 1980) for being too insensitive to perturbations, he settlesonandρ 1(i) =1−ρ 2(i) =1−q∑k=1q∑k=1‖a k − P (i) a k ‖q‖a k(i) − Pa k(i) ‖qwhere a k , a k(i) are eigenvectors with and without the ith observation,P, P (i) are projection matrices onto the subspaces derived with and withoutthe ith observation, and the summation is over the q eigenvectors withinthe subspace of interest. Benasseni (1990) goes on to find expressions forthe theoretical influence functions for these two quantities, which can thenbe used to compute empirical influences.Reducing the comparison of two subspaces to a single measure inevitablyleads to a loss of information about the structure of the differences be-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!