12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

13.6. <strong>Principal</strong> <strong>Component</strong> <strong>Analysis</strong> in the Presence of Missing Data 36313.6 <strong>Principal</strong> <strong>Component</strong> <strong>Analysis</strong> in thePresence of Missing DataIn all the examples given in this text, the data sets are complete. However,it is not uncommon, especially for large data sets, for some of the valuesof some of the variables to be missing. The most usual way of dealing withsuch a situation is to delete, entirely, any observation for which at least oneof the variables has a missing value. This is satisfactory if missing values arefew, but clearly wasteful of information if a high proportion of observationshave missing values for just one or two variables. To meet this problem, anumber of alternatives have been suggested.The first step in a PCA is usually to compute the covariance or correlationmatrix, so interest often centres on estimating these matrices inthe presence of missing data. There are a number of what Little and Rubin(1987, Chapter 3) call ‘quick’ methods. One option is to compute the(j, k)th correlation or covariance element-wise, using all observations forwhich the values of both x j and x k are available. Unfortunately, this leads tocovariance or correlation matrices that are not necessarily positive semidefinite.Beale and Little (1975) note a modification of this option. Whencomputing the summation ∑ i (x ij − ¯x j )(x ik − ¯x k ) in the covariance or correlationmatrix, ¯x j ,¯x k are calculated from all available values of x j , x k ,respectively, instead of only for observations for which both x j and x k havevalues present, They state that, at least in the regression context, the resultscan be unsatisfactory. However, Mehrota (1995), in discussing robustestimation of covariance matrices (see Section 10.4), argues that the problemof a possible lack of positive semi-definiteness is less important thanmaking efficient use of as many data as possible. He therefore advocateselement-wise estimation of the variances and covariances in a covariancematrix, with possible adjustment if positive semi-definiteness is lost.Another quick method is to replace missing values for variable x j by themean value ¯x j , calculated from the observations for which the value of x jis available. This is a simple way of ‘imputing’ rather than ignoring missingvalues. A more sophisticated method of imputation is to use regression ofthe missing variables on the available variables case-by-case. An extensionto the idea of imputing missing values is multiple imputation. Each missingvalue is replaced by a value drawn from a probability distribution, andthis procedure is repeated M times (Little and Rubin, 1987, Section 12.4;Schafer, 1997, Section 4.3). The analysis, in our case PCA, is then doneM times, corresponding to each of the M different sets of imputed values.The variability in the results of the analyses gives an indication of theuncertainty associated with the presence of missing values.A different class of procedures is based on maximum likelihood estimation(Little and Rubin, 1987, Section 8.2). The well-known EM algorithm(Dempster et al., 1977) can easily cope with maximum likelihood estimation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!