12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

13.6. <strong>Principal</strong> <strong>Component</strong> <strong>Analysis</strong> in the Presence of Missing Data 365of prior knowledge about q, there is, at present, no procedure for choosingits value without repeating the analysis for a range of values.Most published work, including Little and Rubin (1987), does not explicitlydeal with PCA, but with the estimation of covariance matrices ingeneral. Tipping and Bishop (1999a) is one of relatively few papers thatfocus specifically on PCA when discussing missing data. Another is Wiberg(1976). His approach is via the singular value decomposition (SVD), whichgives a least squares approximation of rank m to the data matrix X. Inother words, the approximation m˜x ij minimizesn∑ p∑( m x ij − x ij ) 2 ,i=1 j=1where m x ij is any rank m approximation to x ij (see Section 3.5). <strong>Principal</strong>components can be computed from the SVD (see Section 3.5 and AppendixAl). With missing data, Wiberg (1976) suggests minimizing the same quantity,but with the summation only over values of (i, j) for which x ij is notmissing; PCs can then be estimated from the modified SVD. The same ideais implicitly suggested by Gabriel and Zamir (1979). Wiberg (1976) reportsthat for simulated multivariate normal data his method is slightly worsethan the method based on maximum likelihood estimation. However, hismethod has the virtue that it can be used regardless of whether or not thedata come from a multivariate normal distribution.For the specialized use of PCA in analysing residuals from an additivemodel for data from designed experiments (see Section 13.4), Freeman(1975) shows that incomplete data can be easily handled, although modificationsto procedures for deciding the rank of the model are needed.Michailidis and de Leeuw (1998) note three ways of dealing with missingdata in non-linear multivariate analysis, including non-linear PCA(Section 14.1).A special type of ‘missing’ data occurs when observations or variablescorrespond to different times or different spatial locations, but with irregularspacing between them. In the common atmospheric science set-up,where variables correspond to spatial locations, Karl et al. (1982) examinedifferences between PCAs when locations are on a regularly spaced grid,and when they are irregularly spaced. Unsurprisingly, for the irregular datathe locations in areas with the highest density of measurements tend to increasetheir loadings on the leading PCs, compared to the regularly spaceddata. This is because of the larger correlations observed in the high-densityregions. Kaplan et al. (2001) discuss methodology based on PCA for interpolatingspatial fields (see Section 12.4.4). Such interpolation is, in effect,imputing missing data.Another special type of data in which some values are missing occurswhen candidates choose to take a subset of p ′ out of p examinations, withdifferent candidates choosing different subsets. Scores on examinations not

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!