Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

14.1. Additive Principal Components and Principal Curves 377in some way to make the optimization problem tractable. One choice is touse step functions, which leads back towards Gifi’s (1990) system of nonlinearPCA. Besse and Ferraty (1995) favour an approach based on splines.They contrast their proposal, in which flexibility of the functional transformationis controlled by the choice of smoothing parameters, with earlierspline-based procedures controlled by the number and positioning of knots(see, for example, van Rijckevorsel (1988) and Winsberg (1988)). Usingsplines as Besse and Ferraty do is equivalent to adding a roughness penaltyfunction to the quantity to be minimized. This is similar to Besse et al.’s(1997) approach to analysing functional data described in Section 12.3.4using equation (12.3.6).As with Gifi’s (1990) non-linear PCA, Besse and Ferraty’s (1995) proposalis implemented by means of an alternating least squares algorithmand, as in Besse and de Falgerolles (1993) for the linear case (see Section6.1.5), bootstrapping of residuals from a q-dimensional model is usedto decide on the best fit. Here, instead of simply using the bootstrap tochoose q, simultaneous optimization with respect q and with respect to thesmoothing parameters which determine the function f(x) is needed. At thisstage it might be asked ‘where is the PCA in all this?’ The name ‘PCA’is still appropriate because the q-dimensional subspace is determined byan optimal set of q linear functions of the vector of transformed randomvariables f(x), and it is these linear functions that are the non-linear PCs.14.1.2 Additive Principal Components and Principal CurvesFowlkes and Kettenring (1985) note that one possible objective for transformingdata before performing a PCA is to find near-singularities in thetransformed data. In other words, x ′ =(x 1 ,x 2 ,...,x p ) is transformed tof ′ (x) =(f 1 (x 1 ),f 2 (x 2 ),...,f p (x p )), and we are interested in finding linearfunctions a ′ f(x) off(x) for which var[a ′ f(x)] ≈ 0. Fowlkes and Kettenring(1985) suggest looking for a transformation that minimizes the determinantof the correlation matrix of the transformed variables. The last fewPCs derived from this correlation matrix should then identify the requirednear-constant relationships, if any exist.A similar idea underlies additive principal components, which are discussedin detail by Donnell et al. (1994). The additive principal componentstake the form ∑ pj=1 φ j(x j ) instead of ∑ pj=1 a jx j in standard PCA, and, aswith Fowlkes and Kettenring (1985), interest centres on components forwhich var[ ∑ pj=1 φ j(x j )] is small. To define a non-linear analogue of PCAthere is a choice of either an algebraic definition that minimizes variance,or a geometric definition that optimizes expected squared distance fromthe additive manifold ∑ pj=1 φ j(x j ) = const. Once we move away from linearPCA, the two definitions lead to different solutions, and Donnell etal. (1994) choose to minimize variance. The optimization problem to be

378 14. Generalizations and Adaptations of Principal Component Analysissolved is then to successively find p-variate vectors φ (k) ,k =1, 2,...,whoseelements are φ (k)j (x j ), which minimize[ ∑p ]var φ (k)j (x j )j=1subject to ∑ pj=1 var[φ(k) j (x j )] = 1, and for k>1,k >l,p∑cov[φ (k)j (x j )φ (l)j (x j)] = 0.j=1As with linear PCA, this reduces to an eigenvalue problem. The mainchoice to be made is the set of functions φ(.) over which optimization is totake place. In an example Donnell et al. (1994) use splines, but their theoreticalresults are quite general and they discuss other, more sophisticated,smoothers. They identify two main uses for low-variance additive principalcomponents, namely to fit additive implicit equations to data and to identifythe presence of ‘concurvities,’ which play the same rôle and cause thesame problems in additive regression as do collinearities in linear regression.Principal curves are included in the same section as additive principalcomponents despite the insistence by Donnell and coworkers in a responseto discussion of their paper by Flury that they are very different. One differenceis that although the range of functions allowed in additive principalcomponents is wide, an equation is found relating the variables via thefunctions φ j (x j ), whereas a principal curve is just that, a smooth curvewith no necessity for a parametric equation. A second difference is thatadditive principal components concentrate on low-variance relationships,while principal curves minimize variation orthogonal to the curve.There is nevertheless a similarity between the two techniques, in thatboth replace an optimum line or plane produced by linear PCA by anoptimal non-linear curve or surface. In the case of principal curves, a smoothone-dimensional curve is sought that passes through the ‘middle’ of the dataset. With an appropriate definition of ‘middle,’ the first PC gives the beststraight line through the middle of the data, and principal curves generalizethis using the idea of self-consistency, which was introduced at the end ofSection 2.2. We saw there that, for p-variate random vectors x, y, thevector of random variables y is self-consistent for x if E[x|y] =y. Considera smooth curve in the p-dimensional space defined by x. The curve can bewritten f(λ), where λ defines the position along the curve, and the vectorf(λ) contains the values of the elements of x for a given value of λ. A curvef(λ) is self-consistent, that is, a principal curve,ifE[x | f −1 (x) =λ] =f(λ),where f −1 (x) is the value of λ for which ‖x−f(λ)‖ is minimized. What thismeans intuitively is that, for any given value of λ, sayλ 0 , the average of allvalues of x that have f(λ 0 ) as their closest point on the curve is preciselyf(λ 0 ).

14.1. Additive <strong>Principal</strong> <strong>Component</strong>s and <strong>Principal</strong> Curves 377in some way to make the optimization problem tractable. One choice is touse step functions, which leads back towards Gifi’s (1990) system of nonlinearPCA. Besse and Ferraty (1995) favour an approach based on splines.They contrast their proposal, in which flexibility of the functional transformationis controlled by the choice of smoothing parameters, with earlierspline-based procedures controlled by the number and positioning of knots(see, for example, van Rijckevorsel (1988) and Winsberg (1988)). Usingsplines as Besse and Ferraty do is equivalent to adding a roughness penaltyfunction to the quantity to be minimized. This is similar to Besse et al.’s(1997) approach to analysing functional data described in Section 12.3.4using equation (12.3.6).As with Gifi’s (1990) non-linear PCA, Besse and Ferraty’s (1995) proposalis implemented by means of an alternating least squares algorithmand, as in Besse and de Falgerolles (1993) for the linear case (see Section6.1.5), bootstrapping of residuals from a q-dimensional model is usedto decide on the best fit. Here, instead of simply using the bootstrap tochoose q, simultaneous optimization with respect q and with respect to thesmoothing parameters which determine the function f(x) is needed. At thisstage it might be asked ‘where is the PCA in all this?’ The name ‘PCA’is still appropriate because the q-dimensional subspace is determined byan optimal set of q linear functions of the vector of transformed randomvariables f(x), and it is these linear functions that are the non-linear PCs.14.1.2 Additive <strong>Principal</strong> <strong>Component</strong>s and <strong>Principal</strong> CurvesFowlkes and Kettenring (1985) note that one possible objective for transformingdata before performing a PCA is to find near-singularities in thetransformed data. In other words, x ′ =(x 1 ,x 2 ,...,x p ) is transformed tof ′ (x) =(f 1 (x 1 ),f 2 (x 2 ),...,f p (x p )), and we are interested in finding linearfunctions a ′ f(x) off(x) for which var[a ′ f(x)] ≈ 0. Fowlkes and Kettenring(1985) suggest looking for a transformation that minimizes the determinantof the correlation matrix of the transformed variables. The last fewPCs derived from this correlation matrix should then identify the requirednear-constant relationships, if any exist.A similar idea underlies additive principal components, which are discussedin detail by Donnell et al. (1994). The additive principal componentstake the form ∑ pj=1 φ j(x j ) instead of ∑ pj=1 a jx j in standard PCA, and, aswith Fowlkes and Kettenring (1985), interest centres on components forwhich var[ ∑ pj=1 φ j(x j )] is small. To define a non-linear analogue of PCAthere is a choice of either an algebraic definition that minimizes variance,or a geometric definition that optimizes expected squared distance fromthe additive manifold ∑ pj=1 φ j(x j ) = const. Once we move away from linearPCA, the two definitions lead to different solutions, and Donnell etal. (1994) choose to minimize variance. The optimization problem to be

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!