Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

8.3. Connections Between PC Regression and Other Methods 177the adjusted multiple coefficient of determination,¯R 2 =1−(n − 1)(n − p − 1) (1 − R2 ),where R 2 is the usual multiple coefficient of determination (squared multiplecorrelation) for the regression equation obtained from each subset M ofinterest. The ‘best’ subset is then the one that maximizes ¯R 2 . Lott demonstratesthat this very simple procedure works well in a limited simulationstudy. Soofi (1988) uses a Bayesian approach to define the gain of informationfrom the data about the ith element γ i of γ. The subset M is chosento consist of integers corresponding to components with the largest valuesof this measure of information. Soofi shows that the measure combinesthe variance accounted for by a component with its correlation with thedependent variable.It is difficult to give any general advice regarding the choice of a decisionrule for determining M. It is clearly inadvisable to base the decision entirelyon the size of variance; conversely, inclusion of highly predictive PCscan also be dangerous if they also have very small variances, because ofthe resulting instability of the estimated regression equation. Use of MSEcriteria provides a number of compromise solutions, but they are essentiallyarbitrary.What PC regression can do, which least squares cannot, is to indicateexplicitly whether a problem exists with respect to the removal of multicollinearity,that is whether instability in the regression coefficients can onlybe removed by simultaneously losing a substantial proportion of the predictabilityof y. An extension of the cross-validation procedure of Mertenset al. (1995) to general subsets M would provide a less arbitrary way thanmost of deciding which PCs to keep, but the choice of M for PC regressionremains an open question.8.3 Some Connections Between PrincipalComponent Regression and Other BiasedRegression MethodsUsing the expressions (8.1.8), (8.1.9) for ˆβ and its variance-covariance matrix,it was seen in the previous section that deletion of the last few termsfrom the summation for ˆβ can dramatically reduce the high variances of elementsof ˆβ caused by multicollinearities. However, if any of the elements ofγ corresponding to deleted components are non-zero, then the PC estimator˜β for β is biased. Various other methods of biased estimation that aim toremove collinearity-induced high variances have also been proposed. A fulldescription of these methods will not be given here as several do not involve

178 8. Principal Components in Regression AnalysisPCs directly, but there are various relationships between PC regression andother biased regression methods which will be briefly discussed.Consider first ridge regression, which was described by Hoerl and Kennard(1970a,b) and which has since been the subject of much debate inthe statistical literature. The estimator of β using the technique can bewritten, among other ways, asp∑ˆβ R = (l k + κ) −1 a k a ′ kX ′ y,k=1where κ is some fixed positive constant and the other terms in the expressionhave the same meaning as in (8.1.8). The variance-covariance matrixof ˆβ R , is equal top∑l k (l k + κ) −2 a k a ′ k.σ 2k=1Thus, ridge regression estimators have rather similar expressions to thosefor least squares and PC estimators, but variance reduction is achievednot by deleting components, but by reducing the weight given to the latercomponents. A generalization of ridge regression has p constants κ k , k =1, 2,...,p that must be chosen, rather than a single constant κ.A modification of PC regression, due to Marquardt (1970) uses a similar,but more restricted, idea. Here a PC regression estimator of the form(8.1.10) is adapted so that M includes the first m integers, excludes theintegers m +2,m+3,...,p, but includes the term corresponding to integer(m + 1) with a weighting less than unity. Detailed discussion of suchestimators is given by Marquardt (1970).Ridge regression estimators ‘shrink’ the least squares estimators towardsthe origin, and so are similar in effect to the shrinkage estimators proposedby Stein (1960) and Sclove (1968). These latter estimators start withthe idea of shrinking some or all of the elements of ˆγ (or ˆβ) using argumentsbased on loss functions, admissibility and prior information; choiceof shrinkage constants is based on optimization of MSE criteria. Partialleast squares regression is sometimes viewed as another class of shrinkageestimators. However, Butler and Denham (2000) show that it has peculiarproperties, shrinking some of the elements of ˆγ but inflating others.All these various biased estimators have relationships between them. Inparticular, all the present estimators, as well as latent root regression,which is discussed in the next section along with partial least squares,can be viewed as optimizing (˜β − β) ′ X ′ X(˜β − β), subject to differentconstraints for different estimators (see Hocking (1976)). If the data setis augmented by a set of dummy observations, and least squares is usedto estimate β from the augmented data, Hocking (1976) demonstratesfurther that ridge, generalized ridge, PC regression, Marquardt’s modificationand shrinkage estimators all appear as special cases for particular

178 8. <strong>Principal</strong> <strong>Component</strong>s in Regression <strong>Analysis</strong>PCs directly, but there are various relationships between PC regression andother biased regression methods which will be briefly discussed.Consider first ridge regression, which was described by Hoerl and Kennard(1970a,b) and which has since been the subject of much debate inthe statistical literature. The estimator of β using the technique can bewritten, among other ways, asp∑ˆβ R = (l k + κ) −1 a k a ′ kX ′ y,k=1where κ is some fixed positive constant and the other terms in the expressionhave the same meaning as in (8.1.8). The variance-covariance matrixof ˆβ R , is equal top∑l k (l k + κ) −2 a k a ′ k.σ 2k=1Thus, ridge regression estimators have rather similar expressions to thosefor least squares and PC estimators, but variance reduction is achievednot by deleting components, but by reducing the weight given to the latercomponents. A generalization of ridge regression has p constants κ k , k =1, 2,...,p that must be chosen, rather than a single constant κ.A modification of PC regression, due to Marquardt (1970) uses a similar,but more restricted, idea. Here a PC regression estimator of the form(8.1.10) is adapted so that M includes the first m integers, excludes theintegers m +2,m+3,...,p, but includes the term corresponding to integer(m + 1) with a weighting less than unity. Detailed discussion of suchestimators is given by Marquardt (1970).Ridge regression estimators ‘shrink’ the least squares estimators towardsthe origin, and so are similar in effect to the shrinkage estimators proposedby Stein (1960) and Sclove (1968). These latter estimators start withthe idea of shrinking some or all of the elements of ˆγ (or ˆβ) using argumentsbased on loss functions, admissibility and prior information; choiceof shrinkage constants is based on optimization of MSE criteria. Partialleast squares regression is sometimes viewed as another class of shrinkageestimators. However, Butler and Denham (2000) show that it has peculiarproperties, shrinking some of the elements of ˆγ but inflating others.All these various biased estimators have relationships between them. Inparticular, all the present estimators, as well as latent root regression,which is discussed in the next section along with partial least squares,can be viewed as optimizing (˜β − β) ′ X ′ X(˜β − β), subject to differentconstraints for different estimators (see Hocking (1976)). If the data setis augmented by a set of dummy observations, and least squares is usedto estimate β from the augmented data, Hocking (1976) demonstratesfurther that ridge, generalized ridge, PC regression, Marquardt’s modificationand shrinkage estimators all appear as special cases for particular

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!