Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
8.3. Connections Between PC Regression and Other Methods 177the adjusted multiple coefficient of determination,¯R 2 =1−(n − 1)(n − p − 1) (1 − R2 ),where R 2 is the usual multiple coefficient of determination (squared multiplecorrelation) for the regression equation obtained from each subset M ofinterest. The ‘best’ subset is then the one that maximizes ¯R 2 . Lott demonstratesthat this very simple procedure works well in a limited simulationstudy. Soofi (1988) uses a Bayesian approach to define the gain of informationfrom the data about the ith element γ i of γ. The subset M is chosento consist of integers corresponding to components with the largest valuesof this measure of information. Soofi shows that the measure combinesthe variance accounted for by a component with its correlation with thedependent variable.It is difficult to give any general advice regarding the choice of a decisionrule for determining M. It is clearly inadvisable to base the decision entirelyon the size of variance; conversely, inclusion of highly predictive PCscan also be dangerous if they also have very small variances, because ofthe resulting instability of the estimated regression equation. Use of MSEcriteria provides a number of compromise solutions, but they are essentiallyarbitrary.What PC regression can do, which least squares cannot, is to indicateexplicitly whether a problem exists with respect to the removal of multicollinearity,that is whether instability in the regression coefficients can onlybe removed by simultaneously losing a substantial proportion of the predictabilityof y. An extension of the cross-validation procedure of Mertenset al. (1995) to general subsets M would provide a less arbitrary way thanmost of deciding which PCs to keep, but the choice of M for PC regressionremains an open question.8.3 Some Connections Between PrincipalComponent Regression and Other BiasedRegression MethodsUsing the expressions (8.1.8), (8.1.9) for ˆβ and its variance-covariance matrix,it was seen in the previous section that deletion of the last few termsfrom the summation for ˆβ can dramatically reduce the high variances of elementsof ˆβ caused by multicollinearities. However, if any of the elements ofγ corresponding to deleted components are non-zero, then the PC estimator˜β for β is biased. Various other methods of biased estimation that aim toremove collinearity-induced high variances have also been proposed. A fulldescription of these methods will not be given here as several do not involve
178 8. Principal Components in Regression AnalysisPCs directly, but there are various relationships between PC regression andother biased regression methods which will be briefly discussed.Consider first ridge regression, which was described by Hoerl and Kennard(1970a,b) and which has since been the subject of much debate inthe statistical literature. The estimator of β using the technique can bewritten, among other ways, asp∑ˆβ R = (l k + κ) −1 a k a ′ kX ′ y,k=1where κ is some fixed positive constant and the other terms in the expressionhave the same meaning as in (8.1.8). The variance-covariance matrixof ˆβ R , is equal top∑l k (l k + κ) −2 a k a ′ k.σ 2k=1Thus, ridge regression estimators have rather similar expressions to thosefor least squares and PC estimators, but variance reduction is achievednot by deleting components, but by reducing the weight given to the latercomponents. A generalization of ridge regression has p constants κ k , k =1, 2,...,p that must be chosen, rather than a single constant κ.A modification of PC regression, due to Marquardt (1970) uses a similar,but more restricted, idea. Here a PC regression estimator of the form(8.1.10) is adapted so that M includes the first m integers, excludes theintegers m +2,m+3,...,p, but includes the term corresponding to integer(m + 1) with a weighting less than unity. Detailed discussion of suchestimators is given by Marquardt (1970).Ridge regression estimators ‘shrink’ the least squares estimators towardsthe origin, and so are similar in effect to the shrinkage estimators proposedby Stein (1960) and Sclove (1968). These latter estimators start withthe idea of shrinking some or all of the elements of ˆγ (or ˆβ) using argumentsbased on loss functions, admissibility and prior information; choiceof shrinkage constants is based on optimization of MSE criteria. Partialleast squares regression is sometimes viewed as another class of shrinkageestimators. However, Butler and Denham (2000) show that it has peculiarproperties, shrinking some of the elements of ˆγ but inflating others.All these various biased estimators have relationships between them. Inparticular, all the present estimators, as well as latent root regression,which is discussed in the next section along with partial least squares,can be viewed as optimizing (˜β − β) ′ X ′ X(˜β − β), subject to differentconstraints for different estimators (see Hocking (1976)). If the data setis augmented by a set of dummy observations, and least squares is usedto estimate β from the augmented data, Hocking (1976) demonstratesfurther that ridge, generalized ridge, PC regression, Marquardt’s modificationand shrinkage estimators all appear as special cases for particular
- Page 158 and 159: 6.1. How Many Principal Components?
- Page 160 and 161: 6.1. How Many Principal Components?
- Page 162 and 163: 6.1. How Many Principal Components?
- Page 164 and 165: 6.2. Choosing m, the Number of Comp
- Page 166 and 167: 6.2. Choosing m, the Number of Comp
- Page 168 and 169: 6.3. Selecting a Subset of Variable
- Page 170 and 171: 6.3. Selecting a Subset of Variable
- Page 172 and 173: 6.3. Selecting a Subset of Variable
- Page 174 and 175: 6.3. Selecting a Subset of Variable
- Page 176 and 177: 6.4. Examples Illustrating Variable
- Page 178 and 179: 6.4. Examples Illustrating Variable
- Page 180 and 181: 6.4. Examples Illustrating Variable
- Page 182 and 183: 7.1. Models for Factor Analysis 151
- Page 184 and 185: 7.2. Estimation of the Factor Model
- Page 186 and 187: 7.2. Estimation of the Factor Model
- Page 188 and 189: 7.2. Estimation of the Factor Model
- Page 190 and 191: 7.3. Comparisons Between Factor and
- Page 192 and 193: 7.4. An Example of Factor Analysis
- Page 194 and 195: 7.4. An Example of Factor Analysis
- Page 196 and 197: 7.5. Concluding Remarks 165To illus
- Page 198 and 199: 8Principal Components in Regression
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 212 and 213: 8.4. Variations on Principal Compon
- Page 214 and 215: 8.4. Variations on Principal Compon
- Page 216 and 217: 8.5. Variable Selection in Regressi
- Page 218 and 219: 8.5. Variable Selection in Regressi
- Page 220 and 221: 8.6. Functional and Structural Rela
- Page 222 and 223: 8.7. Examples of Principal Componen
- Page 224 and 225: Table 8.3. Principal component regr
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 232 and 233: 9.1. Discriminant Analysis 201on th
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 250 and 251: 9.2. Cluster Analysis 219county clu
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
8.3. Connections Between PC Regression and Other Methods 177the adjusted multiple coefficient of determination,¯R 2 =1−(n − 1)(n − p − 1) (1 − R2 ),where R 2 is the usual multiple coefficient of determination (squared multiplecorrelation) for the regression equation obtained from each subset M ofinterest. The ‘best’ subset is then the one that maximizes ¯R 2 . Lott demonstratesthat this very simple procedure works well in a limited simulationstudy. Soofi (1988) uses a Bayesian approach to define the gain of informationfrom the data about the ith element γ i of γ. The subset M is chosento consist of integers corresponding to components with the largest valuesof this measure of information. Soofi shows that the measure combinesthe variance accounted for by a component with its correlation with thedependent variable.It is difficult to give any general advice regarding the choice of a decisionrule for determining M. It is clearly inadvisable to base the decision entirelyon the size of variance; conversely, inclusion of highly predictive PCscan also be dangerous if they also have very small variances, because ofthe resulting instability of the estimated regression equation. Use of MSEcriteria provides a number of compromise solutions, but they are essentiallyarbitrary.What PC regression can do, which least squares cannot, is to indicateexplicitly whether a problem exists with respect to the removal of multicollinearity,that is whether instability in the regression coefficients can onlybe removed by simultaneously losing a substantial proportion of the predictabilityof y. An extension of the cross-validation procedure of Mertenset al. (1995) to general subsets M would provide a less arbitrary way thanmost of deciding which PCs to keep, but the choice of M for PC regressionremains an open question.8.3 Some Connections Between <strong>Principal</strong><strong>Component</strong> Regression and Other BiasedRegression MethodsUsing the expressions (8.1.8), (8.1.9) for ˆβ and its variance-covariance matrix,it was seen in the previous section that deletion of the last few termsfrom the summation for ˆβ can dramatically reduce the high variances of elementsof ˆβ caused by multicollinearities. However, if any of the elements ofγ corresponding to deleted components are non-zero, then the PC estimator˜β for β is biased. Various other methods of biased estimation that aim toremove collinearity-induced high variances have also been proposed. A fulldescription of these methods will not be given here as several do not involve