Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
8.5. Variable Selection in Regression Using Principal Components 185ridge regression, although this latter conclusion is disputed by S. Wold inthe published discussion that follows the article.Naes and Isaksson (1992) use a locally weighted version of PC regressionin the calibration of spectroscopic data. PCA is done on the predictorvariables, and to form a predictor for a particular observation only the kobservations closest to the chosen observation in the space of the first mPCs are used. These k observations are given weights in a regression of thedependent variable on the first m PCs whose values decrease as distancefrom the chosen observation increases. The values of m and k are chosenby cross-validation, and the technique is shown to outperform both PCregression and PLS.Bertrand et al. (2001) revisit latent root regression, and replace the PCAof the matrix of (p + 1) variables formed by y together with X by theequivalent PCA of y together with the PC scores Z. This makes it easierto identify predictive and non-predictive multicollinearities, and gives asimple expression for the MSE of the latent root estimator. Bertrand et al.(2001) present their version of latent root regression as an alternative toPLS or PC regression for near infrared spectroscopic data.Marx and Smith (1990) extend PC regression from linear models to generalizedlinear models. Straying further from ordinary PCA, Li et al. (2000)discuss principal Hessian directions, which utilize a variety of generalizedPCA (see Section 14.2.2) in a regression context. These directions are usedto define splits in a regression tree, where the objective is to find directionsalong which the regression surface ‘bends’ as much as possible. A weightedcovariance matrix S W is calculated for the predictor variables, where theweights are residuals from a multiple regression of y on all the predictorvariables. Given the (unweighted) covariance matrix S, their derivation ofthe first principal Hessian direction is equivalent to finding the first eigenvectorin a generalized PCA of S W with metric Q = S −1 and D = 1 n I n,inthe notation of Section 14.2.2.8.5 Variable Selection in Regression UsingPrincipal ComponentsPrincipal component regression, latent root regression, and other biased regressionestimates keep all the predictor variables in the model, but changethe estimates from least squares estimates in a way that reduces the effectsof multicollinearity. As mentioned in the introductory section of thischapter, an alternative way of dealing with multicollinearity problems is touse only a subset of the predictor variables. Among the very many possiblemethods of selecting a subset of variables, a few use PCs.As noted in the previous section, the procedures due to Hawkins (1973)and Hawkins and Eplett (1982) can be used in this way. Rotation of the PCs
186 8. Principal Components in Regression Analysisproduces a large number of near-zero coefficients for the rotated variables,so that in low-variance relationships involving y (if such low-variance relationshipsexist) only a subset of the predictor variables will have coefficientssubstantially different from zero. This subset forms a plausible selection ofvariables to be included in a regression model. There may be other lowvariancerelationships between the predictor variables alone, again withrelatively few coefficients far from zero. If such relationships exist, and involvesome of the same variables as are in the relationship involving y,then substitution will lead to alternative subsets of predictor variables.Jeffers (1981) argues that in this way it is possible to identify all good subregressionsusing Hawkins’ (1973) original procedure. Hawkins and Eplett(1982) demonstrate that their newer technique, incorporating Cholesky factorization,can do even better than the earlier method. In particular, for anexample that is analysed by both methods, two subsets of variables selectedby the first method are shown to be inappropriate by the second.Principal component regression and latent root regression may also beused in an iterative manner to select variables. Consider, first, PC regressionand suppose that ˜β given by (8.1.12) is the proposed estimator for β.Then it is possible to test whether or not subsets of the elements of ˜β aresignificantly different from zero, and those variables whose coefficients arefound to be not significantly non-zero can then be deleted from the model.Mansfield et al. (1977), after a moderate amount of algebra, construct theappropriate tests for estimators of the form (8.1.10), that is, where thePCs deleted from the regression are restricted to be those with the smallestvariances. Provided that the true coefficients of the deleted PCs are zeroand that normality assumptions are valid, the appropriate test statisticsare F -statistics, reducing to t-statistics if only one variable is considered ata time. A corresponding result will also hold for the more general form ofestimator (8.1.12).Although the variable selection procedure could stop at this stage, it maybe more fruitful to use an iterative procedure, similar to that suggested byJolliffe (1972) for variable selection in another (non-regression) context (seeSection 6.3, method (i)). The next step in such a procedure is to performa PC regression on the reduced set of variables, and then see if any furthervariables can be deleted from the reduced set, using the same reasoningas before. This process is repeated, until eventually no more variables aredeleted. Two variations on this iterative procedure are described by Mansfieldet al. (1977). The first is a stepwise procedure that first looks for thebest single variable to delete, then the best pair of variables, one of which isthe best single variable, then the best triple of variables, which includes thebest pair, and so on. The procedure stops when the test for zero regressioncoefficients on the subset of excluded variables first gives a significant result.The second variation is to delete only one variable at each stage, and thenrecompute the PCs using the reduced set of variables, rather than allowingthe deletion of several variables before the PCs are recomputed. According
- Page 166 and 167: 6.2. Choosing m, the Number of Comp
- Page 168 and 169: 6.3. Selecting a Subset of Variable
- Page 170 and 171: 6.3. Selecting a Subset of Variable
- Page 172 and 173: 6.3. Selecting a Subset of Variable
- Page 174 and 175: 6.3. Selecting a Subset of Variable
- Page 176 and 177: 6.4. Examples Illustrating Variable
- Page 178 and 179: 6.4. Examples Illustrating Variable
- Page 180 and 181: 6.4. Examples Illustrating Variable
- Page 182 and 183: 7.1. Models for Factor Analysis 151
- Page 184 and 185: 7.2. Estimation of the Factor Model
- Page 186 and 187: 7.2. Estimation of the Factor Model
- Page 188 and 189: 7.2. Estimation of the Factor Model
- Page 190 and 191: 7.3. Comparisons Between Factor and
- Page 192 and 193: 7.4. An Example of Factor Analysis
- Page 194 and 195: 7.4. An Example of Factor Analysis
- Page 196 and 197: 7.5. Concluding Remarks 165To illus
- Page 198 and 199: 8Principal Components in Regression
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 208 and 209: 8.3. Connections Between PC Regress
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 212 and 213: 8.4. Variations on Principal Compon
- Page 214 and 215: 8.4. Variations on Principal Compon
- Page 218 and 219: 8.5. Variable Selection in Regressi
- Page 220 and 221: 8.6. Functional and Structural Rela
- Page 222 and 223: 8.7. Examples of Principal Componen
- Page 224 and 225: Table 8.3. Principal component regr
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 232 and 233: 9.1. Discriminant Analysis 201on th
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 250 and 251: 9.2. Cluster Analysis 219county clu
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
- Page 258 and 259: 9.3. Canonical Correlation Analysis
- Page 260 and 261: 9.3. Canonical Correlation Analysis
- Page 262 and 263: 9.3. Canonical Correlation Analysis
- Page 264 and 265: 10.1. Detection of Outliers Using P
186 8. <strong>Principal</strong> <strong>Component</strong>s in Regression <strong>Analysis</strong>produces a large number of near-zero coefficients for the rotated variables,so that in low-variance relationships involving y (if such low-variance relationshipsexist) only a subset of the predictor variables will have coefficientssubstantially different from zero. This subset forms a plausible selection ofvariables to be included in a regression model. There may be other lowvariancerelationships between the predictor variables alone, again withrelatively few coefficients far from zero. If such relationships exist, and involvesome of the same variables as are in the relationship involving y,then substitution will lead to alternative subsets of predictor variables.Jeffers (1981) argues that in this way it is possible to identify all good subregressionsusing Hawkins’ (1973) original procedure. Hawkins and Eplett(1982) demonstrate that their newer technique, incorporating Cholesky factorization,can do even better than the earlier method. In particular, for anexample that is analysed by both methods, two subsets of variables selectedby the first method are shown to be inappropriate by the second.<strong>Principal</strong> component regression and latent root regression may also beused in an iterative manner to select variables. Consider, first, PC regressionand suppose that ˜β given by (8.1.12) is the proposed estimator for β.Then it is possible to test whether or not subsets of the elements of ˜β aresignificantly different from zero, and those variables whose coefficients arefound to be not significantly non-zero can then be deleted from the model.Mansfield et al. (1977), after a moderate amount of algebra, construct theappropriate tests for estimators of the form (8.1.10), that is, where thePCs deleted from the regression are restricted to be those with the smallestvariances. Provided that the true coefficients of the deleted PCs are zeroand that normality assumptions are valid, the appropriate test statisticsare F -statistics, reducing to t-statistics if only one variable is considered ata time. A corresponding result will also hold for the more general form ofestimator (8.1.12).Although the variable selection procedure could stop at this stage, it maybe more fruitful to use an iterative procedure, similar to that suggested by<strong>Jolliffe</strong> (1972) for variable selection in another (non-regression) context (seeSection 6.3, method (i)). The next step in such a procedure is to performa PC regression on the reduced set of variables, and then see if any furthervariables can be deleted from the reduced set, using the same reasoningas before. This process is repeated, until eventually no more variables aredeleted. Two variations on this iterative procedure are described by Mansfieldet al. (1977). The first is a stepwise procedure that first looks for thebest single variable to delete, then the best pair of variables, one of which isthe best single variable, then the best triple of variables, which includes thebest pair, and so on. The procedure stops when the test for zero regressioncoefficients on the subset of excluded variables first gives a significant result.The second variation is to delete only one variable at each stage, and thenrecompute the PCs using the reduced set of variables, rather than allowingthe deletion of several variables before the PCs are recomputed. According