Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
8.4. Variations on Principal Component Regression 181are given byf k = −δ 0k η y˜l−1 k( ∑M LRδ 2 0k˜l −1k) −1, (8.4.2)where ηy 2 = ∑ ni=1 (y i − ȳ) 2 ,andδ 0k , ˜l k are as defined above. Note that theleast squares estimator ˆβ can also be written in the form (8.4.1) if M LR in(8.4.1) and (8.4.2) is taken to be the full set of PCs.The full derivation of this expression for f k is fairly lengthy, and canbe found in Webster et al. (1974). It is interesting to note that f k is proportionalto the size of the coefficient of y in the kth PC, and inverselyproportional to the variance of the kth PC; both of these relationships areintuitively reasonable.In order to choose the subset M LR it is necessary to decide not only howsmall the eigenvalues must be in order to indicate multicollinearities, butalso how large the coefficient of y must be in order to indicate a predictivemulticollinearity. Again, these are arbitrary choices, and ad hoc rules havebeen used, for example, by Gunst et al. (1976). A more formal procedurefor identifying non-predictive multicollinearities is described by White andGunst (1979), but its derivation is based on asymptotic properties of thestatistics used in latent root regression.Gunst et al. (1976) compared ˆβ LR and ˆβ in terms of MSE, using asimulation study, for cases of only one multicollinearity, and found thatˆβ LR showed substantial improvement over ˆβ when the multicollinearityis non-predictive. However, in cases where the single multicollinearity hadsome predictive value, the results were, unsurprisingly, less favourable toˆβ LR . Gunst and Mason (1977a) reported a larger simulation study, whichcompared PC, latent root, ridge and shrinkage estimators, again on thebasis of MSE. Overall, latent root estimators did well in many, but not all,situations studied, as did PC estimators, but no simulation study can everbe exhaustive, and different conclusions might be drawn for other types ofsimulated data.Hawkins (1973) also proposed finding PCs for the enlarged set of (p +1)variables, but he used the PCs in a rather different way from that of latentroot regression as defined above. The idea here is to use the PCs themselves,or rather a rotated version of them, to decide upon a suitable regressionequation. Any PC with a small variance gives a relationship between yand the predictor variables whose sum of squared residuals orthogonal tothe fitted plane is small. Of course, in regression it is squared residuals inthe y-direction, rather than orthogonal to the fitted plane, which are tobe minimized (see Section 8.6), but the low-variance PCs can neverthelessbe used to suggest low-variability relationships between y and the predictorvariables. Hawkins (1973) goes further by suggesting that it may bemore fruitful to look at rotated versions of the PCs, instead of the PCsthemselves, in order to indicate low-variance relationships. This is done
182 8. Principal Components in Regression Analysisby rescaling and then using varimax rotation (see Chapter 7), which hasthe effect of transforming the PCs to a different set of uncorrelated variables.These variables are, like the PCs, linear functions of the original(p + 1) variables, but their coefficients are mostly close to zero or a longway from zero, with relatively few intermediate values. There is no guarantee,in general, that any of the new variables will have particularly large orparticularly small variances, as they are chosen by simplicity of structureof their coefficients, rather than for their variance properties. However, ifonly one or two of the coefficients for y are large, as should often happenwith varimax rotation, then Hawkins (1973) shows that the correspondingtransformed variables will have very small variances, and therefore suggestlow-variance relationships between y and the predictor variables. Otherpossible regression equations may be found by substitution of one subset ofpredictor variables in terms of another, using any low-variability relationshipsbetween predictor variables that are suggested by the other rotatedPCs.The above technique is advocated by Hawkins (1973) and by Jeffers(1981) as a means of selecting which variables should appear in the regressionequation (see Section 8.5), rather than as a way of directly estimatingtheir coefficients in the regression equation, although the technique couldbe used for the latter purpose. Daling and Tamura (1970) also discussedrotation of PCs in the context of variable selection, but their PCs were forthe predictor variables only.In a later paper, Hawkins and Eplett (1982) propose another variant of latentroot regression one which can be used to efficiently find low-variabilityrelationships between y and the predictor variables, and which also can beused in variable selection. This method replaces the rescaling and varimaxrotation of Hawkins’ earlier method by a sequence of rotations leading toa set of relationships between y and the predictor variables that are simplerto interpret than in the previous method. This simplicity is achievedbecause the matrix of coefficients defining the relationships has non-zeroentries only in its lower-triangular region. Despite the apparent complexityof the new method, it is also computationally simple to implement. Thecovariance (or correlation) matrix ˜Σ of y and all the predictor variables isfactorized using a Cholesky factorization˜Σ = DD ′ ,where D is lower-triangular. Then the matrix of coefficients defining therelationships is proportional to D −1 , which is also lower-triangular. To findD it is not necessary to calculate PCs based on ˜Σ, which makes the linksbetween the method and PCA rather more tenuous than those betweenPCA and latent root regression. The next section discusses variable selectionin regression using PCs, and because all three variants of latent rootregression described above can be used in variable selection, they will allbe discussed further in that section.
- Page 162 and 163: 6.1. How Many Principal Components?
- Page 164 and 165: 6.2. Choosing m, the Number of Comp
- Page 166 and 167: 6.2. Choosing m, the Number of Comp
- Page 168 and 169: 6.3. Selecting a Subset of Variable
- Page 170 and 171: 6.3. Selecting a Subset of Variable
- Page 172 and 173: 6.3. Selecting a Subset of Variable
- Page 174 and 175: 6.3. Selecting a Subset of Variable
- Page 176 and 177: 6.4. Examples Illustrating Variable
- Page 178 and 179: 6.4. Examples Illustrating Variable
- Page 180 and 181: 6.4. Examples Illustrating Variable
- Page 182 and 183: 7.1. Models for Factor Analysis 151
- Page 184 and 185: 7.2. Estimation of the Factor Model
- Page 186 and 187: 7.2. Estimation of the Factor Model
- Page 188 and 189: 7.2. Estimation of the Factor Model
- Page 190 and 191: 7.3. Comparisons Between Factor and
- Page 192 and 193: 7.4. An Example of Factor Analysis
- Page 194 and 195: 7.4. An Example of Factor Analysis
- Page 196 and 197: 7.5. Concluding Remarks 165To illus
- Page 198 and 199: 8Principal Components in Regression
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 208 and 209: 8.3. Connections Between PC Regress
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 214 and 215: 8.4. Variations on Principal Compon
- Page 216 and 217: 8.5. Variable Selection in Regressi
- Page 218 and 219: 8.5. Variable Selection in Regressi
- Page 220 and 221: 8.6. Functional and Structural Rela
- Page 222 and 223: 8.7. Examples of Principal Componen
- Page 224 and 225: Table 8.3. Principal component regr
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 232 and 233: 9.1. Discriminant Analysis 201on th
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 250 and 251: 9.2. Cluster Analysis 219county clu
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
- Page 258 and 259: 9.3. Canonical Correlation Analysis
- Page 260 and 261: 9.3. Canonical Correlation Analysis
182 8. <strong>Principal</strong> <strong>Component</strong>s in Regression <strong>Analysis</strong>by rescaling and then using varimax rotation (see Chapter 7), which hasthe effect of transforming the PCs to a different set of uncorrelated variables.These variables are, like the PCs, linear functions of the original(p + 1) variables, but their coefficients are mostly close to zero or a longway from zero, with relatively few intermediate values. There is no guarantee,in general, that any of the new variables will have particularly large orparticularly small variances, as they are chosen by simplicity of structureof their coefficients, rather than for their variance properties. However, ifonly one or two of the coefficients for y are large, as should often happenwith varimax rotation, then Hawkins (1973) shows that the correspondingtransformed variables will have very small variances, and therefore suggestlow-variance relationships between y and the predictor variables. Otherpossible regression equations may be found by substitution of one subset ofpredictor variables in terms of another, using any low-variability relationshipsbetween predictor variables that are suggested by the other rotatedPCs.The above technique is advocated by Hawkins (1973) and by Jeffers(1981) as a means of selecting which variables should appear in the regressionequation (see Section 8.5), rather than as a way of directly estimatingtheir coefficients in the regression equation, although the technique couldbe used for the latter purpose. Daling and Tamura (1970) also discussedrotation of PCs in the context of variable selection, but their PCs were forthe predictor variables only.In a later paper, Hawkins and Eplett (1982) propose another variant of latentroot regression one which can be used to efficiently find low-variabilityrelationships between y and the predictor variables, and which also can beused in variable selection. This method replaces the rescaling and varimaxrotation of Hawkins’ earlier method by a sequence of rotations leading toa set of relationships between y and the predictor variables that are simplerto interpret than in the previous method. This simplicity is achievedbecause the matrix of coefficients defining the relationships has non-zeroentries only in its lower-triangular region. Despite the apparent complexityof the new method, it is also computationally simple to implement. Thecovariance (or correlation) matrix ˜Σ of y and all the predictor variables isfactorized using a Cholesky factorization˜Σ = DD ′ ,where D is lower-triangular. Then the matrix of coefficients defining therelationships is proportional to D −1 , which is also lower-triangular. To findD it is not necessary to calculate PCs based on ˜Σ, which makes the linksbetween the method and PCA rather more tenuous than those betweenPCA and latent root regression. The next section discusses variable selectionin regression using PCs, and because all three variants of latent rootregression described above can be used in variable selection, they will allbe discussed further in that section.