Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
9.1. Discriminant Analysis 201on the assumptions about group structure and on the training set if oneis available, rules are constructed for assigning future observations to oneof the G groups in some ‘optimal’ way, for example, so as to minimize theprobability or cost of misclassification.The best-known form of discriminant analysis occurs when there are onlytwo populations, and x is assumed to have a multivariate normal distributionthat differs between the two populations with respect to its mean butnot its covariance matrix. If the means µ 1 , µ 2 and the common covariancematrix Σ are known, then the optimal rule (according to several differentcriteria) is based on the linear discriminant function x ′ Σ −1 (µ 1 −µ 2 ). If µ 1 ,µ 2 , Σ are estimated from a ‘training set’ by ¯x 1 , ¯x 2 , S w , respectively, then arule based on the sample linear discriminant function x ′ S −1w (¯x 1 −¯x 2 )isoftenused. There are many other varieties of discriminant analysis (McLachlan,1992), depending on the assumptions made regarding the populationstructure, and much research has been done, for example, on discriminantanalysis for discrete data and on non-parametric approaches (Goldsteinand Dillon, 1978; Hand, 1982).The most obvious way of using PCA in a discriminant analysis is toreduce the dimensionality of the analysis by replacing x by the first m(high variance) PCs in the derivation of a discriminant rule. If the first twoPCs account for a high proportion of the variance, they can also be usedto provide a two-dimensional graphical representation of the data showinghow good, or otherwise, the separation is between the groups.The first point to be clarified is exactly what is meant by the PCs ofx in the context of discriminant analysis. A common assumption in manyforms of discriminant analysis is that the covariance matrix is the samefor all groups, and the PCA may therefore be done on an estimate of thiscommon within-group covariance (or correlation) matrix. Unfortunately,this procedure may be unsatisfactory for two reasons. First, the withingroupcovariance matrix may be different for different groups. Methods forcomparing PCs from different groups are discussed in Section 13.5, and laterin the present section we describe techniques that use PCs to discriminatebetween populations when equal covariance matrices are not assumed. Forthe moment, however, we make the equal covariance assumption.The second, more serious, problem encountered in using PCs based on acommon within-group covariance matrix to discriminate between groups isthat there is no guarantee that the separation between groups will be in thedirection of the high-variance PCs. This point is illustrated diagramaticallyin Figures 9.1 and 9.2 for two variables. In both figures the two groups arewell separated, but in the first the separation is in the direction of the firstPC (that is parallel to the major axis of within-group variation), whereas inthe second the separation is orthogonal to this direction. Thus, the first fewPCs will only be useful for discriminating between groups in the case wherewithin- and between-group variation have the same dominant directions. Ifthis does not occur (and in general there is no particular reason for it to
202 9. Principal Components Used with Other Multivariate TechniquesFigure 9.1. Two data sets whose direction of separation is the same as that ofthe first (within-group) PC.do so) then omitting the low-variance PCs may actually throw away mostof the information in x concerning between-group variation.The problem is essentially the same one that arises in PC regressionwhere, as discussed in Section 8.2, it is inadvisable to look only at highvariancePCs, as the low-variance PCs can also be highly correlated with thedependent variable. That the same problem arises in both multiple regressionand discriminant analysis is hardly surprising, as linear discriminantanalysis can be viewed as a special case of multiple regression in whichthe dependent variable is a dummy variable defining group membership(Rencher, 1995, Section 8.3).An alternative to finding PCs from the within-group covariance matrixis mentioned by Rao (1964) and used by Chang (1983), Jolliffe etal. (1996) and Mager (1980b), among others. It ignores the group structureand calculates an overall covariance matrix based on the raw data. If thebetween-group variation is much larger than within-group variation, thenthe first few PCs for the overall covariance matrix will define directions in
- Page 182 and 183: 7.1. Models for Factor Analysis 151
- Page 184 and 185: 7.2. Estimation of the Factor Model
- Page 186 and 187: 7.2. Estimation of the Factor Model
- Page 188 and 189: 7.2. Estimation of the Factor Model
- Page 190 and 191: 7.3. Comparisons Between Factor and
- Page 192 and 193: 7.4. An Example of Factor Analysis
- Page 194 and 195: 7.4. An Example of Factor Analysis
- Page 196 and 197: 7.5. Concluding Remarks 165To illus
- Page 198 and 199: 8Principal Components in Regression
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 208 and 209: 8.3. Connections Between PC Regress
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 212 and 213: 8.4. Variations on Principal Compon
- Page 214 and 215: 8.4. Variations on Principal Compon
- Page 216 and 217: 8.5. Variable Selection in Regressi
- Page 218 and 219: 8.5. Variable Selection in Regressi
- Page 220 and 221: 8.6. Functional and Structural Rela
- Page 222 and 223: 8.7. Examples of Principal Componen
- Page 224 and 225: Table 8.3. Principal component regr
- Page 226 and 227: 8.7. Examples of Principal Componen
- Page 228 and 229: 8.7. Examples of Principal Componen
- Page 230 and 231: 9Principal Components Used withOthe
- Page 234 and 235: 9.1. Discriminant Analysis 203Figur
- Page 236 and 237: 9.1. Discriminant Analysis 205Corbi
- Page 238 and 239: 9.1. Discriminant Analysis 207that
- Page 240 and 241: 9.1. Discriminant Analysis 209betwe
- Page 242 and 243: 9.2. Cluster Analysis 211dimensiona
- Page 244 and 245: 9.2. Cluster Analysis 213Before loo
- Page 246 and 247: 9.2. Cluster Analysis 215Figure 9.3
- Page 248 and 249: 9.2. Cluster Analysis 217demographi
- Page 250 and 251: 9.2. Cluster Analysis 219county clu
- Page 252 and 253: 9.2. Cluster Analysis 221choosing a
- Page 254 and 255: 9.3. Canonical Correlation Analysis
- Page 256 and 257: 9.3. Canonical Correlation Analysis
- Page 258 and 259: 9.3. Canonical Correlation Analysis
- Page 260 and 261: 9.3. Canonical Correlation Analysis
- Page 262 and 263: 9.3. Canonical Correlation Analysis
- Page 264 and 265: 10.1. Detection of Outliers Using P
- Page 266 and 267: 10.1. Detection of Outliers Using P
- Page 268 and 269: 10.1. Detection of Outliers Using P
- Page 270 and 271: 10.1. Detection of Outliers Using P
- Page 272 and 273: 10.1. Detection of Outliers Using P
- Page 274 and 275: 10.1. Detection of Outliers Using P
- Page 276 and 277: 10.1. Detection of Outliers Using P
- Page 278 and 279: 10.1. Detection of Outliers Using P
- Page 280 and 281: 10.2. Influential Observations in a
9.1. Discriminant <strong>Analysis</strong> 201on the assumptions about group structure and on the training set if oneis available, rules are constructed for assigning future observations to oneof the G groups in some ‘optimal’ way, for example, so as to minimize theprobability or cost of misclassification.The best-known form of discriminant analysis occurs when there are onlytwo populations, and x is assumed to have a multivariate normal distributionthat differs between the two populations with respect to its mean butnot its covariance matrix. If the means µ 1 , µ 2 and the common covariancematrix Σ are known, then the optimal rule (according to several differentcriteria) is based on the linear discriminant function x ′ Σ −1 (µ 1 −µ 2 ). If µ 1 ,µ 2 , Σ are estimated from a ‘training set’ by ¯x 1 , ¯x 2 , S w , respectively, then arule based on the sample linear discriminant function x ′ S −1w (¯x 1 −¯x 2 )isoftenused. There are many other varieties of discriminant analysis (McLachlan,1992), depending on the assumptions made regarding the populationstructure, and much research has been done, for example, on discriminantanalysis for discrete data and on non-parametric approaches (Goldsteinand Dillon, 1978; Hand, 1982).The most obvious way of using PCA in a discriminant analysis is toreduce the dimensionality of the analysis by replacing x by the first m(high variance) PCs in the derivation of a discriminant rule. If the first twoPCs account for a high proportion of the variance, they can also be usedto provide a two-dimensional graphical representation of the data showinghow good, or otherwise, the separation is between the groups.The first point to be clarified is exactly what is meant by the PCs ofx in the context of discriminant analysis. A common assumption in manyforms of discriminant analysis is that the covariance matrix is the samefor all groups, and the PCA may therefore be done on an estimate of thiscommon within-group covariance (or correlation) matrix. Unfortunately,this procedure may be unsatisfactory for two reasons. First, the withingroupcovariance matrix may be different for different groups. Methods forcomparing PCs from different groups are discussed in Section 13.5, and laterin the present section we describe techniques that use PCs to discriminatebetween populations when equal covariance matrices are not assumed. Forthe moment, however, we make the equal covariance assumption.The second, more serious, problem encountered in using PCs based on acommon within-group covariance matrix to discriminate between groups isthat there is no guarantee that the separation between groups will be in thedirection of the high-variance PCs. This point is illustrated diagramaticallyin Figures 9.1 and 9.2 for two variables. In both figures the two groups arewell separated, but in the first the separation is in the direction of the firstPC (that is parallel to the major axis of within-group variation), whereas inthe second the separation is orthogonal to this direction. Thus, the first fewPCs will only be useful for discriminating between groups in the case wherewithin- and between-group variation have the same dominant directions. Ifthis does not occur (and in general there is no particular reason for it to