Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

9.1. Discriminant Analysis 201on the assumptions about group structure and on the training set if oneis available, rules are constructed for assigning future observations to oneof the G groups in some ‘optimal’ way, for example, so as to minimize theprobability or cost of misclassification.The best-known form of discriminant analysis occurs when there are onlytwo populations, and x is assumed to have a multivariate normal distributionthat differs between the two populations with respect to its mean butnot its covariance matrix. If the means µ 1 , µ 2 and the common covariancematrix Σ are known, then the optimal rule (according to several differentcriteria) is based on the linear discriminant function x ′ Σ −1 (µ 1 −µ 2 ). If µ 1 ,µ 2 , Σ are estimated from a ‘training set’ by ¯x 1 , ¯x 2 , S w , respectively, then arule based on the sample linear discriminant function x ′ S −1w (¯x 1 −¯x 2 )isoftenused. There are many other varieties of discriminant analysis (McLachlan,1992), depending on the assumptions made regarding the populationstructure, and much research has been done, for example, on discriminantanalysis for discrete data and on non-parametric approaches (Goldsteinand Dillon, 1978; Hand, 1982).The most obvious way of using PCA in a discriminant analysis is toreduce the dimensionality of the analysis by replacing x by the first m(high variance) PCs in the derivation of a discriminant rule. If the first twoPCs account for a high proportion of the variance, they can also be usedto provide a two-dimensional graphical representation of the data showinghow good, or otherwise, the separation is between the groups.The first point to be clarified is exactly what is meant by the PCs ofx in the context of discriminant analysis. A common assumption in manyforms of discriminant analysis is that the covariance matrix is the samefor all groups, and the PCA may therefore be done on an estimate of thiscommon within-group covariance (or correlation) matrix. Unfortunately,this procedure may be unsatisfactory for two reasons. First, the withingroupcovariance matrix may be different for different groups. Methods forcomparing PCs from different groups are discussed in Section 13.5, and laterin the present section we describe techniques that use PCs to discriminatebetween populations when equal covariance matrices are not assumed. Forthe moment, however, we make the equal covariance assumption.The second, more serious, problem encountered in using PCs based on acommon within-group covariance matrix to discriminate between groups isthat there is no guarantee that the separation between groups will be in thedirection of the high-variance PCs. This point is illustrated diagramaticallyin Figures 9.1 and 9.2 for two variables. In both figures the two groups arewell separated, but in the first the separation is in the direction of the firstPC (that is parallel to the major axis of within-group variation), whereas inthe second the separation is orthogonal to this direction. Thus, the first fewPCs will only be useful for discriminating between groups in the case wherewithin- and between-group variation have the same dominant directions. Ifthis does not occur (and in general there is no particular reason for it to

202 9. Principal Components Used with Other Multivariate TechniquesFigure 9.1. Two data sets whose direction of separation is the same as that ofthe first (within-group) PC.do so) then omitting the low-variance PCs may actually throw away mostof the information in x concerning between-group variation.The problem is essentially the same one that arises in PC regressionwhere, as discussed in Section 8.2, it is inadvisable to look only at highvariancePCs, as the low-variance PCs can also be highly correlated with thedependent variable. That the same problem arises in both multiple regressionand discriminant analysis is hardly surprising, as linear discriminantanalysis can be viewed as a special case of multiple regression in whichthe dependent variable is a dummy variable defining group membership(Rencher, 1995, Section 8.3).An alternative to finding PCs from the within-group covariance matrixis mentioned by Rao (1964) and used by Chang (1983), Jolliffe etal. (1996) and Mager (1980b), among others. It ignores the group structureand calculates an overall covariance matrix based on the raw data. If thebetween-group variation is much larger than within-group variation, thenthe first few PCs for the overall covariance matrix will define directions in

9.1. Discriminant <strong>Analysis</strong> 201on the assumptions about group structure and on the training set if oneis available, rules are constructed for assigning future observations to oneof the G groups in some ‘optimal’ way, for example, so as to minimize theprobability or cost of misclassification.The best-known form of discriminant analysis occurs when there are onlytwo populations, and x is assumed to have a multivariate normal distributionthat differs between the two populations with respect to its mean butnot its covariance matrix. If the means µ 1 , µ 2 and the common covariancematrix Σ are known, then the optimal rule (according to several differentcriteria) is based on the linear discriminant function x ′ Σ −1 (µ 1 −µ 2 ). If µ 1 ,µ 2 , Σ are estimated from a ‘training set’ by ¯x 1 , ¯x 2 , S w , respectively, then arule based on the sample linear discriminant function x ′ S −1w (¯x 1 −¯x 2 )isoftenused. There are many other varieties of discriminant analysis (McLachlan,1992), depending on the assumptions made regarding the populationstructure, and much research has been done, for example, on discriminantanalysis for discrete data and on non-parametric approaches (Goldsteinand Dillon, 1978; Hand, 1982).The most obvious way of using PCA in a discriminant analysis is toreduce the dimensionality of the analysis by replacing x by the first m(high variance) PCs in the derivation of a discriminant rule. If the first twoPCs account for a high proportion of the variance, they can also be usedto provide a two-dimensional graphical representation of the data showinghow good, or otherwise, the separation is between the groups.The first point to be clarified is exactly what is meant by the PCs ofx in the context of discriminant analysis. A common assumption in manyforms of discriminant analysis is that the covariance matrix is the samefor all groups, and the PCA may therefore be done on an estimate of thiscommon within-group covariance (or correlation) matrix. Unfortunately,this procedure may be unsatisfactory for two reasons. First, the withingroupcovariance matrix may be different for different groups. Methods forcomparing PCs from different groups are discussed in Section 13.5, and laterin the present section we describe techniques that use PCs to discriminatebetween populations when equal covariance matrices are not assumed. Forthe moment, however, we make the equal covariance assumption.The second, more serious, problem encountered in using PCs based on acommon within-group covariance matrix to discriminate between groups isthat there is no guarantee that the separation between groups will be in thedirection of the high-variance PCs. This point is illustrated diagramaticallyin Figures 9.1 and 9.2 for two variables. In both figures the two groups arewell separated, but in the first the separation is in the direction of the firstPC (that is parallel to the major axis of within-group variation), whereas inthe second the separation is orthogonal to this direction. Thus, the first fewPCs will only be useful for discriminating between groups in the case wherewithin- and between-group variation have the same dominant directions. Ifthis does not occur (and in general there is no particular reason for it to

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!