Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

6.2. Choosing m, the Number of Components: Examples 133Table 6.1. First six eigenvalues for the correlation matrix, blood chemistry data.Component number 1 2 3 4 5 6Eigenvalue, l k 2.79 1.53 1.25 0.78 0.62 0.49t m = 100 ∑ mk=1 l k/p 34.9 54.1 69.7 79.4 87.2 93.3l k−1 − l k 1.26 0.28 0.47 0.16 0.13to retain. In reading the concluding paragraph that follows, this messageshould be kept firmly in mind.Some procedures, such as those introduced in Sections 6.1.4 and 6.1.6,are usually inappropriate because they retain, respectively, too many or toofew PCs in most circumstances. Some rules have been derived in particularfields of application, such as atmospheric science (Sections 6.1.3, 6.1.7) orpsychology (Sections 6.1.3, 6.1.6) and may be less relevant outside thesefields than within them. The simple rules of Sections 6.1.1 and 6.1.2 seemto work well in many examples, although the recommended cut-offs mustbe treated flexibly. Ideally the threshold should not fall between two PCswith very similar variances, and it may also change depending on the valueson the values of n and p, and on the presence of variables with dominantvariances (see the examples in the next section). A large amount of researchhas been done on rules for choosing m since the first edition of this bookappeared. However it still remains true that attempts to construct ruleshaving more sound statistical foundations seem, at present, to offer littleadvantage over the simpler rules in most circumstances.6.2 Choosing m, the Number of Components:ExamplesTwo examples are given here to illustrate several of the techniques describedin Section 6.1; in addition, the examples of Section 6.4 include some relevantdiscussion, and Section 6.1.8 noted a number of comparative studies.6.2.1 Clinical Trials Blood ChemistryThese data were introduced in Section 3.3 and consist of measurementsof eight blood chemistry variables on 72 patients. The eigenvalues for thecorrelation matrix are given in Table 6.1, together with the related informationthat is required to implement the ad hoc methods described inSections 6.1.1–6.1.3.Looking at Table 6.1 and Figure 6.1, the three methods of Sections 6.1.1–6.1.3 suggest that between three and six PCs should be retained, but thedecision on a single best number is not clear-cut. Four PCs account for

134 6. Choosing a Subset of Principal Components or VariablesTable 6.2. First six eigenvalues for the covariance matrix, blood chemistry data.Component number 1 2 3 4 5 6Eigenvalue, l k 1704.68 15.07 6.98 2.64 0.13 0.07l k /¯l ∑ 7.88 0.07 0.03 0.01 0.0006 0.0003mk=1t m = 100l k∑ p l 98.6 99.4 99.8 99.99 99.995 99.9994k=1 kl k−1 − l k 1689.61 8.09 4.34 2.51 0.06nearly 80% of the total variation, but it takes six PCs to account for 90%.A cut-off at l ∗ =0.7 for the second criterion retains four PCs, but the nexteigenvalue is not very much smaller, so perhaps five should be retained. Inthe scree graph the slope actually increases between k = 3 and 4, but thenfalls sharply and levels off, suggesting that perhaps only four PCs shouldbe retained. The LEV diagram (not shown) is of little help here; it has noclear indication of constant slope after any value of k, and in fact has itssteepest slope between k =7and8.Using Cattell’s (1966) formulation, there is no strong straight-line behaviourafter any particular point, although perhaps a cut-off at k =4ismost appropriate. Cattell suggests that the first point on the straight line(that is, the ‘elbow’ point) should be retained. However, if we consider thescree graph in the same light as the test of Section 6.1.4, then all eigenvaluesafter, and including, the elbow are deemed roughly equal and so allcorresponding PCs should be deleted. This would lead to the retention ofonly three PCs in the present case.Turning to Table 6.2, which gives information for the covariance matrix,corresponding to that presented for the correlation matrix in Table 6.1, thethree ad hoc measures all conclusively suggest that one PC is sufficient. Itis undoubtedly true that choosing m = 1 accounts for the vast majorityof the variation in x, but this conclusion is not particularly informativeas it merely reflects that one of the original variables accounts for nearlyall the variation in x. The PCs for the covariance matrix in this examplewere discussed in Section 3.3, and it can be argued that it is the use ofthe covariance matrix, rather than the rules of Sections 6.1.1–6.1.3, that isinappropriate for these data.6.2.2 Gas Chromatography DataThese data, which were originally presented by McReynolds (1970), andwhich have been analysed by Wold (1978) and by Eastment and Krzanowski(1982), are concerned with gas chromatography retention indices. Afterremoval of a number of apparent outliers and an observation with a missingvalue, there remain 212 (Eastment and Krzanowski) or 213 (Wold) measurementson ten variables. Wold (1978) claims that his method indicates

134 6. Choosing a Subset of <strong>Principal</strong> <strong>Component</strong>s or VariablesTable 6.2. First six eigenvalues for the covariance matrix, blood chemistry data.<strong>Component</strong> number 1 2 3 4 5 6Eigenvalue, l k 1704.68 15.07 6.98 2.64 0.13 0.07l k /¯l ∑ 7.88 0.07 0.03 0.01 0.0006 0.0003mk=1t m = 100l k∑ p l 98.6 99.4 99.8 99.99 99.995 99.9994k=1 kl k−1 − l k 1689.61 8.09 4.34 2.51 0.06nearly 80% of the total variation, but it takes six PCs to account for 90%.A cut-off at l ∗ =0.7 for the second criterion retains four PCs, but the nexteigenvalue is not very much smaller, so perhaps five should be retained. Inthe scree graph the slope actually increases between k = 3 and 4, but thenfalls sharply and levels off, suggesting that perhaps only four PCs shouldbe retained. The LEV diagram (not shown) is of little help here; it has noclear indication of constant slope after any value of k, and in fact has itssteepest slope between k =7and8.Using Cattell’s (1966) formulation, there is no strong straight-line behaviourafter any particular point, although perhaps a cut-off at k =4ismost appropriate. Cattell suggests that the first point on the straight line(that is, the ‘elbow’ point) should be retained. However, if we consider thescree graph in the same light as the test of Section 6.1.4, then all eigenvaluesafter, and including, the elbow are deemed roughly equal and so allcorresponding PCs should be deleted. This would lead to the retention ofonly three PCs in the present case.Turning to Table 6.2, which gives information for the covariance matrix,corresponding to that presented for the correlation matrix in Table 6.1, thethree ad hoc measures all conclusively suggest that one PC is sufficient. Itis undoubtedly true that choosing m = 1 accounts for the vast majorityof the variation in x, but this conclusion is not particularly informativeas it merely reflects that one of the original variables accounts for nearlyall the variation in x. The PCs for the covariance matrix in this examplewere discussed in Section 3.3, and it can be argued that it is the use ofthe covariance matrix, rather than the rules of Sections 6.1.1–6.1.3, that isinappropriate for these data.6.2.2 Gas Chromatography DataThese data, which were originally presented by McReynolds (1970), andwhich have been analysed by Wold (1978) and by Eastment and Krzanowski(1982), are concerned with gas chromatography retention indices. Afterremoval of a number of apparent outliers and an observation with a missingvalue, there remain 212 (Eastment and Krzanowski) or 213 (Wold) measurementson ten variables. Wold (1978) claims that his method indicates

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!