Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)
6.2. Choosing m, the Number of Components: Examples 133Table 6.1. First six eigenvalues for the correlation matrix, blood chemistry data.Component number 1 2 3 4 5 6Eigenvalue, l k 2.79 1.53 1.25 0.78 0.62 0.49t m = 100 ∑ mk=1 l k/p 34.9 54.1 69.7 79.4 87.2 93.3l k−1 − l k 1.26 0.28 0.47 0.16 0.13to retain. In reading the concluding paragraph that follows, this messageshould be kept firmly in mind.Some procedures, such as those introduced in Sections 6.1.4 and 6.1.6,are usually inappropriate because they retain, respectively, too many or toofew PCs in most circumstances. Some rules have been derived in particularfields of application, such as atmospheric science (Sections 6.1.3, 6.1.7) orpsychology (Sections 6.1.3, 6.1.6) and may be less relevant outside thesefields than within them. The simple rules of Sections 6.1.1 and 6.1.2 seemto work well in many examples, although the recommended cut-offs mustbe treated flexibly. Ideally the threshold should not fall between two PCswith very similar variances, and it may also change depending on the valueson the values of n and p, and on the presence of variables with dominantvariances (see the examples in the next section). A large amount of researchhas been done on rules for choosing m since the first edition of this bookappeared. However it still remains true that attempts to construct ruleshaving more sound statistical foundations seem, at present, to offer littleadvantage over the simpler rules in most circumstances.6.2 Choosing m, the Number of Components:ExamplesTwo examples are given here to illustrate several of the techniques describedin Section 6.1; in addition, the examples of Section 6.4 include some relevantdiscussion, and Section 6.1.8 noted a number of comparative studies.6.2.1 Clinical Trials Blood ChemistryThese data were introduced in Section 3.3 and consist of measurementsof eight blood chemistry variables on 72 patients. The eigenvalues for thecorrelation matrix are given in Table 6.1, together with the related informationthat is required to implement the ad hoc methods described inSections 6.1.1–6.1.3.Looking at Table 6.1 and Figure 6.1, the three methods of Sections 6.1.1–6.1.3 suggest that between three and six PCs should be retained, but thedecision on a single best number is not clear-cut. Four PCs account for
134 6. Choosing a Subset of Principal Components or VariablesTable 6.2. First six eigenvalues for the covariance matrix, blood chemistry data.Component number 1 2 3 4 5 6Eigenvalue, l k 1704.68 15.07 6.98 2.64 0.13 0.07l k /¯l ∑ 7.88 0.07 0.03 0.01 0.0006 0.0003mk=1t m = 100l k∑ p l 98.6 99.4 99.8 99.99 99.995 99.9994k=1 kl k−1 − l k 1689.61 8.09 4.34 2.51 0.06nearly 80% of the total variation, but it takes six PCs to account for 90%.A cut-off at l ∗ =0.7 for the second criterion retains four PCs, but the nexteigenvalue is not very much smaller, so perhaps five should be retained. Inthe scree graph the slope actually increases between k = 3 and 4, but thenfalls sharply and levels off, suggesting that perhaps only four PCs shouldbe retained. The LEV diagram (not shown) is of little help here; it has noclear indication of constant slope after any value of k, and in fact has itssteepest slope between k =7and8.Using Cattell’s (1966) formulation, there is no strong straight-line behaviourafter any particular point, although perhaps a cut-off at k =4ismost appropriate. Cattell suggests that the first point on the straight line(that is, the ‘elbow’ point) should be retained. However, if we consider thescree graph in the same light as the test of Section 6.1.4, then all eigenvaluesafter, and including, the elbow are deemed roughly equal and so allcorresponding PCs should be deleted. This would lead to the retention ofonly three PCs in the present case.Turning to Table 6.2, which gives information for the covariance matrix,corresponding to that presented for the correlation matrix in Table 6.1, thethree ad hoc measures all conclusively suggest that one PC is sufficient. Itis undoubtedly true that choosing m = 1 accounts for the vast majorityof the variation in x, but this conclusion is not particularly informativeas it merely reflects that one of the original variables accounts for nearlyall the variation in x. The PCs for the covariance matrix in this examplewere discussed in Section 3.3, and it can be argued that it is the use ofthe covariance matrix, rather than the rules of Sections 6.1.1–6.1.3, that isinappropriate for these data.6.2.2 Gas Chromatography DataThese data, which were originally presented by McReynolds (1970), andwhich have been analysed by Wold (1978) and by Eastment and Krzanowski(1982), are concerned with gas chromatography retention indices. Afterremoval of a number of apparent outliers and an observation with a missingvalue, there remain 212 (Eastment and Krzanowski) or 213 (Wold) measurementson ten variables. Wold (1978) claims that his method indicates
- Page 114 and 115: 5.1. Plotting Two or Three Principa
- Page 116 and 117: 5.2. Principal Coordinate Analysis
- Page 118 and 119: 5.2. Principal Coordinate Analysis
- Page 120 and 121: 5.2. Principal Coordinate Analysis
- Page 122 and 123: 5.3. Biplots 91columns, L is an (r
- Page 124 and 125: 5.3. Biplots 93ButandSubstituting i
- Page 126 and 127: 5.3. Biplots 95The vector gi ∗ co
- Page 128 and 129: 5.3. Biplots 97Figure 5.3. Biplot u
- Page 130 and 131: 5.3. Biplots 99Table 5.2. First two
- Page 132 and 133: 5.3. Biplots 101Figure 5.5. Biplot
- Page 134 and 135: 5.4. Correspondence Analysis 103of
- Page 136 and 137: 5.4. Correspondence Analysis 105Fig
- Page 138 and 139: 5.6. Displaying Intrinsically High-
- Page 140 and 141: 5.6. Displaying Intrinsically High-
- Page 142 and 143: 6Choosing a Subset of PrincipalComp
- Page 144 and 145: 6.1. How Many Principal Components?
- Page 146 and 147: 6.1. How Many Principal Components?
- Page 148 and 149: 6.1. How Many Principal Components?
- Page 150 and 151: 6.1. How Many Principal Components?
- Page 152 and 153: 6.1. How Many Principal Components?
- Page 154 and 155: 6.1. How Many Principal Components?
- Page 156 and 157: 6.1. How Many Principal Components?
- Page 158 and 159: 6.1. How Many Principal Components?
- Page 160 and 161: 6.1. How Many Principal Components?
- Page 162 and 163: 6.1. How Many Principal Components?
- Page 166 and 167: 6.2. Choosing m, the Number of Comp
- Page 168 and 169: 6.3. Selecting a Subset of Variable
- Page 170 and 171: 6.3. Selecting a Subset of Variable
- Page 172 and 173: 6.3. Selecting a Subset of Variable
- Page 174 and 175: 6.3. Selecting a Subset of Variable
- Page 176 and 177: 6.4. Examples Illustrating Variable
- Page 178 and 179: 6.4. Examples Illustrating Variable
- Page 180 and 181: 6.4. Examples Illustrating Variable
- Page 182 and 183: 7.1. Models for Factor Analysis 151
- Page 184 and 185: 7.2. Estimation of the Factor Model
- Page 186 and 187: 7.2. Estimation of the Factor Model
- Page 188 and 189: 7.2. Estimation of the Factor Model
- Page 190 and 191: 7.3. Comparisons Between Factor and
- Page 192 and 193: 7.4. An Example of Factor Analysis
- Page 194 and 195: 7.4. An Example of Factor Analysis
- Page 196 and 197: 7.5. Concluding Remarks 165To illus
- Page 198 and 199: 8Principal Components in Regression
- Page 200 and 201: 8.1. Principal Component Regression
- Page 202 and 203: 8.1. Principal Component Regression
- Page 204 and 205: 8.2. Selecting Components in Princi
- Page 206 and 207: 8.2. Selecting Components in Princi
- Page 208 and 209: 8.3. Connections Between PC Regress
- Page 210 and 211: 8.4. Variations on Principal Compon
- Page 212 and 213: 8.4. Variations on Principal Compon
134 6. Choosing a Subset of <strong>Principal</strong> <strong>Component</strong>s or VariablesTable 6.2. First six eigenvalues for the covariance matrix, blood chemistry data.<strong>Component</strong> number 1 2 3 4 5 6Eigenvalue, l k 1704.68 15.07 6.98 2.64 0.13 0.07l k /¯l ∑ 7.88 0.07 0.03 0.01 0.0006 0.0003mk=1t m = 100l k∑ p l 98.6 99.4 99.8 99.99 99.995 99.9994k=1 kl k−1 − l k 1689.61 8.09 4.34 2.51 0.06nearly 80% of the total variation, but it takes six PCs to account for 90%.A cut-off at l ∗ =0.7 for the second criterion retains four PCs, but the nexteigenvalue is not very much smaller, so perhaps five should be retained. Inthe scree graph the slope actually increases between k = 3 and 4, but thenfalls sharply and levels off, suggesting that perhaps only four PCs shouldbe retained. The LEV diagram (not shown) is of little help here; it has noclear indication of constant slope after any value of k, and in fact has itssteepest slope between k =7and8.Using Cattell’s (1966) formulation, there is no strong straight-line behaviourafter any particular point, although perhaps a cut-off at k =4ismost appropriate. Cattell suggests that the first point on the straight line(that is, the ‘elbow’ point) should be retained. However, if we consider thescree graph in the same light as the test of Section 6.1.4, then all eigenvaluesafter, and including, the elbow are deemed roughly equal and so allcorresponding PCs should be deleted. This would lead to the retention ofonly three PCs in the present case.Turning to Table 6.2, which gives information for the covariance matrix,corresponding to that presented for the correlation matrix in Table 6.1, thethree ad hoc measures all conclusively suggest that one PC is sufficient. Itis undoubtedly true that choosing m = 1 accounts for the vast majorityof the variation in x, but this conclusion is not particularly informativeas it merely reflects that one of the original variables accounts for nearlyall the variation in x. The PCs for the covariance matrix in this examplewere discussed in Section 3.3, and it can be argued that it is the use ofthe covariance matrix, rather than the rules of Sections 6.1.1–6.1.3, that isinappropriate for these data.6.2.2 Gas Chromatography DataThese data, which were originally presented by McReynolds (1970), andwhich have been analysed by Wold (1978) and by Eastment and Krzanowski(1982), are concerned with gas chromatography retention indices. Afterremoval of a number of apparent outliers and an observation with a missingvalue, there remain 212 (Eastment and Krzanowski) or 213 (Wold) measurementson ten variables. Wold (1978) claims that his method indicates