Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

6.4. Examples Illustrating Variable Selection 147both among the group of dominant variables for the second PC, and variable13 (tibia length 3) has the largest coefficient of any variable for PC1.Comparisons can be made regarding how well Jolliffe’s and McCabe’s selectionsperform with respect to the criteria (6.3.4) and (6.3.5). For (6.3.5),Jolliffe’s choices are closer to optimality than McCabe’s, achieving valuesof 0.933 and 0.945 for four variables, compared to 0.907 and 0.904 forMcCabe, whereas the optimal value is 0.948. Discrepancies are generallylarger but more variable for criterion (6.3.4). For example, the B2 selectionof three variables achieves a value of only 0.746 compared the optimalvalue of 0.942, which is attained by B4. Values for McCabe’s selections areintermediate (0.838, 0.880).Regarding the choice of m, thel k criterion of Section 6.1.2 was foundby Jolliffe (1972), using simulation studies, to be appropriate for methodsB2 and B4, with a cut-off close to l ∗ =0.7. In the present example thecriterion suggests m =3,asl 3 =0.75 and l 4 =0.50. Confirmation that mshould be this small is given by the criterion t m of Section 6.1.1. Two PCsaccount for 85.4% of the variation, three PCs give 89.4% and four PCscontribute 92.0%, from which Jeffers (1967) concludes that two PCs aresufficient to account for most of the variation. However, Jolliffe (1973) alsolooked at how well other aspects of the structure of data are reproduced forvarious values of m. For example, the form of the PCs and the division intofour distinct groups of aphids (see Section 9.2 for further discussion of thisaspect) were both examined and found to be noticeably better reproducedfor m = 4 than for m = 2 or 3, so it seems that the criteria of Sections 6.1.1and 6.1.2 might be relaxed somewhat when very small values of m areindicated, especially when coupled with small values of n, the sample size.McCabe (1982) notes that four or five of the original variables are necessaryin order to account for as much variation as the first two PCs, confirmingthat m = 4 or 5 is probably appropriate here.Tanaka and Mori (1997) suggest, on the basis of their two criteria andusing a backward elimination algorithm, that seven or nine variables shouldbe kept, rather more than Jolliffe (1973) or McCabe (1982). If only fourvariables are retained, Tanaka and Mori’s (1997) analysis keeps variables5, 6, 14, 19 according to the RV-coefficient, and variables 5, 14, 17, 18 usingresiduals from regression. At least three of the four variables overlap withchoices made in Table 6.4. On the other hand, the selection rule basedon influential variables suggested by Mori et al. (2000) retains variables2, 4, 12, 13 in a 4-variable subset, a quite different selection from those ofthe other methods.6.4.2 Crime RatesThese data were given by Ahamad (1967) and consist of measurements ofthe crime rate in England and Wales for 18 different categories of crime(the variables) for the 14 years, 1950–63. The sample size n = 14 is very

148 6. Choosing a Subset of Principal Components or VariablesTable 6.5. Subsets of selected variables, crime rates.(Each row corresponds to a selected subset with × denoting a selected variable.)Variables1 3 4 5 7 8 10 13 14 16 17McCabe, using criterion (a){best × × ×Three variablessecond best × × ×{best × × × ×Four variablessecond best × × × ×Jolliffe, using criteria B2, B4{B2 × × ×Three variablesB4 × × ×{B2 × × × ×Four variablesB4 × × × ×Criterion (6.3.4)Three variables × × ×Four variables × × × ×Criterion (6.3.5)Three variables × × ×Four variables × × × ×small, and is in fact smaller than the number of variables. Furthermore,the data are time series, and the 14 observations are not independent (seeChapter 12), so that the effective sample size is even smaller than 14. Leavingaside this potential problem and other criticisms of Ahamad’s analysis(Walker, 1967), subsets of variables that are selected using the correlationmatrix by the same methods as in Table 6.4 are shown in Table 6.5.There is a strong similarity between the correlation structure of thepresent data set and that of the previous example. Most of the variablesconsidered increased during the time period considered, and the correlationsbetween these variables are large and positive. (Some elements of thecorrelation matrix given by Ahamad (1967) are incorrect; Jolliffe (1970)gives the correct values.)The first PC based on the correlation matrix therefore has large coefficientson all these variables; it measures an ‘average crime rate’ calculatedlargely from 13 of the 18 variables, and accounts for 71.7% of the totalvariation. The second PC, accounting for 16.1% of the total variation, haslarge coefficients on the five variables whose behaviour over the 14 yearsis ‘atypical’ in one way or another. The third PC, accounting for 5.5% ofthe total variation, is dominated by the single variable ‘homicide,’ whichstayed almost constant compared with the trends in other variables overthe period of study. On the basis of t m only two or three PCs are necessary,

148 6. Choosing a Subset of <strong>Principal</strong> <strong>Component</strong>s or VariablesTable 6.5. Subsets of selected variables, crime rates.(Each row corresponds to a selected subset with × denoting a selected variable.)Variables1 3 4 5 7 8 10 13 14 16 17McCabe, using criterion (a){best × × ×Three variablessecond best × × ×{best × × × ×Four variablessecond best × × × ×<strong>Jolliffe</strong>, using criteria B2, B4{B2 × × ×Three variablesB4 × × ×{B2 × × × ×Four variablesB4 × × × ×Criterion (6.3.4)Three variables × × ×Four variables × × × ×Criterion (6.3.5)Three variables × × ×Four variables × × × ×small, and is in fact smaller than the number of variables. Furthermore,the data are time series, and the 14 observations are not independent (seeChapter 12), so that the effective sample size is even smaller than 14. Leavingaside this potential problem and other criticisms of Ahamad’s analysis(Walker, 1967), subsets of variables that are selected using the correlationmatrix by the same methods as in Table 6.4 are shown in Table 6.5.There is a strong similarity between the correlation structure of thepresent data set and that of the previous example. Most of the variablesconsidered increased during the time period considered, and the correlationsbetween these variables are large and positive. (Some elements of thecorrelation matrix given by Ahamad (1967) are incorrect; <strong>Jolliffe</strong> (1970)gives the correct values.)The first PC based on the correlation matrix therefore has large coefficientson all these variables; it measures an ‘average crime rate’ calculatedlargely from 13 of the 18 variables, and accounts for 71.7% of the totalvariation. The second PC, accounting for 16.1% of the total variation, haslarge coefficients on the five variables whose behaviour over the 14 yearsis ‘atypical’ in one way or another. The third PC, accounting for 5.5% ofthe total variation, is dominated by the single variable ‘homicide,’ whichstayed almost constant compared with the trends in other variables overthe period of study. On the basis of t m only two or three PCs are necessary,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!