12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.4. Examples Illustrating Variable Selection 147both among the group of dominant variables for the second PC, and variable13 (tibia length 3) has the largest coefficient of any variable for PC1.Comparisons can be made regarding how well <strong>Jolliffe</strong>’s and McCabe’s selectionsperform with respect to the criteria (6.3.4) and (6.3.5). For (6.3.5),<strong>Jolliffe</strong>’s choices are closer to optimality than McCabe’s, achieving valuesof 0.933 and 0.945 for four variables, compared to 0.907 and 0.904 forMcCabe, whereas the optimal value is 0.948. Discrepancies are generallylarger but more variable for criterion (6.3.4). For example, the B2 selectionof three variables achieves a value of only 0.746 compared the optimalvalue of 0.942, which is attained by B4. Values for McCabe’s selections areintermediate (0.838, 0.880).Regarding the choice of m, thel k criterion of Section 6.1.2 was foundby <strong>Jolliffe</strong> (1972), using simulation studies, to be appropriate for methodsB2 and B4, with a cut-off close to l ∗ =0.7. In the present example thecriterion suggests m =3,asl 3 =0.75 and l 4 =0.50. Confirmation that mshould be this small is given by the criterion t m of Section 6.1.1. Two PCsaccount for 85.4% of the variation, three PCs give 89.4% and four PCscontribute 92.0%, from which Jeffers (1967) concludes that two PCs aresufficient to account for most of the variation. However, <strong>Jolliffe</strong> (1973) alsolooked at how well other aspects of the structure of data are reproduced forvarious values of m. For example, the form of the PCs and the division intofour distinct groups of aphids (see Section 9.2 for further discussion of thisaspect) were both examined and found to be noticeably better reproducedfor m = 4 than for m = 2 or 3, so it seems that the criteria of Sections 6.1.1and 6.1.2 might be relaxed somewhat when very small values of m areindicated, especially when coupled with small values of n, the sample size.McCabe (1982) notes that four or five of the original variables are necessaryin order to account for as much variation as the first two PCs, confirmingthat m = 4 or 5 is probably appropriate here.Tanaka and Mori (1997) suggest, on the basis of their two criteria andusing a backward elimination algorithm, that seven or nine variables shouldbe kept, rather more than <strong>Jolliffe</strong> (1973) or McCabe (1982). If only fourvariables are retained, Tanaka and Mori’s (1997) analysis keeps variables5, 6, 14, 19 according to the RV-coefficient, and variables 5, 14, 17, 18 usingresiduals from regression. At least three of the four variables overlap withchoices made in Table 6.4. On the other hand, the selection rule basedon influential variables suggested by Mori et al. (2000) retains variables2, 4, 12, 13 in a 4-variable subset, a quite different selection from those ofthe other methods.6.4.2 Crime RatesThese data were given by Ahamad (1967) and consist of measurements ofthe crime rate in England and Wales for 18 different categories of crime(the variables) for the 14 years, 1950–63. The sample size n = 14 is very

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!