12.07.2015 Views

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

6.3. Selecting a Subset of Variables 137Krzanowski (1983) examines the gas chromatography example furtherby generating six different artificial data sets with the same sample covariancematrix as the real data. The values of W are fairly stable across thereplicates and confirm the choice of four PCs obtained above by slightly decreasingthe cut-off for W . For the full data set, with outliers not removed,the replicates give some different, and useful, information from that in theoriginal data.6.3 Selecting a Subset of VariablesWhen p, the number of variables observed, is large it is often the case thata subset of m variables, with m ≪ p, contains virtually all the informationavailable in all p variables. It is then useful to determine an appropriatevalue of m, and to decide which subset or subsets of m variables are best.Solution of these two problems, the choice of m and the selection of agood subset, depends on the purpose to which the subset of variables isto be put. If the purpose is simply to preserve most of the variation inx, then the PCs of x can be used fairly straightforwardly to solve bothproblems, as will be explained shortly. A more familiar variable selectionproblem is in multiple regression, and although PCA can contribute in thiscontext (see Section 8.5), it is used in a more complicated manner. This isbecause external considerations, namely the relationships of the predictor(regressor) variables with the dependent variable, as well as the internalrelationships between the regressor variables, must be considered. Externalconsiderations are also relevant in other variable selection situations, forexample in discriminant analysis (Section 9.1); these situations will not beconsidered in the present chapter. Furthermore, practical considerations,such as ease of measurement of the selected variables, may be important insome circumstances, and it must be stressed that such considerations, aswell as the purpose of the subsequent analysis, can play a prominent role invariable selection, Here, however, we concentrate on the problem of findinga subset of x in which the sole aim is to represent the internal variation ofx as well as possible.Regarding the choice of m, the methods of Section 6.1 are all relevant.The techniques described there find the number of PCs that account formost of the variation in x, but they can also be interpreted as finding theeffective dimensionality of x. Ifx can be successfully described by only mPCs, then it will often be true that x can be replaced by a subset of m (orperhaps slightly more) variables, with a relatively small loss of information.Moving on to the choice of m variables, <strong>Jolliffe</strong> (1970, 1972, 1973) discusseda number of methods for selecting a subset of m variables thatpreserve most of the variation in x. Some of the methods compared, andindeed some of those which performed quite well, are based on PCs. Other

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!