Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

3.8. Patterned Covariance and Correlation Matrices 57relations are positive and not close to zero. Sometimes a variable in such agroup will initially have entirely negative correlations with the other membersof the group, but the sign of a variable is often arbitrary, and switchingthe sign will give a group of the required structure. If correlations betweenthe q members of the group and variables outside the group are close tozero, then there will be q PCs ‘associated with the group’ whose coefficientsfor variables outside the group are small. One of these PCs will havea large variance, approximately 1 + (q − 1)¯r, where ¯r is the average correlationwithin the group, and will have positive coefficients for all variablesin the group. The remaining (q − 1) PCs will have much smaller variances(of order 1 − ¯r), and will have some positive and some negative coefficients.Thus the ‘large variance PC’ for the group measures, roughly, the averagesize of variables in the group, whereas the ‘small variance PCs’ give ‘contrasts’between some or all of the variables in the group. There may beseveral such groups of variables in a data set, in which case each group willhave one ‘large variance PC’ and several ‘small variance PCs.’ Conversely,as happens not infrequently, especially in biological applications when allvariables are measurements on individuals of some species, we may findthat all p variables are positively correlated. In such cases, the first PCis often interpreted as a measure of size of the individuals, whereas subsequentPCs measure aspects of shape (see Sections 4.1, 13.2 for furtherdiscussion).The discussion above implies that the approximate structure and variancesof the first few PCs can be deduced from a correlation matrix,provided that well-defined groups of variables are detected, including possiblysingle-variable groups, whose within-group correlations are high, andwhose between-group correlations are low. The ideas can be taken further;upper and lower bounds on the variance of the first PC can be calculated,based on sums and averages of correlations (Friedman and Weisberg, 1981;Jackson, 1991, Section 4.2.3). However, it should be stressed that althoughdata sets for which there is some group structure among variables are notuncommon, there are many others for which no such pattern is apparent.In such cases the structure of the PCs cannot usually be found withoutactually performing the PCA.3.8.1 ExampleIn many of the examples discussed in later chapters, it will be seen that thestructure of some of the PCs can be partially deduced from the correlationmatrix, using the ideas just discussed. Here we describe an example in whichall the PCs have a fairly clear pattern. The data consist of measurements ofreflexes at 10 sites of the body, measured for 143 individuals. As with theexamples discussed in Sections 3.3 and 3.4, the data were kindly suppliedby Richard Hews of Pfizer Central Research.

58 3. Properties of Sample Principal ComponentsTable 3.4. Correlation matrix for ten variables measuring reflexes.V1 V2 V3 V4 V5 V6 V7 V8 V9 V10V1 1.00V2 0.98 1.00V3 0.60 0.62 1.00V4 0.71 0.73 0.88 1.00V5 0.55 0.57 0.61 0.68 1.00V6 0.55 0.57 0.56 0.68 0.97 1.00V7 0.38 0.40 0.48 0.53 0.33 0.33 1.00V8 0.25 0.28 0.42 0.47 0.27 0.27 0.90 1.00V9 0.22 0.21 0.19 0.23 0.16 0.19 0.40 0.41 1.00V10 0.20 0.19 0.18 0.21 0.13 0.16 0.39 0.40 0.94 1.00The correlation matrix for these data is given in Table 3.4, and thecoefficients of, and the variation accounted for by, the corresponding PCsare presented in Table 3.5. It should first be noted that the ten variablesfall into five pairs. Thus, V1, V2, respectively, denote strength of reflexesfor right and left triceps, with {V3, V4}, {V5, V6}, {V7, V8}, {V9, V10}similarly defined for right and left biceps, right and left wrists, right and leftknees, and right and left ankles. The correlations between variables withineach pair are large, so that the differences between variables in each pairhave small variances. This is reflected in the last five PCs, which are mainlywithin-pair contrasts, with the more highly correlated pairs correspondingto the later components.Turning to the first two PCs, there is a suggestion in the correlationmatrix that, although all correlations are positive, the variables can bedivided into two groups {V1–V6}, {V7–V10}. These correspond to sites inthe arms and legs, respectively. Reflecting this group structure, the first andsecond PCs have their largest coefficients on the first and second groups ofvariables, respectively. Because the group structure is not clear-cut, thesetwo PCs also have contributions from the less dominant group, and thefirst PC is a weighted average of variables from both groups, whereas thesecond PC is a weighted contrast between the groups.The third, fourth and fifth PCs reinforce the idea of the two groups. Thethird PC is a contrast between the two pairs of variables in the second(smaller) group and the fourth and fifth PCs both give contrasts betweenthe three pairs of variables in the first group.It is relatively rare for examples with as many as ten variables to havesuch a nicely defined structure as in the present case for all their PCs.However, as will be seen in the examples of subsequent chapters, it is notunusual to be able to deduce the structure of at least a few PCs in thismanner.

58 3. Properties of Sample <strong>Principal</strong> <strong>Component</strong>sTable 3.4. Correlation matrix for ten variables measuring reflexes.V1 V2 V3 V4 V5 V6 V7 V8 V9 V10V1 1.00V2 0.98 1.00V3 0.60 0.62 1.00V4 0.71 0.73 0.88 1.00V5 0.55 0.57 0.61 0.68 1.00V6 0.55 0.57 0.56 0.68 0.97 1.00V7 0.38 0.40 0.48 0.53 0.33 0.33 1.00V8 0.25 0.28 0.42 0.47 0.27 0.27 0.90 1.00V9 0.22 0.21 0.19 0.23 0.16 0.19 0.40 0.41 1.00V10 0.20 0.19 0.18 0.21 0.13 0.16 0.39 0.40 0.94 1.00The correlation matrix for these data is given in Table 3.4, and thecoefficients of, and the variation accounted for by, the corresponding PCsare presented in Table 3.5. It should first be noted that the ten variablesfall into five pairs. Thus, V1, V2, respectively, denote strength of reflexesfor right and left triceps, with {V3, V4}, {V5, V6}, {V7, V8}, {V9, V10}similarly defined for right and left biceps, right and left wrists, right and leftknees, and right and left ankles. The correlations between variables withineach pair are large, so that the differences between variables in each pairhave small variances. This is reflected in the last five PCs, which are mainlywithin-pair contrasts, with the more highly correlated pairs correspondingto the later components.Turning to the first two PCs, there is a suggestion in the correlationmatrix that, although all correlations are positive, the variables can bedivided into two groups {V1–V6}, {V7–V10}. These correspond to sites inthe arms and legs, respectively. Reflecting this group structure, the first andsecond PCs have their largest coefficients on the first and second groups ofvariables, respectively. Because the group structure is not clear-cut, thesetwo PCs also have contributions from the less dominant group, and thefirst PC is a weighted average of variables from both groups, whereas thesecond PC is a weighted contrast between the groups.The third, fourth and fifth PCs reinforce the idea of the two groups. Thethird PC is a contrast between the two pairs of variables in the second(smaller) group and the fourth and fifth PCs both give contrasts betweenthe three pairs of variables in the first group.It is relatively rare for examples with as many as ten variables to havesuch a nicely defined structure as in the present case for all their PCs.However, as will be seen in the examples of subsequent chapters, it is notunusual to be able to deduce the structure of at least a few PCs in thismanner.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!