24.06.2015 Views

Notes on Principal Component Analysis

Notes on Principal Component Analysis

Notes on Principal Component Analysis

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Versi<strong>on</strong>: August 4, 2011<br />

<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong> <strong>Principal</strong> Comp<strong>on</strong>ent<br />

<strong>Analysis</strong><br />

A preliminary introducti<strong>on</strong> to principal comp<strong>on</strong>ents was given in<br />

our brief discussi<strong>on</strong> of the spectral decompositi<strong>on</strong> (i.e., the eigenvector/eigenvalue<br />

decompositi<strong>on</strong>) of a matrix and what it might be<br />

used for. We will now be a bit more systematic, and begin by making<br />

three introductory comments:<br />

(a) <strong>Principal</strong> comp<strong>on</strong>ent analysis (PCA) deals with <strong>on</strong>ly <strong>on</strong>e set of<br />

variables without the need for categorizing the variables as being independent<br />

or dependent. There is asymmetry in the discussi<strong>on</strong> of the<br />

general linear model; in PCA, however, we analyze the relati<strong>on</strong>ships<br />

am<strong>on</strong>g the variables in <strong>on</strong>e set and not between two.<br />

(b) As always, everything can be d<strong>on</strong>e computati<strong>on</strong>ally without<br />

the Multivariate Normal (MVN) assumpti<strong>on</strong>; we are just getting<br />

descriptive statistics. When significance tests and the like are desired,<br />

the MVN assumpti<strong>on</strong> becomes indispensable. Also, MVN gives some<br />

very nice interpretati<strong>on</strong>s for what the principal comp<strong>on</strong>ents are in<br />

terms of our c<strong>on</strong>stant density ellipsoids.<br />

(c) Finally, it is probably best if you are doing a PCA, not to<br />

refer to these as “factors”. A lot of blood and ill-will has been spilt<br />

and spread over the distincti<strong>on</strong> between comp<strong>on</strong>ent analysis (which<br />

involves linear combinati<strong>on</strong>s of observable variables), and the estimati<strong>on</strong><br />

of a factor model (which involves the use of underlying latent<br />

1


variables or factors, and the estimati<strong>on</strong> of the factor structure). We<br />

will get sloppy ourselves later, but some people really get exercised<br />

about these things.<br />

We will begin working with the populati<strong>on</strong> (but everything translates<br />

more-or-less directly for a sample):<br />

Suppose [X 1 , X 2 , . . . , X p ] = X ′ is a set of p random variables, with<br />

mean vector µ and variance-covariance matrix Σ. I want to define p<br />

linear combinati<strong>on</strong>s of X ′ that represent the informati<strong>on</strong> in X ′ more<br />

parsim<strong>on</strong>iously. Specifically, find a 1 , . . . , a p such that a ′ 1X, . . . , a ′ pX<br />

gives the same informati<strong>on</strong> as X ′ , but the new random variables,<br />

a ′ 1X, . . . , a ′ pX, are “nicer”.<br />

Let λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0 be the p roots (eigenvalues) of the<br />

matrix Σ, and let a 1 , . . . , a p be the corresp<strong>on</strong>ding eigenvectors. If<br />

some roots are not distinct, I can still pick corresp<strong>on</strong>ding eigenvectors<br />

to be orthog<strong>on</strong>al. Choose an eigenvector a i so a ′ ia i = 1, i.e., a<br />

normalized eigenvector. Then, a ′ iX is the i th principal comp<strong>on</strong>ent of<br />

the random variables in X ′ .<br />

Properties:<br />

1) Var(a ′ iX) = a ′ iΣa i = λ i<br />

We know Σa i = λ i a i , because a i is the eigenvector for λ i ; thus,<br />

a ′ iΣa i = a ′ iλ i a i = λ i . In words, the variance of the i th principal<br />

comp<strong>on</strong>ent is λ i , the root.<br />

Also, for all vectors b i such that b i is orthog<strong>on</strong>al to a 1 , . . . , a i−1 ,<br />

and b ′ ib i = 1, Var(b ′ iX) is the greatest it can be (i.e., λ i ) when<br />

b i = a i .<br />

2


2) a i and a j are orthog<strong>on</strong>al, i.e., a ′ ia j = 0<br />

3) Cov(a ′ iX, a ′ jX) = a ′ iΣa j = a ′ iλ j a j = λ j a ′ ia j = 0<br />

4) Tr(Σ) = λ 1 + · · · + λ p = sum of variances for all p principal<br />

comp<strong>on</strong>ents, and for X 1 , . . . , X p . The importance of the i th principal<br />

comp<strong>on</strong>ent is<br />

λ i /Tr(Σ) ,<br />

which is equal to the variance of the i th principal comp<strong>on</strong>ent divided<br />

by the total variance in the system of p random variables,<br />

X 1 , . . . , X p ; it is the proporti<strong>on</strong> of the total variance explained by<br />

the i th comp<strong>on</strong>ent.<br />

If the first few principal comp<strong>on</strong>ents account for most of the variati<strong>on</strong>,<br />

then we might interpret these comp<strong>on</strong>ents as “factors” underlying<br />

the whole set X 1 , . . . , X p . This is the basis of principal factor<br />

analysis.<br />

The questi<strong>on</strong> of how many comp<strong>on</strong>ents (or factors, or clusters,<br />

or dimensi<strong>on</strong>s) usually has no definitive answer. Some attempt has<br />

been made to do what are called Scree plots, and graphically see how<br />

many comp<strong>on</strong>ents to retain. A plot is c<strong>on</strong>structed of the value of the<br />

eigenvalue <strong>on</strong> the y-axis and the number of the eigenvalue (e.g., 1, 2,<br />

3, and so <strong>on</strong>) <strong>on</strong> the x-axis, and you look for an “elbow” to see where<br />

to stop. Scree or talus is the pile of rock debris (detritus) at the foot<br />

of a cliff, i.e., the sloping mass of debris at the bottom of the cliff. I,<br />

unfortunately, can never see an “elbow”!<br />

If we let a populati<strong>on</strong> correlati<strong>on</strong> matrix corresp<strong>on</strong>ding to Σ be<br />

denoted as P, then Tr(P) = p, and we might use <strong>on</strong>ly those principal<br />

3


comp<strong>on</strong>ents that have variance of λ i ≥ 1 — otherwise, the comp<strong>on</strong>ent<br />

would “explain” less variance than would a single variable.<br />

A major rub — if I do principal comp<strong>on</strong>ents <strong>on</strong> the correlati<strong>on</strong><br />

matrix, P, and <strong>on</strong> the original variance-covariance matrix, Σ, the<br />

structures obtained are generally different. This is <strong>on</strong>e reas<strong>on</strong> the<br />

“true believers” might prefer a factor analysis model over a PCA because<br />

the former holds out some hope for an invariance (to scaling).<br />

Generally, it seems more reas<strong>on</strong>able to always use the populati<strong>on</strong><br />

correlati<strong>on</strong> matrix, P; the units of the original variables become irrelevant,<br />

and it is much easier to interpret the principal comp<strong>on</strong>ents<br />

through their coefficients.<br />

The j th principal comp<strong>on</strong>ent is a ′ jX:<br />

Cov(a ′ jX, X i ) = Cov(a ′ jX, b ′ X), where b ′ = [0 · · · 0 1 0 · · · 0],<br />

with the 1 in the i th positi<strong>on</strong>, = a ′ jΣb = b ′ Σa j = b ′ λ j a j = λ j<br />

times the i th comp<strong>on</strong>ent of a j = λ j a ij . Thus, Cor(a ′ jX, X i ) =<br />

λ j a ij<br />

√<br />

λj σ i<br />

=<br />

√<br />

λj a ij<br />

σ i<br />

,<br />

where σ i is the standard deviati<strong>on</strong> of X i . This correlati<strong>on</strong> is called<br />

the loading of X i <strong>on</strong> the j th comp<strong>on</strong>ent. Generally, these correlati<strong>on</strong>s<br />

can be used to see the c<strong>on</strong>tributi<strong>on</strong> of each variable to each of the<br />

principal comp<strong>on</strong>ents.<br />

If the populati<strong>on</strong> covariance matrix, Σ, is replaced by the sample<br />

covariance matrix, S, we obtain sample principal comp<strong>on</strong>ents; if the<br />

populati<strong>on</strong> correlati<strong>on</strong> matrix, P, is replaced by the sample correlati<strong>on</strong><br />

matrix, R, we again obtain sample principal comp<strong>on</strong>ents. These<br />

structures are generally different.<br />

4


The covariance matrix S (or Σ) can be represented by<br />

⎡ √ ⎤ ⎡ √ ⎤ ⎡<br />

λ1 · · · 0<br />

λ1 · · · 0<br />

a ′ 1<br />

S = [a 1 , . . . , a p ]<br />

⎢<br />

. .<br />

⎣<br />

√<br />

⎥ ⎢<br />

. .<br />

⎦ ⎣<br />

√<br />

⎥ ⎢<br />

.<br />

⎦ ⎣<br />

0 · · · λp 0 · · · λp a ′ p<br />

or as the sum of p, p × p matrices,<br />

⎤<br />

⎥<br />

⎦<br />

≡ LL ′<br />

S = λ 1 a 1 a ′ 1 + · · · + λ p a p a ′ p .<br />

Given the ordering of the eigenvalues as λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0,<br />

the least-squares approximati<strong>on</strong> to S of rank r is λ 1 a 1 a ′ 1 + · · · +<br />

λ 1 a r a ′ r, and the residual matrix, S − λ 1 a 1 a ′ 1 − · · · − λ 1 a r a ′ r, is<br />

λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p.<br />

Note that for an arbitrary matrix, B p×q , the Tr(BB ′ ) = sum of<br />

squares of the entries in B. Also, for two matrices, B and C, if both<br />

of the products BC and CB can be taken, then Tr(BC) is equal<br />

to Tr(CB). Using these two results, the least-squares criteri<strong>on</strong> value<br />

can be given as<br />

Tr([λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p][λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p] ′ ) =<br />

∑<br />

k≥r+1<br />

λ 2 k .<br />

This measure is <strong>on</strong>e of how bad the rank r approximati<strong>on</strong> might be<br />

(i.e., the proporti<strong>on</strong> of unexplained sum-of-squares when put over<br />

∑p<br />

k=1 λ 2 k).<br />

For a geometric interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, suppose<br />

we have two variables, X 1 and X 2 , that are centered at their respective<br />

means (i.e., the means of the scores <strong>on</strong> X 1 and X 2 are zero). In<br />

5


the diagram below, the ellipse represents the scatter diagram of the<br />

sample points. The first principal comp<strong>on</strong>ent is a line through the<br />

widest part; the sec<strong>on</strong>d comp<strong>on</strong>ent is the line at right angles to the<br />

first principal comp<strong>on</strong>ent. In other words, the first principal comp<strong>on</strong>ent<br />

goes through the fattest part of the “football”, and the sec<strong>on</strong>d<br />

principal comp<strong>on</strong>ent through the next fattest part of the “football”<br />

and orthog<strong>on</strong>al to the first; and so <strong>on</strong>. Or, we take our original frame<br />

of reference and do a rigid transformati<strong>on</strong> around the origin to get a<br />

new set of axes; the origin is given by the sample means (of zero) <strong>on</strong><br />

the two X 1 and X 2 variables. (To make these same geometric points,<br />

we could have used a c<strong>on</strong>stant density c<strong>on</strong>tour for a bivariate normal<br />

pair of random variables, X 1 and X 2 , with zero mean vector.)<br />

sec<strong>on</strong>d comp<strong>on</strong>ent<br />

X 2<br />

first comp<strong>on</strong>ent<br />

X 1<br />

As an example of how to find the placement of the comp<strong>on</strong>ents in<br />

the picture given above, suppose we have the two variables, X 1 and<br />

X 2 , with variance-covariance matrix<br />

Σ =<br />

⎛<br />

⎜<br />

⎝<br />

σ1 2 σ 12<br />

σ 12 σ2<br />

2<br />

⎞<br />

⎟<br />

⎠ .<br />

6


Let a 11 and a 21 denote the weights from the first eigenvector of Σ;<br />

a 12 and a 22 are the weights from the sec<strong>on</strong>d eigenvector. If these<br />

are placed in a 2 × 2 orthog<strong>on</strong>al (or rotati<strong>on</strong>) matrix T, with the<br />

first column c<strong>on</strong>taining the first eigenvector weights and the sec<strong>on</strong>d<br />

column the sec<strong>on</strong>d eigenvector weights, we can obtain the directi<strong>on</strong><br />

cosines of the new axes system from the following:<br />

T =<br />

⎛<br />

⎜<br />

⎝<br />

a 11 a 12<br />

a 21<br />

⎞ ⎛<br />

⎟ ⎜<br />

⎠ = ⎝<br />

a 22<br />

cos(θ) cos(90 + θ)<br />

cos(θ − 90) cos(θ)<br />

⎞<br />

⎟<br />

⎠ =<br />

⎛<br />

⎜<br />

⎝<br />

cos(θ) − sin(θ)<br />

sin(θ) cos(θ)<br />

These are the cosines of the angles with the positive (horiz<strong>on</strong>tal and<br />

vertical) axes. If we wish to change the orientati<strong>on</strong> of a transformed<br />

axis (i.e., to make the arrow go in the other directi<strong>on</strong>), we merely<br />

use a multiplicati<strong>on</strong> of the relevant eigenvector values by −1 (i.e.,<br />

we choose the other normalized eigenvector for that same eigenvalue,<br />

which still has unit length).<br />

90 + θ<br />

⎞<br />

⎟<br />

⎠ .<br />

θ<br />

θ − 90<br />

θ<br />

If we denote the data matrix in this simple two variable problem<br />

as X n×2 , where n is the number of subjects and the two columns<br />

7


epresent the values <strong>on</strong> variables X 1 and X 2 (i.e., the coordinates of<br />

each subject <strong>on</strong> the original axes), the n × 2 matrix of coordinates<br />

of the subjects <strong>on</strong> the transformed axes, say X trans can be given as<br />

XT.<br />

For another interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, the first comp<strong>on</strong>ent<br />

could be obtained by minimizing the sum of squared perpendicular<br />

residuals from a line (and in analogy to simple regressi<strong>on</strong><br />

where the sum of squared vertical residuals from a line is minimized).<br />

This noti<strong>on</strong> generalizes to more than than <strong>on</strong>e principal comp<strong>on</strong>ent<br />

and in analogy to the way that multiple regressi<strong>on</strong> generalizes simple<br />

regressi<strong>on</strong> — vertical residuals to hyperplanes are used in multiple<br />

regressi<strong>on</strong>, and perpendicular residuals to hyperplanes are used in<br />

PCA.<br />

There are a number of specially patterned matrices that have interesting<br />

eigenvector/eigenvalue decompositi<strong>on</strong>s. For example, for<br />

the p × p diag<strong>on</strong>al variance-covariance matrix<br />

Σ p×p =<br />

⎛<br />

⎜<br />

⎝<br />

σ1 2 · · · 0<br />

. .<br />

0 · · · σp<br />

2<br />

the roots are σ1, 2 . . . , σp, 2 and the eigenvector corresp<strong>on</strong>ding to σi<br />

2<br />

is [0 0 . . . 1 . . . 0] ′ where the single 1 is in the i th positi<strong>on</strong>. If we<br />

have a correlati<strong>on</strong> matrix, the root of 1 has multiplicity p, and the<br />

eigenvectors could also be chosen as these same vectors having all<br />

zeros except for a single 1 in the i th positi<strong>on</strong>, 1 ≤ i ≤ p.<br />

If the p × p variance-covariance matrix dem<strong>on</strong>strates compound<br />

⎞<br />

⎟<br />

⎠<br />

,<br />

8


symmetry,<br />

or is an equicorrelati<strong>on</strong> matrix,<br />

⎛<br />

1 · · · ρ<br />

Σ p×p = σ 2 ⎜<br />

. .<br />

⎝<br />

ρ · · · 1<br />

P =<br />

⎛<br />

⎜<br />

⎝<br />

1 · · · ρ<br />

. .<br />

ρ · · · 1<br />

then the p − 1 smallest roots are all equal. For example, for the<br />

equicorrelati<strong>on</strong> matrix, λ 1 = 1 + (p − 1)ρ, and the remaining p −<br />

1 roots are all equal to 1 − ρ. The p × 1 eigenvector for λ 1 is<br />

[ √ 1<br />

p<br />

, . . . , √ 1<br />

p<br />

] ′ , and defines an average. Generally, for any variancecovariance<br />

matrix with all entries greater than zero (or just n<strong>on</strong>negative),<br />

the entries in the first eigenvector must all be greater than<br />

zero (or n<strong>on</strong>-negative). This is known as the Perr<strong>on</strong>-Frobenius theorem.<br />

Although we will not give these tests explicitly here (they can be<br />

found in Johns<strong>on</strong> and Wichern’s (2007) multivariate text), they are<br />

inference methods to test the null hypothesis of an equicorrelati<strong>on</strong><br />

matrix (i.e., the last p − 1 eigenvalues are equal); that the variancecovariance<br />

matrix is diag<strong>on</strong>al or the correlati<strong>on</strong> matrix is the identity<br />

(i.e., all eigenvalues are equal); or a sphericity test of independence<br />

that all eigenvalues are equal and Σ is σ 2 times the identity matrix.<br />

⎞<br />

⎟<br />

⎠<br />

⎞<br />

⎟<br />

⎠<br />

,<br />

,<br />

9


0.1 Analytic Rotati<strong>on</strong> Methods<br />

Suppose we have a p × m matrix, A, c<strong>on</strong>taining the correlati<strong>on</strong>s<br />

(loadings) between our p variables and the first m principal comp<strong>on</strong>ents.<br />

We are seeking an orthog<strong>on</strong>al m × m matrix T that defines<br />

a rotati<strong>on</strong> of the m comp<strong>on</strong>ents into a new p × m matrix, B, that<br />

c<strong>on</strong>tains the correlati<strong>on</strong>s (loadings) between the p variables and the<br />

newly rotated axes: AT = B. A rotati<strong>on</strong> matrix T is sought that<br />

produces “nice” properties in B, e.g., a “simple structure”, where<br />

generally the loadings are positive and either close to 1.0 or to 0.0.<br />

The most comm<strong>on</strong> strategy is due to Kaiser, and calls for maximizing<br />

the normal varimax criteri<strong>on</strong>:<br />

1<br />

p<br />

m∑ ∑<br />

[ p<br />

j=1 i=1<br />

(b ij /h i ) 4 − γ p { ∑ p (b ij /h i ) 2 } 2 ] ,<br />

where the parameter γ = 1 for varimax, and h i = √∑ m j=1 b 2 ij (this<br />

is called the square root of the communality of the i th variable in a<br />

factor analytic c<strong>on</strong>text). Other criteria have been suggested for this<br />

so-called orthomax criteri<strong>on</strong> that use different values of γ — 0 for<br />

quartimax, m/2 for equamax, and p(m−1)/(p+m−2) for parsimax.<br />

Also, various methods are available for attempting oblique rotati<strong>on</strong>s<br />

where the transformed axes do not need to maintain orthog<strong>on</strong>ality,<br />

e.g., oblimin in SYSTAT; Procrustes in MATLAB.<br />

Generally, varimax seems to be a good default choice. It tends to<br />

“smear” the variance explained across the transformed axes rather<br />

evenly. We will stick with varimax in the various examples we do<br />

later.<br />

i=1<br />

10


0.2 Little Jiffy<br />

Chester Harris named a procedure posed by Henry Kaiser for “factor<br />

analysis”, Little Jiffy. It is defined very simply as “principal comp<strong>on</strong>ents<br />

(of a correlati<strong>on</strong> matrix) with associated eigenvalues greater<br />

than 1.0 followed by normal varimax rotati<strong>on</strong>”. To this date, it is<br />

the most used approach to do a factor analysis, and could be called<br />

“the principal comp<strong>on</strong>ent soluti<strong>on</strong> to the factor analytic model”.<br />

More explicitly, we start with the p × p sample correlati<strong>on</strong> matrix<br />

R and assume it has r eigenvalues greater than 1.0. R is then<br />

approximated by a rank r matrix of the form:<br />

where<br />

R = λ 1 a 1 a ′ 1 + · · · + λ r a r a ′ r =<br />

( √ λ 1 a 1 )( √ λ 1 a ′ 1) + · · · + ( √ λ r a r )( √ λ r a ′ r) =<br />

b 1 b ′ 1 + · · · + b r b ′ r =<br />

(b 1 , . . . , b r )<br />

B p×r =<br />

⎛<br />

⎜<br />

⎝<br />

⎛<br />

⎜<br />

⎝<br />

b ′ 1<br />

.<br />

b ′ r<br />

⎞<br />

⎟<br />

⎠<br />

= BB ′ ,<br />

⎞<br />

b 11 b 12 · · · b 1r<br />

b 21 b 22 · · · b 2r<br />

. . .<br />

⎟<br />

⎠<br />

b p1 b p2 · · · b pr<br />

The entries in B are the loadings of the row variables <strong>on</strong> the column<br />

comp<strong>on</strong>ents.<br />

For any r × r orthog<strong>on</strong>al matrix T, we know TT ′ = I, and<br />

R = BIB ′ = BTT ′ B ′ = (BT)(BT) ′ = B ∗ p×rB ∗′<br />

r×p .<br />

11<br />

.


For example, varimax is <strong>on</strong>e method for c<strong>on</strong>structing B ∗ . The<br />

columns of B ∗ when normalized to unit length, define r linear composites<br />

of the observable variables, where the sum of squares within<br />

columns of B ∗ defines the variance for that composite. The composites<br />

are still orthog<strong>on</strong>al.<br />

0.3 <strong>Principal</strong> Comp<strong>on</strong>ents in Terms of the Data Matrix<br />

For c<strong>on</strong>venience, suppose we transform our n × p data matrix X<br />

into the z-score data matrix Z, and assuming n > p, let the SVD of<br />

Z n×p = U n×p D p×p V ′ p×p. Note that the p × p correlati<strong>on</strong> matrix<br />

R = 1 n Z′ Z = 1 n (VDU′ )(UDV ′ ) = V( 1 n D2 )V ′ .<br />

So, the rows of V ′ are the principal comp<strong>on</strong>ent weights. Also,<br />

ZV = UDV ′ V = UD .<br />

In other words, (UD) n×p are the scores for the n subjects <strong>on</strong> the p<br />

principal comp<strong>on</strong>ents.<br />

What’s going <strong>on</strong> in “variable” space: Suppose we look at a rank<br />

2 approximati<strong>on</strong> of Z n×p ≈ U n×2 D 2×2 V ′ 2×p. The i th subject’s row<br />

data vector sits somewhere in p-dimensi<strong>on</strong>al “variable” space; it is<br />

approximated by a linear combinati<strong>on</strong> of the two eigenvectors (which<br />

gives another point in p dimensi<strong>on</strong>s), where the weights used in the<br />

linear combinati<strong>on</strong> come from the i th row of (UD) n×2 . Because we<br />

do least-squares, we are minimizing the squared Euclidean distances<br />

12


etween the subject’s row vector and the vector defined by the particular<br />

linear combinati<strong>on</strong> of the two eigenvectors. These approximating<br />

vectors in p dimensi<strong>on</strong>s are all in a plane defined by all linear<br />

combinati<strong>on</strong>s of the two eigenvectors. For a rank 1 approximati<strong>on</strong>,<br />

we merely have a multiple of the first eigenvector (in p dimensi<strong>on</strong>s)<br />

as the approximating vector for a subject’s row vector.<br />

What’s going <strong>on</strong> in “subject space”: Suppose we begin by looking<br />

at a rank 1 approximati<strong>on</strong> of Z n×p ≈ U n×1 D 1×1 V ′ 1×p. The<br />

j th column (i.e., variable) of Z is a point in n-dimensi<strong>on</strong>al “subject<br />

space”, and is approximated by a multiple of the scores <strong>on</strong> the first<br />

comp<strong>on</strong>ent, (UD) n×1 . The multiple used is the j th element of the<br />

1 × p vector of first comp<strong>on</strong>ent weights, V ′ 1×p. Thus, each column<br />

of the n × p approximating matrix, U n×1 D 1×1 V ′ 1×p, is a multiple of<br />

the same vector giving the scores <strong>on</strong> the first comp<strong>on</strong>ent. In other<br />

words, we represent each column (variable) by a multiple of <strong>on</strong>e specific<br />

vector, where the multiple represents where the projecti<strong>on</strong> lies<br />

<strong>on</strong> this <strong>on</strong>e single vector (the term “projecti<strong>on</strong>” is used because of the<br />

least-squares property of the approximati<strong>on</strong>). For a rank 2 approximati<strong>on</strong>,<br />

each column variable in Z is represented by a point in the<br />

plane defined by all linear combinati<strong>on</strong>s of the two comp<strong>on</strong>ent score<br />

columns in U n×2 D 2×2 ; the point in that plane is determined by the<br />

weights in the j th column of V ′ 2×p. Alternatively, Z is approximated<br />

by the sum of two n×p matrices defined by columns being multiples<br />

of the first or sec<strong>on</strong>d comp<strong>on</strong>ent scores.<br />

As a way of illustrating a graphical way of representing principal<br />

comp<strong>on</strong>ents of a data matrix (through a biplot), suppose we have<br />

the rank 2 approximati<strong>on</strong>, Z n×p ≈ U n×2 D 2×2 V ′ 2×p, and c<strong>on</strong>sider<br />

13


a two-dimensi<strong>on</strong>al Cartesian system where the horiz<strong>on</strong>tal axis corresp<strong>on</strong>ds<br />

to the first comp<strong>on</strong>ent and the vertical axis corresp<strong>on</strong>ds<br />

to the sec<strong>on</strong>d comp<strong>on</strong>ent. Use the n two-dimensi<strong>on</strong>al coordinates<br />

in (U n×2 D 2×2 ) n×2 to plot the rows (subjects), let V p×2 define the<br />

two-dimensi<strong>on</strong>al coordinates for the p variables in this same space.<br />

As in any biplot, if a vector is drawn from the origin through the<br />

i th row (subject) point, and the p column points are projected <strong>on</strong>to<br />

this vector, the collecti<strong>on</strong> of such projecti<strong>on</strong>s is proporti<strong>on</strong>al to the<br />

i th row of the n × p approximati<strong>on</strong> matrix (U n×2 D 2×2 V ′ 2×p) n×p .<br />

The emphasis in this notes has been <strong>on</strong> the descriptive aspects of<br />

principal comp<strong>on</strong>ents. For a discussi<strong>on</strong> of the statistical properties of<br />

these entities, c<strong>on</strong>sult Johns<strong>on</strong> and Wichern (2007) — c<strong>on</strong>fidence intervals<br />

<strong>on</strong> the populati<strong>on</strong> eigenvalues; testing equality of eigenvalues;<br />

assessing the patterning present in an eigenvector; and so <strong>on</strong>.<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!