24.06.2015 Views

Notes on Principal Component Analysis

Notes on Principal Component Analysis

Notes on Principal Component Analysis

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Versi<strong>on</strong>: August 4, 2011<br />

<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong> <strong>Principal</strong> Comp<strong>on</strong>ent<br />

<strong>Analysis</strong><br />

A preliminary introducti<strong>on</strong> to principal comp<strong>on</strong>ents was given in<br />

our brief discussi<strong>on</strong> of the spectral decompositi<strong>on</strong> (i.e., the eigenvector/eigenvalue<br />

decompositi<strong>on</strong>) of a matrix and what it might be<br />

used for. We will now be a bit more systematic, and begin by making<br />

three introductory comments:<br />

(a) <strong>Principal</strong> comp<strong>on</strong>ent analysis (PCA) deals with <strong>on</strong>ly <strong>on</strong>e set of<br />

variables without the need for categorizing the variables as being independent<br />

or dependent. There is asymmetry in the discussi<strong>on</strong> of the<br />

general linear model; in PCA, however, we analyze the relati<strong>on</strong>ships<br />

am<strong>on</strong>g the variables in <strong>on</strong>e set and not between two.<br />

(b) As always, everything can be d<strong>on</strong>e computati<strong>on</strong>ally without<br />

the Multivariate Normal (MVN) assumpti<strong>on</strong>; we are just getting<br />

descriptive statistics. When significance tests and the like are desired,<br />

the MVN assumpti<strong>on</strong> becomes indispensable. Also, MVN gives some<br />

very nice interpretati<strong>on</strong>s for what the principal comp<strong>on</strong>ents are in<br />

terms of our c<strong>on</strong>stant density ellipsoids.<br />

(c) Finally, it is probably best if you are doing a PCA, not to<br />

refer to these as “factors”. A lot of blood and ill-will has been spilt<br />

and spread over the distincti<strong>on</strong> between comp<strong>on</strong>ent analysis (which<br />

involves linear combinati<strong>on</strong>s of observable variables), and the estimati<strong>on</strong><br />

of a factor model (which involves the use of underlying latent<br />

1


variables or factors, and the estimati<strong>on</strong> of the factor structure). We<br />

will get sloppy ourselves later, but some people really get exercised<br />

about these things.<br />

We will begin working with the populati<strong>on</strong> (but everything translates<br />

more-or-less directly for a sample):<br />

Suppose [X 1 , X 2 , . . . , X p ] = X ′ is a set of p random variables, with<br />

mean vector µ and variance-covariance matrix Σ. I want to define p<br />

linear combinati<strong>on</strong>s of X ′ that represent the informati<strong>on</strong> in X ′ more<br />

parsim<strong>on</strong>iously. Specifically, find a 1 , . . . , a p such that a ′ 1X, . . . , a ′ pX<br />

gives the same informati<strong>on</strong> as X ′ , but the new random variables,<br />

a ′ 1X, . . . , a ′ pX, are “nicer”.<br />

Let λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0 be the p roots (eigenvalues) of the<br />

matrix Σ, and let a 1 , . . . , a p be the corresp<strong>on</strong>ding eigenvectors. If<br />

some roots are not distinct, I can still pick corresp<strong>on</strong>ding eigenvectors<br />

to be orthog<strong>on</strong>al. Choose an eigenvector a i so a ′ ia i = 1, i.e., a<br />

normalized eigenvector. Then, a ′ iX is the i th principal comp<strong>on</strong>ent of<br />

the random variables in X ′ .<br />

Properties:<br />

1) Var(a ′ iX) = a ′ iΣa i = λ i<br />

We know Σa i = λ i a i , because a i is the eigenvector for λ i ; thus,<br />

a ′ iΣa i = a ′ iλ i a i = λ i . In words, the variance of the i th principal<br />

comp<strong>on</strong>ent is λ i , the root.<br />

Also, for all vectors b i such that b i is orthog<strong>on</strong>al to a 1 , . . . , a i−1 ,<br />

and b ′ ib i = 1, Var(b ′ iX) is the greatest it can be (i.e., λ i ) when<br />

b i = a i .<br />

2


2) a i and a j are orthog<strong>on</strong>al, i.e., a ′ ia j = 0<br />

3) Cov(a ′ iX, a ′ jX) = a ′ iΣa j = a ′ iλ j a j = λ j a ′ ia j = 0<br />

4) Tr(Σ) = λ 1 + · · · + λ p = sum of variances for all p principal<br />

comp<strong>on</strong>ents, and for X 1 , . . . , X p . The importance of the i th principal<br />

comp<strong>on</strong>ent is<br />

λ i /Tr(Σ) ,<br />

which is equal to the variance of the i th principal comp<strong>on</strong>ent divided<br />

by the total variance in the system of p random variables,<br />

X 1 , . . . , X p ; it is the proporti<strong>on</strong> of the total variance explained by<br />

the i th comp<strong>on</strong>ent.<br />

If the first few principal comp<strong>on</strong>ents account for most of the variati<strong>on</strong>,<br />

then we might interpret these comp<strong>on</strong>ents as “factors” underlying<br />

the whole set X 1 , . . . , X p . This is the basis of principal factor<br />

analysis.<br />

The questi<strong>on</strong> of how many comp<strong>on</strong>ents (or factors, or clusters,<br />

or dimensi<strong>on</strong>s) usually has no definitive answer. Some attempt has<br />

been made to do what are called Scree plots, and graphically see how<br />

many comp<strong>on</strong>ents to retain. A plot is c<strong>on</strong>structed of the value of the<br />

eigenvalue <strong>on</strong> the y-axis and the number of the eigenvalue (e.g., 1, 2,<br />

3, and so <strong>on</strong>) <strong>on</strong> the x-axis, and you look for an “elbow” to see where<br />

to stop. Scree or talus is the pile of rock debris (detritus) at the foot<br />

of a cliff, i.e., the sloping mass of debris at the bottom of the cliff. I,<br />

unfortunately, can never see an “elbow”!<br />

If we let a populati<strong>on</strong> correlati<strong>on</strong> matrix corresp<strong>on</strong>ding to Σ be<br />

denoted as P, then Tr(P) = p, and we might use <strong>on</strong>ly those principal<br />

3


comp<strong>on</strong>ents that have variance of λ i ≥ 1 — otherwise, the comp<strong>on</strong>ent<br />

would “explain” less variance than would a single variable.<br />

A major rub — if I do principal comp<strong>on</strong>ents <strong>on</strong> the correlati<strong>on</strong><br />

matrix, P, and <strong>on</strong> the original variance-covariance matrix, Σ, the<br />

structures obtained are generally different. This is <strong>on</strong>e reas<strong>on</strong> the<br />

“true believers” might prefer a factor analysis model over a PCA because<br />

the former holds out some hope for an invariance (to scaling).<br />

Generally, it seems more reas<strong>on</strong>able to always use the populati<strong>on</strong><br />

correlati<strong>on</strong> matrix, P; the units of the original variables become irrelevant,<br />

and it is much easier to interpret the principal comp<strong>on</strong>ents<br />

through their coefficients.<br />

The j th principal comp<strong>on</strong>ent is a ′ jX:<br />

Cov(a ′ jX, X i ) = Cov(a ′ jX, b ′ X), where b ′ = [0 · · · 0 1 0 · · · 0],<br />

with the 1 in the i th positi<strong>on</strong>, = a ′ jΣb = b ′ Σa j = b ′ λ j a j = λ j<br />

times the i th comp<strong>on</strong>ent of a j = λ j a ij . Thus, Cor(a ′ jX, X i ) =<br />

λ j a ij<br />

√<br />

λj σ i<br />

=<br />

√<br />

λj a ij<br />

σ i<br />

,<br />

where σ i is the standard deviati<strong>on</strong> of X i . This correlati<strong>on</strong> is called<br />

the loading of X i <strong>on</strong> the j th comp<strong>on</strong>ent. Generally, these correlati<strong>on</strong>s<br />

can be used to see the c<strong>on</strong>tributi<strong>on</strong> of each variable to each of the<br />

principal comp<strong>on</strong>ents.<br />

If the populati<strong>on</strong> covariance matrix, Σ, is replaced by the sample<br />

covariance matrix, S, we obtain sample principal comp<strong>on</strong>ents; if the<br />

populati<strong>on</strong> correlati<strong>on</strong> matrix, P, is replaced by the sample correlati<strong>on</strong><br />

matrix, R, we again obtain sample principal comp<strong>on</strong>ents. These<br />

structures are generally different.<br />

4


The covariance matrix S (or Σ) can be represented by<br />

⎡ √ ⎤ ⎡ √ ⎤ ⎡<br />

λ1 · · · 0<br />

λ1 · · · 0<br />

a ′ 1<br />

S = [a 1 , . . . , a p ]<br />

⎢<br />

. .<br />

⎣<br />

√<br />

⎥ ⎢<br />

. .<br />

⎦ ⎣<br />

√<br />

⎥ ⎢<br />

.<br />

⎦ ⎣<br />

0 · · · λp 0 · · · λp a ′ p<br />

or as the sum of p, p × p matrices,<br />

⎤<br />

⎥<br />

⎦<br />

≡ LL ′<br />

S = λ 1 a 1 a ′ 1 + · · · + λ p a p a ′ p .<br />

Given the ordering of the eigenvalues as λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0,<br />

the least-squares approximati<strong>on</strong> to S of rank r is λ 1 a 1 a ′ 1 + · · · +<br />

λ 1 a r a ′ r, and the residual matrix, S − λ 1 a 1 a ′ 1 − · · · − λ 1 a r a ′ r, is<br />

λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p.<br />

Note that for an arbitrary matrix, B p×q , the Tr(BB ′ ) = sum of<br />

squares of the entries in B. Also, for two matrices, B and C, if both<br />

of the products BC and CB can be taken, then Tr(BC) is equal<br />

to Tr(CB). Using these two results, the least-squares criteri<strong>on</strong> value<br />

can be given as<br />

Tr([λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p][λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p] ′ ) =<br />

∑<br />

k≥r+1<br />

λ 2 k .<br />

This measure is <strong>on</strong>e of how bad the rank r approximati<strong>on</strong> might be<br />

(i.e., the proporti<strong>on</strong> of unexplained sum-of-squares when put over<br />

∑p<br />

k=1 λ 2 k).<br />

For a geometric interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, suppose<br />

we have two variables, X 1 and X 2 , that are centered at their respective<br />

means (i.e., the means of the scores <strong>on</strong> X 1 and X 2 are zero). In<br />

5


the diagram below, the ellipse represents the scatter diagram of the<br />

sample points. The first principal comp<strong>on</strong>ent is a line through the<br />

widest part; the sec<strong>on</strong>d comp<strong>on</strong>ent is the line at right angles to the<br />

first principal comp<strong>on</strong>ent. In other words, the first principal comp<strong>on</strong>ent<br />

goes through the fattest part of the “football”, and the sec<strong>on</strong>d<br />

principal comp<strong>on</strong>ent through the next fattest part of the “football”<br />

and orthog<strong>on</strong>al to the first; and so <strong>on</strong>. Or, we take our original frame<br />

of reference and do a rigid transformati<strong>on</strong> around the origin to get a<br />

new set of axes; the origin is given by the sample means (of zero) <strong>on</strong><br />

the two X 1 and X 2 variables. (To make these same geometric points,<br />

we could have used a c<strong>on</strong>stant density c<strong>on</strong>tour for a bivariate normal<br />

pair of random variables, X 1 and X 2 , with zero mean vector.)<br />

sec<strong>on</strong>d comp<strong>on</strong>ent<br />

X 2<br />

first comp<strong>on</strong>ent<br />

X 1<br />

As an example of how to find the placement of the comp<strong>on</strong>ents in<br />

the picture given above, suppose we have the two variables, X 1 and<br />

X 2 , with variance-covariance matrix<br />

Σ =<br />

⎛<br />

⎜<br />

⎝<br />

σ1 2 σ 12<br />

σ 12 σ2<br />

2<br />

⎞<br />

⎟<br />

⎠ .<br />

6


Let a 11 and a 21 denote the weights from the first eigenvector of Σ;<br />

a 12 and a 22 are the weights from the sec<strong>on</strong>d eigenvector. If these<br />

are placed in a 2 × 2 orthog<strong>on</strong>al (or rotati<strong>on</strong>) matrix T, with the<br />

first column c<strong>on</strong>taining the first eigenvector weights and the sec<strong>on</strong>d<br />

column the sec<strong>on</strong>d eigenvector weights, we can obtain the directi<strong>on</strong><br />

cosines of the new axes system from the following:<br />

T =<br />

⎛<br />

⎜<br />

⎝<br />

a 11 a 12<br />

a 21<br />

⎞ ⎛<br />

⎟ ⎜<br />

⎠ = ⎝<br />

a 22<br />

cos(θ) cos(90 + θ)<br />

cos(θ − 90) cos(θ)<br />

⎞<br />

⎟<br />

⎠ =<br />

⎛<br />

⎜<br />

⎝<br />

cos(θ) − sin(θ)<br />

sin(θ) cos(θ)<br />

These are the cosines of the angles with the positive (horiz<strong>on</strong>tal and<br />

vertical) axes. If we wish to change the orientati<strong>on</strong> of a transformed<br />

axis (i.e., to make the arrow go in the other directi<strong>on</strong>), we merely<br />

use a multiplicati<strong>on</strong> of the relevant eigenvector values by −1 (i.e.,<br />

we choose the other normalized eigenvector for that same eigenvalue,<br />

which still has unit length).<br />

90 + θ<br />

⎞<br />

⎟<br />

⎠ .<br />

θ<br />

θ − 90<br />

θ<br />

If we denote the data matrix in this simple two variable problem<br />

as X n×2 , where n is the number of subjects and the two columns<br />

7


epresent the values <strong>on</strong> variables X 1 and X 2 (i.e., the coordinates of<br />

each subject <strong>on</strong> the original axes), the n × 2 matrix of coordinates<br />

of the subjects <strong>on</strong> the transformed axes, say X trans can be given as<br />

XT.<br />

For another interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, the first comp<strong>on</strong>ent<br />

could be obtained by minimizing the sum of squared perpendicular<br />

residuals from a line (and in analogy to simple regressi<strong>on</strong><br />

where the sum of squared vertical residuals from a line is minimized).<br />

This noti<strong>on</strong> generalizes to more than than <strong>on</strong>e principal comp<strong>on</strong>ent<br />

and in analogy to the way that multiple regressi<strong>on</strong> generalizes simple<br />

regressi<strong>on</strong> — vertical residuals to hyperplanes are used in multiple<br />

regressi<strong>on</strong>, and perpendicular residuals to hyperplanes are used in<br />

PCA.<br />

There are a number of specially patterned matrices that have interesting<br />

eigenvector/eigenvalue decompositi<strong>on</strong>s. For example, for<br />

the p × p diag<strong>on</strong>al variance-covariance matrix<br />

Σ p×p =<br />

⎛<br />

⎜<br />

⎝<br />

σ1 2 · · · 0<br />

. .<br />

0 · · · σp<br />

2<br />

the roots are σ1, 2 . . . , σp, 2 and the eigenvector corresp<strong>on</strong>ding to σi<br />

2<br />

is [0 0 . . . 1 . . . 0] ′ where the single 1 is in the i th positi<strong>on</strong>. If we<br />

have a correlati<strong>on</strong> matrix, the root of 1 has multiplicity p, and the<br />

eigenvectors could also be chosen as these same vectors having all<br />

zeros except for a single 1 in the i th positi<strong>on</strong>, 1 ≤ i ≤ p.<br />

If the p × p variance-covariance matrix dem<strong>on</strong>strates compound<br />

⎞<br />

⎟<br />

⎠<br />

,<br />

8


symmetry,<br />

or is an equicorrelati<strong>on</strong> matrix,<br />

⎛<br />

1 · · · ρ<br />

Σ p×p = σ 2 ⎜<br />

. .<br />

⎝<br />

ρ · · · 1<br />

P =<br />

⎛<br />

⎜<br />

⎝<br />

1 · · · ρ<br />

. .<br />

ρ · · · 1<br />

then the p − 1 smallest roots are all equal. For example, for the<br />

equicorrelati<strong>on</strong> matrix, λ 1 = 1 + (p − 1)ρ, and the remaining p −<br />

1 roots are all equal to 1 − ρ. The p × 1 eigenvector for λ 1 is<br />

[ √ 1<br />

p<br />

, . . . , √ 1<br />

p<br />

] ′ , and defines an average. Generally, for any variancecovariance<br />

matrix with all entries greater than zero (or just n<strong>on</strong>negative),<br />

the entries in the first eigenvector must all be greater than<br />

zero (or n<strong>on</strong>-negative). This is known as the Perr<strong>on</strong>-Frobenius theorem.<br />

Although we will not give these tests explicitly here (they can be<br />

found in Johns<strong>on</strong> and Wichern’s (2007) multivariate text), they are<br />

inference methods to test the null hypothesis of an equicorrelati<strong>on</strong><br />

matrix (i.e., the last p − 1 eigenvalues are equal); that the variancecovariance<br />

matrix is diag<strong>on</strong>al or the correlati<strong>on</strong> matrix is the identity<br />

(i.e., all eigenvalues are equal); or a sphericity test of independence<br />

that all eigenvalues are equal and Σ is σ 2 times the identity matrix.<br />

⎞<br />

⎟<br />

⎠<br />

⎞<br />

⎟<br />

⎠<br />

,<br />

,<br />

9


0.1 Analytic Rotati<strong>on</strong> Methods<br />

Suppose we have a p × m matrix, A, c<strong>on</strong>taining the correlati<strong>on</strong>s<br />

(loadings) between our p variables and the first m principal comp<strong>on</strong>ents.<br />

We are seeking an orthog<strong>on</strong>al m × m matrix T that defines<br />

a rotati<strong>on</strong> of the m comp<strong>on</strong>ents into a new p × m matrix, B, that<br />

c<strong>on</strong>tains the correlati<strong>on</strong>s (loadings) between the p variables and the<br />

newly rotated axes: AT = B. A rotati<strong>on</strong> matrix T is sought that<br />

produces “nice” properties in B, e.g., a “simple structure”, where<br />

generally the loadings are positive and either close to 1.0 or to 0.0.<br />

The most comm<strong>on</strong> strategy is due to Kaiser, and calls for maximizing<br />

the normal varimax criteri<strong>on</strong>:<br />

1<br />

p<br />

m∑ ∑<br />

[ p<br />

j=1 i=1<br />

(b ij /h i ) 4 − γ p { ∑ p (b ij /h i ) 2 } 2 ] ,<br />

where the parameter γ = 1 for varimax, and h i = √∑ m j=1 b 2 ij (this<br />

is called the square root of the communality of the i th variable in a<br />

factor analytic c<strong>on</strong>text). Other criteria have been suggested for this<br />

so-called orthomax criteri<strong>on</strong> that use different values of γ — 0 for<br />

quartimax, m/2 for equamax, and p(m−1)/(p+m−2) for parsimax.<br />

Also, various methods are available for attempting oblique rotati<strong>on</strong>s<br />

where the transformed axes do not need to maintain orthog<strong>on</strong>ality,<br />

e.g., oblimin in SYSTAT; Procrustes in MATLAB.<br />

Generally, varimax seems to be a good default choice. It tends to<br />

“smear” the variance explained across the transformed axes rather<br />

evenly. We will stick with varimax in the various examples we do<br />

later.<br />

i=1<br />

10


0.2 Little Jiffy<br />

Chester Harris named a procedure posed by Henry Kaiser for “factor<br />

analysis”, Little Jiffy. It is defined very simply as “principal comp<strong>on</strong>ents<br />

(of a correlati<strong>on</strong> matrix) with associated eigenvalues greater<br />

than 1.0 followed by normal varimax rotati<strong>on</strong>”. To this date, it is<br />

the most used approach to do a factor analysis, and could be called<br />

“the principal comp<strong>on</strong>ent soluti<strong>on</strong> to the factor analytic model”.<br />

More explicitly, we start with the p × p sample correlati<strong>on</strong> matrix<br />

R and assume it has r eigenvalues greater than 1.0. R is then<br />

approximated by a rank r matrix of the form:<br />

where<br />

R = λ 1 a 1 a ′ 1 + · · · + λ r a r a ′ r =<br />

( √ λ 1 a 1 )( √ λ 1 a ′ 1) + · · · + ( √ λ r a r )( √ λ r a ′ r) =<br />

b 1 b ′ 1 + · · · + b r b ′ r =<br />

(b 1 , . . . , b r )<br />

B p×r =<br />

⎛<br />

⎜<br />

⎝<br />

⎛<br />

⎜<br />

⎝<br />

b ′ 1<br />

.<br />

b ′ r<br />

⎞<br />

⎟<br />

⎠<br />

= BB ′ ,<br />

⎞<br />

b 11 b 12 · · · b 1r<br />

b 21 b 22 · · · b 2r<br />

. . .<br />

⎟<br />

⎠<br />

b p1 b p2 · · · b pr<br />

The entries in B are the loadings of the row variables <strong>on</strong> the column<br />

comp<strong>on</strong>ents.<br />

For any r × r orthog<strong>on</strong>al matrix T, we know TT ′ = I, and<br />

R = BIB ′ = BTT ′ B ′ = (BT)(BT) ′ = B ∗ p×rB ∗′<br />

r×p .<br />

11<br />

.


For example, varimax is <strong>on</strong>e method for c<strong>on</strong>structing B ∗ . The<br />

columns of B ∗ when normalized to unit length, define r linear composites<br />

of the observable variables, where the sum of squares within<br />

columns of B ∗ defines the variance for that composite. The composites<br />

are still orthog<strong>on</strong>al.<br />

0.3 <strong>Principal</strong> Comp<strong>on</strong>ents in Terms of the Data Matrix<br />

For c<strong>on</strong>venience, suppose we transform our n × p data matrix X<br />

into the z-score data matrix Z, and assuming n > p, let the SVD of<br />

Z n×p = U n×p D p×p V ′ p×p. Note that the p × p correlati<strong>on</strong> matrix<br />

R = 1 n Z′ Z = 1 n (VDU′ )(UDV ′ ) = V( 1 n D2 )V ′ .<br />

So, the rows of V ′ are the principal comp<strong>on</strong>ent weights. Also,<br />

ZV = UDV ′ V = UD .<br />

In other words, (UD) n×p are the scores for the n subjects <strong>on</strong> the p<br />

principal comp<strong>on</strong>ents.<br />

What’s going <strong>on</strong> in “variable” space: Suppose we look at a rank<br />

2 approximati<strong>on</strong> of Z n×p ≈ U n×2 D 2×2 V ′ 2×p. The i th subject’s row<br />

data vector sits somewhere in p-dimensi<strong>on</strong>al “variable” space; it is<br />

approximated by a linear combinati<strong>on</strong> of the two eigenvectors (which<br />

gives another point in p dimensi<strong>on</strong>s), where the weights used in the<br />

linear combinati<strong>on</strong> come from the i th row of (UD) n×2 . Because we<br />

do least-squares, we are minimizing the squared Euclidean distances<br />

12


etween the subject’s row vector and the vector defined by the particular<br />

linear combinati<strong>on</strong> of the two eigenvectors. These approximating<br />

vectors in p dimensi<strong>on</strong>s are all in a plane defined by all linear<br />

combinati<strong>on</strong>s of the two eigenvectors. For a rank 1 approximati<strong>on</strong>,<br />

we merely have a multiple of the first eigenvector (in p dimensi<strong>on</strong>s)<br />

as the approximating vector for a subject’s row vector.<br />

What’s going <strong>on</strong> in “subject space”: Suppose we begin by looking<br />

at a rank 1 approximati<strong>on</strong> of Z n×p ≈ U n×1 D 1×1 V ′ 1×p. The<br />

j th column (i.e., variable) of Z is a point in n-dimensi<strong>on</strong>al “subject<br />

space”, and is approximated by a multiple of the scores <strong>on</strong> the first<br />

comp<strong>on</strong>ent, (UD) n×1 . The multiple used is the j th element of the<br />

1 × p vector of first comp<strong>on</strong>ent weights, V ′ 1×p. Thus, each column<br />

of the n × p approximating matrix, U n×1 D 1×1 V ′ 1×p, is a multiple of<br />

the same vector giving the scores <strong>on</strong> the first comp<strong>on</strong>ent. In other<br />

words, we represent each column (variable) by a multiple of <strong>on</strong>e specific<br />

vector, where the multiple represents where the projecti<strong>on</strong> lies<br />

<strong>on</strong> this <strong>on</strong>e single vector (the term “projecti<strong>on</strong>” is used because of the<br />

least-squares property of the approximati<strong>on</strong>). For a rank 2 approximati<strong>on</strong>,<br />

each column variable in Z is represented by a point in the<br />

plane defined by all linear combinati<strong>on</strong>s of the two comp<strong>on</strong>ent score<br />

columns in U n×2 D 2×2 ; the point in that plane is determined by the<br />

weights in the j th column of V ′ 2×p. Alternatively, Z is approximated<br />

by the sum of two n×p matrices defined by columns being multiples<br />

of the first or sec<strong>on</strong>d comp<strong>on</strong>ent scores.<br />

As a way of illustrating a graphical way of representing principal<br />

comp<strong>on</strong>ents of a data matrix (through a biplot), suppose we have<br />

the rank 2 approximati<strong>on</strong>, Z n×p ≈ U n×2 D 2×2 V ′ 2×p, and c<strong>on</strong>sider<br />

13


a two-dimensi<strong>on</strong>al Cartesian system where the horiz<strong>on</strong>tal axis corresp<strong>on</strong>ds<br />

to the first comp<strong>on</strong>ent and the vertical axis corresp<strong>on</strong>ds<br />

to the sec<strong>on</strong>d comp<strong>on</strong>ent. Use the n two-dimensi<strong>on</strong>al coordinates<br />

in (U n×2 D 2×2 ) n×2 to plot the rows (subjects), let V p×2 define the<br />

two-dimensi<strong>on</strong>al coordinates for the p variables in this same space.<br />

As in any biplot, if a vector is drawn from the origin through the<br />

i th row (subject) point, and the p column points are projected <strong>on</strong>to<br />

this vector, the collecti<strong>on</strong> of such projecti<strong>on</strong>s is proporti<strong>on</strong>al to the<br />

i th row of the n × p approximati<strong>on</strong> matrix (U n×2 D 2×2 V ′ 2×p) n×p .<br />

The emphasis in this notes has been <strong>on</strong> the descriptive aspects of<br />

principal comp<strong>on</strong>ents. For a discussi<strong>on</strong> of the statistical properties of<br />

these entities, c<strong>on</strong>sult Johns<strong>on</strong> and Wichern (2007) — c<strong>on</strong>fidence intervals<br />

<strong>on</strong> the populati<strong>on</strong> eigenvalues; testing equality of eigenvalues;<br />

assessing the patterning present in an eigenvector; and so <strong>on</strong>.<br />

14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!