Notes on Principal Component Analysis
Notes on Principal Component Analysis
Notes on Principal Component Analysis
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Versi<strong>on</strong>: August 4, 2011<br />
<str<strong>on</strong>g>Notes</str<strong>on</strong>g> <strong>on</strong> <strong>Principal</strong> Comp<strong>on</strong>ent<br />
<strong>Analysis</strong><br />
A preliminary introducti<strong>on</strong> to principal comp<strong>on</strong>ents was given in<br />
our brief discussi<strong>on</strong> of the spectral decompositi<strong>on</strong> (i.e., the eigenvector/eigenvalue<br />
decompositi<strong>on</strong>) of a matrix and what it might be<br />
used for. We will now be a bit more systematic, and begin by making<br />
three introductory comments:<br />
(a) <strong>Principal</strong> comp<strong>on</strong>ent analysis (PCA) deals with <strong>on</strong>ly <strong>on</strong>e set of<br />
variables without the need for categorizing the variables as being independent<br />
or dependent. There is asymmetry in the discussi<strong>on</strong> of the<br />
general linear model; in PCA, however, we analyze the relati<strong>on</strong>ships<br />
am<strong>on</strong>g the variables in <strong>on</strong>e set and not between two.<br />
(b) As always, everything can be d<strong>on</strong>e computati<strong>on</strong>ally without<br />
the Multivariate Normal (MVN) assumpti<strong>on</strong>; we are just getting<br />
descriptive statistics. When significance tests and the like are desired,<br />
the MVN assumpti<strong>on</strong> becomes indispensable. Also, MVN gives some<br />
very nice interpretati<strong>on</strong>s for what the principal comp<strong>on</strong>ents are in<br />
terms of our c<strong>on</strong>stant density ellipsoids.<br />
(c) Finally, it is probably best if you are doing a PCA, not to<br />
refer to these as “factors”. A lot of blood and ill-will has been spilt<br />
and spread over the distincti<strong>on</strong> between comp<strong>on</strong>ent analysis (which<br />
involves linear combinati<strong>on</strong>s of observable variables), and the estimati<strong>on</strong><br />
of a factor model (which involves the use of underlying latent<br />
1
variables or factors, and the estimati<strong>on</strong> of the factor structure). We<br />
will get sloppy ourselves later, but some people really get exercised<br />
about these things.<br />
We will begin working with the populati<strong>on</strong> (but everything translates<br />
more-or-less directly for a sample):<br />
Suppose [X 1 , X 2 , . . . , X p ] = X ′ is a set of p random variables, with<br />
mean vector µ and variance-covariance matrix Σ. I want to define p<br />
linear combinati<strong>on</strong>s of X ′ that represent the informati<strong>on</strong> in X ′ more<br />
parsim<strong>on</strong>iously. Specifically, find a 1 , . . . , a p such that a ′ 1X, . . . , a ′ pX<br />
gives the same informati<strong>on</strong> as X ′ , but the new random variables,<br />
a ′ 1X, . . . , a ′ pX, are “nicer”.<br />
Let λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0 be the p roots (eigenvalues) of the<br />
matrix Σ, and let a 1 , . . . , a p be the corresp<strong>on</strong>ding eigenvectors. If<br />
some roots are not distinct, I can still pick corresp<strong>on</strong>ding eigenvectors<br />
to be orthog<strong>on</strong>al. Choose an eigenvector a i so a ′ ia i = 1, i.e., a<br />
normalized eigenvector. Then, a ′ iX is the i th principal comp<strong>on</strong>ent of<br />
the random variables in X ′ .<br />
Properties:<br />
1) Var(a ′ iX) = a ′ iΣa i = λ i<br />
We know Σa i = λ i a i , because a i is the eigenvector for λ i ; thus,<br />
a ′ iΣa i = a ′ iλ i a i = λ i . In words, the variance of the i th principal<br />
comp<strong>on</strong>ent is λ i , the root.<br />
Also, for all vectors b i such that b i is orthog<strong>on</strong>al to a 1 , . . . , a i−1 ,<br />
and b ′ ib i = 1, Var(b ′ iX) is the greatest it can be (i.e., λ i ) when<br />
b i = a i .<br />
2
2) a i and a j are orthog<strong>on</strong>al, i.e., a ′ ia j = 0<br />
3) Cov(a ′ iX, a ′ jX) = a ′ iΣa j = a ′ iλ j a j = λ j a ′ ia j = 0<br />
4) Tr(Σ) = λ 1 + · · · + λ p = sum of variances for all p principal<br />
comp<strong>on</strong>ents, and for X 1 , . . . , X p . The importance of the i th principal<br />
comp<strong>on</strong>ent is<br />
λ i /Tr(Σ) ,<br />
which is equal to the variance of the i th principal comp<strong>on</strong>ent divided<br />
by the total variance in the system of p random variables,<br />
X 1 , . . . , X p ; it is the proporti<strong>on</strong> of the total variance explained by<br />
the i th comp<strong>on</strong>ent.<br />
If the first few principal comp<strong>on</strong>ents account for most of the variati<strong>on</strong>,<br />
then we might interpret these comp<strong>on</strong>ents as “factors” underlying<br />
the whole set X 1 , . . . , X p . This is the basis of principal factor<br />
analysis.<br />
The questi<strong>on</strong> of how many comp<strong>on</strong>ents (or factors, or clusters,<br />
or dimensi<strong>on</strong>s) usually has no definitive answer. Some attempt has<br />
been made to do what are called Scree plots, and graphically see how<br />
many comp<strong>on</strong>ents to retain. A plot is c<strong>on</strong>structed of the value of the<br />
eigenvalue <strong>on</strong> the y-axis and the number of the eigenvalue (e.g., 1, 2,<br />
3, and so <strong>on</strong>) <strong>on</strong> the x-axis, and you look for an “elbow” to see where<br />
to stop. Scree or talus is the pile of rock debris (detritus) at the foot<br />
of a cliff, i.e., the sloping mass of debris at the bottom of the cliff. I,<br />
unfortunately, can never see an “elbow”!<br />
If we let a populati<strong>on</strong> correlati<strong>on</strong> matrix corresp<strong>on</strong>ding to Σ be<br />
denoted as P, then Tr(P) = p, and we might use <strong>on</strong>ly those principal<br />
3
comp<strong>on</strong>ents that have variance of λ i ≥ 1 — otherwise, the comp<strong>on</strong>ent<br />
would “explain” less variance than would a single variable.<br />
A major rub — if I do principal comp<strong>on</strong>ents <strong>on</strong> the correlati<strong>on</strong><br />
matrix, P, and <strong>on</strong> the original variance-covariance matrix, Σ, the<br />
structures obtained are generally different. This is <strong>on</strong>e reas<strong>on</strong> the<br />
“true believers” might prefer a factor analysis model over a PCA because<br />
the former holds out some hope for an invariance (to scaling).<br />
Generally, it seems more reas<strong>on</strong>able to always use the populati<strong>on</strong><br />
correlati<strong>on</strong> matrix, P; the units of the original variables become irrelevant,<br />
and it is much easier to interpret the principal comp<strong>on</strong>ents<br />
through their coefficients.<br />
The j th principal comp<strong>on</strong>ent is a ′ jX:<br />
Cov(a ′ jX, X i ) = Cov(a ′ jX, b ′ X), where b ′ = [0 · · · 0 1 0 · · · 0],<br />
with the 1 in the i th positi<strong>on</strong>, = a ′ jΣb = b ′ Σa j = b ′ λ j a j = λ j<br />
times the i th comp<strong>on</strong>ent of a j = λ j a ij . Thus, Cor(a ′ jX, X i ) =<br />
λ j a ij<br />
√<br />
λj σ i<br />
=<br />
√<br />
λj a ij<br />
σ i<br />
,<br />
where σ i is the standard deviati<strong>on</strong> of X i . This correlati<strong>on</strong> is called<br />
the loading of X i <strong>on</strong> the j th comp<strong>on</strong>ent. Generally, these correlati<strong>on</strong>s<br />
can be used to see the c<strong>on</strong>tributi<strong>on</strong> of each variable to each of the<br />
principal comp<strong>on</strong>ents.<br />
If the populati<strong>on</strong> covariance matrix, Σ, is replaced by the sample<br />
covariance matrix, S, we obtain sample principal comp<strong>on</strong>ents; if the<br />
populati<strong>on</strong> correlati<strong>on</strong> matrix, P, is replaced by the sample correlati<strong>on</strong><br />
matrix, R, we again obtain sample principal comp<strong>on</strong>ents. These<br />
structures are generally different.<br />
4
The covariance matrix S (or Σ) can be represented by<br />
⎡ √ ⎤ ⎡ √ ⎤ ⎡<br />
λ1 · · · 0<br />
λ1 · · · 0<br />
a ′ 1<br />
S = [a 1 , . . . , a p ]<br />
⎢<br />
. .<br />
⎣<br />
√<br />
⎥ ⎢<br />
. .<br />
⎦ ⎣<br />
√<br />
⎥ ⎢<br />
.<br />
⎦ ⎣<br />
0 · · · λp 0 · · · λp a ′ p<br />
or as the sum of p, p × p matrices,<br />
⎤<br />
⎥<br />
⎦<br />
≡ LL ′<br />
S = λ 1 a 1 a ′ 1 + · · · + λ p a p a ′ p .<br />
Given the ordering of the eigenvalues as λ 1 ≥ λ 2 ≥ · · · ≥ λ p ≥ 0,<br />
the least-squares approximati<strong>on</strong> to S of rank r is λ 1 a 1 a ′ 1 + · · · +<br />
λ 1 a r a ′ r, and the residual matrix, S − λ 1 a 1 a ′ 1 − · · · − λ 1 a r a ′ r, is<br />
λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p.<br />
Note that for an arbitrary matrix, B p×q , the Tr(BB ′ ) = sum of<br />
squares of the entries in B. Also, for two matrices, B and C, if both<br />
of the products BC and CB can be taken, then Tr(BC) is equal<br />
to Tr(CB). Using these two results, the least-squares criteri<strong>on</strong> value<br />
can be given as<br />
Tr([λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p][λ r+1 a r+1 a ′ r+1 + · · · + λ p a p a ′ p] ′ ) =<br />
∑<br />
k≥r+1<br />
λ 2 k .<br />
This measure is <strong>on</strong>e of how bad the rank r approximati<strong>on</strong> might be<br />
(i.e., the proporti<strong>on</strong> of unexplained sum-of-squares when put over<br />
∑p<br />
k=1 λ 2 k).<br />
For a geometric interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, suppose<br />
we have two variables, X 1 and X 2 , that are centered at their respective<br />
means (i.e., the means of the scores <strong>on</strong> X 1 and X 2 are zero). In<br />
5
the diagram below, the ellipse represents the scatter diagram of the<br />
sample points. The first principal comp<strong>on</strong>ent is a line through the<br />
widest part; the sec<strong>on</strong>d comp<strong>on</strong>ent is the line at right angles to the<br />
first principal comp<strong>on</strong>ent. In other words, the first principal comp<strong>on</strong>ent<br />
goes through the fattest part of the “football”, and the sec<strong>on</strong>d<br />
principal comp<strong>on</strong>ent through the next fattest part of the “football”<br />
and orthog<strong>on</strong>al to the first; and so <strong>on</strong>. Or, we take our original frame<br />
of reference and do a rigid transformati<strong>on</strong> around the origin to get a<br />
new set of axes; the origin is given by the sample means (of zero) <strong>on</strong><br />
the two X 1 and X 2 variables. (To make these same geometric points,<br />
we could have used a c<strong>on</strong>stant density c<strong>on</strong>tour for a bivariate normal<br />
pair of random variables, X 1 and X 2 , with zero mean vector.)<br />
sec<strong>on</strong>d comp<strong>on</strong>ent<br />
X 2<br />
first comp<strong>on</strong>ent<br />
X 1<br />
As an example of how to find the placement of the comp<strong>on</strong>ents in<br />
the picture given above, suppose we have the two variables, X 1 and<br />
X 2 , with variance-covariance matrix<br />
Σ =<br />
⎛<br />
⎜<br />
⎝<br />
σ1 2 σ 12<br />
σ 12 σ2<br />
2<br />
⎞<br />
⎟<br />
⎠ .<br />
6
Let a 11 and a 21 denote the weights from the first eigenvector of Σ;<br />
a 12 and a 22 are the weights from the sec<strong>on</strong>d eigenvector. If these<br />
are placed in a 2 × 2 orthog<strong>on</strong>al (or rotati<strong>on</strong>) matrix T, with the<br />
first column c<strong>on</strong>taining the first eigenvector weights and the sec<strong>on</strong>d<br />
column the sec<strong>on</strong>d eigenvector weights, we can obtain the directi<strong>on</strong><br />
cosines of the new axes system from the following:<br />
T =<br />
⎛<br />
⎜<br />
⎝<br />
a 11 a 12<br />
a 21<br />
⎞ ⎛<br />
⎟ ⎜<br />
⎠ = ⎝<br />
a 22<br />
cos(θ) cos(90 + θ)<br />
cos(θ − 90) cos(θ)<br />
⎞<br />
⎟<br />
⎠ =<br />
⎛<br />
⎜<br />
⎝<br />
cos(θ) − sin(θ)<br />
sin(θ) cos(θ)<br />
These are the cosines of the angles with the positive (horiz<strong>on</strong>tal and<br />
vertical) axes. If we wish to change the orientati<strong>on</strong> of a transformed<br />
axis (i.e., to make the arrow go in the other directi<strong>on</strong>), we merely<br />
use a multiplicati<strong>on</strong> of the relevant eigenvector values by −1 (i.e.,<br />
we choose the other normalized eigenvector for that same eigenvalue,<br />
which still has unit length).<br />
90 + θ<br />
⎞<br />
⎟<br />
⎠ .<br />
θ<br />
θ − 90<br />
θ<br />
If we denote the data matrix in this simple two variable problem<br />
as X n×2 , where n is the number of subjects and the two columns<br />
7
epresent the values <strong>on</strong> variables X 1 and X 2 (i.e., the coordinates of<br />
each subject <strong>on</strong> the original axes), the n × 2 matrix of coordinates<br />
of the subjects <strong>on</strong> the transformed axes, say X trans can be given as<br />
XT.<br />
For another interpretati<strong>on</strong> of principal comp<strong>on</strong>ents, the first comp<strong>on</strong>ent<br />
could be obtained by minimizing the sum of squared perpendicular<br />
residuals from a line (and in analogy to simple regressi<strong>on</strong><br />
where the sum of squared vertical residuals from a line is minimized).<br />
This noti<strong>on</strong> generalizes to more than than <strong>on</strong>e principal comp<strong>on</strong>ent<br />
and in analogy to the way that multiple regressi<strong>on</strong> generalizes simple<br />
regressi<strong>on</strong> — vertical residuals to hyperplanes are used in multiple<br />
regressi<strong>on</strong>, and perpendicular residuals to hyperplanes are used in<br />
PCA.<br />
There are a number of specially patterned matrices that have interesting<br />
eigenvector/eigenvalue decompositi<strong>on</strong>s. For example, for<br />
the p × p diag<strong>on</strong>al variance-covariance matrix<br />
Σ p×p =<br />
⎛<br />
⎜<br />
⎝<br />
σ1 2 · · · 0<br />
. .<br />
0 · · · σp<br />
2<br />
the roots are σ1, 2 . . . , σp, 2 and the eigenvector corresp<strong>on</strong>ding to σi<br />
2<br />
is [0 0 . . . 1 . . . 0] ′ where the single 1 is in the i th positi<strong>on</strong>. If we<br />
have a correlati<strong>on</strong> matrix, the root of 1 has multiplicity p, and the<br />
eigenvectors could also be chosen as these same vectors having all<br />
zeros except for a single 1 in the i th positi<strong>on</strong>, 1 ≤ i ≤ p.<br />
If the p × p variance-covariance matrix dem<strong>on</strong>strates compound<br />
⎞<br />
⎟<br />
⎠<br />
,<br />
8
symmetry,<br />
or is an equicorrelati<strong>on</strong> matrix,<br />
⎛<br />
1 · · · ρ<br />
Σ p×p = σ 2 ⎜<br />
. .<br />
⎝<br />
ρ · · · 1<br />
P =<br />
⎛<br />
⎜<br />
⎝<br />
1 · · · ρ<br />
. .<br />
ρ · · · 1<br />
then the p − 1 smallest roots are all equal. For example, for the<br />
equicorrelati<strong>on</strong> matrix, λ 1 = 1 + (p − 1)ρ, and the remaining p −<br />
1 roots are all equal to 1 − ρ. The p × 1 eigenvector for λ 1 is<br />
[ √ 1<br />
p<br />
, . . . , √ 1<br />
p<br />
] ′ , and defines an average. Generally, for any variancecovariance<br />
matrix with all entries greater than zero (or just n<strong>on</strong>negative),<br />
the entries in the first eigenvector must all be greater than<br />
zero (or n<strong>on</strong>-negative). This is known as the Perr<strong>on</strong>-Frobenius theorem.<br />
Although we will not give these tests explicitly here (they can be<br />
found in Johns<strong>on</strong> and Wichern’s (2007) multivariate text), they are<br />
inference methods to test the null hypothesis of an equicorrelati<strong>on</strong><br />
matrix (i.e., the last p − 1 eigenvalues are equal); that the variancecovariance<br />
matrix is diag<strong>on</strong>al or the correlati<strong>on</strong> matrix is the identity<br />
(i.e., all eigenvalues are equal); or a sphericity test of independence<br />
that all eigenvalues are equal and Σ is σ 2 times the identity matrix.<br />
⎞<br />
⎟<br />
⎠<br />
⎞<br />
⎟<br />
⎠<br />
,<br />
,<br />
9
0.1 Analytic Rotati<strong>on</strong> Methods<br />
Suppose we have a p × m matrix, A, c<strong>on</strong>taining the correlati<strong>on</strong>s<br />
(loadings) between our p variables and the first m principal comp<strong>on</strong>ents.<br />
We are seeking an orthog<strong>on</strong>al m × m matrix T that defines<br />
a rotati<strong>on</strong> of the m comp<strong>on</strong>ents into a new p × m matrix, B, that<br />
c<strong>on</strong>tains the correlati<strong>on</strong>s (loadings) between the p variables and the<br />
newly rotated axes: AT = B. A rotati<strong>on</strong> matrix T is sought that<br />
produces “nice” properties in B, e.g., a “simple structure”, where<br />
generally the loadings are positive and either close to 1.0 or to 0.0.<br />
The most comm<strong>on</strong> strategy is due to Kaiser, and calls for maximizing<br />
the normal varimax criteri<strong>on</strong>:<br />
1<br />
p<br />
m∑ ∑<br />
[ p<br />
j=1 i=1<br />
(b ij /h i ) 4 − γ p { ∑ p (b ij /h i ) 2 } 2 ] ,<br />
where the parameter γ = 1 for varimax, and h i = √∑ m j=1 b 2 ij (this<br />
is called the square root of the communality of the i th variable in a<br />
factor analytic c<strong>on</strong>text). Other criteria have been suggested for this<br />
so-called orthomax criteri<strong>on</strong> that use different values of γ — 0 for<br />
quartimax, m/2 for equamax, and p(m−1)/(p+m−2) for parsimax.<br />
Also, various methods are available for attempting oblique rotati<strong>on</strong>s<br />
where the transformed axes do not need to maintain orthog<strong>on</strong>ality,<br />
e.g., oblimin in SYSTAT; Procrustes in MATLAB.<br />
Generally, varimax seems to be a good default choice. It tends to<br />
“smear” the variance explained across the transformed axes rather<br />
evenly. We will stick with varimax in the various examples we do<br />
later.<br />
i=1<br />
10
0.2 Little Jiffy<br />
Chester Harris named a procedure posed by Henry Kaiser for “factor<br />
analysis”, Little Jiffy. It is defined very simply as “principal comp<strong>on</strong>ents<br />
(of a correlati<strong>on</strong> matrix) with associated eigenvalues greater<br />
than 1.0 followed by normal varimax rotati<strong>on</strong>”. To this date, it is<br />
the most used approach to do a factor analysis, and could be called<br />
“the principal comp<strong>on</strong>ent soluti<strong>on</strong> to the factor analytic model”.<br />
More explicitly, we start with the p × p sample correlati<strong>on</strong> matrix<br />
R and assume it has r eigenvalues greater than 1.0. R is then<br />
approximated by a rank r matrix of the form:<br />
where<br />
R = λ 1 a 1 a ′ 1 + · · · + λ r a r a ′ r =<br />
( √ λ 1 a 1 )( √ λ 1 a ′ 1) + · · · + ( √ λ r a r )( √ λ r a ′ r) =<br />
b 1 b ′ 1 + · · · + b r b ′ r =<br />
(b 1 , . . . , b r )<br />
B p×r =<br />
⎛<br />
⎜<br />
⎝<br />
⎛<br />
⎜<br />
⎝<br />
b ′ 1<br />
.<br />
b ′ r<br />
⎞<br />
⎟<br />
⎠<br />
= BB ′ ,<br />
⎞<br />
b 11 b 12 · · · b 1r<br />
b 21 b 22 · · · b 2r<br />
. . .<br />
⎟<br />
⎠<br />
b p1 b p2 · · · b pr<br />
The entries in B are the loadings of the row variables <strong>on</strong> the column<br />
comp<strong>on</strong>ents.<br />
For any r × r orthog<strong>on</strong>al matrix T, we know TT ′ = I, and<br />
R = BIB ′ = BTT ′ B ′ = (BT)(BT) ′ = B ∗ p×rB ∗′<br />
r×p .<br />
11<br />
.
For example, varimax is <strong>on</strong>e method for c<strong>on</strong>structing B ∗ . The<br />
columns of B ∗ when normalized to unit length, define r linear composites<br />
of the observable variables, where the sum of squares within<br />
columns of B ∗ defines the variance for that composite. The composites<br />
are still orthog<strong>on</strong>al.<br />
0.3 <strong>Principal</strong> Comp<strong>on</strong>ents in Terms of the Data Matrix<br />
For c<strong>on</strong>venience, suppose we transform our n × p data matrix X<br />
into the z-score data matrix Z, and assuming n > p, let the SVD of<br />
Z n×p = U n×p D p×p V ′ p×p. Note that the p × p correlati<strong>on</strong> matrix<br />
R = 1 n Z′ Z = 1 n (VDU′ )(UDV ′ ) = V( 1 n D2 )V ′ .<br />
So, the rows of V ′ are the principal comp<strong>on</strong>ent weights. Also,<br />
ZV = UDV ′ V = UD .<br />
In other words, (UD) n×p are the scores for the n subjects <strong>on</strong> the p<br />
principal comp<strong>on</strong>ents.<br />
What’s going <strong>on</strong> in “variable” space: Suppose we look at a rank<br />
2 approximati<strong>on</strong> of Z n×p ≈ U n×2 D 2×2 V ′ 2×p. The i th subject’s row<br />
data vector sits somewhere in p-dimensi<strong>on</strong>al “variable” space; it is<br />
approximated by a linear combinati<strong>on</strong> of the two eigenvectors (which<br />
gives another point in p dimensi<strong>on</strong>s), where the weights used in the<br />
linear combinati<strong>on</strong> come from the i th row of (UD) n×2 . Because we<br />
do least-squares, we are minimizing the squared Euclidean distances<br />
12
etween the subject’s row vector and the vector defined by the particular<br />
linear combinati<strong>on</strong> of the two eigenvectors. These approximating<br />
vectors in p dimensi<strong>on</strong>s are all in a plane defined by all linear<br />
combinati<strong>on</strong>s of the two eigenvectors. For a rank 1 approximati<strong>on</strong>,<br />
we merely have a multiple of the first eigenvector (in p dimensi<strong>on</strong>s)<br />
as the approximating vector for a subject’s row vector.<br />
What’s going <strong>on</strong> in “subject space”: Suppose we begin by looking<br />
at a rank 1 approximati<strong>on</strong> of Z n×p ≈ U n×1 D 1×1 V ′ 1×p. The<br />
j th column (i.e., variable) of Z is a point in n-dimensi<strong>on</strong>al “subject<br />
space”, and is approximated by a multiple of the scores <strong>on</strong> the first<br />
comp<strong>on</strong>ent, (UD) n×1 . The multiple used is the j th element of the<br />
1 × p vector of first comp<strong>on</strong>ent weights, V ′ 1×p. Thus, each column<br />
of the n × p approximating matrix, U n×1 D 1×1 V ′ 1×p, is a multiple of<br />
the same vector giving the scores <strong>on</strong> the first comp<strong>on</strong>ent. In other<br />
words, we represent each column (variable) by a multiple of <strong>on</strong>e specific<br />
vector, where the multiple represents where the projecti<strong>on</strong> lies<br />
<strong>on</strong> this <strong>on</strong>e single vector (the term “projecti<strong>on</strong>” is used because of the<br />
least-squares property of the approximati<strong>on</strong>). For a rank 2 approximati<strong>on</strong>,<br />
each column variable in Z is represented by a point in the<br />
plane defined by all linear combinati<strong>on</strong>s of the two comp<strong>on</strong>ent score<br />
columns in U n×2 D 2×2 ; the point in that plane is determined by the<br />
weights in the j th column of V ′ 2×p. Alternatively, Z is approximated<br />
by the sum of two n×p matrices defined by columns being multiples<br />
of the first or sec<strong>on</strong>d comp<strong>on</strong>ent scores.<br />
As a way of illustrating a graphical way of representing principal<br />
comp<strong>on</strong>ents of a data matrix (through a biplot), suppose we have<br />
the rank 2 approximati<strong>on</strong>, Z n×p ≈ U n×2 D 2×2 V ′ 2×p, and c<strong>on</strong>sider<br />
13
a two-dimensi<strong>on</strong>al Cartesian system where the horiz<strong>on</strong>tal axis corresp<strong>on</strong>ds<br />
to the first comp<strong>on</strong>ent and the vertical axis corresp<strong>on</strong>ds<br />
to the sec<strong>on</strong>d comp<strong>on</strong>ent. Use the n two-dimensi<strong>on</strong>al coordinates<br />
in (U n×2 D 2×2 ) n×2 to plot the rows (subjects), let V p×2 define the<br />
two-dimensi<strong>on</strong>al coordinates for the p variables in this same space.<br />
As in any biplot, if a vector is drawn from the origin through the<br />
i th row (subject) point, and the p column points are projected <strong>on</strong>to<br />
this vector, the collecti<strong>on</strong> of such projecti<strong>on</strong>s is proporti<strong>on</strong>al to the<br />
i th row of the n × p approximati<strong>on</strong> matrix (U n×2 D 2×2 V ′ 2×p) n×p .<br />
The emphasis in this notes has been <strong>on</strong> the descriptive aspects of<br />
principal comp<strong>on</strong>ents. For a discussi<strong>on</strong> of the statistical properties of<br />
these entities, c<strong>on</strong>sult Johns<strong>on</strong> and Wichern (2007) — c<strong>on</strong>fidence intervals<br />
<strong>on</strong> the populati<strong>on</strong> eigenvalues; testing equality of eigenvalues;<br />
assessing the patterning present in an eigenvector; and so <strong>on</strong>.<br />
14