Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s) Jolliffe I. Principal Component Analysis (2ed., Springer, 2002)(518s)

cda.psych.uiuc.edu
from cda.psych.uiuc.edu More from this publisher
12.07.2015 Views

3.2. Geometric Properties of Sample Principal Components 35Figure 3.1. Orthogonal projection of a two-dimensional vector onto a one-dimensionalsubspace.NowThusx ′ ix i =(m i + r i ) ′ (m i + r i )n∑r ′ ir i =i=1= m ′ im i + r ′ ir i +2r ′ im i= m ′ im i + r ′ ir i .n∑x ′ ix i −i=1n∑m ′ im i ,so that, for a given set of observations, minimization of the sum of squaredperpendicular distances is equivalent to maximization of ∑ ni=1 m′ i m i.Distancesare preserved under orthogonal transformations, so the squareddistance m ′ i m i of y i from the origin is the same in y coordinates as inx coordinates. Therefore, the quantity to be maximized is ∑ ni=1 y′ i y i. Butn∑n∑y iy ′ i = x ′ iBB ′ x ii=1i=1i=1

36 3. Properties of Sample Principal Components=tr==n∑(x ′ iBB ′ x i )i=1n∑tr(x ′ iBB ′ x i )i=1n∑tr(B ′ x i x ′ iB)i=1[ ( n) ]∑=tr B ′ x i x ′ i Bi=1=tr[B ′ X ′ XB]=(n − 1) tr(B ′ SB).Finally, from Property A1, tr(B ′ SB) is maximized when B = A q .Instead of treating this property (G3) as just another property of samplePCs, it can also be viewed as an alternative derivation of the PCs. Ratherthan adapting for samples the algebraic definition of population PCs givenin Chapter 1, there is an alternative geometric definition of sample PCs.They are defined as the linear functions (projections) of x 1 , x 2 ,...,x n thatsuccessively define subspaces of dimension 1, 2,...,q,...,(p − 1) for whichthe sum of squared perpendicular distances of x 1 , x 2 ,...,x n from the subspaceis minimized. This definition provides another way in which PCs canbe interpreted as accounting for as much as possible of the total variationin the data, within a lower-dimensional space. In fact, this is essentiallythe approach adopted by Pearson (1901), although he concentrated on thetwo special cases, where q =1andq =(p − 1). Given a set of points in p-dimensional space, Pearson found the ‘best-fitting line,’ and the ‘best-fittinghyperplane,’ in the sense of minimizing the sum of squared deviations ofthe points from the line or hyperplane. The best-fitting line determines thefirst principal component, although Pearson did not use this terminology,and the direction of the last PC is orthogonal to the best-fitting hyperplane.The scores for the last PC are simply the perpendicular distances ofthe observations from this best-fitting hyperplane.Property G4. Let X be the (n × p) matrix whose (i, j)th element is˜x ij − ¯x j , and consider the matrix XX ′ . The ith diagonal element of XX ′is ∑ pj=1 (˜x ij − ¯x j ) 2 , which is the squared Euclidean distance of x i from thecentre of gravity ¯x of the points x 1 , x 2 ,...,x n ,where¯x = 1 ∑ nn i=1 x i.Also,the (h, i)th element of XX ′ is ∑ pj=1 (˜x hj − ¯x j )(˜x ij − ¯x j ), which measuresthe cosine of the angle between the lines joining x h and x i to ¯x, multipliedby the distances of x h and x i from ¯x. Thus XX ′ contains informationabout the configuration of x 1 , x 2 ,...,x n relative to ¯x. Now suppose thatx 1 , x 2 ,...,x n are projected onto a q-dimensional subspace with the usualorthogonal transformation y i = B ′ x i ,i=1, 2,...,n. Then the transfor-✷

36 3. Properties of Sample <strong>Principal</strong> <strong>Component</strong>s=tr==n∑(x ′ iBB ′ x i )i=1n∑tr(x ′ iBB ′ x i )i=1n∑tr(B ′ x i x ′ iB)i=1[ ( n) ]∑=tr B ′ x i x ′ i Bi=1=tr[B ′ X ′ XB]=(n − 1) tr(B ′ SB).Finally, from Property A1, tr(B ′ SB) is maximized when B = A q .Instead of treating this property (G3) as just another property of samplePCs, it can also be viewed as an alternative derivation of the PCs. Ratherthan adapting for samples the algebraic definition of population PCs givenin Chapter 1, there is an alternative geometric definition of sample PCs.They are defined as the linear functions (projections) of x 1 , x 2 ,...,x n thatsuccessively define subspaces of dimension 1, 2,...,q,...,(p − 1) for whichthe sum of squared perpendicular distances of x 1 , x 2 ,...,x n from the subspaceis minimized. This definition provides another way in which PCs canbe interpreted as accounting for as much as possible of the total variationin the data, within a lower-dimensional space. In fact, this is essentiallythe approach adopted by Pearson (1901), although he concentrated on thetwo special cases, where q =1andq =(p − 1). Given a set of points in p-dimensional space, Pearson found the ‘best-fitting line,’ and the ‘best-fittinghyperplane,’ in the sense of minimizing the sum of squared deviations ofthe points from the line or hyperplane. The best-fitting line determines thefirst principal component, although Pearson did not use this terminology,and the direction of the last PC is orthogonal to the best-fitting hyperplane.The scores for the last PC are simply the perpendicular distances ofthe observations from this best-fitting hyperplane.Property G4. Let X be the (n × p) matrix whose (i, j)th element is˜x ij − ¯x j , and consider the matrix XX ′ . The ith diagonal element of XX ′is ∑ pj=1 (˜x ij − ¯x j ) 2 , which is the squared Euclidean distance of x i from thecentre of gravity ¯x of the points x 1 , x 2 ,...,x n ,where¯x = 1 ∑ nn i=1 x i.Also,the (h, i)th element of XX ′ is ∑ pj=1 (˜x hj − ¯x j )(˜x ij − ¯x j ), which measuresthe cosine of the angle between the lines joining x h and x i to ¯x, multipliedby the distances of x h and x i from ¯x. Thus XX ′ contains informationabout the configuration of x 1 , x 2 ,...,x n relative to ¯x. Now suppose thatx 1 , x 2 ,...,x n are projected onto a q-dimensional subspace with the usualorthogonal transformation y i = B ′ x i ,i=1, 2,...,n. Then the transfor-✷

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!