Subspace-based Learning with Grassmann Kernels - VideoLectures

Subspace-based Learning with Grassmann Kernels - VideoLectures

Subspace-based Learning with Grassmann Kernels - VideoLectures


Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Subspace</strong>-<strong>based</strong> <strong>Learning</strong> <strong>with</strong><br />

<strong>Grassmann</strong> <strong>Kernels</strong><br />

Jihun Hamm and Daniel. D. Lee<br />

University of Pennsylvania<br />

July 8, 2008

<strong>Subspace</strong> structure in data<br />

• Image data<br />

•<br />

•<br />

•<br />

Illumination variation is low-dimensional<br />

empirical [Hallinan94,Epstein95], theoretical<br />

[Belhumeur98,Ramamoorthi01,Ramamoorthi02,Basri03]<br />

Pose, expression, etc also modeled well by<br />

subspaces: “Eigenface” [Sirovich87,Kirby90,Turk91]

Set of illumination-subspaces<br />

sets subspaces<br />

X 1<br />

X 2<br />

PCA<br />

2<br />

1<br />

Y 1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

Y 2<br />

∈ R D×m<br />

0<br />

0 2 4 6

Set of pose-subsapces<br />

sets subspaces<br />

X 1<br />

X 2<br />

PCA<br />

2<br />

1<br />

Y 1<br />

0<br />

0 2 4 6<br />

2<br />

1<br />

Y 2<br />

∈ R D×m<br />

0<br />

0 2 4 6

Linear dynamical model<br />

• ARMA model of a sequential data<br />

•<br />

e.g., dynamic textures, human actions<br />

[Doretto03,Veeraraghavan05,Turaga08]<br />

• Observability matrix [Cock02]<br />

•<br />

x(t + 1) = Ax(t) + v(t)<br />

y(t) = Cx(t) + w(t)<br />

x(t): internal state, y(t): observed output, v(t), w(t): noise<br />

O (A,C) = [C ; CA ; CA 2 ; ...] ∈ R D×m<br />

Each ARMA model spans a unique subspace

Set of observability subspaces<br />

check watch OA1,C1<br />

scratch head<br />

. . .<br />

image sequences observability matrices<br />

OA2,C2<br />

. . .

•<br />

<strong>Subspace</strong>-<strong>based</strong> learning<br />

Assumption: data consists of linear subspaces<br />

•<br />

•<br />

subspace<br />

1<br />

R D<br />

subspace<br />

2 subspace<br />

N<br />

Model-out the known (undesired) variability <strong>with</strong><br />

linear subspaces<br />

Learn the unknown (interesting) variability<br />

between subspaces

Framework for subspace-<strong>based</strong> learning<br />

•<br />

The <strong>Grassmann</strong> manifold G(m, D) is the set of m-dimensional<br />

linear subspaces of the R D .<br />

•<br />

R D<br />

span( Yi )<br />

u 1<br />

!1, ..., !m<br />

span( Yj )<br />

v 1<br />

Applications in signal processing and control<br />

[Srivastava00,Henkel05,Baumann07], optimization<br />

[Edelman99], and computer vision<br />

[Liu04,Lin06,Chang06,Turaga08].<br />

Yi<br />

G(m, D )<br />

! 2<br />


Representation<br />

• Quotient space representations of<br />

G(m, D) = O(D)/O(m) × O(D − m)<br />

• Basis representation of an element<br />

An element of G(m, D) is represented by a D × m matrix<br />

such that Y ′ Y = Im, <strong>with</strong> the equivalence relation:<br />

Y1 ∼ = Y2<br />

⇐⇒ span(Y1) = span(Y2)<br />

⇐⇒ ∃Rm ∈ O(m), such that Y1 = Y2Rm<br />

G(m, D)

Principal angle/canonical corr<br />

Let Y1 and Y2 be two orthonormal matrices of size D by m,<br />

and let u ∈ span(Y1) and v ∈ span(Y2) be unit vectors.<br />

R D<br />

span( Yi )<br />

u 1<br />

!1, ..., !m<br />

span( Yj )<br />

The first principal angle/canoncial corr between span(Y1) and span(Y2) is<br />

cos θ1 = max<br />

u∈span(Y1)<br />

v 1<br />

max<br />

v∈span(Y2) u′ v, subject to �u� = �v� = 1.<br />

Yi<br />

G(m, D )<br />

! 2

•<br />

•<br />

k-th principal angle<br />

The k-th principal angle/cannonical correlation is:<br />

cos θk = max<br />

uk∈span(Y1)<br />

max<br />

vk∈span(Y2) uk ′ vk, subject to<br />

uk ′ uk = 1, vk ′ vk = 1,<br />

uk ′ ui = 0, vk ′ vi = 0, (i = 1, ..., k − 1).<br />

0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2 and 1 ≥ cos θ1 ≥ · · · ≥ cos θm ≥ 0<br />

Use SVD for computation:<br />

Y ′<br />

1Y2 = USV ′ [Golub96]<br />

, where U = [u1 ... um], V = [v1 ... vm],<br />

and S is the diagonal matrix S = diag(cos θ1 ... cos θm).

Principal angles and distance<br />

• Given: two subspaces Y1 and Y2, and principal<br />

angles θ1, ... , θm from SVD<br />

• How to define a good subspace distance<br />

from θ1, ... , θm<br />

?<br />

• A canonical distance [Edelman99]<br />

• Arc-length:<br />

• Is this the only distance?<br />

d 2 Arc (Y1, Y2) = �<br />

i θ2 i

<strong>Grassmann</strong> distances<br />

• Projection distance [Edelmann99,Wang06]<br />

d2 Proj (Y1, Y2) = �m i=1 sin2 θi = 2−1 �Y1Y ′<br />

1 − Y2Y ′<br />

• Binet-Cauchy distance [Wolf03,Vishwanathan04]<br />

•<br />

d 2 BC (Y1, Y2) = 1 − �<br />

i cos2 θi = 1 − det(Y ′<br />

1Y2) 2<br />

Martin distance between two ARMA models<br />

[Martin00]<br />

• Max Corr, Min Corr, Procrustes1/2<br />

2� 2 F

• Characteristics<br />

• d2 MaxCor = 2 sin2 θ1<br />

•<br />

•<br />

Comparison<br />

Arc Length Projection Binet-Cauchy<br />

d2 (Y1, Y2) · 2−1�Y1Y ′<br />

1 − Y2Y ′<br />

2�2 F 1 − det(Y ′<br />

1Y2) 2<br />

In terms of θ<br />

� 2 θi � 2<br />

sin θi 1 − � cos2 Is a metric? Yes Yes<br />

θi<br />

Yes<br />

Max Corr Min Corr Procrustes 1 Procrustes 2<br />

d2 (Y1, Y2) 2 − 2�Y ′<br />

1Y2�2 2 �Y1Y ′<br />

1 − Y2Y ′<br />

2�2 2 �Y1U − Y2V �2 F �Y1U − Y2V �2 In terms of θ 2 sin<br />

2<br />

2 θ1 sin 2 θm 4 � sin 2 (θi/2) 4 sin 2 (θm/2)<br />

Is a metric? No Yes Yes Yes<br />

is a rough measure<br />

d 2 MinCor = 2 sin2 θm, d 2 Proc2 = 4 sin2 (θm/2)<br />

, intermediate<br />

too sensitive<br />

d 2 Arc = � θ 2 i , d2 Proj = � sin 2 θi, d 2 Proc1 = 4 � sin 2 (θi/2)<br />

d 2 BC = 1 − � m<br />

i=1 cos2 θi

Applications<br />

• Distance-<strong>based</strong>: e.g., k-NN<br />

• Mutual <strong>Subspace</strong> Method [Yamaguchi98]<br />

• Beyond k-NN: subspace-<strong>based</strong> discriminant<br />

analysis<br />

•<br />

•<br />

•<br />

Previous methods<br />

Constrained Mutual <strong>Subspace</strong> Method [Fukui03]<br />

Discriminant Analysis of Canonical Correlations<br />


Some complications<br />

• Find a discriminative direction w ∈ R ,<br />

so that<br />

D<br />

•<br />

•<br />

d(w ′ Xi, w ′ Xj) becomes<br />

� small if yi = yj,<br />

large if yi �= yj,<br />

A difficult optimization problem<br />

Procedures often iterative, and not well justified<br />

• Inconsistency: projection is performed in<br />

image space, but distances are computed on<br />

<strong>Grassmann</strong> space

Easier solution<br />

• Use kernel-induced Hilbert space<br />

H 1<br />

H 1<br />

span( Yi )<br />

span( Yi )<br />

!1, ..., !m<br />

!1, ..., !m<br />

span( Yj )<br />

span( Yj )<br />

Yi<br />

G(m, D )<br />

G(m, D )<br />

! 2<br />

! 2<br />

! "<br />

X<br />

X<br />

!<br />

H2<br />

H2<br />

"<br />

"( )<br />

"( ) "( )<br />

"( )<br />

No need to 1) project data and<br />

2) measure distances separately<br />

Yi<br />

Yj<br />


<strong>Grassmann</strong> kernels<br />

• Let k : R be a real-valued<br />

symmetric function,<br />

Dm × RDm → R<br />

• Invariance:<br />

k(Y1, Y2) = k(Y1R1, Y2R2), ∀R1, R2 ∈ O(m)<br />

•<br />

k(x1, x2) = k(x2, x1)<br />

Positive definiteness<br />

�<br />

cicjk(xi, xj) ≥ 0, ∀(x1, ..., xn), ∀(c1, ..., cn), n ∈ N<br />

i,j<br />

• dProj, dBC<br />

have corresponding <strong>Grassmann</strong> kernels

Projection kernel<br />

• Projection embedding [Chikuse06]<br />

The map Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′<br />

is an isometric embedding from (G, dProj) to (R D×D , � · �F ).<br />

• Natural inner product in<br />

• Projection kernel<br />

R D×D : tr(Y1Y ′<br />

1Y2Y ′<br />

2)<br />

• kProj(Y1, Y2) = tr(Y1Y is a<br />

<strong>Grassmann</strong> kernel<br />

′<br />

1Y2Y ′<br />

2) = �Y ′<br />

1Y2�2 F<br />

• Has a very simple form and requires only<br />

multiplications to evaluate.<br />


Binet-Cauchy kernel<br />

• Binet-Cauchy identity [Horn85]<br />

Suppose we choose m rows from a D × m matrix A.<br />

Then, there are n = DCm square subsmatrices A (s1) , ..., A (sn)<br />

det(A ′ B) = �<br />

s det A(s) det B (s)<br />

• Binet-Cauchy embedding<br />

Ψ : G(m, D) → R<br />

• Binet-Cauchy kernel [Wolf03,Vishwanathan04]<br />

• is a <strong>Grassmann</strong> kernel<br />

n , span(Y ) ↦→ � det Y (s1) (sn) , ..., det Y �<br />

is an embedding. It is also an isometry from (G, dBC) to (Rn , � · �2).<br />

kBC(Y1, Y2) = (det Y ′<br />

1Y2) 2

Advantages of <strong>Grassmann</strong> kernel<br />

• Access to all the kernel-<strong>based</strong> algorithms for<br />

Hilbert spaces!<br />

• Can generate a family of kernels:<br />

If k1(x, y) and k2(x, y) are PD kernels, then so are<br />

1. α1k1(x, y) + α2k2(x, y), (α1, α2 > 0),<br />

2. k1(x, y)k2(x, y),<br />

3. � k1(x, z)k1(y, z) dz,<br />

4. f(x)k1(x, y)f(y)

Extension to nonlinear subspace<br />

• `Doubly kernel’ method [Wolf03,Wang06]<br />

H 1<br />

Kernel PCA<br />

span( Yi )<br />

!1, ..., !m<br />

span( Yj )<br />

! "<br />

X H<br />

2<br />

Yi<br />

G(m, D )<br />

! 2<br />

Yj<br />

"( ) "( )

Kernel Fisher Discriminant Analysis<br />

Training:<br />

1. Compute the matrix [Ktrain]ij = kP (Yi, Yj) or kBC(Yi, Yj) for all Yi, Yj in<br />

the training set.<br />

2. Solve maxα L(α) by eigen-decomposition.<br />

3. Compute the (C − 1)-dimensional coefficients Ftrain = α ′ Ktrain.<br />

Testing:<br />

1. Compute the matrix [Ktest]ij = kP (Yi, Yj) or kBC(Yi, Yj) for all Yi in<br />

training set and Yj in the test set.<br />

2. Compute the (C − 1)-dim coefficients Ftest = α ′ Ktest.<br />

3. Perform 1-NN classification from the Euclidean distance between Ftrain<br />

and Ftest.

Discriminant Analysis Algorithms<br />

• Baseline: Euclidean FDA<br />

• <strong>Grassmann</strong> Discriminant Analysis (GDA):<br />

• kernel FDA + Proj / BC kernel<br />

• Others<br />

• MSM : no dim reduction + MaxCor [Yamaguchi98]<br />

• cMSM : heur dim reduction + MaxCor [Fukui03]<br />

• DCC : iterating between 1.NDA 2. Proc1 [Kim07]

Illum-invariant face recognition<br />

• Yale face database [Georghiades01]<br />

• 38 persons x 9 poses x 45 illums<br />

• PCA along illum axis<br />

• 9-fold cross validation<br />

• Illum-invariant face recognition<br />

• CMU-PIE database [Sim03]<br />

• 68 persons x 7 poses x 43 illums<br />

• 7-fold cross validation<br />

person<br />

illumination<br />


Pose-inv. object categorization<br />

• ETH-80 database [Leibe02]<br />

• 8 cats x 10 objects x 41 poses<br />

• PCA along pose axis<br />

• 10-fold cross validation<br />

• Pose-invariant object<br />

categorization<br />

category<br />

pose<br />


Video-<strong>based</strong> action recognition<br />

• IXMAS database [Weinland06]<br />

• 11 actions x 11 actors x T frms<br />

(x 3 trials)<br />

• ARMA model <strong>with</strong> T frms<br />

• 11-fold cross validation<br />

• Video-<strong>based</strong> action recognition<br />

action<br />

frame<br />


9(:*+7;8<br />

9(:*+7;8<br />

!&&<br />

#&<br />

&<br />

!&&<br />

#&<br />

&<br />

'()*+,(-*<br />

! " # $ %<br />

./0.1(-*+234*5.365+748<br />

ABC!D&<br />

! " # $ %<br />

./0.1(-*+234*5.365+748<br />

Results<br />

9(:*+7;8<br />

9(:*+7;8<br />

!&&<br />

#&<br />

&<br />

!&&<br />

#&<br />

&<br />

!?@A<br />

! " # $ %<br />

./0.1(-*+234*5.365+748<br />

@G=HI<br />

! E " F #<br />


Conclusion<br />

• <strong>Subspace</strong>-<strong>based</strong> learning: new paradigm for<br />

exploiting inherent linear structures in data<br />

• <strong>Grassmann</strong> manifold as a framework:<br />

Projection distances and kernels<br />

• Experiments: superior classification<br />

performance <strong>with</strong> proposed method<br />

• Not limited to image data/FDA method/<br />

classification task<br />

• An open question

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!