Subspace-based Learning with Grassmann Kernels - VideoLectures
Subspace-based Learning with Grassmann Kernels - VideoLectures
Subspace-based Learning with Grassmann Kernels - VideoLectures
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Subspace</strong>-<strong>based</strong> <strong>Learning</strong> <strong>with</strong><br />
<strong>Grassmann</strong> <strong>Kernels</strong><br />
Jihun Hamm and Daniel. D. Lee<br />
University of Pennsylvania<br />
July 8, 2008
<strong>Subspace</strong> structure in data<br />
• Image data<br />
•<br />
•<br />
•<br />
Illumination variation is low-dimensional<br />
empirical [Hallinan94,Epstein95], theoretical<br />
[Belhumeur98,Ramamoorthi01,Ramamoorthi02,Basri03]<br />
Pose, expression, etc also modeled well by<br />
subspaces: “Eigenface” [Sirovich87,Kirby90,Turk91]
Set of illumination-subspaces<br />
sets subspaces<br />
X 1<br />
X 2<br />
PCA<br />
2<br />
1<br />
Y 1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
Y 2<br />
∈ R D×m<br />
0<br />
0 2 4 6
Set of pose-subsapces<br />
sets subspaces<br />
X 1<br />
X 2<br />
PCA<br />
2<br />
1<br />
Y 1<br />
0<br />
0 2 4 6<br />
2<br />
1<br />
Y 2<br />
∈ R D×m<br />
0<br />
0 2 4 6
Linear dynamical model<br />
• ARMA model of a sequential data<br />
•<br />
e.g., dynamic textures, human actions<br />
[Doretto03,Veeraraghavan05,Turaga08]<br />
• Observability matrix [Cock02]<br />
•<br />
x(t + 1) = Ax(t) + v(t)<br />
y(t) = Cx(t) + w(t)<br />
x(t): internal state, y(t): observed output, v(t), w(t): noise<br />
O (A,C) = [C ; CA ; CA 2 ; ...] ∈ R D×m<br />
Each ARMA model spans a unique subspace
Set of observability subspaces<br />
check watch OA1,C1<br />
scratch head<br />
. . .<br />
image sequences observability matrices<br />
OA2,C2<br />
. . .
•<br />
<strong>Subspace</strong>-<strong>based</strong> learning<br />
Assumption: data consists of linear subspaces<br />
•<br />
•<br />
subspace<br />
1<br />
R D<br />
subspace<br />
2 subspace<br />
N<br />
Model-out the known (undesired) variability <strong>with</strong><br />
linear subspaces<br />
Learn the unknown (interesting) variability<br />
between subspaces
Framework for subspace-<strong>based</strong> learning<br />
•<br />
The <strong>Grassmann</strong> manifold G(m, D) is the set of m-dimensional<br />
linear subspaces of the R D .<br />
•<br />
R D<br />
span( Yi )<br />
u 1<br />
!1, ..., !m<br />
span( Yj )<br />
v 1<br />
Applications in signal processing and control<br />
[Srivastava00,Henkel05,Baumann07], optimization<br />
[Edelman99], and computer vision<br />
[Liu04,Lin06,Chang06,Turaga08].<br />
Yi<br />
G(m, D )<br />
! 2<br />
Yj
Representation<br />
• Quotient space representations of<br />
G(m, D) = O(D)/O(m) × O(D − m)<br />
• Basis representation of an element<br />
An element of G(m, D) is represented by a D × m matrix<br />
such that Y ′ Y = Im, <strong>with</strong> the equivalence relation:<br />
Y1 ∼ = Y2<br />
⇐⇒ span(Y1) = span(Y2)<br />
⇐⇒ ∃Rm ∈ O(m), such that Y1 = Y2Rm<br />
G(m, D)
Principal angle/canonical corr<br />
Let Y1 and Y2 be two orthonormal matrices of size D by m,<br />
and let u ∈ span(Y1) and v ∈ span(Y2) be unit vectors.<br />
R D<br />
span( Yi )<br />
u 1<br />
!1, ..., !m<br />
span( Yj )<br />
The first principal angle/canoncial corr between span(Y1) and span(Y2) is<br />
cos θ1 = max<br />
u∈span(Y1)<br />
v 1<br />
max<br />
v∈span(Y2) u′ v, subject to �u� = �v� = 1.<br />
Yi<br />
G(m, D )<br />
! 2
•<br />
•<br />
k-th principal angle<br />
The k-th principal angle/cannonical correlation is:<br />
cos θk = max<br />
uk∈span(Y1)<br />
max<br />
vk∈span(Y2) uk ′ vk, subject to<br />
uk ′ uk = 1, vk ′ vk = 1,<br />
uk ′ ui = 0, vk ′ vi = 0, (i = 1, ..., k − 1).<br />
0 ≤ θ1 ≤ · · · ≤ θm ≤ π/2 and 1 ≥ cos θ1 ≥ · · · ≥ cos θm ≥ 0<br />
Use SVD for computation:<br />
Y ′<br />
1Y2 = USV ′ [Golub96]<br />
, where U = [u1 ... um], V = [v1 ... vm],<br />
and S is the diagonal matrix S = diag(cos θ1 ... cos θm).
Principal angles and distance<br />
• Given: two subspaces Y1 and Y2, and principal<br />
angles θ1, ... , θm from SVD<br />
• How to define a good subspace distance<br />
from θ1, ... , θm<br />
?<br />
• A canonical distance [Edelman99]<br />
• Arc-length:<br />
• Is this the only distance?<br />
d 2 Arc (Y1, Y2) = �<br />
i θ2 i
<strong>Grassmann</strong> distances<br />
• Projection distance [Edelmann99,Wang06]<br />
d2 Proj (Y1, Y2) = �m i=1 sin2 θi = 2−1 �Y1Y ′<br />
1 − Y2Y ′<br />
• Binet-Cauchy distance [Wolf03,Vishwanathan04]<br />
•<br />
d 2 BC (Y1, Y2) = 1 − �<br />
i cos2 θi = 1 − det(Y ′<br />
1Y2) 2<br />
Martin distance between two ARMA models<br />
[Martin00]<br />
• Max Corr, Min Corr, Procrustes1/2<br />
2� 2 F
• Characteristics<br />
• d2 MaxCor = 2 sin2 θ1<br />
•<br />
•<br />
Comparison<br />
Arc Length Projection Binet-Cauchy<br />
d2 (Y1, Y2) · 2−1�Y1Y ′<br />
1 − Y2Y ′<br />
2�2 F 1 − det(Y ′<br />
1Y2) 2<br />
In terms of θ<br />
� 2 θi � 2<br />
sin θi 1 − � cos2 Is a metric? Yes Yes<br />
θi<br />
Yes<br />
Max Corr Min Corr Procrustes 1 Procrustes 2<br />
d2 (Y1, Y2) 2 − 2�Y ′<br />
1Y2�2 2 �Y1Y ′<br />
1 − Y2Y ′<br />
2�2 2 �Y1U − Y2V �2 F �Y1U − Y2V �2 In terms of θ 2 sin<br />
2<br />
2 θ1 sin 2 θm 4 � sin 2 (θi/2) 4 sin 2 (θm/2)<br />
Is a metric? No Yes Yes Yes<br />
is a rough measure<br />
d 2 MinCor = 2 sin2 θm, d 2 Proc2 = 4 sin2 (θm/2)<br />
, intermediate<br />
too sensitive<br />
d 2 Arc = � θ 2 i , d2 Proj = � sin 2 θi, d 2 Proc1 = 4 � sin 2 (θi/2)<br />
d 2 BC = 1 − � m<br />
i=1 cos2 θi
Applications<br />
• Distance-<strong>based</strong>: e.g., k-NN<br />
• Mutual <strong>Subspace</strong> Method [Yamaguchi98]<br />
• Beyond k-NN: subspace-<strong>based</strong> discriminant<br />
analysis<br />
•<br />
•<br />
•<br />
Previous methods<br />
Constrained Mutual <strong>Subspace</strong> Method [Fukui03]<br />
Discriminant Analysis of Canonical Correlations<br />
[Kim06]
Some complications<br />
• Find a discriminative direction w ∈ R ,<br />
so that<br />
D<br />
•<br />
•<br />
d(w ′ Xi, w ′ Xj) becomes<br />
� small if yi = yj,<br />
large if yi �= yj,<br />
A difficult optimization problem<br />
Procedures often iterative, and not well justified<br />
• Inconsistency: projection is performed in<br />
image space, but distances are computed on<br />
<strong>Grassmann</strong> space
Easier solution<br />
• Use kernel-induced Hilbert space<br />
H 1<br />
H 1<br />
span( Yi )<br />
span( Yi )<br />
!1, ..., !m<br />
!1, ..., !m<br />
span( Yj )<br />
span( Yj )<br />
Yi<br />
G(m, D )<br />
G(m, D )<br />
! 2<br />
! 2<br />
! "<br />
X<br />
X<br />
!<br />
H2<br />
H2<br />
"<br />
"( )<br />
"( ) "( )<br />
"( )<br />
No need to 1) project data and<br />
2) measure distances separately<br />
Yi<br />
Yj<br />
Yj
<strong>Grassmann</strong> kernels<br />
• Let k : R be a real-valued<br />
symmetric function,<br />
Dm × RDm → R<br />
• Invariance:<br />
k(Y1, Y2) = k(Y1R1, Y2R2), ∀R1, R2 ∈ O(m)<br />
•<br />
k(x1, x2) = k(x2, x1)<br />
Positive definiteness<br />
�<br />
cicjk(xi, xj) ≥ 0, ∀(x1, ..., xn), ∀(c1, ..., cn), n ∈ N<br />
i,j<br />
• dProj, dBC<br />
have corresponding <strong>Grassmann</strong> kernels
Projection kernel<br />
• Projection embedding [Chikuse06]<br />
The map Ψ : G(m, D) → R D×D , span(Y ) ↦→ Y Y ′<br />
is an isometric embedding from (G, dProj) to (R D×D , � · �F ).<br />
• Natural inner product in<br />
• Projection kernel<br />
R D×D : tr(Y1Y ′<br />
1Y2Y ′<br />
2)<br />
• kProj(Y1, Y2) = tr(Y1Y is a<br />
<strong>Grassmann</strong> kernel<br />
′<br />
1Y2Y ′<br />
2) = �Y ′<br />
1Y2�2 F<br />
• Has a very simple form and requires only<br />
multiplications to evaluate.<br />
O(Dm)
Binet-Cauchy kernel<br />
• Binet-Cauchy identity [Horn85]<br />
Suppose we choose m rows from a D × m matrix A.<br />
Then, there are n = DCm square subsmatrices A (s1) , ..., A (sn)<br />
det(A ′ B) = �<br />
s det A(s) det B (s)<br />
• Binet-Cauchy embedding<br />
Ψ : G(m, D) → R<br />
• Binet-Cauchy kernel [Wolf03,Vishwanathan04]<br />
• is a <strong>Grassmann</strong> kernel<br />
n , span(Y ) ↦→ � det Y (s1) (sn) , ..., det Y �<br />
is an embedding. It is also an isometry from (G, dBC) to (Rn , � · �2).<br />
kBC(Y1, Y2) = (det Y ′<br />
1Y2) 2
Advantages of <strong>Grassmann</strong> kernel<br />
• Access to all the kernel-<strong>based</strong> algorithms for<br />
Hilbert spaces!<br />
• Can generate a family of kernels:<br />
If k1(x, y) and k2(x, y) are PD kernels, then so are<br />
1. α1k1(x, y) + α2k2(x, y), (α1, α2 > 0),<br />
2. k1(x, y)k2(x, y),<br />
3. � k1(x, z)k1(y, z) dz,<br />
4. f(x)k1(x, y)f(y)
Extension to nonlinear subspace<br />
• `Doubly kernel’ method [Wolf03,Wang06]<br />
H 1<br />
Kernel PCA<br />
span( Yi )<br />
!1, ..., !m<br />
span( Yj )<br />
! "<br />
X H<br />
2<br />
Yi<br />
G(m, D )<br />
! 2<br />
Yj<br />
"( ) "( )
Kernel Fisher Discriminant Analysis<br />
Training:<br />
1. Compute the matrix [Ktrain]ij = kP (Yi, Yj) or kBC(Yi, Yj) for all Yi, Yj in<br />
the training set.<br />
2. Solve maxα L(α) by eigen-decomposition.<br />
3. Compute the (C − 1)-dimensional coefficients Ftrain = α ′ Ktrain.<br />
Testing:<br />
1. Compute the matrix [Ktest]ij = kP (Yi, Yj) or kBC(Yi, Yj) for all Yi in<br />
training set and Yj in the test set.<br />
2. Compute the (C − 1)-dim coefficients Ftest = α ′ Ktest.<br />
3. Perform 1-NN classification from the Euclidean distance between Ftrain<br />
and Ftest.
Discriminant Analysis Algorithms<br />
• Baseline: Euclidean FDA<br />
• <strong>Grassmann</strong> Discriminant Analysis (GDA):<br />
• kernel FDA + Proj / BC kernel<br />
• Others<br />
• MSM : no dim reduction + MaxCor [Yamaguchi98]<br />
• cMSM : heur dim reduction + MaxCor [Fukui03]<br />
• DCC : iterating between 1.NDA 2. Proc1 [Kim07]
Illum-invariant face recognition<br />
• Yale face database [Georghiades01]<br />
• 38 persons x 9 poses x 45 illums<br />
• PCA along illum axis<br />
• 9-fold cross validation<br />
• Illum-invariant face recognition<br />
• CMU-PIE database [Sim03]<br />
• 68 persons x 7 poses x 43 illums<br />
• 7-fold cross validation<br />
person<br />
illumination<br />
pose
Pose-inv. object categorization<br />
• ETH-80 database [Leibe02]<br />
• 8 cats x 10 objects x 41 poses<br />
• PCA along pose axis<br />
• 10-fold cross validation<br />
• Pose-invariant object<br />
categorization<br />
category<br />
pose<br />
object
Video-<strong>based</strong> action recognition<br />
• IXMAS database [Weinland06]<br />
• 11 actions x 11 actors x T frms<br />
(x 3 trials)<br />
• ARMA model <strong>with</strong> T frms<br />
• 11-fold cross validation<br />
• Video-<strong>based</strong> action recognition<br />
action<br />
frame<br />
person
9(:*+7;8<br />
9(:*+7;8<br />
!&&<br />
#&<br />
&<br />
!&&<br />
#&<br />
&<br />
'()*+,(-*<br />
! " # $ %<br />
./0.1(-*+234*5.365+748<br />
ABC!D&<br />
! " # $ %<br />
./0.1(-*+234*5.365+748<br />
Results<br />
9(:*+7;8<br />
9(:*+7;8<br />
!&&<br />
#&<br />
&<br />
!&&<br />
#&<br />
&<br />
!?@A<br />
! " # $ %<br />
./0.1(-*+234*5.365+748<br />
@G=HI<br />
! E " F #<br />
./0.1(-*+234*5.365+748
Conclusion<br />
• <strong>Subspace</strong>-<strong>based</strong> learning: new paradigm for<br />
exploiting inherent linear structures in data<br />
• <strong>Grassmann</strong> manifold as a framework:<br />
Projection distances and kernels<br />
• Experiments: superior classification<br />
performance <strong>with</strong> proposed method<br />
• Not limited to image data/FDA method/<br />
classification task<br />
• An open question