Self-Organizing Maps, Principal Components and Non-negative ...

Self Organizing Maps 

Principal Components, Curves and Surfaces 

Non-negative Matrix Factorization 

Self-Organizing Maps, Principal Components and 


Karoline Geissler 

May 18, 2011 

Karoline Geissler Self-Organizing Maps, Principal Components and Non-negative M




Table of Contents 

1 Self Organizing Maps 

2 Principal Components, Curves and Surfaces 

Principal Components 

Principal Curves 

Spectral Clustering 

3 Non-negative Matrix Factorization 





The simplest version of the SOM 

Other version of the SOM 

Example 

Reconstruction Error 

The method can be viewed as a version of K-means clustering. 

The prototypes lie in a one- or two- dimensional manifold. 

The resulting manifold is referred to as a constrained 

topological map. 

The original high dimensional observation can be mapped 

down onto a two-dimensional coordinate system. 








Example 


two dimensional grid of K prototypes mj ∈ R p . 

Parametrize each of the K prototypes to an integer coordinate 

pair ℓj ∈ Q1 × Q2. 

Prototypes are like ”buttons”. 

Map the observations xi down onto a two-dimensional grid. 

Find the closest prototype mj to xi. (Euclidean distance) 





We move mk toward xi via 



Example 


mk ←− mk + α(xi − mk) (1) 

Then for all neighbors of mk. (ℓj − ℓk < r). 

α ... the learning rate 

r... the distance threshold 






The update step 



Example 


mk ← ml + αh(ℓj − ℓk)(xi − mk) (2) 

h... neighborhood function, which gives more weight to 

prototypes mk with indices ℓk closer to ℓl than to those 

further away. 


Example 






Example 


Generate 90 data points in three dimensions (near the surface 

of a half sphere of radius 1) 

5 × 5 grid of prototypes 







Example 


Figure: Simulated data in three classes 







Example 


Figure: Left panel is the initial configuration, right panel the final one. 







Example 


Figure: Wiremesh representation of the fitted SOM model 






(x − mj) 2 



Example 


equal to the total sum of squares of each data point around 

it’s prototype 







Example 


Figure: reconstruction error 







Example 


usefull for Document Organization and Retrieval 

SOM usefull for organizing and indexing large corpora 

term document matrix, where each row represents a single 

document 









a sequence of projections of the data 

uncorrelated and ordered in variance 









provide a sequence of best linear approximations to the given 

data in R p , of all ranks q ≤ p 

observations x1, ..., xN and rank - q linear model 

µ ... location vector in R p 

f (λ) = µ + Vqλ (3) 

Vq ... p × q matrix with q orthogonal unit vectors as columns 

λ ... q vector of parameters 





minimizing the reconstruction error 

We obtain 

min 

µ,λ,V 




N 

xi − µ − Vqλi 2 

i=1 

(4) 

ˆµ = ˆx (5) 

ˆλi = V T q (xi − ˆx) (6) 








This leaves us to find the orthogonal matrix Vq 

min 

Vq 

N 

(xi − ˆx) − VqV T q (xi − ˆx) 2 

i=1 

The projection matrix Hq = VqV T q maps each point xi onto 

its rank-q reconstruction Hqxi. 

Other expression of the solution, singular value decomposition 

X = UDV T 

X...the rows contain the centered observations 

(7) 

(8) 





U ... a N × p orthogonal matrix 

columns uj ... left singular vectors 

V ... p × p orthogonal matrix 

columns vj... right singular vectors 

D... diagonal matrix p × p matrix 

d1 ≥ d2 ≥ ... ≥ 0... singular values 




columns of UD... principal components of X 





Handwritten Digits 

sample of 130 handwritten 3’s 




We consider these images as points xi in R 256 and compute 

their principal components via the SVD. 








Figure: A sample of 130 handwritten 3’s shows a variety of writing styles. 








Figure: the first two principal components of the handwritten threes 









Generalization of the principal component line. 

First for random variables X ∈ R p . 

f (λ)... a parameterized smooth curve, a vector function with 

p coordinates 








For each data value x, let λf (x) define the closest point on 

the curve to x. 

The function f (λ) is called a principal curve for the 

distribution of X 

f (λ) = E(X | λf (X ) = λ) (9) 

f (λ) is the average of all data points that project to it. 





Principal Points 

Set of k prototypes. 




For each point x in the support of a distribution the closest 

prototype. (responsible prototype) 

The set of k points that minimize the exspected distance from 

X to its prototype are called the principal points. 

k = 1...the mean vector (circular normal distribution) 

k = ∞... principal curves 








Construction of a principal curve of a distribution 

f (λ) = [f1(λ), f2(λ), ..., fp(λ)] ... coordinate function 

X T = (X1, ..., Xp) 

Consider the following alternating steps 

1 ˆ fj(λ) ←− E(Xj | λ(X ) = λ) 

2 ˆλf (x) ←− arg minλ x − ˆf (λ) 

The first equation fixes λ 

The second fixes the curve and finds the closest point to each 

data point. 






for non-convex clusters 




Generalization of standard clustering methods. 








N × N matrix of pairwise similarities sii ′ ≥ 0 between all 

observation pairs 

undirected similarity graph G = 〈V , E〉. 

The N vertices vi represent the observations. 

Pairs of vertices are connected by an edge if their similarity is 

positive. 








=⇒ graph-partition problem (we identify connected 

components with clusters) 

Partition of the graph, such that edges between different 

groups have low weight und within a group have high weight. 

idea: construction of similarity graphs that represent the local 

neighborhood relationships between the observations. 





mutual K-nearest neighbor graph 




NK ... a symmetric set of nearby pairs of points 

We connect all symmetric nearest points and give them edge 

weight wii ′ = sii ′ (otherwise zero) 

We set to zero all the pairwise similarities not in NK and draw 

the graph. 





Unnormalized graph Laplacian 




A fully connected graph includes all pairwise edges with 

weights wii ′ = sii ′ 

adjacency matrix... matrix of edge weights W = {wii ′} 

G... diagonal matrix with diagonal elements gi = 

i 

sum of the weights connected to it) 

unnormalized graph Laplacian 

′ wii ′ (the 

L = G − W (10) 


Procedure 







Spectral clustering finds the m smallest eigenvectors to the m 

smallest eigenvalues of L. 

Consider any vector f 

f T Lf = 

N 

i=1 

= 1 

2 

gif 2 

i − 

N 

N 

i=1 i ′ = 1 

N 

N 

i=1 i ′ =1 

wii ′(fi − f ′ 

i ) 2 

fifi ′wii ′ (11) 

(12) 

We have a small value of f T Lf if pairs of points with large wii ′ 

have coordinates fi and fi ′ close together. 








If 1 T L1 = 0, the constant vector is a trivial eigenvector with 

eigenvalue zero. 

Then the graph is connected. 

For a graph with m connected components and L has m 

eigenvectors with eigenvalue zero. 








Figure: Toy example illustrating spectral clustering 






alternative approach to principal components analysis 

The data and components are assumed to be non-negative. 

It’s usefull for modeling non-negative data such as images. 





The N × p data matrix X is approximated by 

W is N × p 

H is r × p 

we assume that xij, wik, hkj ≥ 0 

X ≈ WH (13) 





The matrices W and H are found by maximizing 

L(W, H) = 

N 

i=1 

N 

[xijlog(WHij) − (WHij)] (14) 

i=j 

This is the log-Likelihood from a model in which xij has a 

Poisson distribution with mean (WB)ij 





Thank you!

Self-Organizing Maps, Principal Components and Non-negative ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?