slides - SNAP - Stanford University
slides - SNAP - Stanford University
slides - SNAP - Stanford University
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS224W: Social and Information Network Analysis<br />
Jure Leskovec <strong>Stanford</strong> <strong>University</strong><br />
Jure Leskovec, <strong>Stanford</strong> <strong>University</strong><br />
http://cs224w.stanford.edu
Task:<br />
Find coalitions in signed networks<br />
Incentives:<br />
European chocolates!<br />
FFame<br />
Up to 10% extra credit<br />
DDue:<br />
Friday midnight<br />
No late days!<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2
Today: 3 methods<br />
(3) Trawling: Community signatures that can be efficiently extracted<br />
(4) SSpectral t lgraph h partitioning: titi i LLaplacian l i matrix, ti 22nd deigenvector i t<br />
(5) Overlapping communities: Clique percolation method<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3
[Kumar et al. ‘99]<br />
Searching for small communities in Web graph<br />
(1) What is the signature of a<br />
community/discussion in a Web graph<br />
…<br />
…<br />
A dense 2‐layer graph<br />
Intuition: many people all talking about the same things<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
Use this to define topics:<br />
Wh What t th the same people l on th the<br />
left talk about on the right<br />
4
(2) A more well‐defined well defined problem:<br />
Enumerate complete bipartite subgraphs K s,t<br />
Where K Ks,t = s nodes where each links to the<br />
same t other nodes<br />
[Kumar et al. ‘99]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5
Two points:<br />
(1) The signature of a community/discussion<br />
(2) Complete bipartite subgraph K Ks,t Ks,t = graph on s nodes, each links to the same t other nodes<br />
Plan:<br />
(A) From (2) get back to (1):<br />
Via: Any dense enough graph contains a<br />
smaller K s,t as a subgraph<br />
(B) How do we solve (2) in a giant graph?<br />
[Kumar et al. ‘99]<br />
What similar problems have been solved on big non‐graph non graph data?<br />
(3) Frequent itemset enumeration [Agrawal‐Srikant ‘99]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6
Marketbasket analysis:<br />
What items are bought together in a store?<br />
Setting:<br />
Universe U of n items<br />
m subsets b t of f UU: S S1, S S2, …, S Sm U<br />
(Si is a set of items one person bought)<br />
Frequency threshold f<br />
Goal:<br />
Fi Find d all ll subsets b t T s.t. t T S Si of f f f sets t S Si (items in T were bought together f times)<br />
[Agrawa‐Srikant ‘99]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7
Example:<br />
Universe of items:<br />
U={1,2,3,4,5} {,,,,}<br />
Itemsets:<br />
S1={1,3,5}, S2={2,3,4}, S3={2,4,5}, S4={3,4,5}, S5={1,3,4,5}, S6={2,3,4,5} Minimum support:<br />
f=3<br />
Algorithm: Build up the lists<br />
Insight: for a frequent set of size k, all<br />
its subsets are also frequent<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8
U={1,2,3,4,5}, U={12345} f=3<br />
S1={1,3,5}, S2={2,3,4}, S3={2,4,5}, S S4={3,4,5}, ={345} SS={1345} 5={1,3,4,5}, SS={2345} 6={2,3,4,5}<br />
[Agrawa‐Srikant ‘99]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9
For i =1 = 1,…,kk<br />
Find all frequent sets of size i by<br />
composing sets of size ii-1 1 that<br />
differ in 1 element<br />
Open question:<br />
Efficiently find only<br />
maximal frequent sets<br />
[Agrawa‐Srikant ‘99]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10
[Kumar et al. ‘99]<br />
Claim: (3) (itemsets) solves (2) (bipartite graphs)<br />
How?<br />
View each node i as a<br />
set S i of nodes i points to<br />
K Ks,t = a set t y of f size i t<br />
that occurs in s sets Si Looking for Ks,t set of<br />
frequency threshold to s<br />
and look at layer t – all<br />
frequent sets of size t<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11
[Kumar et al. ‘99]<br />
(2) (1): Informally, Informally every dense enough<br />
graph G contains a bipartite Ks,t subgraph<br />
where s and t depend on size (# of nodes) and<br />
density (avg. degree) of G [Kovan‐Sos‐Turan ‘53]<br />
Theorem: Let G=(X,Y,E), |X|=|Y|= n with<br />
g g<br />
1/<br />
t 1<br />
1/<br />
t<br />
avg. degree: d <br />
s n <br />
t<br />
then G contains K stas a subgraph<br />
s,t g p<br />
[Will not prove it here. See online <strong>slides</strong>]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12
For the proof we will need the following fact<br />
Recall:<br />
a<br />
a(<br />
a 1)...(<br />
a b 1)<br />
<br />
b<br />
<br />
<br />
<br />
b!<br />
Let f(x) = x(x-1)(x-2)…(x-k)<br />
Once x k, f(x) curves upward (convex)<br />
Suppose a setting:<br />
g(y) is convex<br />
Want to minimize g(xi) where xi=x TTo minimize i i i g(x ( i) ) make k each h xi = x/n /<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13
Node i, i degree d i and<br />
neighbor set Si Put node i in buckets<br />
for all size t subsets of<br />
its neighbors<br />
Potential right‐hand<br />
sides of Ks,t (i.e., all<br />
size t subsets of Si) 11/10/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 14
Note: As soon as s nodes appear in a bucket<br />
we have a Ks,t How many buckets node i contributes?<br />
d i … degree of node i<br />
i<br />
What is the total size of all buckets?<br />
11/10/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 15
So, So the total height of<br />
all buckets is…<br />
Plug in:<br />
d<br />
<br />
s<br />
1<br />
/ t 1 1<br />
1 / t<br />
n<br />
<br />
t<br />
a<br />
a(<br />
a 1)...(<br />
a b 1)<br />
<br />
b<br />
<br />
<br />
b<br />
b !<br />
11/10/2009 Jure Leskovec, <strong>Stanford</strong> CS322: Network Analysis 16
We have: Total height of all buckets<br />
How many buckets are there?<br />
n s<br />
t<br />
<br />
t!<br />
t<br />
n<br />
n<br />
<br />
t t!<br />
<br />
<br />
What is the average height of buckets?<br />
s t<br />
s<br />
t n<br />
t ! height s<br />
n t<br />
! So, avg. bucket<br />
So by pigeonhole principle, principle there must be at least<br />
one bucket with more than s nodes in it.<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17
Theoretical result:<br />
[Kumar et al. ‘99]<br />
Complete bipartite subgraphs Ks,t are embedded in<br />
larger dense enough graphs (i (i.e., e the communities)<br />
i.e., biparite subgraphs as signatures of communities<br />
Algorithmic result:<br />
Frequent itemset extraction and dynamic<br />
programming<br />
p g g<br />
SCALABLE!!!<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18
Undirected graph G(V,E): G(V E):<br />
Bi‐partitioning Bi partitioning task:<br />
Divide vertices into two disjoint groups (A,B)<br />
Questions:<br />
A 1<br />
5 B<br />
2<br />
3<br />
How can we define a “good” g ppartition<br />
of G?<br />
How can we efficiently identify such a partition?<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
19<br />
4<br />
2<br />
6<br />
1<br />
3<br />
4<br />
5<br />
6
What makes a good partition?<br />
Maximize the number of within‐group connections<br />
Mi Minimize i i th the number b of f bt between‐group<br />
connections<br />
2<br />
1<br />
3<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20<br />
4<br />
5<br />
6
A<br />
Express partitioning objectives as a<br />
function of the “edge cut” of the partition<br />
CCut: SSet of f edges d with ihonly l one vertex in i a<br />
group:<br />
2<br />
1<br />
3<br />
4<br />
5<br />
6<br />
B<br />
cut(A,B) = 2<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
21
Criterion: Minimum Minimum‐cut cut<br />
Minimise weight of connections between groups<br />
min minABcut(A,B) A,B cut(A,B)<br />
Degenerate case:<br />
Problem:<br />
Optimal cut Minimum cut<br />
OOnly l considers id external t lcluster l t connections ti<br />
Does not consider internal cluster connectivity<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
22
Criterion: Normalized‐cut [ [Shi‐Malik, , ’97] ]<br />
Connectivity between groups relative to the density of<br />
each group<br />
Vol(A): The total weight of the edges originating from group AA.<br />
Why use this criterion?<br />
Produces more balanced partitions<br />
How do we efficiently find a good partition?<br />
Problem: Computing optimal cut is NP‐hard<br />
[Shi‐Malik]<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
23
A: adjacency matrix of undirected G<br />
Aij =1 if (i,j) is an edge, else 0<br />
x is a vector in n x is a vector in with components (x x )<br />
n with components (x1,…, xn) just a label/value of each node of G<br />
What is the meaning of AAx? x?<br />
Entry y j is a sum of labels x i of neighbors of j<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24
jth j coordinate of Ax:<br />
Sum of the x‐values<br />
of neighbors of j<br />
Make this a new value at node j<br />
Spectral Graph Theory:<br />
Analyze the “spectrum” of matrix representing G<br />
Spectrum: Eigenvectors of a graph, graph ordered by the<br />
magnitude (strength) of their corresponding<br />
eigenvalues:<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25
Suppose all nodes in G have degree d<br />
and G is connected<br />
What are some eigenvalues/vectors of G?<br />
Ax = x What is ? What x?<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 26
What if G is not connected?<br />
Say G has 2 components, each d‐regular<br />
What are some eigenvectors?<br />
x= Put all 1s on A and 0s on B or vice versa<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 27
Adjacency matrix (A):<br />
n n matrix<br />
A=[a A [aij] ij], a aij=1 ij 1 if edge between node i and j<br />
2<br />
1<br />
3<br />
Important properties:<br />
4<br />
5<br />
Symmetric matrix<br />
Eigenvectors are real and orthogonal<br />
6<br />
1 2 3 4 5 6<br />
1 0 1 1 0 1 0<br />
2 1 0 1 0 0 0<br />
3 1 1 0 1 0 0<br />
4 0 0 1 0 1 1<br />
5 1 0 0 1 0 1<br />
6 0 0 0 1 1 0<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
28
Degree matrix (D):<br />
n n diagonal matrix<br />
D D=[d [d ii], ] d dii = ddegree of f node d i<br />
2<br />
1<br />
3<br />
4<br />
5<br />
6<br />
1 2 3 4 5 6<br />
1 3 0 0 0 0 0<br />
2 0 2 0 0 0 0<br />
3 0 0 3 0 0 0<br />
4 0 0 0 3 0 0<br />
5 0 0 0 0 3 0<br />
6 0 0 0 0 0 2<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
29
Laplacian p matrix () (L):<br />
n n symmetric matrix<br />
2<br />
1<br />
3<br />
What is trivial eigenvector/<br />
eigenvalue?<br />
Important properties:<br />
4<br />
5<br />
6<br />
1 2 3 4 5 6<br />
1 3 ‐1 ‐1 0 ‐1 0<br />
2 ‐1 2 ‐1 0 0 0<br />
3 ‐1 ‐1 3 ‐1 0 0<br />
4 0 0 ‐1 3 ‐1 ‐1<br />
5 ‐1 0 0 ‐1 3 ‐1<br />
6 0 0 0 ‐1 ‐1 2<br />
L = D - A<br />
Eigenvalues are non‐negative non negative real numbers<br />
Eigenvectors are real and orthogonal<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
30
For symmetric matrix M:<br />
<br />
2<br />
min<br />
x<br />
T<br />
Mx<br />
x<br />
T<br />
Mx<br />
x x<br />
T <br />
What is the meaning of min xT What is the meaning of min x Lx on G?<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31
What else do we know about x?<br />
x is unit vector<br />
i th l t 1st x is orthogonal to 1 i t (1 1) th<br />
st eigenvector (1,…,1) thus:<br />
Then:<br />
( x <br />
x<br />
min<br />
i j<br />
2 All lbli f<br />
x<br />
2<br />
i<br />
All labelings of<br />
nodes so that<br />
sum(x i)=0<br />
<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32<br />
)<br />
2
Express p p partition ( (A,B) , ) as a vector<br />
We can minimize the cut of the partition p by y<br />
finding a non‐trivial vector x that minimizes:<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
33
The minimum value is given by the 2 nd<br />
smallest eigenvalue λ 2 of the Laplacian<br />
matrix L<br />
The optimal solution for x is given by the<br />
corresponding eigenvector λ2, referred as<br />
the Fiedler vector<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34
How to define a “good” good partition of a graph?<br />
Minimise a given graph cut criterion<br />
How to efficiently identify such a partition?<br />
Approximate using information provided by the<br />
eigenvalues and eigenvectors of a graph<br />
Spectral l Clustering l i<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
35
Three basic stages:<br />
1. Pre‐processing<br />
Construct a matrix representation of the graph<br />
2. Decomposition<br />
Compute eigenvalues and eigenvectors of the matrix<br />
Map each point to a lower‐dimensional<br />
representation based on one or more eigenvectors<br />
3. Grouping<br />
Assign points to two or more clusters, based on the<br />
new representation<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
36
p g<br />
Pre‐processing:<br />
Build Laplacian<br />
matrix L of the<br />
graph g p<br />
p<br />
1 2 3 4 5 6<br />
1 3<br />
‐1 ‐1 0 ‐1 0<br />
2 ‐1 2 ‐1 0 0 0<br />
3 ‐1 ‐1 3 ‐1 0 0<br />
4 0 0 ‐1 3 ‐1 ‐1<br />
5 ‐1 0 0 ‐1 3 ‐1<br />
Decomposition: 3.0<br />
0.4 0.3 0.1 0.6 ‐0.4 0.5<br />
Find eigenvalues <br />
and eigenvectors x<br />
of the matrix L<br />
Map vertices to<br />
corresponding p g<br />
components of 2 0.0<br />
1.0<br />
= X =<br />
1<br />
2<br />
3<br />
4<br />
0.3<br />
3.0<br />
4.0<br />
5.0<br />
0.6<br />
0.3<br />
6 0 0 0 ‐1 ‐1 2<br />
0.4<br />
0.4<br />
0.4<br />
0.4<br />
0.4<br />
0.3<br />
0.6<br />
‐0.3<br />
‐0.3<br />
‐0.6<br />
‐0.5<br />
0.4<br />
0.1<br />
‐0.5<br />
‐0.4<br />
‐0.2<br />
‐0.4<br />
0.6<br />
‐0.2<br />
‐0.4<br />
‐0.4<br />
0.4<br />
0.4<br />
0.4<br />
‐0.4<br />
‐0.5<br />
0.0<br />
‐0.5<br />
‐0.5<br />
0.0<br />
How do we now<br />
ffind dthe h clusters? l<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37<br />
5<br />
6<br />
‐0.3<br />
‐0.3<br />
‐0.6
Grouping:<br />
Sort components of reduced 1‐dimensional vector<br />
Identify clusters by splitting the sorted vector in two<br />
How to choose h a splitting l point?<br />
Naïve approaches:<br />
Split p at 0, mean or median value<br />
More expensive approaches:<br />
Attempt to minimise normalized cut criterion in 1‐dimension<br />
1<br />
2<br />
3<br />
4<br />
5<br />
6<br />
03 0.3<br />
0.6<br />
0.3<br />
‐0.33<br />
‐0.3<br />
‐0.6<br />
Split p at 0:<br />
Cluster A: Positive points<br />
Cluster B: Negative points<br />
1<br />
2<br />
3<br />
03 0.3<br />
0.6<br />
0.3<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu<br />
38<br />
4<br />
5<br />
6<br />
‐0.3 03<br />
‐0.3<br />
‐0.6<br />
A<br />
B
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39
How do we partition a graph into k clusters?<br />
Two basic approaches:<br />
Recursive bi‐partitioning [Hagen et al., ’92]<br />
Recursively apply bi‐partitioning algorithm in a<br />
hierarchical divisive manner<br />
Disadvantages: Inefficient, unstable<br />
Cluster multiple p eigenvectors g [Shi‐Malik, [ , ’00] ]<br />
Build a reduced space from multiple eigenvectors<br />
Commonly used in recent papers<br />
A preferable f bl approach… h<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40
k‐eigenvector k eigenvector Algorithm [Ng et al., al ’01] 01]<br />
Pre‐processing:<br />
Construct the scaled adjacency matrix A': A :<br />
1/<br />
2 1/<br />
2<br />
A'<br />
Decomposition:<br />
D AD<br />
Find the eigenvalues and eigenvectors of A'<br />
Build embedded space from the eigenvectors<br />
corresponding to the k largest eigenvalues<br />
Grouping:<br />
Apply k‐means to reduced nk space to get k clusters<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41
Approximates pp the optimal p cut [ [Shi‐Malik, , ’00] ]<br />
Can be used to approximate the optimal k‐way normalized cut<br />
Emphasizes cohesive clusters<br />
Increases the h unevenness in i the h distribution di ib i of f the h data d<br />
Associations between similar points are amplified, associations<br />
between dissimilar points are attenuated<br />
Th The data dt bbegins i t to “ “approximate i t a clustering” l t i ”<br />
Well‐separated space<br />
Transforms data to a new “embedded embedded space”, space ,<br />
consisting of k orthogonal basis vectors<br />
NB: Multiple eigenvectors prevent instability due to<br />
information f loss l<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 42
Eigengap:<br />
The difference between two consecutive eigenvalues<br />
Most stable clustering is generally given by the<br />
value k that maximises eigengap:<br />
p<br />
Example:<br />
Eigenvalue<br />
50<br />
45<br />
40<br />
35<br />
30<br />
25<br />
20<br />
15<br />
10<br />
5<br />
0<br />
λ 1<br />
λ 2<br />
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20<br />
k<br />
<br />
k k k1<br />
max <br />
<br />
k<br />
Choose<br />
kk=22 11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43<br />
2<br />
1
METIS:<br />
Heuristic but works really well in practice<br />
http://glaros.dtc.umn.edu/gkhome/views/metis<br />
Graclus:<br />
Based on kernel k‐means<br />
http://www.cs.utexas.edu/users/dml/Software/graclus.html<br />
Cluto:<br />
http://glaros.dtc.umn.edu/gkhome/views/cluto/<br />
p //g /g / / /<br />
Clique percorlation method:<br />
For finding overlapping clusters<br />
http://angel.elte.hu/cfinder/<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44
Non‐overlapping Non overlapping vs. vs overlapping communities<br />
11/8/2010 Jure Leskovec, <strong>Stanford</strong> CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45