25.12.2012 Views

Tutorial on ranking methods in machine learning - Computer ...

Tutorial on ranking methods in machine learning - Computer ...

Tutorial on ranking methods in machine learning - Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Rank<strong>in</strong>g Methods <strong>in</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

A <str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> Introducti<strong>on</strong><br />

Shivani Agarwal<br />

<strong>Computer</strong> Science & Artificial Intelligence Laboratory<br />

Massachusetts Institute of Technology


Example 1: Recommendati<strong>on</strong> Systems


Example 2: Informati<strong>on</strong> Retrieval


Example 2: Informati<strong>on</strong> Retrieval<br />

<strong>in</strong>formati<strong>on</strong>


Example 2: Informati<strong>on</strong> Retrieval<br />

<strong>in</strong>formati<strong>on</strong>


Example 3: Drug Discovery<br />

Problem: Milli<strong>on</strong>s of structures <strong>in</strong> a chemical library.<br />

How do we identify the most promis<strong>in</strong>g <strong>on</strong>es?


Example 4: Bio<strong>in</strong>formatics<br />

Human genetics is now at a critical juncture.<br />

The molecular <strong>methods</strong> used successfully to<br />

identify the genes underly<strong>in</strong>g rare mendelian<br />

syndromes are fail<strong>in</strong>g to f<strong>in</strong>d the numerous<br />

genes caus<strong>in</strong>g more comm<strong>on</strong>, familial, n<strong>on</strong>mendelian<br />

diseases . . .<br />

Nature 405:847–856, 2000


Example 4: Bio<strong>in</strong>formatics<br />

With the human genome sequence near<strong>in</strong>g<br />

completi<strong>on</strong>, new opportunities are be<strong>in</strong>g<br />

presented for unravell<strong>in</strong>g the complex genetic<br />

basis of n<strong>on</strong>mendelian disorders based <strong>on</strong><br />

large-scale genomewide studies . . .<br />

Nature 405:847–856, 2000


Types of Rank<strong>in</strong>g Problems<br />

Instance Rank<strong>in</strong>g<br />

Label Rank<strong>in</strong>g<br />

Subset Rank<strong>in</strong>g<br />

Rank Aggregati<strong>on</strong><br />

?


Instance Rank<strong>in</strong>g<br />

> , 10<br />

doc1 doc2<br />

> , 20<br />

doc3 doc5<br />


doc1<br />

doc2<br />

…<br />

Label Rank<strong>in</strong>g<br />

sports > politics<br />

health > m<strong>on</strong>ey<br />

…<br />

science > sports<br />

m<strong>on</strong>ey > politics<br />


Subset Rank<strong>in</strong>g<br />

query 1 doc1 > doc2 , doc3 doc5<br />

…<br />

> …<br />

query 2 doc2 > doc4 , doc11 ><br />

doc3 …


…<br />

Rank Aggregati<strong>on</strong><br />

query 1 of search of search …<br />

query 2<br />

results<br />

eng<strong>in</strong>e 1<br />

results<br />

of search<br />

eng<strong>in</strong>e 1<br />

results<br />

eng<strong>in</strong>e 2<br />

results<br />

of search<br />

eng<strong>in</strong>e 2<br />

…<br />

desired<br />

<strong>rank<strong>in</strong>g</strong><br />

desired<br />

<strong>rank<strong>in</strong>g</strong>


Types of Rank<strong>in</strong>g Problems<br />

Instance Rank<strong>in</strong>g<br />

Label Rank<strong>in</strong>g<br />

Subset Rank<strong>in</strong>g<br />

Rank Aggregati<strong>on</strong><br />

?<br />

This tutorial


<str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> Road Map<br />

Part I: Theory & Algorithms<br />

Bipartite Rank<strong>in</strong>g<br />

k-partite Rank<strong>in</strong>g<br />

Rank<strong>in</strong>g with Real-Valued Labels<br />

General Instance Rank<strong>in</strong>g<br />

Part II: Applicati<strong>on</strong>s<br />

Further Read<strong>in</strong>g & Resources<br />

RankSVM<br />

RankBoost<br />

RankNet<br />

Applicati<strong>on</strong>s to Bio<strong>in</strong>formatics<br />

Applicati<strong>on</strong>s to Drug Discovery<br />

Subset Rank<strong>in</strong>g and Applicati<strong>on</strong>s to Informati<strong>on</strong> Retrieval


Part I<br />

Theory & Algorithms<br />

[for Instance Rank<strong>in</strong>g]


Bipartite Rank<strong>in</strong>g<br />

Relevant (+) …<br />

doc1+ doc2+ doc3+<br />

Irrelevant (-) …<br />

doc1- doc2- doc3- doc4-


Bipartite Rank<strong>in</strong>g


Bipartite Rank<strong>in</strong>g


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Is Bipartite Rank<strong>in</strong>g Different from<br />

B<strong>in</strong>ary Classificati<strong>on</strong>?


Bipartite Rank<strong>in</strong>g:<br />

Basic Algorithmic Framework


Bipartite RankSVM Algorithm<br />

[Herbrich et al, 2000; Joachims, 2002; Rakotomam<strong>on</strong>jy, 2004]


Bipartite RankSVM Algorithm


Bipartite RankBoost Algorithm<br />

[Freund et al, 2003]


Bipartite RankBoost Algorithm


Bipartite RankNet Algorithm<br />

[Burges et al, 2005]


k-partite Rank<strong>in</strong>g<br />

Rat<strong>in</strong>g k …<br />

doc1k doc2k doc3k …<br />

Rat<strong>in</strong>g 2 …<br />

doc12 doc22 doc32 doc42 Rat<strong>in</strong>g 1 …<br />

doc11 doc21 doc31 doc41


k-partite Rank<strong>in</strong>g


k-partite Rank<strong>in</strong>g:<br />

Basic Algorithmic Framework


Rank<strong>in</strong>g with Real-Valued Labels<br />

doc1<br />

doc2<br />

doc3<br />

…<br />

y1<br />

y2<br />

y3


Rank<strong>in</strong>g with Real-Valued Labels


Rank<strong>in</strong>g with Real-Valued Labels:<br />

Basic Algorithmic Framework


General Instance Rank<strong>in</strong>g<br />

><br />

doc1 doc1’<br />

, r1<br />

> , r2<br />

doc2 doc2’<br />


General Instance Rank<strong>in</strong>g<br />

r i


General Instance Rank<strong>in</strong>g


General Instance Rank<strong>in</strong>g:<br />

Basic Algorithmic Framework


General RankSVM Algorithm<br />

[Herbrich et al, 2000; Joachims, 2002]


General RankBoost Algorithm<br />

[Freund et al, 2003]


General RankNet Algorithm<br />

[Burges et al, 2005]


<str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> Road Map<br />

Part I: Theory & Algorithms<br />

Bipartite Rank<strong>in</strong>g<br />

k-partite Rank<strong>in</strong>g<br />

Rank<strong>in</strong>g with Real-Valued Labels<br />

General Instance Rank<strong>in</strong>g<br />

Part II: Applicati<strong>on</strong>s<br />

Further Read<strong>in</strong>g & Resources<br />

RankSVM<br />

RankBoost<br />

RankNet<br />

Applicati<strong>on</strong>s to Bio<strong>in</strong>formatics<br />

Applicati<strong>on</strong>s to Drug Discovery<br />

Subset Rank<strong>in</strong>g and Applicati<strong>on</strong>s to Informati<strong>on</strong> Retrieval


Part II<br />

Applicati<strong>on</strong>s<br />

[and Subset Rank<strong>in</strong>g]


Applicati<strong>on</strong> to Bio<strong>in</strong>formatics<br />

Human genetics is now at a critical juncture.<br />

The molecular <strong>methods</strong> used successfully to<br />

identify the genes underly<strong>in</strong>g rare mendelian<br />

syndromes are fail<strong>in</strong>g to f<strong>in</strong>d the numerous<br />

genes caus<strong>in</strong>g more comm<strong>on</strong>, familial, n<strong>on</strong>mendelian<br />

diseases . . .<br />

Nature 405:847–856, 2000


Applicati<strong>on</strong> to Bio<strong>in</strong>formatics<br />

With the human genome sequence near<strong>in</strong>g<br />

completi<strong>on</strong>, new opportunities are be<strong>in</strong>g<br />

presented for unravell<strong>in</strong>g the complex genetic<br />

basis of n<strong>on</strong>mendelian disorders based <strong>on</strong><br />

large-scale genomewide studies . . .<br />

Nature 405:847–856, 2000


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Identify<strong>in</strong>g Genes Relevant to a Disease<br />

Us<strong>in</strong>g Microarrray Gene Expressi<strong>on</strong> Data


Formulati<strong>on</strong> as a Bipartite<br />

Relevant<br />

Not relevant<br />

Rank<strong>in</strong>g Problem<br />

…<br />


Microarray Gene Expressi<strong>on</strong> Data Sets<br />

[Golub et al, 1999; Al<strong>on</strong> et al, 1999]


Selecti<strong>on</strong> of Tra<strong>in</strong><strong>in</strong>g Genes


Top-Rank<strong>in</strong>g Genes for Leukemia<br />

Returned by RankBoost<br />

[Agarwal & Sengupta, 2009]


Biological Validati<strong>on</strong><br />

[Agarwal et al, 2010]


Applicati<strong>on</strong> to Drug Discovery<br />

Problem: Milli<strong>on</strong>s of structures <strong>in</strong> a chemical library.<br />

How do we identify the most promis<strong>in</strong>g <strong>on</strong>es?


Formulati<strong>on</strong> as a Rank<strong>in</strong>g Problem<br />

with Real-Valued Labels


Chem<strong>in</strong>formatics Data Sets<br />

[Sutherland et al, 2004]


DHFR Results Us<strong>in</strong>g RankSVM<br />

[Agarwal et al, 2010]


Applicati<strong>on</strong> to Informati<strong>on</strong> Retrieval (IR)


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


Learn<strong>in</strong>g to Rank <strong>in</strong> IR


General Subset Rank<strong>in</strong>g<br />

query 1 doc1 > doc2 , doc3 doc5<br />

…<br />

> …<br />

query 2 doc2 > doc4 , doc11 ><br />

doc3 …


General Subset Rank<strong>in</strong>g


Subset Rank<strong>in</strong>g with Real-Valued<br />

Relevance Labels<br />

query 1 doc1 y1 , doc2 y2 , doc3 y3<br />

…<br />

query 2<br />

…<br />

doc1<br />

y1 , y2 , y3<br />

doc2 doc3<br />


Subset Rank<strong>in</strong>g with Real-Valued<br />

Relevance Labels


RankSVM Applied to IR/Subset Rank<strong>in</strong>g<br />

Standard RankSVM<br />

[Joachims, 2002]


RankSVM Applied to IR/Subset Rank<strong>in</strong>g<br />

RankSVM with Query Normalizati<strong>on</strong><br />

& Relevance Weight<strong>in</strong>g<br />

[Agarwal & Coll<strong>in</strong>s, 2010; also Cao et al, 2006]


Rank<strong>in</strong>g Performance Measures <strong>in</strong> IR<br />

Mean Average Precisi<strong>on</strong> (MAP)


Rank<strong>in</strong>g Performance Measures <strong>in</strong> IR<br />

Normalized Discounted Cumulative Ga<strong>in</strong> (NDCG)


Rank<strong>in</strong>g Algorithms for Optimiz<strong>in</strong>g<br />

MAP/NDCG


LETOR 3.0/OHSUMED Data Set<br />

[Liu et al, 2007]


OHSUMED Results – NDCG


OHSUMED Results – MAP


Further Read<strong>in</strong>g & Resources<br />

[Incomplete!]


Early Papers <strong>on</strong> Rank<strong>in</strong>g<br />

W. W. Cohen, R. E. Schapire, and Y. S<strong>in</strong>ger, Learn<strong>in</strong>g to order th<strong>in</strong>gs,<br />

Journal of Artificial Intelligence Research, 10:243–270, 1999.<br />

R. Herbrich, T. Graepel, and K. Obermayer, Large marg<strong>in</strong> rank boundaries<br />

for ord<strong>in</strong>al regressi<strong>on</strong>. Advances <strong>in</strong> Large Marg<strong>in</strong> Classifiers, 2000.<br />

T. Joachims, Optimiz<strong>in</strong>g search eng<strong>in</strong>es us<strong>in</strong>g clickthrough data, KDD 2002.<br />

Y. Freund, R. Iyer, R. E. Schapire, and Y. S<strong>in</strong>ger, An efficient boost<strong>in</strong>g<br />

algorithm for comb<strong>in</strong><strong>in</strong>g preferences. Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

Research, 4:933–969, 2003.<br />

C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilt<strong>on</strong>,<br />

G. Hullender, Learn<strong>in</strong>g to rank us<strong>in</strong>g gradient descent, ICML 2005.


Generalizati<strong>on</strong> Bounds for Rank<strong>in</strong>g<br />

S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, D. Roth, Generalizati<strong>on</strong><br />

bounds for the area under the ROC curve, Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g<br />

Research, 6:393—425, 2005.<br />

S. Agarwal, P. Niyogi, Generalizati<strong>on</strong> bounds for <strong>rank<strong>in</strong>g</strong> algorithms via<br />

algorithmic stability, Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research, 10:441—474,<br />

2009.<br />

C. Rud<strong>in</strong>, R. Schapire, Marg<strong>in</strong>-based <strong>rank<strong>in</strong>g</strong> and an equivalence between<br />

AdaBoost and RankBoost, Journal of Mach<strong>in</strong>e Learn<strong>in</strong>g Research,<br />

10: 2193—2232, 2009


Bio<strong>in</strong>formatics/Drug Discovery<br />

Applicati<strong>on</strong>s<br />

S. Agarwal and S. Sengupta, Rank<strong>in</strong>g genes by relevance to a disease,<br />

CSB 2009.<br />

S. Agarwal, D. Dugar, and S. Sengupta, Rank<strong>in</strong>g chemical structures for<br />

drug discovery: A new mach<strong>in</strong>e learn<strong>in</strong>g approach. Journal of Chemical<br />

Informati<strong>on</strong> and Model<strong>in</strong>g, DOI 10.1021/ci9003865, 2010.


Natural Language Process<strong>in</strong>g<br />

Other Applicati<strong>on</strong>s<br />

M. Coll<strong>in</strong>s and T. Koo, Discrim<strong>in</strong>ative re<strong>rank<strong>in</strong>g</strong> for natural language<br />

pars<strong>in</strong>g, Computati<strong>on</strong>al L<strong>in</strong>guistics, 31:25—69, 2005.<br />

Collaborative Filter<strong>in</strong>g<br />

M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola, CofiRank - Maximum<br />

marg<strong>in</strong> matrix factorizati<strong>on</strong> for collaborative <strong>rank<strong>in</strong>g</strong>, NIPS 2007.<br />

Manhole Event Predicti<strong>on</strong><br />

C. Rud<strong>in</strong>, R. Pass<strong>on</strong>neau, A. Radeva, H. Dutta, S. Ierome, and D. Isaac , A<br />

process for predict<strong>in</strong>g manhole events <strong>in</strong> Manhattan, Mach<strong>in</strong>e Learn<strong>in</strong>g,<br />

DOI 10.1007/s10994-009-5166, 2010.


IR Rank<strong>in</strong>g Algorithms<br />

Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Hunag, and H.W. H<strong>on</strong>, Adapt<strong>in</strong>g <strong>rank<strong>in</strong>g</strong><br />

SVM to document retrieval, SIGIR 2006.<br />

C.J.C. Burges, R. Ragno, and Q.V. Le, Learn<strong>in</strong>g to rank with n<strong>on</strong>-smooth<br />

cost functi<strong>on</strong>s. NIPS 2006.<br />

J. Xu and H. Li, AdaRank: A boost<strong>in</strong>g algorithm for <strong>in</strong>formati<strong>on</strong> retrieval.<br />

SIGIR 2007.<br />

Y. Yue, T. F<strong>in</strong>ley, F. Radl<strong>in</strong>ski, and T. Joachims, A support vector method<br />

for optimiz<strong>in</strong>g average precisi<strong>on</strong>. SIGIR 2007.<br />

M. Taylor, J. Guiver, S. Roberts<strong>on</strong>, T. M<strong>in</strong>ka, Softrank: optimiz<strong>in</strong>g<br />

n<strong>on</strong>-smooth rank metrics. WSDM 2008.


IR Rank<strong>in</strong>g Algorithms<br />

S. Chakrabarti, R. Khanna, U. Sawant, and C. Bhattacharyya, Structured<br />

learn<strong>in</strong>g for n<strong>on</strong>smooth <strong>rank<strong>in</strong>g</strong> losses. KDD 2008.<br />

D. Cossock and T. Zhang, Statistical analysis of Bayes optimal subset<br />

<strong>rank<strong>in</strong>g</strong>, IEEE Transacti<strong>on</strong>s <strong>on</strong> Informati<strong>on</strong> Theory, 54:5140–5154, 2008.<br />

T. Q<strong>in</strong>, X.D. Zhang, M.F. Tsai, D.S. Wang, T.Y. Liu, and H. Li. Query-level<br />

loss functi<strong>on</strong>s for <strong>in</strong>formati<strong>on</strong> retrieval. Informati<strong>on</strong> Process<strong>in</strong>g and<br />

Management, 44:838–855, 2008.<br />

O. Chapelle and M. Wu, Gradient descent optimizati<strong>on</strong> of smoothed<br />

<strong>in</strong>formati<strong>on</strong> retrieval metrics. Informati<strong>on</strong> Retrieval (To appear), 2010.<br />

S. Agarwal and M. Coll<strong>in</strong>s, Maximum marg<strong>in</strong> <strong>rank<strong>in</strong>g</strong> algorithms for<br />

<strong>in</strong>formati<strong>on</strong> retrieval, ECIR 2010.


NIPS Workshop 2005<br />

Learn<strong>in</strong>g to Rank<br />

SIGIR Workshops 2007-2009<br />

Learn<strong>in</strong>g to Rank for Informati<strong>on</strong> Retrieval<br />

NIPS Workshop 2009<br />

Advances <strong>in</strong> Rank<strong>in</strong>g<br />

American Institute of Mathematics<br />

Workshop <strong>in</strong> Summer 2010<br />

The Mathematics of Rank<strong>in</strong>g


<str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> Articles & Books<br />

Tie-Yan Liu, Learn<strong>in</strong>g to Rank for Informati<strong>on</strong> Retrieval,<br />

Foundati<strong>on</strong>s & Trends <strong>in</strong> Informati<strong>on</strong> Retrieval, 2009.<br />

Shivani Agarwal, A <str<strong>on</strong>g>Tutorial</str<strong>on</strong>g> Introducti<strong>on</strong> to Rank<strong>in</strong>g Methods<br />

<strong>in</strong> Mach<strong>in</strong>e Learn<strong>in</strong>g, In preparati<strong>on</strong>.<br />

Shivani Agarwal (Ed.), Advances <strong>in</strong> Rank<strong>in</strong>g Methods <strong>in</strong> Mach<strong>in</strong>e<br />

Learn<strong>in</strong>g, Spr<strong>in</strong>ger-Verlag, In preparati<strong>on</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!