08.06.2015 Views

Learning to Rank for Information Retrieval

Learning to Rank for Information Retrieval

Learning to Rank for Information Retrieval

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

<strong>for</strong> In<strong>for</strong>mation <strong>Retrieval</strong><br />

Tie-Yan Liu<br />

Lead Researcher, Microsoft Research Asia


Outline<br />

• Introduction<br />

• Pointwise approach<br />

• Pairwise approach<br />

• Listwise approach<br />

• Analysis of the approaches<br />

• Benchmarking learning <strong>to</strong> rank methods<br />

• Summary and Outlook<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

2


Introduction


Overwhelmed by Flood of In<strong>for</strong>mation<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

4


Facts about the Web<br />

• According <strong>to</strong> www.worldwidewebsize.com, major<br />

search engines indexed at least tens of billions of<br />

web pages.<br />

• CUIL.com indexed more than 120 Billion web pages.<br />

4/20/2010 Tie-Yan Liu @ Renda 5


Facts about the Web<br />

• According <strong>to</strong> www.worldwidewebsize.com, major<br />

search engines indexed at least tens of billions of<br />

web pages.<br />

• CUIL.com indexed more than 120 Billion web pages.<br />

4/20/2010 Tie-Yan Liu @ Renda 6


<strong>Rank</strong>ing is Essential<br />

Indexed<br />

Document Reposi<strong>to</strong>ry<br />

D <br />

d j<br />

<strong>Rank</strong>ing<br />

model<br />

Query<br />

4/20/2010 Tie-Yan Liu @ Renda 7


Evaluation of <strong>Rank</strong>ing Results<br />

• Construct a test set containing a large number of<br />

(randomly sampled) queries and their associated<br />

documents, and relevance judgment <strong>for</strong> each querydocument<br />

pair.<br />

• Evaluate the ranking result <strong>for</strong> a particular query in<br />

the test set with an evaluation measure.<br />

• Use the average measure over the entire test set <strong>to</strong><br />

represent the overall ranking per<strong>for</strong>mance.<br />

4/20/2010 Tie-Yan Liu @ Renda 8


Collecting Queries and Documents<br />

• Queries are usually randomly sampled from<br />

query logs.<br />

• TREC’s Pooling strategy <strong>for</strong> sampling documents<br />

– A pool of possibly relevant documents is created by<br />

taking a sample of documents selected by the various<br />

participating systems.<br />

– Top 100 documents retrieved in each submitted run<br />

<strong>for</strong> a given query are selected and merged in<strong>to</strong> the<br />

pool <strong>for</strong> human assessment.<br />

– On average, an assessor judges the relevance of<br />

approximately 1500 documents per query.<br />

4/20/2010 Tie-Yan Liu @ Renda 9


Relevance Judgment<br />

• Degree of relevance<br />

– Binary: relevant vs. irrelevant<br />

– Multiple ordered categories: Perfect > Excellent > Good ><br />

Fair > Bad<br />

• Pairwise preference<br />

– Document A is more relevant than<br />

document B<br />

• Total order<br />

l<br />

• Documents are ranked as {A,B,C,..}<br />

according <strong>to</strong> their relevance<br />

l k<br />

l<br />

u , v<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

10


Evaluation Measure - MAP<br />

• Precision at position k <strong>for</strong> query q:<br />

#{relevant documents in <strong>to</strong>p k results}<br />

P@<br />

k <br />

k<br />

• Average precision <strong>for</strong> query q:<br />

P@<br />

k l k<br />

k<br />

AP <br />

#{relevant documents}<br />

AP <br />

• MAP: averaged over all queries.<br />

1<br />

3<br />

1<br />

<br />

<br />

1<br />

2<br />

3<br />

<br />

3<br />

5<br />

<br />

<br />

<br />

0.76<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

11


Evaluation Measure - NDCG<br />

• NDCG at position n <strong>for</strong> query q:<br />

NDCG@<br />

k<br />

<br />

Z<br />

k<br />

k<br />

<br />

j1<br />

<br />

G <br />

1<br />

(<br />

<br />

j)<br />

(<br />

j)<br />

Normalization<br />

Cumulating<br />

Gain<br />

Position discount<br />

• Averaged over all queries.<br />

G<br />

<br />

1<br />

<br />

l<br />

<br />

<br />

( ) 2<br />

1 ( j )<br />

j 1,<br />

(<br />

j)<br />

1/ log( j 1)<br />

The document ranked at<br />

the j-th position in π.<br />

1<br />

( j)<br />

( j)<br />

The rank position of<br />

document j in π.<br />

4/20/2010 Tie-Yan Liu @ Renda 12


Evaluation Measure - Summary<br />

• Query-level: every query contributes equally <strong>to</strong> the<br />

measure.<br />

– Computed on documents associated with the same query.<br />

– Bounded <strong>for</strong> each query.<br />

– Averaged over all test queries.<br />

• Position-based: rank position is explicitly used.<br />

– Top-ranked objects are more important.<br />

– Relative order vs. relevance score of each document.<br />

– Non-continuous and non-differentiable w.r.t. scores<br />

4/20/2010 Tie-Yan Liu @ Renda 13


Conventional <strong>Rank</strong>ing Models<br />

• Query-dependent<br />

– Boolean model, extended Boolean model, etc.<br />

– Vec<strong>to</strong>r space model, latent semantic indexing (LSI), etc.<br />

– BM25 model, statistical language model, etc.<br />

– Span based model, distance aggregation model, etc.<br />

• Query-independent<br />

– Page<strong>Rank</strong>, Trust<strong>Rank</strong>, Browse<strong>Rank</strong>, Toolbar Clicks, etc.<br />

4/20/2010 Tie-Yan Liu @ Renda 14


Problems with Conventional Models<br />

• Manual parameter tuning is usually difficult,<br />

especially when there are many parameters and the<br />

evaluation measures are non-smooth.<br />

• Manual parameter tuning sometimes leads <strong>to</strong> overfitting.<br />

• It is non-trivial <strong>to</strong> combine the large number of<br />

models proposed in the literature <strong>to</strong> obtain an even<br />

more effective model.<br />

4/20/2010 Tie-Yan Liu @ Renda 15


Machine <strong>Learning</strong> Can Help<br />

• Machine learning is an effective <strong>to</strong>ol<br />

– To au<strong>to</strong>matically tune parameters.<br />

– To combine multiple evidences.<br />

– To avoid over-fitting (by means of regularization, etc.)<br />

• “<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong>”<br />

– In general, those methods that use<br />

machine learning technologies <strong>to</strong> solve<br />

the problem of ranking can be named as<br />

“learning <strong>to</strong> rank” methods.<br />

4/20/2010 Tie-Yan Liu @ Renda 16


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

• In most recent works, learning <strong>to</strong> rank is<br />

defined as having the following two properties:<br />

– Feature based;<br />

– Discriminative training.<br />

4/20/2010 Tie-Yan Liu @ Renda 17


Feature Based<br />

• Documents represented by feature vec<strong>to</strong>rs.<br />

– Features are extracted <strong>for</strong> each query-document pair.<br />

– Even if a feature is the output of an existing retrieval<br />

model, one assumes that the parameter in the model is<br />

fixed, and only learns the optimal way of combining these<br />

features.<br />

• The capability of combining a large number of<br />

features is very promising.<br />

– It can easily incorporate any new progress on retrieval<br />

model, by including the output of the model as a feature.<br />

4/20/2010 Tie-Yan Liu @ Renda 18


Discriminative Training<br />

• An au<strong>to</strong>matic learning process based on the<br />

training data,<br />

– With the four pillars of • Contains discriminative the objects learning under<br />

• Input space,<br />

• Output space,<br />

• Hypothesis space,<br />

• Loss function.<br />

investigation.<br />

• Usually objects are<br />

represented by feature vec<strong>to</strong>rs,<br />

extracted according <strong>to</strong> different<br />

applications<br />

– Not even necessary <strong>to</strong> have a probabilistic<br />

explanation.<br />

4/20/2010 Tie-Yan Liu @ Renda 19


Discriminative Training<br />

• An au<strong>to</strong>matic learning process based on the<br />

training data,<br />

– With the four pillars of discriminative learning<br />

• Input space,<br />

• Output space,<br />

• Hypothesis space,<br />

• Loss function.<br />

• There are two related but different<br />

definitions of the output space.<br />

- Output space of the task.<br />

- Output space <strong>to</strong> facilitate learning.<br />

• Example: using regression <strong>to</strong> solve<br />

classification<br />

- Output space of the task: {+1, -1}<br />

- Output space <strong>to</strong> facilitate learning:<br />

– Not even necessary <strong>to</strong> have a probabilistic<br />

real numbers<br />

explanation.<br />

4/20/2010 Tie-Yan Liu @ Renda 20


Discriminative Training<br />

• An au<strong>to</strong>matic learning process based on the<br />

training data,<br />

– With the four pillars of discriminative learning<br />

• Input space,<br />

• Output space,<br />

• Hypothesis space,<br />

• Loss function.<br />

• Defines the class of functions<br />

mapping the input space <strong>to</strong> the output<br />

space.<br />

• The functions operate on the feature<br />

vec<strong>to</strong>rs of the input objects, and make<br />

predictions according <strong>to</strong> the <strong>for</strong>mat of<br />

– Not even necessary <strong>to</strong> the have output a probabilistic<br />

space.<br />

explanation.<br />

4/20/2010 Tie-Yan Liu @ Renda 21


Discriminative Training<br />

• An au<strong>to</strong>matic learning process based on the<br />

training data,<br />

– With the four pillars of discriminative learning<br />

• Input space,<br />

• Output space,<br />

• Hypothesis space,<br />

• Loss function.<br />

• The loss function measures <strong>to</strong> what<br />

degree the prediction generated by the<br />

hypothesis is in accordance with the<br />

ground truth label.<br />

- For example, widely used loss<br />

functions <strong>for</strong> classification include the<br />

exponential loss, the hinge loss, and<br />

the logistic loss.<br />

- With the loss function, an empirical<br />

risk can be defined on the training set,<br />

– Not even necessary <strong>to</strong> have a probabilistic<br />

explanation.<br />

and the optimal hypothesis is usually<br />

(but not always) learned by means of<br />

empirical risk minimization.<br />

4/20/2010 Tie-Yan Liu @ Renda 22


Discriminative Training<br />

• An au<strong>to</strong>matic learning process based on the<br />

training data,<br />

• This is highly demanding <strong>for</strong> real search<br />

engines,<br />

– Everyday these search engines will receive a lot of<br />

user feedback and usage logs<br />

– It is very important <strong>to</strong> au<strong>to</strong>matically learn from<br />

the feedback and constantly improve the ranking<br />

mechanism.<br />

4/20/2010 Tie-Yan Liu @ Renda 23


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

• In most recent works, learning <strong>to</strong> rank is<br />

defined as having the following two properties:<br />

– Feature based;<br />

– Discriminative training.<br />

• Au<strong>to</strong>matic parameter tuning <strong>for</strong> existing<br />

ranking models, and learning-based<br />

generative ranking models do not belong <strong>to</strong><br />

the current definition of “learning <strong>to</strong> rank.”<br />

4/20/2010 Tie-Yan Liu @ Renda 24


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Framework<br />

4/20/2010 Tie-Yan Liu @ Renda 25


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Algorithms


<strong>Learning</strong> <strong>to</strong> <strong>Rank</strong> Algorithms<br />

Least Square <strong>Retrieval</strong> Function Query refinement (WWW 2008)<br />

(TOIS 1989)<br />

SVM-MAP (SIGIR 2007) Nested <strong>Rank</strong>er (SIGIR 2006)<br />

ListNet (ICML 2007)<br />

Pranking (NIPS 2002)<br />

Lambda<strong>Rank</strong> (NIPS 2006) MP<strong>Rank</strong> (ICML 2007)<br />

Frank (SIGIR 2007)<br />

MHR (SIGIR 2007) <strong>Rank</strong>Boost (JMLR 2003)<br />

<strong>Learning</strong> <strong>to</strong> retrieval info (SCC 1995)<br />

LDM (SIGIR 2005)<br />

Large margin ranker (NIPS 2002)<br />

<strong>Rank</strong>Net (ICML 2005)<br />

<strong>Rank</strong>ing SVM (ICANN 1999) IRSVM (SIGIR 2006)<br />

Discriminative model <strong>for</strong> IR (SIGIR 2004) SVM Structure (JMLR 2005)<br />

OAP-BPM (ICML 2003)<br />

Subset <strong>Rank</strong>ing (COLT 2006)<br />

GP<strong>Rank</strong> (LR4IR 2007) QB<strong>Rank</strong> (NIPS 2007) GB<strong>Rank</strong> (SIGIR 2007)<br />

Constraint Ordinal Regression (ICML 2005) Mc<strong>Rank</strong> (NIPS 2007) Soft<strong>Rank</strong> (LR4IR 2007)<br />

Ada<strong>Rank</strong> (SIGIR 2007) CCA (SIGIR 2007) ListMLE (ICML 2008)<br />

<strong>Rank</strong>Cosine (IP&M 2007) Supervised <strong>Rank</strong> Aggregation (WWW 2007)<br />

Relational ranking (WWW 2008)<br />

<strong>Learning</strong> <strong>to</strong> order things (NIPS 1998)<br />

Round robin ranking (ECML 2003)<br />

4/20/2010 Tie-Yan Liu @ Renda 27


Key Questions <strong>to</strong> Answer<br />

• Q1: To what respect are these learning <strong>to</strong> rank algorithms<br />

similar and in which aspects do they differ? What are the<br />

strengths and weaknesses of each algorithm?<br />

• Q2: Empirically speaking, which of those many learning <strong>to</strong><br />

rank algorithms per<strong>for</strong>m the best?<br />

• Q3: Are there many remaining issues regarding learning <strong>to</strong><br />

rank <strong>to</strong> study in the future?<br />

4/20/2010 Tie-Yan Liu @ Renda 28


Categorization of the Algorithms<br />

Category<br />

Pointwise<br />

Approach<br />

Pairwise<br />

Approach<br />

Listwise<br />

Approach<br />

Algorithms<br />

Regression: Least Square <strong>Retrieval</strong> Function (TOIS 1989), Regression Tree <strong>for</strong> Ordinal<br />

Class Prediction (Fundamenta In<strong>for</strong>maticae, 2000), Subset <strong>Rank</strong>ing using Regression<br />

(COLT 2006), …<br />

Classification: Discriminative model <strong>for</strong> IR (SIGIR 2004), Mc<strong>Rank</strong> (NIPS 2007), …<br />

Ordinal regression: Pranking (NIPS 2002), OAP-BPM (EMCL 2003), <strong>Rank</strong>ing with<br />

Large Margin Principles (NIPS 2002), Constraint Ordinal Regression (ICML 2005), …<br />

<strong>Learning</strong> <strong>to</strong> Retrieve In<strong>for</strong>mation (SCC 1995), <strong>Learning</strong> <strong>to</strong> Order Things (NIPS 1998),<br />

<strong>Rank</strong>ing SVM (ICANN 1999), <strong>Rank</strong>Boost (JMLR 2003), LDM (SIGIR 2005), <strong>Rank</strong>Net<br />

(ICML 2005), Frank (SIGIR 2007), MHR(SIGIR 2007), GB<strong>Rank</strong> (SIGIR 2007), QB<strong>Rank</strong><br />

(NIPS 2007), MP<strong>Rank</strong> (ICML 2007), IRSVM (SIGIR 2006), Lambda<strong>Rank</strong> (NIPS 2006),…<br />

Non-measure specific loss minimization: ListNet (ICML 2007), ListMLE (ICML 2008),<br />

…<br />

Measure specific loss minimizai<strong>to</strong>n: Ada<strong>Rank</strong> (SIGIR 2007), SVM-MAP (SIGIR 2007),<br />

Soft<strong>Rank</strong> (LR4IR 2007), GP<strong>Rank</strong> (LR4IR 2007), CCA (SIGIR 2007), …<br />

4/20/2010 Tie-Yan Liu @ Renda 29


The Poinitwise Approach


The Pointwise Approach<br />

The Pointwise Approach<br />

Regression Classification Ordinal Regression<br />

Input Space<br />

Single documents x j<br />

Output Space Real values y j<br />

Non-ordered<br />

Categories y j<br />

Hypothesis Scoring function f(x) Classifier sgn.f(x)<br />

Loss Function<br />

Regression loss<br />

Classification loss<br />

L(<br />

f ; x j<br />

, y<br />

j)<br />

Ordinal categories y j<br />

Scoring function +<br />

thresholding<br />

Ordinal regression<br />

loss<br />

4/20/2010 Tie-Yan Liu @ Renda 31


The Pointwise Approach<br />

• Reduce ranking <strong>to</strong><br />

– Regression<br />

• Subset <strong>Rank</strong>ing<br />

– Classification<br />

• Discriminative model <strong>for</strong> IR<br />

• MC<strong>Rank</strong><br />

– Ordinal regression<br />

• P<strong>Rank</strong>ing<br />

• <strong>Rank</strong>ing with large margin principle<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

x<br />

1<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

x2<br />

q <br />

<br />

x m<br />

( x , y ),( x 1 2,<br />

y 2),...,(<br />

x m,<br />

y<br />

1 m<br />

)<br />

<br />

4/20/2010 Tie-Yan Liu @ Renda 32


Subset <strong>Rank</strong>ing using Regression<br />

(D. Cossock and T. Zhang, COLT 2006)<br />

• Regard relevance degree as real number, and use<br />

regression <strong>to</strong> learn the ranking function.<br />

f<br />

( x ) y 2<br />

L( f ; x<br />

j,<br />

y<br />

j)<br />

<br />

j j<br />

Regression-based<br />

4/20/2010 Tie-Yan Liu @ Renda 33


Discriminative Model <strong>for</strong> IR<br />

(R. Nallapati, SIGIR 2004)<br />

• Classification is used <strong>to</strong> learn the ranking function<br />

– Treat relevant documents as positive examples (y j =+1),<br />

while irrelevant documents as negative examples (y j =-1).<br />

– f ; x j<br />

, y ) is surrogate loss of I<br />

L (<br />

j<br />

{sgn<br />

f ( x j ) y<br />

}<br />

• Different algorithms optimize different surrogate<br />

losses.<br />

– e.g., Support Vec<strong>to</strong>r Machines (hinge loss)<br />

j<br />

Classification-based<br />

4/20/2010 Tie-Yan Liu @ Renda 34


Mc<strong>Rank</strong><br />

(P. Li, C. Burges, et al. NIPS 2007)<br />

• Multi-class classification is used <strong>to</strong> learn the ranking<br />

function.<br />

– For document x , the output of the classifier is ŷ<br />

j.<br />

– Loss function: surrogate function of<br />

• <strong>Rank</strong>ing is produced by combining the outputs of the<br />

classifiers.<br />

pˆ<br />

j<br />

j<br />

<br />

, k<br />

P(<br />

yˆ<br />

j<br />

k),<br />

f ( x<br />

j)<br />

pˆ<br />

K<br />

k1<br />

I<br />

yˆ<br />

j y<br />

{ j<br />

j,<br />

k<br />

k<br />

}<br />

Classification-based<br />

4/20/2010 Tie-Yan Liu @ Renda 35


Pranking with <strong>Rank</strong>ing<br />

(K. Krammer and Y. Singer, NIPS 2002)<br />

• <strong>Rank</strong>ing is solved by ordinal regression<br />

x<br />

<br />

w<br />

x<br />

Ordinal regression based<br />

4/20/2010 Tie-Yan Liu @ Renda 36


<strong>Rank</strong>ing with Large Margin Principles<br />

(A. Shashua and A. Levin, NIPS 2002)<br />

• Fixed margin strategy<br />

– Margin:<br />

2 w<br />

2<br />

– Constraints: every document is correctly placed in its<br />

corresponding category.<br />

if y j<br />

2<br />

f x )<br />

( j<br />

b 1 b 2<br />

Category 1 Category 2<br />

Category 3<br />

<br />

Ordinal regression based<br />

4/20/2010 Tie-Yan Liu @ Renda 37


<strong>Rank</strong>ing with Large Margin Principles<br />

• Sum of margins strategy<br />

– Margin:<br />

<br />

k<br />

(b )<br />

k<br />

a k<br />

– Constraints: every document is correctly placed in its<br />

corresponding category<br />

f x )<br />

( j<br />

if y j<br />

2<br />

4/20/2010 Tie-Yan Liu @ Renda 38<br />

<br />

Ordinal regression based


Problem with Pointwise Approach<br />

• Properties of IR evaluation measures have not been<br />

well considered.<br />

– The fact is ignored that some documents are associated<br />

with the same query and some are not.<br />

• When the number of associated documents varies largely <strong>for</strong><br />

different queries, the overall loss function will be dominated by<br />

those queries with a large number of documents.<br />

– The position of documents in the ranked list is invisible <strong>to</strong><br />

the loss functions.<br />

• The pointwise loss function may unconsciously emphasize <strong>to</strong>o<br />

much those unimportant documents (which are ranked low in the<br />

final results).<br />

4/20/2010 Tie-Yan Liu @ Renda 39


The Pairwise Approach


The Pairwise Approach<br />

The Pairwise Approach<br />

Input Space Document pairs (x u ,x v )<br />

Output Space<br />

Preference<br />

y u<br />

, v<br />

{<br />

1,<br />

1}<br />

Hypothesis Space<br />

Preference function<br />

h( xu,<br />

xv<br />

) 2<br />

I<br />

f ( x ) f ( x )}<br />

<br />

{ <br />

<br />

u v<br />

1<br />

Loss Function<br />

Pairwise classification loss<br />

L(<br />

h;<br />

xu,<br />

xv,<br />

yu,v)<br />

4/20/2010 Tie-Yan Liu @ Renda 41


The Pairwise Approach<br />

• Reduce ranking <strong>to</strong> pairwise classification<br />

– <strong>Learning</strong> <strong>to</strong> order things<br />

– <strong>Rank</strong>Net and Frank<br />

– <strong>Rank</strong>Boost<br />

– <strong>Rank</strong>ing SVM<br />

– MHR<br />

– IR-SVM<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

x<br />

x<br />

<br />

1<br />

2<br />

x m<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

x<br />

<br />

1<br />

, x2,<br />

1 , x2,<br />

x1<br />

, 1 ,...,<br />

<br />

<br />

<br />

x<br />

q <br />

2<br />

, x<br />

m<br />

, 1,<br />

x<br />

m<br />

, x<br />

2<br />

, 1<br />

4/20/2010 Tie-Yan Liu @ Renda 42


<strong>Learning</strong> <strong>to</strong> Order Things<br />

(W. Cohen, R. Schapire, et al. NIPS 1998)<br />

• Pairwise loss function<br />

L(<br />

h;<br />

x<br />

, x<br />

• Pairwise ranking function<br />

– Weighted majority algorithm is used <strong>to</strong> learn the parameters w.<br />

– From preference <strong>to</strong> <strong>to</strong>tal order:<br />

u<br />

v<br />

,<br />

y<br />

<br />

u,<br />

v<br />

)<br />

<br />

y<br />

u,<br />

v<br />

h( xu<br />

, xv<br />

) wt<br />

ht<br />

( xu<br />

, xv<br />

)<br />

<br />

max <br />

( x 1<br />

, x 1<br />

( u)<br />

( v )<br />

uv<br />

h )<br />

h(<br />

x<br />

2<br />

u<br />

, x<br />

v<br />

)<br />

4/20/2010 Tie-Yan Liu @ Renda 43


• Target probability:<br />

Pu<br />

v<br />

, if yu,<br />

v<br />

1;<br />

• Modeled probability:<br />

P<br />

u,<br />

v<br />

( f<br />

<strong>Rank</strong>Net<br />

(C. Burges, T. Shaked, et al. ICML 2005)<br />

,<br />

1 Pu<br />

, v<br />

<br />

exp( f ( xu<br />

) f<br />

) <br />

1<br />

exp( f ( x ) <br />

( xv<br />

))<br />

f ( x ))<br />

• Cross entropy as the loss function<br />

l(<br />

f ; x<br />

P<br />

u<br />

u,<br />

v<br />

, x<br />

v<br />

,<br />

log<br />

y<br />

P<br />

u,<br />

v<br />

u,<br />

v<br />

)<br />

(<br />

u<br />

• Use neural network as model, and gradient descent as<br />

algorithm, <strong>to</strong> optimize the cross-entropy loss.<br />

v<br />

f ) (1 P<br />

0,otherwise<br />

u,<br />

v<br />

)log(1 P<br />

u,<br />

v<br />

(<br />

f<br />

))<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

44


F<strong>Rank</strong><br />

(M. Tsai, T.-Y. Liu, et al. SIGIR 2007)<br />

• Similar setting <strong>to</strong> that of <strong>Rank</strong>Net<br />

• New loss function: fidelity<br />

L(<br />

f<br />

; x<br />

u<br />

, x<br />

v<br />

,<br />

y<br />

u,<br />

v<br />

)<br />

1<br />

P<br />

u,<br />

v<br />

P<br />

u,<br />

v<br />

(<br />

f<br />

)<br />

<br />

(1 <br />

P<br />

u,<br />

v<br />

) (1<br />

<br />

P<br />

u,<br />

v<br />

(<br />

f<br />

))<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

45


• Exponential loss<br />

<strong>Rank</strong>Boost<br />

(Y. Freund, R. Iyer, et al. JMLR 2003)<br />

L( f ; xu,<br />

xv,<br />

yu<br />

v)<br />

exp( yu,<br />

(<br />

f ( x<br />

) <br />

, v u v<br />

Weak learner that can<br />

classify important pairs<br />

more f ( correctly x ))) will have<br />

larger weight α<br />

Hypothesis is the<br />

combination of weak<br />

learners<br />

Emphasize those hard<br />

pairs by setting their<br />

distributions.<br />

4/20/2010 Tie-Yan Liu @ Renda 46


<strong>Rank</strong>ing SVM<br />

(R. Herbrich, T. Graepel, et al. , Advances in Large Margin Classifiers, 2000;<br />

T. Joachims, KDD 2002)<br />

• Hinge loss<br />

<br />

L(<br />

f ; xu,<br />

xv,<br />

yu, v)<br />

1<br />

yu,<br />

v<br />

f ( xu<br />

) f ( xv)<br />

<br />

<br />

<br />

min<br />

w<br />

<br />

T<br />

( i)<br />

uv<br />

<br />

1<br />

2<br />

x<br />

( i)<br />

u<br />

|| w ||<br />

<br />

0, i<br />

x<br />

2<br />

( i)<br />

v<br />

<br />

(<br />

1<br />

1,...,<br />

n.<br />

n<br />

( i)<br />

u,<br />

v<br />

i1 ( )<br />

u,<br />

v:<br />

y , 1<br />

C <br />

i)<br />

u,<br />

v<br />

u<br />

i<br />

v<br />

,if<br />

y<br />

( i)<br />

u,<br />

v<br />

1.<br />

x u -x v as positive instance of<br />

learning<br />

Use SVM <strong>to</strong> per<strong>for</strong>m binary<br />

classification on these instances,<br />

<strong>to</strong> learn model parameter w<br />

4/20/2010 Tie-Yan Liu @ Renda 47


Extensions of the Pairwise Approach<br />

• When constructing document pairs, category<br />

in<strong>for</strong>mation has been lost: different pairs of labels<br />

are treated identically.<br />

– Can we maintain more in<strong>for</strong>mation about ordered<br />

category?<br />

– Can we use different learner <strong>for</strong> different kinds of pairs?<br />

• Solutions<br />

– Multi-hyperplane ranker (MHR)<br />

4/20/2010 Tie-Yan Liu @ Renda 48


Multi-Hyperplane <strong>Rank</strong>er<br />

(T. Qin, T.-Y. Liu, et al. SIGIR 2007)<br />

• If we have K categories, then will have K(K-1)/2 pairwise<br />

preferences between two labels.<br />

– Learn a model f k,l <strong>for</strong> the pair of categories k and l, using any previous<br />

method (<strong>for</strong> example <strong>Rank</strong>ing SVM).<br />

• Testing<br />

– <strong>Rank</strong> Aggregation<br />

f ( x)<br />

k<br />

k,<br />

l<br />

, l<br />

fk,<br />

l<br />

( x)<br />

4/20/2010 Tie-Yan Liu @ Renda 49


Improvement of Pairwise Approach<br />

• Advantage<br />

– Predicting relative order is closer <strong>to</strong> the nature of ranking<br />

than predicting class label or relevance score.<br />

• Problem<br />

– The distribution of<br />

document pair number is<br />

more skewed than the<br />

distribution of document<br />

number, with respect <strong>to</strong><br />

different queries.<br />

– Extreme case: m 2 vs. m<br />

4/20/2010 Tie-Yan Liu @ Renda 50


IR-SVM<br />

• Introduce query-level normalization <strong>to</strong> the pairwise loss<br />

function.<br />

– Use the number of document pairs <strong>to</strong> normalize the pairwise loss.<br />

• IRSVM (Y. Cao, J. Xu, et al. SIGIR 2006; T. Qin, T.-Y. Liu, et al. IPM 2007)<br />

min<br />

1<br />

2<br />

|| w ||<br />

2<br />

C<br />

N<br />

<br />

1<br />

m~<br />

<br />

( i)<br />

i1 u,<br />

v<br />

Query-level normalizer<br />

<br />

( i)<br />

uv<br />

Significant improvement has been observed by using the query-level normalizer.<br />

4/20/2010 Tie-Yan Liu @ Renda 51


The Listwise Approach


The Listwise Approach<br />

The Listwise Approach<br />

Input Space<br />

Measure-specific<br />

Document set<br />

x {<br />

<br />

Non-measure specific<br />

m<br />

x<br />

j}<br />

j 1<br />

m<br />

Output Space Ordered category y { y<br />

j}<br />

j1<br />

Permutation<br />

y<br />

Hypothesis Space<br />

h( x)<br />

f ( x)<br />

h( x)<br />

sort<br />

f ( x)<br />

Loss Function 1-surrogate measure L ( h;<br />

x,<br />

y)<br />

Listwise ranking loss L h;<br />

x,<br />

)<br />

(<br />

y<br />

4/20/2010 Tie-Yan Liu @ Renda 53


Measure-specific Listwise <strong>Rank</strong>ing<br />

Methods<br />

• Also known as “Direct Optimization of IR Measures”.<br />

• It is natural <strong>to</strong> directly optimize what is used <strong>to</strong> evaluate the<br />

ranking results.<br />

• However, it is non-trivial.<br />

– Evaluation measures such as NDCG are non-continuous and nondifferentiable<br />

since they depend on the rank positions.<br />

(S. Robertson and H. Zaragoza, In<strong>for</strong>mation <strong>Retrieval</strong> 2007; J. Xu, T.-Y.<br />

Liu, et al. SIGIR 2008).<br />

– It is challenging <strong>to</strong> optimize such objective functions, since most<br />

optimization techniques in the literature were developed <strong>to</strong> handle<br />

continuous and differentiable cases.<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

54


Tackle the Challenges<br />

• Approximate the objective<br />

– Soften (approximate) the evaluation measure so as <strong>to</strong> make it smooth<br />

and differentiable (Soft<strong>Rank</strong>)<br />

• Bound the objective<br />

– Optimize a smooth and differentiable upper bound of the evaluation<br />

measure (SVM-MAP)<br />

• Optimize the non-smooth objective directly<br />

– Use IR measure <strong>to</strong> update the distribution in Boosting (Ada<strong>Rank</strong>)<br />

– Use genetic programming (<strong>Rank</strong>GP)<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

55


Soft<strong>Rank</strong><br />

(M. Taylor, J. Guiver, et al. LR4IR 2007/WSDM 2008)<br />

• Key Idea:<br />

– Avoid sorting by treating scores as random variables<br />

• Steps<br />

– Construct score distribution<br />

– Map score distribution <strong>to</strong> rank distribution<br />

– Compute expected NDCG (SoftNDCG) which is smooth and<br />

differentiable.<br />

– Optimize SoftNDCG by gradient descent.<br />

4/20/2010 Tie-Yan Liu @ Renda 56


Construct Score Distribution<br />

• Considering the score <strong>for</strong> each document s j as random<br />

variable, by using Gaussian.<br />

p(<br />

s<br />

j<br />

) <br />

Ν(<br />

s<br />

j<br />

|<br />

f<br />

( x<br />

j<br />

2<br />

), )<br />

s<br />

4/20/2010 Tie-Yan Liu @ Renda 57


From Scores <strong>to</strong> <strong>Rank</strong> Distribution<br />

• Probability of d u being ranked be<strong>for</strong>e d v<br />

<br />

uv<br />

Ν( s | f ( xu<br />

) f ( x<br />

2 v),2<br />

s<br />

) ds<br />

0<br />

• <strong>Rank</strong> Distribution<br />

p<br />

( u)<br />

j<br />

( r)<br />

<br />

p<br />

( u1)<br />

j<br />

( r<br />

1)<br />

<br />

uj<br />

<br />

p<br />

( u1)<br />

j<br />

( r)(1<br />

<br />

uj<br />

)<br />

Define the probability of a document being ranked at a particular position<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

58


Soft NDCG<br />

• Expected NDCG as the objective, or in the language of loss<br />

function,<br />

– Only P j (r) contain the model parameter w, and it is differential with<br />

respect <strong>to</strong> w.<br />

<br />

j1<br />

m1<br />

<br />

j<br />

L( f ; x,<br />

y)<br />

1<br />

E[<br />

NDCG]<br />

1<br />

Z (2 1)<br />

p<br />

m<br />

m<br />

y<br />

r0<br />

r<br />

j<br />

( r)<br />

• A neural network is used as the model, and gradient descent<br />

is used as the algorithm <strong>to</strong> optimize the above loss function.<br />

4/20/2010 Tie-Yan Liu @ Renda 59


SVM-MAP<br />

(Y. Yue, T. Finley, et al. SIGIR 2007)<br />

• Use the framework of Structured SVM <strong>for</strong> optimization<br />

– I. Tsochantaridis, et al. ICML 2004; T. Joachims, ICML 2005.<br />

• Sum of slacks ∑ξ upper bounds ∑ (1-AP).<br />

min<br />

1<br />

2<br />

y'<br />

y<br />

w<br />

( i)<br />

2<br />

:<br />

<br />

w<br />

C<br />

N<br />

T<br />

N<br />

<br />

i1<br />

(<br />

y<br />

<br />

( i)<br />

( i)<br />

, x<br />

( i)<br />

)<br />

<br />

w<br />

T<br />

(<br />

y',<br />

x<br />

( i)<br />

) (<br />

y<br />

( i)<br />

, y')<br />

<br />

( i)<br />

<br />

(<br />

y ', x)<br />

( y'<br />

y'<br />

v<br />

)( xu<br />

xv<br />

)<br />

( i)<br />

(<br />

y , y')<br />

1<br />

AP(<br />

y'<br />

)<br />

(<br />

y,<br />

x)<br />

<br />

u<br />

u,<br />

v:<br />

y u 1,<br />

y v 0<br />

<br />

u,<br />

v:<br />

y u 1,<br />

y v 0<br />

( x u<br />

x v<br />

)<br />

4/20/2010 Tie-Yan Liu @ Renda 60


Challenges in SVM-MAP<br />

• For AP, the true ranking is a ranking where the relevant<br />

documents are all ranked in the front, e.g.,<br />

• An incorrect ranking would be any other ranking, e.g.,<br />

• Exponential number of rankings, thus an exponential number<br />

of constraints!<br />

• Solution: use a working set only containing the most<br />

violated constraints.<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

61


Finding the Most Violated Constraint<br />

• Most violated y’:<br />

arg max<br />

y'<br />

<br />

(<br />

y,<br />

y'<br />

)<br />

w<br />

T<br />

(<br />

y',<br />

x)<br />

<br />

• If the relevance at each position is fixed, Δ(y,y’) will be the same. But<br />

if the documents are sorted by their scores in descending order, the<br />

second term will be maximized.<br />

• Strategy (with a complexity of O(mlogm))<br />

– Sort the relevant and irrelevant documents and <strong>for</strong>m a perfect ranking.<br />

– Start with the perfect ranking and swap two adjacent relevant and<br />

irrelevant documents, so as <strong>to</strong> find the optimal interleaving of the two<br />

sorted lists.<br />

4/20/2010 Tie-Yan Liu @ Renda 62


Ada<strong>Rank</strong><br />

(J. Xu, H. Li. SIGIR 2007)<br />

• Embed IR evaluation measure in the distribution<br />

update of Boosting<br />

Weak learner that can<br />

rank important queries<br />

more correctly will have<br />

larger weight α<br />

Emphasize those hard<br />

queries by setting their<br />

distributions.<br />

4/20/2010 Tie-Yan Liu @ Renda 63


<strong>Rank</strong>GP<br />

(J. Yeh, J. Lin, et al. LR4IR 2007)<br />

• Define ranking function as a tree<br />

• <strong>Learning</strong><br />

– Single-population GP, which has been widely used <strong>to</strong><br />

optimize non-smoothing non-differentiable objectives.<br />

• Evolution mechanism<br />

– Crossover, mutation, reproduction, <strong>to</strong>urnament selection.<br />

• Fitness function<br />

– IR evaluation measure (MAP).<br />

4/20/2010 Tie-Yan Liu @ Renda 64


Non-measure Specific Listwise <strong>Rank</strong>ing<br />

Methods<br />

• Defining listwise loss functions based on the<br />

understanding on the unique properties of ranking<br />

<strong>for</strong> IR.<br />

• Representative Algorithms<br />

– ListNet<br />

– ListMLE<br />

4/20/2010 Tie-Yan Liu @ Renda 65


<strong>Rank</strong>ing Loss is Non-trivial!<br />

• An example:<br />

– function f: f(A)=3, f(B)=0, f(C)=1 ACB<br />

– function h: h(A)=4, h(B)=6, h(C)=3 BAC<br />

– ground truth g: g(A)=6, g(B)=4, g(C)=3 ABC<br />

• Question: which function is closer <strong>to</strong> ground truth?<br />

– Based on pointwise similarity: sim(f,g) < sim(g,h).<br />

– Based on pairwise similarity: sim(f,g) = sim(g,h)<br />

– Based on cosine similarity between score vec<strong>to</strong>rs?<br />

f: g: h:<br />

sim(f,g)=0.85<br />

Closer!<br />

sim(g,h)=0.93<br />

• However, according <strong>to</strong> NDCG, f should be closer <strong>to</strong> g!<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

66


ListNet<br />

(Z. Cao, T. Qin, T.-Y. Liu, et al. ICML 2007)<br />

• <strong>Rank</strong>ed list ↔ Permutation probability distribution<br />

• More in<strong>for</strong>mative representation <strong>for</strong> ranked list:<br />

permutation and ranked list has 1-1 correspondence.<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

67


Properties of Luce Model<br />

• Continuous, differentiable, and concave w.r.t. score s.<br />

• Suppose f(A) = 3, f(B)=0, f(C)=1, P(ACB) is largest and P(BCA) is<br />

smallest; <strong>for</strong> the ranking based on f: ACB, swap any two,<br />

probability will decrease, e.g., P(ACB) > P(ABC)<br />

• Translation invariant, when<br />

( s)<br />

exp( s)<br />

• Scale invariant, when<br />

( s)<br />

<br />

as<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

69


Distance between <strong>Rank</strong>ed Lists<br />

φ = exp<br />

Using KL-divergence<br />

<strong>to</strong> measure difference<br />

between distributions<br />

dis(f,g) = 0.46<br />

dis(g,h) = 2.56<br />

4/20/2010<br />

Tie-Yan Liu @ Renda<br />

70


K-L Divergence Loss<br />

• Loss function = KL-divergence between two permutation<br />

probability distributions (φ = exp)<br />

<br />

<br />

||<br />

P<br />

| f ( ) <br />

L( f ; x,<br />

) D P<br />

x<br />

y<br />

y<br />

<br />

Probability distribution defined<br />

by the ground truth<br />

Probability distribution defined<br />

by the model output<br />

• Model = Neural Network<br />

• Algorithm = Gradient Descent<br />

4/20/2010 Tie-Yan Liu @ Renda<br />

71


Limitations of ListNet<br />

• Too complex <strong>to</strong> use in practice: O(m⋅m!)<br />

• By using the <strong>to</strong>p-k Luce model, the complexity of ListNet can<br />

be reduced <strong>to</strong> polynomial order of m.<br />

• However,<br />

– When k is large, the complexity is still high!<br />

– When k is small, the in<strong>for</strong>mation loss in the <strong>to</strong>p-k Luce model will<br />

affect the effectiveness of ListNet.<br />

4/20/2010 Tie-Yan Liu @ Renda 72


Solution<br />

• Given the scores outputted by the ranking model, compute<br />

the likelihood of a ground truth permutation, and maximize it<br />

<strong>to</strong> learn the model parameter.<br />

• The complexity is reduced from O(m⋅m!) <strong>to</strong> O(m).<br />

4/20/2010 Tie-Yan Liu @ Renda 73


ListMLE<br />

(F. Xia, T.-Y. Liu, et al. ICML 2008)<br />

• Likelihood Loss<br />

L( f ; x,<br />

) log<br />

P(<br />

| (<br />

f ( x)))<br />

y<br />

y<br />

• Model = Neural Network<br />

• Algorithm = S<strong>to</strong>chastic Gradient Descent<br />

4/20/2010 Tie-Yan Liu @ Renda 74


Extension of ListMLE<br />

• Dealing with other types of ground truth labels using<br />

the concept of “equivalent permutation set”.<br />

– For relevance degree<br />

<br />

– For pairwise preference<br />

<br />

y<br />

y<br />

{ <br />

y<br />

| u v,if<br />

l 1<br />

l 1<br />

}<br />

( u)<br />

( u )<br />

{ <br />

y<br />

| u v,if<br />

l 1<br />

( u),<br />

( u )<br />

<br />

y<br />

1 <br />

y y<br />

L( f , x,<br />

) min log P | f ( w,<br />

x)<br />

y<br />

<br />

y<br />

y<br />

<br />

y<br />

<br />

y<br />

1}<br />

<br />

<br />

4/20/2010 Tie-Yan Liu @ Renda 75


Discussions<br />

• Advantages<br />

– Take all the documents associated with the same<br />

query as the learning instance<br />

– <strong>Rank</strong> position is visible <strong>to</strong> the loss function<br />

• Problems<br />

– Complexity issue.<br />

– The use of the position in<strong>for</strong>mation is insufficient.<br />

4/20/2010 Tie-Yan Liu @ Renda 76


Analysis of the Approaches


Question<br />

• Different loss functions are used in different<br />

approaches, while the models learned with all<br />

the approaches are evaluated by IR measures.<br />

• What are the relationship between the loss<br />

functions and IR evaluation measures?<br />

4/20/2010 Tie-Yan Liu @ Renda 78


The Pointwise Approach<br />

• Many pointwise loss functions can upper<br />

bound (1-NDCG).<br />

• Minimization of the pointwise loss functions<br />

can lead <strong>to</strong> the maximization of NDCG.<br />

4/20/2010 Tie-Yan Liu @ Renda 79


The Pointwise Approach<br />

• For regression based algorithms (D. Cossock and T. Zhang,<br />

COLT 2006)<br />

1<br />

NDCG(<br />

f<br />

)<br />

<br />

1<br />

Z<br />

m<br />

<br />

<br />

2<br />

<br />

m<br />

<br />

j1<br />

<br />

<br />

j<br />

<br />

<br />

<br />

1/ <br />

<br />

<br />

<br />

m<br />

<br />

j1<br />

f<br />

( x<br />

j<br />

) <br />

y<br />

j<br />

<br />

<br />

<br />

<br />

<br />

1/ <br />

• For multi-class classification based algorithms (P. Li, C. Burges,<br />

et al. NIPS 2007)<br />

<br />

m<br />

m<br />

15<br />

2<br />

1<br />

NDCG(<br />

f ) 2<br />

<br />

j<br />

m<br />

j<br />

Zm<br />

j1 j1<br />

2<br />

m<br />

<br />

<br />

<br />

<br />

m<br />

<br />

j1<br />

I<br />

{ y<br />

yˆ<br />

}<br />

j<br />

j<br />

4/20/2010 Tie-Yan Liu @ Renda 80


The Pointwise Approach<br />

• The minimization of the losses is only the sufficient<br />

condition of minimizing (1-NDCG).<br />

x<br />

i, y i<br />

21.4<br />

xi<br />

, f ( xi<br />

)<br />

Z m<br />

x<br />

<br />

x<br />

x<br />

<br />

x<br />

1<br />

2<br />

3<br />

4<br />

,4 <br />

<br />

,3<br />

,2<br />

<br />

,1<br />

<br />

1<br />

NDCG(<br />

f ) 0<br />

m<br />

I{<br />

y f ( x )}<br />

4<br />

j j<br />

j1<br />

x<br />

<br />

x<br />

x<br />

<br />

x<br />

1<br />

2<br />

3<br />

4<br />

,3 <br />

<br />

,2<br />

,1 <br />

<br />

,0<br />

<br />

15<br />

Z<br />

m<br />

<br />

<br />

2<br />

<br />

m<br />

<br />

1 <br />

<br />

log( j 1)<br />

<br />

j1 j1<br />

2<br />

m<br />

m<br />

<br />

1 <br />

<br />

log( j 1)<br />

<br />

2<br />

m<br />

<br />

<br />

<br />

<br />

m<br />

<br />

j1<br />

I<br />

{ y<br />

j<br />

f ( x<br />

j<br />

)}<br />

1.15 1<br />

4/20/2010 Tie-Yan Liu @ Renda 81


The Pairwise and Non-measure<br />

Specific Listwise Approaches<br />

(W. Chen, T.-Y. Liu, et al. 2009)<br />

• Many pairwise and non-measure specific listwise<br />

loss functions are surrogate functions, and also<br />

upper bounds of a quantity named the “essential<br />

loss”.<br />

• The essential loss is an upper bound of (1-NDCG)<br />

and (1-MAP).<br />

• The minimization of these surrogate loss<br />

functions can lead <strong>to</strong> the maximization of NDCG<br />

and MAP.<br />

4/20/2010 Tie-Yan Liu @ Renda 82


Essential Loss <strong>for</strong> <strong>Rank</strong>ing<br />

• Model ranking as a sequence of classifications<br />

y<br />

A<br />

<br />

B<br />

<br />

C<br />

<br />

D<br />

<br />

Ground truth permutation:<br />

Prediction of the ranking function:<br />

<br />

B<br />

y<br />

<br />

B<br />

A<br />

<br />

C<br />

D<br />

<br />

D<br />

C<br />

√<br />

<br />

B<br />

y<br />

<br />

C<br />

D<br />

<br />

D<br />

C<br />

y { A B C D}<br />

{ B A D C}<br />

<br />

D<br />

<br />

C<br />

Classifier<br />

4/20/2010 Tie-Yan Liu @ Renda 83


Essential Loss vs. <strong>Rank</strong>ing Measures<br />

1) Both (1-NDCG) and (1-MAP) are upper bounded by the<br />

essential loss.<br />

2) The zero value of the essential loss is a necessary and<br />

sufficient condition <strong>for</strong> the zero values of (1-NDCG) and<br />

(1-MAP).<br />

4/20/2010 Tie-Yan Liu @ Renda 84


Essential Loss vs. Surrogate Losses<br />

1) Many pairwise and listwise loss functions are upper<br />

bounds of the essential loss.<br />

2) There<strong>for</strong>e, the pairwise and listwise loss functions are also<br />

upper bounds of (1-NDCG) and (1-MAP).<br />

4/20/2010 Tie-Yan Liu @ Renda 85


Discussions<br />

• Revealed that different loss functions are upper<br />

bounds of the same thing.<br />

• Explained why many different pairwise and<br />

listwise algorithms per<strong>for</strong>m similarly well.<br />

• Suggest better algorithms<br />

– The essential loss is a tighter upper bound of<br />

measure-based ranking errors than existing surrogate<br />

loss functions.<br />

– Our experiments showed that higher ranking<br />

per<strong>for</strong>mances can be achieved by optimizing such a<br />

tighter bound.<br />

4/20/2010 Tie-Yan Liu @ Renda 86


Measure-specific Listwise Approach<br />

• To what degree is a surrogate measure<br />

correlated with the ranking measure?<br />

• Can the maximization of a surrogate measure<br />

lead <strong>to</strong> the same ranking model that<br />

maximizes the ranking measure?<br />

4/20/2010 Tie-Yan Liu @ Renda 87


Tendency Correlation as a Tool<br />

(Y. He and T.-Y. Liu, 2010)<br />

0.2<br />

0<br />

Finding the best linear 0.4 trans<strong>for</strong>mation, by<br />

applying which <strong>to</strong> one of the functions,<br />

the maximum difference between the<br />

tendencies of the two functions in the<br />

entire input space can be minimized.<br />

The “tendency” of a function with<br />

respect <strong>to</strong> two inputs is defined as<br />

the difference in the values of the<br />

function at the two inputs.<br />

4/20/2010 Tie-Yan Liu @ Renda 88


Theoretical Justification of Tendency<br />

Correlation<br />

If the tendency correlation of a surrogate measure is<br />

arbitrarily strong, its optimization will lead <strong>to</strong> the<br />

ranking model that can optimize the corresponding<br />

ranking measure, on both training and test sets.<br />

4/20/2010 Tie-Yan Liu @ Renda 89


Major Results - Soft<strong>Rank</strong><br />

1) When a parameter in Soft<strong>Rank</strong> is set <strong>to</strong> be small, the<br />

tendency correlation of its surrogate measure will<br />

become strong, regardless of the data distribution.<br />

2) However, in this case, the surrogate measure will contain<br />

many local optima, there<strong>for</strong>e robust global optimization<br />

techniques are needed.<br />

4/20/2010 Tie-Yan Liu @ Renda 90


Major Results – SVM-MAP, etc.<br />

1) The surrogate measures in SVM-MAP, SVM-NDCG, DORM,<br />

and Permu<strong>Rank</strong> are concave and easy <strong>to</strong> optimize.<br />

2) No matter how the parameters in these algorithms are set,<br />

<strong>for</strong> certain data distributions, the tendency correlation of<br />

their surrogate measures can never be as strong as expected.<br />

4/20/2010 Tie-Yan Liu @ Renda 91


Summary<br />

• With the <strong>to</strong>ol of tendency correlation, we reveal<br />

the relationship between surrogate measures and<br />

original ranking measures.<br />

• The theorems well explain previous empirical<br />

observations<br />

– Some methods can per<strong>for</strong>m well on different datasets,<br />

while some others cannot on some datasets.<br />

• The theorems provide a guideline <strong>for</strong> trade-off<br />

between effectiveness and ease of optimization.<br />

4/20/2010 Tie-Yan Liu @ Renda 92


Benchmarking <strong>Learning</strong> <strong>to</strong> <strong>Rank</strong><br />

Methods


Role of a Benchmark Dataset<br />

• Enable researchers <strong>to</strong> focus on technologies<br />

instead of experimental preparations.<br />

• Enable fair comparison among algorithms.<br />

• Examples<br />

– Reuters and RCV-1 <strong>for</strong> text classification<br />

– UCI <strong>for</strong> general machine learning<br />

4/20/2010 Tie-Yan Liu @ Renda 94


The LETOR Collection<br />

• A benchmark dataset <strong>for</strong> learning <strong>to</strong> rank<br />

– Document corpora<br />

– Document sampling<br />

– Feature extraction<br />

– Meta in<strong>for</strong>mation<br />

– Cross validation<br />

4/20/2010 Tie-Yan Liu @ Renda 95


Document Corpora<br />

• The “.gov” corpus<br />

• The OHSUMED corpus<br />

– 106 queries<br />

4/20/2010 Tie-Yan Liu @ Renda 96


Document Sampling<br />

• The “.gov” corpus<br />

– Top 1000 documents per query returned by BM25<br />

• The OHSUMED corpus<br />

– All judged documents<br />

4/20/2010 Tie-Yan Liu @ Renda 97


Feature Extraction<br />

• The “.gov” corpus<br />

– 64 features<br />

– TF, IDF, BM25, LMIR, Page<strong>Rank</strong>, HITS, Relevance<br />

Propagation, URL length, etc.<br />

• The OHSUMED corpus<br />

– 40 features<br />

– TF, IDF, BM25, LMIR, etc.<br />

4/20/2010 Tie-Yan Liu @ Renda 98


Meta In<strong>for</strong>mation<br />

• Statistical in<strong>for</strong>mation about the corpus<br />

– Number of documents, number of streams, and number of<br />

(unique) terms in each stream.<br />

• Raw in<strong>for</strong>mation of the documents associated with<br />

each query<br />

– Term frequency, and document length.<br />

• Relational in<strong>for</strong>mation<br />

– Hyperlink graph, sitemap in<strong>for</strong>mation, and similarity<br />

relationship matrix of the corpora.<br />

4/20/2010 Tie-Yan Liu @ Renda 99


Cross Validation<br />

• Five-fold cross validation<br />

http://research.microsoft.com/~LETOR/<br />

4/20/2010 Tie-Yan Liu @ Renda 100


Algorithms under Investigation<br />

• The pointwise approach<br />

– Linear regression<br />

• The pairwise approach<br />

– <strong>Rank</strong>ing SVM, <strong>Rank</strong>Boost, Frank<br />

• The listwise approach<br />

– ListNet<br />

– SVM-MAP, Ada<strong>Rank</strong><br />

Linear ranking model (except <strong>Rank</strong>Boost)<br />

4/20/2010 Tie-Yan Liu @ Renda 101


Experimental Results – TD2003<br />

4/20/2010 Tie-Yan Liu @ Renda 102


Experimental Results – TD2004<br />

4/20/2010 Tie-Yan Liu @ Renda 103


Experimental Results – NP2003<br />

4/20/2010 Tie-Yan Liu @ Renda 104


Experimental Results – NP2004<br />

4/20/2010 Tie-Yan Liu @ Renda 105


Experimental Results – HP2003<br />

4/20/2010 Tie-Yan Liu @ Renda 106


Experimental Results – HP2004<br />

4/20/2010 Tie-Yan Liu @ Renda 107


Experimental Results - OHSUMED<br />

4/20/2010 Tie-Yan Liu @ Renda 108


Experimental Results - Winner Numbers<br />

4/20/2010 Tie-Yan Liu @ Renda 109


Experimental Results - Winner Numbers<br />

40<br />

35<br />

30<br />

25<br />

20<br />

15<br />

10<br />

5<br />

Regression<br />

<strong>Rank</strong>SVM<br />

<strong>Rank</strong>Boost<br />

F<strong>Rank</strong><br />

ListNet<br />

Ada<strong>Rank</strong><br />

SVMmap<br />

0<br />

N@1 N@3 N@10 P@1 P@3 P@10 MAP<br />

4/20/2010 Tie-Yan Liu @ Renda 110


• NDCG@1<br />

Discussions<br />

– {ListNet, Ada<strong>Rank</strong>} > {SVM-MAP, <strong>Rank</strong>SVM}<br />

> {<strong>Rank</strong>Boost, F<strong>Rank</strong>} > {Regression}<br />

• NDCG@3, @10<br />

– {ListNet} > {Ada<strong>Rank</strong>, SVM-MAP, <strong>Rank</strong>SVM, <strong>Rank</strong>Boost}<br />

> {F<strong>Rank</strong>} > {Regression}<br />

• MAP<br />

– {ListNet} > {Ada<strong>Rank</strong>, SVM-MAP, <strong>Rank</strong>SVM}<br />

> {<strong>Rank</strong>Boost, F<strong>Rank</strong>} > {Regression}<br />

The listwise approach has certain advantages, especially in<br />

correctly predicting <strong>to</strong>p-ranked documents.<br />

4/20/2010 Tie-Yan Liu @ Renda 111


Notes<br />

• The experimental results can be found at the<br />

LETOR website<br />

– http://research.microsoft.com/~le<strong>to</strong>r/<br />

• The results are still primal<br />

– The implementations of the these algorithms are<br />

not perfect<br />

– Almost every algorithm can be further improved.<br />

4/20/2010 Tie-Yan Liu @ Renda 112


Summary and Outlook


Algorithms Theory Data Activity<br />

Pointwise approach<br />

Pairwise approach<br />

Listwise approach<br />

- non-measure specific<br />

- measure specific<br />

• NIPS workshop on<br />

learning <strong>to</strong> rank<br />

• ICML workshop on<br />

structure output<br />

• 1 st SIGIR workshop<br />

on LR4IR<br />

• WWW tu<strong>to</strong>rial • WWW tu<strong>to</strong>rial<br />

• SIGIR tu<strong>to</strong>rial • ACL tu<strong>to</strong>rial<br />

• 2 nd LR4IR workshop • IRJ special issue<br />

• 1 st SIGIR workshop on • 3 rd LR4IR workshop<br />

diversity<br />

• 2 nd Diversity workshop<br />

• LETOR 1.0<br />

• LETOR 2.0<br />

• LETOR 3.0 • LETOR 4.0<br />

• Pranking<br />

• DMIR<br />

• <strong>Rank</strong>ing SVM<br />

• <strong>Rank</strong>Boost<br />

• Gaussian process <strong>for</strong><br />

ordinal regression<br />

• <strong>Rank</strong>Net<br />

• Consistency of<br />

pointwise methods<br />

• Subset <strong>Rank</strong>ing<br />

• IR-SVM<br />

• Lambda<strong>Rank</strong><br />

• P-norm push<br />

• Nested <strong>Rank</strong>er<br />

• <strong>Rank</strong>Cosine<br />

• MC-<strong>Rank</strong><br />

• GB<strong>Rank</strong><br />

• MHR<br />

• Frank<br />

• ListNet<br />

• Supervised RA<br />

• <strong>Rank</strong>GP<br />

• Ada<strong>Rank</strong><br />

• SVM-MAP<br />

• Generalization of<br />

pairwise methods<br />

• Consistency of<br />

Listwise methods<br />

• Iso<strong>Rank</strong><br />

• QB<strong>Rank</strong><br />

• CDN <strong>Rank</strong>er<br />

• ListMLE<br />

• Relational <strong>Rank</strong>ing<br />

• Global<strong>Rank</strong><br />

• Query-dependent <strong>Rank</strong>er<br />

• App<strong>Rank</strong><br />

• Permu<strong>Rank</strong><br />

• Soft<strong>Rank</strong><br />

• G-Soft<strong>Rank</strong><br />

• SVM-NDCG<br />

• Generalization of<br />

listwise methods<br />

< 2005 2005 2006 2007 2008 2009<br />

• OWA <strong>Rank</strong><br />

• Boltzman<strong>Rank</strong><br />

• Bayes-Luce<br />

• Smooth<strong>Rank</strong><br />

• Robust sparse<br />

ranker<br />

• … …<br />

4/20/2010 Tie-Yan Liu @ Renda 114


Answer <strong>to</strong> Question 1<br />

• To what respect are these learning <strong>to</strong> rank algorithms<br />

similar and in which aspects do they differ? What are<br />

the strengths and weaknesses of each algorithm?<br />

Approach Modeling Pro Con Connection<br />

Pointwise<br />

Pairwise<br />

Listwise<br />

Regression/<br />

classification/<br />

ordinal<br />

regression<br />

Pairwise<br />

classification<br />

<strong>Rank</strong>ing<br />

Easy <strong>to</strong> leverage<br />

existing theories<br />

and algorithms<br />

query-level and<br />

position based<br />

• Accurate ranking<br />

≠ Accurate score or<br />

category<br />

• Position info is<br />

invisible <strong>to</strong> the loss<br />

• More complex<br />

• New theory<br />

needed<br />

•Effectively<br />

minimize (1-<br />

NDCG).<br />

4/20/2010 Tie-Yan Liu @ Renda 115


Answer <strong>to</strong> Question 2<br />

• Empirically speaking, which of those many learning<br />

<strong>to</strong> rank algorithms per<strong>for</strong>m the best?<br />

– LETOR makes it possible <strong>to</strong> per<strong>for</strong>m fair comparisons<br />

among different learning <strong>to</strong> rank methods.<br />

– Empirical studies on LETOR have shown that the listwise<br />

ranking algorithms seem <strong>to</strong> have certain advantages over<br />

other algorithms, especially in predicting the <strong>to</strong>p-ranked<br />

documents, and the pairwise ranking algorithms seem <strong>to</strong><br />

outper<strong>for</strong>m the pointwise algorithms.<br />

4/20/2010 Tie-Yan Liu @ Renda 116


Answer <strong>to</strong> Question 3<br />

• Are there many remaining issues regarding learning<br />

<strong>to</strong> rank <strong>to</strong> study in the future?<br />

– <strong>Learning</strong> from logs<br />

– Feature engineering<br />

– Advanced ranking model<br />

– <strong>Rank</strong>ing theory<br />

4/20/2010 Tie-Yan Liu @ Renda 117


<strong>Learning</strong> from Logs<br />

• Data scale matters a lot!<br />

– Ground truth mining from logs is one of the possible<br />

ways <strong>to</strong> construct large-scale training data.<br />

• Ground truth mining may lose in<strong>for</strong>mation .<br />

– User sessions, frequency of clicking a certain<br />

document, frequency of a certain click pattern, and<br />

diversity in the intentions of different users.<br />

• Directly learn from logs.<br />

– E.g., Regard click-through logs as the “ground truth”,<br />

and define its likelihood as the loss function.<br />

4/20/2010 Tie-Yan Liu @ Renda 118


Feature Engineering<br />

• <strong>Rank</strong>ing is deeply rooted in IR<br />

– Not just new algorithms and theories.<br />

• Feature engineering is very importance<br />

– How <strong>to</strong> encode the knowledge on IR accumulated<br />

in the past half a century in the features?<br />

– Currently, these kinds of works have not been<br />

given enough attention <strong>to</strong>.<br />

4/20/2010 Tie-Yan Liu @ Renda 119


Advanced <strong>Rank</strong>ing Model<br />

• Listwise ranking function<br />

– Natural but challenging<br />

– Complexity is a big issue<br />

• Generative ranking model<br />

– Why not generative?<br />

• Semi-supervised ranking<br />

– The difference from semi-supervised classification<br />

4/20/2010 Tie-Yan Liu @ Renda 120


<strong>Rank</strong>ing Theory<br />

• A more unified view on loss functions<br />

• True loss <strong>for</strong> ranking<br />

• Statistical consistency <strong>for</strong> ranking<br />

• Complexity of ranking function class<br />

• ……<br />

4/20/2010 Tie-Yan Liu @ Renda 121


tyliu@microsoft.com<br />

http://research.microsoft.com/users/tyliu/<br />

4/20/2010 Tie-Yan Liu @ Renda 122

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!