Learning to Rank for Information Retrieval

Learning to Rank 

for Information Retrieval 

Tie-Yan Liu 

Lead Researcher, Microsoft Research Asia

Outline 

• Introduction 

• Pointwise approach 

• Pairwise approach 

• Listwise approach 

• Analysis of the approaches 

• Benchmarking learning to rank methods 

• Summary and Outlook 

4/20/2010 Tie-Yan Liu @ Renda 

2

Introduction

Overwhelmed by Flood of Information 


4

Facts about the Web 

• According to www.worldwidewebsize.com, major 

search engines indexed at least tens of billions of 

web pages. 

• CUIL.com indexed more than 120 Billion web pages. 

4/20/2010 Tie-Yan Liu @ Renda 5

Facts about the Web 

• According to www.worldwidewebsize.com, major 

search engines indexed at least tens of billions of 

web pages. 

• CUIL.com indexed more than 120 Billion web pages. 


Ranking is Essential 

Indexed 

Document Repository 

D 

d j 

Ranking 

model 

Query 


Evaluation of Ranking Results 

• Construct a test set containing a large number of 

(randomly sampled) queries and their associated 

documents, and relevance judgment for each querydocument 

pair. 

• Evaluate the ranking result for a particular query in 

the test set with an evaluation measure. 

• Use the average measure over the entire test set to 

represent the overall ranking performance. 


Collecting Queries and Documents 

• Queries are usually randomly sampled from 

query logs. 

• TREC’s Pooling strategy for sampling documents 

– A pool of possibly relevant documents is created by 

taking a sample of documents selected by the various 

participating systems. 

– Top 100 documents retrieved in each submitted run 

for a given query are selected and merged into the 

pool for human assessment. 

– On average, an assessor judges the relevance of 

approximately 1500 documents per query. 


Relevance Judgment 

• Degree of relevance 

– Binary: relevant vs. irrelevant 

– Multiple ordered categories: Perfect > Excellent > Good > 

Fair > Bad 

• Pairwise preference 

– Document A is more relevant than 

document B 

• Total order 

l 

• Documents are ranked as {A,B,C,..} 

according to their relevance 

l k 

l 

u , v 


10

Evaluation Measure - MAP 

• Precision at position k for query q: 

#{relevant documents in top k results} 

P@ 

k 

k 

• Average precision for query q: 

P@ 

k l k 

k 

AP 

#{relevant documents} 

AP 

• MAP: averaged over all queries. 

1 

3 

1 

 

 

1 

2 

3 

 

3 

5 

 

 

 

0.76 


11

Evaluation Measure - NDCG 

• NDCG at position n for query q: 

NDCG@ 

k 

 

Z 

k 

k 

 

j1 

 

G 

1 

( 

 

j) 

( 

j) 

Normalization 

Cumulating 

Gain 

Position discount 

• Averaged over all queries. 

G 

 

1 

 

l 

 

 

( ) 2 

1 ( j ) 

j 1, 

( 

j) 

1/ log( j 1) 

The document ranked at 

the j-th position in π. 

1 

( j) 

( j) 

The rank position of 

document j in π. 


Evaluation Measure - Summary 

• Query-level: every query contributes equally to the 

measure. 

– Computed on documents associated with the same query. 

– Bounded for each query. 

– Averaged over all test queries. 

• Position-based: rank position is explicitly used. 

– Top-ranked objects are more important. 

– Relative order vs. relevance score of each document. 

– Non-continuous and non-differentiable w.r.t. scores 


Conventional Ranking Models 

• Query-dependent 

– Boolean model, extended Boolean model, etc. 

– Vector space model, latent semantic indexing (LSI), etc. 

– BM25 model, statistical language model, etc. 

– Span based model, distance aggregation model, etc. 

• Query-independent 

– PageRank, TrustRank, BrowseRank, Toolbar Clicks, etc. 


Problems with Conventional Models 

• Manual parameter tuning is usually difficult, 

especially when there are many parameters and the 

evaluation measures are non-smooth. 

• Manual parameter tuning sometimes leads to overfitting. 

• It is non-trivial to combine the large number of 

models proposed in the literature to obtain an even 

more effective model. 


Machine Learning Can Help 

• Machine learning is an effective tool 

– To automatically tune parameters. 

– To combine multiple evidences. 

– To avoid over-fitting (by means of regularization, etc.) 

• “Learning to Rank” 

– In general, those methods that use 

machine learning technologies to solve 

the problem of ranking can be named as 

“learning to rank” methods. 



• In most recent works, learning to rank is 

defined as having the following two properties: 

– Feature based; 

– Discriminative training. 


Feature Based 

• Documents represented by feature vectors. 

– Features are extracted for each query-document pair. 

– Even if a feature is the output of an existing retrieval 

model, one assumes that the parameter in the model is 

fixed, and only learns the optimal way of combining these 

features. 

• The capability of combining a large number of 

features is very promising. 

– It can easily incorporate any new progress on retrieval 

model, by including the output of the model as a feature. 


Discriminative Training 

• An automatic learning process based on the 

training data, 

– With the four pillars of • Contains discriminative the objects learning under 

• Input space, 

• Output space, 

• Hypothesis space, 

• Loss function. 

investigation. 

• Usually objects are 

represented by feature vectors, 

extracted according to different 

applications 

– Not even necessary to have a probabilistic 

explanation. 





– With the four pillars of discriminative learning 





• There are two related but different 

definitions of the output space. 

- Output space of the task. 

- Output space to facilitate learning. 

• Example: using regression to solve 

classification 

- Output space of the task: {+1, -1} 

- Output space to facilitate learning: 


real numbers 

explanation. 










• Defines the class of functions 

mapping the input space to the output 

space. 

• The functions operate on the feature 

vectors of the input objects, and make 

predictions according to the format of 

– Not even necessary to the have output a probabilistic 

space. 

explanation. 










• The loss function measures to what 

degree the prediction generated by the 

hypothesis is in accordance with the 

ground truth label. 

- For example, widely used loss 

functions for classification include the 

exponential loss, the hinge loss, and 

the logistic loss. 

- With the loss function, an empirical 

risk can be defined on the training set, 


explanation. 

and the optimal hypothesis is usually 

(but not always) learned by means of 

empirical risk minimization. 





• This is highly demanding for real search 

engines, 

– Everyday these search engines will receive a lot of 

user feedback and usage logs 

– It is very important to automatically learn from 

the feedback and constantly improve the ranking 

mechanism. 



• In most recent works, learning to rank is 

defined as having the following two properties: 

– Feature based; 

– Discriminative training. 

• Automatic parameter tuning for existing 

ranking models, and learning-based 

generative ranking models do not belong to 

the current definition of “learning to rank.” 


Learning to Rank Framework 


Learning to Rank Algorithms

Learning to Rank Algorithms 

Least Square Retrieval Function Query refinement (WWW 2008) 

(TOIS 1989) 

SVM-MAP (SIGIR 2007) Nested Ranker (SIGIR 2006) 

ListNet (ICML 2007) 

Pranking (NIPS 2002) 

LambdaRank (NIPS 2006) MPRank (ICML 2007) 

Frank (SIGIR 2007) 

MHR (SIGIR 2007) RankBoost (JMLR 2003) 

Learning to retrieval info (SCC 1995) 

LDM (SIGIR 2005) 

Large margin ranker (NIPS 2002) 

RankNet (ICML 2005) 

Ranking SVM (ICANN 1999) IRSVM (SIGIR 2006) 

Discriminative model for IR (SIGIR 2004) SVM Structure (JMLR 2005) 

OAP-BPM (ICML 2003) 

Subset Ranking (COLT 2006) 

GPRank (LR4IR 2007) QBRank (NIPS 2007) GBRank (SIGIR 2007) 

Constraint Ordinal Regression (ICML 2005) McRank (NIPS 2007) SoftRank (LR4IR 2007) 

AdaRank (SIGIR 2007) CCA (SIGIR 2007) ListMLE (ICML 2008) 

RankCosine (IP&M 2007) Supervised Rank Aggregation (WWW 2007) 

Relational ranking (WWW 2008) 

Learning to order things (NIPS 1998) 

Round robin ranking (ECML 2003) 


Key Questions to Answer 

• Q1: To what respect are these learning to rank algorithms 

similar and in which aspects do they differ? What are the 

strengths and weaknesses of each algorithm? 

• Q2: Empirically speaking, which of those many learning to 

rank algorithms perform the best? 

• Q3: Are there many remaining issues regarding learning to 

rank to study in the future? 


Categorization of the Algorithms 

Category 

Pointwise 

Approach 

Pairwise 

Approach 

Listwise 

Approach 

Algorithms 

Regression: Least Square Retrieval Function (TOIS 1989), Regression Tree for Ordinal 

Class Prediction (Fundamenta Informaticae, 2000), Subset Ranking using Regression 

(COLT 2006), … 

Classification: Discriminative model for IR (SIGIR 2004), McRank (NIPS 2007), … 

Ordinal regression: Pranking (NIPS 2002), OAP-BPM (EMCL 2003), Ranking with 

Large Margin Principles (NIPS 2002), Constraint Ordinal Regression (ICML 2005), … 

Learning to Retrieve Information (SCC 1995), Learning to Order Things (NIPS 1998), 

Ranking SVM (ICANN 1999), RankBoost (JMLR 2003), LDM (SIGIR 2005), RankNet 

(ICML 2005), Frank (SIGIR 2007), MHR(SIGIR 2007), GBRank (SIGIR 2007), QBRank 

(NIPS 2007), MPRank (ICML 2007), IRSVM (SIGIR 2006), LambdaRank (NIPS 2006),… 

Non-measure specific loss minimization: ListNet (ICML 2007), ListMLE (ICML 2008), 

… 

Measure specific loss minimizaiton: AdaRank (SIGIR 2007), SVM-MAP (SIGIR 2007), 

SoftRank (LR4IR 2007), GPRank (LR4IR 2007), CCA (SIGIR 2007), … 


The Poinitwise Approach

The Pointwise Approach 


Regression Classification Ordinal Regression 

Input Space 

Single documents x j 

Output Space Real values y j 

Non-ordered 

Categories y j 

Hypothesis Scoring function f(x) Classifier sgn.f(x) 

Loss Function 

Regression loss 

Classification loss 

L( 

f ; x j 

, y 

j) 

Ordinal categories y j 

Scoring function + 

thresholding 

Ordinal regression 

loss 



• Reduce ranking to 

– Regression 

• Subset Ranking 

– Classification 

• Discriminative model for IR 

• MCRank 

– Ordinal regression 

• PRanking 

• Ranking with large margin principle 

 

 

 

 

 

 

 

 

x 

1 

 

 

 

 

 

 

 

x2 

q 

 

x m 

( x , y ),( x 1 2, 

y 2),...,( 

x m, 

y 

1 m 

) 

 


Subset Ranking using Regression 

(D. Cossock and T. Zhang, COLT 2006) 

• Regard relevance degree as real number, and use 

regression to learn the ranking function. 

f 

( x ) y 2 

L( f ; x 

j, 

y 

j) 

 

j j 

Regression-based 


Discriminative Model for IR 

(R. Nallapati, SIGIR 2004) 

• Classification is used to learn the ranking function 

– Treat relevant documents as positive examples (y j =+1), 

while irrelevant documents as negative examples (y j =-1). 

– f ; x j 

, y ) is surrogate loss of I 

L ( 

j 

{sgn 

f ( x j ) y 

} 

• Different algorithms optimize different surrogate 

losses. 

– e.g., Support Vector Machines (hinge loss) 

j 

Classification-based 


McRank 

(P. Li, C. Burges, et al. NIPS 2007) 

• Multi-class classification is used to learn the ranking 

function. 

– For document x , the output of the classifier is ŷ 

j. 

– Loss function: surrogate function of 

• Ranking is produced by combining the outputs of the 

classifiers. 

pˆ 

j 

j 

 

, k 

P( 

yˆ 

j 

k), 

f ( x 

j) 

pˆ 

K 

k1 

I 

yˆ 

j y 

{ j 

j, 

k 

k 

} 

Classification-based 


Pranking with Ranking 

(K. Krammer and Y. Singer, NIPS 2002) 

• Ranking is solved by ordinal regression 

x 

 

w 

x 

Ordinal regression based 


Ranking with Large Margin Principles 

(A. Shashua and A. Levin, NIPS 2002) 

• Fixed margin strategy 

– Margin: 

2 w 

2 

– Constraints: every document is correctly placed in its 

corresponding category. 

if y j 

2 

f x ) 

( j 

b 1 b 2 

Category 1 Category 2 

Category 3 

 

Ordinal regression based 


Ranking with Large Margin Principles 

• Sum of margins strategy 

– Margin: 

 

k 

(b ) 

k 

a k 

– Constraints: every document is correctly placed in its 

corresponding category 

f x ) 

( j 

if y j 

2 

4/20/2010 Tie-Yan Liu @ Renda 38 

 

Ordinal regression based

Problem with Pointwise Approach 

• Properties of IR evaluation measures have not been 

well considered. 

– The fact is ignored that some documents are associated 

with the same query and some are not. 

• When the number of associated documents varies largely for 

different queries, the overall loss function will be dominated by 

those queries with a large number of documents. 

– The position of documents in the ranked list is invisible to 

the loss functions. 

• The pointwise loss function may unconsciously emphasize too 

much those unimportant documents (which are ranked low in the 

final results). 


The Pairwise Approach

The Pairwise Approach 


Input Space Document pairs (x u ,x v ) 

Output Space 

Preference 

y u 

, v 

{ 

1, 

1} 

Hypothesis Space 

Preference function 

h( xu, 

xv 

) 2 

I 

f ( x ) f ( x )} 

 

{ 

 

u v 

1 

Loss Function 

Pairwise classification loss 

L( 

h; 

xu, 

xv, 

yu,v) 



• Reduce ranking to pairwise classification 

– Learning to order things 

– RankNet and Frank 

– RankBoost 

– Ranking SVM 

– MHR 

– IR-SVM 

 

 

 

 

 

 

 

x 

x 

 

1 

2 

x m 

 

 

 

 

 

 

 

 

x 

 

1 

, x2, 

1 , x2, 

x1 

, 1 ,..., 

 

 

 

x 

q 

2 

, x 

m 

, 1, 

x 

m 

, x 

2 

, 1 


Learning to Order Things 

(W. Cohen, R. Schapire, et al. NIPS 1998) 

• Pairwise loss function 

L( 

h; 

x 

, x 

• Pairwise ranking function 

– Weighted majority algorithm is used to learn the parameters w. 

– From preference to total order: 

u 

v 

, 

y 

 

u, 

v 

) 

 

y 

u, 

v 

h( xu 

, xv 

) wt 

ht 

( xu 

, xv 

) 

 

max 

( x 1 

, x 1 

( u) 

( v ) 

uv 

h ) 

h( 

x 

2 

u 

, x 

v 

) 


• Target probability: 

Pu 

v 

, if yu, 

v 

1; 

• Modeled probability: 

P 

u, 

v 

( f 

RankNet 

(C. Burges, T. Shaked, et al. ICML 2005) 

, 

1 Pu 

, v 

 

exp( f ( xu 

) f 

) 

1 

exp( f ( x ) 

( xv 

)) 

f ( x )) 

• Cross entropy as the loss function 

l( 

f ; x 

P 

u 

u, 

v 

, x 

v 

, 

log 

y 

P 

u, 

v 

u, 

v 

) 

( 

u 

• Use neural network as model, and gradient descent as 

algorithm, to optimize the cross-entropy loss. 

v 

f ) (1 P 

0,otherwise 

u, 

v 

)log(1 P 

u, 

v 

( 

f 

)) 


44

FRank 

(M. Tsai, T.-Y. Liu, et al. SIGIR 2007) 

• Similar setting to that of RankNet 

• New loss function: fidelity 

L( 

f 

; x 

u 

, x 

v 

, 

y 

u, 

v 

) 

1 

P 

u, 

v 

P 

u, 

v 

( 

f 

) 

 

(1 

P 

u, 

v 

) (1 

 

P 

u, 

v 

( 

f 

)) 


45

• Exponential loss 

RankBoost 

(Y. Freund, R. Iyer, et al. JMLR 2003) 

L( f ; xu, 

xv, 

yu 

v) 

exp( yu, 

( 

f ( x 

) 

, v u v 

Weak learner that can 

classify important pairs 

more f ( correctly x ))) will have 

larger weight α 

Hypothesis is the 

combination of weak 

learners 

Emphasize those hard 

pairs by setting their 

distributions. 


Ranking SVM 

(R. Herbrich, T. Graepel, et al. , Advances in Large Margin Classifiers, 2000; 

T. Joachims, KDD 2002) 

• Hinge loss 

 

L( 

f ; xu, 

xv, 

yu, v) 

1 

yu, 

v 

f ( xu 

) f ( xv) 

 

 

 

min 

w 

 

T 

( i) 

uv 

 

1 

2 

x 

( i) 

u 

|| w || 

 

0, i 

x 

2 

( i) 

v 

 

( 

1 

1,..., 

n. 

n 

( i) 

u, 

v 

i1 ( ) 

u, 

v: 

y , 1 

C 

i) 

u, 

v 

u 

i 

v 

,if 

y 

( i) 

u, 

v 

1. 

x u -x v as positive instance of 

learning 

Use SVM to perform binary 

classification on these instances, 

to learn model parameter w 


Extensions of the Pairwise Approach 

• When constructing document pairs, category 

information has been lost: different pairs of labels 

are treated identically. 

– Can we maintain more information about ordered 

category? 

– Can we use different learner for different kinds of pairs? 

• Solutions 

– Multi-hyperplane ranker (MHR) 


Multi-Hyperplane Ranker 

(T. Qin, T.-Y. Liu, et al. SIGIR 2007) 

• If we have K categories, then will have K(K-1)/2 pairwise 

preferences between two labels. 

– Learn a model f k,l for the pair of categories k and l, using any previous 

method (for example Ranking SVM). 

• Testing 

– Rank Aggregation 

f ( x) 

k 

k, 

l 

, l 

fk, 

l 

( x) 


Improvement of Pairwise Approach 

• Advantage 

– Predicting relative order is closer to the nature of ranking 

than predicting class label or relevance score. 

• Problem 

– The distribution of 

document pair number is 

more skewed than the 

distribution of document 

number, with respect to 

different queries. 

– Extreme case: m 2 vs. m 


IR-SVM 

• Introduce query-level normalization to the pairwise loss 

function. 

– Use the number of document pairs to normalize the pairwise loss. 

• IRSVM (Y. Cao, J. Xu, et al. SIGIR 2006; T. Qin, T.-Y. Liu, et al. IPM 2007) 

min 

1 

2 

|| w || 

2 

C 

N 

 

1 

m~ 

 

( i) 

i1 u, 

v 

Query-level normalizer 

 

( i) 

uv 

Significant improvement has been observed by using the query-level normalizer. 


The Listwise Approach

The Listwise Approach 

The Listwise Approach 

Input Space 

Measure-specific 

Document set 

x { 

 

Non-measure specific 

m 

x 

j} 

j 1 

m 

Output Space Ordered category y { y 

j} 

j1 

Permutation 

y 

Hypothesis Space 

h( x) 

f ( x) 

h( x) 

sort 

f ( x) 

Loss Function 1-surrogate measure L ( h; 

x, 

y) 

Listwise ranking loss L h; 

x, 

) 

( 

y 


Measure-specific Listwise Ranking 

Methods 

• Also known as “Direct Optimization of IR Measures”. 

• It is natural to directly optimize what is used to evaluate the 

ranking results. 

• However, it is non-trivial. 

– Evaluation measures such as NDCG are non-continuous and nondifferentiable 

since they depend on the rank positions. 

(S. Robertson and H. Zaragoza, Information Retrieval 2007; J. Xu, T.-Y. 

Liu, et al. SIGIR 2008). 

– It is challenging to optimize such objective functions, since most 

optimization techniques in the literature were developed to handle 

continuous and differentiable cases. 


54

Tackle the Challenges 

• Approximate the objective 

– Soften (approximate) the evaluation measure so as to make it smooth 

and differentiable (SoftRank) 

• Bound the objective 

– Optimize a smooth and differentiable upper bound of the evaluation 

measure (SVM-MAP) 

• Optimize the non-smooth objective directly 

– Use IR measure to update the distribution in Boosting (AdaRank) 

– Use genetic programming (RankGP) 


55

SoftRank 

(M. Taylor, J. Guiver, et al. LR4IR 2007/WSDM 2008) 

• Key Idea: 

– Avoid sorting by treating scores as random variables 

• Steps 

– Construct score distribution 

– Map score distribution to rank distribution 

– Compute expected NDCG (SoftNDCG) which is smooth and 

differentiable. 

– Optimize SoftNDCG by gradient descent. 


Construct Score Distribution 

• Considering the score for each document s j as random 

variable, by using Gaussian. 

p( 

s 

j 

) 

Ν( 

s 

j 

| 

f 

( x 

j 

2 

), ) 

s 


From Scores to Rank Distribution 

• Probability of d u being ranked before d v 

 

uv 

Ν( s | f ( xu 

) f ( x 

2 v),2 

s 

) ds 

0 

• Rank Distribution 

p 

( u) 

j 

( r) 

 

p 

( u1) 

j 

( r 

1) 

 

uj 

 

p 

( u1) 

j 

( r)(1 

 

uj 

) 

Define the probability of a document being ranked at a particular position 


58

Soft NDCG 

• Expected NDCG as the objective, or in the language of loss 

function, 

– Only P j (r) contain the model parameter w, and it is differential with 

respect to w. 

 

j1 

m1 

 

j 

L( f ; x, 

y) 

1 

E[ 

NDCG] 

1 

Z (2 1) 

p 

m 

m 

y 

r0 

r 

j 

( r) 

• A neural network is used as the model, and gradient descent 

is used as the algorithm to optimize the above loss function. 


SVM-MAP 

(Y. Yue, T. Finley, et al. SIGIR 2007) 

• Use the framework of Structured SVM for optimization 

– I. Tsochantaridis, et al. ICML 2004; T. Joachims, ICML 2005. 

• Sum of slacks ∑ξ upper bounds ∑ (1-AP). 

min 

1 

2 

y' 

y 

w 

( i) 

2 

: 

 

w 

C 

N 

T 

N 

 

i1 

( 

y 

 

( i) 

( i) 

, x 

( i) 

) 

 

w 

T 

( 

y', 

x 

( i) 

) ( 

y 

( i) 

, y') 

 

( i) 

 

( 

y ', x) 

( y' 

y' 

v 

)( xu 

xv 

) 

( i) 

( 

y , y') 

1 

AP( 

y' 

) 

( 

y, 

x) 

 

u 

u, 

v: 

y u 1, 

y v 0 

 

u, 

v: 

y u 1, 

y v 0 

( x u 

x v 

) 


Challenges in SVM-MAP 

• For AP, the true ranking is a ranking where the relevant 

documents are all ranked in the front, e.g., 

• An incorrect ranking would be any other ranking, e.g., 

• Exponential number of rankings, thus an exponential number 

of constraints! 

• Solution: use a working set only containing the most 

violated constraints. 


61

Finding the Most Violated Constraint 

• Most violated y’: 

arg max 

y' 

 

( 

y, 

y' 

) 

w 

T 

( 

y', 

x) 

 

• If the relevance at each position is fixed, Δ(y,y’) will be the same. But 

if the documents are sorted by their scores in descending order, the 

second term will be maximized. 

• Strategy (with a complexity of O(mlogm)) 

– Sort the relevant and irrelevant documents and form a perfect ranking. 

– Start with the perfect ranking and swap two adjacent relevant and 

irrelevant documents, so as to find the optimal interleaving of the two 

sorted lists. 


AdaRank 

(J. Xu, H. Li. SIGIR 2007) 

• Embed IR evaluation measure in the distribution 

update of Boosting 

Weak learner that can 

rank important queries 

more correctly will have 

larger weight α 

Emphasize those hard 

queries by setting their 

distributions. 


RankGP 

(J. Yeh, J. Lin, et al. LR4IR 2007) 

• Define ranking function as a tree 

• Learning 

– Single-population GP, which has been widely used to 

optimize non-smoothing non-differentiable objectives. 

• Evolution mechanism 

– Crossover, mutation, reproduction, tournament selection. 

• Fitness function 

– IR evaluation measure (MAP). 


Non-measure Specific Listwise Ranking 

Methods 

• Defining listwise loss functions based on the 

understanding on the unique properties of ranking 

for IR. 

• Representative Algorithms 

– ListNet 

– ListMLE 


Ranking Loss is Non-trivial! 

• An example: 

– function f: f(A)=3, f(B)=0, f(C)=1 ACB 

– function h: h(A)=4, h(B)=6, h(C)=3 BAC 

– ground truth g: g(A)=6, g(B)=4, g(C)=3 ABC 

• Question: which function is closer to ground truth? 

– Based on pointwise similarity: sim(f,g) < sim(g,h). 

– Based on pairwise similarity: sim(f,g) = sim(g,h) 

– Based on cosine similarity between score vectors? 

f: g: h: 

sim(f,g)=0.85 

Closer! 

sim(g,h)=0.93 

• However, according to NDCG, f should be closer to g! 


66

ListNet 

(Z. Cao, T. Qin, T.-Y. Liu, et al. ICML 2007) 

• Ranked list ↔ Permutation probability distribution 

• More informative representation for ranked list: 

permutation and ranked list has 1-1 correspondence. 


67

Properties of Luce Model 

• Continuous, differentiable, and concave w.r.t. score s. 

• Suppose f(A) = 3, f(B)=0, f(C)=1, P(ACB) is largest and P(BCA) is 

smallest; for the ranking based on f: ACB, swap any two, 

probability will decrease, e.g., P(ACB) > P(ABC) 

• Translation invariant, when 

( s) 

exp( s) 

• Scale invariant, when 

( s) 

 

as 


69

Distance between Ranked Lists 

φ = exp 

Using KL-divergence 

to measure difference 

between distributions 

dis(f,g) = 0.46 

dis(g,h) = 2.56 

4/20/2010 

Tie-Yan Liu @ Renda 

70

K-L Divergence Loss 

• Loss function = KL-divergence between two permutation 

probability distributions (φ = exp) 

 

 

|| 

P 

| f ( ) 

L( f ; x, 

) D P 

x 

y 

y 

 

Probability distribution defined 

by the ground truth 

Probability distribution defined 

by the model output 

• Model = Neural Network 

• Algorithm = Gradient Descent 


71

Limitations of ListNet 

• Too complex to use in practice: O(m⋅m!) 

• By using the top-k Luce model, the complexity of ListNet can 

be reduced to polynomial order of m. 

• However, 

– When k is large, the complexity is still high! 

– When k is small, the information loss in the top-k Luce model will 

affect the effectiveness of ListNet. 


Solution 

• Given the scores outputted by the ranking model, compute 

the likelihood of a ground truth permutation, and maximize it 

to learn the model parameter. 

• The complexity is reduced from O(m⋅m!) to O(m). 


ListMLE 

(F. Xia, T.-Y. Liu, et al. ICML 2008) 

• Likelihood Loss 

L( f ; x, 

) log 

P( 

| ( 

f ( x))) 

y 

y 

• Model = Neural Network 

• Algorithm = Stochastic Gradient Descent 


Extension of ListMLE 

• Dealing with other types of ground truth labels using 

the concept of “equivalent permutation set”. 

– For relevance degree 

 

– For pairwise preference 

 

y 

y 

{ 

y 

| u v,if 

l 1 

l 1 

} 

( u) 

( u ) 

{ 

y 

| u v,if 

l 1 

( u), 

( u ) 

 

y 

1 

y y 

L( f , x, 

) min log P | f ( w, 

x) 

y 

 

y 

y 

 

y 

 

y 

1} 

 

 


Discussions 

• Advantages 

– Take all the documents associated with the same 

query as the learning instance 

– Rank position is visible to the loss function 

• Problems 

– Complexity issue. 

– The use of the position information is insufficient. 


Analysis of the Approaches

Question 

• Different loss functions are used in different 

approaches, while the models learned with all 

the approaches are evaluated by IR measures. 

• What are the relationship between the loss 

functions and IR evaluation measures? 



• Many pointwise loss functions can upper 

bound (1-NDCG). 

• Minimization of the pointwise loss functions 

can lead to the maximization of NDCG. 



• For regression based algorithms (D. Cossock and T. Zhang, 

COLT 2006) 

1 

NDCG( 

f 

) 

 

1 

Z 

m 

 

 

2 

 

m 

 

j1 

 

 

j 

 

 

 

1/ 

 

 

 

m 

 

j1 

f 

( x 

j 

) 

y 

j 

 

 

 

 

 

1/ 

• For multi-class classification based algorithms (P. Li, C. Burges, 

et al. NIPS 2007) 

 

m 

m 

15 

2 

1 

NDCG( 

f ) 2 

 

j 

m 

j 

Zm 

j1 j1 

2 

m 

 

 

 

 

m 

 

j1 

I 

{ y 

yˆ 

} 

j 

j 



• The minimization of the losses is only the sufficient 

condition of minimizing (1-NDCG). 

x 

i, y i 

21.4 

xi 

, f ( xi 

) 

Z m 

x 

 

x 

x 

 

x 

1 

2 

3 

4 

,4 

 

,3 

,2 

 

,1 

 

1 

NDCG( 

f ) 0 

m 

I{ 

y f ( x )} 

4 

j j 

j1 

x 

 

x 

x 

 

x 

1 

2 

3 

4 

,3 

 

,2 

,1 

 

,0 

 

15 

Z 

m 

 

 

2 

 

m 

 

1 

 

log( j 1) 

 

j1 j1 

2 

m 

m 

 

1 

 

log( j 1) 

 

2 

m 

 

 

 

 

m 

 

j1 

I 

{ y 

j 

f ( x 

j 

)} 

1.15 1 


The Pairwise and Non-measure 

Specific Listwise Approaches 

(W. Chen, T.-Y. Liu, et al. 2009) 

• Many pairwise and non-measure specific listwise 

loss functions are surrogate functions, and also 

upper bounds of a quantity named the “essential 

loss”. 

• The essential loss is an upper bound of (1-NDCG) 

and (1-MAP). 

• The minimization of these surrogate loss 

functions can lead to the maximization of NDCG 

and MAP. 


Essential Loss for Ranking 

• Model ranking as a sequence of classifications 

y 

A 

 

B 

 

C 

 

D 

 

Ground truth permutation: 

Prediction of the ranking function: 

 

B 

y 

 

B 

A 

 

C 

D 

 

D 

C 

√ 

 

B 

y 

 

C 

D 

 

D 

C 

y { A B C D} 

{ B A D C} 

 

D 

 

C 

Classifier 


Essential Loss vs. Ranking Measures 

1) Both (1-NDCG) and (1-MAP) are upper bounded by the 

essential loss. 

2) The zero value of the essential loss is a necessary and 

sufficient condition for the zero values of (1-NDCG) and 

(1-MAP). 


Essential Loss vs. Surrogate Losses 

1) Many pairwise and listwise loss functions are upper 

bounds of the essential loss. 

2) Therefore, the pairwise and listwise loss functions are also 

upper bounds of (1-NDCG) and (1-MAP). 


Discussions 

• Revealed that different loss functions are upper 

bounds of the same thing. 

• Explained why many different pairwise and 

listwise algorithms perform similarly well. 

• Suggest better algorithms 

– The essential loss is a tighter upper bound of 

measure-based ranking errors than existing surrogate 

loss functions. 

– Our experiments showed that higher ranking 

performances can be achieved by optimizing such a 

tighter bound. 


Measure-specific Listwise Approach 

• To what degree is a surrogate measure 

correlated with the ranking measure? 

• Can the maximization of a surrogate measure 

lead to the same ranking model that 

maximizes the ranking measure? 


Tendency Correlation as a Tool 

(Y. He and T.-Y. Liu, 2010) 

0.2 

0 

Finding the best linear 0.4 transformation, by 

applying which to one of the functions, 

the maximum difference between the 

tendencies of the two functions in the 

entire input space can be minimized. 

The “tendency” of a function with 

respect to two inputs is defined as 

the difference in the values of the 

function at the two inputs. 


Theoretical Justification of Tendency 

Correlation 

If the tendency correlation of a surrogate measure is 

arbitrarily strong, its optimization will lead to the 

ranking model that can optimize the corresponding 

ranking measure, on both training and test sets. 


Major Results - SoftRank 

1) When a parameter in SoftRank is set to be small, the 

tendency correlation of its surrogate measure will 

become strong, regardless of the data distribution. 

2) However, in this case, the surrogate measure will contain 

many local optima, therefore robust global optimization 

techniques are needed. 


Major Results – SVM-MAP, etc. 

1) The surrogate measures in SVM-MAP, SVM-NDCG, DORM, 

and PermuRank are concave and easy to optimize. 

2) No matter how the parameters in these algorithms are set, 

for certain data distributions, the tendency correlation of 

their surrogate measures can never be as strong as expected. 


Summary 

• With the tool of tendency correlation, we reveal 

the relationship between surrogate measures and 

original ranking measures. 

• The theorems well explain previous empirical 

observations 

– Some methods can perform well on different datasets, 

while some others cannot on some datasets. 

• The theorems provide a guideline for trade-off 

between effectiveness and ease of optimization. 


Benchmarking Learning to Rank 

Methods

Role of a Benchmark Dataset 

• Enable researchers to focus on technologies 

instead of experimental preparations. 

• Enable fair comparison among algorithms. 

• Examples 

– Reuters and RCV-1 for text classification 

– UCI for general machine learning 


The LETOR Collection 

• A benchmark dataset for learning to rank 

– Document corpora 

– Document sampling 

– Feature extraction 

– Meta information 

– Cross validation 


Document Corpora 

• The “.gov” corpus 

• The OHSUMED corpus 

– 106 queries 


Document Sampling 


– Top 1000 documents per query returned by BM25 


– All judged documents 


Feature Extraction 


– 64 features 

– TF, IDF, BM25, LMIR, PageRank, HITS, Relevance 

Propagation, URL length, etc. 


– 40 features 

– TF, IDF, BM25, LMIR, etc. 


Meta Information 

• Statistical information about the corpus 

– Number of documents, number of streams, and number of 

(unique) terms in each stream. 

• Raw information of the documents associated with 

each query 

– Term frequency, and document length. 

• Relational information 

– Hyperlink graph, sitemap information, and similarity 

relationship matrix of the corpora. 


Cross Validation 

• Five-fold cross validation 

http://research.microsoft.com/~LETOR/ 


Algorithms under Investigation 

• The pointwise approach 

– Linear regression 

• The pairwise approach 

– Ranking SVM, RankBoost, Frank 

• The listwise approach 

– ListNet 

– SVM-MAP, AdaRank 

Linear ranking model (except RankBoost) 


Experimental Results – TD2003 


Experimental Results – TD2004 


Experimental Results – NP2003 


Experimental Results – NP2004 


Experimental Results – HP2003 


Experimental Results – HP2004 


Experimental Results - OHSUMED 


Experimental Results - Winner Numbers 


Experimental Results - Winner Numbers 

40 

35 

30 

25 

20 

15 

10 

5 

Regression 

RankSVM 

RankBoost 

FRank 

ListNet 

AdaRank 

SVMmap 

0 

N@1 N@3 N@10 P@1 P@3 P@10 MAP 


• NDCG@1 

Discussions 

– {ListNet, AdaRank} > {SVM-MAP, RankSVM} 

> {RankBoost, FRank} > {Regression} 

• NDCG@3, @10 

– {ListNet} > {AdaRank, SVM-MAP, RankSVM, RankBoost} 

> {FRank} > {Regression} 

• MAP 

– {ListNet} > {AdaRank, SVM-MAP, RankSVM} 

> {RankBoost, FRank} > {Regression} 

The listwise approach has certain advantages, especially in 

correctly predicting top-ranked documents. 


Notes 

• The experimental results can be found at the 

LETOR website 

– http://research.microsoft.com/~letor/ 

• The results are still primal 

– The implementations of the these algorithms are 

not perfect 

– Almost every algorithm can be further improved. 


Summary and Outlook

Algorithms Theory Data Activity 

Pointwise approach 

Pairwise approach 

Listwise approach 

- non-measure specific 

- measure specific 

• NIPS workshop on 

learning to rank 

• ICML workshop on 

structure output 

• 1 st SIGIR workshop 

on LR4IR 

• WWW tutorial • WWW tutorial 

• SIGIR tutorial • ACL tutorial 

• 2 nd LR4IR workshop • IRJ special issue 

• 1 st SIGIR workshop on • 3 rd LR4IR workshop 

diversity 

• 2 nd Diversity workshop 

• LETOR 1.0 

• LETOR 2.0 

• LETOR 3.0 • LETOR 4.0 

• Pranking 

• DMIR 

• Ranking SVM 

• RankBoost 

• Gaussian process for 

ordinal regression 

• RankNet 

• Consistency of 

pointwise methods 

• Subset Ranking 

• IR-SVM 

• LambdaRank 

• P-norm push 

• Nested Ranker 

• RankCosine 

• MC-Rank 

• GBRank 

• MHR 

• Frank 

• ListNet 

• Supervised RA 

• RankGP 

• AdaRank 

• SVM-MAP 

• Generalization of 

pairwise methods 

• Consistency of 

Listwise methods 

• IsoRank 

• QBRank 

• CDN Ranker 

• ListMLE 

• Relational Ranking 

• GlobalRank 

• Query-dependent Ranker 

• AppRank 

• PermuRank 

• SoftRank 

• G-SoftRank 

• SVM-NDCG 

• Generalization of 

listwise methods 

< 2005 2005 2006 2007 2008 2009 

• OWA Rank 

• BoltzmanRank 

• Bayes-Luce 

• SmoothRank 

• Robust sparse 

ranker 

• … … 


Answer to Question 1 

• To what respect are these learning to rank algorithms 

similar and in which aspects do they differ? What are 

the strengths and weaknesses of each algorithm? 

Approach Modeling Pro Con Connection 

Pointwise 

Pairwise 

Listwise 

Regression/ 

classification/ 

ordinal 

regression 

Pairwise 

classification 

Ranking 

Easy to leverage 

existing theories 

and algorithms 

query-level and 

position based 

• Accurate ranking 

≠ Accurate score or 

category 

• Position info is 

invisible to the loss 

• More complex 

• New theory 

needed 

•Effectively 

minimize (1- 

NDCG). 



• Empirically speaking, which of those many learning 

to rank algorithms perform the best? 

– LETOR makes it possible to perform fair comparisons 

among different learning to rank methods. 

– Empirical studies on LETOR have shown that the listwise 

ranking algorithms seem to have certain advantages over 

other algorithms, especially in predicting the top-ranked 

documents, and the pairwise ranking algorithms seem to 

outperform the pointwise algorithms. 



• Are there many remaining issues regarding learning 

to rank to study in the future? 

– Learning from logs 

– Feature engineering 

– Advanced ranking model 

– Ranking theory 


Learning from Logs 

• Data scale matters a lot! 

– Ground truth mining from logs is one of the possible 

ways to construct large-scale training data. 

• Ground truth mining may lose information . 

– User sessions, frequency of clicking a certain 

document, frequency of a certain click pattern, and 

diversity in the intentions of different users. 

• Directly learn from logs. 

– E.g., Regard click-through logs as the “ground truth”, 

and define its likelihood as the loss function. 


Feature Engineering 

• Ranking is deeply rooted in IR 

– Not just new algorithms and theories. 

• Feature engineering is very importance 

– How to encode the knowledge on IR accumulated 

in the past half a century in the features? 

– Currently, these kinds of works have not been 

given enough attention to. 


Advanced Ranking Model 

• Listwise ranking function 

– Natural but challenging 

– Complexity is a big issue 

• Generative ranking model 

– Why not generative? 

• Semi-supervised ranking 

– The difference from semi-supervised classification 


Ranking Theory 

• A more unified view on loss functions 

• True loss for ranking 

• Statistical consistency for ranking 

• Complexity of ranking function class 

• …… 


tyliu@microsoft.com 

http://research.microsoft.com/users/tyliu/

Learning to Rank for Information Retrieval

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?