# An Alternative Cross Entropy Loss for Learning-to-Rank

Listwise learning-to-rank methods form a powerful class of ranking algorithms that are widely adopted in applications such as information retrieval. These algorithms learn to rank a set of items by optimizing a loss that is a function of the entire set—as a surrogate to a typically non-differentiable ranking metric. Despite their empirical success, existing listwise methods are based on heuristics and remain theoretically ill-understood. In particular, none of the empirically-successful loss functions are related to ranking metrics. In this work, we propose a cross entropy-based learning-to-rank loss function that is theoretically sound and is a convex bound on NDCG, a popular ranking metric. Furthermore, empirical evaluation of an implementation of the proposed method with gradient boosting machines on benchmark learning-to-rank datasets demonstrates the superiority of our proposed formulation over existing algorithms in quality and robustness.

## Authors

• 2 publications
• ### NeuralNDCG: Direct Optimisation of a Ranking Metric via Differentiable Relaxation of Sorting

Learning to Rank (LTR) algorithms are usually evaluated using Informatio...
02/15/2021 ∙ by Przemysław Pobrotyn, et al. ∙ 0

• ### A review on ranking problems in statistical learning

Ranking problems define a widely spread class of statistical learning pr...
09/06/2019 ∙ by Tino Werner, et al. ∙ 0

• ### Set Cross Entropy: Likelihood-based Permutation Invariant Loss Function for Probability Distributions

We propose a permutation-invariant loss function designed for the neural...
12/04/2018 ∙ by Masataro Asai, et al. ∙ 0

• ### MidRank: Learning to rank based on subsequences

We present a supervised learning to rank algorithm that effectively orde...
11/29/2015 ∙ by Basura Fernando, et al. ∙ 0

• ### Building Cross-Sectional Systematic Strategies By Learning to Rank

The success of a cross-sectional systematic strategy depends critically ...
12/13/2020 ∙ by Daniel Poh, et al. ∙ 0

• ### Learning Rank Functionals: An Empirical Study

Ranking is a key aspect of many applications, such as information retrie...
07/23/2014 ∙ by Truyen Tran, et al. ∙ 0

• ### Boosting Video Captioning with Dynamic Loss Network

Video captioning is one of the challenging problems at the intersection ...
07/25/2021 ∙ by Nasibullah, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Learning-to-rank or supervised ranking is a central problem in a range of applications including web search, recommendation systems, and question answering. The task is to learn a function that, conditioned on some context, arranges a set of items into an ordered list so as to maximize a given metric. In this work, without loss of generality, we take search as an example where a set of documents (items) are ranked by their relevance to a query (context).

Rather than directly working with permutations, learning-to-rank methods typically approach the ranking problem as one of “score and sort.” The objective is then to learn a “scoring” function to measure the relevance of a document with respect to a query. Subsequently, they sort documents in decreasing relevance to form a ranked list. Ideally, the resulting ranked list should maximize a ranking metric.

Popular ranking metrics are instances of the general class of conditional linear rank statistics (Clémençon and Vayatis, 2008) that summarize the Receiver Operator Characteristic (ROC) curve. Of particular interest are the ranking statistics that care mostly about the leftmost portion of the ROC curve, corresponding to the top of the ranked list. Mean Reciprocal Rank and Normalized Discounted Cumulative Gain (Järvelin and Kekäläinen, 2002) are two such metrics that are widely used in information retrieval applications.

Ranking metrics, as functions of learning-to-rank scores, are flat almost everywhere; a small perturbation of scores is unlikely to lead to a change in the metric. This property poses a challenge for gradient-based optimization algorithms, making a direct optimization of ranking metrics over a complex hypothesis space infeasible. Addressing this challenge has been the focus of a large body of research (Liu, 2009), with most considering smooth loss functions as surrogates to metrics.

The majority of the proposed surrogate loss functions (Cao et al., 2007; Burges et al., 2005; Burges, 2010; Xia et al., 2008; Joachims, 2006), however, are only loosely related to ranking metrics such as NDCG. ListNet (Cao et al., 2007)

, as an example, projects labels and scores onto the probability simplex and minimizes the cross-entropy between the resulting distributions. LambdaMART

(Burges, 2010; Wu et al., 2010) (denoted as ), as another example, forgoes the loss function altogether and heuristically formulates the gradients.

The heuristic nature of learning-to-rank surrogate loss functions and a lack of theoretical justification for their use have hindered progress in the field. While remains the state-of-the-art to date, the fact that its loss function—presumed to be smooth—is unknown makes a theoretical analysis of the algorithm difficult. Empirical improvements over existing methods remain marginal for similar reasons.

In this work, we are motivated to help close the gap above. To that end, we present a construction of the cross-entropy loss which we dub , that is only slightly different from the ListNet loss, but that enjoys strong theoretical properties. In particular, we prove that our construction is a convex bound on negative (translated and log-transformed) mean NDCG—where NDCG, a utility is turned into a cost by negation—thereby lending credence to its optimization for the purpose of learning ranking functions. Furthermore, we show that the generalization error bound of compares favorably with that of ’s. Experiments on benchmark learning-to-rank datasets further reveal the empirical superiority of our proposed method. We anticipate the theoretical soundness of our method and its strong connection to ranking metrics enable future research and progress.

Our contributions can be summarized as follows:

• [leftmargin=*]

• We present a cross entropy-based loss function, dubbed , for learning-to-rank and prove that it is a convex bound on negative (translated and log-transformed) mean NDCG;

• We compare model complexity between and ;

• We formulate an approximation to the inverse Hessian for for optimization with second-order methods; and,

• We optimize to learn Gradient Boosted Regression Trees (denoted by ) and compare its performance and robustness with on benchmark learning-to-rank datasets through extensive randomized experiments.

This document is organized as follows. Section 2 reviews existing work on learning-to-rank. In Section 3, we introduce the notation adopted in this work and formulate the problem. Section 4 presents a detailed description of our proposed learning-to-rank loss function and examines its theoretical properties, including a comparison of generalization error bounds. We empirically evaluate our proposed method and report our findings in Section 5. Finally, we conclude this work in Section 6.

## 2 Related Work

A large class of learning-to-rank methods attempt to optimize pairwise misranking error—a popular ranking statistic in many prioritization problems—by learning to correctly classify pairwise preferences. Examples include RankSVM

(Joachims, 2006) and AdaRank (Xu and Li, 2007) which learn margin classifiers, RankNet (Burges et al., 2005) which optimizes a probabilistic loss function, and the P-Norm Push method (Rudin, 2009) which extends the problem to settings where we mostly care about the top of the ranked list. While the so-called “pairwise” methods typically optimize convex upper-bounds of the misranking error, direct optimization methods based on mathematical programming have also been proposed (Rudin and Wang, 2018) albeit for linear hypothesis spaces.

Pairwise learning-to-rank methods, while generally effective, optimize loss functions that are misaligned with more complex ranking statistics such as Expected Reciprocal Rank (Chapelle et al., 2009) or NDCG (Järvelin and Kekäläinen, 2002). This discrepancy has given rise to the so-called “listwise” learning-to-rank methods, where the loss function under optimization is defined over the entire list of items, not just pairs.

Listwise learning-to-rank methods either derive a smooth approximation to ranking metrics or use heuristics to construct smooth surrogate loss functions. Algorithms that represent the first class are SoftRank (Taylor et al., 2008)

which takes every score to be the mean of a Gaussian distribution, and ApproxNDCG

(Qin et al., 2010) which approximates the indicator function—used in the computation of ranks given scores—with a generalized sigmoid.

The other class of listwise learning-to-rank methods include ListMLE (Xia et al., 2008), ListNet (Cao et al., 2007), and  (Wu et al., 2010; Burges, 2010). ListMLE maximizes the log-likelihood based on the Plackett-Luce probabilistic model, a loss function that is disconnected from ranking metrics. ListNet minimizes the cross-entropy between the ground-truth and score distributions. Though a recent work (Bruch et al., 2019a) establishes a link between the ListNet loss function and NDCG under strict conditions—requiring binary relevance labels—in a general setting, its loss is only loosely related to ranking metrics.

is a gradient boosting machine (Friedman, 2001) that forgoes the loss function altogether and, instead, directly designs the gradients of its unknown loss function using heuristics. While a recent work (Wang et al., 2018) claims to have found ’s loss function, it overlooks an important detail: The reported loss function is not differentiable.

There is abundant evidence to suggest listwise methods are empirically superior to pairwise methods where MRR, ERR, or NDCG is used to determine ranking quality (Wang et al., 2018; Bruch et al., 2019b; Liu, 2009). However, unlike pairwise methods, listwise algorithms remain theoretically ill-understood. Past studies have examined the generalization error bounds for existing surrogate loss functions (Tewari and Chaudhuri, 2015; Chapelle and Wu, 2010; Lan et al., 2009), but little attention has been paid to the validity of such functions which could shed light on their empirical success.

## 3 Preliminaries

In this section, we formalize the problem and introduce our notation. To simplify exposition, we write vectors in bold and use subscripts to index their elements (e.g.,

).

Let be a training example comprising of items and relevance labels where is the bounded space of items or item-context pairs represented by -dimensional feature vectors, and is the space of relevance labels. For consistency with existing work on listwise learning-to-rank, we refer to each as a “document.” Note, however, that could be the representation of any general item or item-context pair. We assume the training set consists of such examples.

We denote a learning-to-rank scoring function by and assume where is a compact hypothesis space of bounded functions endowed with the uniform norm. For brevity, we denote by and, with a slight abuse of notation, define , the vector of scores for documents in .

As noted in earlier sections, the goal is to learn a scoring function that minimizes the empirical risk:

 L(f)=1|Ψ|∑(x,y)∈Ψℓ(y,f(x)), (1)

where is by assumption a smooth loss function.

ListNet: The loss in ListNet (Cao et al., 2007) first projects labels and scores onto the probability simplex to form distributions and , respectively. Given the two distributions, the loss is their distance as measured by cross entropy:

 ℓ(y,f(x))≜−m∑i=1ϕListNet(yi)logρListNet(fi). (2)

The distributions and may be understood as encoding the likelihood of document appearing at the top of the ranked list, referred to as “top one” probability, according to the labels and scores respectively. In the original publication (Cao et al., 2007), and are defined as follows:

 ϕListNet(yi)=eyi∑mj=1eyj,ρListNet(fi)=efi∑mj=1efj. (3)

: The loss in is unknown but its gradients with respect to the scoring function are designed as follows:

 ∂ℓ∂fi=∑yi>yj∂ℓij∂fi+∑yk>yi∂ℓki∂fi, (4)

where

 ∂ℓmn∂fm=−σ|ΔNDCGmn|1+eσ(fm−fn)=−∂ℓnm∂fm, (5)

where

is a hyperparameter and

is the change in NDCG if documents at ranks and are swapped. Finally, NDCG is defined as follows:

 NDCG(πf,y)=DCG(πf,y)% DCG(πy,y), (6)

where is a ranked list induced by on , is the ideal ranked list (where is sorted by ), and DCG is defined as follows:

 DCG(π,y)=m∑i=12yi−1log2(1+π[i]), (7)

with denoting the rank of .

## 4 Proposed Method

In this section, we show how a slight modification to the ListNet loss function equips the loss with interesting theoretical properties. To avoid conflating implementation details with the loss function itself, we name our proposed loss function .

###### Definition 1.

For a training example and scores , we define as the cross entropy between score distribution and a parameterized class of label distributions defined as follows:

 ρ(fi)=efi∑mj=1efj,ϕ(yi;γ)=2yi−γi∑mj=12yj−γj

where .

In effect, the distribution allocates a mass in the interval for each document. As we will explain later, the vector plays an important role in certain theoretical properties of our proposed loss function. Note that in general, may be unique to each training example .

### 4.1 Relationship to NDCG

The difference between and ListNet is minor but consequential: The change to the definition of leads to our main result.

###### Theorem 1.

is an upper-bound on negative (translated and log-transformed) mean Normalized Discounted Cumulative Gain.

Theorem 1 asserts that is a convex proxy to minimizing negative NDCG (where we turn NDCG which is a utility to a cost by negation). No such analytical link exists between the , ListNet, or other listwise learning-to-rank loss functions and ranking metrics.

In proving Theorem 1 we make use of Jensen’s inequality when applied to the function:

 logE[X]≥E[logX], (8)

where

is a random variable and

denotes expectation. We also use the following bound on ranks that was originally derived in (Bruch et al., 2019a):

 π[r] =1+∑i≠r1fi>fr=1+∑i≠r1fi−fr>0 ≤1+∑i≠re(fi−fr)=∑ie(fi−fr)=∑iefiefr,

where is the indicator function taking value when the predicate is true and otherwise. The above leads to:

 1π[r]≥efr∑iefi=ρ(fr). (9)
###### Proof.

Consider DCG. Using :

 DCG (πy,y)=m∑i=12yi−1log2(1+πy[i])≤m∑i=12yi−γi, (10)

for .

Turning to DCG and using for a positive integer or equivalently , we have the following:

 DCG(πf,y)=∑r2yr−1log2(1+πf[r])≥∑r2yr−1πf[r] ≥∑r(2yr−1)ρ(fr)=[∑r2yrρ(fr)]−1 ≥[∑r(2yr−γr)ρ(fr)]−1, (11)

where the second inequality holds by Equation (9).

Finally, consider a translation (by a constant) and -transformation of mean NDCG, , as follows:

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯NDCG≜log(¯¯¯¯¯¯¯¯¯¯¯NDCG+1|Ψ|∑(x,y)1DCG(πy,y)).

Given the monotonicity of , the maximizer of also maximizes . We now proceed as follows:

 ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯NDCG =log1|Ψ|∑(x,y)1DCG(πy,y)[DCG(πf,y)+1] ≥log1|Ψ|∑(x,y)1∑j2yj−γj[DCG(πf,y)+1] ≥log1|Ψ|∑(x,y)∑rϕ(yr)ρ(fr) (12) ≥1|Ψ|∑(x,y)∑rϕ(yr)logρ(fr), (13)

where the first inequality holds by Equation (10), the second inequality by Equation (11), the third inequality by Definition 1, and the last inequality by repeated applications of Equation (8). Finally, negating both sides completes the proof. ∎

### 4.2 Effect of γ

In the proof of Theorem 1, we made use of Jensen’s inequality as defined in Equation (8). In particular, the inequality from Equation (12) to Equation (13) holds by repeated applications of Jensen’s inequality, the last of which involves:

 log Eϕ[ρ(fr)]=log∑rϕ(yr)ρ(fr) ≥∑rϕ(yr)logρ(fr)=Eϕ[logρ(fr)].

The gap between the RHS and LHS in the above—known as the Jensen gap—contributes to the tightness of the bound on the ranking metric. This gap, and therefore the tightness of the resulting bound, can be controlled by the distribution . This is illustrated in Figure 1 with simulated data points. The tightest bound can be achieved by minimizing the following constrained optimization problem per training example given scores :

 minimizeγ log∑rϕ(yr)ρ(fr)−∑rϕ(yr)logρ(fr) such that ϕ(yr)=2yr−γr∑j2yj−γj.

In addition to its effect on the Jensen gap, affects the tightness of the bounds in Equations (10) and (11). Solving these optimization problems jointly per training example at every step of training, however, is nontrivial. In this work, as we will elaborate later in Section 5, we reduce the problem of choosing to a more tractable optimization problem by treating it as a hyperparameter that may be tuned on a validation dataset.

### 4.3 Comparison with λ\textscmart

In this section, we compare with in terms of model complexity and generalization error. In what follows, we proceed under the strong assumption that the loss optimized by in fact exists. That is, we assume that there exists a differentiable function that satisfies Equation (4).

We begin with an examination of the Lipschitz constant of the two algorithms—an upper-bound on the variation a function can exhibit. Intuitively, functions with a smaller Lipscthiz constant are simpler because they vary at a slower rate, and thus generalize better.

###### Proposition 1.

The loss is -Lipschitz with respect to .

###### Proof.

Recall the definition of the Lipschitz constant for a differentiable function :

 Liph =sup∥f−f′∥|h(f)−h(f′)|∥f−f′∥ =sup∥f−f′∥|∇fh(f′′)(f−f′)|∥f−f′∥=∥∇fh(f)∥∗,

where the second equality holds by the Mean Value Theorem and the last equality by definition of the dual norm, . Therefore, to derive the Lipschitz constant of a function with respect to the infinity norm, it is sufficient to calculate the norm of its gradient. Given that ’s loss function is unknown, we resort to this strategy to derive its Lipschitz constant.

Observe that the terms in Equation (5) are bounded by and Equation (4) has at most such terms. As such, we have that,

 |∂ℓ∂fi|≤σm.

Then,

 ∥∇fℓ∥1=m∑i=1|∂ℓ∂fi|≤m∑i=1σm=σm2

which completes the proof. ∎

###### Proposition 2.

is -Lipschitz with respect to .

###### Proof.

We present the proof in the appendix due to space constraints. ∎

In order to put this difference into perspective, we use the results above to derive bounds on the generalization error of the two algorithms. But first we need the following result.

###### Theorem 2.

Let be a compact space of bounded functions from to , be the number of training examples, the Lipschitz constant of loss function , and the covering number of by balls of radius . The following generalization error bound holds:

 P{E(f)≤ϵ}≥1−2N(ϵ4Lipℓ,F,∥⋅∥∞)exp(−2nϵ2Lip2ℓ),

where the generalization error is defined as follows:

 E(f)≜EXm×Ym[ℓ(y,f(x))]−1n∑(x,y)∈Ψℓ(y,f(x)).
###### Proof.

Based on the proofs in (Cucker and Smale, 2002; Rudin, 2009) and, for completeness, presented in the appendix. ∎

The dependence of the generalization error bound on the Lipschitz constant suggests that unlike , ’s generalization error does not degrade as the number of documents per training example increases. Given ’s higher complexity, we hypothesize that the algorithm is less robust to noise or in settings where the number of documents per training example is large.

We note that, the independence of the ListNet generalization error bound from was also reported in (Tewari and Chaudhuri, 2015) for linear models, but we present the (structure of the) bounds here to allow a direct comparison between and .

### 4.4 Approximating the Inverse Hessian

In this work, we fix the hypothesis space, , to Gradient Boosted Regression Trees. This is, in part, because we are interested in a fair comparison of ListNet, , and in isolation of other factors, as explained in Section 5. As most GBRT learning algorithms use second-order optimization methods (e.g., Newton’s), however, we must approximate the inverse Hessian for ListNet and .

Unfortunately, as defined in Definition 1 results in a Hessian that is singular, making the loss incompatible with a straightforward implementation of Newton’s second-order method. We resolve this technical difficulty by making a small adjustment to the formulation of the loss function.

Let us re-define the score distribution, , from Definition 1 as follows for a negligible :

 ρ(fi)=efi∑mj=1efj+ϵ. (14)

In effect, we take away a small probability mass, , from the score distribution for a nonexistent, document with label probability . The gradients of the loss will take the following form:

 ∂ℓ∂fr =∂∂fr[∑i(−ϕ(yi)fi)+log(∑jefj+ϵ)] =−ϕr+ρr,

where and . The Hessian looks as follows:

 Hij={ρi(1−ρi),i=j−ρiρj,i≠j
###### Claim 1.

The Hessian, as defined above, is positive definite.

###### Proof.

A complete proof may be found in the appendix. Observe that is strictly diagonally dominant:

 |Hkk| =ρk(1−ρk)=ρk(1−efk∑efj+ϵ) =ρk∑j≠kefj+ϵ∑efj+ϵ>ρk∑j≠kρj=∑j≠k|Hkj|.

By the properties of strictly diagonally dominant matrices and the fact that the diagonal elements of are positive, we have that and therefore invertible. ∎

We now turn to approximating the inverse of as required. Write where

is the identity matrix,

is a diagonal matrix where and is a square matrix where,

 Sij={0,i=jρj/(1−ρi),i≠j.
###### Claim 2.

The spectral radius of is strictly less than 1.

###### Proof.

A complete proof is presented in the appendix. is a square matrix with nonnegative entries. By the Perron-Frobenious theorem, its spectral radius is bounded above by the maximum row-wise sum of entries, which, in , is strictly less than 1. ∎

Claim 2 allows us to apply Neumann’s result to approximate as follows:

 (I−S)−1=∞∑k=0Sk≈I+S+S2.

Using this result, we may approximate as follows:

 H−1=(I−S)−1D−1≈(I+S+S2)D−1

With that, we can finally calculate the update rule in Newton’s method which requires the quantity :

 (H−1∇)k=∑iH−1ki∇i ≈∑i(I+S+S2)ki(D−1∇)i =∑i(I+S+S2)ki−ϕi+ρiρi(1−ρi) =−ϕk+ρkρk(1−ρk)(ID−1∇)k+11−ρk∑i≠k−ϕi+ρi1−ρi(SD−1∇)k+∑i≠kρi(SD−1∇)i1−ρk =−ϕk+ρk+ρk∑i≠k−ϕi+ρi1−ρi+ρk∑i≠kρi(SD−1∇)iρk(1−ρk).

## 5 Experiments

We are largely interested in a comparison of (a) the overall performance of ListNet, , and on benchmark learning-to-rank datasets, and (b) the robustness of these models to various types and degrees of noise as a proxy to comparing their complexity. In this section, we describe our experimental setup and report our empirical findings.

### 5.1 Datasets

We conduct experiments on two publicly available benchmark datasets: MSLR Web30K (Qin and Liu, 2013) and Yahoo! Learning to Rank Challenge Set 1 (Chapelle and Chang, 2011). Web30K contains roughly 30,000 example, with an average of 120 documents per example. Documents are represented by 136 numeric features. Yahoo! also has about 30,000 examples but the average number of documents per example is 24 and each document is represented by 519 features. Documents in both datasets are labeled with graded relevance from 0 to 4 with larger labels indicating a higher relevance.

From each dataset, we sample training (60%), validation (20%), and test (20%) examples, and train and compare models on the resulting splits. We repeat this procedure 100 times and obtain mean NDCG at different rank cutoffs for each trial. We subsequently compare the ranking quality between pairs of models and determine statistical significance of differences using a paired t-test.

During evaluation, we discard examples with no relevant documents. There are 982 and 1,135 such examples in the Web30K and Yahoo! datasets. The reason for ignoring these examples during evaluation is that their ranking quality can be arbitrarily 0 or 1, and that arbitrary choice skews the mean metrics one way or another.

### 5.2 Models

We train models using LightGBM (Ke et al., 2017). The hyperparameters are guided by previous work (Ke et al., 2017; Wang et al., 2018; Bruch et al., 2019a). For Web30K, max_bin is 255, learning_rate is 0.02, num_leaves is 400, min_data_in_leaf is 50, min_sum_hessian_in_leaf is set to 0, is 1, and lambdamart_norm is set to false. We do not utilize any regularizing terms because we are interested in a comparison of core algorithms. For Yahoo!, num_leaves is 200 and min_data_in_leaf is 100. We use NDCG@5 to select the best models on validation sets by fixing early stopping round to 50 up to 500 trees.

We also implemented ListNet and

in LightGBM, which we intend to open source. As noted earlier, by fixing the hypothesis space to gradient boosted regression trees, we aim to strictly compare the performance of the loss functions and shield our analysis from any effect the hypothesis space may have on convergence and generalization. We use the same hyperparameters above for these algorithms as well.

Finally, we must address the choice for in . As explained in Section 4.2, in this work we turn into a hyperparameter. In effect, our strategy is similar to the process that led to the visualization in Figure 1: At every iteration and for every training example with documents, we sample uniformly from . We train 10 models in this way and choose the one that performs the best on the validation set.

### 5.3 Ranking Quality

We compare the ranking quality of the three models under consideration. We report model quality by measuring average NDCG at rank cutoffs 5 and 10. As noted earlier, we also measure statistical significance in the difference between model qualities using a paired t-test with significance level set to . Our results are summarized in Table 1.

From Table 1, we observe that ListNet consistently performs poorly across both datasets and the quality gap between ListNet and is statistically significant at all rank cutoffs. This observation is in agreement with past studies (Bruch et al., 2019a).

On the other hand, our proposed yields a significant improvement over ListNet. This observation holds consistently across both datasets and rank cutoffs and lends support to our theoretical findings in previous sections.

Not only does outperform ListNet, its performance surpasses that of ’s. While ’s gain over is smaller than its gap with ListNet, the differences are statistically significant. This is an encouraging result: is not only theoretically sound and is equipped with better properties, it also performs well empirically compared to the state-of-the-art algorithm.

A notable difference between and is in their convergence rate. Figure 4 plots NDCG@5 on validation sets as more trees are added to the ensemble. To avoid clutter, the figure illustrates just one trial (out of 100) but we note that we observe a similar trend across trials. From Figure 4, it is clear that outperforms by a wider margin when the number of trees in the ensemble is small. This property is important in latency-sensitive applications where a smaller ensemble is preferred.

### 5.4 Robustness

We now turn to model robustness where we perform a comparative analysis of the effect of noise on and . The robustness of a ranking model to noise is important in practice due to the uncertainty in relevance labels, whether judged by human experts or is collected implicitly by user feedback such as clicks. We expect to overfit to noise and be less robust due to its higher model complexity—see findings in Section 4.3. As such, we expect the performance of to degrade at a higher pace than as we inject more noise into the dataset. We put this hypothesis to test through two types of experiments.

In a first series of experiments, we focus on the effect of enlarging the document list per training example by the addition of noise. In particular, we augment document lists for training examples with new negative (i.e., non-relevant) documents using the following process. For every training example , we sample from the collection of all documents in the training set excluding to form ). Subsequently, we augment by adding as non-relevant documents: , where denotes concatenation. Finally, we train models on the resulting training set and evaluate on the (unmodified) test set. As before, we repeat this experiment 100 times.

We illustrate NDCG@5 on the test sets averaged over 100 trials and for various degrees of augmentation in Figures (a)a and (b)b. The trend confirms our hypothesis: On both datasets, the performance of degrades more severely as more noise is added to the training set, increasing the number of documents per example (). This effect is more pronounced on the Yahoo! dataset where is on average small. We note that the increase in NDCG@5 of from the 40% mark to 60% on Web30K is not statistically significant.

In another series of experiments we perturb relevance labels in the training set. To that end, for each training example , we randomly choose a subset of its documents and set their labels (independently) to 0 through 4 with decreasing probabilities: . We train models on the perturbed training set and evaluate on the (unmodified) test set. As before, we repeat this experiment 100 times.

The results are shown in Figures (c)c and (d)d. As before, ’s performance degrades more rapidly with more noise. This behavior is more pronounced on Web30K.

We have included additional experiments in the appendix that explore the robustness of and to noise in a click dataset simulated from Yahoo! and Web30K. The results from those experiments additionally support our hypothesis that is a more robust algorithm.

## 6 Conclusion

In this work, we presented a novel “listwise” learning-to-rank loss function, , that, unlike existing methods bounds NDCG—a popular ranking metric—in a general setting. We contrasted our proposed loss function with and showed its superior theoretical properties. In particular, we showed that the loss function optimized by (if it exists), has a higher complexity with a Lipschitz constant that is a function of the number of documents, . In contrast, the complexity of is invariant to .

Furthermore, we proposed a model that optimizes

to learn an ensemble of gradient-boosted decision trees which we refer to as

. Through extensive experiments on two benchmark learning-to-rank datasets, we demonstrated the superiority of our proposed method over ListNet and in terms of quality and robustness. We showed that, is less sensitive to the number of documents and is more robust in the presence of noise. Finally, our experiments suggest that the performance gap between and widens if we constrain the size of the learned ensemble. Better performance with fewer trees is important for latency-sensitive applications.

As a future direction, we are interested in an examination of the tightness of the presented bound and its effect on the convergence of . In particular, in this work, we cast the problem presented in Section 4.2 as one of hyperparameter tuning. However, more effective strategies for solving ’s and obtaining tighter bounds during boosting remain unexplored. Furthermore, given its robustness to label noise (implicit and explicit), we are also interested in studying in an online learning setting.

## References

• S. Bruch, X. Wang, M. Bendersky, and M. Najork (2019a) An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. In Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval, Cited by: §2, §4.1, §5.2, §5.3.
• S. Bruch, M. Zoghi, M. Bendersky, and M. Najork (2019b)

Revisiting approximate metric optimization in the age of deep neural networks

.
In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §2.
• C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender (2005) Learning to rank using gradient descent. In

Proceedings of the 22nd International Conference on Machine Learning

,
pp. 89–96. Cited by: §1, §2.
• C. J.C. Burges (2010) From RankNet to LambdaRank to LambdaMART: an overview. Technical report Technical Report MSR-TR-2010-82, Microsoft Research. Cited by: §1, §2.
• Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, pp. 129–136. Cited by: §1, §2, §3.
• O. Chapelle and Y. Chang (2011) Yahoo! learning to rank challenge overview. pp. 1–24. Cited by: §5.1.
• O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan (2009) Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 621–630. Cited by: §2.
• O. Chapelle and M. Wu (2010) Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval 13 (3), pp. 216–235. Cited by: §2.
• S. Clémençon and N. Vayatis (2008) Empirical performance maximization for linear rank statistics. In Proceedings of the 21st International Conference on Neural Information Processing Systems, pp. 305–312. Cited by: §1.
• N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey (2008) An experimental comparison of click position-bias models. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 87–94. Cited by: Appendix E.
• F. Cucker and S. Smale (2002) On the mathematical foundations of learning. Bulletin of the American Mathematical Society 39, pp. 1–49. Cited by: Appendix B, §4.3.
• J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §2.
• K. Järvelin and J. Kekäläinen (2002) Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems 20 (4), pp. 422–446. Cited by: §1, §2.
• T. Joachims, A. Swaminathan, and T. Schnabel (2017) Unbiased learning-to-rank with biased feedback. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining, pp. 781–789. Cited by: Appendix E.
• T. Joachims (2006) Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217–226. Cited by: §1, §2.
• G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, pp. 3146–3154. Cited by: §5.2.
• Y. Lan, T. Liu, Z. Ma, and H. Li (2009) Generalization analysis of listwise learning-to-rank algorithms. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 577–584. Cited by: §2.
• T. Liu (2009) Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3 (3), pp. 225–331. Cited by: §1, §2.
• T. Qin, T. Liu, and H. Li (2010) A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13 (4), pp. 375–397. Cited by: §2.
• T. Qin and T. Liu (2013) Introducing LETOR 4.0 datasets. External Links: 1306.2597 Cited by: §5.1.
• C. Rudin and Y. Wang (2018) Direct learning to rank and rerank. In

Proceedings of Artificial Intelligence and Statistics AISTATS

,
Cited by: §2.
• C. Rudin (2009) The p-norm push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research 10, pp. 2233–2271. Cited by: Appendix B, §2, §4.3.
• M. Taylor, J. Guiver, S. Robertson, and T. Minka (2008) SoftRank: optimizing non-smooth rank metrics. In Proceedings of the 1st International Conference on Web Search and Data Mining, pp. 77–86. Cited by: §2.
• A. Tewari and S. Chaudhuri (2015) Generalization error bounds for learning to rank: does the length of document lists matter?. In Proceedings of the 32nd International Conference on Machine Learning, pp. 315–323. Cited by: §2, §4.3.
• X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork (2018) The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1313–1322. Cited by: §2, §2, §5.2.
• Q. Wu, C. J. Burges, K. M. Svore, and J. Gao (2010) Adapting boosting for information retrieval measures. Information Retrieval 13 (3), pp. 254–270. Cited by: §1, §2.
• F. Xia, T. Liu, J. Wang, W. Zhang, and H. Li (2008) Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning, pp. 1192–1199. Cited by: §1, §2.
• J. Xu and H. Li (2007) AdaRank: a boosting algorithm for information retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 391–398. Cited by: §2.

## Appendix A Proof of Proposition 2

###### Proposition.

is -Lipschitz with respect to .

###### Proof.

We follow the proof of Proposition 1.

Recall that the cost function for is defined as follows:

 ℓ(y,f(x))≜−∑ϕ(yi)logρ(fi),

where and

form probability distributions over labels

and scores respectively, and .

Observe that the derivative of the cost function with respect to a score is:

 ∂ℓ∂fr =∂∂fr[−∑iϕ(yi)(fi−log∑jefj)] =∂∂fr[(∑i−ϕ(yi)fi)+log∑jefj] =−ϕ(yr)+efr∑jefj=−ϕ(yr)+ρ(fr).

By triangle inequality,

 |∂ℓ∂fr|≤ϕ(yr)+ρ(fr)

Then,

 ∥∇fℓ∥1=∑|∂ℓ∂fi|≤∑(ϕ(yr)+ρ(fr))=2

as required. ∎

## Appendix B Proof of Theorem 2

###### Theorem.

Let be a compact space of bounded functions from to ,111The loss is invariant to a translation of scores. Scaling the scores is equivalent to scaling the generalized sigmoid’s hyperparameter, , by a scalar and the gradients by inverse of that scalar, which subsequently does not affect the Lipschitz constant of the loss. As such, any (bounded) function produced by can be translated and scaled into the interval without loss of generality. be the number of training examples, the Lipschitz constant of loss function , and the covering number of by balls of radius . The following generalization error bound holds:

 P{E(f)≤ϵ}≥1−2N(ϵ4Lipℓ,F,∥⋅∥∞)exp(−2nϵ2Lip2ℓ),

where the generalization error is defined as follows:

 E(f)≜EXm×Ym[ℓ(y,f(x))]−1n∑(x,y)∈Ψℓ(y,f(x)).
###### Proof.

Based on the proofs in (Cucker and Smale, 2002; Rudin, 2009). At a high level, the proof can be sketched as follows: Consider a cover of with -balls with respect to and let be its covering number (i.e., the number of elements in the smallest such cover). That is, we assume there exists balls centered on s for that cover . In Lemma B.2, we show that if the radius of the balls is sufficiently small, it matters little which we choose to represent each ball; the generalization error does not change much within each ball. We then proceed to work with the centers s and find a probabilistic generalization bound within each ball for s. Finally, we use the union bound to derive a bound on the entire cover. But first we need the following lemma.

For any , .

###### Proof.
 E(f)−E(f′) =[EXm×Ym[ℓ(y,f(x))]−1n∑(x,y)∈Ψℓ(y,f(x))]− [EXm×Ym[ℓ(y,f′(x))]−1n∑(x,y)∈Ψℓ(y,f′(x))] =EXm×Ym[|ℓ(y,f(x))−ℓ(y,f′(x))|]+ 1n∑(x,y)∈Ψ|ℓ(y,f(x))−ℓ(y,f′(x))| ≤Lipℓ∥f−f′∥∞+Lipℓ∥f−f′∥∞ (by Lipschitz continuity) =2Lipℓ∥f−f′∥∞.

###### Lemma B.2.

Consider an ball, in centered at with radius . We have that:

 P{supf∈N(f∗)E(f)≥ϵ}≤P{E(f∗)≥ϵ2}.
###### Proof.
 supf∈N(f∗)E(f)−E(f∗) ≤2Lipℓsupf∈N(f∗)∥f−f∗∥ (by Lemma B.1) ≤2Lipℓϵ4Lipℓ=ϵ2.

The above means:

 supf∈N(f∗)E(f)≥ϵ⇒E(f∗)≥ϵ2.

The claim follows. ∎

Finally, we turn to proving the main result. Define . The largest possible change in due to a replacement of with another training example is bounded by:

 1nLipℓ∥f(x)−f(x′)∥∞ ≤1nLipℓsupx,x′∈X∥f(x)−f(x′)∥∞≤1nLipℓ.

Using McDiarmid’s inequality:

 P{|E[S]−S|≥ϵ}≤2exp(−2ϵ2n(1nLipℓ)2).

Finally, using Lemma B.2 and the union bound over the entire cover, we obtain:

 P{supfE(f)≥ϵ} ≤N(ϵ4Lipℓ,F,∥⋅∥∞)P{E(fr)≥ϵ2}

## Appendix C Proof of Claim 1

Using to denote the score probability of the document, the Hessian can be written as follows:

 Hij={ρi(1−ρi),i=j−ρiρj,i≠j
###### Claim.

The Hessian, as defined above, is positive definite.

###### Proof.

We first prove that is strictly diagonally dominant. By definition, a square matrix is said to be strictly diagonally dominant if the following holds for all :

 |Aii|>∑j≠i|Aij|.

Observe that:

 |Hkk| =ρk(1−ρk)=ρ