# Just Sort It! A Simple and Effective Approach to Active Preference Learning

We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.

## Authors

• 6 publications
• 11 publications
• ### A Nearly Instance Optimal Algorithm for Top-k Ranking under the Multinomial Logit Model

We study the active learning problem of top-k ranking from multi-wise co...
07/25/2017 ∙ by Xi Chen, et al. ∙ 0

• ### Minimax Rates and Efficient Algorithms for Noisy Sorting

There has been a recent surge of interest in studying permutation-based ...
10/28/2017 ∙ by Cheng Mao, et al. ∙ 0

• ### Black Hole Metric: Overcoming the PageRank Normalization Problem

In network science, there is often the need to sort the graph nodes. Whi...
02/15/2018 ∙ by Marco Buzzanca, et al. ∙ 0

• ### Sensitive and Scalable Online Evaluation with Theoretical Guarantees

Multileaved comparison methods generalize interleaved comparison methods...
11/26/2017 ∙ by Harrie Oosterhuis, et al. ∙ 0

• ### Robust Ranking of Equivalent Algorithms via Relative Performance

In scientific computing, it is common that one target computation can be...
10/14/2020 ∙ by Aravind Sankaran, et al. ∙ 0

• ### Building Cross-Sectional Systematic Strategies By Learning to Rank

The success of a cross-sectional systematic strategy depends critically ...
12/13/2020 ∙ by Daniel Poh, et al. ∙ 0

• ### The entropy of lies: playing twenty questions with a liar

`Twenty questions' is a guessing game played by two players: Bob thinks ...
11/06/2018 ∙ by Yuval Dagan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of recovering a ranking over a set of items from a collection of pairwise noisy comparisons. Applications are broad and diverse, ranging from ranking chess players based on game outcomes [Elo, 1978] to learning user preferences from individual choices in a recommender system [Wang et al., 2014]. Furthermore, pairwise comparisons have been advocated by psychologists and sociologists as a way of eliciting information for almost a century [Thurstone, 1927, Salganik and Levy, 2014]; Arguably, comparisons are one of the simplest and most natural ways for humans to give feedback. As an example, GIFGIF111See: http://gifgif.media.mit.edu. is a large-scale, crowdsourced attempt at understanding how emotions are represented in animated GIF images. Visitors of the site are presented with a prompt that displays two images, together with a question such as, “Which better expresses happiness?” Aggregating millions of answers enables us to rank the images in a compelling way.

Clearly, any useful method should be able to cope with inconsistent or noisy data. A popular probabilistic model of comparison outcomes is the Bradley–Terry (BT) model [Bradley and Terry, 1952], which we shortly describe in Section 1.1. The BT model posits a notion of distance, or similarity, between items; comparisons are more noisy for similar items than for distant items. As a consequence, the amount of information we can learn from a comparison depends on previous comparisons. For example, if we already know that two items are very distant, comparing them might be almost pointless. This suggests that active (or adaptive) strategies, which select items for comparison based on previous comparisons, might perform much better than agnostic strategies.

In this paper, we confirm this intuition by exploring strategies for active ranking from noisy comparison data. Specifically, we show that sorting algorithms (such as Quicksort), which recover the exact ranking in the noiseless case, provide a basis for efficient active approximate ranking in the noisy case. Running a sorting algorithm over a set of items induces a set of comparisons over these items. We propose a ranking method which runs a sorting algorithm repeatedly over the items until the budget is exhausted; the output of the sorting algorithm is ignored, but we retain the set of item pairs that were compared. We then estimate the final ranking from this set of comparisons and their outcomes.

This result requires a detour: We first develop the maximum-likelihood estimator for the BT model based on a set of noisy comparisons among items. The starting point for this is recent work on Rank Centrality (RC), an algorithm for aggregating pairwise comparisons [Negahban et al., 2012]. The analysis of this algorithm provided first-of-its-kind bounds for the finite sample error under the BT model. Since then, there have been several new results that analyze the statistical convergence of ranking estimators under this model [Hajek et al., 2014] and various generalizations [Rajkumar and Agarwal, 2014].

In Section 2, our first contribution is to show how RC can be interpreted as an approximation of the maximum likelihood (ML) estimator. This enables us to show that the ML estimate enjoys essentially the same error guarantees as the RC estimate, and to give a new, provably convergent iterative algorithm that computes the ML estimate. In Section 3, we return to the problem of ranking induced by the BT model. Previous results in the literature show that if compared pairs are chosen in advance, comparisons are necessary to even approximately recover the ranking. In light of this, we investigate what happens when we use a simple active strategy: sorting the items. This leads to our second contribution, which is to show (under a certain distribution over BT model parameters) that we can approximately recover the ranking over the items using comparisons, and that the “worst mistake” in the estimated ranking is bounded by

with high probability. Finally, in Section

4 we test our theoretical findings and show experimentally that the speed of convergence is increased when we obtain comparison outcomes by sorting the items. Notations. We define some of the notations used throughout the paper. The items are denoted without loss of generality by their true rank in , i.e., means that is ranked higher than . When is preferred to as a result of a pairwise comparison, we denote the observation by . Let be the set containing all pairs of items that have been compared at least once. A given pair might be compared more than once; let be the number of times items and were compared, out of which was preferred times and was preferred times. When we define as the ratio of wins of over . When we let . Finally, an event is said to occur with high probability if for , fixed.

Informally, the Bradley–Terry (BT) model [Bradley and Terry, 1952] places every item on the real line. The probability of observing a comparison outcome that is inconsistent with the ranking decreases with the distance between the items. This captures the intuitive notion that some pairs are easy to compare, whereas some are more difficult.

More formally, the model defines the probability of observing a particular comparison outcome. It can be parameterized in two different ways, and we will make use of both. In the first parameterization, each item is associated to a parameter , which we call weight. The probability that is preferred over is then

 p(i≺j)=wiwi+wj.

We call the dynamic range of the model. It is clear that the parameters are defined up to a multiplicative factor, so without loss of generality we assume

. Sometimes, it will be convenient to view the parameters as a probability distribution over the items, in which case they need to sum to one. We will then use the notation

, with . The distinction becomes important when we consider sequences with a growing number of items.

An alternative parameterization the model is given by , with . Then

 p(i≺j)=eθieθi+eθj=11+e−(θi−θj).

In this formulation the parameters are defined up to an additive constant. This second parameterization enables us to make the link with our informal description: intuitively, “close” items have more uncertain comparison outcomes. Finally, let us note that as the likelihood is concave, maximum likelihood estimation is tractable, and efficient algorithms exist [Zermelo, 1929, Hunter, 2004].

### 1.2 Related Work

Negahban et al. [2012, 2014] propose Rank Centrality, a spectral ranking algorithm for aggregating pairwise comparisons. They analyze the case where comparisons are obtained under the BT model for a fixed set of pairs . They give bounds on the mean-square error (MSE) of the resulting parameter estimate, as a function of the structural properties of the comparisons graph . The random graph turns out to have optimal structural properties with high probability, which justifies selecting comparison pairs at random. In this case, their main result implies that comparisons are enough to ensure that the MSE is . In Section 2, we provide further insights into Rank Centrality by relating it to the ML estimator. We show, among other contributions, how RC can be used to compute the ML estimate, and provide a principled way of extending RC to the case where pairs might have a different number of comparisons.

Hajek et al. [2014] provide a result similar to that of Rank Centrality for the ML estimator. They also present a lower bound on the minimax error rate. It follows that both RC and ML are minimax-optimal up to logarithmic factors. In Section 2 we recover their result for the ML estimator by analyzing RC and give an independent proof of the error bound on the ML estimate. Furthermore, in Section 3 we observe that a minimax analysis is not insightful when the task is to recover the ranking instead of the model parameters. This is the reason why we develop a novel analysis of ranking estimates, in a way that enables us to provide useful guarantees.

The results mentioned all bound the MSE of the parameter estimate or . If there is interest in recovering the ranking induced by a BT model, a low average error on the parameter estimate is good only in so far as parameter values are not too close to each other. When comparisons are chosen at random, Rajkumar and Agarwal [2014] look at the statistical convergence of various ranking algorithms, including RC and ML. Their sample complexity results are impractical, with quartic dependencies in . However, they show that the ML estimator (under the BT model), unlike RC, can recover rankings from comparisons generated according to a model that only satisfies a weaker, more general low-noise condition. The robustness of the ML estimator—to variations in the model—provides another justification to use it for ranking aggregation.

Finally, several works consider active approaches for ranking when the observed comparisons are noisy. Using a different noise model, Braverman and Mossel [2008] show that comparisons are sufficient to find a ranking that is close to the original one. In their model, the noise is independent of the comparison pair. In a way, our developments in Section 3 provide similar guarantees for the BT model. Jamieson and Nowak [2011], Ailon [2012], Wang et al. [2014] also consider active ranking with pairwise data, but they depart from the BT model used in our setting.

## 2 Maximum Likelihood vs. Rank Centrality

In this section, we find a link between Rank Centrality [Negahban et al., 2012, 2014] and the ML estimator for the BT model. This allows us to show that the MSE on the ML estimate is as long as we have comparisons over randomly chosen pairs, to present a novel, provably convergent iterative algorithm that computes the ML estimate using RC, and to extend RC for the case of uneven number of comparisons across pairs. We will use to denote true model parameters, for parameters estimated by or related to RC, and for parameters estimated by or related to the ML estimator.

### 2.1 Rank Centrality Approximates the ML Estimate

We start by recalling the Rank Centrality algorithm. Let be the comparisons graph and assume that every pair in has been compared exactly times. Informally, the RC estimate is defined as the stationary distribution of a random walk on the graph that is biased towards items that are preferred more often. Denote by and the maximum and the minimum degree of a node in , respectively, and let

. Consider the Markov chain on

with the following transition matrix, which we assume to be irreducible and aperiodic.

 ¯Pij={ϵaijif i≠j,1−ϵ∑l≠iailif i=j.

The factor ensures that the matrix is stochastic. The RC estimate is defined as the (unique) stationary distribution of the random walk on , i.e., satisfying .

Let be the adjacency matrix of , and the diagonal matrix of its nodes’ degrees. A Markov chain with transition matrix represents an unbiased random walk on . Let be the spectral gap of this transition matrix; intuitively, the larger the spectral gap is, the faster the convergence towards the stationary distribution is. The main result of the analysis of RC is as follows.

###### Theorem 1 (Negahban et al., 2012).

For , the error on the RC estimate satisfies w.h.p.

 ∥¯π−π∗∥2∥π∗∥2

where is a constant, and .

How to interpret the left-hand side of (1) is not necessarily clear at first sight, especially as parameters and contract as grows to ensure that they continue to form a probability distribution. For this reason, we state the following corollary, which expresses the error in a more interpretable way.

###### Corollary 1.

For , the error on the RC estimate satisfies w.h.p.

 ∥¯w−w∗∥2

where is a constant, and .

The proof can be found in Appendix A. Corollary 1 bounds the error of the parameter estimate when the dynamic range stays constant. The error depends on the structure of the comparisons graph through , and . In the best case, and are bounded by a constant, in which case comparisons are enough to make sure that the MSE is .

We move on to show how RC relates to the ML estimator. On one hand, the Rank Centrality estimate, as the stationary distribution of a Markov chain, satisfies the global balance equations. Let be the set of neighbors of in , i.e., the set of items has been compared with. For all ,

 ∑j∈Ni¯πi¯Pij=∑j∈Ni¯πj¯Pji ⟺ ∑j∈Ni(¯πiaij−¯πjaji)=0. (2)

On the other hand, the maximum likelihood satisfies the following claim.

###### Claim.

The maximum likelihood estimate satisfies, for all ,

 fi(^π)=∑j∈Ni1^πi+^πj(^πiaij−^πjaji)=0. (3)
###### Proof.

The log-likelihood of the observations given weights can be written as

 logL=∑i∈[n]∑j∈Niaji(log(πi)−log(πi+πj)) ⟺ ∂logL∂πi=∑j∈Ni1πi(πi+πj)(πiaij−πjaji).

The ML estimate by definition satisfies . The factor can be dismissed, and the claim follows. ∎

Consider the task of finding an appropriate linearization of the function in (3). We will now present two approaches that lead to the same result. First, the idea of linearizing directly could be abandoned because a good point for to be developed around cannot be given. In this case, by writing with

 fij(πi,πj)=1πi+πj(πiaij−πjaji),

we can linearize each function separately. A good point around which to develop is , as the ratio is expected to be ( is a scaling constant.) Summing up the first-order Taylor approximations of , we get

 fi(π)=∑j∈Ni1cij(πiaij−πjaji)+H. O. T.

Taking yields Rank Centrality (2). Alternatively, if we were to linearize around the maximum likelihood estimate, we would find the following first-order Taylor approximation.

 fi(π) =∑j∈Ni1^πi+^πj(^πj^πi+^πjπi−^πi^πi+^πjπj)+H. O. T.

As is expected to be close to , we can plug the data in. As for the coefficient , in the absence of additional knowledge, we can just ignore it (or set it to .) Again we fall back on Rank Centrality (2).

Furthermore, (3) enables us to view the ML estimate as the stationary distribution of a Markov chain with transition matrix , where

 (4)

The factor ensures that is stochastic. This matrix enables us to analyze the ML estimator by using the methods developed for RC.

###### Theorem 2.

For , the error on the ML estimate satisfies w.h.p.

 ∥^w−w∗∥2

where is a constant, and .

###### Proof.

Given the Markov chain , the proof essentially follows that of Theorem 1 of Negahban et al. [2014]. Let be the ideal Markov chain, when , i.e., the ratios are noiseless. The key insight is to note that the stationary distribution of is , the true model parameters. By bounding and , we can bound the error on the stationary distribution of . For the former, a straightforward application of the proof in the RC case suffices. For the latter, in the application of the comparison theorem, the lower bound on changes by a factor of . This is due to the additional factor in the off-diagonal entries of . ∎

This independently recovers the result recently shown by Hajek et al. [2014], by using an alternative method. For completeness, we provide a corollary that gives an instance of the bound for random graphs.

###### Corollary 2.

For , and , the error on the ML estimate satisfies w.h.p.

 ∥^w−w∗∥2

where , are constants and .

Corollary 2 shows that comparisons on randomly chosen pairs are sufficient also for the ML estimator to ensure that the MSE vanishes as increases. Combining these results with the lower bounds of Hajek et al. [2014] shows that ML and RC are minimax-optimal up to a logarithmic factor.

### 2.2 The Case of General kij

In this section, we make a small aside to show how our newly developed understanding of RC enables us to state a principled way to extend the algorithm to a non-uniform number of comparisons. Let , and . The Markov chain

 ˜Pij={ϵAijif i≠j,1−ϵ∑l≠iAilif i=j

leads to an estimate that, via the global balance equations, satisfies

 ∑j∈Ni(~πiAij−~πjAji)=0.

This is to be compared to the maximum likelihood estimate that satisfies

 ∑j∈Ni1^πi+^πj(^πiAij−^πjAji)=0.

Unfortunately, the analysis of the error does not benefit from the higher number of comparisons available for certain pairs. Bounds in this general case can be obtained by setting in Theorem 1.

Equation (3) and the Markov chain in (2.2) suggest an iterative procedure to compute the maximum likelihood estimate using adapted Rank Centrality Markov chains as primitives. Algorithm 1, which we call Adjusted Rank Centrality (ARC), describes such a procedure. It makes use of the following Markov chain, parameterized by .

 (5)

The factor ensures that the matrix is stochastic.

###### Theorem 3.

Algorithm 1 converges to the maximum likelihood estimate, provided that the Markov chain is irreducible and aperiodic.

The full proof of Theorem 3 is given in Appendix A. The proof simply shows that is a contraction mapping; by (3) the unique fixed point is the ML estimate . Note that is aperiodic as soon as it has a self-transition, and that if is not irreducible, the ML estimate is ill-defined. Lastly, ARC bears similarities to Hunter’s minorization–maximization (MM) algorithm [Hunter, 2004]. Like MM, it solves the ML optimization problem by iterative linear relaxations. However, MM linearizes part of the likelihood function directly, whereas ARC linearizes the gradient.

## 3 Ranking by Sorting

In Section 2, we see that RC and the ML estimator are both minimax-optimal up to logarithmic factors, given comparisons for pairs chosen uniformly at random. This might lead us to think that we should not seek further and simply try to gather comparisons on random pairs of items. However, when the task is to recover the ranking induced by a BT model, some considerations change.

First, let us review the assumption of fixed dynamic range for growing . For finding a good BT parameter estimate, a constant is good, as it restricts the possibilities to a constant range. This is made clear in Theorem 2, where the error has a dependency in . The task of learning the induced ranking, however, seems easier for large (or growing) values of . If two items are in general more “distant” from each other, the outcome of the comparison is less noisy, resulting in a better aggregate order (even though the actual parameters or might be estimated less precisely.) Whether to consider as growing or fixed is ultimately a modeling choice that has to be tested on real cases and based on quantitative results.

Second, a minimax analysis is not insightful for learning a ranking. It is easy to construct an instance of BT model where a fraction of the items are arbitrarily close to each other, thus making it arbitrarily hard to find their relative order. For this reason, we consider in this section the Bayes risk. We assume that the parameters are a realization of a Poisson point process with rate . Equivalently, for ,

 i.i.dXi=θ∗i−θ∗i+1∼Exp(λ). (6)

Hence, . Let be the ranking induced by a BT estimate , i.e., . To measure the quality of this ranking, we look at the average displacement

 Δ(σ)=1nn∑i=1|σ(i)−i|.

This metric is the one considered by Braverman and Mossel [2008]. It is also known as the (average) Spearman’s footrule distance.

In the noiseless case, random comparisons are necessary to recover the ranking, even if we allow for a constant bounded away from zero [Alon et al., 1994]. However, various sorting algorithms find the true ranking with comparisons, attaining the information-theoretic lower bound. This raises the question: Can we characterize the quality of a ranking produced by a sorting procedure operating with noisy comparisons? In Section 3.1 we give theoretical guarantees for one such sorting algorithm, Quicksort. In Section 3.2 we compare the theory to simulation results.

### 3.1 Analysis of Quicksort

We consider the case of the randomized Quicksort algorithm

[Sedgewick and Wayne, 2011]. Given a set of items, Quicksort works by selecting a pivot element uniformly at random and dividing the items into two subsets: elements lesser and greater than . This operation is called partition. The algorithm then recurses on these two subsets. First, we show that Quicksort always terminates and is, in a sense, noise-agnostic.

###### Claim.

Quicksort returns a ranking after comparisons with high probability, regardless of the outcome of the comparisons.

###### Proof.

The claim follows from standard results of the analysis of the algorithm, see for example Sedgewick and Wayne [2011]. Comparisons used by Quicksort can be reintepreted as those used in the construction of a binary search tree, where elements are inserted in a random order. That the comparisons form a tree shows that no non-transitive comparison is observed, and the ranking can simply be read by a pre-order traversal. ∎

Notice also that if we were to take the comparisons used by a single run of Quicksort and plug them into a ML estimator for BT, the ranking induced by the resulting estimate would necessarily match the ranking output by Quicksort. We will now state two theorems that bound the average and the maximum displacement of a ranking produced by Quicksort.

###### Theorem 4.

Assume comparisons are made under a BT model with as per (6). Let be the ranking output by randomized Quicksort. Then, w.h.p.,

 Δ(σ)=O(λ3)

We only give a sketch of the proof here, the full proof is available in Appendix B. In general, the analysis of a sorting algorithm under noisy comparisons is difficult, because comparisons are strongly interdependent and a single mistake can affect many items. In our case, we are helped by the fact that, in the BT model, the probability of making a mistake decreases exponentially fast with the distance between items. Within a single partition operation, consider the rank difference (relative to

) of items that end up in the wrong subset. It is highly improbable that an incorrect outcome of the comparison is observed for an item with a high rank difference. This enables us to bound the expectation and variance of the total displacement caused by a single partition operation by

and , respectively. As the total number of partition operations performed by Quicksort is upper bounded by , the average displacement has mean and variance .

###### Theorem 5.

Assume comparisons are made under a BT model with as per (6). Let be the ranking output by randomized Quicksort. Then, w.h.p.,

 maxi|σ(i)−i|=O(λlog2n)

Again, we sketch the proof, with the full proof available in Appendix B. Consider a partition operation. Using a Chernoff bound argument, with high probability no item with rank distance from the pivot will end up in the wrong subset. Combined with the fact that an item goes with high probability through partition operations, the result follows.

These bounds are rather good news: They mean that a very simple and efficient active strategy, sorting the items, enables us to approximately recover the ranking.

### 3.2 Empirical Validation of Bounds

To empirically test our bounds, we perform some simulations. We generate values for according to (6), apply Quicksort, and record the ranking. Figure 1 shows the average and maximum displacement for varying values of and . Our bound for appears to be tight in , but seems to be loose in , as the simulations seem to indicate a linear dependency in this parameter. Our bound for the maximum displacement appears to be tight in . The dependency in seems to be , however, we observe that the variance does not decrease. Therefore, our with-high-probability bound is presumably optimal. The contraction observed on the right plot for small values of is due to border effects.

## 4 Experimental Evaluation

In this section, we further expand on the connection between sorting and rank estimation from noisy comparisons. Note that all (non-trivial) sorting algorithms are active, in the sense that each comparison pair is a function of the outcomes of earlier comparisons. For example, in Quicksort, two non-pivot items and are compared only if they fall on the same side of all previous pivots . A natural question to ask is whether the set of comparisons induced by such a sorting procedure can lead to a better ranking estimate than a set of random comparison pairs. In this section, we answer this question in the affirmative.

Section 3

shows that under some assumptions, selecting comparison pairs via sorting enables us to recover the ranking up to a constant displacement. In practice, the comparison budget for ranking estimation from noisy data is typically larger than that for a single sort. This suggests the following heuristic strategy: For a budget of

comparisons, run the sorting procedure repeatedly222 With high probability, the instances of the sorting procedure produce different comparisons, because of noisy comparison outcomes, and because the sorting procedure itself may be randomized. For example, in Quicksort, the pivot selection is random. until the budget is depleted (the last call may have to be truncated.) We retain only the set of comparison pairs and their outcomes, but discard the rankings produced by the sorting procedure. The final ranking estimate is then given by the ML estimate over the set of comparison outcomes.

In this section, we compare the quality of this estimate to one based on comparisons selected uniformly at random. We first present the results on synthetic data, and then use two real-world datasets: one containing comparisons between animated GIFs, and one consisting of sushi preferences.

### 4.1 Synthetic Data

In this experiment, we set and generate models following (6) for values of . For increasing values of , we collect comparisons outcomes using repeated calls to Quicksort, and random pairs of items. In both cases, the ranking is induced by the ML estimate fitted on the observed comparisons. In Figure 2, we show the relative improvement of selecting comparisons via sorting over random selection. We can see that the average displacement is always improved by some factor for our range of parameters. For example, with and for , the average displacement when comparisons are selected actively is half that obtained with the same number of random comparisons. As can be expected, the less noisy the comparisons are (small values of ), the better the improvement obtained by the active approach is. Furthermore, it seems that even for large , the active approach keeps its relative advantage.

### 4.2 GIFGIF Dataset

Next, we look at a dataset from GIFGIF, a survey to which we referred in Section 1. GIFGIF aims at explaining the emotions communicated by a collection of over animated GIF images. Users of the website are shown a prompt with two images and a question, “Which better expresses ?” where is one of 17 emotions. The users can click on either image, or use a third option, neither. To date, over 2.7 million comparison outcomes have been collected. For the purpose of our experiment, we restrict ourselves to a single emotion, happiness, and we ignore outcomes that resulted in neither. We considered comparison outcomes over items. To the best of our knowledge, this is the largest dataset of human-generated pairwise comparisons analyzed in the context of rank aggregation.

As the data, despite a large number of comparisons, remains sparse (less than 20 comparisons per item on average), we proceed as follows. We fit a BT model by using all the available comparisons and use it as ground truth. We then generate comparison outcomes directly from the BT model, for both active and passive approaches. In this sense, the experiment enables us to compare active and passive sampling by using a model with realistic parameters that do not necessarily follow (6) anymore.

Figure 3 (left) compares the average displacement of passive and active approaches. The active approach performs systematically better, with an improvement of about after observing outcomes. The improvement is noticeable but modest. We notice that item parameters are close to each other on average; fitting a Poisson point process yields . This is because there is a considerable fraction of items that have their parameters very close to one another. This might mean that a significant fraction of the images do not express any opinion with respect to happiness. Figuring out the exact order of these images is therefore difficult and probably of marginal value.

### 4.3 Sushi Dataset

For our last experiment, we look at a dataset of Sushi preferences [Kamishima and Akaho, 2009]. In this dataset, respondents give a strict ordering over 10 different types of sushis. These 10 sushis are chosen among a larger set of items. To suit our purposes, we decompose each partial ranking into pairwise comparisons, resulting in comparison outcomes.

This time, the data is dense enough to enable us to use the comparison outcomes directly. When an outcome for pair is requested, we sample uniformly at random over the outcomes observed for this pair. In the rare case where no outcome is available, we return with probability . This enables us to compare active and passive approaches in a more realistic setting, where the assumptions of the BT model do not necessarily hold anymore.

The ground truth is set to the ranking induced by a BT model fitted with all the available data. Results are shown in Figure 3 (right.) We see that the active approach is able to bring the average displacement down to a required level much more quickly than the passive approach. For example, an average displacement of is reached after random comparisons, but only if we collect comparisons with Quicksort—reducing the number of comparisons needed by a factor .

## 5 Conclusion

In this work, we have shown that recovering a ranking from pairwise comparisons can benefit from an active approach. Selecting the next pair of items to compare based on previous comparison outcomes yields systematic gains over selecting random pairs. Perhaps surprisingly, a model-agnostic black box, Quicksort, is able to approximately recover the ranking even in the presence of noise. On our way, we broadened the understanding of the Rank Centrality algorithm. This enabled us to develop a novel iterative algorithm that computes the maximum likelihood estimate in the Bradley–Terry model, and to give formal guarantees on the MSE of this estimate.

Future work on active strategies for ranking could investigate the very sparse case, for example when the budget of comparisons is linear in the number of items . In this case, there are too few comparisons for even a single sort to be carried out entirely.

## Acknowledgments

We are very grateful to Travis Rich and Kevin Hu for providing us with the GIFGIF dataset. Furthemore, we wish to thank Holly Cogliati-Bauereis and Ksenia Konyushkova for their careful proofreading and comments on the text.

## Appendix A Analysis of Rank Centrality

First we give the proof Corollary 1, which bounds the error of the RC estimate in a more interpretable way. Then, we prove the convergence of the Adjusted Rank Centrality (ARC) algorithm.

###### Proof of Corollary 1.

We will show that

 ∥w−w∗∥2≤√nb∥π−π∗∥2∥π∗∥2. (7)

First, we bound . Let s.t. and for all . We have, for ,

 ∥v(i)∥2 =√n−1+b2(n−1+b)2=√n+(b−1)2n(n+2b−2)+(b−1)2 (8) ≤√1n+b−1n=√bn. (9)

Note that because is fixed, any feasible is a convex combination of the , hence by the triangle inequality . Furthermore, we have

 ∥π−π∗∥2=1∑iw∗i∥w−w∗∥2≥1n∥w−w∗∥2. (10)

The corollary ensues. ∎

The proof of convergence for the Adjusted Rank Centrality algorithm (Theorem 3) essentially analyzes the operator defined by (5).

###### Proof of Theorem 3.

The proof is inspired from Tresch [2008] and von Petersdorff [2014]. Let be a mapping of the interior of the simplex onto itself, defined by

 T(π)=πP(π). (11)

We will use the notation

 Tk(π)=T∘T∘…∘Tk times(π), (12)

with . The Jacobian matrix of is defined as

 (13)

Note that and are stochastic, and have the same non-zero entries as . By stochasticity, . Now consider the smallest such that , i.e., such that for all . That such a finite exists is guaranteed by the fact that is irreducible and aperiodic. By the above, it follows that

• is such that for all , and

• is such that for all and .

Let , and let be the all-ones matrix. We can write as

 S′(π)=ϵJn+R(π), (14)

where . Note that by construction . Now pick any , and let . Then , and

 S(y)−S(x)=˜S(1)−˜S(0)=∫10˜S′(u)du=∫10S′(x+u(y−x))(y−x)du (15)

As is continuous, we have

 ∥S(y)−S(x)∥1 ≤∫10∥S′(x+u(y−x))(y−x)∥1du (16) =∫10∥ϵJn(y−x)=0+R(x+u(y−x))(y−x)∥1du (17) ≤∫10∥R(x+u(y−x))∥1≤c∥y−x∥1du (18) ≤c∥y−x∥1 (19)

This shows that is a contraction in . Therefore, the Banach fixed-point theorem applies, and the unique fixed-point can be found by iterative applications of . Furthermore, is also a fixed point of since

 T(^π)=T(limk→∞Sk(^π))=limk→∞T(Sk(^π))=limk→∞Sk(T(^π))=^π, (20)

and this is the only fixed point because any fixed point of is a fixed point of . Finally, let

. The vectors

 x,T(x),T2(x),… (21)

occur in one of the sequences , . All sequences converge to , and therefore

 limk→∞Tk(x)=^π. (22)

## Appendix B Analysis of Quicksort

Algorithm 2 formalizes the Quicksort algorithm (lines 413 are referred to as the partition operation.) We consider the case where the outcomes of comparisons follow a Bradley–Terry (BT) model. Comparison outcomes are probabilistic, and might be inconsistent with the true ranking of the items.

We recall that w.l.o.g. the items are assumed to be labeled in decreasing rank order. Let be the parameters of the items, distributed according to a Poisson point process of rate . Equivalently, for

are i.i.d . We denote by the ranking distance between and , and by the distance between their parameters. If , then the probability of observing the outcome , inconsistent with the true ranking, is bounded by

 P(j≺i)=11+eD(i,j)

Theorem 4 bounds the average displacement , and Theorem 5 bounds the maximum displacement .

### b.1 Proof of Theorem 4

First we will formalize the fact that in expectation, the probability of observing a wrong outcome in the comparison between and is exponentially decreasing, not only , but also in the ranking distance .

###### Lemma 1.

Let be two items such that . Then, the probability of observing the wrong outcome , in expectation over , is bounded by

 P(j≺i)<(λλ+1)k.
###### Proof.

The distance is . Therefore, using Eq. 23,

 P(j≺i)

Consider a partition operation at an arbitrary level of the recursion, where items in are compared against . Let be such that no item compared to with is misclassified. In other words, is the maximum rank distance of a misclassified item. There are at most items that are displaced (counting ), and each of them is displaced by at most if and are subsequently correctly sorted. In view of this, we define the random variable

 α=(2¯k+1)¯k, (27)

which represents an upper bound to the local contribution to the total displacement caused by a single partition operation. We can bound its expected value and variance by a constant, as is made precise by the next lemma.

###### Lemma 2.

In a partition operation, let be such that no item compared to with is misclassified. Let . Then,

 E(α) <λ(4λ2+7λ+3), Var(α)

for some constant .

###### Proof.

We will use Lemma 1 to bound the probability that an item with a given rank distance is misclassified. First we observe that

 P(¯k=k)

Therefore,

 E((2¯k+1)¯k) <∞∑k=0(2k+1)k⋅P(item at distance% k is misclassified) (29) <∞∑k=0k(2k+1)(λλ+1)k=λ(4λ2+7λ+3). (30)

We can use a similar argument for the variance.

 Var((2¯k+1)¯k) ≤E((2¯k+1)2¯k2) (31) <∞∑k=0k2(2k+1)2(λλ+1)k (32) =λ(96λ4+264λ3+250λ2+91λ+9) (33)

Note that we did not make an assumption on , the size of the set under consideration in the partition operation. Therefore, Lemma 2 applies at any level of the recursion. We can now proceed to prove the main theorem that bounds the total expected displacement of the output of the Quicksort algorithm.

###### Proof of Theorem 4.

As each item can be a pivot at most once, there are such partition operations. Denote the upper bound on the displacement caused by each partition operation by . Then,

 E(Δ(σ)) ≤E(1nk∑l=1αl)=knE(α)≤E(α)

for some constants , , and . By Chebyshev’s inequality,

 P(Δ(σ)≤C1λ3+C2+√C3λ5+C4)≥1−1n. (36)

Finally, let us remark that it is well known that randomized Quicksort needs comparisons on average [Sedgewick and Wayne, 2011].

### b.2 Proof of Theorem 5

The first lemma states that w.h.p., the maximum rank difference of an item getting displaced in any partition operation is . For convenience, we assume . The same result holds with different constants for smaller values of .

###### Lemma 3.

The maximum displacement in any single partition operation is , w.h.p.

###### Proof.

Consider a particular partition operation, with pivot . First, we will give a bound on given that . We recall that , where is i.i.d. . Using a Chernoff bound yields

 P(D(p,i)≤(1−δ)k/λ)≤(1−δe−δ)k. (37)

Let . By using , we get

 P(D(p,i)≤3logn)≤1n2λ≤1n2, (38)

for . Hence with probability , elements and at ranking distance of on the left and on the right of are such that and , respectively. Then,

 P(at least one item i s.t. d(p,i)>3elogn is % misclassified)≤ne3logn=1n2. (39)

In total, there are at most such partition operations, and therefore the claim follows with probability . ∎

The next lemma states that with high probability, the Quicksort call tree has depth . This is a classical result in the analysis of randomized Quicksort.

###### Lemma 4.

The maximum call depth of Quicksort is w.h.p.

###### Proof.

At each recursion step, we select a pivot at random in the set of items . The probability that the rank of this item lies in the middle 50% of is , and if it does, the resulting sets and see their size decrease by at least . Therefore, in at most such partitions we get to a subset of size one and match the stopping condition. Even though we do not pick the pivot in the middle 50% every time, it is unlikely that more than recursions are needed (for some small ) to pick the pivot in the middle range at least times. We can make this formal with a Chernoff bound. Let be i.i.d. . Then,

 P(k∑l=1Zl≤(1−δ)k/2)≤e−δ24k. (40)

Setting and , it follows that

 P(24logn∑l=1Zl≤4logn)≤1n2. (41)

This means that the depth of a leaf in the recursion tree is at most with probability at least , and therefore that the maximum depth is at most with probability at least . ∎

###### Proof of Theorem 5.

Lemma 4 states that an item goes through at most partition operations with probability . Lemma 3 states that the displacement due to a single partition operation never exceeds with probability . It follows that the total displacement of any item never exceeds with probability . ∎

## References

• Ailon [2012] N. Ailon. An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity.

Journal of Machine Learning Research

, 13:137–164, 2012.
• Alon et al. [1994] N. Alon, B. Bollobás, G. Brightwell, and S. Janson. Linear Extensions of a Random Partial Order. The Annals of Applied Probability, 4(1):108–123, 1994.
• Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4):324–345, 1952.
• Braverman and Mossel [2008] M. Braverman and E. Mossel. Noisy sorting without resampling. In Proceedings of the 19 ACM-SIAM Symposium on Discrete Algorithms (SODA ’08), San Francisco, CA, 2008.
• Elo [1978] A. Elo. The Rating Of Chess Players, Past & Present. Arco, 1978.
• Hajek et al. [2014] B. Hajek, S. Oh, and J. Xu. Minimax-optimal Inference from Partial Rankings. In Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 2014.
• Hunter [2004] D. R. Hunter. MM algorithms for generalized Bradley–Terry models. The Annals of Statistics, 32(1):384–406, 2004.
• Jamieson and Nowak [2011] K. Jamieson and R. Nowak. Active Ranking using Pairwise Comparisons. In Advances in Neural Information Processing Systems 24 (NIPS 2011), Granada, Spain, 2011.
• Kamishima and Akaho [2009] T. Kamishima and S. Akaho. Efficient Clustering for Orders. In Mining Complex Data, pages 261–279. Springer, 2009.
• Negahban et al. [2012] S. Negahban, S. Oh, and D. Shah. Iterative Ranking from Pair-wise Comparisons. In Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, CA, 2012.
• Negahban et al. [2014] S. Negahban, S. Oh, and D. Shah. Rank Centrality: Ranking from Pair-wise Comparisons. preprint, arXiv:1209.1688v2 [cs.LG], Jan. 2014.
• Rajkumar and Agarwal [2014] A. Rajkumar and S. Agarwal. A Statistical Convergence Perspective of Algorithms for Rank Aggregation from Pairwise Data. In Proceedings of the 21 International Conference on Machine Learning (ICML 2014), Beijing, China, 2014.
• Salganik and Levy [2014] M. Salganik and K. Levy. Wiki surveys: Open and quantifiable social data collection. preprint, arXiv:1209.1688v2 [cs.LG], Jan. 2014.
• Sedgewick and Wayne [2011] R. Sedgewick and K. Wayne. Algorithms. Addison-Wesley, 4th edition, 2011.
• Thurstone [1927] L. Thurstone. The method of paired comparisons for social values. Journal of Abnormal and Social Psychology, 21(4):384–400, 1927.
• Tresch [2008] A. Tresch. Convergence Proof for discrete ergodic Markov Chains. Course notes, http://www.treschgroup.de/MachineLearning.html, Jan. 2008.
• von Petersdorff [2014] T. von Petersdorff. Fixed Point Iteration and Contraction Mapping Theorem. Course notes, http://terpconnect.umd.edu/~petersd/666/, 2014.
• Wang et al. [2014] J. Wang, N. Srebro, and J. Evans. Active Collaborative Permutation Learning. In Proceedings of the 20 ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’14), New York, NY, 2014.
• Zermelo [1929] E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436–460, 1929.