The problem of inferring a ranking over a set of
items, such as documents, images, movies, or URL links, is an important problem in machine learning and finds many applications in recommender systems, web search, social choice, and many other areas. One of the most popular forms of data for ranking is pairwise comparison data, which can be easily collected via, for example, crowdsourcing, online games, or tournament play. The problem of ranking aggregation from pairwise comparisons has been widely studied and most work aims at inferring a total ordering of all the items (see, e.g.,Negahban12RankCentrality ). However, for some applications with a large number of items (e.g., rating of restaurants in a city), it is only necessary to identify the set of top items. For these applications, inferring the total global ranking order unnecessarily increases the complexity of the problem and requires significantly more samples.
In the basic setting for this problem, there is a set of items with some true underlying ranking. For possible pair of items, an analyst is given noisy pairwise comparisons between those two items, each independently ranking above
with some probability. From this data, the analyst wishes to identify the top items in the ranking, ideally using as few samples as is necessary to be correct with sufficiently high probability. The noise in the pairwise comparisons (i.e. the probabilities ) is constrained by the choice of noise model. Many existing models - such as the Bradley-Terry-Luce model (BTL) (Bradley52, ; Luce59, ), the Thurstone model (Thurstone:27, ), and their variants - are parametric comparison models, in that each probability is of the form , where is a ‘score’ associated with item
. While these parametric models yield many interesting algorithms with provable guarantees(Chen15, ; Jang13, ; Suh16Adversarial, ), the models enforce strong assumptions on the probabilities of incorrect pairwise comparisons that might not hold in practice (Davidson59, ; McLaughlin65, ; Tversky72, ; Ballinger97, ).
A more general class of pairwise comparison model is the strong stochastic transitivity (SST) model, which subsumes the aforementioned parameter models as special cases and has a wide range of applications in psychology and social science (see, e.g., Davidson59 ; McLaughlin65 ; Fishburn73 ). The SST model only enforces the following coherence assumption: if is ranked above , then for all other items . Shah15Sto pioneered the algorithmic and theoretical study of ranking aggregation under SST models. For top- ranking problems, Shah15Sim proposed a counting-based algorithm, which simply orders the items by the total number of pairwise comparisons won. For a certain class of instances, this algorithm is in fact optimal; any algorithm with a constant probability of success on these instances needs roughly at least as many samples as this counting algorithm. However, this does not rule out the existence of other instances where the counting algorithm performs asymptotically worse than some other algorithm.
In this paper, we study algorithms for the top- problem from the standpoint of competitive analysis. We give an algorithm which, on any instance, needs at most times as many samples as the best possible algorithm for that instance to succeed with the same probability. We further show this result is tight: for any algorithm, there are instances where that algorithm needs at least times as many samples as the best possible algorithm. In contrast, the counting algorithm of Shah15Sto sometimes requires times as many samples as the best possible algorithm, even when the probabilities are bounded away from .
Our main technical tool is the introduction of a new decision problem we call domination, which captures the difficulty of solving the top problem while being simpler to directly analyze via information theoretic techniques. The domination problem can be thought of as a restricted one-dimensional variant of the top- problem, where the analyst is only given the outcomes of pairwise comparisons that involve item or , and wishes to determine whether is ranked above . Our proof of the above claims proceeds by proving analogous competitive ratio results for the domination problem, and then carefully embedding the domination problem as part of the top- problem.
1.1 Related Work
The problem of sorting a set of items from a collection of pairwise comparisons is one of the most classical problems in computer science and statistics. Many works investigate the problem of recovering the total ordering under noisy comparisons drawn from some parametric model. For the BTL model, Negahban et al. Negahban12RankCentrality propose the RankCentrality algorithm, which serves as the building block for many spectral ranking algorithms. Lu and Boutilier Craig:11 give an algorithm for sorting in the Mallows model. Rajkumar and Agarwal Rajkumar14 investigate which statistical assumptions (BTL models, generalized low-noise condition, etc.) guarantee convergence of different algorithms to the true ranking.
More recently, the problem of top- ranking has received a lot of attention. Chen and Suh Chen15 , Jang et al. Jang13 , and Suh et al. Suh16Adversarial all propose various spectral methods for the BTL model or a mixture of BTL models. Eriksson Eriksson13 considers a noisy observation model where comparisons deviating from the true ordering are i.i.d. with bounded probability. In Shah15Sim , Shah and Wainwright consider the general SST models and propose the counting-based algorithm, which motivates our work. The top- ranking problem is also related to the best arm identification in multi-armed bandit Bubeck:13 ; Jamieson:14 ; Zhou:14
. However, in the latter problem, the samples are i.i.d. random variables rather than pairwise comparisons and the goal is to identify the topdistributions with largest means.
This paper and the above references all belong to the non-active setting: the set of data provided to the algorithm is fixed, and there is no way for the algorithm to adaptively choose additional pairwise comparisons to query. In several applications, this property is desirable, specifically if one is using a well-established dataset or if adaptivity is costly (e.g. on some crowdsourcing platforms). Nonetheless, the problems of sorting and top- ranking are incredibly interesting in the adaptive setting as well. Several works (Ailon11, ; Jamieson11, ; Mathieu07, ; Braverman08, ) consider the adaptive noisy sorting problem with (noisy) pairwise comparisons and explore the sample complexity to recover an (approximately) correct total ordering in terms of some distance function (e.g,., Kendall’s tau). In Wauthier13 , Wauthier et al. propose simple weighted counting algorithms to recovery an approximate total ordering from noisy pairwise comparisons. Dwork et al. Dwork01 and Ailon et al. Ailon08 consider a related Kemeny optimization problem, where the goal is to determine the total ordering that minimizes the sum of the distances to different permutations. More recently, the top- ranking problem in the active setting has been studied by Braverman et al. BMW16 where they consider the sample complexity of algorithms that use a constant number of rounds of adaptivity. All of this work takes place in much more constrained noise models than the SST model. Extending our work to the active setting is an interesting open problem.
2 Preliminaries and Problem Setup
Consider the following problem. An analyst is given a collection of items, labelled through . These items have some true ordering defined by a permutation such that for , the item labelled has a better rank than the item labelled (i.e., the item with label has a better rank than the item if and only if ). The analyst’s goal is to determine the set of the top items, i.e., .
The analyst receives samples. Each sample consists of pairwise comparisons between all pairs of items. All the pairwise comparisons are independent with each other. The outcomes of the pairwise comparison between any two items is characterized by the probability matrix . For a pair of items , let be the outcome of the comparison between the item and , where means is preferred to (denoted by ) and otherwise. Further, let denote the Bernoulli random variable with mean . The outcome follows , i.e.,
The probability matrix is said to be strong stochastic transitive (SST) if it satisfies the following definition.
The probability matrix is strong stochastic transitive (SST) if
For , for all .
is shifted-skew-symmetric (i.e.,is skew-symmetric) where and for .
The first condition claims that when the item has a higher rank than item (i.e., ), for any other item , we have
Many classical parametric models such that BTL (Bradley52, ; Luce59, ) and Thurstone (Case V) (Thurstone:27, ) models are special cases of SST. More specifically, parametric models assume a score vector
models are special cases of SST. More specifically, parametric models assume a score vector. They further assume that the comparison probability , where is a non-decreasing function and (e.g., in BTL models). By the property of , it is easy to verify that satisfy the conditions in Definition 2.1.
Under the SST models, we can formally define the top- ranking problem as follows. The top- ranking problem takes the inputs , , that are known to the algorithm and the SST probability matrix that is unknown to the algorithm.
Top-K is the following algorithmic problem:
A permutation of is uniformly sampled.
The algorithm is given samples for , where each is sampled independently according to . The algorithm is also given the value of , but not or the matrix .
The algorithm succeeds if it correctly outputs the set of labels of the top items.
We note that Shah15Sim considers a slightly different observation model in which each pair is queried times. For each query, one can obtain a comparison result with the probability and with probability , the query is invalid. In this model, each pair will be compared times on expectation. When , it reduces to our model in Definition 2.2, where we observe exactly comparisons for each pair. Our results can be easily extended to deal with the observation model in Shah15Sim by replacing with the effective sample size, . We omit the details for the sake of simplicity.
Our primary metric of concern is the sample complexity of various algorithms; that is, the minimum number of samples an algorithm requires to succeed with a given probability. To this end, we call the triple an instance of the Top-K problem, and write to denote the minimum value such that for all , succeeds on instance with probability when given samples. When is omitted, we will take ; i.e., .
Instead of working directly with Top-K, we will spend most of our time working with a problem we call Domination, which captures the core of the difficulty of the Top-K problem. Domination is formally defined as follows.
Domination is the following algorithmic problem:
and are two vectors of probabilities that satisfy for all . are not given to the algorithm.
A random bit is sampled from . Samples (for ) are generated as follows:
Case : each is independently sampled according to and each is independently sampled according to .
Case : each is independently sampled according to and each is independently sampled according to .
The algorithm is given the samples and , but is not given the bit or the values of and .
The algorithm succeeds if it correctly outputs the value of the hidden bit .
As before, we are interested in the sample complexity of algorithms for Domination. We call the triple an instance of Domination, and write to be the minimum value such that for all , succeeds at solving with probability at least (similarly, we let ).
3 Main Results
There are at least two main approaches one can take to analyze the sample complexity of problems like Top-K and Domination. The first (and more common) is to bound the value of by some explicit function of the instance . This is the approach taken by Shah15Sim . They show that for some simple function (roughly, the square of the reciprocal of the absolute difference of the sums of the -th and -th rows of the matrix i.e. ), there is an algorithm such that for all instances , ; moreover this is optimal in the sense that there exists an instance such that for all algorithms , . While this is a natural approach, it leaves open the question of what the correct choice of should be; indeed, different choices of give rise to different ‘optimal’ algorithms which outperform each other on different instances.
In this paper, we take the second approach, which is to compare the sample complexity of an algorithm on an instance to the sample complexity of the best possible algorithm on that instance. Formally, let and let . An ideal algorithm would satisfy for all instances of Top-K; more generally, we are interested in bounding the ratio between and . We call this ratio the competitive ratio of the algorithm, and say that an algorithm is -competitive if . (We likewise define all the corresponding notions for Domination).
In our main upper bound result, we give a linear-time algorithm for Top-K which is -competitive (restatement of Corollary 7.5):
There is an algorithm for Top-K such that runs in time and on every instance of Top-K on items,
In our main lower bound result, we show that up to logarithmic factors, this competitive ratio is optimal (restatement of Theorem 8.1):
For any algorithm for Top-K, there exists an instance of Top-K on items such that
In comparison, for the counting algorithm of Shah15Sim , there exist instances such that . For example, consider the instance with
It is straightforward to show that with samples, we can learn all pairwise comparisons correctly with high probability by taking a majority vote, and therefore even sort all the elements correctly. This implies that . On the other hand, we show in Corollary 5.4 that when .
3.1 Main Techniques and Overview
We prove our main results by first proving similar results for Domination which we defined in Definition 2.3. Intuitively Domination captures the main hardness of Top-K while being much simpler to analyze. Once we prove upper bound and lower bounds for the sample complexity of Domination, we will use reductions to prove analogous results for Top-K.
We begin in Section 4, by proving a general lower bound on the sample complexity of domination. Explicitly, for a given instance of Domination, we show that where is the amount of information we can learn about the bit from one sample of pairwise comparison in each of the coordinates.
In Section 5, we proceed to design algorithms for Domination restricted to instances where for some constant . In this regime , which makes it easier to argue our algorithms are not too bad compared with the optimal one. We first consider an algorithm we call the counting algorithm (Algorithm 1), which is a Domination analogue of the counting algorithm proposed by Shah15Sim . We show that has a competitive ratio of . Intuitively, the main reason fails is that tries to consider samples from different coordinates equally important even when they are sampled from a very unbalanced distribution (for example, ). We then consider another algorithm we call the max algorithm (Algorithm 2) which simply finds and outputs according the sign of . We show also has a competitive ratio of . Interestingly, fails for a different reason from , namely that does not use the information fully from all coordinates when the samples are sampled from a very balanced distribution. In fact, performs well whenever fails and vice versa. We therefore show how combine and in two different ways to get two new algorithms: (Algorithm 3) and (Algorithm 4). We show that both of these new algorithms have a competitive ratio of , which is tight by Theorem 8.2.
In Section 6, we design algorithms for Domination in the general regime. In this regime, can be much larger than , particularly for values of and very close to or . In these corner cases, the counting algorithm and max algorithm can fail very badly; we will show that even for fixed , their competitive ratios can grow arbitrarily large (Lemma 6.6 and Lemma 6.7). One main reason for this failure is that, even when , samples from coordinate could convey much more information than the samples from coordinate (consider, for example, , and ). Taking this into account, we design a new algorithm (Algorithm 5) which has a competitive ratio of in the general regime. The new algorithm still combines features from both and
, but also better estimates the importance of each coordinate. To estimate how much information each coordinate has, the new algorithm divides the samples intogroups and checks how often samples from coordinate are consistent with themselves. If one coordinate has a large proportion of the total information, it uses samples from that coordinate to decide , otherwise it takes a majority vote on samples from all coordinates.
In Section 7, we return to Top-K and present an algorithm that has a competitive ratio of , thus proving Theorem 3.1. Our algorithm works by reducing the Top-K problem to several instances of the Domination problem (see Theorem 6.5). At a high level, the algorithm tries to find the top rows by pairwise comparisons of rows, each of which can be thought of as an instance of Domination. We use algorithm to solve these Domination instances. Since we only need to make at most comparisons, if outputs the correct answer with at least probability for each comparison, then by union bound all the comparisons will be correct with probability at least . However, to find the top rows, we do not actually need to compare all the rows to each other; Lemma 7.1 shows that we can find the top rows with high probability while making only comparisons. Using this lemma, we get a linear time algorithm for solving Top-K. Finally in Lemma 7.4, we extend the lower bound for Domination proved in Lemma 4.2 to show a lower bound on the number of samples any algorithm would need on a specific instance of Top-K. Combining these results, we prove Theorem 3.1.
Finally, in Section 8, we show that the algorithms for both Domination and Top-K presented in the previous sections have the optimal competitive ratio (up to polylogarithmic factors). Specifically, we show that for any algorithm solving Domination, there exists an instance of domination where (Theorem 8.2). We accomplish this by constructing a distribution over instances of Domination such that each instance in the support of this distribution can by solved by an algorithm with low sample complexity (Theorem 8.5) but any algorithm that succeeds over the entire distribution requires times more samples (Theorem 8.7). We then embed Domination in Top-K (similarly as in Section 7) to show an analogous lower bound for Top-K (Theorem 8.1).
4 Lower bounds on the sample complexity of domination
We start by establishing lower bounds on the number of samples needed by any algorithm to succeed with constant probability on a given instance of Domination. This is controlled by the quantity , which is the amount of information we can learn about the bit given one sample of pairwise comparison between each of the coordinates of and .
Given , define
Given , define
Let be an instance of Domination. Then .
The main idea is to bound the mutual information between the samples and the correct output, and then apply Fano’s inequality. Let and . Recall that indicates the correct output and that are the samples given to the algorithm. By Fact A.6,
When , and are given, each sample ( or ) is independent of the other samples, and thus . By Fact A.7, we then have
Repeating this, we get
By Fact A.9, we have
It follows that
For any algorithm, let be its error probability on Domination. By Fano’s inequality, we have that
Since , we find that , as desired. ∎
In the following section, we will concern ourselves with instances that satisfy for some constant for all . For such instances, we can approximate by the distance between and .
For some , let . Then
Let and . Then and . We need to show that
By Fact A.10,
this implies the desired upper bound. The lower bound also holds since,
Let be an instance of Domination satisfying for all . Then
5 Domination in the well-behaved regime
We now proceed to the problem of designing algorithms for Domination which are competitive on all instances. As a warmup, we begin by considering only instances of Domination satisfying for all where is some fixed constant. This regime of instances captures much of the interesting behavior of Domination, but with the added benefit that the mutual information between the samples and behaves nicely in this regime: in particular (see Lemma 4.3). By Corollary 4.4, we have . This fact will make it easier to design algorithms for Domination which are competitive in this regime.
In Section 5.1, we give two simple algorithms (counting algorithm and max algorithm) which can solve Domination given samples which gives them a competitive ratio of . We will then show that this is tight, i.e. their competitive ratio is in Lemma 5.3 and Lemma 5.5. While the sample complexities of these two algorithms are not optimal, they have the nice property that whenever one performs badly, the other performs well. In Section 5.2, we show how to combine the counting algorithm and the max algorithm to give two different algorithms which can solve Domination using only samples i.e. they have a competitive ratio of . According to Theorem 8.2, this is the best we can do up to polylogarithmic factors.
5.1 Counting algorithm and max algorithm
We now consider two simple algorithms for Domination, which we call the counting algorithm (Algorithm 1) and the max algorithm (Algorithm 2) denoted by and respectively. We show that both algorithms require samples to solve Domination (Lemmas 5.1 and 5.2). By Corollary 4.4, we have , leading to a competitive ratio for these algorithms. We show in Lemma 5.3 and Lemma 5.5 that this is tight up to polylogarithmic factors i.e. their competitive ratio is .
Both the counting algorithm and the max algorithm begin by computing (for each coordinate ) the differences between the number of ones in the samples and samples; i.e., we compute the values . The counting algorithm decides whether to output or based on the sign of , whereas the max algorithm decides its output based on the sign of the with the largest absolute value. See Algorithms 1 and 2 for detailed pseudocode for both and .
We begin by proving upper bounds for the sample complexities of both and . In particular, both and need at most times as many samples as the best possible algorithm for any instance in this regime.
Let be an instance of Domination. Then
If further satisfies for all for some constant , then
Let be the probability that and outputs when provided with samples. By symmetry is equal to the probability that we are in the case and outputs when provided with samples. It therefore suffices to show that is at most . When ,
By the Chernoff bound,
The second part of the lemma follows from Corollary 4.4, along with the observation that . ∎
Let be an instance of Domination. Then
If further satisfies for all for some constant , then
Assume without loss of generality that , and let . Let be the event that makes an error and outputs when given samples. We can upper bound the probability of error as
We will bound each term separately. Since , by Hoeffding’s inequality,
Similarly, by Hoeffding’s inequality and the union bound,
It follows that . The second part of the lemma follows from Corollary 4.4, along with the observation that . ∎
We now show that the upper bounds we proved above are essentially tight. In particular, we demonstrate instances where both and need times as many samples as the best possible algorithms for those instances. Interestingly, on the instance where suffers, performs well, and vice versa. This fact will prove useful in the next section.
For each and each sufficiently large , there exists an instance of Domination such that the following two statements are true:
Let be an arbitrary integer between 1 and . Let be any vectors satisfying the following constraints:
For all , .
If , .
If , .
Note that . Therefore, by Lemma 5.2, , thus proving the first part of the lemma.
Now assume that . We will show that with this many samples, solves instance with probability at most , thus implying the second part of the lemma. Without loss of generality, assume that . Define the following random variables :
for and .
. and .
It is straightforward to check that for all , , and . Let
be the cdf of the standard normal distribution.