Pairwise comparison data is frequently observed in various domains, including recommender systems, website ranking, voting and social choice (see, e.g. Baltrunas et al., 2010; Dwork et al., 2001; Liu, 2009; Young, 1988; Caplin and Nalebuff, 1991). For these applications, it is of significant interest to produce a suitable ranking of the items by aggregating the outcomes of pairwise comparisons. The general problem of interest can be stated as follows. Suppose there are items to be compared and an underlying matrix
of probability parameters, each entryof which represents the probability that item beats item if they are compared. Hence we have and the event that item beats item
in a comparison can be viewed as a Bernoulli random variable with probability. Observing the outcomes of independent pairwise comparisons, we aim to estimate the absolute ranking of the items.
For the sake of consistency, one needs of course to impose some structure on the matrix . These structural assumptions are traditionally split between parametric and nonparametric ones. Classical parametric models include the Bradley-Terry-Luce model (Bradley and Terry, 1952; Luce, 1959) and the Thurstone model (Thurstone, 1927). These models can be recast as log-linear models, which enables the use of the statistical and computational machinery of maximum likelihood estimation in generalized linear models (Hunter, 2004; Negahban et al., 2012; Rajkumar and Agarwal, 2014; Hajek et al., 2014; Shah et al., 2015; Negahban et al., 2016, 2017).
To allow richer structures on beyond the scope of parametric models, permutation-based models such as the noisy sorting model (Braverman and Mossel, 2008, 2009) and the strong stochastic transitivity (SST) model (Chatterjee, 2015; Shah et al., 2017a) have recently become more prevalent. These models only require shape constraints on the matrix and are typically called nonparametric. In these models, the underlying ranking of items is determined by an unknown permutation , and, additionally, the comparison probabilities are assumed to have a bi-isotonic structure when the items are aligned according to . While permutation-based models provide ordering structures that are not captured by parametric models (Agarwal, 2016; Shah et al., 2017a), they introduce both statistical and computational barriers for estimation of the underlying ranking. These barriers are mainly due to the complexity of the discrete set of permutations. On the one hand, the complexity of the set of permutations is not well understood (see the discussion following Theorem 8 in Collier and Dalalyan, 2016), which leads to logarithmic gaps in the current statistical bounds for permutation-based models. On the other hand, it is computationally challenging to optimize over the set of permutations, so current algorithms either sacrifice nontrivial statistical performance or have impractical time complexity. In this work, we aim to address both questions for the noisy sorting model.
In practice, it is unlikely that all the items are compared to each other. To account for this limitation, a widely used scheme consists in assuming that that each pairwise comparison is observed with probability (see, e.g. Chatterjee, 2015; Shah et al., 2017a). In addition to this model of missing comparisons, we study the model where pairwise comparisons are sampled uniformly at random from the pairs, with replacement and independent of each other. It turns out that sampling with and without replacement yields the same rate of estimation up to a constant when the expected numbers of observations coincide.
We focus on the noisy sorting model with partial observations, under which a stronger item wins a comparison against a weaker item with probability at least where . For sampling both with and without replacement, we establish the minimax rate of learning the underlying permutation. In particular, the rate does not involve a logarithmic term, and we explain this phenomenon through a careful analysis of the metric entropy of the set of permutations equipped with the Kendall tau distance, which is of independent theoretical interest.
Moreover, we propose a multistage sorting algorithm that has time complexity . For the sampling with replacement model, we prove a theoretical guarantee on the performance of the multistage sorting algorithm, which differs from the minimax rate by only a polylogarithmic factor. In addition, the algorithm is demonstrated to perform similarly for both sampling models using simulated examples.
The noisy sorting model was proposed by Braverman and Mossel (2008). In the original paper, the optimal rate of estimation achieved by the maximum likelihood estimator (MLE) is established, and an algorithm with time complexity is shown to find the MLE with high probability in the case of full observations111If the algorithm is allowed to actively choose the pairs to be compared, the sample complexity can be reduced to . However, in the passive setting which we adopt throughout this work, the algorithm still needs pairwise comparisons., where is a large unknown constant. Moreover, their algorithm does not have a polynomial running time if only random pairwise comparisons are observed. Our work generalizes the optimal rate to the partial observation settings by studying a variant of the MLE for the upper bound. In the model of sampling with replacement, our fast multistage sorting algorithm provably achieves near-optimal rate of estimation. Since finding the MLE for the noisy sorting model is an instance of the NP-hard feedback arc set problem (Alon, 2006; Kenyon-Mathieu and Schudy, 2007; Ailon et al., 2008; Braverman and Mossel, 2008), our results indicate that, despite the NP-hardness of the worst-case problem, it is still possible to achieve (near-)optimal rates for the average-case statistical setting in polynomial time.
The SST model generalizes the noisy sorting model, and minimax rates in the SST model have been studied by Shah et al. (2017a). However, the upper bound specialized to noisy sorting contains an extra logarithmic factor, which this work shows to be unnecessary. Moreover, the lower bound there is based on noisy sorting models with shrinking to zero as , while we establish a matching lower bound at any fixed . In addition, algorithms of Wauthier et al. (2013); Shah et al. (2017a); Chatterjee and Mukherjee (2016) are all statistically suboptimal for the noisy sorting model. This is partially addressed by our multistage sorting algorithm as discussed above.
In fact, both with- and without-replacement sampling models discussed in this paper are restrictive for applications where the set of observed comparisons is subject to certain structural constraints (Hajek et al., 2014; Shah et al., 2015; Negahban et al., 2017; Pananjady et al., 2017a). Obtaining sharper rates of estimation for these more complex sampling models is of significant interest but is beyond the scope of the current work.
Finally, we mention a few other lines of related work. Besides permutation-based models, low-rank structures have also been proposed by Rajkumar and Agarwal (2016) to generalize classical parametric models. Moreover, there is an extensive literature on active ranking from pairwise comparisons (see, e.g., Jamieson and Nowak, 2011; Heckel et al., 2016; Agarwal et al., 2017, and references therein)
, where the pairs to be compared are chosen actively and in a sequential fashion by the learner. The sequential nature of the models greatly reduces sample complexity, so we do not compare our results for passive observations to the literature on active learning. However, it is interesting to note that our multistage sorting algorithm is reminiscent of active algorithms, because it uses different batches of samples for different stages. Thus active learning algorithms could potentially be useful even for passive sampling models.
The noisy sorting model together with the two sampling models is formalized in Section 2. In Section 3, we present our main results, the minimax rate of estimation for the latent permutation and the near-optimal rate achieved by an efficient multistage sorting algorithm. To complement our theoretical findings, we inspect the empirical performance of the multistage sorting algorithm on numerical examples in Section 4. Section 5 is devoted to the study of the set of permutations equipped with the Kendall tau distance. Proofs of the main results are provided in Section 6. We discuss directions for future research in Section 7.
For a positive integer , let . For a finite set , we denote its cardinality by . Given , let and . We use and , possibly with subscripts, to denote universal positive constants that may change at each appearance. For two sequences and , we write if there exists a universal constant such that for all . We define the relation analogously, and write if both and hold. Let denote the symmetric group on , i.e., the set of permutations .
2 Problem formulation
The noisy sorting model can be formulated as follows. Fix an unknown permutation which determines the underlying order of items. More precisely, orders the items from the weakest to the strongest, so that item is the -th weakest among the items. For a fixed , we define a class of matrices
where is the
-dimensional all-ones vector. In addition, we define a special matrixby
Note that satisfies strong stochastic transitivity but other matrices may not. Though this observation plays a crucial role in the design of efficient algorithms, our statistical results hold for general matrices in .
To model pairwise comparisons, fix and let denote the probability that items beats item when they are compared222The diagonal entries of are inessential in the model as an item is not compared to itself, and they are set to only for concreteness., so that a stronger item beats a weaker item with probability at least . As a result, captures the signal-to-noise ratio of our problem and our minimax results explicitly capture the dependence in this key parameter.
2.1 Sampling models
In the noisy sorting model, suppose that for each (unordered) pair with , we observe the outcomes of comparisons between them, and item wins a comparison against item with probability independently. The set of nonnegative integers is determined by certain sampling models described below. We allow to be zero, which means that and are not compared. We collect sufficient statistics into a matrix consisting of outcomes of pairwise comparisons, by defining to be the number of times item beats item among the comparisons between and . In particular, we have for and . Our goal is to aggregate the results of pairwise comparisons to estimate , the underlying order of items.
In the full observation setup of Braverman and Mossel (2008), we have for each pair and the total number of observations is . Instead, we are interested here in the regime where the total number of observations is much smaller than . We study the following two sampling models in this work:
Sampling without replacement. In this sampling model, instead of observing all the pairwise comparisons, we observe each pair with probability independently. Hence each is a Bernoulli random variable with parameter , and in expectation we have observations in total.
Sampling with replacement. We observe pairwise comparisons between the items, sampled uniformly and independently with replacement from the pairs.
In the sequel, we study the noisy sorting model with either of the above two sampling models. In particular, the minimax rates of estimating coincide for the two sampling models if , i.e., if the expected number of observations are of the same order.
2.2 Measures of performance
Having discussed the sampling and comparison models, we turn to the distance used to measure the difference between the underlying permutation and an estimated permutation . Among various distances defined on the symmetric group, we consider primarily the Kendall tau distance, i.e., the number of inversions (or discordant pairs) between permutations, defined as
for . Note that . The Kendall tau distance between two permutations is a natural metric on , and it is equal to the minimum number of adjacent transpositions required to change from one permutation to another (Knuth, 1998). A closely related distance on is the -distance, also known as Spearman’s footrule, defined as
for . It is well known (Diaconis and Graham, 1977) that
Hence the rates of estimation in the two distances coincide. Another distance on we use is the -distance, defined as
Note that unlike existing literature on ranking from pairwise comparisons where metrics on the probability parameters are studied, we employ here distances that measure how far an item is from its true ranking.
3 Main results
In this section, we state our main results. Specifically, we establish the minimax rates of estimating in the Kendall tau distance (and thus in distance) for noisy sorting under both sampling models 1 and 2. The minimax estimator that we propose is intractable in general and we complement our results with an efficient estimator of which achieves near-optimal rates in both the Kendall tau and the -distance, under the sampling model 2.
3.1 Minimax rates of noisy sorting
Under the noisy sorting model with latent permutation and matrix of probabilities , we determine the minimax rate of estimating in the following theorem. Let
denote the expectation with respect to the probability distribution of the observations in the noisy sorting model with underlying permutationand matrix of probabilities , in either sampling model.
Fix where is a universal positive constant. It holds that
where the minimum is taken minimized over all permutation estimators that are measurable with respect to the observations.
The theorem establishes the minimax rates for noisy sorting, including the case of partial observations and weak signals. The upper bounds in fact hold with high probability as shown in Theorem 6.1. If the expected numbers of observations in the two sampling models 1 and 2 are of the same order, i.e., , then the two rates coincide. In this sense, the two sampling models are statistically equivalent. In sampling model 1, if and is larger than a constant, then the rate of order recovers the upper bound proved by Braverman and Mossel (2008).
Note in particular the absence of logarithmic factor in the rates. Naively bounding the metric entropy of by actually yields a superfluous logarithmic term in the upper bound. To avoid it, we study the doubling dimension of ; see the discussion after Proposition 5. Closing this logarithmic gap for other problems involving latent permutations (Collier and Dalalyan, 2016; Flammarion et al., 2016; Shah et al., 2017a; Pananjady et al., 2017b) remains an open question.
The technical assumption in Theorem 3.1 is very mild, because we are interested in the “noisy” sorting model (meaning that the pairwise comparisons are noisy, or equivalently that is not close to ). In fact the requirement that be bounded away from can be lifted, in which case we establish upper and lower bounds that match up to a logarithmic factor of order , where (see Section 6).
Finally, we note that the proof of Theorem 3.1 holds even in the so-called semi-random setting (Blum and Spencer, 1995; Makarychev et al., 2013), in which observations are generated by one of the random procedures described above, but a “helpful” adversary is allowed to reverse the outcome of any comparison in which a weaker item beat a stronger item. Though these reversals appear benign at first glance, the presence of such an adversary can in fact worsen statistical rates of estimation in more brittle models such as stochastic block models and the related broadcast tree model (Moitra et al., 2016). Our results indicate that no such degradation occurs for the rates of estimation in the noisy sorting problem.
3.2 Efficient multistage sorting
The minimax upper bound in Theorem 3.1 is established using a computationally prohibitive estimator, so we now introduce an efficient estimator of the underlying permutation that can be computed in time . In this section, we prove theoretical guarantees for this estimator under the noisy sorting model with probability matrix and observations sampled with replacement according to 2 when is bounded away from zero by a universal constant. No polynomial-time algorithm was previously known to achieve near-optimal rates even in this simplified setting when pairwise comparisons are observed.
Since we aim to prove guarantees up to constants, we may assume that we have pairwise comparisons, and split them into two independent samples, each containing pairwise comparisons. The first sample is used to estimate the parameter and the second one is used to estimate the permutation .
First, we introduce a fairly simple estimator of that can be described informally as follows: first sort in increasing order the items according to the number of wins. Then for any pair for which item is ranked positions higher than item , it is very likely that item is stronger than item so that it beats item with probability . We then average the variables over all such pairs to obtain an estimator of . More formally, we further split the first sample into two subsamples, each containing pairwise comparisons. Denote by and the number of wins item has against item in the first and second subsample, respectively. The estimator is given by the following procedure:
For each , associate with item a score .
Construct a permutation by sorting the scores in increasing order, i.e., is chosen so that if , with ties broken arbitrarily.
Given the estimator , we now describe a multistage procedure to estimate the permutation . To recover the underlying order of items, it is equivalent to estimate the row sums which we call scores of the items, because the scores are increasing linearly if the items are placed in order. Initially, for each , we estimate the score of item by the number of wins item has. If item has a much higher score than item in the first stage, then we are confident that item is stronger than item . Hence in the second stage, we can estimate by , which is very close to the truth. For those pairs that we are not certain about,
is still estimated by its empirical version. The variance of each score is thus greatly reduced in the second stage, thereby yielding a more accurate order of the items. Then we iterate this process to obtain finer and finer estimates of the scores and the underlying order.
To present the Multistage Sorting (MS) algorithm formally, let us fix a positive integer which is the number of stages of the algorithm. We further split the second sample into subsamples each containing pairwise comparisons333We assume without loss of generality that divides to ease the notation.. Similar to the data matrix for the full sample, for we define a matrix by setting to be the number of wins item has against item in the -th sample. The MS algorithm proceeds as follows:
For each , define , and .
At the -th stage where , compute the score of item :
then we set the threshold
and define the sets
If (3.1) does not hold, then we define , and . Note that denotes the set of items whose ranking relative to has not been determined by the algorithm at stage .
After repeating Step 2 and 3 for , output a permutation by sorting the scores in increasing order, i.e., is chosen so that if with ties broken arbitrarily.
It is clear that the time complexity of each stage of the algorithm is . Take so that the overall time complexity of the MS algorithm is only . Our main result in this section is the following guarantee on the performance of the estimator given by the MS algorithm.
Suppose that for a sufficiently large constant and that where for a constant . Then, under the noisy sorting model with sampling model 2, the following holds. With probability at least , the MS algorithm with stages outputs an estimator that satisfies
Note that the second statement follows from the first one together with (2.1). Indeed, we have
which is optimal up to a polylogarithmic factor in the regime where is bounded away from according to Theorem 3.1 (and Theorem 6.1). Therefore, the MS algorithm achieves significant computational efficiency while sacrificing little in terms of statistical performance. On the downside, it is limited to the noisy sorting model where —this assumption is necessary to exploit strong stochastic transitivity—and our analysis does not account for the dependence in .
Furthermore, although we only consider model 2 of sampling with replacement in this section, the MS algorithm can be easily modified to handle model 1 of sampling without replacement. It is much more challenging to prove analogous theoretical guarantees in this case, because we cannot split the observations into independent samples. In Section 4, however, we provide empirical evidence showing that the MS estimator has very similar performance for the two sampling models.
Our algorithm bears comparison with the algorithm proposed by Braverman and Mossel (2008). Their algorithm—which works in the full observation case —achieves the statistically optimal rate in time , where is a large positive constant depending on . Though our algorithm’s statistical performance falls short of the optimal rate by a polylogarithmic factor, it runs in time and works in the partial observation setting as long as . Note by way of comparison that Theorem 6.1 indicates that no procedure achieves nontrivial recovery unless .
To support our theoretical findings in Section 3.2, we implement the MS algorithm on synthetic instances generated from the noisy sorting model. For simplicity, we take and set in the algorithm. Theorem 3.2 predicts a scaling of the estimation error in the Kendall tau distance for model 2 of sampling with replacement, where is the number of items and is the number of pairwise comparisons. This rate is optimal up to a polylogarithmic factor according to Theorem 6.1.
In Figure 1, we plot estimation errors averaged over instances generated from the model. In the left plot, we let range from to and set . For this choice of , Theorem 3.2 predicts that and we indeed observe a near-linear scaling in that plot. In the right plot, we fix and let the proportion of observed entries, range from to . For this choice of parameters, Theorem 3.2 predicts that (recall that here is fixed), and we clearly observe a sublinear relation between and . Note that this does not contradict the lower bound since the latter is stated up to constants.
Moreover, the MS algorithm can be easily modified to work for the without replacement model 1. Namely, given the partially observed pairwise comparisons, we assign each comparison to one of the samples uniformly at random, independent of all the other assignments. After splitting the whole sample into subsamples, we execute the MS algorithm as in the previous case. In Figure 1, we take and plot the estimation errors for sampling without replacement, which closely follow the errors for observations sampled with replacement. Therefore, although it seems difficult to prove analogous guarantees on the performance of the MS algorithm applied to the without replacement model, empirically the algorithm performs very similarly for the two sampling models.
To gain further intuition about the MS algorithm, we consider the set defined in the algorithm. At stage of the algorithm, the set consists of all indices for which we are not certain about the relative order of item and item . The proof of Theorem 3.2 essentially shows that the uncertainty set is shrinking as the algorithm proceeds. To verify this intuition, in Figure 2 we plot the uncertainty regions
at stages of the MS algorithm, for and . The items are ordered according to for visibility of the region. As exhibited in the plots, the uncertainty region is indeed shrinking as the algorithm proceeds.
5 The symmetric group and inversions
Before proving the main results for the noisy sorting model, we study the metric entropy of the symmetric group with respect to the Kendall tau distance. Counting permutations subject to constraints in terms of the Kendall tau distance is of theoretical importance and has interesting applications, e.g., in coding theory (see, e.g, Barg and Mazumdar, 2010; Mazumdar et al., 2013). We present the results in terms of metric entropy, which easily applies to the noisy sorting problem and may find further applications in statistical problems involving permutations.
For and , let and denote respectively the -covering number and the -packing number of with respect to the Kendall tau distance. The following main result of this section provides bounds on the metric entropy of balls in .
Consider the ball centered at with radius . We have that for ,
We now discuss some high-level implications of Proposition 5. Note that if , the lemma states that the -metric entropy of a ball of radius in the Kendall tau distance scales as . In other words, the symmetric group equipped with the Kendall tau metric is a doubling space with doubling dimension . One of the main messages of the current work is that although , the intrinsic dimension of is , which explains the absence of logarithmic factor in the minimax rate.
To start the proof, we first recall a useful tool for counting permutations, the inversion table. Formally, the inversion table of a permutation is defined by
for . Clearly, we have that and . It is easy to reconstruct a unique permutation using an inversion table with , so the set of inversion tables is bijective to via this relation; see, e.g., Mahmoud (2000). We use this bijection to bound the number of permutations that differ from the identity by at most inversions. The following lemma appears in a different form in Barg and Mazumdar (2010). We provide a simple proof here for completeness.
For , we have that
According to the discussion above, the cardinality , which we denote by , is equal to the number of inversion tables where such that . On the one hand, if for all , then , so a lower bound on is given by
Using Stirling’s approximation, we see that
On the other hand, if is only required to be a nonnegative integer for each , then we can use a standard “stars and bars” counting argument (Feller, 1968) to get an upper bound of the form
Taking the logarithm finishes the proof.
We are ready to prove Proposition 5.
[of Proposition 5] The relation between the covering and the packing number is standard.
We employ a standard volume argument to control these numbers. Let be a -packing of so that the balls are disjoint for . Moreover, by the triangle inequality, for each . By the invariance of the Kendall tau distance under composition, Lemma 5 yields
On the other hand, if is an -net of , then the set of balls covers . By Lemma 5, we obtain
The lower bound on the packing number in Proposition 5 becomes vacuous when and are smaller than , so we complement it with the following result, which is useful for proving minimax lower bounds.
Consider the ball where . We have that
Without loss of generality, we may assume that and is even. The sparse Varshamov-Gilbert bound (Massart, 2007, Lemma 4.10) states that there exists a set of -sparse vectors in , such that and any two distinct vectors in are separated by at least in the Hamming distance. We now map every to a permutation by defining
and if , and
and if ,
for . Note that because swaps at most adjacent pairs. Denote by the image of under this mapping. Since the Hamming distance between any two distinct vectors in is lower bounded by , we see that for any distinct . Thus is an -packing of . By construction, , so we can use the standard relation to complete the proof.
6 Proofs of the main results
This section is devoted to the proofs of our main results. We start with a lemma giving useful tail bounds for the binomial distribution.
Suppose that has the Binomial distribution where and . Then for and , we have
, by the definition of the Kullback-Leibler divergence, we have
Thus we also have
Moreover, by Theorem 1 of Arratia and Gordon (1989) and symmetry, it holds that
6.1 Proof of Theorem 3.1
First, to achieve optimal upper bounds, we consider a variant of maximum likelihood estimation. Fix and define in the case of sampling model 1, and in the case of sampling model 2. If or is unknown, one may learn these scalar parameters easily from the observations and define using the estimated values. For readability, we assume that they are given to avoid these technical complications.
Let be a maximal -packing (and thus a -net) of the symmetric group with respect to . Consider the following estimator:
It is easy to see that is the MLE of over . Such an estimator is often called sieve estimator (see, e.g. Le Cam, 1986) in the statistics literature. The estimator satisfies the following upper bounds.
Consider the noisy sorting model with underlying permutation and probability matrix where . Then, with probability at least , the estimator defined in (6.3) satisfies
By integrating the tail probabilities of the above bounds, we easily obtain bounds on the expectation of the same order, which then prove the upper bounds in Theorem 3.1. One may wonder whether the rate in Theorem 6.1 can be achieved by the MLE over defined by
[of Theorem 6.1] We assume that is lower bounded by a constant without loss of generality, and note that the bounds of order are trivial. The proof is split into four parts to improve readability.
Since is a maximal -packing of , it is also a -net and thus there exists such that . By definition of , Canceling concordant pairs under and , we see that
Splitting the summands according to yields that
Since , we may drop the leftmost term and drop the condition in the rightmost term to obtain that
This inequality is crucial to proving that is close to with high probability.
To set up the rest of the proof, we define, for ,