Simple, Robust and Optimal Ranking from Pairwise Comparisons

12/30/2015 ∙ by Nihar B. Shah, et al. ∙ berkeley college 0

We consider data in the form of pairwise comparisons of n items, with the goal of precisely identifying the top k items for some value of k < n, or alternatively, recovering a ranking of all the items. We analyze the Copeland counting algorithm that ranks the items in order of the number of pairwise comparisons won, and show it has three attractive features: (a) its computational efficiency leads to speed-ups of several orders of magnitude in computation time as compared to prior work; (b) it is robust in that theoretical guarantees impose no conditions on the underlying matrix of pairwise-comparison probabilities, in contrast to some prior work that applies only to the BTL parametric model; and (c) it is an optimal method up to constant factors, meaning that it achieves the information-theoretic limits for recovering the top k-subset. We extend our results to obtain sharp guarantees for approximate recovery under the Hamming distortion metric, and more generally, to any arbitrary error requirement that satisfies a simple and natural monotonicity condition.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ranking problems involve a collection of items, and some unknown underlying total ordering of these items. In many applications, one may observe (noisy) comparisons between various pairs of items. Examples include matches between football teams in tournament play; consumer’s preference ratings in marketing; and certain types of voting systems in politics. Given a set of such noisy comparisons between items, it is often of interest to find the true underlying ordering of all items, or alternatively, given some given positive integer , to find the subset of most highly rated items. These two problems are the focus of this paper.

There is a substantial literature on the problem of finding approximate rankings based on noisy pairwise comparisons. A number of papers (e.g., [KMS07, BM08, Eri13]) consider models in which the probability of a pairwise comparison agreeing with the underlying order is identical across all pairs. These results break down when for one or more pairs, the probability of agreeing with the underlying ranking is either comes close to or is exactly equal to . Another set of papers [Hun04, NOS12, HOX14, SPX14, SBB16] work using parametric models of pairwise comparisons, and address the problem of recovering the parameters associated to every individual item. A more recent line of work [Cha14, SBGW16, SBW16] studies a more general class of models based on the notion of strong stochastic transitivity (SST), and derives conditions on recovering the pairwise comparison probabilities themselves. However, it remains unclear whether or not these results can directly extend to tight bounds for the problem of recovery of the top items. The works [JS08, MGCV11, AS12, DIS15] consider mixture models, in which every pairwise comparison is associated to a certain individual making the comparison, and it is assumed that the preferences across individuals can be described by a low-dimensional model.

Most related to our work are the papers [WJJ13, RA14, RGLA15, CS15], which we discuss in more detail here. Wauthier et al. [WJJ13] analyze a weighted counting algorithm to recover approximate rankings; their analysis applies to a specific model in which the pairwise comparison between any pair of items remains faithful to their relative positions in the true ranking with a probability common across all pairs. They consider recovery of an approximate ranking (under Kendall’s tau and maximum displacement metrics), but do not provide results on exact recovery. As the analysis of this paper shows, their bounds are quite loose: their results are tight only when there are a total of at least comparisons. The pair of papers [RA14, RGLA15] by Rajkumar et al. consider ranking under several models and several metrics. In the part that is common with our setting, they show that the counting algorithm is consistent in terms of recovering the full ranking, which automatically implies consistency in exactly recovering the top items. They obtain upper bounds on the sample complexity in terms of a separation threshold that is identical to a parameter defined subsequently in this paper (see Section 3). However, as our analysis shows, their bounds are loose by at least an order of magnitude. They also assume a certain high-SNR condition on the probabilities, an assumption that is not imposed in our analysis.

Finally, in very recent work on this problem, Chen and Suh [CS15] proposed an algorithm called the Spectral MLE for exact recovery of the top items. They showed that, if the pairwise observations are assumed to drawn according to the Bradley-Terry-Luce (BTL) parametric model [BT52, Luc59], the Spectral MLE algorithm recovers the items correctly with high probability under certain regularity conditions. In addition, they also show, via matching lower bounds, that their regularity conditions are tight up to constant factors. While these guarantees are attractive, it is natural to ask how such an algorithm behaves when the data is not drawn from the BTL model. In real-world instances of pairwise ranking data, it is often found that parametric models, such as the BTL model and its variants, fail to provide accurate fits (for instance, see the papers [DM59, ML65, Tve72, BW97] and references therein).

With this context, the main contribution of this paper is to analyze a classical counting-based method for ranking, often called the Copeland method [Cop51], and to show that it is simple, optimal and robust. Our analysis does not require that the data-generating mechanism follow either the BTL or other parametric assumptions, nor other regularity conditions such as stochastic transitivity. We show that the Copeland counting algorithm has the following properties:

  • [leftmargin=*]

  • Simplicity: The algorithm is simple, as it just orders the items by the number of pairwise comparisons won. As we will subsequently see, the execution time of this counting algorithm is several orders of magnitude lower as compared to prior work.

  • Optimality: We derive conditions under which the counting algorithm achieves the stated goals, and by means of matching information-theoretic lower bounds, show that these conditions are tight.

  • Robustness: The guarantees that we prove do not require any assumptions on the pairwise-comparison probabilities, and the counting algorithm performs well for various classes of data sets. In contrast, we find that the spectral MLE algorithm performs poorly when the data is not drawn from the BTL model.

In doing so, we consider three different instantiations of the problem of set-based recovery: (i) Recovering the top items perfectly; (ii) Recovering the top items allowing for a certain Hamming error tolerance; and (iii) a more general recovery problem for set families that satisfy a natural “set-monotonicity” condition. In order to tackle this third problem, we introduce a general framework that allows us to treat a variety of problems in the literature in an unified manner.

The remainder of this paper is organized as follows. We begin in Section 2 with background and a more precise formulation of the problem. Section 3 presents our main theoretical results on top- recovery under various requirements. Section 4 provides the results of experiments on both simulated and real-world data sets. We provide all proofs in Section 5. The paper concludes with a discussion in Section 6.

2 Background and problem formulation

In this section, we provide a more formal statement of the problem along with background on various types of ranking models.

2.1 Problem statement

Given an integer , we consider a collection of items, indexed by the set . For each pair , we let denote the probability that item wins the comparison with item . We assume that that each comparison necessarily results in one winner, meaning that

(1)

where we set the diagonal for concreteness.

For any item , we define an associated score as

(2)

In words, the score of any item corresponds to the probability that item beats an item chosen uniformly at random from all items.

Given a set of noisy pairwise comparisons, our goals are (a) to recover the

items with the maximum values of their scores; and (b) to recover the full ordering of all the items as defined by the score vector. The notion of ranking items via their scores (

2) generalizes the explicit rankings under popular models in the literature. Indeed, as we discuss shortly, most models of pairwise comparisons considered in the literature either implicitly or explicitly assume that the items are ranked according to their scores. Note that neither the scores nor the matrix of probabilities are assumed to be known.

More concretely, we consider a random-design observation model defined as follows. Each pair is associated with a random number of noisy comparisons, following a binomial distribution with parameters

, where is the number of trials and is the probability of making a comparison on any given trial. Thus, each pair

is associated with a binomial random variable with parameters

that governs the number of comparisons between the pair of items. We assume that the observation sequences for different pairs are independent. Note that in the special case , this random binomial model reduces to the case in which we observe exactly observations of each pair; in the special case , the set of pairs compared form an Erdős-Rényi random graph.

In this paper, we begin in Section 3.1 by analyzing the problem of exact recovery. More precisely, for a given matrix of pairwise probabilities, suppose that we let denote the (unknown) set of items with the largest values of their respective scores, assumed to be unique for concreteness.

Given noisy observations specified by the pairwise probabilities , our goal is to establish conditions under which there exists some algorithm that identifies items based on the outcomes of various comparisons such that the probability is very close to one. In the case of recovering the full ranking, our goal is to identify conditions that ensure that the probability is close to one.

In Section 3.2, we consider the problem of recovering a set of items that approximates with a minimal Hamming error For any two subsets of , we define their Hamming distance , also referred to as their Hamming error, to be the number of items that belong to exactly one of the two sets—that is

(3)

For a given user-defined tolerance parameter , we derive conditions that ensure that with high probability.

Finally, we generalize our results to the problem of satisfying any a general class of requirements on set families. These requirement are specified in terms of which -sized subsets of the items are allowed, and is required to satisfy only one natural condition, that of set-monotonicity, meaning that replacing an item in an allowed set with a higher rank item should also be allowed. See Section 3.3 for more details on this general framework.

2.2 A range of pairwise comparison models

To be clear, our work makes no assumptions on the form of the pairwise comparison probabilities. However, so as to put our work in context of the literature, let us briefly review some standard models uesd for pairwise comparison data.

Parametric models:

A broad class of parametric models, including the Bradley-Terry-Luce (BTL) model as a special case [BT52, Luc59], are based on assuming the existence of “quality” parameter for each item , and requiring that the probability of an item beating another is a specific function of the difference between their values. In the BTL model, the probability that beats is given by the logistic model

(4a)
More generally, parametric models assume that the pairwise comparison probabilities take the form
(4b)

where

is some strictly increasing cumulative distribution function.

By construction, any parametric model has the following property: if for some pair of items , then we are also guaranteed that for every item . As a consequence, we are guaranteed that , which implies that ordering of the items in terms of their quality vector is identical to their ordering in terms of the score vector . Consequently, if the data is actually drawn from a parametric model, then recovering the top items according to their scores is the same as recovering the top items according their respective quality parameters.

Strong Stochastic Transitivity (SST) class:

The class of strong stochastic transitivity (SST) models is a superset of parametric models [SBGW16]. It does not assume the existence of a quality vector, nor does it assume any specific form of the probabilities as in equation (4a). Instead, the SST class is defined by assuming the existence of a total ordering of the items, and imposing the inequality constraints for every pair of items where is ranked above in the ordering, and every item . One can verify that an ordering by the scores of the items lead to an ordering of the items that is consistent with that defined by the SST class.

Thus, we see that in a broad class of models for pairwise ranking, the total ordering defined by the score vectors (2) coincides with the underlying ordering used to define the models. In this paper, we analyze the performance of a counting algorithm, without imposing any modeling conditions on the family of pairwise probabilities. The next three sections establish theoretical guarantees on the recovery of the top items under various requirements.

2.3 Copeland counting algorithm

The analysis of this paper focuses on a simple counting-based algorithm, often called the Copeland method [Cop51]. It can be also be viewed as a special case of the Borda count method [dB81], which applies more generally to observations that consist of rankings of two or more items. Here we describe how this method applies to the random-design observation model introduced earlier.

More precisely, for each distinct and every integer , let represent the outcome of the comparison between the pair and , defined as

(5)

Note that this definition ensures that . For , the quantity

(6)

corresponds to the number of pairwise comparisons won by item . Here we use to denote the indicator function that takes the value if its argument is true, and the value otherwise. For each integer , the vector of number of pairwise wins defines a -sized subset

(7)

corresponding to the set of items with the largest values of . Otherwise stated, the set corresponds to the rank statistics of the top -items in the pairwise win ordering. (If there are any ties, we resolve them by choosing the indices with the smallest value of .)

3 Main results

In this section, we present our main theoretical results on top- recovery under the three settings described earlier. Note that the three settings are ordered in terms of increasing generality, with the advantage that the least general setting leads to the simplest form of theoretical claim.

3.1 Thresholds for exact recovery of the top items

We begin with the goal of exactly recovering the top-ranked items. As one might expect, the difficulty of this problem turns out to depend on the degree of separation between the top items and the remaining items. More precisely, let us use and to denote the indices of the items that are ranked and respectively. With this notation, the -separation threshold is given by

(8)

In words, the quantity is the difference in the probability of item beating another item chosen uniformly at random, versus the same probability for item .

As shown by the following theorem, success or failure in recovering the top entries is determined by the size of relative to the number of items , observation probability and number of repetitions . In particular, consider the family of matrices

(9)

To simplify notation, we often adopt as a convenient shorthand for this set, where its dependence on should be understood implicitly.

With this notation, the achievable result in part (a) of the following theorem is based on the estimator that returns the set

of the the items defined by the number of pairwise comparisons won, as defined in equation (7). On the other hand, the lower bound in part (b) applies to any estimator, meaning any measurable function of the observations.

Theorem 1.
  1. For any , the maximum pairwise win estimator from equation (7) satisfies

    (10a)
  2. Conversely, suppose that and . Then for any , the error probability of any estimator is lower bounded as

    (10b)

Remarks: First, it is important to note that the negative result in part (b) holds even if the supremum is further restricted to a particular parametric sub-class of , such as the pairwise comparison matrices generated by the BTL model, or by the SST model. Our proof of the lower bound for exact recovery is based on a generalization of a construction introduced by Chen and Suh [CS15], one adapted to the general definition (8) of the separation threshold .

Second, we note that in the regime , standard results from random graph theory [ER60] can be used to show that there are at least items (in expectation) that are never compared to any other item. Of course, estimating the rank is impossible in this pathological case, so we omit it from consideration.

Third, the two parts of the theorem in conjunction show that the counting algorithm is essentially optimal. The only room for improvement is in the difference between the value of in the achievable result, and the value in the lower bound.

Theorem 1 can also be used to derive guarantees for recovery of other functions of the underlying ranking. Here we consider the problem of identifying the ranking of all items, say denoted by the permutation . In this case, we require that each of the separations are suitably lower bounded: more precisely, we study models that belong to the intersection .

Corollary 1.

Let be the permutation of the items specified by the number of pairwise comparisons won. Then for any , we have

Moreover, the separation condition on that defines the set is unimprovable beyond constant factors.

This corollary follows from the equivalence between correct recovery of the ranking and recovering the top items for every value of .

Detailed comparison to related work:

In the remainder of this subsection, we make a detailed comparison to the related works [WJJ13, RA14, RGLA15, CS15] that we briefly discussed earlier in Section 1.

Wauthier et al. [WJJ13] analyze a weighted counting algorithm for approximate recovery of rankings; they work under a model in which whenever item is ranked above item in an assumed underlying ordering. Here the parameter is independent of , and as a consequence, the best ranked item is assumed to be as likely to meet the worst item as it is to beat the second ranked item, for instance. They analyze approximate ranking under Kendall tau and maximum displacement metrics. In order to have a displacement upper bounded by by some , their bounds require the order of pairwise comparisons. In comparison, our model is more general in that we do not impose the -condition on the pairwise probabiltiies. When specialized to the -model, the quantities in our analysis takes the form , and Corollary 1 shows that observations are sufficient to recover the exact total ordering. Thus, for any constant , Corollary 1 guarantees recover with a multiplicative factor of order smaller than that established by Wauthier et al. [WJJ13].

The pair of papers [RA14, RGLA15] by Rajkumar et al. consider ranking under several models and several metrics. For the subset of their models common with our setting—namely, Bradley-Terry-Luce and the so-called low noise models—they show that the counting algorithm is consistent in terms of recovering the full ranking or the top subset of items. The guarantees are obtained under a low-noise assumpotion: namely, that the probability of any item beating is at least whenever item is ranked higher than item in an assumed underlying ordering. Their guarantees are based on a sample size of at least , where is a parameter lower bounded as . Once again, our setting allows for the parameter to be arbitrarily close to zero, and furthermore as one can see from the discussion above, our bounds are much stronger. Moreover, while Rajkumar et al. focus on upper bounds alone, we also prove matching lower bounds on sample complexity showing that our results are unimprovable beyond constant factors. It should be noted that Rajkumar et al. also provide results for other types of ranking problems that lie outside the class of models treated in the current paper.

Most recently, Chen and Suh [CS15] show that if the pairwise observations are assumed to drawn according to the Bradley-Terry-Luce (BTL) parametric model (4a), then their proposed Spectral MLE algorithm recovers the items correctly with high probability when a certain separation condition on the parameters of the BTL model is satisfied. In addition, they also show, via matching lower bounds, that this separation condition are tight up to constant factors. In real-world instances of pairwise ranking data, it is often found that parametric models, such as the BTL model and its variants, fail to provide accurate fits [DM59, ML65, Tve72, BW97]. Our results make no such assumptions on the noise, and furthermore, our notion of the ordering of the items in terms of their scores (2) strictly generalizes the notion of the ordering with respect to the BTL parameters. In empirical evaluations presented subsequently, we see that the counting algorithm is significantly more robust to various kinds of noise, and takes several orders of magnitude lesser time to compute.

Finally, in addition to the notion of exact recovery considered so far, in the next two subsections we also derive tight guarantees for the Hamming error metric and more general metrics inspired by the requirements of many relevant applications [IBS08, MTW05, BO03, MAEA05, KS06, FLN03].

3.2 Approximate recovery under Hamming error

In the previous section, we analyzed performance in terms of exactly recovering the -ranked subset. Although exact recovery is suitable for some applications (e.g., a setting with high stakes, in which any single error has a large price), there are other settings in which it may be acceptable to return a subset that is “close” to the correct -ranked subset. In this section, we analyze this problem of approximate recovery when closeness is measured under the Hamming error. More precisely, for a given threshold , suppose that our goal is to output a set -sized set such that its Hamming distance to the set of the true top items, as defined in equation (3), is bounded as

(11)

Our goal is to establish conditions under which it is possible (or impossible) to return an estimate satisfying the bound (11) with high probability.111The requirement is sensible because if , the problem is trivial: any two -sized sets and satisfy the bound .

As before, we use to denote the permutation of the items in decreasing order of their scores. With this notation, the following quantity plays a central role in our analysis:

(12a)
Observe that is a generalization of the quantity defined previously in equation (8); more precisely, the quantity corresponds to with . We then define a generalization of the family , namely
(12b)

As before, we frequently adopt the shorthand , with the dependence on being understood implicitly.

Theorem 2.
  1. For any , the maximum pairwise win set satisfies

    (13a)
  2. Conversely, in the regime and for given constants , suppose that . Then for any , any estimator has error at least

    (13b)

    for all larger than a constant .

This result is similar to that of Theorem 1, except that the relaxation of the exact recovery condition allows for a less constrained definition of the separation threshold . As with Theorem 1, the lower bound in part (b) applies even if probability matrix is restricted to lie in a parametric model (such as the BTL model), or the more general SST class. The counting algorithm is thus optimal for estimation under the relaxed Hamming metric as well.

Finally, it is worth making a few comments about the constants appearing in these claims. We can weaken the lower bound on required in Theorem 2(a) at the expense of a lower probability of success; for instance, if we instead require that , then the probability of error is guaranteed to be at most . Subsequently in the paper, we provide the results of simulations with items and . On the other hand, in Theorem 2(b), if we impose the stronger upper bound , then we can remove the condition .

3.3 An abstract form of -set recovery

In earlier sections, we investigated recovery of the top items either exactly or under a Hamming error. Exact recovery may be quite strict for certain applications, whereas the property of Hamming error allowing for a few of the top items to be replaced by arbitrary items may be undesirable. Indeed, many applications have requirements that go beyond these metrics; for instance, see the papers [IBS08, MTW05, BO03, MAEA05, KS06, FLN03] and references therein for some examples. In this section, we generalize the notion of exact or Hamming-error recovery in order to accommodate a fairly general class of requirements.

Both the exact and approximate Hamming recovery settings require the estimator to output a set of items that are either exactly or approximately equal the true set of top items. When is the estimate deemed successful? One way to think about the problem is as follows. The specified requirement of exact or approximate Hamming recovery is associated to a set of -sized subsets of the possible ranks. The estimator is deemed successful if the true ranks of the chosen items equals one of these subsets. In our notion of generalized recovery, we refer to such sets as allowed sets. For example, in the case , we might say that the set is allowed, meaning that an output consisting of the “first”, “fourth” and “tenth” ranked items is considered correct.

In more generality, let denote a family of -sized subsets of , which we refer to as family of allowed sets. Notice that any allowed set is defined by the positions of the items in the true ordering and not the items themselves.222In case of two or more items with identical scores, the choice of any of these items is considered valid. Once some true underlying ordering of the items is fixed, each element of the family then specifies a set of the items themselves. We use these two interpretations depending on the context — the definition in terms of positions to specify the requirements, and the definition in terms of the items to evaluate an estimator for a given underlying probability matrix .

We let denote a -set estimate, meaning a function that given a set of observations as input, returns a -sized subset of as output.

Definition 1 (-respecting estimators).

For any family of allowed sets, a -set estimate respects its structure if the set of positions of the items in belongs to the set family .

Our goal is to determine conditions on the set family under which there exist estimators that respect its structure. In order to illustrate this definition, let us return to the examples treated thus far:

Example 1 (Exact and approximate Hamming recovery).

The requirement of exact recovery of the top items has consisting of exactly one set, the set of the top positions . In the case of recovery with a Hamming error at most , the set of all allowed sets consists all -sized subsets of that contain at least positions in the top positions. For instance, in the case , and , we have

Apart from these two requirements, there are several other requirements for top- recovery popular in the literature [CCF01, FLN03, BO03, MTW05, MAEA05, KS06, IBS08]. Let us illustrate them with another example:

Example 2.

Let denote the true underlying ordering of the items. The following are four popular requirements on the set for top- identification, with respect to the true permutation , for a pre-specified parameter .

  1. [label = ()]

  2. All items in the set must be contained contained within the top entries:

    (14a)
  3. The rank of any item in the set must lie within a multiplicative factor of the rank of any item not in the set :

    (14b)
  4. The rank of any item in the set must lie within an additive factor of the rank of any item not in the set :

    (14c)
  5. The sum of the ranks of the items in the set must be contained within a factor of the sums of ranks of the top entries:

    (14d)

Note that each of these requirements reduces to the exact recovery requirement when . Moreover, each of these requirements can be rephrased in terms of families of allowed sets. For instance, if we focus on requirement (i), then any -sized subset of the top positions is an allowable set.

In this paper, we derive conditions that govern -set recovery for allowable set systems that satisfy a natural “monotonicity” condition. Informally, the monotonicity condition requires that the set of items resulting from replacing an item in an allowed set with a higher ranked item must also be an allowed set. More precisely, for any set , let be the set defined by all of its monotone transformations—that is

Using this notation, we have the following:

Definition 2 (Monotonic set systems).

The set of allowed sets is a monotonic set system if

(15)

One can verify that condition (15) is satisfied by the settings of exact and Hamming-error recovery, as discussed in Example 1. The condition is also satisfied by all four requirements discussed in Example 2.

The following theorem establishes conditions under which one can (or cannot) produce an estimator that respects an allowable set requirement. In order to state it, recall the score , as previously defined in equation (2) for each . For notational convenience, we also define for every . Consider any monotonic family of allowed sets , and for some integer , let such that . For every , let denote the entries of . We then define the critical threshold based on the scores:

(16)

The term is a further generalization of the quantities and defined in earlier sections. We also define a generalization of the families and as

(17)

As before, we use the shorthand , with the dependence on being understood implicitly.

Theorem 3.

Consider any allowable set requirement specified by a monotonic set class .

  1. For any , the maximum pairwise win set satisfies

  2. Conversely, in the regime , and for given constants , suppose that and . Then for any smaller than a constant , any estimator has error at least

    (18)

    for all larger than a constant .

A few remarks on the lower bound are in order. First, the lower bound continues to hold even if the probability matrix is restricted to follow a parametric model such as BTL or restricted to lie in the SST class. Second, in terms of the threshold for , the lower bound holds with . Third, it is worth noting that one must necessarily impose some conditions for the lower bound, along the lines of those required in Theorem 3(b) for the allowable sets to be “interesting” enough.

As a concrete illustration, consider the requirement defined by the parameters , and . For , this requirement satisfies the condition but violates the condition . Now, a selection of item made uniformly at random (independent of the data) satisfies this allowable set requirement with probability . Given the success of such a random selection algorithm in this parameter regime, we see that the lower bounds therefore cannot be universal, but must require some conditions on the allowable sets.

4 Simulations and experiments

In this section, we empirically evaluate the performance of the counting algorithm and compare it with the Spectral MLE algorithm via simulations on synthetic data, as well as experiments using datasets from the Amazon Mechanical Turk crowdsourcing platform.

4.1 Simulated data

Figure 1: Simulation results comparing Spectral MLE and the counting algorithm in terms of error rates for exact recovery of the top items, and computation time. (a) Histogram of fraction of instances where the algorithm failed to recover the items correctly, with each bar being the average value across trials. The counting algorithm has 0% error across all problems, while the spectral MLE is accurate for parametric models (BTL, Thurstone), but increasingly inaccurate for other models. (b) Histogram plots of the maximum computation time taken by the counting algorithm and the minimum computation time taken by Spectral MLE across all trials. Even though this maximum-to-minimum comparison is unfair to the counting algorithm, it involves five or more orders of magnitude less computation.

We begin with simulations using synthetically generated data with items and observation probability , and with pairwise comparison models ranging over six possible types. Panel (a) in Figure 1 provides a histogram plot of the associated error rates (with a bar for each one of these six models) in recovering the items for the counting algorithm versus the Spectral MLE algorithm. Each bar corresponds to the average over trials. Panel (b) compares the CPU times of the two algorithms. The value of (and in turn, the value of ) in the first five models is as derived in Section 3.1. In more detail, the six model types are given by:

  1. [leftmargin=*]

  2. Bradley-Terry-Luce (BTL) model: Recall that the theoretical guarantees for the Spectral MLE algorithm [CS15] are applicable to data that is generated from the BTL model (4a), and as guaranteed, the Spectral MLE algorithm gives a accuracy under this model. The counting algorithm also obtains a accuracy, but importantly, the counting algorithm requires a computational time that is five orders of magnitude lower than that of Spectral MLE.

  3. Thurstone model: The Thurstone model [Thu27] is another parametric model, with the function in equation (4b

    ) set as the cumulative distribution function of the standard Gaussian distribution. Both Spectral MLE and the counting algorithm gave

    accuracy under this model.

  4. BTL model with one (non-transitive) outlier:

    This model is identical to BTL, with one modification. Comparisons among of the items follow the BTL model as before, but the remaining item always beats the first items and always loses to each of the other items. We see that the counting algorithm continues to achieve an accuracy of as guaranteed by Theorem 1. The departure from the BTL model however prevents the Spectral MLE algorithm from identifying the top items.

  5. Strong stochastic transitivity (SST) model: We simulate the “independent diagonals” construction of [SBGW16] in the SST class. Spectral MLE is often unsuccessful in recovering the top items, while the counting algorithm always succeeds.

  6. Mixture of BTL models: Consider two sets of people with opposing preferences. The first set of people have a certain ordering of the items in their mind and their preferences follow a BTL model under this ordering. The second set of people have the opposite ordering, and their preferences also follow a BTL model under this opposite ordering. The overall preference probabilities is a mixture between these two sets of people. In the simulations, we observe that the counting algorithm is always successful while the Spectral MLE method often fails.

  7. BTL with violation of separation condition: We simulate the BTL model, but with a choice of parameter small enough that the value of is about one-tenth of its recommended value in Section 3.1. We observe that the counting algorithm incurs lower errors than the Spectral MLE algorithm, thereby demonstrating its robustness.

To summarize, the performance of the two algorithms can be contrasted in the following way. When our stated lower bounds on

are satisfied, then consistent with our theoretical claims, the Copeland counting algorithm succeeds irrespective of the form of the pairwise probability distributions. The Spectral MLE algorithm performs well when the pairwise comparison probabilities are faithful to parametric models, but is often unsuccessful otherwise. Even when the condition on

is violated, the performance of the counting algorithm remains superior to that of the Spectral MLE.333Note that part (b) of Theorem 1 is a minimax converse meaning that it appeals to the worst case scenario. In terms of computational complexity, for every instance we simulated, the counting algorithm took several orders of magnitude less time as compared to Spectral MLE.

4.2 Experiments on data from Amazon Mechanical Turk

In this section, we describe experiments on real world datasets collected from the Amazon Mechanical Turk (mturk.com) commercial crowdsourcing platform.

4.2.1 Data

In order to evaluate the accuracy of the algorithms under consideration, we require datasets consisting of pairwise comparisons in which the questions can be associated with an objective and verifiable ground truth. To this end, we used the “cardinal versus ordinal” dataset from our past work [SBB16]; three of the experiments performed in that paper are suitable for the evaluations here—namely, ones in which each question has a ground truth, and the pairs of items are chosen uniformly at random. The three experiments tested the workers’ general knowledge, audio, and visual understanding, and the respective tasks involved: (i) identifying the pair of cities with a greater geographical distance, (ii) identifying the higher frequency key of a piano, and (iii) identifying spelling mistakes in a paragraph of text. The number of items in the three experiments were , and respectively. The total number of pairwise comparisons were , and respectively. The fraction of pairwise comparisons whose outcomes were incorrect (as compared to the ground truth) in the raw data are , and respectively.

4.2.2 Results

We compared the performance of the counting algorithm with that of the Spectral MLE algorithm. For each value of a “subsampling probability” , we subsampled a fraction of the data and executed both algorithms on this subsampled data. We evaluated the performance of the algorithms on their ability to recover the top items under the Hamming error metric.

Figure 2 shows the results of the experiments. Each point in the plots is an average across trials. Observe that the counting algorithm consistently outperforms Spectral MLE. (We think that the erratic fluctuations in the spelling mistakes data are a consequence of a high noise and a relatively small problem size.) Moreover, the Spectral MLE algorithm required about orders of magnitude more computation time (not shown in the figure) as compared to counting. Thus the counting algorithm performs well on simulated as well as real data. It outperforms Spectral MLE not only when the number of items is large (as in the simulations) but also when the problem sizes are small as seen in these experiments.

Figure 2: Evaluation of Spectral MLE and the counting algorithm on three datasets from Amazon Mechanical Turk in terms of the error rates for top -subset recovery. The three panels plot the Hamming error when recovering the top items in the three datasets when a fraction of the total data is used, for various values of subsampling probability . The counting algorithm consistently outperforms the Spectral MLE algorithm.

5 Proofs

We now turn to the proofs of our main results. We continue to use the notation to denote the set for any integer . We ignore floor and ceiling conditions unless critical to the proof.

Our lower bounds are based on a standard form of Fano’s inequality [CT12, Tsy08] for lower bounding the probability of error in an -ary hypothesis testing problem. We state a version here for future reference. For some integer , fix some collection of distributions . Suppose that we observe a random variable that is obtained by first sampling an index uniformly at random from , and then drawing . (As a result, the variable is marginally distributed according to the mixture distribution .) Given the observation , our goal is to “decode” the value of , corresponding to the index of the underlying mixture component. Using to denote the sample space associated with the observation , Fano’s inequality asserts that any test function for this problem has error probability lower bounded as

where denotes the mutual information between and . A standard convexity argument for the mutual information yields the weaker bound

(19)

We make use of this weakened form of Fano’s inequality in several proofs.

5.1 Proof of Theorem 1

We begin with the proof of Theorem 1, dividing our argument into two parts.

5.1.1 Proof of part (a)

For any pair of items , let us encode the outcomes of the trials by an i.i.d. sequence of random vectors, indexed by . Each random vector follows the distribution

With this encoding, the variable encodes the number of wins for item .

Consider any item which ranks among the top in the true underlying ordering, and any item which ranks outside the top . We claim that with high probability, item will win more pairwise comparisons than item . More precisely, let denote the event that item wins at least as many pairwise comparisons than . We claim that

(20)

Given this bound, the probability that the counting algorithm will rank item above is no more than . Applying the union bound over all pairs of items and yields as claimed.

We note that inequality (ii) in equation (20) follows from inequality (i) combined with the condition on that arises by setting as assumed in the hypothesis of the theorem. Thus, it remains to prove inequality (i) in equation (20). By definition of , we have

(21)

It is convenient to recenter the random variables. For every and , define the zero-mean random variables

Also, let

We then have

Since and , from the definition of , we have , and consequently

(22)

By construction, all the random variables in the above inequality are zero-mean, mutually independent, and bounded in absolute value by . These properties alone would allow us to obtain a tail bound by Hoeffding’s inequality; however, in order to obtain the stated result (20), we need the more refined result afforded by Bernstein’s inequality (e.g., [BLM13]

). In order to derive a bound of Bernstein type, the only remaining step is to bound the second moments of the random variables at hand. Some straightforward calculations yield

It follows that

where the inequality (iii) follows from the definition of , and step (iv) follows because for every and . Applying the Bernstein inequality now yields the stated bound (20)(i).

5.1.2 Proof of part (b)

The symmetry of the problem allows us to assume, without loss of generality, that . We prove a lower bound by first constructing a ensemble of different problems, and considering the problem of distinguishing between them. For each , let us define the -sized subset , and the associated matrix of pairwise probabilities

where is a parameter to be chosen. We use to denote probabilities taken under pairwise comparisons drawn according to the model .

One can verify that the construction above falls in the intersection of parametric models and the SST model. In the parametric case, this construction amounts to having the parameters associated to every item in to have the same value, and those associated to every item in to have the same value. Also observe that for every such distribution , the associated -separation threshold .

Any given set of observations can be described by the collection of random variables . When the true underlying model is , the random variable follows the distribution

The random variables are mutually independent, and the distribution is a product distribution across pairs and repetitions .

Let

follow a uniform distribution over the index set, and suppose that given

, our observations has components drawn according to the model . Consequently, the marginal distribution of is the mixture distribution over all models. Based on observing , our goal is to recover the correct index of the underlying model, which is equivalent to recovering the planted subset . We use the Fano bound (19

) to lower bound the error bound associated with any test for this problem. In order to apply Fano’s inequality, the following result provides control over the Kullback-Leibler divergence between any pair of probabilities involved.

Lemma 1.

For any distinct pair , we have

(23)

See the end of this section for the proof of this claim.

Given this bound on the Kullback-Leibler divergence, Fano’s inequality (19) implies that any estimator of has error probability lower bounded as

Here the final inequality holds whenever , , and . The condition also ensures that thereby ensuring that our construction is valid. It only remains to prove Lemma 1.

5.1.3 Proof of Lemma 1

Since the distributions and are formed by components that are independent across edges and repetitions , we have