Your 2 is My 1, Your 3 is My 9: Handling Arbitrary Miscalibrations in Ratings

06/13/2018 ∙ by Jingyan Wang, et al. ∙ Carnegie Mellon University 0

Cardinal scores (numeric ratings) collected from people are well known to suffer from miscalibrations. A popular approach to address this issue is to assume simplistic models of miscalibration (such as linear biases) to de-bias the scores. This approach, however, often fares poorly because people's miscalibrations are typically far more complex and not well understood. In the absence of simplifying assumptions on the miscalibration, it is widely believed that the only useful information in the cardinal scores is the induced ranking. In this paper, inspired by the framework of Stein's shrinkage and empirical Bayes, we contest this widespread belief. Specifically, we consider cardinal scores with arbitrary (or even adversarially chosen) miscalibrations that is only required to be consistent with the induced ranking. We design estimators that despite making no assumptions on the miscalibration, surprisingly, strictly and uniformly outperform all possible estimators that rely on only the ranking. Our estimators are flexible in that they can be used as a plug-in for a variety of applications. Our results thus provide novel insights in the eternal debate between cardinal and ordinal data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

  • “A raw rating of 7 out of 10 in the absence of any other information is potentially useless.” [MGCV11]

    “The rating scale as well as the individual ratings are often arbitrary and may not be consistent from one user to another.” [AS12]

Consider two items that need to be evaluated (for example, papers submitted to a conference) and two reviewers. Suppose each reviewer is assigned one distinct item for evaluation, and this assignment is done uniformly at random. The two reviewers provide their evaluations (say, in the range ) for the respective item they evaluate, from which the better item must be chosen. However, the reviewers’ rating scales may be miscalibrated. It might be the case that the first reviewer is lenient and always provides scores in whereas the second reviewer is more stringent and provides scores in the range . Or it might be the case that one reviewer is moderate whereas the other is extreme – the first reviewer’s 0.2 is equivalent to the second reviewer’s 0.1 whereas the first reviewer’s 0.3 is equivalent to the second reviewer’s 0.9. More generally, the miscalibration of the reviewers may be arbitrary and unknown. Then is there any hope of identifying the better of the two items with any non-trivial degree of certainty?

A variety of applications involve collection of human preferences or judgments in terms of cardinal scores (numeric ratings). A perennial problem with eliciting cardinal scores is that of miscalibration – the systematic errors introduced due to incomparability of cardinal scores provided by different people (see [GB08] and references therein).

This issue of miscalibration is sometimes addressed by making simplifying assumptions about the form of miscalibration, and post-hoc corrections under these assumptions. Such models include one-parameter-per-reviewer additive biases [Pau81, BK13, GWG13, MKLP17], two-parameters-per-reviewer scale-and-shift biases [Pau81, RRS11] and others [FSG10]. The calibration issues with human-provided scores are often significantly more complex causing significant violations to these simplified assumptions (see [GB08] and references therein). Moreover, the algorithms for post-hoc correction often try to estimate the individual parameters which may not be feasible due to low sample sizes. For instance, John Langford notes from his experience as the program chair of the ICML 2012 conference [Lan12]:

  • “We experimented with reviewer normalization and generally found it significantly harmful.”

This problem of low sample size is exacerbated in a number of applications such as A/B testing where every reviewer evaluates only one item, thereby making the problem underdetermined even under highly restrictive models.

It is commonly believed that when unable or unwilling to make any simplifying assumptions on the bias in cardinal scores, the only useful information is the ranking of the scores [Rok68, FISS03, HBBR09, MGCV11, AS12, NOS12]. This perception gives rise to a second approach towards handling miscalibrations – that of using only the induced ranking or otherwise directly eliciting a ranking and not scores from the use. As noted by Freund et al. [FISS03]:

  • “[Using rankings instead of ratings] becomes very important when we combine the rankings of many viewers who often use completely different ranges of scores to express identical preferences.”

These motivations have spurred a long line of literature on analyzing data that takes the form of partial or total rankings of items [CGPR07, BK09, AS12, NOS12, RGLA15, SBB16, SW18].

In this paper, we contest this widely held belief with the following two fundamental questions:

  • [leftmargin=0.25in]

  • In the absence of simplifying modeling assumptions on the miscalibration, is there any estimator (based on the scores) that can outperform estimators based on the induced rankings?

  • If only one evaluation per reviewer is available, and if each reviewer may have an arbitrary (possibly adversarially chosen) miscalibration, is there hope of estimation better than random guessing?

We show that the answer to both questions is “Yes”. One need not make simplifying assumptions about the miscalibration and yet guarantee a performance superior to that of any estimator that uses only the induced rankings.

In more detail, we consider settings where a number of people provide cardinal scores for one or more from a collection of items. The calibration of each reviewer is represented by an unknown monotonic function that maps the space of true values to the scores given by this reviewer. These functions are arbitrary and may even be chosen adversarially. We present a class of estimators based on cardinal scores given by the reviewers which uniformly outperforms any estimator that uses only the induced rankings. A compelling feature of our estimators is that they can be used as a plug-in to improve ranking-based algorithms in a variety of applications, and we provide a proof-of-concept for two applications: A/B testing and ranking.

The techniques used in our analyses draw inspiration from the framework of Stein’s shrinkage [Ste56, JS61] and empirical Bayes [Rob56]. Moreover, our setting with reviewers and papers presented subsequently in the paper carries a close connection to the classic two-envelope problem (for a survey of the two-envelope problem, see [Gne16]), and our estimator in this setting is similar in spirit to the randomized strategy [Cov87] proposed by Thomas Cover. We discuss connections with the literature in more detail in Section 3.1.1.

Our work provides a new perspective on the eternal debate between cardinal scores and ordinal rankings. It is often believed that ordinal rankings are a panacea for the miscalibration issues with cardinal scores. Here we show that ordinal estimators are not only inadmissible, they are also strictly and uniformly beaten by our cardinal estimators. Our results thus uncover a new point on the bias-variance tradeoff for this class of problems: Estimators that rely on simplified assumptions about the miscalibration incur biases due to model mismatch, whereas the absence of such assumptions in our work eliminates the modeling bias. Moreover, in this minimal-bias regime, our cardinal estimators incur a strictly smaller variance as compared to estimators based on ordinal data alone.

Finally, a note qualifying the scope of the problem setting considered here. In applications such as crowdsourced microtasks where workers often spend very little time answering every question, the cardinal scores elicited may not necessarily be consistent with the ordinal rankings, and moreover, ordinal rankings are often easier and faster to provide. These differences cease to exist in a variety of applications such as peer-review or in-person laboratory A/B tests which require the reviewers to spend a non-trivial amount of time and effort in the review process, and these applications form the motivation of this work.

2 Preliminaries

Consider a set of items denoted as or in short.111We use the standard notation of to denote the set for any positive integer . Each item has an unknown value . For ease of exposition, we assume that all items have distinct values. There are reviewers and each reviewer evaluates a subset of the items. The calibration of any reviewer is given by an unknown, strictly-increasing function . (More generally, our results hold for any non-singleton intervals on the real line as the domain and range of the calibration functions). When reviewer evaluates item , the reported score is . We make no other assumptions on the calibration functions . We use the notation to represent a relative order of any items, for instance, we use “” to say that item has a larger value (ranked higher) than item . We assume that and are finite.

Every reviewer is assigned one or more items to evaluate. We denote the assignment of items to reviewers as , where is the set of items assigned to reviewer . We use the notation to represent the set of all permutations of items. We let denote the ranking of the items induced by their respective values , such that . The goal is to estimate this ranking from the evaluations of the reviewers. We consider two types of settings: an ordinal setting where estimation is performed using the rankings induced by each reviewer’s reported scores, and a cardinal setting where the estimation is performed using the reviewers’ scores (which can have an arbitrary miscalibration and only need to be consistent with the rankings). Formally:

  • [leftmargin=0.25in]

  • Ordinal: Each reviewer reports a total ranking among the items in , that is, the ranking of the items induced by the values . An ordinal estimator observes the assignment and the rankings reported by all reviewers.

  • Cardinal: Each reviewer reports the scores for the items in , that is, the values of . A cardinal estimator observes the assignment and the scores reported by all reviewers.

Observe that the setting described above considers “noiseless” data, where each reviewer reports either the scores or the induced rankings. We provide an extension to the noisy setting in Appendix A.

In order to compare the performance of different estimators, we use the notion of strict uniform dominance. Informally, we say that one estimator strictly uniformly dominates another if it incurs a strictly lower risk for all possible choices of the miscalibration functions and the item values.

In more detail, suppose that you wish to show that an estimator is superior to estimator with respect to some metric for estimating . However, there is a clever adversary who intends to thwart your attempts. The adversary can choose the miscalibration functions of all reviewers and the values of all items, and moreover, can tailor these choices for different realizations of . Formally, the adversary specifies a set of values . The only constraints in this choice are that the miscalibration functions must be strictly monotonic and that the item values should induce the ranking . In the sequel, we consider two ways of choosing the true ranking : In one setting, can be chosen by the adversary, and in the second setting is drawn uniformly at random from . Once this ranking is chosen, the actual values of the miscalibration functions and the item values are set as and . The items are then assigned to reviewers according to the (possibly random) assignment . The reviewers now provide their ordinal or cardinal evaluations as described earlier, and these evaluations are used to compute and evaluate the two estimators and . We say that estimator strictly uniformly dominates , if is always guaranteed to incur a strictly smaller (expected) error than . Formally:

Definition 1 (Strict uniform dominance)

Let and be two estimators for the true ranking . Estimator is said to strictly uniformly dominate estimator with respect to a given loss if

(1)

The expectation is taken over any randomness in the assignment and the estimators. If the true ranking is drawn at random from a fixed distribution, then the expectation is also taken over this distribution; otherwise, inequality (1) must hold for all values of .

Note that strict uniform dominance is a stronger notion than comparing estimators in terms of their minimax (worst-case) or average-case risks. Moreover, if an estimator is strictly uniformly dominated by some estimator , then the estimator is inadmissible.

Finally, for ease of exposition, we focus on the 0-1 loss in the main text:

where we use the standard notation to denote the indicator function of an event , where if the event is true, and otherwise. Extensions to other metrics of Kendall-tau distance and Spearman’s footrule distance are provided in Appendix B.

3 Main results

In this section we present our main theoretical results. All proofs are provided in Section 5.

3.1 A canonical setting

We begin with a canonical setting that involves two items and two reviewers (that is, , ), where each reviewer evaluates one of the two items. Our analysis for this setting conveys the key ideas underlying our general results. These ideas are directly applicable towards designing uniformly superior estimators for a variety of applications, and we subsequently demonstrate this general utility with two applications.

In this canonical setting, each of the two reviewers evaluates one of the two items chosen uniformly at random without replacement, that is, the assignment is chosen uniformly at random from the two possibilities and . Since each reviewer is assigned only one item, the ordinal data is vacuous. Then the natural ordinal baseline is an estimator which makes a guess uniformly at random:

In the cardinal setting, let denote the score reported for item by its respective reviewer, and let denote the score for item reported by its respective reviewer. Since the calibration functions are arbitrary (and may be adversarial), it appears hopeless to obtain information about the relative values of and from just this data. Indeed, as we show below, standard estimators such as the sign test — ranking the items in terms of their reviewer-provided scores — provably fail to achieve this goal. More generally, the following theorem holds for the class of all deterministic estimators, that is, estimators given by deterministic mappings from to the set .

Theorem 1

No deterministic (cardinal or ordinal) estimator can strictly uniformly dominate the random-guessing estimator .

This theorem demonstrates the difficulty of this problem by ruling out all deterministic estimators. Our original question then still remains: is there any estimator that can strictly uniformly outperform the random-guessing ordinal baseline?

We show that the answer is yes, with the construction of a randomized estimator for this canonical setting, denoted as . This estimator is based on a function which may be chosen as any arbitrary strictly-increasing function. For instance, one could choose or

as the sigmoid function. Given the scores

reported for the two items, let denote the item which receives the higher score, and let denote the remaining item (with ties broken uniformly). Then our randomized estimator outputs:

(2)

Note that the the output of this estimator is independent of the assignment , so in the remainder of this paper we also denote this estimator as .

The following theorem now proves that our proposed estimator indeed achieves the stated goal.

Theorem 2

The randomized estimator strictly uniformly dominates the random-guessing baseline .

While this result considers a setting with “noiseless” observations (that is, where ), in Appendix A we show that the guarantee for continues to hold when the observations are noisy.

Having established the positive result for this canonical setting, we now discuss some connections and inspirations in the literature.

3.1.1 Connections to the literature

The canonical setting has a close connection to the randomized version of the two-envelope problem [Cov87]. In the two-envelope problem, there are two arbitrary numbers. One of the two numbers is observed uniformly at random, and the other remains unknown. The goal is to estimate which number is larger. This problem can also be viewed from a game-theoretic perspective [Gne16] as ours, where one player picks an estimator and the other player picks the two values. Cover [Cov87] proposed a randomized estimator whose probability of success is strictly larger than

uniformly across all arbitrary pairs of numbers. The proposed estimator samples a new random variable

whose distribution has a probability density function

with for all . Then if the observed number is smaller than , the estimator decides that the observed number is the smaller number; if the observed number is larger than , the estimator decides that the observed number is the larger number.

Our canonical setting can be reduced to the two-envelope problem as follows. Consider the two values and . Since the two items are assigned to the two reviewers uniformly at random, we observe one of these two values uniformly at random. By the assumption that and are monotonically increasing, we know that these two values are distinct, and furthermore, if and only if . Hence, the relative ordering of these two values is identical to the relative ordering of and , reducing our canonical setting to the two-envelope problem. Our estimator also carries a close connection to Cover’s estimator to the two-envelope problem. Specifically, Cover’s estimator can be equivalently viewed as being designated by a “switching function” [MA09]. This switching function specifies the probability to “switch” (that is, to guess that the unobserved value is larger), and is a monotonically-decreasing function in the observed value. The use of the monotonic function in our estimator in (2) is similar in spirit.

The two-envelope problem can also be alternatively viewed as a secretary problem with two candidates. Negative results have been shown regarding the effect of cardinal vs. ordinal data when there are more than two candidates [SN92, Gne94], and positive result has been shown on extensions of the secretary problem to different losses [GK96].

Our original inspiration for our proposed estimator arose from Stein’s phenomenon [Ste56] and empirical Bayes [Rob56]. This inspiration stems for the fact that the two items are not to be estimated in isolation, but in a joint manner. That said, a significant fraction of the work (e.g., [Rob56, Ste56, JS61, Bar70, Boc75, TKV17]) in these areas is based on deterministic estimators. In comparison, our negative result for all deterministic estimators (Theorem 1) and the positive result for our randomized estimator (Theorem 2) provide interesting insights in this space.

3.2 A/B testing

We now demonstrate how to use the result in the canonical setting as a plug-in for more general scenarios. Specifically, we construct simple extensions to our canonical estimator, as a proof-of-concept for the superiority of cardinal data over ordinal data in A/B testing (this section) and ranking (Section 3.3). A/B testing is concerned with the problem of choosing the better of two given items, based on multiple evaluations of each item, and is used widely for the web and e-commerce (e.g. [KLSH09]). In many applications of A/B testing, the two items are rated by disjoint sets of individuals (for example, when comparing two web designs, each user sees one and only one design). It is therefore important to take into account the different calibrations of different individuals, and this problem fits in our setting with items and reviewers. For simplicity, we assume that is even. We consider the assignment obtained by assigning item to some reviewers chosen uniformly at random (without replacement) from the set of reviewers, and assigning item to the remaining reviewers.222Our results also hold in the following settings: (a) Each reviewer is assigned one of the two items independently and uniformly at random. (b) Reviewers are grouped (in any arbitrary manner) into pairs, and within each pair, the two reviewers are assigned one distinct item each uniformly at random.

As in the canonical setting we studied earlier, in the absence of any direct comparison between the two items, a natural ordinal estimator in the A/B testing setting is a random guess:

For concreteness, we consider the following method of performing the random assignment of the two items to the reviewers. We first perform a uniformly random permutation of the reviewers, and then assign the first reviewers in this permutation to item ; we assign the last reviewers in this permutation to item . We let denote the scores given by the reviewers to item , and let denote the scores given by the reviewers assigned to item . Namely, the reviewers (in the permuted order) provide the scores . Now consider the following standard (deterministic) estimators:

  • [leftmargin=0.25in]

  • Sign estimator: The sign estimator outputs the item which has more pairwise wins:
    .

  • Mean estimator: The mean estimator outputs the item with the higher mean score:
    .

  • Median estimator: The median estimator outputs the item with the higher median score (upper median if there are multiple medians)333For values , we define the median function as the upper median, . Theorem 3 also holds instead for the lower median , and the median defined as the mean of the two middle values, .: .

In each estimator, ties are assumed to be broken uniformly at random.

We now show that despite using the scores given by all reviewers, where can be arbitrarily large, these natural estimators fail to uniformly dominate the naïve random-guessing ordinal estimator.

Theorem 3

For any (even) number of reviewers, none of the sign, mean, and median estimators can strictly uniformly dominate the random-guessing estimator .

The negative result of Theorem 3 demonstrates the challenges even when one is allowed to collect an arbitrarily large number of scores for each item. Intuitively, the more reviewers there are, the more miscalibration functions they introduce. Even if the statistics used by these estimators converge as the number of the reviewers grows large, these values are not guaranteed to be informative towards comparing the values of the items due to the miscalibrations.

The failure of these standard estimators suggests the need of a novel approach towards this problem of A/B testing under arbitrary miscalibrations. To this end, we build on top of our canonical estimator from Section 3.1, and present a simple randomized estimator as follows:

  1. [(1), leftmargin=0.25in]

  2. For every , use the canonical estimator on the pair of scores and obtain the estimate .

  3. Set the output as the outcome of the majority vote among the estimates with ties broken uniformly at random.

The following theorem now shows that the results for the canonical setting from Section 3.1 translate to this A/B testing application.

Theorem 4

For any (even) number of reviewers, the estimator strictly uniformly dominates the random guessing estimator .

This result thus illustrates the use of our canonical estimator as a plug-in for A/B testing. So far we have considered settings where there are only two items and where each reviewer is assigned only one item, thereby making the ordinal information vacuous. We now turn to an application that is free of these restrictions.

3.3 Ranking

It is common in practice to estimate the partial or total ranking for a list of items by soliciting ordinal or cardinal responses from individuals. In conference reviews or peer-grading, each reviewer is asked to rank [Dou09, SBP13, STM17] or rate [GWG13, PHC13, STM17] a small subset of the papers, and this information is subsequently used to estimate a partial or total ranking of the papers (or student homework). Other applications for aggregating rankings include voting [You88, PSZ16], crowdsourcing [SBB16, SW18], recommendation systems [FISS03] and meta-search [DKNS01].

Formally, we let denote the number of items and denote the number of reviewers. For simplicity, we focus on a setting where each reviewer reports noiseless evaluations of some pair of items, and the goal is to estimate the total ranking of all items. We consider a random design setup where the pairs compared are randomly chosen and randomly assigned to reviewers. We assume so that the problem does not degenerate. Each reviewer evaluates a pair of items, and these pairs are drawn uniformly without replacement from the possible pairs of items. We let denote these pairs of items to be evaluated by the respective reviewers, where denotes the pair of items evaluated by reviewer . For each pair , denote the cardinal evaluation as , and the ordinal evaluation as the induced ranking . Denote the set of ordinal observations as , and the set of cardinal observations as . The input to an ordinal estimator is the ordinal information . The input to a cardinal estimator is the reviewer assignment and the set of cardinal observations . Finally, let denote a directed acyclic graph (DAG) with nodes comprising the items and with an edge from any node to any other node if and only if . A topological ordering on is any total ranking of its vertices which does not violate any pairwise comparisons indicated by .

1 Deduce the ordinal observations from the cardinal observations . Compute a topological ordering on the graph , with ties broken in order of the indices of the items. . while  do
2       Let be the ranking obtained by flipping the positions of the and the items in . if  is a topological ordering on , and both the and items are evaluated by at least one reviewer each in  then
3            From all of the scores of the item in , sample one uniformly at random and denote it as . Likewise denote as a randomly chosen score of the item from . Consider the two reviewers reporting the scores and . Remove from all scores provided by these two reviewers. if  outputs  then
4                  .
5             end if
6            .
7      else
8            .
9       end if
10      
11 end while
Output .
Algorithm 1 Our cardinal ranking estimator .

We now present our (randomized) cardinal estimator in Algorithm 1. In words, this algorithm start from any topological ordering of the items as the initial estimate of the true ranking. Then the algorithm scans one-by-one over the pairs with adjacent items in the initial estimated ranking. If a pair can be flipped (that is, if the ranking after flipping this pair is also a topological ordering), we uniformly sample a pair of scores for these two items from the cardinal observations , and use the randomized estimator to determine the relative order of the pair. After is called, the positions of this pair are finalized. We remove all scores of these two reviewers from future use, and jump to the next pair that does not contain these two items.

The following theorem now presents the main result of this section.

Theorem 5

Suppose that the true ranking is drawn uniformly at random from the collection of all possible rankings, and consider any ordinal estimator for . Then the cardinal estimator strictly uniformly dominates the ordinal estimator .

We note that Algorithm 1 runs in polynomial time (in the number of items ) because the two major operations of this estimator – finding a topological ordering, and checking if a ranking is a topological ordering on the DAG – can be implemented in polynomial time [DPV08]. Theorem 5 thus demonstrates again the power of the canonical estimator as a plug-in component to be used in a variety of applications. An extension of our results to the setting where can be arbitrary (adversarially chosen) is presented in Appendix C.

4 Simulations

We now experimentally evaluate our proposed estimators for A/B testing and ranking. Since the performance of the ordinal estimators vary significantly in different problem instances, we use the notion of “relative improvement”. The relative improvement of an estimator as compared to a baseline estimator is defined as: A positive value of the relative improvement indicates the superiority of estimator over the estimator . A relative improvement of zero indicates an identical performance of the two estimators. In our proposed estimators, the function is set as .

4.1 A/B testing

We now present simulations to evaluate various points on the bias-variance tradeoff. For A/B testing, we compare our estimator with other standard estimators — the sign, mean and median estimators introduced in Section 3.2. The item values and are chosen independently and uniformly at random from the interval . The calibration functions are linear and given by:

  1. [(a), leftmargin=0.25in]

  2. One biased reviewer: One reviewer gives an abnormally (high or low) score. Formally, for , and .

  3. Incremental biases: Calibration functions of reviewers are shifted from each other. Formally, for .

  4. Incremental biases with one biased reviewer: A combination of setting 1 and setting 2. Formally, for , and .

We simulate and compute the relative improvement of the different estimators as compared to the random-guessing estimator . The results are shown in Figure 4. While the performance of the estimators vary with respect to each other, our estimator consistently beats the baseline whereas every other estimator fails. Our estimator thus indeed operates at a unique point on the bias-variance tradeoff with a low (zero) bias and a variance strictly smaller than the ordinal estimators, whereas all other estimators incur a non-zero error due to bias.

(a) one biased reviewer
(b) incremental biases
(c) incremental biases with one biased reviewer
Figure 4: Relative improvement in exact recovery of various estimators as compared to the random-guessing ordinal estimator for A/B testing. Each point is an average over trials. The error bars are too small to display.

4.2 Ranking

Figure 5: Relative improvement in Kendall-tau distance of our ranking estimator as compared to an optimal ordinal estimator for ranking. Each point is an average over trials, where in each trial the quantities and are approximated by an empirical average over samples.

Next, we evaluate the performance of our ranking estimator when the true ranking is drawn from a uniform prior. We compare this estimator with an optimal ordinal estimator which outputs a topological ordering with ties broken in order of the indices of the items (this ordinal estimator is optimal regardless of the tie-breaking strategy).

For any number of items , we generate the values of the items i.i.d. uniformly from the interval . We set . We assume that the reviewer has a linear calibration function , where we sample and i.i.d. uniformly from the interval .

We have previously proved that our estimator

based on cardinal data can strictly uniformly outperform the optimal ordinal estimator for the 0-1 loss. We use these simulations to evaluate the efficacy of our approach for a different loss function – Kendall-tau distance. Specifically, Figure 

5 compares these two estimators in terms of Kendall-tau distance (Appendix B provides a formal definition of this distance and associated theoretical results). We observe that our estimator is able to consistently yield improvements even for this loss. The reason that the improvement becomes smaller when the number of items is large is that by flipping pairs, our estimator only modifies the ranking in the neighborhood of the initial estimate. We strongly believe that it should be possible to design better estimators for the large regime using the tools developed in this paper. Having met our stated goal of outperforming ordinal estimators to handle arbitrary miscalibrations, we leave this interesting problem for future work.

4.3 Tradeoff between estimation under perfect calibration vs. miscalibration

In this section, we present a preliminary experiment showing the tradeoff between estimation under perfect calibration (all reviewers reporting the true values of the papers) and estimation under miscalibration. For simplicity, we consider the canonical setting from Section 3.1. We evaluate the performance of our estimator under two scenarios: (1) perfect calibration, where for each ; (2) miscalibration with one biased reviewer, where and . We consider the function in our estimator as , where . We sample and uniformly at random from the interval .

Figure 6: Relative improvement of our canonical estimator under perfect calibration and under miscalibration of one biased reviewer, with and , where increases from left to right in the plot. Each point is an average over trials. The error bars are too small to display.

Figure 6 shows the relative improvement of our estimator over the random-guessing baseline under perfect calibration and under miscalibration, where increases from left to right. Let us focus on a few regimes in this plot. First, on the left end of the curve, when is close to , we have close to . The estimator is close to random-guessing. At the other extreme, on the right end of the curve, when goes to infinity, we have close to . The estimator always outputs the item with the higher score, and hence gives perfect estimation under perfect calibration. Under miscalibration, this estimator always chooses the biased reviewer giving the higher score and hence performs the same as random guess. Past the maximum point of the function at approximately when , the value of the curve starts decreasing, suggesting a tradeoff of estimation accuracy under perfect calibration and under miscalibration. It is clear that points to the left of the maximum point are not Pareto-efficient, since there exist other points with the same accuracy under miscalibration but improved accuracy under perfect calibration.

We thus see that robustness under arbitrary miscalibration comes at a cost of lower accuracy under perfect calibration. Establishing a formal understanding of this tradeoff and designing estimators that are provably Pareto-efficient are important open problems.

5 Proofs

In this section, we present the proofs of our theoretical results.

For notational simplicity, we use “” to denote that item has a smaller value than item . Since the items have distinct values, we have if and only if . For the 0-1 loss , we call the expected loss as the “probability of error” of any estimator , and as the “probability of success”. For the canonical setting and A/B testing, the probability of success of random guessing is . To show that some estimator strictly uniformly dominates random guessing for the canonical setting or A/B testing, we only need to show that the probability of success of this estimator is strictly higher than , or equivalently, the probability of error of of this estimator is strictly lower than .

5.1 Proof of Theorem 1

We prove that no deterministic cardinal estimator can strictly uniformly dominate the random-guessing estimator , which implies the negative result for any deterministic ordinal estimator.

Recall the notation as the item receiving the higher score (with ties broken uniformly at random), and the notation as the remaining item. First, we consider a deterministic estimator that always outputs as the item whose value is greater. We call this estimator the “sign estimator”, denoted :

The proof consists of two steps. (1) We show that the sign estimator does not strictly uniformly dominate random guess. (2) Building on top of (1), we show that more generally, no deterministic estimator strictly uniformly dominates random guess.

Step 1: The sign estimator does not strictly uniformly dominate random guess.

We construct the following counterexample such that the probability of error of the sign estimator is . We construct reviewer calibration functions such that their ranges are disjoint, that is, one reviewer always gives a higher score than the other reviewer, regardless of the items they are assigned. Then the relative ordering of the two scores does not convey any information about the relative ordering of the two items, and we show that in this case, the sign estimator has a probability of error of . Concretely, let the item values be bounded as , and let the calibration functions be and . Then the score given by reviewer is higher than the score given by reviewer regardless of the item values they are assigned. The sign estimator always observes , and outputs the item assigned to reviewer as the larger item. The assignment is either or with probability each. Under assignment , the sign estimator outputs . Under assignment , the sign estimator outputs . Under one (and exactly one) of the two assignments, the output of the sign estimator is correct. Hence, the probability of error of the sign estimator is .

Step 2: No deterministic estimator strictly uniformly dominates random guess.

Let be the set of the two assignments, . A deterministic estimator is a deterministic function that takes as input the assignment and the scores for the two items, and outputs the relative ordering between the two items. Step 1 has shown that the sign estimator does not strictly uniformly dominate random guess. Hence, we only need to prove that any deterministic estimator that is different from the sign estimator does not strictly uniformly dominate random guess. For this deterministic estimator , there exist some input values such that the output of this deterministic estimator differs from the sign estimator. If the two estimators and only differ at points where , then we can use the same counterexample in Step 1 to show that the probability of error of this deterministic estimator is . It remains to consider the case when . Without loss of generality, assume . Then consider the following counterexample. Let . Let be strictly-increasing functions such that . Regardless of the reviewer assignment, the score for item is , and the score for item is . The item receiving a higher score is always , so the sign estimator always outputs . Under assignment , the deterministic estimator differs from the sign estimator, so the deterministic estimator gives the incorrect output . The assignment happens with probability , so the probability of error of this deterministic estimator is at least .

The two steps above complete the proof that there exists no deterministic estimator that strictly uniformly dominates random guess.

5.2 Proof of Theorem 2

In what follows, we prove that the probability of success of our estimator is strictly greater than under arbitrary item values and arbitrary calibration functions . We start with re-writing our estimator in (2) into an alternative and equivalent expression, and then prove the result on this new expression of our estimator.

Recall that denotes the item receiving the higher score, and denotes the remaining item (with ties broken uniformly). Depending on the relative ordering of and , we can split (2) into the following three cases:

(3a)
(3b)
(3c)

Recall that the function is from to . Now we define the following auxiliary function :

(4)

Combining (3) and (4), we have

(5)

Without loss of generality, assume . The assignment is either or with probability each. Thus, the estimator observes under assignment , or under assignment . The probability of success of our estimator is:

(6)

where step (i) is true by plugging in (5), and step (ii) is true because by the definition of the function in (4).

By the monotonicity of the functions and , and by the assumption that , we have , and therefore . Since and is monotonically increasing on , it is straightforward to verify that is monotonically increasing on . Hence, we have

(7)

Combining (6) and (7), we have

5.3 Proof of Theorem 3

We construct a counterexample on which the mean, median and sign estimators have a probability of error of . In this counterexample, let the item values be bounded as , and let the reviewer calibration functions be as follows:

(8)

In these calibration functions, the score provided by each reviewer is the sum of the true value of the item assigned to this reviewer, and a bias term specific to this reviewer. The analysis is performed separately for the three estimators. At a high level, the analysis for the mean estimator uses the fact that one reviewer (specifically, reviewer has a significantly greater bias than the rest of the reviewers. The analysis for the median and the sign estimators uses the fact that the ranges of these calibration functions are disjoint.

Mean estimator: Recall that each reviewer is assigned one of the two items. Given any assignment, consider the item assigned to reviewer . Trivially, the sum of the scores for this item must be strictly greater than . Now consider the remaining item (not assigned to reviewer ). The sum of the scores for the remaining item can be at most .

From these two bounds on the sum of the scores, an item has a greater sum of scores if and only if reviewer is assigned to this item. By symmetry of the assignment, reviewer is assigned to either item with probability . With the true ranking being either or , the mean estimator makes an error in one of the two assignments, and this assignment happens with probability . Hence, the mean estimator makes an error with probability .

Median estimator and sign estimator: For the median estimator and the sign estimator, we first present an alternative view on the assignment, which is used for the analysis of both estimators. Recall that the assignment specifies reviewers to evaluate item , drawn uniformly at random without replacement, and the remaining reviewers to item . Equivalently, we can view this assignment as comprising the following two steps. (1) We sample uniformly at random a permutation of the reviewers, denoted as a list . Define and as the first half and second half of the reviewers in the list, and . (2) We draw uniformly at random one of the two items, and assign the list of reviewers to this item. Then assign the list of reviewers to the remaining item. For each , call reviewers as the pair of reviewers.

For the median estimator and the sign estimator, we prove that given any arbitrary lists of reviewers and in Step (1) of the assignment, the randomness in Step (2) yields the probability of error of the two estimators as .

Recall that the item values are bounded as . Since the biases of any two reviewers differ by at least in Eq. (8), any reviewer gives a higher score than any other reviewer if and only if , independent of the item values and the assignment. Formally, for any , and any , we have

(9)

The remaining analysis is performed separately for the median estimator and the sign estimator.

Median estimator: Denote and as the indices of the reviewers providing the (upper) median scores in the set and , respectively. From (9), we have

(10)

Also from (9), the higher score in the two scores given by reviewer and is the reviewer with the larger index, . In Step (2) of the assignment, reviewer is assigned to item or item with equal probability. Hence, the probability of error of the median estimator is . This proves the claim that the (upper) median estimator does not strictly uniformly dominates random guess.

We now comment on using the median function defined as the lower median, or as the mean of the two middle values. For the lower median, the same argument as above applies. Now consider the median defined as the mean of the two middle values. When

is odd, Eq. (

10) still holds, and the argument as above still applies. When is even, the median value may not be equal to any scores from the reviewers. We construct a counterexample where the item values are still bounded as , and the calibration functions as follows:

With these calibration functions, for any , and any , we have

Using this fact, we can show that the output of this median estimator only depends on reviewer indices and the realization of Step (2), independent of the item values. The probability of error of this median estimator is also .

Sign estimator: Denote as the assignment that reviewers in are assigned to item , and denote as the assignment that reviewers in are assigned to item . For each , define as the binary value of whether the higher score in the pair of scores comes from item , under assignment . Set if the higher score comes from item and otherwise. Define similarly under assignment . Set if the higher score comes from item , and otherwise. Inequality (9) implies that for any . Define as the count of pairwise wins for item under assignment , and define similarly. Then we have

(11)

The sign estimator outputs the item with more pairwise wins. That is, the sign estimator outputs item 1 under assignment if , outputs item 1 under assignment if , and outputs one of the two items uniformly at random if or . When , then under either assignment, the sign estimator has a tie, and hence outputs one of the two items uniformly at random. The probability of error of the sign estimator is . Otherwise, we have . By (11), we have either or . The sign estimator gives different outputs under the two assignments, out of which one and only one output is correct. The probability of error of the sign estimator is .

5.4 Proof of Theorem 4

Recall that a subset of reviewers, drawn uniformly at random without replacement, are assigned to item , and the remaining reviewers are assigned to item . We provide an alternative and equivalent view of the assignment as the following two steps:

  1. [(1), leftmargin=0.25in]

  2. We sample two reviewers, uniformly at random without replacement, as the first pair of reviewers for the two items, and call them . Then sample two reviewers, uniformly at random without replacement, from the remaining reviewers as the second pair of reviewers for the two items, and call them . Continue until all reviewers are exhausted, and call the subsequent pairs of reviewers .

  3. Within each pair, assign the pair of reviewers to the two items uniformly at random. That is, for each , assign reviewer to one of the two items uniformly at random, and assign reviewer to the remaining item. The assignments are independent across pairs.

Consider any arbitrary values of items . Given any arbitrary realization of Step (1) of the assignment procedure described above, we apply Theorem 2 and show that on each pair of reviewers, the canonical estimator gives the correct output with probability strictly greater than . Then we show that combining the pairs by majority-voting yields probability of success strictly greater than .

Denote as the probability that our canonical estimator in Eq. (2) gives the correct output comparing items of values under reviewer calibration functions . In Step (2) of the assignment procedure described above, for any , consider the pair of reviewers, . Suppose that the calibration functions of these two reviewers are denoted as