1 Introduction
A number of applications involve data in the form of pairwise comparisons among a collection of items, and entail an evaluation of the individual items from this data. An application gaining increasing popularity is competition between pairs of AI bots (e.g., [26]). Here a number of AI bots compete with each other in pairwise matchups for a certain task, where each bot plays every other bot a certain number of times in a round robin fashion, with the goal of evaluating the quality of each bot. A second example is the evaluation of selfplay of AI algorithms in their training phase [34], where again, different copies of an AI bot play against each other a number of times. Applications involving humans include sports and online games such as the English Premier League of football [22, 2] (unofficial ratings) and official world rankings for chess (e.g., FIDE [1] and USCF [14] ratings). The influence of scientific journals has also been analyzed in this manner, where citations from one journal to another are modeled by pairwise comparisons [35].
A common method of evaluating the items based on pairwise comparisons is to assume that the probability of an item beating another equals the logistic function of the difference in the true quality of the two items, and then infer the true quality from the observed outcomes of the comparisons (e.g., the Elo rating system). Various applications employ such an approach to rating from pairwise comparisons, with some modifications tailored to that specific application. Our goal is not to study the applicationspecific versions, but the foundational underpinnings of such rating systems.
In this paper, we study the pairwisecomparison model that underlies [15, 4] these rating systems, namely the BradleyTerryLuce (BTL) model [6, 24]. The BTL model assumes that each item is associated to an unknown realvalued parameter representing the quality of that item, and assumes that the probability of an item beating another is the logistic function applied to the difference of the parameters of these two items. The BTL model is also employed in the applications of peer grading [32, 23] (where the grades of the students are set as the BTL parameters to be estimated), crowdsourcing [7, 27], and understanding consumer choice in marketing [16].
1.1 BTL model and maximum likelihood estimation
Now we present a formal definition of the BTL model. Let denote the number of items. The
items are associated to an unknown parameter vector
whose entry represents the underlying quality of item . When any item is compared with any item in the BTL model, the item beats item with probability(1) 
independent of all other comparisons. The probability of item beating is one minus the expression (1) above. We consider the “league format” [4] of comparisons where every pair of items is compared times.
We follow the usual assumption [17, 31] under the BTL model that the true parameter vector lies in the set parameterized by a constant and satisfy:
(2) 
The first constraint requires that the magnitude of the parameters is bounded by some constant . We call this constraint the “box constraint”. A box constraint is necessary, because otherwise the estimation error can diverge to infinity [31, Appendix G]. The second constraint requires the parameters to sum to . This is without loss of generality due to the shiftinvariance property of the BTL model.
A large amount of both theoretical [19, 17, 38, 25, 31] and applied [35, 33, 7, 27] literature focuses on the goal of estimating the parameter vector of the BTL model. A standard and widelystudied estimator is the maximumlikelihood estimator (MLE):
(3) 
where is the negative loglikelihood function. Letting
denote a random variable representing the number of times that item
beats item , the loglikelihood function is given by:1.2 Metrics
Accuracy.
A common metric used in the literature on estimating the BTL model is the accuracy of the estimate, measured in terms of the mean squared error. Formally, the accuracy of any estimator is defined as:
Importantly, past work [17, 31] has shown that the MLE (3) has the appealing property of being minimaxoptimal in terms of the accuracy.
Bias.
Another important desideratum for designing and evaluating estimators is fairness. For example, in sports or online games, we do not want to assign scores in such a way that it systematically gives certain players higher scores than their true quality, but at the same time gives certain other players lower scores than their true quality. In this paper, we use the standard definition of bias in statistics as the notion of fairness. For any estimator, the bias incurred by this estimator on a parameter is defined as the difference between the expected value of the estimator and the true value of the parameter. Since our parameters are a vector, we consider the worstcase bias, that is, the maximum magnitude of the bias across all items. Formally, the bias of any estimator is defined as:
With this background, we now provide an overview of the contributions of this paper.
Estimator  Bias 














1.3 Contribution I: Performance of MLE
Our first contribution is to analyze the widelyused MLE (3) in terms of its bias. Let us begin with a visual illustration through simulation. Consider items with parameter values equally spaced in the interval , where pairwise comparisons are observed between each pair of items under the BTL model. We estimate the parameters using the MLE, and plot the bias on each item across iterations of the simulation in Figure 2 (striped red). The MLE shows a systematic bias: it induces a negative bias (underestimation) on the large positive parameters, and a positive bias (overestimation) on the large negative parameters. In the applications of interest, the MLE thus systematically underestimates the abilities of the top players/students/items and overestimates the abilities of those at the bottom.
In this paper, we theoretically quantify the bias incurred by the MLE.
Theorem 1.1 (MLE bias lower bound; Informal).
The MLE (3) incurs a bias lower bounded as .
As shown by our results to follow, this bias is suboptimal. Our proof for this result indicates that the bias is incurred because the MLE operates under the accurately specified model with the box constraint at . That is, the MLE “clips” the estimate to lie within the set . This issue is visible in the simulation of Figure 2 where the bias is the largest when the true values of the parameters are near the boundaries . For example, consider a true parameter whose value equals . The estimate of this parameter sometimes equals the largest allowed value (due to the box constraint), and sometimes is smaller than (due to the randomness of the data). Therefore, in expectation, the estimate of this parameter incurs a negative bias. An analogous argument explains the positive bias when the true parameter equals or is close to .
1.4 Contribution II: Proposed stretched estimator and its theoretical guarantees
Our goal is to design an estimator with a lower bias while maintaining high accuracy. Since the MLE (3) is already widely studied and used, it is also desirable from a practical and computational standpoint that the new estimator is a simple modification of the MLE (3). With this motivation in mind, an intuitive approach is to consider the MLE but without the box constraint “”. We call the estimator without the box constraint as the “unconstrained MLE”, and denote it by , because removing the box constraint is equivalent to setting the box constraint to :
(4) 
where . The unconstrained MLE incurs an unbounded error in terms of accuracy. This is because with nonzero probability an item beats all others, in which case the unconstrained MLE estimates the parameter of this item as , thereby inducing an unbounded mean squared error.
Consequently, in this work, we propose the following simple modification to the MLE which is a middle ground between the standard MLE (3) and the unconstrained MLE. Specifically, we consider a “stretchedMLE”, which is associated to a parameter such that . Given the parameter , the stretchedMLE is identical to (3) but “stretches” the box constraint to :
(5) 
where . That is, simply replaces the box constraint in (2) by the “stretched” box constraint .
The bias induced by the stretchedMLE (with ) in the previous experiment is also shown in Figure 2 (solid blue). Observe that the maximum bias (incurred at the leftmost item with the largest negative parameter, or the rightmost item with the largest positive parameter) is significantly reduced compared to the MLE. Moreover, the bias induced by the stretchedMLE looks qualitatively more evened out across the items.
Our second main theoretical result proves that the stretchedMLE indeed incurs a significantly lower bias.
Theorem 1.2 (StretchedMLE bias upper bound; Informal).
The stretchedMLE (5) with incurs a bias upper bounded as .
Given the significant bias reduction by our estimator, a natural question is about the accuracy of the stretchedMLE, particularly given the unbounded error incurred by the unconstrained MLE. We prove that our stretchedMLE is able to maintain the same minimaxoptimal rate on the mean squared error as the standard MLE.
Theorem 1.3 (StretchedMLE accuracy upper bound; Informal).
The stretchedMLE (5) with incurs a mean squared error upper bounded as , which is minimaxoptimal.
This result shows a winwin by our stretchedMLE: reducing the bias while retaining the accuracy guarantee. The comparison of the MLE and the stretchedMLE in terms of accuracy and bias is summarized in Table 2. Another attractive feature of our result is that the proposed stretchedMLE is a simple modification of the standard MLE, which can easily be incorporated in any existing implementation. It is important to note that while our modification to the estimator is simple to implement, our theoretical analyses and the proofs are nontrivial.
1.5 Related work
The logistic nature (1
) of the BTL model relates our work to studies of logistic regression (e.g.,
[28, 18, 37, 11]), among which the paper [37] is the most closely related to ours. The paper [37] considers an unconstrained MLE in logistic regression, and shows its bias in the opposite direction as compared to our results on the standard MLE (constrained) in the BTL model. Specifically, the paper [37] shows that the large positive coefficients are overestimated, and the large negative coefficients are underestimated. There are several additional key differences between the results in [37] as compared to the present paper. The paper [37] studies the asymptotic bias of the unconstrained MLE, showing that the unconstrained MLE is not consistent. On the other hand, we operate in a regime where the MLE is still consistent, and study finitesample bounds. Moreover, the paper [37]assumes that the predictor variables are i.i.d. Gaussian. On the other hand, in the BTL model the probability that item
beats item can be written as , where each predictor variable has entry equal to , entry equal to , and the remaining entries equal to .A common way to achieve bias reduction is to employ finitesample correction, such as Jackknife [29] and other methods [10, 5, 12] to the MLE (or other estimators). These methods operate in a lowdimensional regime (small ) where the MLE is asymptotically unbiased. Informally, these methods use a Taylor expansion and write the expression for the bias as an infinite sum , where is the number samples, for some functions . These works then modify the estimator in a variety of ways to eliminate the lowerorder terms in this bias expression. However, since the expression is an infinite sum, eliminating the first term does not guarantee a low rate of the bias. Moreover, since the functions are implicit functions of , eliminating lowerorder terms does not directly translate to explicit worstcase guarantees.
Returning to the pairwisecomparison setting, in addition to the mean squared error, some past work has also considered accuracy in terms of the norm error [3] and the norm error [9, 8, 20]. The bound for a regularized MLE is analyzed in [8]. Our proof for bounding the bias of the standard MLE (unregularized) relies on a highprobability bound for the unconstrained MLE (unregularized). It is important to note that the bound for regularized MLE from [8] does not carry to unregularized MLE, because the proof from [8] relies on the strong convexity of the regularizer. On the other hand, our intermediate result provides a partial answer to the open question in [8] about the norm for the unregularized MLE (Lemma A.5 in Appendix A): We establish an bound for unregularized MLE when , which has the same rate as that of the regularized MLE in [8].
Another common occurrence of bias is the phenomenon of regression towards the mean [36]. Regression towards the mean refers to the phenomenon that random variables taking large (or small) values in one measurement are likely to take more moderate (closer to average) values in subsequent measurements. On the contrary, we consider items whose indices are fixed (and are not order statistics). For fixed indices, our results suggest that under the BTL model, the bias (underestimation of large true values) is in the opposite direction as that in regression towards the mean (overestimation of large observed values).
Finally, the paper [22]
models the notion of fairness in Elo ratings in terms of the “variance”, where an estimator is considered fair if the estimator is not much affected by the underlying randomness of the pairwisecomparison outcomes. The paper
[22] empirically evaluates this notion of fairness on the English Premier League data, but presents no theoretical results.2 Main results
In this section, we formally provide our main theoretical results on bias and on the mean squared error.
2.1 Bias
Recall that denotes the number of items and denotes the number of comparisons per pair of items. The true parameter vector is for some prespecified constant . The following theorem provides bounds on the bias of the standard MLE and that of our stretchedMLE with parameter . In particular, it shows that if is a finite constant strictly greater than , then our stretchedMLE has a much smaller bias than the MLE when and are sufficiently large.
Theorem 2.1.

[(a)]

There exists a constant that depends only on the constant , such that
(6a) for all and all , where and are constants that depend only on the constant .

Let be any finite constant such that . There exists a constant that depends only on the constants and , such that
(6b) for all and all , where and are constants that depend only on the constants and .
We note that in Theorem 2.12, we allow to be any positive constant as long as . Therefore, the difference between and can be any arbitrarily small constant. It is perhaps surprising that stretching the box constraint only by a small constant yields such a significant improvement in the bias. We provide intuition behind this result in Section 2.1.1.
We devote the remainder of this section to providing a sketch of the proof of Theorem 2.1. We first prove Theorem 2.12 and then Theorem 2.11, because the proof of Theorem 2.11 depends on the proof of Theorem 2.12. The complete proof is provided in Appendix A.
For Theorem 2.12, we first analyze the unconstrained MLE . By plugging into the firstorder optimality condition of the negative loglikelihood function and using concentration on the comparison outcomes, we prove an bound of the form with sufficiently high probability (which partially resolves the open problem from [8], in the regime where ). Next, using a secondorder mean value theorem on the firstorder optimality condition and taking an expectation, we show a result of the form , where is some highprobability event (recall from Table 2 that for unconstrained MLE, the bias without conditioning on is undefined). Finally, we show that the unconstrained MLE and the stretchedMLE are identical with high probability for sufficiently large and , and perform some algebraic manipulations to finally arrive at the claim (6b).
For Theorem 2.11, we first prove a bound on the order of when there are items. Then for general , we consider the bias on item under the true parameter vector . We construct an “oracle” MLE, such that analyzing the bias of the “oracle” MLE can be reduced to the proof of the item case, and thereby prove a bias on the order of for the oracle MLE. Finally, we show that the difference between the oracle MLE and the standard MLE is small, by repeating arguments from the proof of Theorem 2.12.
2.1.1 Intuition for Theorem 2.1
In this section, we provide intuition why stretching the box constraint from to significantly reduces the bias. Specifically, we consider a simplified setting with items. Due to the centering constraint, we have for the true parameters, and we have for any estimator that satisfies the centering constraint. Therefore, it suffices to focus only on item . Denote as the random variable representing the fraction of times that item beats item , and denote the true probability that item beats item as . We consider the true parameter of item as . Then we have , where and . The standard MLE , the stretchedMLE and the unconstrained MLE can be solved in closed form:
See Fig. (a)a for a comparison of these three estimators.
Now we consider the bias incurred by these three estimators. For intuition, let us consider the case , which incurs the largest bias in our simulation of Fig. 2. If the observation were noiseless (and thus equals the true probability ), then all three estimators would output the true parameter . However, the observation is noisy, and only concentrates around . To investigate how these three estimators behave differently under this noise, we zoom in to the region around indicated by the grey box in Fig. (a)a. (Note that the observation can lie outside the grey box, but for intuition we ignore this lowprobability event due to concentration.)
The behaviors of the three estimators in the grey box are shown in Fig. (b)b, Fig. (c)c and Fig. (d)d, respectively. For each of these estimators, the blue dots on the xaxis denotes the noisy observation of across different iterations, and the blue dots on the estimator function denotes the corresponding noisy estimates. The expected value of the estimator is a mean over the blue dots on the estimator function. For the standard MLE (Fig. (b)b), the box constraint requires that the estimate shall never exceed . We call this phenomenon the “clipping” effect, which introduces a negative bias. For the unconstrained MLE (Fig. (c)c), since the estimator function is convex, by Jensen’s inequality, the unconstrained MLE introduces a positive bias. Our proposed stretchedMLE (Fig. (d)d) lies in the middle between the standard MLE and the unconstrained MLE. Therefore, the stretchedMLE balances out the negative bias from the “clipping” effect and the positive bias from the convexity of the estimator function, thereby yielding a smaller bias on the item parameter. In practice, one can numerically tune the parameter to minimize the bias across all possible parameter vector . Simulation results on different values of are included in Section 3.
2.2 Accuracy
Given the result of Theorem 2.1 on the bias reduction of the estimator , we revisit the mean squared error. Past work [17, 31] has shown that the standard MLE is minimaxoptimal in terms of the mean squared error. The following theorem shows that this minimaxoptimality also holds for our proposed stretchedMLE , where is any constant such that . The theorem statement and its proof follows Theorem 2 from [31], after some modification to accommodate the bounding box parameter .
Theorem 2.2.

[(a)]

[Theorem 2(a) from [31]] There exists a constant that depends only on the constant , such that any estimator has a mean squared error lower bounded as
(7a) for all , where is a constant that depends only on the constant .

Let be any finite constant such that . There exists a constant that depends only on the constants and , such that
(7b)
Theorem 2.2 shows that using the estimator retains the minimaxoptimality achieved by in terms of the mean squared error. Combining Theorem 2.1 and Theorem 2.2 shows the Pareto improvement of our estimator : the estimator decreases the rate of the bias, while still performing optimally on the mean squared error.
3 Simulations
In this section, we explore our problem space and compare the standard MLE and our proposed stretchedMLE by simulations. In what follows, we set , and unless specified otherwise we set and . We also evaluate the performance of other values of
subsequently. Error bars in all the plots represent the standard error of the mean.

[(i), wide=]

Dependence on : We vary the number of items , while fixing . The results are shown in Fig. 4. Observe that the stretchedMLE has a significantly smaller bias, and performs on par with the MLE in terms of the mean squared error when is large. Moreover, the simulations also suggest the rate of bias as of order for the MLE and for the stretchedMLE, as predicted by our theoretical results.
Figure 4: Performance of estimators for various values of , with and . Each point is a mean over iterations. Figure 5: Performance of estimators for various values of , with and . Each point is a mean over iterations. 
Dependence on : We vary the number of comparisons per pair of items, while fixing . The results are shown in Fig. 5. As in the simulation 1 with varying , we observe that the stretchedMLE has a significantly smaller bias, and performs on par with the MLE in terms of the mean squared error. Moreover, the simulations also suggest the rate of bias as of order for the MLE and for the stretchedMLE, as predicted by our theoretical results.

Different values of : In our theoretical analysis, we proved bounds that hold for all constant such that . In this simulation, we empirically compare the performance of the stretchedMLE for different values of (note that setting is equivalent to the standard MLE). We fix , varying and from to . The results are shown in Fig. 6. For the bias, we observe that the bias keeps decreasing in the range of . This is because as we increase to , the negative bias introduced by the “clipping” effect is reduced. The optimal value of for all settings of is always greater than . Moreover, the optimal seems to be closer to when we increase . This agrees with the intuition in Section 2.1.1. When is larger, the estimate becomes more concentrated around the true parameter. Then the “clipping” effect becomes smaller and can be accommodated by a smaller . The mean squared error is insensitive to the choice of as long as .
Figure 6: Performance of estimators for various values of and , with . Setting is equivalent to the standard MLE. Each point is a mean over iterations. Figure 7: Performance of estimators for various values of and various settings of , with and . Setting is equivalent to the standard MLE. Each point is a mean over iterations. 
Different settings of the true parameter : Our theoretical result considers the worstcase bias and accuracy. In this simulation, we empirically compare the performance of the stretchedMLE under different settings of the true parameter vector (again, recall that setting is equivalent to the standard MLE). Specifically, we consider the following values of :

[]

Worst case: .

Worst case (0.5): .

Bipolar: half of the values are , and the other half are .

Linear: the values are equally spaced in the interval .

All zeros: all parameters are .
We fix and , varying under different settings of the true parameter vector . The results are shown in Fig. 7. Two highlevel takeaways from the empirical evaluations are that the bias generally reduces with an increase in till past , and that the mean squared error remains relatively constant beyond in the plotted range. In more detail, for the bias, we observe that the performance primarily depends on the largest magnitude of the items (that is, ). For the settings worst case, bipolar and linear (where ), the bias keeps decreasing when A is past . For the setting worstcase (0.5) (where ), the bias keeps decreasing when A is past . This makes sense since in this case we effectively have (although the algorithm would not know this in practice). The bias for the setting all zeros stays small across values of . For the mean squared error, the increase when A is past is relatively small under most of the settings of the true parameter vector . The bipoloar setting has the largest increase in the mean squared error. Under this setting, all parameters take values at the boundaries , and therefore the estimates of all parameters are affected by the box constraint.


Sparse observations: So far we have considered a league format where comparisons are observed between any pair of items. Now we consider a randomdesign setup, where comparisons are observed between any pair of items independently with probability , and none otherwise [25, 8]. In our simulations, we set and . We discard an iteration if the graph is not connected, since the problem is not identifiable under such a graph. The results are shown in Figure 8. We observe that the stretchedMLE continues to outperform MLE in terms of bias, and perform on par in terms of the mean squared error.
Figure 8: Performance of estimators for various values of under sparse observations, with . A number of comparisons are observed between any pair independently with probability and none otherwise. Each point is a mean over iterations.
4 Conclusions and discussions
In this work, we show that the widelyused MLE is suboptimal in terms of bias, and propose a class of estimators called the “stretchedMLE”, which provably reduces the bias while maintaining the minimaxoptimality in terms of accuracy. These results on the performance of the MLE and the stretchedMLE are of both theoretical and practical interest. From the theoretical point of view, our analysis and proofs provide insights on the cause of the bias, explain why stretching the box alleviates this cause, and prove theoretical guarantees in bias reduction by stretching the box. Our results on the benefits of the stretchedMLE thus suggest theoreticians to consider the stretchedMLE for analysis instead of the standard MLE.
From the practical point of view, the constant is often unknown, and practitioners oten estimate the value of by fitting the data or from past experience. Our results thus suggest that one should estimate leniently, as an estimation smaller than or equal to the true causes significant bias. Moreover, our proposed estimator is a simple modification to the MLE, which can be incorporated into any existing implementation at ease.
Our results lead to several open problems. First, it is of interest to extend our theoretical analysis to settings where the observations are sparse. For example, one may consider a randomdesign setup, where comparisons are observed between any pair independently with probability and none otherwise [25, 8] (also see simulation 5 in Section 3). In terms of the bias under this randomdesign setup, we think that the lowerbound for MLE and the upperbound for our stretchedMLE also depend on and as and respectively; we also think that the dependence of the stretchedMLE on
is no worse than that of the standard MLE. Second, it is of interest to extend our results to other parametric models such as the Thurstone model
[39], and we envisage similar results to hold across a variety of such models. Finally, the ideas and techniques developed in this paper may also help in improving the Pareto efficiency on other learning and estimation problems, in terms of the biasaccuracy tradeoff.Acknowledgements
The work of JW and NBS was supported in part by NSF grants 1755656 and 1763734. The work of RR was supported in part by NSF grant 1527032.
References
 [1] FIDE rating regulations effective from 1 July 2017, 2017. https://www.fide.com/fide/handbook.html?id=197&view=article [Online; accessed May 21, 2019].
 [2] Elo ratings  English Premier League, 2019. https://sinceawin.com/data/elo/league/div/e0 [Online; accessed May 21, 2019].

[3]
Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal.
Accelerated spectral ranking.
In
International Conference on Machine Learning
, 2018.  [4] David Aldous. Elo ratings and the sports model: A neglected topic in applied probability? Statistical Science, 32(4):616–629, 2017.
 [5] J. A. Anderson and S. C. Richardson. Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics, 21(1):71–78, 1979.
 [6] Ralph A. Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.

[7]
Baiyu Chen, Sergio Escalera, Isabelle Guyon, Víctor PonceLópez, Nihar
Shah, and Marc Oliu Simón.
Overcoming calibration problems in pattern labeling with pairwise
ratings: application to personality traits.
In
European Conference on Computer Vision
, 2016.  [8] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized MLE are both optimal for topK ranking. Ann. Statist., 47(4):2204–2235, 08 2019.
 [9] Yuxin Chen and Changho Suh. Spectral MLE: topK rank aggregation from pairwise comparisons. In International Conference on Machine Learning, 2015.
 [10] D. R. Cox and E. J. Snell. A general definition of residuals. Journal of the Royal Statistical Society. Series B (Methodological), 30(2):248–275, 1968.
 [11] Yingying Fan, Emre Demirkaya, and Jinchi Lv. Nonuniformity of pvalues can occur early in diverging dimensions. Journal of Machine Learning Research, 20(77):1–33, 2019.
 [12] David Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
 [13] E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.
 [14] Mark E. Glickman and Thomas Doan. The US chess rating system, 2017. http://www.glicko.net/ratings/rating.system.pdf [Online; accessed May 21, 2019].
 [15] Mark E Glickman and Albyn C Jones. Rating the chess rating system. Chance, 12:21–28, 1999.
 [16] Paul E. Green, J. Douglas Carroll, and Wayne S. DeSarbo. Estimating choice probabilities in multiattribute decision making. Journal of Consumer Research, 8(1):76–84, 1981.
 [17] Bruce Hajek, Sewoong Oh, and Jiaming Xu. Minimaxoptimal inference from partial rankings. In Advances in Neural Information Processing Systems, 2014.

[18]
Xuming He and QiMan Shao.
On parameters of increasing dimensions.
Journal of Multivariate Analysis
, 73(1):120 – 135, 2000.  [19] David R. Hunter. MM algorithms for generalized BradleyTerry models. Ann. Statist., 32(1):384–406, 02 2004.
 [20] Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity of mwise data for topK ranking. In Advances in Neural Information Processing Systems, 2017.
 [21] L. R. Ford Jr. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly, 64(8P2):28–33, 1957.
 [22] Franz J. Király and Zhaozhi Qian. Modelling competitive sports: BradleyTerryÉlő models for supervised and online learning of paired competition outcomes. preprint arXiv:1701.08055, 2017.
 [23] Alec Lamon, Dave Comroe, Peter Fader, Daniel McCarthy, Rob Ditto, and Don Huesman. Making WHOOPPEE: A collaborative approach to creating the modern student peer assessment ecosystem. In EDUCAUSE, 2016.
 [24] R. Duncan Luce. Individual Choice Behavior: A Theoretical analysis. Wiley, New York, NY, USA, 1959.
 [25] Sahand Negahban, Sewoong Oh, and Devavrat Shah. RankCentrality: Ranking from pairwise comparisons. Operations Research, 65:266–287, 2016.
 [26] S. Ontañón, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss. A survey of realtime strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in Games, 5(4):293–311, 2013.
 [27] Víctor PonceLópez, Baiyu Chen, Marc Oliu, Ciprian Corneanu, Albert Clapés, Isabelle Guyon, Xavier Baró, Hugo Jair Escalante, and Sergio Escalera. ChaLearn LAP 2016: First round challenge on first impressionsdataset and results. In European Conference on Computer Vision, 2016.
 [28] Stephen Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Statist., 16(1):356–366, 03 1988.
 [29] M. H. Quenouille. Approximate tests of correlation in timeseries. Journal of the Royal Statistical Society. Series B (Methodological), 11(1):68–84, 1949.
 [30] Walter Rudin. Principles of Mathematical Analysis. McGrawHill, 1976.
 [31] Nihar B. Shah, Sivaraman Balakrishnan, Joseph Bradley, Abhay Parekh, Kannan Ramchandran, and Martin J. Wainwright. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. Journal of Machine Learning Research, 17(58):1–47, 2016.
 [32] Nihar B. Shah, Joseph K Bradley, Abhay Parekh, Martin Wainwright, and Kannan Ramchandran. A case for ordinal peerevaluation in MOOCs. In NIPS Workshop on Data Driven Education, 2013.
 [33] P. C. Sham and D. Curtis. An extended transmission/disequilibrium test (TDT) for multiallele marker loci. Annals of Human Genetics, 59(3):323–336, 1995.
 [34] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, L Robert Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Fong Celine Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017.
 [35] Stephen M. Stigler. Citation patterns in the journals of statistics and probability. Statistical Science, 9(1):94–108, 1994.
 [36] Stephen M. Stigler. Regression towards the mean, historically considered. Statistical methods in medical research, 6(2):103–14, 02 1997.
 [37] Pragya Sur and Emmanuel J. Candès. A modern maximumlikelihood theory for highdimensional logistic regression. preprint arXiv:1803.06964, 2018.
 [38] Balázs Szörényi, Róbert BusaFekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for PlackettLuce: A dueling bandits approach. In Advances in Neural Information Processing Systems, 2015.
 [39] L. L. Thurstone. A law of comparative judgement. Psychological Review, 34:278–286, 1927.
Appendix A Proof of Theorem 2.1
In this appendix, we present the proof of Theorem 2.1. We first introduce notation and preliminaries in Appendix A.1, to be used subsequently in proving both parts of Theorem 2.1. The proof of Theorem 2.12 is presented in Appendix A.2. The proof of Theorem 2.11 is presented in Appendix A.3. We first present the proof of Theorem 2.12 followed by Theorem 2.11, because the proof of Theorem 2.11 depends on the proof of Theorem 2.12.
In the proof of Theorem 2.11, the constants are allowed to depend only on the constant . In the proof of Theorem 2.12, the constants are allowed to depend only on the constants and . The proofs for all the lemmas are presented in Appendix A.4.
a.1 Notation and preliminaries
In this appendix, we introduce notation and preliminaries that are used subsequently in the proofs of both Theorem 2.12 and Theorem 2.11.

[label=(), ref=, wide=]

Notation
Recall that denotes the number of items, and denotes the number of comparisons per pair of items. The items are associated to a true parameter vector . We have the set and the set , where and are finite constants such that . The true parameter vector satisfies .
Denote as the probability that item beats item . Under the BTL model, we have
(8) For every , denote the outcome of the comparison between item and item as
We have , independent across all and all . Recall that denotes the number of times that item beats . We have and therefore . Denote as the fraction of times that item beats item . That is,
(9) We have , independent across all .
Finally, we use , etc. to denote finite constants whose values may change from line to line. We write if there exists a constant such that for all . The notation is defined analogously.

Notion of conditioning
Let be any event. The conditional bias of any estimator conditioned on the event is defined as:
We use “w.h.p.()” to denote that an event happens with probability at least
for all and , where and are positive constants.
Similarly, we use “w.h.p.()” to denote that conditioned on some event , some other event happens with probability at least
for all and , where and are positive constants.

The negative loglikelihood function and its derivative
Recall that denotes the negative loglikelihood function. Under the BTL model, we have
(10) Since is simply a normalized version of , we equivalently denote the negative loglikelihood function as .
From the expression of in (10), we compute the gradient for every as
(11) Finally, the following lemma from [19] shows the strict convexity of the negative loglikelihood function .
Lemma A.1 (Lemma 2(a) from [19]).
The negative loglikelihood function is strictly convex in .

The sigmoid function and its derivatives
Denote the function as the sigmoid function . It is straightforward to verify that the function has the following two properties.

[]

The first derivative is positive on . Moreover, on any bounded interval, the first derivative is bounded above and below. That is, for any constants , there exist constants such that
(12a) 
The second derivative is bounded on any bounded interval. That is, for any constants , there exists a constant such that
(12b)


Existence and uniqueness of MLE
Recall that the MLE (3), the unconstrained MLE (4), and the stretchedMLE (5) are respectively defined as:
(13) (14) (15) The following lemma shows the existence and uniqueness of the stretchedMLE (15) for any constant , which incorporates the standard MLE by setting .
Lemma A.2.
For any finite constant , there always exists a unique solution to the stretchedMLE (15).
For the unconstrained MLE, due to the removal of the box constraint in (14), a finite solution may not exist. However, the following lemma shows that a unique finite solution exists with high probability.
Lemma A.3.
There exists a unique finite solution to the unconstrained MLE (14) w.h.p.().
a.2 Proof of Theorem 2.12
In this appendix, we present the proof of Theorem 2.12. To describe the main steps involved, we first present a proof sketch of a simple case of items (Appendix A.2.1), followed by the complete proof of the general case (Appendix A.2.2). The reader may pass to the complete proof in Appendix A.2.2 without loss of continuity.
a.2.1 Simple case: 2 items
We first present an informal proof sketch for a simple case where there are items. The proof for the general case in Appendix A.2.2 follows the same outline. In the case of items, due to the centering constraint on the true parameter vector , we have . Similarly, we have for any estimator that satisfies the centering constraint (in particular, for the stretchedMLE and the unconstrained MLE ). Therefore, it suffices to focus only on item . Since there are only two items, for ease of notation, we denote and . We now present the main steps of the proof sketch.
Proof sketch of the item case (informal):
In the proof sketch, we fix any , and any finite constants and such that .

[label=Step 0:, ref=0, wide=]

Establish concentration of
By Hoeffding’s inequality, we have
(16) Since , we have that is bounded away from and by a constant. Hence, for sufficiently large , there exist constants where , such that
(17) 
Write the firstorder optimality condition for
The unconstrained MLE minimizes the negative loglikelihood . If a finite unconstrained MLE exists^{1}^{1}1 For the proof sketch, we ignore the highprobability nature of Lemma A.3, and assume that a finite always exists. It is made precise in the complete proof in Appendix A.2.2. , we have . Setting in the gradient expression (11) and plugging in , we have
(18) Setting the derivative (18) to , we have
(19) By the definition of in (8), we have , which can be written as
(20) Define a function as
(21) Subtracting (20) from (19) and using the definition of from (21), we have
(22) 
Bound the difference between and , by the firstorder mean value theorem
It can be verified that has positive firstorder derivative on . Moreover, there exists some constant such that for all . Applying the firstorder mean value theorem on (22), we have the deterministic relation
(23) where is a random variable that depends on and , and takes values between and . By (17), we have . From (23) we have

Bound the expected difference between and , by the secondorder mean value theorem
By the secondorder mean value theorem on (22), we have the deterministic relation
(26) where is a random variable that depends on and , and takes values between and . By (17), we have .
It can be verified that has bounded secondorder derivative. That is, for all . Taking an expectation over (26), we have
(27) (28) where (i) is true because combined with the fact that on , and (ii) is true^{2}^{2}2 For the proof sketch, we ignore the highprobability nature of (16) and treat it as a deterministic relation. It is made precise in the complete proof in Appendix A.2.2. by (16).

Connect back to
From (25), we have w.h.p. for sufficiently large . Hence,
Moreover, we have . Therefore, with high probability, the unconstrained MLE does not violate the box constraint at , and therefore is identical to the stretchedMLE . Hence, the bound (28) holds^{3}^{3}3 For the proof sketch, we ignore the highprobability nature of the fact that , and treat it as a deterministic relation. It is made precise in the complete proof in Appendix A.2.2. for the stretchedMLE, completing the proof sketch.
a.2.2 Complete Proof
In this appendix, we present the proof of Theorem 2.12, by formally extending the steps outlined for the simple case in Appendix A.2.1. In the general case, one notable challenge is that one can no longer write a closedform solution of the MLE as we did in (19) of Step 2. The firstorder optimality condition now becomes a system of equations that describe an implicit relation between and , requiring more involved analysis.
In the proof, we fix any , and fix any finite constants and such that .

[label=Step 0:, ref=0, wide=]

Establish concentration of
We first use standard concentration inequalities to establish the following lemma, to be used in the subsequent steps of the proof.
Lemma A.4.
There exists a constant , such that
simultaneously for all w.h.p.().
Recall that Lemma A.3 states that a finite unconstrained MLE exists w.h.p.(). We denote as the event that Lemma A.3 and Lemma A.4 both hold. For the rest of the proof, we condition on . Since both Lemma A.3 and Lemma A.4 hold w.h.p.(), taking a union bound, we have that holds w.h.p.(). That is,
(29) 
Write the firstorder optimality condition for the unconstrained MLE
Recall from Lemma A.1 that the negative loglikelihood function is convex in . In this step, we first justify that the whenever a finite unconstrained MLE exists, it sat
Comments
There are no comments yet.