# Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons

A number of applications (e.g., AI bot tournaments, sports, peer grading, crowdsourcing) use pairwise comparison data and the Bradley-Terry-Luce (BTL) model to evaluate a given collection of items (e.g., bots, teams, students, search results). Past work has shown that under the BTL model, the widely-used maximum-likelihood estimator (MLE) is minimax-optimal in estimating the item parameters, in terms of the mean squared error. However, another important desideratum for designing estimators is fairness. In this work, we consider fairness modeled by the notion of bias in statistics. We show that the MLE incurs a suboptimal rate in terms of bias. We then propose a simple modification to the MLE, which "stretches" the bounding box of the maximum-likelihood optimizer by a small constant factor from the underlying ground truth domain. We show that this simple modification leads to an improved rate in bias, while maintaining minimax-optimality in the mean squared error. In this manner, our proposed class of estimators provably improves fairness represented by bias without loss in accuracy.

## Authors

• 5 publications
• 31 publications
• 11 publications
10/09/2020

### Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator

Gradient estimation in models with discrete latent variables is a challe...
08/03/2018

### Maximum likelihood quantum state tomography is inadmissible

Maximum likelihood estimation (MLE) is the most common approach to quant...
10/05/2018

### Efficient Estimation of Smooth Functionals in Gaussian Shift Models

We study a problem of estimation of smooth functionals of parameter θ o...
08/13/2019

### The bias of isotonic regression

We study the bias of the isotonic regression estimator. While there is e...
09/19/2018

### Bias corrected minimum distance estimator for short and long memory processes

This work proposes a new minimum distance estimator (MDE) for the parame...
06/09/2018

### An Estimation and Analysis Framework for the Rasch Model

The Rasch model is widely used for item response analysis in application...
06/15/2020

### Estimation of Skill Distributions

In this paper, we study the problem of learning the skill distribution o...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A number of applications involve data in the form of pairwise comparisons among a collection of items, and entail an evaluation of the individual items from this data. An application gaining increasing popularity is competition between pairs of AI bots (e.g., [26]). Here a number of AI bots compete with each other in pairwise matchups for a certain task, where each bot plays every other bot a certain number of times in a round robin fashion, with the goal of evaluating the quality of each bot. A second example is the evaluation of self-play of AI algorithms in their training phase [34], where again, different copies of an AI bot play against each other a number of times. Applications involving humans include sports and online games such as the English Premier League of football [22, 2] (unofficial ratings) and official world rankings for chess (e.g., FIDE [1] and USCF [14] ratings). The influence of scientific journals has also been analyzed in this manner, where citations from one journal to another are modeled by pairwise comparisons [35].

A common method of evaluating the items based on pairwise comparisons is to assume that the probability of an item beating another equals the logistic function of the difference in the true quality of the two items, and then infer the true quality from the observed outcomes of the comparisons (e.g., the Elo rating system). Various applications employ such an approach to rating from pairwise comparisons, with some modifications tailored to that specific application. Our goal is not to study the application-specific versions, but the foundational underpinnings of such rating systems.

In this paper, we study the pairwise-comparison model that underlies [15, 4] these rating systems, namely the Bradley-Terry-Luce (BTL) model [6, 24]. The BTL model assumes that each item is associated to an unknown real-valued parameter representing the quality of that item, and assumes that the probability of an item beating another is the logistic function applied to the difference of the parameters of these two items. The BTL model is also employed in the applications of peer grading [32, 23] (where the grades of the students are set as the BTL parameters to be estimated), crowdsourcing [7, 27], and understanding consumer choice in marketing [16].

### 1.1 BTL model and maximum likelihood estimation

Now we present a formal definition of the BTL model. Let denote the number of items. The

items are associated to an unknown parameter vector

whose entry represents the underlying quality of item . When any item is compared with any item in the BTL model, the item beats item with probability

 11+e−(θ∗i−θ∗j), (1)

independent of all other comparisons. The probability of item beating is one minus the expression (1) above. We consider the “league format” [4] of comparisons where every pair of items is compared times.

We follow the usual assumption [17, 31] under the BTL model that the true parameter vector lies in the set parameterized by a constant and satisfy:

 ΘB={θ∈Rd∣∥θ∥∞≤B and d∑i=1θi=0}. (2)

The first constraint requires that the magnitude of the parameters is bounded by some constant . We call this constraint the “box constraint”. A box constraint is necessary, because otherwise the estimation error can diverge to infinity [31, Appendix G]. The second constraint requires the parameters to sum to . This is without loss of generality due to the shift-invariance property of the BTL model.

A large amount of both theoretical [19, 17, 38, 25, 31] and applied [35, 33, 7, 27] literature focuses on the goal of estimating the parameter vector of the BTL model. A standard and widely-studied estimator is the maximum-likelihood estimator (MLE):

 ˆθ(B)=argminθ∈ΘBℓ(θ), (3)

where is the negative log-likelihood function. Letting

denote a random variable representing the number of times that item

beats item , the log-likelihood function is given by:

 ℓ(θ)\vcentcolon=ℓ({Wij};θ) =−∑1≤i

### 1.2 Metrics

##### Accuracy.

A common metric used in the literature on estimating the BTL model is the accuracy of the estimate, measured in terms of the mean squared error. Formally, the accuracy of any estimator is defined as:

 α(ˆθ)\vcentcolon=supθ∗∈ΘBE[∥ˆθ−θ∗∥22].

Importantly, past work [17, 31] has shown that the MLE (3) has the appealing property of being minimax-optimal in terms of the accuracy.

##### Bias.

Another important desideratum for designing and evaluating estimators is fairness. For example, in sports or online games, we do not want to assign scores in such a way that it systematically gives certain players higher scores than their true quality, but at the same time gives certain other players lower scores than their true quality. In this paper, we use the standard definition of bias in statistics as the notion of fairness. For any estimator, the bias incurred by this estimator on a parameter is defined as the difference between the expected value of the estimator and the true value of the parameter. Since our parameters are a vector, we consider the worst-case bias, that is, the maximum magnitude of the bias across all items. Formally, the bias of any estimator is defined as:

 β(ˆθ)\vcentcolon=supθ∗∈ΘB∥E[ˆθ]−θ∗∥∞.

With this background, we now provide an overview of the contributions of this paper.

### 1.3 Contribution I: Performance of MLE

Our first contribution is to analyze the widely-used MLE (3) in terms of its bias. Let us begin with a visual illustration through simulation. Consider items with parameter values equally spaced in the interval , where pairwise comparisons are observed between each pair of items under the BTL model. We estimate the parameters using the MLE, and plot the bias on each item across iterations of the simulation in Figure 2 (striped red). The MLE shows a systematic bias: it induces a negative bias (under-estimation) on the large positive parameters, and a positive bias (over-estimation) on the large negative parameters. In the applications of interest, the MLE thus systematically underestimates the abilities of the top players/students/items and overestimates the abilities of those at the bottom.

In this paper, we theoretically quantify the bias incurred by the MLE.

###### Theorem 1.1 (MLE bias lower bound; Informal).

The MLE (3) incurs a bias lower bounded as .

As shown by our results to follow, this bias is suboptimal. Our proof for this result indicates that the bias is incurred because the MLE operates under the accurately specified model with the box constraint at . That is, the MLE “clips” the estimate to lie within the set . This issue is visible in the simulation of Figure 2 where the bias is the largest when the true values of the parameters are near the boundaries . For example, consider a true parameter whose value equals . The estimate of this parameter sometimes equals the largest allowed value (due to the box constraint), and sometimes is smaller than (due to the randomness of the data). Therefore, in expectation, the estimate of this parameter incurs a negative bias. An analogous argument explains the positive bias when the true parameter equals or is close to .

### 1.4 Contribution II: Proposed stretched estimator and its theoretical guarantees

Our goal is to design an estimator with a lower bias while maintaining high accuracy. Since the MLE (3) is already widely studied and used, it is also desirable from a practical and computational standpoint that the new estimator is a simple modification of the MLE (3). With this motivation in mind, an intuitive approach is to consider the MLE but without the box constraint “”. We call the estimator without the box constraint as the “unconstrained MLE”, and denote it by , because removing the box constraint is equivalent to setting the box constraint to :

 ˆθ(∞)=argminθ∈Θ∞ℓ(θ), (4)

where . The unconstrained MLE incurs an unbounded error in terms of accuracy. This is because with non-zero probability an item beats all others, in which case the unconstrained MLE estimates the parameter of this item as , thereby inducing an unbounded mean squared error.

Consequently, in this work, we propose the following simple modification to the MLE which is a middle ground between the standard MLE (3) and the unconstrained MLE. Specifically, we consider a “stretched-MLE”, which is associated to a parameter such that . Given the parameter , the stretched-MLE is identical to (3) but “stretches” the box constraint to :

 ˆθ(A)=argminθ∈ΘAℓ(θ), (5)

where . That is, simply replaces the box constraint in (2) by the “stretched” box constraint .

The bias induced by the stretched-MLE (with ) in the previous experiment is also shown in Figure 2 (solid blue). Observe that the maximum bias (incurred at the leftmost item with the largest negative parameter, or the rightmost item with the largest positive parameter) is significantly reduced compared to the MLE. Moreover, the bias induced by the stretched-MLE looks qualitatively more evened out across the items.

Our second main theoretical result proves that the stretched-MLE indeed incurs a significantly lower bias.

###### Theorem 1.2 (Stretched-MLE bias upper bound; Informal).

The stretched-MLE (5) with incurs a bias upper bounded as .

Given the significant bias reduction by our estimator, a natural question is about the accuracy of the stretched-MLE, particularly given the unbounded error incurred by the unconstrained MLE. We prove that our stretched-MLE is able to maintain the same minimax-optimal rate on the mean squared error as the standard MLE.

###### Theorem 1.3 (Stretched-MLE accuracy upper bound; Informal).

The stretched-MLE (5) with incurs a mean squared error upper bounded as , which is minimax-optimal.

This result shows a win-win by our stretched-MLE: reducing the bias while retaining the accuracy guarantee. The comparison of the MLE and the stretched-MLE in terms of accuracy and bias is summarized in Table 2. Another attractive feature of our result is that the proposed stretched-MLE is a simple modification of the standard MLE, which can easily be incorporated in any existing implementation. It is important to note that while our modification to the estimator is simple to implement, our theoretical analyses and the proofs are non-trivial.

### 1.5 Related work

The logistic nature (1

) of the BTL model relates our work to studies of logistic regression (e.g.,

[28, 18, 37, 11]), among which the paper [37] is the most closely related to ours. The paper [37] considers an unconstrained MLE in logistic regression, and shows its bias in the opposite direction as compared to our results on the standard MLE (constrained) in the BTL model. Specifically, the paper [37] shows that the large positive coefficients are overestimated, and the large negative coefficients are underestimated. There are several additional key differences between the results in [37] as compared to the present paper. The paper [37] studies the asymptotic bias of the unconstrained MLE, showing that the unconstrained MLE is not consistent. On the other hand, we operate in a regime where the MLE is still consistent, and study finite-sample bounds. Moreover, the paper [37]

assumes that the predictor variables are i.i.d. Gaussian. On the other hand, in the BTL model the probability that item

beats item can be written as , where each predictor variable has entry equal to , entry equal to , and the remaining entries equal to .

A common way to achieve bias reduction is to employ finite-sample correction, such as Jackknife [29] and other methods [10, 5, 12] to the MLE (or other estimators). These methods operate in a low-dimensional regime (small ) where the MLE is asymptotically unbiased. Informally, these methods use a Taylor expansion and write the expression for the bias as an infinite sum , where is the number samples, for some functions . These works then modify the estimator in a variety of ways to eliminate the lower-order terms in this bias expression. However, since the expression is an infinite sum, eliminating the first term does not guarantee a low rate of the bias. Moreover, since the functions are implicit functions of , eliminating lower-order terms does not directly translate to explicit worst-case guarantees.

Returning to the pairwise-comparison setting, in addition to the mean squared error, some past work has also considered accuracy in terms of the norm error [3] and the norm error [9, 8, 20]. The bound for a regularized MLE is analyzed in [8]. Our proof for bounding the bias of the standard MLE (unregularized) relies on a high-probability bound for the unconstrained MLE (unregularized). It is important to note that the bound for regularized MLE from [8] does not carry to unregularized MLE, because the proof from [8] relies on the strong convexity of the regularizer. On the other hand, our intermediate result provides a partial answer to the open question in [8] about the norm for the unregularized MLE (Lemma A.5 in Appendix A): We establish an bound for unregularized MLE when , which has the same rate as that of the regularized MLE in [8].

Another common occurrence of bias is the phenomenon of regression towards the mean [36]. Regression towards the mean refers to the phenomenon that random variables taking large (or small) values in one measurement are likely to take more moderate (closer to average) values in subsequent measurements. On the contrary, we consider items whose indices are fixed (and are not order statistics). For fixed indices, our results suggest that under the BTL model, the bias (under-estimation of large true values) is in the opposite direction as that in regression towards the mean (over-estimation of large observed values).

Finally, the paper [22]

models the notion of fairness in Elo ratings in terms of the “variance”, where an estimator is considered fair if the estimator is not much affected by the underlying randomness of the pairwise-comparison outcomes. The paper

[22] empirically evaluates this notion of fairness on the English Premier League data, but presents no theoretical results.

## 2 Main results

In this section, we formally provide our main theoretical results on bias and on the mean squared error.

### 2.1 Bias

Recall that denotes the number of items and denotes the number of comparisons per pair of items. The true parameter vector is for some pre-specified constant . The following theorem provides bounds on the bias of the standard MLE and that of our stretched-MLE with parameter . In particular, it shows that if is a finite constant strictly greater than , then our stretched-MLE has a much smaller bias than the MLE when and are sufficiently large.

###### Theorem 2.1.
1. [(a)]

2. There exists a constant that depends only on the constant , such that

 β(ˆθ(B))≥c√dk, (6a)

for all and all , where and are constants that depend only on the constant .

3. Let be any finite constant such that . There exists a constant that depends only on the constants and , such that

 β(ˆθ(A))≤clogd+logkdk, (6b)

for all and all , where and are constants that depend only on the constants and .

We note that in Theorem 2.12, we allow to be any positive constant as long as . Therefore, the difference between and can be any arbitrarily small constant. It is perhaps surprising that stretching the box constraint only by a small constant yields such a significant improvement in the bias. We provide intuition behind this result in Section 2.1.1.

We devote the remainder of this section to providing a sketch of the proof of Theorem 2.1. We first prove Theorem 2.12 and then Theorem 2.11, because the proof of Theorem  2.11 depends on the proof of Theorem 2.12. The complete proof is provided in Appendix A.

For Theorem 2.12, we first analyze the unconstrained MLE . By plugging into the first-order optimality condition of the negative log-likelihood function and using concentration on the comparison outcomes, we prove an bound of the form with sufficiently high probability (which partially resolves the open problem from [8], in the regime where ). Next, using a second-order mean value theorem on the first-order optimality condition and taking an expectation, we show a result of the form , where is some high-probability event (recall from Table 2 that for unconstrained MLE, the bias without conditioning on is undefined). Finally, we show that the unconstrained MLE and the stretched-MLE are identical with high probability for sufficiently large and , and perform some algebraic manipulations to finally arrive at the claim (6b).

For Theorem 2.11, we first prove a bound on the order of when there are items. Then for general , we consider the bias on item under the true parameter vector . We construct an “oracle” MLE, such that analyzing the bias of the “oracle” MLE can be reduced to the proof of the -item case, and thereby prove a bias on the order of for the oracle MLE. Finally, we show that the difference between the oracle MLE and the standard MLE is small, by repeating arguments from the proof of Theorem 2.12.

#### 2.1.1 Intuition for Theorem 2.1

In this section, we provide intuition why stretching the box constraint from to significantly reduces the bias. Specifically, we consider a simplified setting with items. Due to the centering constraint, we have for the true parameters, and we have for any estimator that satisfies the centering constraint. Therefore, it suffices to focus only on item . Denote as the random variable representing the fraction of times that item beats item , and denote the true probability that item beats item as . We consider the true parameter of item as . Then we have , where and . The standard MLE , the stretched-MLE and the unconstrained MLE can be solved in closed form:

 ˆθ(B)1(μ) =⎧⎪ ⎪⎨⎪ ⎪⎩−Bif μ∈[0,μ−]−12log(1μ−1)if μ∈(μ−,μ+)Bif μ∈[μ+,1]. ˆθ(A)1(μ) ˆθ(∞)1(μ) =−12log(1μ−1).

See Fig. (a)a for a comparison of these three estimators.

Now we consider the bias incurred by these three estimators. For intuition, let us consider the case , which incurs the largest bias in our simulation of Fig. 2. If the observation were noiseless (and thus equals the true probability ), then all three estimators would output the true parameter . However, the observation is noisy, and only concentrates around . To investigate how these three estimators behave differently under this noise, we zoom in to the region around indicated by the grey box in Fig. (a)a. (Note that the observation can lie outside the grey box, but for intuition we ignore this low-probability event due to concentration.)

The behaviors of the three estimators in the grey box are shown in Fig. (b)b, Fig. (c)c and Fig. (d)d, respectively. For each of these estimators, the blue dots on the x-axis denotes the noisy observation of across different iterations, and the blue dots on the estimator function denotes the corresponding noisy estimates. The expected value of the estimator is a mean over the blue dots on the estimator function. For the standard MLE (Fig. (b)b), the box constraint requires that the estimate shall never exceed . We call this phenomenon the “clipping” effect, which introduces a negative bias. For the unconstrained MLE (Fig. (c)c), since the estimator function is convex, by Jensen’s inequality, the unconstrained MLE introduces a positive bias. Our proposed stretched-MLE (Fig. (d)d) lies in the middle between the standard MLE and the unconstrained MLE. Therefore, the stretched-MLE balances out the negative bias from the “clipping” effect and the positive bias from the convexity of the estimator function, thereby yielding a smaller bias on the item parameter. In practice, one can numerically tune the parameter to minimize the bias across all possible parameter vector . Simulation results on different values of are included in Section 3.

### 2.2 Accuracy

Given the result of Theorem 2.1 on the bias reduction of the estimator , we revisit the mean squared error. Past work [17, 31] has shown that the standard MLE is minimax-optimal in terms of the mean squared error. The following theorem shows that this minimax-optimality also holds for our proposed stretched-MLE , where is any constant such that . The theorem statement and its proof follows Theorem 2 from [31], after some modification to accommodate the bounding box parameter .

###### Theorem 2.2.
1. [(a)]

2. [Theorem 2(a) from [31]] There exists a constant that depends only on the constant , such that any estimator has a mean squared error lower bounded as

 α(ˆθ)≥ck, (7a)

for all , where is a constant that depends only on the constant .

3. Let be any finite constant such that . There exists a constant that depends only on the constants and , such that

 α(ˆθ(A))≤ck. (7b)

Theorem 2.2 shows that using the estimator retains the minimax-optimality achieved by in terms of the mean squared error. Combining Theorem 2.1 and Theorem 2.2 shows the Pareto improvement of our estimator : the estimator decreases the rate of the bias, while still performing optimally on the mean squared error.

The proof of Theorem 2.2 closely mimics the proof of Theorem 2(b) from [31], replacing the steps involving the domain by the stretched domain . The details are provided in Appendix B.

## 3 Simulations

In this section, we explore our problem space and compare the standard MLE and our proposed stretched-MLE by simulations. In what follows, we set , and unless specified otherwise we set and . We also evaluate the performance of other values of

subsequently. Error bars in all the plots represent the standard error of the mean.

1. [(i), wide=]

2. Dependence on : We vary the number of items , while fixing . The results are shown in Fig. 4. Observe that the stretched-MLE has a significantly smaller bias, and performs on par with the MLE in terms of the mean squared error when is large. Moreover, the simulations also suggest the rate of bias as of order for the MLE and for the stretched-MLE, as predicted by our theoretical results.

3. Dependence on : We vary the number of comparisons per pair of items, while fixing . The results are shown in Fig. 5. As in the simulation 1 with varying , we observe that the stretched-MLE has a significantly smaller bias, and performs on par with the MLE in terms of the mean squared error. Moreover, the simulations also suggest the rate of bias as of order for the MLE and for the stretched-MLE, as predicted by our theoretical results.

4. Different values of : In our theoretical analysis, we proved bounds that hold for all constant such that . In this simulation, we empirically compare the performance of the stretched-MLE for different values of (note that setting is equivalent to the standard MLE). We fix , varying and from to . The results are shown in Fig. 6. For the bias, we observe that the bias keeps decreasing in the range of . This is because as we increase to , the negative bias introduced by the “clipping” effect is reduced. The optimal value of for all settings of is always greater than . Moreover, the optimal seems to be closer to when we increase . This agrees with the intuition in Section 2.1.1. When is larger, the estimate becomes more concentrated around the true parameter. Then the “clipping” effect becomes smaller and can be accommodated by a smaller . The mean squared error is insensitive to the choice of as long as .

5. Different settings of the true parameter : Our theoretical result considers the worst-case bias and accuracy. In this simulation, we empirically compare the performance of the stretched-MLE under different settings of the true parameter vector (again, recall that setting is equivalent to the standard MLE). Specifically, we consider the following values of :

• []

• Worst case: .

• Worst case (0.5): .

• Bipolar: half of the values are , and the other half are .

• Linear: the values are equally spaced in the interval .

• All zeros: all parameters are .

We fix and , varying under different settings of the true parameter vector . The results are shown in Fig. 7. Two high-level takeaways from the empirical evaluations are that the bias generally reduces with an increase in till past , and that the mean squared error remains relatively constant beyond in the plotted range. In more detail, for the bias, we observe that the performance primarily depends on the largest magnitude of the items (that is, ). For the settings worst case, bipolar and linear (where ), the bias keeps decreasing when A is past . For the setting worst-case (0.5) (where ), the bias keeps decreasing when A is past . This makes sense since in this case we effectively have (although the algorithm would not know this in practice). The bias for the setting all zeros stays small across values of . For the mean squared error, the increase when A is past is relatively small under most of the settings of the true parameter vector . The bipoloar setting has the largest increase in the mean squared error. Under this setting, all parameters take values at the boundaries , and therefore the estimates of all parameters are affected by the box constraint.

6. Sparse observations: So far we have considered a league format where comparisons are observed between any pair of items. Now we consider a random-design setup, where comparisons are observed between any pair of items independently with probability , and none otherwise [25, 8]. In our simulations, we set and . We discard an iteration if the graph is not connected, since the problem is not identifiable under such a graph. The results are shown in Figure 8. We observe that the stretched-MLE continues to outperform MLE in terms of bias, and perform on par in terms of the mean squared error.

## 4 Conclusions and discussions

In this work, we show that the widely-used MLE is suboptimal in terms of bias, and propose a class of estimators called the “stretched-MLE”, which provably reduces the bias while maintaining the minimax-optimality in terms of accuracy. These results on the performance of the MLE and the stretched-MLE are of both theoretical and practical interest. From the theoretical point of view, our analysis and proofs provide insights on the cause of the bias, explain why stretching the box alleviates this cause, and prove theoretical guarantees in bias reduction by stretching the box. Our results on the benefits of the stretched-MLE thus suggest theoreticians to consider the stretched-MLE for analysis instead of the standard MLE.

From the practical point of view, the constant is often unknown, and practitioners oten estimate the value of by fitting the data or from past experience. Our results thus suggest that one should estimate leniently, as an estimation smaller than or equal to the true causes significant bias. Moreover, our proposed estimator is a simple modification to the MLE, which can be incorporated into any existing implementation at ease.

Our results lead to several open problems. First, it is of interest to extend our theoretical analysis to settings where the observations are sparse. For example, one may consider a random-design setup, where comparisons are observed between any pair independently with probability and none otherwise [25, 8] (also see simulation 5 in Section 3). In terms of the bias under this random-design setup, we think that the lower-bound for MLE and the upper-bound for our stretched-MLE also depend on and as and respectively; we also think that the dependence of the stretched-MLE on

is no worse than that of the standard MLE. Second, it is of interest to extend our results to other parametric models such as the Thurstone model

[39], and we envisage similar results to hold across a variety of such models. Finally, the ideas and techniques developed in this paper may also help in improving the Pareto efficiency on other learning and estimation problems, in terms of the bias-accuracy tradeoff.

## Acknowledgements

The work of JW and NBS was supported in part by NSF grants 1755656 and 1763734. The work of RR was supported in part by NSF grant 1527032.

## References

• [1] FIDE rating regulations effective from 1 July 2017, 2017. https://www.fide.com/fide/handbook.html?id=197&view=article [Online; accessed May 21, 2019].
• [2] Elo ratings - English Premier League, 2019. https://sinceawin.com/data/elo/league/div/e0 [Online; accessed May 21, 2019].
• [3] Arpit Agarwal, Prathamesh Patil, and Shivani Agarwal. Accelerated spectral ranking. In

International Conference on Machine Learning

, 2018.
• [4] David Aldous. Elo ratings and the sports model: A neglected topic in applied probability? Statistical Science, 32(4):616–629, 2017.
• [5] J. A. Anderson and S. C. Richardson. Logistic discrimination and bias correction in maximum likelihood estimation. Technometrics, 21(1):71–78, 1979.
• [6] Ralph A. Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
• [7] Baiyu Chen, Sergio Escalera, Isabelle Guyon, Víctor Ponce-López, Nihar Shah, and Marc Oliu Simón. Overcoming calibration problems in pattern labeling with pairwise ratings: application to personality traits. In

European Conference on Computer Vision

, 2016.
• [8] Yuxin Chen, Jianqing Fan, Cong Ma, and Kaizheng Wang. Spectral method and regularized MLE are both optimal for top-K ranking. Ann. Statist., 47(4):2204–2235, 08 2019.
• [9] Yuxin Chen and Changho Suh. Spectral MLE: top-K rank aggregation from pairwise comparisons. In International Conference on Machine Learning, 2015.
• [10] D. R. Cox and E. J. Snell. A general definition of residuals. Journal of the Royal Statistical Society. Series B (Methodological), 30(2):248–275, 1968.
• [11] Yingying Fan, Emre Demirkaya, and Jinchi Lv. Nonuniformity of p-values can occur early in diverging dimensions. Journal of Machine Learning Research, 20(77):1–33, 2019.
• [12] David Firth. Bias reduction of maximum likelihood estimates. Biometrika, 80(1):27–38, 1993.
• [13] E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30(4):1141–1144, 1959.
• [14] Mark E. Glickman and Thomas Doan. The US chess rating system, 2017. http://www.glicko.net/ratings/rating.system.pdf [Online; accessed May 21, 2019].
• [15] Mark E Glickman and Albyn C Jones. Rating the chess rating system. Chance, 12:21–28, 1999.
• [16] Paul E. Green, J. Douglas Carroll, and Wayne S. DeSarbo. Estimating choice probabilities in multiattribute decision making. Journal of Consumer Research, 8(1):76–84, 1981.
• [17] Bruce Hajek, Sewoong Oh, and Jiaming Xu. Minimax-optimal inference from partial rankings. In Advances in Neural Information Processing Systems, 2014.
• [18] Xuming He and Qi-Man Shao. On parameters of increasing dimensions.

Journal of Multivariate Analysis

, 73(1):120 – 135, 2000.
• [19] David R. Hunter. MM algorithms for generalized Bradley-Terry models. Ann. Statist., 32(1):384–406, 02 2004.
• [20] Minje Jang, Sunghyun Kim, Changho Suh, and Sewoong Oh. Optimal sample complexity of m-wise data for top-K ranking. In Advances in Neural Information Processing Systems, 2017.
• [21] L. R. Ford Jr. Solution of a ranking problem from binary comparisons. The American Mathematical Monthly, 64(8P2):28–33, 1957.
• [22] Franz J. Király and Zhaozhi Qian. Modelling competitive sports: Bradley-Terry-Élő models for supervised and on-line learning of paired competition outcomes. preprint arXiv:1701.08055, 2017.
• [23] Alec Lamon, Dave Comroe, Peter Fader, Daniel McCarthy, Rob Ditto, and Don Huesman. Making WHOOPPEE: A collaborative approach to creating the modern student peer assessment ecosystem. In EDUCAUSE, 2016.
• [24] R. Duncan Luce. Individual Choice Behavior: A Theoretical analysis. Wiley, New York, NY, USA, 1959.
• [25] Sahand Negahban, Sewoong Oh, and Devavrat Shah. RankCentrality: Ranking from pair-wise comparisons. Operations Research, 65:266–287, 2016.
• [26] S. Ontañón, G. Synnaeve, A. Uriarte, F. Richoux, D. Churchill, and M. Preuss. A survey of real-time strategy game AI research and competition in StarCraft. IEEE Transactions on Computational Intelligence and AI in Games, 5(4):293–311, 2013.
• [27] Víctor Ponce-López, Baiyu Chen, Marc Oliu, Ciprian Corneanu, Albert Clapés, Isabelle Guyon, Xavier Baró, Hugo Jair Escalante, and Sergio Escalera. ChaLearn LAP 2016: First round challenge on first impressions-dataset and results. In European Conference on Computer Vision, 2016.
• [28] Stephen Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Ann. Statist., 16(1):356–366, 03 1988.
• [29] M. H. Quenouille. Approximate tests of correlation in time-series. Journal of the Royal Statistical Society. Series B (Methodological), 11(1):68–84, 1949.
• [30] Walter Rudin. Principles of Mathematical Analysis. McGraw-Hill, 1976.
• [31] Nihar B. Shah, Sivaraman Balakrishnan, Joseph Bradley, Abhay Parekh, Kannan Ramchandran, and Martin J. Wainwright. Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence. Journal of Machine Learning Research, 17(58):1–47, 2016.
• [32] Nihar B. Shah, Joseph K Bradley, Abhay Parekh, Martin Wainwright, and Kannan Ramchandran. A case for ordinal peer-evaluation in MOOCs. In NIPS Workshop on Data Driven Education, 2013.
• [33] P. C. Sham and D. Curtis. An extended transmission/disequilibrium test (TDT) for multi-allele marker loci. Annals of Human Genetics, 59(3):323–336, 1995.
• [34] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, L Robert Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap, Fan Fong Celine Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the game of Go without human knowledge. Nature, 550:354–359, 2017.
• [35] Stephen M. Stigler. Citation patterns in the journals of statistics and probability. Statistical Science, 9(1):94–108, 1994.
• [36] Stephen M. Stigler. Regression towards the mean, historically considered. Statistical methods in medical research, 6(2):103–14, 02 1997.
• [37] Pragya Sur and Emmanuel J. Candès. A modern maximum-likelihood theory for high-dimensional logistic regression. preprint arXiv:1803.06964, 2018.
• [38] Balázs Szörényi, Róbert Busa-Fekete, Adil Paul, and Eyke Hüllermeier. Online rank elicitation for Plackett-Luce: A dueling bandits approach. In Advances in Neural Information Processing Systems, 2015.
• [39] L. L. Thurstone. A law of comparative judgement. Psychological Review, 34:278–286, 1927.

## Appendix A Proof of Theorem 2.1

In this appendix, we present the proof of Theorem 2.1. We first introduce notation and preliminaries in Appendix A.1, to be used subsequently in proving both parts of Theorem 2.1. The proof of Theorem 2.12 is presented in Appendix A.2. The proof of Theorem 2.11 is presented in Appendix A.3. We first present the proof of Theorem 2.12 followed by Theorem 2.11, because the proof of Theorem 2.11 depends on the proof of Theorem 2.12.

In the proof of Theorem 2.11, the constants are allowed to depend only on the constant . In the proof of Theorem 2.12, the constants are allowed to depend only on the constants and . The proofs for all the lemmas are presented in Appendix A.4.

### a.1 Notation and preliminaries

In this appendix, we introduce notation and preliminaries that are used subsequently in the proofs of both Theorem 2.12 and Theorem 2.11.

1. [label=(), ref=, wide=]

2. Notation

Recall that denotes the number of items, and denotes the number of comparisons per pair of items. The items are associated to a true parameter vector . We have the set and the set , where and are finite constants such that . The true parameter vector satisfies .

Denote as the probability that item beats item . Under the BTL model, we have

 μ∗ij=11+e−(θ∗i−θ∗j). (8)

For every , denote the outcome of the comparison between item and item as

 X(r)ij\vcentcolon=1{item i beats item j in their rth comparison}.

We have , independent across all and all . Recall that denotes the number of times that item beats . We have and therefore . Denote as the fraction of times that item beats item . That is,

 μij\vcentcolon=1kWij=1kk∑r=1X(r)ij. (9)

We have , independent across all .

Finally, we use , etc. to denote finite constants whose values may change from line to line. We write if there exists a constant such that for all . The notation is defined analogously.

3. Notion of conditioning

Let be any event. The conditional bias of any estimator conditioned on the event is defined as:

 β(ˆθ∣E)\vcentcolon=supθ∗∈ΘB∥E[ˆθ∣E]−θ∗∥∞.

We use “w.h.p.()” to denote that an event happens with probability at least

 P(E)>1−cdk,

for all and , where and are positive constants.

Similarly, we use “w.h.p.()” to denote that conditioned on some event , some other event happens with probability at least

 P(E′∣E)≥1−cdk,

for all and , where and are positive constants.

4. The negative log-likelihood function and its derivative

Recall that denotes the negative log-likelihood function. Under the BTL model, we have

 ℓ(θ)\vcentcolon=ℓ({Wij};θ) =−∑1≤i

Since is simply a normalized version of , we equivalently denote the negative log-likelihood function as .

From the expression of in (10), we compute the gradient for every as

 =k∑i≠m(11+e−(θm−θi)−μmi). (11)

Finally, the following lemma from [19] shows the strict convexity of the negative log-likelihood function .

###### Lemma A.1 (Lemma 2(a) from [19]).

The negative log-likelihood function is strictly convex in .

5. The sigmoid function and its derivatives

Denote the function as the sigmoid function . It is straightforward to verify that the function has the following two properties.

• []

• The first derivative is positive on . Moreover, on any bounded interval, the first derivative is bounded above and below. That is, for any constants , there exist constants such that

 0
• The second derivative is bounded on any bounded interval. That is, for any constants , there exists a constant such that

 |f′′(x)|
6. Existence and uniqueness of MLE

Recall that the MLE (3), the unconstrained MLE (4), and the stretched-MLE (5) are respectively defined as:

 ˆθ(B)({μij}) =argminθ∈ΘBℓ({μij};θ), (13) ˆθ(∞)({μij}) =argminθ∈Θ∞ℓ({μij};θ), (14) ˆθ(A)({μij}) =argminθ∈ΘAℓ({μij};θ). (15)

The following lemma shows the existence and uniqueness of the stretched-MLE  (15) for any constant , which incorporates the standard MLE by setting .

###### Lemma A.2.

For any finite constant , there always exists a unique solution to the stretched-MLE (15).

See Appendix A.4.1 for the proof of Lemma A.2.

For the unconstrained MLE, due to the removal of the box constraint in (14), a finite solution may not exist. However, the following lemma shows that a unique finite solution exists with high probability.

###### Lemma A.3.

There exists a unique finite solution to the unconstrained MLE (14) w.h.p.().

See Appendix A.4.2 for the proof of Lemma A.3.

In the subsequent proofs of Theorem 2.12 and Theorem 2.11, we heavily use the unconstrained MLE as an intermediate quantity to analyze the MLE and the stretched-MLE.

### a.2 Proof of Theorem 2.12

In this appendix, we present the proof of Theorem 2.12. To describe the main steps involved, we first present a proof sketch of a simple case of items (Appendix A.2.1), followed by the complete proof of the general case (Appendix A.2.2). The reader may pass to the complete proof in Appendix A.2.2 without loss of continuity.

#### a.2.1 Simple case: 2 items

We first present an informal proof sketch for a simple case where there are items. The proof for the general case in Appendix A.2.2 follows the same outline. In the case of items, due to the centering constraint on the true parameter vector , we have . Similarly, we have for any estimator that satisfies the centering constraint (in particular, for the stretched-MLE and the unconstrained MLE ). Therefore, it suffices to focus only on item . Since there are only two items, for ease of notation, we denote and . We now present the main steps of the proof sketch.

Proof sketch of the -item case (informal):

In the proof sketch, we fix any , and any finite constants and such that .

1. [label=Step 0:, ref=0, wide=]

2. Establish concentration of

By Hoeffding’s inequality, we have

 |μ−μ∗|≲√logkk,w.h.p. (16)

Since , we have that is bounded away from and by a constant. Hence, for sufficiently large , there exist constants where , such that

 μ,μ∗∈(cL,cU). (17)
3. Write the first-order optimality condition for

The unconstrained MLE minimizes the negative log-likelihood . If a finite unconstrained MLE exists111 For the proof sketch, we ignore the high-probability nature of Lemma A.3, and assume that a finite always exists. It is made precise in the complete proof in Appendix A.2.2. , we have . Setting in the gradient expression (11) and plugging in , we have

 =k⎛⎝11+e−(ˆθ(∞)1−ˆθ(∞)2)−μ12⎞⎠ =k⎛⎝11+e−2ˆθ(∞)1−μ⎞⎠. (18)

Setting the derivative (18) to , we have

 ˆθ(∞)1 =−12log(1μ−1). (19)

By the definition of in (8), we have , which can be written as

 θ∗1=−12log(1μ∗−1). (20)

Define a function as

 h(t)=−12log(1t−1). (21)

Subtracting (20) from (19) and using the definition of from (21), we have

 ˆθ(∞)1−θ∗1=h(μ)−h(μ∗). (22)
4. Bound the difference between and , by the first-order mean value theorem

It can be verified that has positive first-order derivative on . Moreover, there exists some constant such that for all . Applying the first-order mean value theorem on (22), we have the deterministic relation

 ˆθ(∞)1−θ∗1 =h′(λ)⋅(μ−μ∗), (23)

where is a random variable that depends on and , and takes values between and . By (17), we have . From (23) we have

 |ˆθ(∞)1−θ∗1| ≤c1|μ−μ∗|. (24)

Combining (24) with (16), we have

 |ˆθ(∞)1−θ∗1| ≲√logkk,w.h.p. (25)
5. Bound the expected difference between and , by the second-order mean value theorem

By the second-order mean value theorem on (22), we have the deterministic relation

 ˆθ(∞)1−θ∗1=h(μ)−h(μ∗)=h′(μ∗)⋅(μ−μ∗)+h′′(˜λ)⋅(μ−μ∗)2, (26)

where is a random variable that depends on and , and takes values between and . By (17), we have .

It can be verified that has bounded second-order derivative. That is, for all . Taking an expectation over (26), we have

 E[ˆθ(∞)1]−θ∗1 =h′(μ∗)⋅(E[μ]−μ∗)+E[h′′(˜λ)⋅(μ−μ∗)2] (27) (i)≤c2E[(μ−μ∗)2] (ii)≲logkk, (28)

where (i) is true because combined with the fact that on , and (ii) is true222 For the proof sketch, we ignore the high-probability nature of (16) and treat it as a deterministic relation. It is made precise in the complete proof in Appendix A.2.2. by (16).

6. Connect back to

From (25), we have w.h.p. for sufficiently large . Hence,

 |ˆθ(∞)1|≤|θ∗1|+|ˆθ(∞)1−θ∗1|≤B+(A−B)=A,w.h.p.

Moreover, we have . Therefore, with high probability, the unconstrained MLE does not violate the box constraint at , and therefore is identical to the stretched-MLE . Hence, the bound (28) holds333 For the proof sketch, we ignore the high-probability nature of the fact that , and treat it as a deterministic relation. It is made precise in the complete proof in Appendix A.2.2. for the stretched-MLE, completing the proof sketch.

#### a.2.2 Complete Proof

In this appendix, we present the proof of Theorem 2.12, by formally extending the steps outlined for the simple case in Appendix A.2.1. In the general case, one notable challenge is that one can no longer write a closed-form solution of the MLE as we did in (19) of Step 2. The first-order optimality condition now becomes a system of equations that describe an implicit relation between and , requiring more involved analysis.

In the proof, we fix any , and fix any finite constants and such that .

1. [label=Step 0:, ref=0, wide=]

2. Establish concentration of

We first use standard concentration inequalities to establish the following lemma, to be used in the subsequent steps of the proof.

###### Lemma A.4.

There exists a constant , such that

 ∣∣ ∣∣∑i≠mμmi−∑i≠mμ∗mi∣∣ ∣∣≤c√d(logd+logk)k,

simultaneously for all w.h.p.().

See Appendix A.4.3 for the proof of Lemma A.4.

Recall that Lemma A.3 states that a finite unconstrained MLE exists w.h.p.(). We denote as the event that Lemma A.3 and Lemma A.4 both hold. For the rest of the proof, we condition on . Since both Lemma A.3 and Lemma A.4 hold w.h.p.(), taking a union bound, we have that holds w.h.p.(). That is,

 P(E0)≥1−cdk,for some % constant c>0. (29)
3. Write the first-order optimality condition for the unconstrained MLE

Recall from Lemma A.1 that the negative log-likelihood function is convex in . In this step, we first justify that the whenever a finite unconstrained MLE exists, it sat