Asymptotic comparison of identifying constraints for Bradley-Terry models

05/09/2022
by   Weichen Wu, et al.
Carnegie Mellon University
0

The Bradley-Terry model is widely used for pairwise comparison data analysis. In this paper, we analyze the asymptotic behavior of the maximum likelihood estimator of the Bradley-Terry model in its logistic parameterization, under a general class of linear identifiability constraints. We show that the constraint requiring the Bradley-Terry scores for all compared objects to sum to zero minimizes the sum of the variances of the estimated scores, and recommend using this constraint in practice.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/27/2021

Modeling Sparse Data Using MLE with Applications to Microbiome Data

Modeling sparse data such as microbiome and transcriptomics (RNA-seq) da...
02/20/2020

A General Pairwise Comparison Model for Extremely Sparse Networks

Statistical inference using pairwise comparison data has been an effecti...
05/13/2020

An Asymptotic Result of Conditional Logistic Regression Estimator

In cluster-specific studies, ordinary logistic regression and conditiona...
01/11/2018

Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis

The normalized maximum likelihood code length has been widely used in mo...
04/01/2022

Asymptotic normality of the least sum of squares of trimmed residuals estimator

To enhance the robustness of the classic least sum of squares (LS) of th...
01/27/2019

Asymptotics of maximum likelihood estimation for stable law with continuous parameterization

Asymptotics of maximum likelihood estimation for α-stable law are analyt...
01/27/2019

Asymptotics of maximum likelihood estimation for stable law with (M) parameterization

Asymptotics of maximum likelihood estimation for α-stable law are analyt...

1 Introduction

The Bradley-Terry model is widely used for analyzing pairwise comparison data (Bradley and Terry, 1952). It assumes that all comparisons are independent, and endows every compared object with a score. The probability that one objects beats another in a comparison is a function of the scores of the two objects. The model is known in two forms. Originally, Bradley and Terry (1952) let the score of an object be , and defined the probability that beats in a comparison by

(1)

By taking , model (1) can be re-written as

(2)

in which

denotes the sigmoid function,

. The -parametrization (2) is used in the statistical software package BradleyTerry2 in R (Turner et al., 2012), and empirical studies like (Liner and Amin, 2004) and (Varin et al., 2016). In this report, we focus on the -parameterization, which has so far received less theoretical attention but more practical use.

Probability (1) remains unchanged if all the ’s are multiplied by a constant and, equivalently, in (2), if a constant is added to all the ’s. Therefore, in order to guarantee identifiability, the Bradley-Terry scores are usually estimated by constrained maximum likelihood estimation. The reference constraint and the sum constraint are the two most widely-used constraints for Bradley-Terry models. For the reference constraint, one of the objects is set as a reference, and its score is set to (Simons and Yao, 1999) – or, equivalently, . This constraint is used, for example, in the R packages BradleyTerry2 (Turner et al., 2012) and prefmod (Hatzinger et al., 2012). For the sum constraint, the sum of all scores is set to a constant: in the -parameterization, the sum of all ’s are usually set to , as in theoretical papers (Bradley and Terry, 1952; Ford Jr, 1957; Dykstra, 1960) and the R package eba (Wickelmaier and Wickelmaier, 2007); in the -parameterization, the sum of all ’s is usually set to (e.g., Varin et al., 2016). Clearly, the sum constraint under the -parameterization is not equivalent to the sum constraint under the -parameterization.

The theoretical properties of the Bradley-Terry model have been studied extensively (Bradley, 1954; Dykstra, 1960; Ford Jr, 1957; Simons and Yao, 1999; Han et al., 2020). Most theoretical results are derived under the -parameterization. For example, Ford Jr (1957) gave the necessary and sufficient condition for the existence and uniqueness of the MLE under the sum constraint. Dykstra (1960) proved the consistency and asymptotic normality of the MLE under the sum constraint when the number of comparisons between all pairs of objects goes to infinity. Simons and Yao (1999) proved the consistency and asymptotic normality under the reference constraint when the number of objects goes to infinity and all pairs of objects are compared the same number of times. Han et al. (2020) generalized this result to the case where only a small proportion of pairs of objects are ever compared with each other.

In this paper, we extend these theoretical results to the -parameterization (3) and address a practical issue. Specifically, we show in Section 2 that the constrained MLE under the -parameterization is unique under the condition posited by Ford Jr (1957). We then prove in Section 3 the consistency and asymptotic normality of the estimators under a practical and flexible data collection method. Finally, in Section 4 we discuss the estimated variance of the estimator under other linear constraints, and prove that the sum constraint is the unique constraint that minimizes the sum of the variances of the estimated scores for all objects. Moreover, as we show empirically in Section 5, poor choice of a reference object for the reference constraint can artificially spread uncertainty to the other objects; this behavior is avoided when using the sum constraint, which tends to concentrate uncertainty of estimation on objects that have participated in fewer comparisons. Thus, use of the sum constraint in the parameterization will help applied researchers in making inferences from pairwise comparison data.

2 Existence and uniqueness of the constrained MLE

Suppose there are objects being compared, and let denote the number of times that beats , and the number of times that and are compared with each other, for . The corresponding matrices, and , satisfy . The MLE for the -parameterization under the linear constraint , with

a non-zero vector in

, is

(3)

where is the log-likelihood function

(4)

Note that by taking , we obtain the sum constraint, and by taking , we obtain the reference constraint.

The existence and uniqueness of the constrained MLE requires sufficient knowledge of the comparisons among the objects. The following assumption was proposed by Ford Jr (1957) as a necessary and sufficient condition for the existence and uniqueness of the constrained MLE under the -parameterization.

Assumption 2.1.

In every possible partition of the objects into two nonempty subsets, some object in the second set has been preferred at least once to some object in the first.

This assumption can also be interpreted in graph-theoretical terms. Define the (directed) comparison graph , where the vertex set represents the objects, and the edge set contains all edges such that has been preferred at least once over . Then Assumption 2.1 essentially states that is strongly connected. That is, for every pair of objects , there exists a directed path from to . Ford Jr (1957) proved the following theorem.

Theorem 2.2.

Assumption 2.1 is necessary and sufficient for the existence and uniqueness of the constrained MLE for the Bradley-Terry model under the -parameterization and the sum constraint .

Theorem 2.2 guarantees the existence and uniqueness of the constrained MLE under reasonable conditions. The violation of Assumption 2.1 means either that a group of objects has never been compared with the rest, or that this group has never been preferred over the rest. In both cases, it is hard to evaluate the comparative strength between the two groups, and it would be more reasonable to analyze them separately. A similar conclusion follows for the -parameterization.

Corollary 2.3.

Assumption 2.1 is necessary and sufficient for the existence and uniqueness of the constrained MLE for the Bradley-Terry model under the -parameterization and the constraint for any such that .

The proof of Corollary 2.3 is given in A.

3 Asymptotic properties of the constrained MLE

The asymptotic behavior of the constrained MLE under the -parameterization has been studied under various asymptotic regimes, implied by different data collection designs. The data collection design we will consider is as follows: suppose that objects are compared, with fixed. Also, suppose that there are subjects making the comparisons, and that subject makes random comparisons, where the are independent and identically distributed, with and (the i.i.d. assumption can be weakened, but we will maintain it in this paper for simplicity). Let denote the comparison matrix corresponding to subject , with its elements representing the number of times subject compares the pair of objects .

We here primarily consider the asymptotic properties of the MLE of under the sum constraint , as . We focus on this constraint because, as will be shown in the next section, it has the unique advantage of minimizing the sum of the variances of the estimated scores. We will discuss how the results can be generalized to other linear constraints at the end of this section. We will discuss these properties when the total number of comparisons goes to infinity. We make the following assumption.

Assumption 3.1.

There exists a matrix , such that for all subjects ,

(5)

In matrix , element , represents the overall frequency that the pair is compared in the survey. The equality assumption basically states that this frequency holds for every subject in the survey. This is true when all subjects are asked to compare every pair of objects, or i.i.d. pairs of objects are chosen for comparisons by the subjects.

The following proposition shows that as the number of subjects grows to infinity, the overall comparison matrix reflects the relative frequency with which each pair of objects is compared.

Proposition 3.2.

Let denote the total number of comparisons made by all subjects. Under the aforementioned data collection method, as the number of subjects grows to infinity,

(6)

Proof. According to the data collection design,

are i.i.d. random variables with finite variance. According to Assumption

2.1, for every pair of objects and every subject ,

are also i.i.d. random variables with finite variance. Therefore, by the weak law of large numbers and the continuous mapping theorem, we have that

(7)

We now prove the following theorem, which shows the consistency of the constrained MLE.

Theorem 3.3.

Under the aforementioned data collection method, as long as the constrained MLE (3) exists and is unique, as , it satisfies

(8)
Proof.

Let denote the contingency matrix corresponding to subject , such that represents the number of times that subject chooses object over object . Define as its expectation,

(9)

and let denote the sample mean of the contingency matrix of all subjects. Then, since the variance of is finite, according to the weak law of large numbers,

(10)

Since the log-likelihood function is continuous in both and , is a continuous function of . Hence, according to the continuous mapping theory, we have that

(11)

where the first equality follows from the definition of the estimator. It remains to show that . By definition, the likelihood is given by

(12)

Using the fact that is maximized when , we have that

(13)

which effectively means that , since is unique. ∎

Next, we construct the asymptotic distribution of the MLE under the sum constraint. Let denote the Hessian matrix of the log-likelihood function,

(14)

For simplicity, we will use to represent when there is no ambiguity. The following lemma states two properties of and serve as the basis of our analysis of the asymptotic properties of the constrained MLE. The proof of this lemma is given in B.

Lemma 3.4.

Let denote the pseudoinverse of .111A matrix is called the pseudoinverse of a matrix , denoted as , if the following four conditions hold: 1) , 2) , 3) , and 4) . (Ben-Israel and Greville, 2003, ,p.27). If Assumption 2.1 holds, then for any scores and such that , it holds that . Furthermore, .

The following theorem gives the asymptotic distribution of the MLE of under the sum constraint.

Theorem 3.5.

Let denote the MLE of under the sum constraint . Then as long as exists and is unique, as , and therefore , we have that

(15)
Proof.

By definition, the MLE under the sum constraint is obtained by maximizing the Lagrangian function

(16)

By first-order condition and Lemma 3.4, we find that

(17)

According to Lagrange’s mean-value theorem, there exists a between and such that

(18)

according to (17) and Lemma 3.4. Meanwhile, since is a linear functions of , we have , and therefore . As a consequence,

(19)

Now that we have linked the difference between the true score and the estimator

with the gradient of the log-likelihood, we will use the law of large numbers and the central limit theorem to construct the asymptotic distribution of the MLE. We can write the gradient of the log-likelihood function

as the sum of the gradients of the subject-specific log-likelihoods,

(20)

It is easy to verify that and , where denotes the Fisher information matrix evaluated in . Therefore, it follows from the central limit theorem that

(21)

On the other hand, since and , by the continuous mapping theorem we find that

(22)

Finally, since , by combining the results from (19),(21) and (22), we obtain that

(23)

Theorem 3.5 is important because it implies that, under a flexible data collection method, we can evaluate the uncertainty of as follows. Since

and , we find that the asymptotic variance of the MLE for under the sum constraint is the pseudoinverse of the Fisher information matrix, and

(24)

This is a generalization of the case for identifiable statistical models, where the asymptotic variance of the MLE is the (standard) inverse of the Fisher information matrix. This result is specific to the sum constraint and the -parameterization: we will see that, for other kinds of linear constraints under the -parameterization, the asymptotic variance of the estimator is the reflexive generalized inverse222A matrix is called the reflexive generalized inverse of a matrix , if the following two conditions hold: 1) , and 2) . (Ben-Israel and Greville, 2003, ,p.27). of the Fisher information matrix (see Theorem 4.1).

4 Variance for the Bradley-Terry scores under different linear constraints

In the previous section, we justified the use of the pseudoinverse of as the estimated variance of the Bradley-Terry scores under the sum constraint . In this section, we will discuss the estimated variance under the more general class of linear constraints, and show that under the -parameterization, the sum constraint minimizes the sum of the variances of the estimated scores for all objects.

Suppose that the Bradley-Terry scores are subject to the constraint , in which is neither a multiple of nor perpendicular to . If for some constant , the constraint is equivalent to the sum constraint. If , a constant can still be added to the scores of all subjects, with the constraint remaining satisfied and the log-likelihood function unchanged; therefore, the MLE would still not be unique. Use to denote the MLE under this constraint. We will firstly prove the following theorem, which characterizes the estimated variance under this constraint.

Theorem 4.1.

The estimated variance of , the MLE (for ) under the constraint , where and , is a reflexive generalized inverse of .

Proof.

For any such that , we can construct the following bijection:

(25)

Then it is easy to verify that . Therefore, if is the MLE under the sum constraint , then is the MLE under the constraint . Meanwhile, it is easy to verify from 25 that the difference between scores of every pair of objects does not change under different constraints. Hence, the Hessian matrix (and therefore, the Fisher information matrix) 14 remains the same when we evaluate at instead of . Thus, .

According to the transformation of variance,

(26)

It follows from Lemma 3.4 that and, therefore, . Since the has the same column space and null space as (page 36 of (Ben-Israel and Greville, 2003)), we also have that and . It therefore can be verified that

(27)

and that

Consequently, the estimated variance is a reflexive generalized inverse of . ∎

From equation (26), we see that varies with . Equation (26) also indicates the relationship between and , and implies the following theorem.

Theorem 4.2.

The sum constraint is the one that minimizes the sum of the variances of all scores.

Proof. Consider

(28)

Therefore, according to Lemma 3.4, as long as and ,

Theorem 4.2 strongly suggests that we use the sum constraint in practice. Currently, some statistical packages, like BradleyTerry2 in R, use the reference constraint ( = 0). We illustrate the difference between these two practices by analyzing the data collected in a pairwise comparison survey.

5 Example: pairwise comparison of activities

In March 2020, a group of faculty members at the Department of Statistics and Data Science of Carnegie Mellon University surveyed the department members about their favorite online activities during the COVID-19 pandemic. They used a so-called wiki survey (Salganik and Levy, 2015), in which the respondents (subjects) could vote between pairs of activities (objects) that were presented to them randomly, or enter their own activities for comparison. The survey lasted for about one month, and 52 respondents voted on 21 activities.

We fit the Bradley-Terry model to these data under both the reference constraint and the sum constraint. For the reference constraint, the activity “Virtual Meeting” is set as the reference by default of the BradleyTerry2 package in R

, because it happens to be the first activity (object) in the data set. We then estimate the variances of the scores, and plot the 2-standard error confidence intervals of the estimators under both constraints. The results are show in Figure

1.

The graphs show that under the reference constraint (the plot on the left), since the activity “Virtual Meeting” is set as the reference, its score is without any uncertainty. However, the scores of all other activities have very wide confidence intervals, suggesting that we are very uncertain about their estimated scores. Thus it is difficult to say that one object dominates another in preference order.

On the other hand, under the sum constraint (the plot on the right), the confidence interval for the score of “Virtual Meeting” is quite wide, while those of all other activities are relatively narrow. This shows that we are not certain about the estimate of the score of “Virtual Meeting”, but quite sure of others, among which “Online origami” is the one with the most uncertainty. The reason for this phenomenon is that “Virtual Meeting” happens to be the activity that was involved in the least number of comparisons. The respondents made 750 comparisons in total, but “Virtual Meeting” was only compared for 8 times. The greater uncertainty associated with the small number of comparisons becomes obvious under the sum constraint. However, under the reference constraint, when “Virtual Meeting” is set as the reference activity, this uncertainty is artificially propagated to all of the other activities.

Figure 1: The 2-standard-error confidence intervals for the estimated scores under the reference constraint (left) and the sum constraint (right).

6 Discussion

In this paper, we studied the asymptotic properties of the MLE for the Bradley-Terry model under the -parameterization and the sum constraint. We firstly pointed out the conditions under which the estimator exists and is unique, and then stated that the estimator is consistent and asymptotically normal. Specifically, we showed that the asymptotic variance of the estimator is the pseudoinverse of the Fisher information matrix. Similar results have been proved by Bradley (1954) and Dykstra (1960), who showed that under the -parameterization, the estimated variance of the MLE is the reflexive generalized inverse of the Fisher information matrix; our result, and its proof, however, is of special interest not only because it considers a different kind of parameterization and a more flexible data collection design, but also because it contains some insights about the asymptotic properties of a constrained MLE for an unidentifiable statistical model.

In particular, we pointed out that under the -parameterization, using the sum constraint has the unique advantage of minimizing the sum of the variances of all the estimators. We showed in an empirical example that poor choice of the reference object for the reference constraint can artificially propagate uncertainty to other objects; this behavior is avoided when using the sum constraint. Thus, we strongly recommend using the sum constraint in practice.

Acknowledgements

This work has been supported (in part) by Grant #2003-22461 from the Russell Sage Foundation. Any opinions expressed are those of the authors alone and should not be construed as representing the opinions of the Foundation.

Appendix A Proof of Corollary 2.3

Proof.

Let the MLE for the Bradley-Terry model under the -parameterization and the sum constraint be denoted by

(29)

for the log-likelihood function For any such that and any such that , define

where is an element-wise operation. It is easy to verify that

  1. ;

  2. If , then ;

  3. For every such that , we have , with an element-wise operation.

Consequently, is a bijection from the set to the set that preserves the value of the log-likelihood function. Therefore, the following two conditions are equivalent:

  1. There exists a unique such that and

  2. There exists a unique such that and

Thus the existence and uniqueness of the MLE under the -parameterization (3) is equivalent to the existence and uniqueness of the MLE for -parameterization under the sum constraint (29), for which Theorem 2.2 provides the necessary and sufficient condition (Assumption 2.1). ∎

Appendix B Proof of Lemma 3.4

In this section, we give a proof of Lemma 3.4. This proof is based on the properties of the pseudoinverse matrix and the following lemma.

Lemma B.1.

If Assumption 2.1 holds, then for any , the Hessian is negative semi-definite, and

Proof of Lemma B.1. For any ,

(30)

Furthermore, in order for the equality to be reached, for any pair such that , it must be true that . Under Assumption 2.1, the comparison graph is strongly connected. This effectively means that all entries of are the same. In other words, . So in summary, is negative semi-definite, and if and only if . ∎

Proof of Lemma 3.4 Lemma B.1 shows that for any , the null space of is . Therefore, if , then is in the column space of . The pseudoinverse of a matrix has the same column space as the original matrix (Ben-Israel and Greville, 2003), and thus is in the column space of . In other words, there exists , such that

(31)

Therefore, by the definition of the pseudoinverse matrix,

(32)

Furthermore, since has the same null space as , we have that . ∎

plus 0.3ex

References

  • A. Ben-Israel and T. N. Greville (2003) Generalized inverses: theory and applications. Vol. 15, Springer Verlag, New York. Cited by: Appendix B, §4, footnote 1, footnote 2.
  • R. A. Bradley and M. E. Terry (1952) Rank analysis of incomplete block designs: i. The method of paired comparisons. Biometrika 39 (3/4), pp. 324–345. Cited by: §1, §1.
  • R. A. Bradley (1954) Incomplete block rank analysis: on the appropriateness of the model for a method of paired comparisons. Biometrics 10 (3), pp. 375–390. Cited by: §1, §6.
  • O. Dykstra (1960) Rank analysis of incomplete block designs: a method of paired comparisons employing unequal repetitions on pairs. Biometrics 16 (2), pp. 176–188. Cited by: §1, §1, §6.
  • L. R. Ford Jr (1957) Solution of a ranking problem from binary comparisons. The American Mathematical Monthly 64 (8P2), pp. 28–33. Cited by: §1, §1, §1, §2, §2.
  • R. Han, R. Ye, C. Tan, and K. Chen (2020) Asymptotic theory of sparse Bradley-Terry model. The Annals of Applied Probability 30 (5), pp. 2491–2515. Cited by: §1.
  • R. Hatzinger, R. Dittrich, et al. (2012) Prefmod: an r package for modeling preferences based on paired comparisons, rankings, or ratings. Journal of Statistical Software 48 (10), pp. 1–31. Cited by: §1.
  • G. H. Liner and M. Amin (2004) Methods of ranking economics journals. Atlantic Economic Journal 32 (2), pp. 140–149. Cited by: §1.
  • M. J. Salganik and K. E. C. Levy (2015) Wiki surveys: open and quantifiable social data collection. PloS One 10 (5). Cited by: §5.
  • G. Simons and Y. Yao (1999) Asymptotics when the number of parameters tends to infinity in the Bradley-Terry model for paired comparisons. The Annals of Statistics 27 (3), pp. 1041–1060. Cited by: §1, §1.
  • H. Turner, D. Firth, et al. (2012) Bradley-Terry models in R: the BradleyTerry2 package. Journal of Statistical Software 48 (9). Cited by: §1, §1.
  • C. Varin, M. Cattelan, and D. Firth (2016) Statistical modelling of citation exchange between statistics journals. Journal of the Royal Statistical Society: Series A (Statistics in Society) 179 (1), pp. 1. Cited by: §1, §1.
  • F. Wickelmaier and M. F. Wickelmaier (2007) The eba package. Cited by: §1.