AcceleratedBradleyTerry
None
view repo
We show tight convergence rate bounds for gradient descent and MM algorithms for maximum likelihood estimation and maximum aposteriori probability estimation of a popular Bayesian inference method for generalized Bradley-Terry models. This class of models includes the Bradley-Terry model of paired comparisons, the Rao-Kupper model of paired comparisons with ties, the Luce choice model, and the Plackett-Luce ranking model. Our results show that MM algorithms have same convergence rates as gradient descent algorithms up to constant factors. For the maximum likelihood estimation, the convergence is linear with the rate crucially determined by the algebraic connectivity of the matrix of item pair co-occurrences in observed comparison data. For the Bayesian inference, the convergence rate is also linear, with the rate determined by a parameter of the prior distribution in a way that can make convergence arbitrarily slow for small values of this parameter. We propose a simple, first-order acceleration method that resolves the slow convergence issue.
READ FULL TEXT VIEW PDFNone
According to the Bradley-Terry model of paired comparisons [5], comparisons of pairs of items are independent and the outcome of each comparison of items and is one of the items winning with
(1.1) |
where is a positive-valued parameter representing the strength of item .
The Bradley-Terry model dates back to 1929, from the early work of Zermelo [49], and was subsequently popularized by the work of Bradley and Terry [5, 4]. David [10] covers early references. The Bradley-Terry model was studied by many, e.g. [15, 11, 12, 43], and is covered by classic books on categorical data analysis [1]. The original Bradley-Terry model has been extended in various directions, e.g. to allow for tie outcomes (e.g. the Rao-Kupper model [40]), choice (the Luce choice model [31]) and ranking outcomes (the Plackett-Luce ranking model [37]) for comparison sets of two or more items, as well as group comparisons (e.g. [22, 21]
). These models are commonly referred to as generalized Bradley-Terry models. They can be derived from suitably defined latent variable models, where each item is associated with a latent performance random variable. This latent variable representation is in spirit of the well-known Thurstone model of comparative judgment
[45]. Statistical models of paired comparisons have been a subject of intense recent research focused on characterization of the accuracy of parameter estimators and proposing new, scalable parameter estimation methods [18, 48, 19, 39, 9, 42, 47, 27, 35].Statistical models of ranking data play an important role in applications. The Bradley-Terry model of paired comparisons underlies the design of the Elo rating system, used for rating skills of chess players [13]. Extensions to team competitions and tie outcomes were implemented in popular online gaming platforms, e.g. TrueSkill rating system [20]. The generalized Bradley-Terry type of models have been used for estimation of relevance of items in information retrieval applications, e.g. learning to rank [7, 30]
. Statistical models of paired comparisons are used in timely applications such as evaluation of reinforcement learning algorithms
[2].An iterative optimization algorithm for maximum likelihood (ML) estimation of the Bradley-Terry model parameter vector has been known since the original work of Zermelo
[49]. Lange, Hunter and Yang [29] showed that this algorithm belongs to the class of MM optimization algorithms. Here MM refers to either minorize-maximization or majorize-minimization, depending on whether the optimization problem is a maximization or a minimization problem. The book by Lange [28] provides an excellent account on MM optimization algorithms while [25] provides a tutorial. Recent work by Mairal [32] provides results on the convergence of MM algorithms.In a seminal paper, Hunter [24] derived MM algorithms for generalized Bradley-Terry models and sufficient conditions for their convergence to maximum likelihood estimators using the framework of MM optimization algorithms. For the Bradley-Terry model of paired comparisons, a necessary and sufficient condition for the existence of a maximum likelihood estimator is that the directed graph whose vertices correspond to items and edges represent outcomes of paired comparisons is connected. In other words, there exists no partition of items in two disjoint sets such that none of the items in one partition won against an item in other partition.
A Bayesian inference method for generalized Bradley-Terry models was proposed by Caron and Doucet [8]
, showing that classical MM algorithms can be reinterpreted as special instances of Expectation-Maximization (EM) algorithms associated with suitably defined latent variables and proposed some original extensions. This amounts to MM algorithms for maximum aposteriori probability (MAP) parameter estimation, for a specific family of prior distributions. Specifically, the prior distribution is a product-form distribution with Gamma
marginal distributions, with the shape parameter and the rate parameter . Unlike to the ML estimation case, the MAP estimator is guaranteed to exist, for any observed data.The MM algorithms for fitting generalized Bradley-Terry models are implemented in popular CRAN packages BradleyTerry2 [46] and BradleyTerryScalable [26], for ML and MAP estimation, respectively. They are also implemented in Python packages such as Choix [33].
While the conditions for convergence of MM algorithms for generalized Bradley-Terry models are well known, to the best of our knowledge, not much was known about their convergence rates for either ML or MAP estimation. In this paper, we address this by providing tight characterizations of convergence rates. Our results identify key properties of input comparison data that determine convergence rates, and in the case of MAP estimation, how the convergence rates depend on parameters of the prior distribution. Our results reveal that popular MM algorithms used for MAP estimation for generalized Bradley-Terry models can have a slow convergence for certain values of parameters of the prior distribution. We address this by proposing a new, first-order acceleration method. We summarize our main contributions in some more details as follows.
We present tight characterizations of the rate of convergence of gradient descent and MM algorithms for ML and MAP estimation for generalized Bradley-Terry models. Our results show that both gradient descent and MM algorithms have linear convergence with rates of convergence that differ only in constant factors. We provide explicit characterizations of the rate of convergence bounds that provide insights into the key properties of the observed comparison data that determine the rate of convergence.
We show that the rate of convergence critically depends on the properties of matrix defined as the matrix of item pair co-occurrences in the observed comparison data. Specifically, two key properties are: (a) the maximum number of paired comparisons per item (denoted as ) and (b) the algebraic connectivity of matrix (denoted as ). Intuitively, quantifies how well is the graph of paired comparisons connected. Formally,
is the Fiedler value (eigenvalue)
[14], defined as the second smallest eigenvalue of the Laplacian matrix , where is a diagonal matrix whose diagonal elements are the row sums of .Our results imply the following bounds on the convergence time, defined as the number of iterations for an iterative optimization algorithm to reach the value of the underlying objective function that is within a given error tolerance parameter of the optimum value.
For the ML objective, the convergence time satisfies
(1.2) |
This reveals that the rate of convergence critically depends on the connectivity of the graph of paired comparisons in the observed data.
On the other hand, for the MAP estimation, we show that the convergence time satisfies
(1.3) |
where, recall, is the rate parameter of the Gamma prior distribution. This bound is shown to be tight for some input data instances. This shows that the MAP estimation can be arbitrarily slow by taking small enough parameter . The small values of parameter correspond to less informative (more vague) prior distributions.
Our results identify a slow rate of convergence issue for gradient descent and MM algorithms in the case of MAP estimation. While the MAP estimation resolves the non-existence of a maximum likelihood estimator when the graph of paired comparisons is disconnected, it can have much slower convergence than ML when the graph of paired comparisons is connected. Perhaps surprisingly, the rate of convergence has a discontinuity at in the sense that for and , the MM algorithm for the MAP estimation boils down to the classic MM algorithm for ML estimation, and in this case the convergence bound (1.2) holds, while for the MAP estimation, the convergence time grows arbitrarily large as approaches from above.
We propose a simple first-order acceleration method for the MAP estimation whose convergence time satisfies
(1.4) |
This acceleration method resolves the slow convergence issue of standard MM algorithms for the MAP estimation for generalized Bradley-Terry models. Note that this accelerated method does not have a discontinuity at with respect to the rate of convergence: as approaches from above, the convergence time bound corresponds to that of the MM algorithm for ML estimation. The acceleration method applies a transformation of the parameter vector in each iteration of the gradient descent or MM algorithm in a way that ensures moving in directions in which the objective function has certain smoothness and strong convexity properties. The acceleration method is derived by using a theoretical framework that may be of general interest. This framework can be applied to different statistical models of ranking data and prior distributions for Bayesian inference of these models.
We present numerical evaluation results for the convergence rate of different iterative optimization algorithms using input data comparisons from a collection of real-world datasets. These results demonstrate the extent of the slow convergence issue of the existing MM algorithms for MAP estimation and show significant speed ups achieved by our accelerated MM algorithms.
Our theoretical results are established by using the framework of convex optimization analysis and spectral theory of matrices. In particular, the convergence rate bounds are obtained using the concepts such as smoothness, strong convexity, and Polyak-Loyasiewicz condition. We derived accelerated iterative optimization algorithms based on a general approach that may be of independent interest. This approach amounts to transforming the parameter estimator in each iteration such that certain conditions hold for the gradient vector and the Hessian matrix of the objective function. In particular, the transformation ensures orthogonality of the gradient vector to a vector that is an eigenvector of the Hessian matrix, which corresponds to an eigenvalue of small value. For generalized Bradley-Terry models, this transformation is simple to implement.
In Section 2, we introduce various instances of generalized Bradley-Terry models, ML and MAP model fitting objectives, gradient descent and MM algorithms, and some key concepts of convex optimization that we use in the paper. Section 3 contains our main results on the characterization of convergence rates of gradient descent and MM algorithms, where for simplicity of exposition, we focus only on the Bradley-Terry model of paired comparisons. Section 4 presents our accelerated algorithm for MAP estimation. In Section 5 we show how our results extend to other instances of generalized Bradley-Terry model. Section 6 presents numerical results. We discuss our results and conclude in Section 7. All our proofs are provided in Appendix.
In this section we introduce several instances of generalized Bradley-Terry models, define maximum likelihood and maximum aposteriori probability estimation, and define some basic concepts from convex optimization analysis that we use throughout the paper.
According to the Bradley-Terry model, each paired comparison of items and has two possible outcomes: either wins against () or wins against (). The distribution of the outcomes is given by
where are model parameters.
The Rao-Kupper model is such that each paired comparison of items and has three possible outcomes: either or or
(tie). The model is defined by the probability distribution of outcomes that is given by
and
where and are model parameters.
The larger the value of parameter , the more mass is put on the tie outcome. For the value of parameter , the model corresponds to the Bradley-Terry model for paired comparisons.
The Luce choice model is a natural generalization of the Bradley-Terry model of paired comparisons to comparison sets of two or more items. For any given comparison set of two or more items, the outcome is a choice of one item (an event we denote as ) which occurs with probability
where are model parameters.
We will use the following definitions and notation. Let be the set of ordered sequences of two or more items from such that for each , is an arbitrary item and are sorted in lexicographical order. We can interpret each as a choice of item from the set of items . According to the Luce’s choice model, the probability of outcome is given by
We denote with the number of observed outcomes in the input data. For each , let denote the number of items in .
The Plackett-Luce ranking model is a model of full rankings: for each comparison set of items , the set of possible outcomes contains all possible permutations of items in . The distribution over possible outcomes is defined as follows. Let be the set of all possible permutations of subsets of two or more items from . Each corresponds to a permutation of the set of items . The probability of outcome is given by
where are model parameters.
The model has an intuitive explanation as a sampling of items without replacement proportional to the item weights . The Plackett-Luce ranking model corresponds to the Bradley-Terry model of paired comparisons when the comparison sets consist of two items. We denote with the number of observed outcomes in the input data.
Let be the log-likelihood function of the model under consideration. The maximum likelihood estimator is defined as follows
For generalized Bradley-Terry models, we can express the log-likelihood function as
where is the set of all possible outcomes of comparisons, is the number of observations of in the input data and is probability of observing outcome according to the underlying model with parameter vector .
We will consider the negative log-likelihood function , defined as for . Indeed, .
The MAP estimation is defined by following the Bayesian approach under which is assumed to be a random variable with a given prior distribution . Conditional on the observed data , the posterior distribution of is given by the Bayes’ formula:
The log-aposteriori probability function, defined as , can be written as
where is the log-likelihood function and is the prior log-likelihood function .
The maximum aposteriori probability estimator is defined as
Let be a differentiable function. The gradient descent algorithm for minimizing function in a general form is defined by
where is the step size,
is an invertible matrix, and
is the gradient vector of at point . Ifis the identity matrix, we have the standard gradient descent algorithm. On the other hand, if
, where is the Hessian matrix of at , we have the Newton’s algorithm. The standard gradient descent algorithm is a first-order method because it requires only access to the value of for a query point .In this paper, we only consider the standard gradient algorithm with constant step size , we refer to as gradient descent algorithm, i.e.
(2.1) |
The MM algorithm for minimizing function is defined by minimizing a surrogate function that majorizes function . A surrogate function is said to be a majorant function of if and for all and . The MM algorithm is defined by the iteration:
(2.2) |
For maximizing a function , we can analogously define the MM algorithm as minimization of a surrogate function that minorizes function . In particular, we consider a majorization surrogate function for minimization of a convex function, and a minorization surrogate function for maximization of a concave function.
The log-likelihood function of the Bradley-Terry model of paired comparisons is minorized by function
(2.3) |
where is the number of observed paired comparisons such that . Here, we utilize the facts that and the equality holds if and only if to break terms in the log-likelihood functions.
Here we review some basic concepts of convex optimization analysis that are used throughout the paper.
Function is said to be -strongly convex on if it satisfies the following subgradient inequality:
Function is -strongly convex on if, and only if, is convex on . If is twice differentiable, then the eigenvalues of the Hessian of are larger than or equal to , i.e. for all .
Function is said to be -smooth on if its gradient vector is -Lipschitz on , i.e.
If is twice differentiable, then -smoothness is equivalent to the largest eigenvalue of the Hessian of being smaller than or equal to at any point, i.e. , for all .
For any function that is -smooth on , the following property holds (see e.g. Lemma 3.4 [6]):
(2.5) |
Function is said to satisfy the Polyak-Lojasiewicz inequality on if there exists such that
(2.6) |
where is a minimizer of . When the PL inequality holds on , for a specific value of , we say that -PL inequality holds on .
If is -strongly convex on , then satisfies the -PL inequality on .
A sequence is said to converge linearly to as goes to infinity, if there exists such that
where is the optimum value. The value is referred to as the rate of convergence. We define as the rate of convergence bound, if for some ,
Note that any rate of convergence bound is an upper bound on the rate of convergence.
For a convergent sequence to and a tolerance parameter , let be the smallest integer such that . We refer to as the -convergence time, or simply the convergence time. If is a linearly convergent sequence with the rate of convergence bound , then
In this section, we provide the rate of convergence characterizations for gradient descent and MM algorithms for ML and MAP estimation for the Bradley-Terry model of paired comparisons. We first establish general convergence theorems that hold for any strongly convex and smooth function , which characterize the rate of convergence in terms of the strong-convexity and smoothness parameters of , and a smoothness parameter of the surrogate function of the MM algorithm. The rate of convergence bounds are then derived for the Bradley-Terry model by applying these general theorems.
It is a well-known fact that gradient descent algorithm (2.1) with a suitable choice of the step size has linear convergence with the rate of convergence , for any -strictly convex and -smooth function. This result is due to Nesterov [36] and a simple proof can be found in Boyd and Vandenberghe [3], Chapter 9.3. The linear convergence and the rate of convergence bound follow from the following fact, which we note for future reference.
Suppose is a convex function that is -smooth on and that satisfies the -PL inequality on . Let be the minimizer of and be according to the gradient descent algorithm (2.1) with step size . Then, if and , we have
We next show a theorem that establishes linear convergence of MM algorithms for any function that satisfies certain smoothness and strong convexity conditions and a surrogate function that satisfies a smoothness condition with respect to .
Suppose is a convex function that is -smooth on and that satisfies the -PL inequality on , and is a majorant surrogate function of such that for some , for all . Let be the minimizer of and be according to the MM algorithm (2.2). Then, if and , we have
The smoothness condition imposed on the surrogate function in Theorem 3.2 is related to the notion of the first-order surrogate functions introduced by Mairal [32]. A surrogate function is said to be a first-order surrogate function of on if is a surrogate function of on and the error function defined by is -smooth on for some . If is a first-order surrogate function of on with parameter , then , for all (Lemma 2.3 [32]). Thus, requiring that is a first-order surrogate function of on with parameter is a sufficient condition for the condition on in Theorem 3.2.
A set of sufficient conditions for linear convergence of MM algorithms was found by Mairal [32] (Proposition 2.7 therein). These conditions require that is a first-order surrogate function of . The result can be summarized as follows. Suppose that is -strongly convex on and is a first-order surrogate function of on with parameter . Let be according to the MM algorithm. Then, if and , then
where
The rate of convergence bounds derived from Theorem 3.2 can be tighter than those derived from Proposition 2.7 in [32]. We provide a more detailed discussion in Appendix A.3.
From Theorems 3.1 and 3.2, we observe that the MM algorithm has the same rate of convergence bound as the gradient descent algorithm except for the smoothness parameter being enlarged for value . If , for a constant , then the MM algorithm has essentially the same rate of convergence bound as gradient descent algorithm, which differs only for a constant factor.
The following lemma is instrumental for deducing that a function satisfies the -PL inequality from a -strong convexity condition on .
Suppose that is a convex set such that is -strongly convex on . Assume that is such that (C1) and (C2) for all and all for . Then, satisfies the -PL inequality on .
The proof of Lemma 3.1 follows by noting that if is -strongly convex on , then it satisfies the -PL inequality on . Since for every , for some and , by conditions (C1) and (C2) and the definition of the -PL inequality (2.6), it follows that if -PL inequality holds on , it holds as well on .
Conditions (C1) and (C2) are satisfied by negative log-likelihood functions for generalized Bradley-Terry models.
For the Bradley-Terry model, the log-likelihood function can be written as:
(3.1) |
where is the number of observed paired comparisons such that .
The negative log-likelihood function has the following properties. We use to denote the -th smallest eigenvalue of matrix .
The negative log-likelihood function for the Bradley-Terry model is -strongly convex on , where , and -smooth on with
where .
By Lemma 3.2, the smoothness parameter is proportional to the largest eigenvalue of the Laplacian matrix . By Gershgorin circle theorem, e.g. Theorem 7.2.1 in [16], we have . Thus, we can take . We will express all our convergence time results in terms of instead of . This is a tight characterization up to constant factors. When is a graph adjacency matrix, then [17]. In the context of comparisons, has an intuitive interpretation as the maximum number of observed paired comparisons involving an item.
Since the log-likelihood function (3.1) satisfies conditions (C1) and (C2) of Lemma 3.1, combining with Lemma 3.2, we observe that the negative log-likelihood function satisfies the -PL inequality on with . Furthermore, by Lemma 3.2, the negative log-likelihood function is -smooth on with . Combining these facts with Theorem 3.1, we obtain the following corollary:
Assume that is the maximum likelihood estimate in , and that is according to gradient descent algorithm with step size . Then, if , we have
The result in Corollary 3.1 implies linear convergence with the rate of convergence bound . Hence, we have the following convergence time bound:
(3.2) |
We next consider the classic MM algorithm given in (2.4), which can be written as: for ,
(3.3) |
The classic MM algorithm is derived for the surrogate function given by (2.3). This surrogate function has the following smoothness property:
For all , where
Proof of Lemma 3.3 is provided in Appendix A.5. As a side remark, we note that is a first-order surrogate function of on with as defined in Lemma 3.3. We discuss this in more details in Appendix A.6.
Assume that is the maximum likelihood estimate in , and that is according to the MM algorithm. Then, if , we have
where .
From Corollaries 3.1 and 3.2, we observe that both gradient descent and MM algorithms have the rate of convergence bound of the form for some constant . The only difference is the value of constant which in general admits different values. This shows that both gradient descent and MM algorithm have linear convergence, and convergence time bound (3.2).
Following Caron and Doucet [8], we consider the MAP estimation of with the prior distribution of product-form and marginal distributions such that
has a Gamma distribution with the shape parameter
and the inverse scale parameter ,The log-aposteriori probability function can be written as where is the log-likelihood function given by (3.1) and is the log-likelihood of the prior distribution that is given as follows
(3.4) |
For the choice of parameters and , the log-aposteriori probability function corresponds to the log-likelihood function.
The negative log-aposteriori probability function for the Bradley-Terry model and the Gamma prior is -strongly convex and -smooth on with
Suppose that is the maximum aposteriori point in and is according to gradient descent algorithm (2.1) with step size . Then, if , we have
The result in Corollary 3.3 implies linear convergence with the convergence time bound
(3.5) |
This bound can be arbitrarily large by taking parameter to be small enough. We will show later a simple instance for which this bound is tight. Hence, the convergence time for MAP estimation can be arbitrarily slow, and much slower than for the ML case.
We next consider the MM algorithm of Caron and Doucet [8]. This MM algorithm is derived for the minorant surrogate function of function , which is defined as
where is the minorant surrogate function of the log-likelihood function in (2.3) and is the prior log-likelihood function (3.4).
The MM algorithm can be written as: for ,
Indeed, this iterative optimization algorithm corresponds to the classic MM algorithm for ML estimation (3.3) when and .
Since , by Lemma 3.3, we have
For all , where
Suppose that is the maximum aposteriori point in and is according to the MM algorithm. Then, if , we have
From Corollaries 3.2 and 3.4, we observe that both gradient descent and MM algorithm have the rate of convergence bound , and hence both have linear convergence and convergence time bound (3.5).
We show that the bound in Corollary 3.4 is tight for a simple case of two items. Let denote the number of paired comparisons. Note that and . Let and denote the number of paired comparisons won by items and , respectively.
The maximum aposteriori estimate of parameter can be computed in an explicit form given as follows:
The MM algorithm iterates are such that
where . From this, observe that evolves according to the following autonomous nonlinear dynamical system:
(3.6) |
The limit point of as goes to infinity is . Note that is the mode of the prior Gamma distribution.
Now, let us define so that . Note that goes to as goes to infinity. By a tedious but straightforward calculus, we can show that
(3.7) |
From (3.6), note that evolves according to a linear dynamical system, which allows us to derive the solution for in the explicit form given as follows:
From (3.7), for large , and thus
It follows that the rate of convergence of the log-aposteriori probability function is given as follows:
The rate of convergence is approximately for small . By taking value of such that for a constant such that , we have the rate of convergence . This establishes the tightness of the rate of convergence bound in Corollary 3.4.
Consider an instance with items with each distinct pair of items compared times and the input data generated according to the Bradley-Terry model of paired comparisons with parameter such that and . We run the MM algorithm with random initial value until the smallest such that , for . The results in Figure 1 show that the MM algorithm for the MAP estimation can be slower for several orders of magnitude than for the ML estimation.
In this section, we present an acceleration method for gradient descent and MM algorithms for MAP estimation. This acceleration method amounts to performing a transformation of the parameter estimator in each iteration in a way that guarantees certain properties of the objective function along the trajectory of the iterative optimization algorithm.
Let be a given mapping. We define the -transformed gradient descent algorithm by the iteration:
(4.1) |
Similarly, we define the -transformed MM algorithm by the iteration:
(4.2) |
Assume that is a convex function that satisfies the following two conditions on a convex set that contains optimum point , for a given vector :
is -smooth on ;
satisfies the -PL inequality on .
Assume that satisfies the following two conditions:
for all ;
for all .
Condition (P1) means that applying the transformation at a point cannot increase the value of function . Condition (P2) means that at any -transformed point, the gradient of function is orthogonal to vector .
We have the following two theorems.
Assume that satisfies (F1) and (F2), satisfies (P1), and . Let be according to the -transformed gradient descent algorithm (4.1). Then, if and , we have
Assume that satisfies (F1) and (F2), satisfies (P1), and is a majorant surrogate function of such that