Rank aggregation is an important task in a wide range of learning and social contexts arising in recommendation systems, information retrieval, and sports and competitions. Given items, we wish to infer relevancy scores or an ordering on the items based on partial orderings provided through many (possibly contradictory) samples. Frequently, the available data that is presented to us is in the form of a comparison: player defeats player ; book is purchased when books and are displayed (a bigger collection of books implies multiple pairwise comparisons); movie is liked more compared to movie . From such partial preferences in the form of comparisons, we frequently wish to deduce not only the order of the underlying objects, but also the scores associated with the objects themselves so as to deduce the intensity of the resulting preference order.
For example, the Microsoft TrueSkill engine assigns scores to online gamers based on the outcomes of (pairwise) games between players. Indeed, it assumes that each player has inherent “skill” and the outcomes of the games are used to learn these skill parameters which in turn lead to scores associated with each player. In most such settings, similar model-based approaches are employed.
In this paper, we have set out with the following goal: develop an algorithm for the above stated problem which is computationally simple, works with available (comparison) data only, and when data is generated as per a reasonable model, then the algorithm should do as well as the best model aware algorithm. The main result of this paper is an affirmative answer to all these questions.
Related work. Most rating based systems rely on users to provide explicit numeric scores for their interests. While these assumptions have led to a flurry of theoretical research for item recommendations based on matrix completion (cf. Candès and Recht (2009), Keshavan et al. (2010), Negahban and Wainwright (2012)), it is widely believed that numeric scores provided by individual users are generally inconsistent. Furthermore, in a number of learning contexts as illustrated above, it is simply impractical to ask a user to provide explicit scores.
These observations have led to the need to develop methods that can aggregate such forms of ordering information into relevance ratings. In general, however, designing consistent aggregation methods can be challenging due in part to possible contradictions between individual preferences. For example, if we consider items , , and , one user might prefer to , while another prefers to , and a third user prefers to . Such problems have been well studied starting with (and potentially even before) Condorcet (1785). In the celebrated work by Arrow (1963), existence of a rank aggregation algorithm with reasonable sets of properties (or axioms) was shown to be impossible.
In this paper, we are interested in a more restrictive setting: we have outcomes of pairwise comparisons between pairs of items, rather than a complete ordering as considered in (Arrow 1963). Based on those pairwise comparisons, we want to obtain a ranking of items along with a score for each item indicating the intensity of the preference. One reasonable way to think about our setting is to imagine that there is a distribution over orderings or rankings or permutations of items and every time a pair of items is compared, the outcome is generated as per this underlying distribution. With this, our question becomes even harder than the setting considered by Arrow (1963) as, in that work, effectively the entire distribution over permutations was already known!
Indeed, such hurdles have not stopped the scientific community as well as practical designers from designing such systems. Chess rating systems and the more recent MSR TrueSkill Ranking system are prime examples. Our work falls precisely into this realm: design algorithms that work well in practice, makes sense in general, and perhaps more importantly, have attractive theoretical properties under common comparative judgment models.
With this philosophy in mind, in a recent work, Ammar and Shah (2011) presented an algorithm that tries to achieve the goal with which we have set out. However, their algorithm requires information about comparisons between all pairs, and for each pair it requires the exact pairwise comparison ‘marginal’ with respect to the underlying distribution over permutations. Indeed, in reality, not all pairs of items can typically be compared, and the number of times each pair is compared is also very small. Therefore, while an important step is taken by Ammar and Shah (2011), it stops short of achieving the desired goal. Another such work is Duchi et al. (2010), where a regret/optimization based framework is presented.
In a related work by Braverman and Mossel (2008), the authors present an algorithm that produces an ordering based on pair-wise comparisons on adaptively selected pairs. They assume that there is an underlying true ranking and one observes noisy comparison results. Each time a pair is queried, we are given the true ordering of the pair with probability for some which does not depend on the items being compared. One limitation of this model is that it does not capture the fact that in many applications, like chess matches, the outcome of a comparison very much depends on the opponents that are competing.
Such considerations have naturally led to the study of noise models induced by parametric distributions over permutations. An important and landmark model in this class is called the Bradley-Terry-Luce (BTL) model (Bradley and Terry 1955, Luce 1959), which is also known as the Multinomial Logit (MNL) model (cf. McFadden (1973)). It has been the backbone of many practical system designs including pricing in the airline industry, e.g. see Talluri and VanRyzin (2005). Adler et al. (1994) used such models to design adaptive algorithms that select the winner from small number of rounds. Interestingly enough, the (near-)optimal performance of their adaptive algorithm for winner selection is matched by our non-adaptive algorithm for assigning scores to obtain global rankings of all players.
Finally, earlier work by Dwork et al. (2001) proposed a number of Markov chain based methods for rank aggregation, which are very similar to the Rank Centrality presented here. However, the precise form of the algorithm proposed is distinctly different and this precise form does matter: the empirical results using synthetic data presented in Section 3.3 make this clear. In summary, our work provides firm theoretical grounding for Markov chain based ranking algorithms.
Our contributions. In this paper, we consider Rank Centrality, an iterative algorithm that takes the noisy comparison answers between a subset of all possible pairs of items as input and produces scores for each item as the output. The proposed algorithm has a nice intuitive explanation. Consider a graph with nodes/vertices corresponding to the items of interest (e.g. players). Construct a random walk on this graph where at each time, the random walk is likely to go from vertex to vertex if items and were ever compared; and if so, the likelihood of going from to depends on how often lost to . That is, the random walk is more likely to move to a neighbor who has more “wins”. How frequently this walk visits a particular node in the long run, or equivalently the stationary distribution, is the score of the corresponding item. Thus, effectively this algorithm captures preference of the given item versus all of the others, not just immediate neighbors: the global effect induced by transitivity of comparisons is captured through the stationary distribution.
Such an interpretation of the stationary distribution of a Markov chain or a random walk has been an effective measure of relative importance of a node in wide class of graph problems, popularly known as the network Centrality (Newman 2010). Notable examples of such network centralities include the random surfer model on the web graph for the version of the PageRank (Brin and Page 1998) which computes the relative importance of a web page, and a model of a random crawler in a peer-to-peer file-sharing network to assign trust value to each peer in EigenTrust (Kamvar et al. 2003).
The computation of the stationary distribution of the Markov chain boils down to ‘power iteration’ using transition matrix lending to a nice iterative algorithm. To establish rigorous properties of the algorithm, we analyze its performance under the BTL model described in Section 2.1.
Formally, we establish the following result: given items, when comparison results between randomly chosen pairs of them are produced as per an (unknown) underlying BTL model, the stationary distribution (or scores) produced by Rank Centrality matches the true score (induced by the BTL model). It should be noted that is a necessary number of (random) comparisons for any algorithm to even produce a consistent ranking (due to connectivity threshold of random graph). In that sense, up to factor, our algorithm is optimal in terms of sample complexity.
In general, the comparisons may not be available between randomly chosen pairs. Let denote the graph of comparisons between these objects with an edge if and only if objects and are compared. In this setting, we establish that with comparisons, our algorithm learns the true score of the underlying BTL model. Here, is the spectral gap for the Laplacian of and this is how the graph structure of comparisons plays a role. Indeed, as a special case when comparisons are chosen at random, the induced graph is Erdös-Rényi for which turns out to be strictly positive, leading to the (order) optimal performance of the algorithm as stated earlier.
To understand the performance of our algorithm compared to the other options, we perform an empirical experimental study. It shows that the performance of our algorithm is identical to the ML estimation of the BTL model. Furthermore, it handsomely outperforms other popular choices including the algorithm by Ammar and Shah (2011). In summary, rank Centrality is computationally simple, always produces a solution using available data, and has near optimal performance with respect to a reasonable generative model.
Some remarks about our analytic technique. Our analysis boils down to studying the induced stationary distribution of the random walk or Markov chain corresponding to the algorithm. Like most such scenarios, the only hope to obtain meaningful results for such ‘random noisy’ Markov chain is to relate it to stationary distribution of a known Markov chain. Through recent concentration of measure results for random matrices and comparison technique using Dirichlet forms for characterizing the spectrum of reversible/self-adjoint operators, along with the known expansion property of the random graph, we obtain the eventual result. Indeed, it is the consequence of such powerful results that lead to near-optimal analytic results for random comparison model and characterization of the algorithm’s performance for general setting.
As an important comparison, we provide analysis of sample complexity required by the maximum likelihood estimator (MLE) using the state-of-art analytic techniques, cf. Negahban and Wainwright (2012). This leads to the conclusion that samples are needed to learn the parameters accurately using the MLE in contrast to using Rank Centrality. The comparable (near optimal) empirical performance of MLE and Rank Centrality seem to suggest that the state-of-art methods for analyzing statistical performance of optimization based methods has room for improvement.
The remainder of the paper is organized as follows. In Section 2, we describe the model, problem statement and the rank Centrality algorithm. Section 3 describes the main results – the key theoretical properties of rank Centrality as well as it’s empirical performance in the context of two real datasets from NASCAR and One Day International (ODI) cricket. We provide comparison of the Rank Centrality with the maximum likelihood estimator using the existing analytic techniques in the same section. We derive the Cramer-Rao lower bound on the square error for estimating parameters by any algorithm - across range of parameters, the performance of Rank Centrality and MLE matches the lower bound implied by Cramer-Rao bound as explained in Section 3 as well. Finally, Section 4 details proofs of all results. We discuss and conclude in Section 5.
In the remainder of this paper, we use , , etc. to denote absolute constants, and their value might change from line to line. We use
to denote the transpose of a matrix. The Euclidean norm of a vector is denoted by, and the operator norm of a linear operator is denoted by . When we say with high probability, we mean that the probability of a sequence of events goes to one as grows: . Also define to be the set of all integers from to .
2 Model, Problem Statement and Algorithm
In this section, we discuss a model of comparisons between various items. This model will be used to analyze the Rank Centrality algorithm.
Bradley-Terry-Luce model for comparative judgment. When comparing pairs of items from items of interest, represented as , the Bradley-Terry-Luce model assumes that there is a weight or score associated with each item . The outcome of a comparison for pair of items and is determined only by the corresponding weights and . Let denote the outcome of the -th comparison of the pair and , such that if is preferred over and otherwise. Then, according to the BTL model,
Furthermore, conditioned on the score vector
, it is assumed that the random variables’s are independent of one another for all , , and .
Since the BTL model is invariant under the scaling of the scores, an -dimensional representation of the scores is not unique. Indeed, under the BTL model, a score vector is the equivalence class . The outcome of a comparison only depends on the equivalence class of the score vector.
To get a unique representation, we represent each equivalence class by its projection onto the standard orthogonal simplex such that . This representation naturally defines a distance between two equivalent classes as the Euclidean distance between two projections:
Our main result provides an upper bound on the (normalized) distance between the estimated score vector and the true underlying score vector.
Bradley-Terry-Luce is equal to Multi Nomial Logit (MNL). We take a brief detour to remind the reader that the BTL model is identical to the MNL model in the sense that the pair-wise distributions between objects induced under BTL are identical to that under MNL. Consider an equivalent way to describe an MNL model. Each object has an associated score . A random ordering over all objects is drawn as follows: iteratively fill the ordered positions by choosing object for position , amongst the remaining objects (not chosen in the first positions) with probability proportional to it’s weight . It can be easily verified that in the random ordering of objects generated as per this process, is ranked higher than with probability .
Sampling model. We also assume that we perform a fixed number of comparisons for all pairs and that are considered (e.g. a best of series). This assumption is mainly to simplify notations, and the analysis as well as the algorithm easily generalizes to the case when we might have a different number of comparisons for different pairs. Given observations of pairwise comparisons among items according to this sampling model, we define a comparisons graph as a graph of items where two items are connected if we have comparisons data on that pair and denotes the weights on each of the edges in .
2.2 Rank Centrality
In our setting, we will assume that represents the fraction of times object has been preferred to object , for example the fraction of times chess player has defeated player . Given the notation above, we have that . Consider a random walk on a weighted directed graph , where a pair if and only if the pair has been compared. The weight edges are defined based on the outcome of the comparisons: and (note that in our setting). We let
if the pair has not been compared. Note that by the Strong Law of Large Numbers, as the numberthe quantity converges to almost surely.
A random walk can be represented by a time-independent transition matrix , where . By definition, the entries of a transition matrix are non-negative and satisfy . One way to define a valid transition matrix of a random walk on is to scale all the edge weights by , where we define as the maximum out-degree of a node. This rescaling ensures that each row-sum is at most one. Finally, to ensure that each row-sum is exactly one, we add a self-loop to each node. Concretely,
The choice to construct our random walk as above is not arbitrary. In an ideal setting with infinite samples () per comparison the transition matrix would define a reversible Markov chain under the BTL model. Recall that a Markov chain is reversible if it satisfies the detailed balance equation: there exists such that for all ; and in that case, defined as is its unique stationary distribution. In the ideal setting (say ), we will have . That is, the random walk will move from state to state with probability proportional to the chance that item is preferred to item . In such a setting, it is clear that satisfies the reversibility conditions. Therefore, under these ideal conditions it immediately follows that the vector acts as a valid stationary distribution for the Markov chain defined by , the ideal matrix. Hence, as long as the graph is connected and at least one node has a self loop then we are guaranteed that our graph has a unique stationary distribution proportional to . If the Markov chain is reversible then we may apply the spectral analysis of self-adjoint operators, which is crucial in the analysis of the behavior of the method.
In our setting, the matrix is a noisy version (due to finite sample error) of the ideal matrix
discussed above. Therefore, it naturally suggests the following algorithm as a surrogate. We estimate the probability distribution obtained by applying matrixrepeated starting from any initial condition. Precisely, let denote the distribution of the random walk at time with be an arbitrary starting distribution on . Then,
In general, the random walk converges to a stationary distribution which may depend on
. When the transition matrix has a unique largest eigenvector (unique stationary distribution), starting from any initial distribution, the limiting distribution is unique. This stationary distribution is the top left eigenvector of , which makes computing a simple eigenvector computation. Formally, we state the algorithm, which assigns numerical scores to each node, which we shall call Rank Centrality:
|1:||Compute the transition matrix according to (1);|
|2:||Compute the stationary distribution (as the limit of (2)).|
The stationary distribution of the random walk is a fixed point of the following equation:
This suggests an alternative intuitive justification: an object receives a high rank if it has been preferred to other high ranking objects or if it has been preferred to many objects.
One key question remains: does have a well defined stationary distribution? Since the Markov chain has a finite state space, there is always a stationary distribution or solution of the above stated fixed-point equations. However, it may not be unique if the Markov chain is not irreducible. The irreducibility follows easily when the graph is connected and for all edges , , . Interestingly enough, we show that the iterative algorithm produces a meaningful solution with near optimal sample complexity as stated in Theorem 3.2 when the pairs of objects that are compared are chosen at random.
3 Main Results
The main result of this paper derives sufficient conditions under which the proposed iterative algorithm finds a solution that is close to the true solution (under the BTL model) for general model of comparison (i.e. any graph ). This result is stated as Theorem 3.1 below. In words, the result implies that to learn the true score correctly as per our algorithm, it is sufficient to have number of comparisons scaling as where is the spectral gap of the Laplacian of the graph . This result explicitly identifies the role played by the graph structure in the ability of the algorithm to learn the true scores.
In the special case, when the pairs of objects to be compared are chosen at random, that is the induced is an Erdös-Rényi random graph, the turns out to be constant and hence the resulting number of comparisons required scales as . This is effectively the optimal sample complexity.
The bounds are presented as the rescaled Euclidean norm between our estimate and the underlying stationary distribution . This error metric provides us with a means to quantify the relative certainty in guessing if one item is preferred over another.
After presenting our main theoretical result, we describe illustrative simulation results. We also present application of the algorithm in the context of two real data-sets: results of NASCAR race for ranking drivers, and results of One Day International (ODI) Cricket for ranking teams. We shall discuss relation between Rank Centrality, the maximum likelihood estimator and the Cramer-Rao’s bound to conclude that our algorithm is effectively as good as the best possible.
3.1 Rank Centrality: Error bound for general graphs
Recall that in the general setting, each pair of objects or items are chosen for comparisons as per the comparisons graph . For each such pair, we have comparisons available. The result below characterizes the performance of Rank Centrality for such a general setting.
Before we state the result, we present a few necessary notations. Let denote the degree of node in ; let the max-degree be denoted by and min-degree be denoted by ; let . The Laplacian matrix of the graph is defined as where is the diagonal matrix with and is the adjacency matrix with if and otherwise. The Laplacian, defined thus, can be thought of as a transition matrix of a reversible random walk on graph : from each node , jump to one of its neighbors
with equal probability. Given this, it is well known that the Laplacian of the graph has real eigenvalues denoted as
We shall denote the spectral gap of the Laplacian as
Now we state the result establishing the performance of Rank Centrality. Given objects and a comparison graph , let each pair be compared for times with outcomes produced as per a BTL model with parameters . Then, there exists positive universal constants and such that for , the following bound on the normalized error holds with probability at least :
where , , and . The constant can be made as large as desired by increasing the constant .
3.2 Rank Centrality: Error bound for random graphs
Now we consider the special case when the comparison graph is an Erdös-Rényi random graph with pair being compared with probability . When is poly-logarithmic in , we provide a strong performance guarantee. Specifically, the result stated below suggests that with comparisons, Rank Centrality manages to learn the true scores with high probability. Given objects, let the comparison graph be generated by selecting each pair to be in with probability independently of everything else. Each such chosen pair of objects is compared times with the outcomes of comparisons produced as per a BTL model with parameters . Then, there exists positive universal constants and such that when , , and , the following bound on the error rate holds with probability at least :
where and . The can be made as large as desired by increasing the constants and .
Remarks. Some remarks are in order. First, Theorem 3.2 implies that as long as we choose and the error goes to . For , it goes down at a rate as increases. Since we are sampling each of the pairs with probability and then obtaining comparisons per pair, we obtain comparisons in total with and . Due to classical results on Erdos-Renyi graphs, the induced graph is connected with high probability only when total number of pairs sampled scales as –we need at least those many comparisons. Thus, our result can be sub-optimal only up to ( if and ).
Second, the parameter should be treated as constant. It is the dynamic range in which we are trying to resolve the uncertainty between scores. If were scaling with , then it would be really easy to differentiate scores of items that are at the two opposite end of the dynamic range; in which case one could focus on differentiating scores of items that have their parameter values near-by. Therefore, the interesting and challenging regime is where is constant and not scaling.
Third, for a general graph, Theorem 3.1 implies that by choice of , the true scores can be learnt by Rank Centrality. That is, effectively the Rank Centrality algorithm requires comparisons to learn scores well. Ignoring , the graph structure plays a role through , the squared inverse of the spectral gap of Laplacian of , in dictating the performance of Rank Centrality. A reversible natural random walk on , whose transition matrix is the Laplacian, has its mixing time scaling as (precisely, relaxation time). In that sense, the mixing time of natural random walk on ends up playing an important role in the ability of Rank Centrality to learn the true scores.
3.3 Experimental Results
Under the BTL model, define an error metric of an estimated ordering as the weighted sum of pairs whose ordering is incorrect:
where is an indicator function. This is a more natural error metric compared to the Kemeny distance, which is an unweighted version of the above sum, since is less sensitive to errors between pairs with similar weights. Further, assuming without loss of generality that is normalized such that , the next lemma connects the error in to the bound provided in Theorem 3.2. Hence, the same upper bound holds for error. A proof of this lemma is provided in Appendix. Let be an ordering of items induced by a scoring . Then,
Synthetic data. To begin with, we generate data synthetically as per a BTL model for a specific choices of scores. A representative result is depicted in Figure. 1: for fixed and a fixed , it shows how the error scales when varying two key parameters – varying the number of comparisons per pair with fixed (on left), and varying the sampling probability with fixed (on right). This figure compares performance of Rank Centrality with variety of other algorithms. Next, we provide a brief description of various algorithms that we shall compare with.
Regularized Rank Centrality. When there are items that have been compared only a few times, the scores to those items might be sensitive to the randomness in the outcome of the comparisons, or even worse the resulting comparisons graph might not be connected. To make the random walk irreducible and get a ranking that is more robust against comparisons noise in those edges with only a few comparisons, one can add regularization to Rank Centrality. A reasonable way to add regularization is to consider the transition probability as the prediction of the event that beats , given data . The Rank Centrality, in non regularized setting, uses the Haldane prior of Beta, which gives . To add regularization, one can use different priors, for example Beta, which gives
Maximum Likelihood Estimator (MLE). The ML estimator directly maximizes the likelihood assuming the BTL model (L. R. Ford 1957). If we reparameterize the problem so that then we obtain our estimates by solving the convex program
which is pair-wise logistic regression. The MLE is known to be consistent(L. R. Ford 1957). The finite sample analysis of MLE is provided in Section 3.5.
For comparison with Regularized Rank Centrality, we provide regularized MLE or regularized Logistic Regression:
Borda Count. The Borda Count method, analyzed recently by Ammar and Shah (2011), scores an item by counting the number of wins divided by the total number of comparisons. This can be thought of as an extension of the standard Borda Count for aggregating full rankings. If we break the full rankings into pairwise comparisons and apply the pairwise version of the Borda Count from (Ammar and Shah 2011), then it produces the same ranking as the standard Borda Count applied to the original full rankings.
Spectral Ranking Algorithms.
Rank Centrality can be classified as part of the spectral ranking algorithms, which assign scores to the items according to the leading eigenvector of a matrix that represents the data. Different choices of the matrix based on data can lead to different algorithms. Few prominent examples areRatio matrix in (Saaty 2003) and those in Dwork et al. (2001). In Ratio matrix algorithm, the scores are assigned as per the top eigenvector of the ratio matrix, whose -th entry is . Dwork et al. (2001) introduced four spectral ranking algorithms called MC1, MC2, MC3 and MC4. They are all based on a random walk very similar (but distinctly different) to that of Rank Centrality.
We make note of the following observations from Figure 1. First, the error achieved by our Rank Centrality is comparable to that of ML estimator, and vanishes at the rate of as predicted by our main result. Moreover, as predicted by our bounds, the error scales as . Second, for fixed , both the Borda Count and Ratio Matrix algorithms have strictly positive error even if we take . This exhibits that these are inherently inefficient algorithms. Third, despite strong similarity between Rank Centrality and the Markov chain based algorithms of Dwork et al. (2001), the careful choice of the transition matrix of Rank Centrality makes a noticeable difference as shown in the figure - like Borda count and Ratio matrix, for fixed , despite increasing the error remains finite (and at times gets worse!).
Real data-sets. Next we show that Rank Centrality also improves over existing spectral ranking approaches on real datasets, which are not necessarily derived from the BTL model.
Dataset 1: Washington Post. This is the public dataset collected from an online polling on Washington Postendnote: http://www.washingtonpost.com/wp-srv/interactivity/worst-year-voting.html from December 2010 to January 2011. Using allourideasendnote: http://www.allourideas.org platform developed by Salganik and Levy (2012), they asked who had the worst year in Washington, where each user was asked to compare a series of randomly selected pairs of political entities. There are political entities in the dataset, and the resulting graph is a complete graph on these nodes. We used Rank Centrality and other algorithms to aggregate this data. We use this data-set primarily to check the ’robustness’ of algorithms rather than understanding their ability to identify ground truth as by design it is not available.
Now each algorithm gives different ground truth rankings for each algorithm. This ground truth is compared to a ranking we get from only a subset of the data, which is generated by sampling each edge with a given sampling rate and revealing only the data on those sampled edges. We want to measure how much each algorithm is effected by eliminating edges from the complete graph. Let be the ranking we get by applying our choice of rank aggregation algorithm to the complete dataset, and be the ranking we get from sampled dataset. To measure the resulting error in the ranking, we use the following metric:
Figure 2 illustrates that, compared to Borda Count, MC1, MC3, MC4, Rank Centrality, Logit Regression and MC2 are less sensitive to sampling the dataset, and hence more robust when available comparisons data is limited.
Dataset 2: NASCAR 2002. Table 1 shows ranking of drivers from NASCAR 2002 season racing results.
Hunter (2004) used this dataset for studying rank-aggregation algorithms, and we use the dataset, publicly available at (Guiver and Snelson 2009):
The dataset has 87 different drivers who competed in total 36 races in which 43 drivers were racing at each race. Some of the drivers raced in all 36 races, whereas some drivers only participated in one. To break the racing results into parities comparisons and to be able to run the comparison based algorithm, like Hunter (2004), Guiver and Snelson (2009), we eliminated four drivers who finished last in every race they participated. Therefore, the dataset we used, there are total 83 drivers.
Table 1 shows top ten and bottom ten drivers according to their average place, and their ranking from Rank Centrality and Logit Regression. The unregularized Rank Centrality can over fit the data by placing P. J. Jones and Scott Pruett in the first and second places. They have high average place, but they only participated in one race. In contrast, the regularized version places them lower and gives the top ranking to those players with more races. Similarly, Morgan Shepherd is placed last in the regularized version, because he had consistently low performance in 5 races.
|Driver||Races||Av. place||Rank Centrality||Logit Regression|
|P. J. Jones||1||4.00||0.1837||1||0.0181||11||0.0124||23|
Dataset 3: ODI Cricket. Table 2 shows ranking of international cricket teams from the 2012 season of the One Day International (ODI) cricket match. Like NASCAR dataset, in Table 2, teams with smaller number of matches, such as Scotland and Ireland, are moved towards the middle with regularization, and New Zealand is moved towards the end. Notice that regularized or not, the ranking from Rank Centrality is different from the simple ranking from average place or winning ratio, because we give more score for winning against stronger opponents.
|Team||matches||Win ratio||deg||Rank Centrality||Logit Regression|
3.4 Information-theoretic lower bound
In previous sections, we presented the achievable error rate based on a particular low-complexity algorithm. In this section, we ask how this bound compares to the fundamental limit under BTL model.
Our result in Theorem 3.2 provides an upper bound on the achievable error rate between estimated scores and the true underlying scores. We provide a constructive argument to lower bound the minimax error rate over a class of BTL models. Concretely, we consider the scores coming from a simplex with bounded dynamic range defined as
We constrain the scores to be on the simplex, because we represent the scores by its projection onto the standard simplex as explained in Section 2.1. Then, we can prove the following lower bound on the minimax error rate. Consider a minimax scenario where we first choose an estimator that estimates the BTL weights from given observations and, for this particular estimator , nature chooses the worst-case true BTL weights . Then, we can show that for any estimator that we choose, there exists a true score vector with dynamic range at most such that no algorithm can achieve an expected normalized error smaller than the following minimax lower bound:
where the infimum ranges over all estimators that are measurable functions over the observations, we observe the outcomes of comparisons for each pair of items, and we compare each pair of items with probability . By definition the dynamic range is always at least one. When , we can trivially achieve a minimax rate of zero. Since the infimum ranges over all measurable functions, it includes a trivial estimator which always outputs regardless of the observations, and this estimator achieves zero error when . In the regime where the dynamic range is bounded away from one and bounded above by a constant, Theorem 3.4 establishes that the upper bound obtained in Theorem 3.2 is minimax-optimal up to factors logarithmic in the number of items .
3.5 MLE: Error bounds using state-of-art method
It is well known that the maximum-likelihood estimate of a set of parameters is asymptotically normal with mean and covariance equal to the inverse Fisher information of the set of parameters. In this section we wish to show the behavior of the estimates obtained through the logistic regression based approach for estimating the parameters in a finite sample setting.
Recall that the logistic regression based method reparameterizes the model so that given items and the probability that defeats is
In order to ensure identifiability we also assume that , so that we also enforce the constraint . We also recall that we let . Similarly, we let and enforce the constraint that where . For simplicity we assume that .
Finally, recall that we are given i.i.d. observations. We take and let to be the outcome of the comparison. Furthermore, if during the competition item competed against item we take where is the standard basis vector with entries that are all zero except for the entry, which equals one. Note that in this context the ordering of the competition does matter. Finally, we define the inner-product between two vectors to be . Therefore, under the BTL model with parameters we have that
Now the estimation procedure is of the form
Before proceeding we recall that . With that in mind we have the following theorem. Suppose that we have observations of the form where and are drawn uniformly at random from and is Bernoulli with parameter . Then, we have with probability at least
With the assumption that , we have .
3.6 Cramér-Rao lower bound
The Fisher information matrix (FIM) encodes the amount of information that the observed measurements carry about the parameter of interest. The Cramér-Rao bounds we derive in this section provides a lower bound on the expected squared Euclidean norm and is directly related to the (inverse of) Fisher information matrix.
Denote the log-likelihood function as
The Fisher information matrix with the BTL weights is defined as
Let denote our estimate of the weights. Applying the Cramé-Rao bound (Rao 1945), we get the following lower bound.
This bound depends on and the graph structure. Although a closed form expression is difficult to get, we can compare our numerical experiments with a numerically computed Cramér-Rao bound on the same graph and the weights .
3.6.1 Numerical comparisons
In Figure 3, the average normalized root mean squared error (RMSE) is shown as a function of various model parameters. We fixed the control parameters as , , and . Each point in the figure is averaged over random instances . Let be the resulting estimate at -th experiment, then
For all ranges of model parameters , , and , RMSE achieved using Rank Centrality is almost indistinguishable from that of the ML estimate and also the Cramér-Rao bound (CRB).
CRB provides a lower bound on the expected mean squared error. Although we are plotting average root mean squared error, as opposed to average mean squared error, we do not expect any estimator to achieve RMSE better than the CRB as long as there is a concentration.
The ML estimator in (7) with finds an estimate that maximizes the log-likelihood, and in general ML estimate does not coincide with the minimum mean squared error estimator. From the figure we see that it intact achieves the minimum mean squared error and matches the CRB.
What is perhaps surprising is that for all the parameters that we experimented with, the RMSE achieved by Rank Centrality is almost indistinguishable with that of ML estimate and the CRB. Thus, coupled with the minimax lower-bounds, one cannot do better than Rank Centrality.
We may now present proofs of Theorems 3.1 and 3.2. We first present a proof of convergence for general graphs in Theorem 3.1. This result follows from Lemma 4.1 that we state below, which shows that our algorithm enjoys convergence properties that result in useful upper bounds. The lemma is made general and uses standard techniques of spectral theory. The main difficulty arises in establishing that the Markov chain satisfies certain properties that we will discuss subsequently. Given the proof for the general graph, Theorem 3.2 follows by showing that in the case of Erdös-Renyi graphs, certain spectral properties are satisfied with high probability.
4.1 Proof of Theorem 3.1: General graph
In this section, we characterize the error rate achieved by our ranking algorithm. Given the random Markov chain , where the randomness comes from the outcome of the comparisons, we will show that it does not deviate too much from its expectation , where we recall is defined as
for all and otherwise.
Recall from the discussion following equation (1) that the transition matrix used in our ranking algorithm has been carefully chosen such that the corresponding expected transition matrix has two important properties. First, the stationary distribution of , which we denote with is proportional to the weight vectors . Furthermore, when the graph is connected and has self loops (which at least one exists), this Markov chain is irreducible and aperiodic so that the stationary distribution is unique. The next important property of is that it is reversible–. This observation implies that the operator is symmetric in an appropriate defined inner product space. The symmetry of the operator will be crucial in applying ideas from spectral analysis to prove our main results.
Let denote the fluctuation of the transition matrix around its mean, such that . The following lemma bounds the deviation of the Markov chain after steps in terms of two important quantities: the spectral radius of the fluctuation and the spectral gap , where
Since ’s are sorted, is the second largest eigenvalue in absolute value. For any Markov chain with a reversible Markov chain , let be the distribution of the Markov chain when started with initial distribution . Then,
where is the stationary distribution of , , , and . The above result provides a general mechanism for establishing error bounds between an estimated stationary distribution and the desired stationary distribution . It is worth noting that the result only requires control on the quantities and . We may now state two technical lemmas that provide control on the quantities and , respectively. For and with appropriately large constant , the error matrix satisfies
with probability at least : constant can be made large at the cost of possibly making and larger. The next lemma provides our desired bound on . When and , the spectral radius satisfies
Proof of Theorem 3.1. With the above results in hand we may now proceed with the proof of the main result of our interest. When there is a positive spectral gap such that , the first term in (11) vanishes as grows. The rest of the first term is bounded and independent of . Formally, we have
by the assumption that and the fact that . Hence, the error between the distribution at the iteration and the true stationary distribution is dominated by the second term in equation (11). Substituting the bounds in Lemma 4.1 and Lemma 4.1, the dominant second term in equation (11) is bounded by
In fact, we only need to ensure that the above bound holds up to a constant factor. This finishes the proof of Theorem 3.1. Notice that in order for this result to hold, we need the following two conditions: for Lemma 4.1 and for Lemma 4.1. Since , , and , the second condition always implies the first for any choice of .
4.1.1 Proof of Lemma 4.1
Due to the reversibility of , we can view it as a self-adjoint operator on an appropriately defined inner product space. This observation allows us to apply the well-understood spectral analysis of self-adjoint operators. In order to establish this fact define an inner product space as a space of -dimensional vectors with
Similarly, we define as the -norm in . For a self-adjoint operator in , we define as the operator norm. These norms are related to the corresponding norms in the Euclidean space through the following inequalities.
A reversible Markov chain is self-adjoint in . To see this, define a closely related symmetric matrix , where is a diagonal matrix with . The assumption that is reversible, i.e. , implies that is symmetric, and it follows that is self-adjoint in .
Further, the asymmetric matrix and the symmetric matrix have the same set of eigenvalues. By Perron-Frobenius theorem, the eigenvalues are at most one. Let be the eigenvalues, and let be the left eigenvector of corresponding to . Then the th left eigenvector of is . Since the first left eigenvector of is the stationary distribution, i.e. , we have .
For the Markov chain , where is a reversible Markov chain such that , we let