The question of ranking items using pairwise comparisons is of interest in many applications. Some typical examples are from sports where pairs of players play against each other and people are interested in ranking the players from past games. This type of ranking problem is usually studied using the Bradley-Terry model  where each item is associated with a score measuring its competitiveness and
For this model, the maximum likelihood estimation of the score vector can be solved efficiently .
There are other examples where comparisons are obtained implicitly. For example, when a user clicks a result from a list returned by a search engine for a given request, it implies that this user prefers this result over nearby results on the list. Similarly, when a customer buys a product from an online retailer, it implies that this customer prefers this product over previously browsed products. Businesses providing these services are interested in inferring users’ rankings of items. In these examples, users can have different scores for the same item and a single score vector is insufficient to capture individual preferences. Therefore it is more appropriate to view the user preferences as generated from the mixture of Bradley-Terry models. Though this mixture model has been used in many fields (See [1, 23] and the references therein), little is known about how to cluster the users and learn the individual preferences efficiently, and how many pairwise comparions are needed for a target estimation error.
In this work, we study the following mixture Bradley-Terry model: users are clustered into different types; users of the same type have the same score vector; every user independently generates a few pairwise comparisons according to the Bradley-Terry model. Notice that under our model users of the same type will have similar but not necessarily identical pairwise comparisons. The task is to estimate the score vector for each user. Essentially, we would like to cluster the users using the observed pairwise comparisons and then estimate the score vector for each cluster. However, there are two key challenges. First, for each user, if we stack all the possible pairwise comparisons as a vector, this comparison vector lies in a high dimensional space and only a small number of its entries are observed. Hence, directly clustering users based on the comparison vectors is likely to be too noisy to work well; our numerical experiments (see Section VII) confirm as much. Second, although standard algorithms like maximum likelihood estimation [15, 22] are available for estimating the score vector once the clusters (users of the same type) are exactly found, it is still unclear how the algorithms perform when the clusters are only approximately recovered.
Our first contribution is to propose and show the effectiveness of clustering users according to their net-win vectors. A net-win vector for a user is a vector of length , where its -th coordinate counts the number of times item is preferred over other items minus the number of times other items are preferred over item according to this user’s pairwise comparisons. The effectiveness of net-win vectors in clustering users relies on the following surprising fact: the means of all the comparisons vectors are close to some
-dimensional linear subspace; the net-win vectors are essentially the projection of the comparisons vectors onto this low-dimensional linear subspace. We show the projection to the net-win vectors preserves the distances between different clusters but the net-win vectors are much less noisy than the comparison vectors. Given good separations of the net-win vectors corresponding to different clusters, we show that a standard spectral clustering algorithm approximately recovers the user clusters.
Our second contribution is to show that, even though the clusters have a few erroneously assigned users, the maximum likelihood estimator for the Bradley-Terry model is still close to the true score vector for this cluster. In our algorithm, as we only expect to approximately recover the user clusters, this robustness result ensures that we can still approximately recover the score vectors for most users.
The results for the clustering and estimation steps can be combined to provide a performance guarantee for the overall algorithm. Our algorithm accurately estimates the score vectors for most users with only pairwise comparisons per cluster, where is the number of user types (clusters) and is the total number of users. When there is only one cluster, it is known that pairwise comparisons are required for any algorithm to accurately estimate the score vector . Also, each user needs to provide at least one pairwise comparison; otherwise there is no hope to accurately estimate the preference for this user, so at least pairwise comparisons are needed in total to learn individual preferences. When is of order or , the sample complexity of our algorithm matches the lower bounds up to logarithmic factors.
I-a Related Work
In this section, we point out some connections of our model and results to prior work. There is a vast literature on the ranking and related rating prediction problems; here we cover a fraction of it we see as most relevant.
Rank aggregation has been extensively studied across various disciplines including statistics, psychology, sociology, and computer science [16, 14, 6, 27, 9]. The Bradley-Terry (BT) model proposed in [5, 19] and its various extensions are widely used for studying the rank aggregation problem [15, 12, 21, 25, 2, 24]. The classical results in  show that the likelihood function under the BT model is concave and the ML estimator for the score vector can be efficiently found using an EM algorithm. It is further shown in  and  that pairwise comparisons are necessary for any algorithm to accurately infer the score vector and
randomly chosen pairwise comparisons is sufficient for the ML estimator. In this paper, we show the ML estimator is able to estimate the score vector accurately even with a small number of arbitrarily corrupted pairwise comparisons. In addition to the ML estimation, several Markov chain based iterative methods have been proposed in[10, 22], and have been shown to accurately estimate the score vector with randomly chosen pairwise comparisons, which matches the sample complexity of the ML estimator.
Previous works on rank aggregation, however, mostly focus on a single type of users and aim to combine the observed user preferences to output a single ranking that best represents the preferences of all users. Little is known about clustering and learning individual preferences when there are multiple types of users. In this paper, we consider a mixed Bradley-Terry model to capture the heterogeneity of the user preferences. Our mixed BT model is closely related to the so-called mixed multinomial logit model studied in and . Ammar et al.  studies a clustering problem similar to ours under the mixed multinomial logit model, where each user provides a set of favorite items instead of pairwise comparisons, and users are clustered based on the overlaps between the sets of
favorite items. Under a geometric decay condition on the score vectors, the algorithm is shown to cluster users correctly with high probability ifand there are only users. However, it is unclear whether the geometric decay condition holds in practice, and more importantly how the algorithm performs when there are a large number of users. Oh and Shah 
studies a different problem of estimating the model parameters under the mixed multinomial logit model. A tensor decomposition based algorithm is shown to estimate the model parameters accurately withpairwise comparisons per component. If our algorithm is applied to estimate the model parameters, only pairwise comparisons per component are needed 111Since there is no need to estimate the preferences for every user in this context, the dependency on in our sample complexity can be dropped.. Another mixture approach is proposed in  for clustering heterogeneous ranking data and an efficient EM algorithm is derived for parameter estimation. This method can take rankings of different lengths as input. However, no analytical performance guarantee is provided for the clustering. Very recently, a nuclear norm regularization approach is proposed in  to estimate the score vectors for all users. By assuming each user has a unique score vector and the score matrix formed by stacking all score vectors as rows is approximately of low rank , they prove the estimation error if there are randomly chosen pairwise comparisons. However, it is not immediately clear how the nuclear norm approach performs in terms of the estimation error of the score vector for each individual.
Finally, we point out that there is a large body of work studying the related problem of rating predictions. A popular approach is based on matrix completion methods [8, 17], the incomplete rating matrix is assumed to be of low rank. Another line of work [3, 28, 20] assumes there are multiple types of users and users of the same type provide similar ratings. However, the rating based methods have several limitations comparing to the pairwise preference based methods. First, not all preferences are available in the form of ratings, while numerical ratings can be transformed into pairwise comparisons. Second, ratings are user-biased, e.g. a user may give higher or lower ratings on average than others, while pairwise comparisons are absolute. Third, pairwise comparisons are more reliable and consistent than ratings, e.g. it is easier for a user to compare two items than assign scores to them. Algorithmically, learning preferences from rankings is more challenging, because the vectors of pairwise comparisons lie in a -dimensional space, while the vectors of ratings lie in an -dimensional space. We overcome this challenge by a simple, but non-trivial projection of the comparison vectors into a low dimensional, linear subspace.
Ii Problem setup
Consider a system with user clusters of sizes and items and let . Each user has a score vector for the items , and he/she compares items according to the Bradley-Terry model: item is preferred over item with probability and vice versa with probability . Assume users in the same cluster have the same score vector and denote the common score vector for cluster by . As and for any constant
define the same probability distributions of pairwise comparisons in the Bradley-Terry model,is only identifiable up to a constant shift. To eliminate the ambiguity and without loss of generality, we always shift to ensure that .
The overall comparison result is represented by an sample comparison matrix . The -th row is the comparison vector of users . The columns are indexed by two numbers with , and the -th column corresponds to the comparisons for item and . For each user , and items and with , user ’s comparison result of item and is sampled with probability independently, where is the erasure probability. Let if prefers over , if prefers over , and if ’s comparison is not sampled. Then
Our goal is to estimate the score vectors from .
To simplify the analysis, we will assume ’s are generated independently as follows: for each and , generate i.i.d. uniformly in , and then define
Clearly, and for any and . Notice that are not independent, and .
Ii-a Notation and Outline
denote the singular value decomposition of a matrixsuch that . The spectral norm of is denoted by , which is equal to the largest singular value. The best rank approximation of is defined as . For vectors, let denote the inner product between two vectors; the only norm that will be used is the usual norm, denoted as . In this paper, all vectors are row vectors. We say an event occurs with high probability when the probability of occurrence of that event goes to one as and go to infinity.
The rest of the paper is organized as follows. Section III describes our three-step algorithm and summarizes the main results. The key idea of de-noising the comparison vectors by projection in the first step is explained in Section IV. The details of the last two steps of our algorithm, i.e., user clustering and score vector estimation, are provided in Section V. All the proofs can be found in Section VI. The experimental results are given in Section VII. Section VIII concludes the paper.
Iii Algorithm and Main Result
Our algorithm for clustering users and inferring their preferences is presented as Algorithm 1. The basic idea is to estimate in two steps: cluster the users and then estimate a score vector for each cluster separately.
The difficulty lies in the clustering step. Recall that, in our problem, each user is represented by a comparison vector of length , and only roughly of its entries are observed. These comparison vectors are so noisy that directly clustering them result in poor performance, a fact which we confirm in our experiments in Section VII.
We overcome this difficulty by reducing the dimension of the comparison vectors. Consider user with comparison vector . For each item , define the normalized net number of wins
We call the net-win vector of user . To simplify the notation, let be the matrix with the -th column being , where is the length vector with all s except for a in the -th coordinate, then it is easy to verify that
The effectiveness of net-win vectors in clustering users relies on the following surprising fact: the expected comparison vectors for all users, which are -dimensional nonlinear functions of the score vectors , are close to the -dimensional linear subspace spanned by the rows of , or the row space of . It suggests denoising the ’s for all users by projecting them onto the row space of . The projections of the ’s turn out to be isometric to the net-win vectors ’s. In particular, recall our definition of given in (1), the term acts just like an orthogonal projection onto the row space of . We show in Section IV that for any two users in two different clusters, and for . Therefore, the net-win vectors are much less noisy and easier to separate than the comparison vectors .
We then cluster the net-win vectors by a standard spectral clustering algorithm in Step 2 of Algorithm 1. Let denote the true clusters and denote the clusters generated by Algorithm 1 with threshold . Since the clusters are only identifiable up to a permutation of the indices, we define the number of errors in as
where denotes the symmetric difference of two sets. The following theorem highlights a key contribution of the paper: the projection of comparison vectors to the row space of results in significant denoising, which allows for accurately clustering the users with only a small number of pairwise comparisons.
Let . If for any arbitrarily small constant or for some constant , then with high probability, there exists a permutation such that,
In particular, when
the fraction of misclustered users in each cluster , i.e.,
Theorem 1 implies that if and are on the same order, roughly each user only needs to give pairwise comparisons to allow for the correct clustering for all but users. Notice that a user needs to give at least one pairwise comparison. The specific choice of is just for simplicity of the proof, which can be relaxed to for any constant . The lower bound is required. Consider the extreme case where , then the the score vectors for all users are all-zero vectors and the clusters are unidentifiable from pairwise comparisons. The upper bound is an artifact of our analysis as shown by our numerical experiments. Note that if , then the most favorable item is preferred over the least favorable item with probability approximately .
After estimating the clusters, Algorithm 1 treats each cluster separately and estimates a score vector using the maximum likelihood estimation for the single cluster Bradley-Terry model. In order to avoid the dependence between Step and Step , we generate two smaller samples and by subsampling , and use them in the two steps respectively. It is not hard to verify that the support sets and are independent.
The overall performance of Algorithm 1 is characterized by the following theorem, which shows that, when the number of pairwise comparisons is large enough, the estimations of the score vectors are accurate for most users with high probability.
Assume for any arbitrarily small constant , then there exists a constant such that with high probability
except for users. In particular, if , then except for users.
Theorem 2 shows that the estimation error depends on the maximum of and : characterizes the fraction of misclustered users in a given cluster as shown by Theorem 1; characterizes the estimation error of the maximum likelihood estimation assuming the clustering is perfect. If , then there is no clustering error and the estimation error only depends on which matches the existing results in  with a single type of user. The lower bounds in  and  show that at least pairwise comparisons per type are needed to ensure even when clusters are known. Also, a user needs to provide at least one pairwise comparison for us to infer his/her preference, which means that at least pairwise comparisons in total are required to infer the preferences for most users. Theorem 2 shows that Algorithm 1 needs approximately comparisons per cluster, which matches the lower bounds up to logarithmic factors if is poly-logarithmic in or .
Iv Denoising using Net-win Vectors
In this section, we analyze Step 1 of Algorithm 1. We first argue that directly clustering based on the comparison vector is too noisy to work well. Then, we show the net-win vectors preserve the distances between different clusters from a geometric projection point of view. Finally, we prove the net-win vectors are much less noisy than the comparison vectors.
Recall that is a -dimensional vector of all pairwise comparisons for user . For any , the mean of the -th entry is
where . Since two users from the same cluster have the identical score vector, the means of their comparison vectors are also identical. With a slight abuse of notation, let denote the common means of the comparison vectors for users in cluster , where . For , we call the distance between cluster and . It is easy to check that with high probability. In other words, the distances between different clusters are roughly . Hence, if we observe the means of all the comparison vectors, then clustering becomes trivial. In our problem, for each user , we only observe , which is a noisy observation of . More specifically, since the expected number of comparison a user provides is ,
Therefore, we would expect the deviation , which is much larger than the distances between different clusters given by . As a result, the comparison vectors for two users from the same cluster are likely to be far apart, while the comparison vectors for two users from different clusters might happen to be close. Therefore, the comparison vectors are too noisy to be clustered directly.
In the following, we explain how to denoise the comparison vectors. An interesting observation is that the mean of the comparison vector lies close to an -dimensional linear subspace. In particular, using the definition of , we get
where for a vector , . Although is a non-linear function, we are able to show the angle between and the -dimensional linear subspace spanned by the rows of , or the row space of , is not large. To see it, let us first assume is small. Recall that for any and . In this regime, we can linearize the function at and get
which means that is approximately on the row space of and the angle . Somewhat surprisingly, is still not too large even if becomes so large that the linear approximation does not work any more. Consider the extreme case when , under our assumption that
are uniformly distributed, we have, thus
The following lemma shows that is approximately in this case.
For any and assume for any and . Define row vector as . Then the angle between and the row space of is in the limit as .
For the intermediate range of , we do not have an analytical result on the upper bound of the angle . Through extensive simulation as plotted in Figure 1, we can see that the averaged over independent simulations decreases monotonically with and it is always upper bounded by .
The observation that is close to the row space of suggests that we may denoise the comparison vectors by projecting onto the row space of . We show in Lemma 6 that the SVD of is given by , where and . Since the row vectors of form an orthonormal basis of the row space of , the projection of onto the row space of is given by and when represented in the basis , the projection is simply . Interestingly, we find that is isometric to the normalized net-win vectors 222In Algorithm 1, we generate two independent samples from and is defined using . Here, we simply write as for ease of notation. used in Algorithm 1:
Since the rows of form an orthonormal basis, when represented in the basis , is simply , which is exactly the same as when represented in the basis . Hence, the net-win vectors are equivalent to the projection of comparison vectors into the row space of . The benefit of using the net-win vectors instead of doing the projection is that they have a more clear physical meaning and are easier to compute; there is no need to compute the SVD of , which is prohibitive when is large.
Since , two users from the same cluster have the same expected net-win vectors. With a slight abuse of notation, let denote the common expected net-win vectors for users in cluster , where . The following lemma confirms that after the projection, and thus the projection preserves the distances between different clusters.
Assume for some constant . If for some constant and , then there exists some constant such that with high probability, for any ,
The lower bound is necessary. When is too small,
’s all become very close to the all-zero vector and the distance between different clusters
given by is too small to distinguish different clusters.
Even though our theorem requires or is very large,
our experiment shows that the ’s are in fact well separated for any .
Moreover, the proof indicates that Lemma 2
applies to general pairwise comparison models as long as
the probability of item is preferred over item minus the probability of item is preferred over item
can be parameterized as for some sigmoid function
for some sigmoid function(In BT model, ); the upper bound changes to , where .
Next, we show the net-win vectors are much less noisy than the comparison vectors. In particular, let , and then , which is much smaller than the deviation .
If , then with high probability,
preserves the distances between different clusters and at the same time dramatically reduces the noise variances. In particular, if
, then the net-win vectors corresponding to different clusters are well-separated; K-means or some thresholding-based algorithm is going to work. In the next section, we will show that spectral clustering based on the net-win vectors does even better and works ifwhen and are on the same order.
Finally, we point out that the idea of projection or equivalently the net-win vectors introduced in this subsection, is not specific to the BT model and is applicable to general pairwise comparison models.
V User clustering and score vector estimation
In this section, we analyze Step 2 and 3 of Algorithm 1. Step 2 clusters the net-win vectors by a variation of the standard spectral clustering algorithm. After clustering the users, the algorithm estimates the score vectors for each cluster using sample . Recall that the supports of and are independent, which is important for the analysis to decouple the two steps.
V-a User clustering
Step 2 of Algorithm 1 first computes the best rank approximation of , and then clusters the rows of by a simple threshold based clustering algorithm. The reason we consider this threshold based clustering algorithm is that it is easy to analyze. However, in the experiments we see later, the more robust -means algorithm is used instead.
The use of can be understood from a geometric projection point of view. Let . Since the users from the same cluster have the same expected score vector, the rank of is . In other words, the expected net-win vectors lie in an -dimensional subspace of . Therefore, similar to the projection idea introduced in Section IV, we may de-noise the net-win vectors by projecting them onto this -dimensional subspace. However, this -dimensional subspace is determined by which is unobservable. Here the key idea is that is a perturbation of and thus the space spanned by the top right singular vectors of is close to the desired -dimensional subspace. Hence, we can de-noise the net-win vectors by projecting them onto the space spanned by the top right singular vectors of , which are exactly . The following lemma shows that such a projection is effective in de-noising. In particular, it shows which is much smaller than the deviation bound as shown by Lemma 3.
If , then with high probability,
Using a counting argument together with Lemma 4, we can show, for most users , is close to its expected comparison vector .
Let , then with high probability, there are at most
users such that .
V-B Score vector estimation
In Step 3, Algorithm 1 estimates the score vectors for each cluster separately When there is no clustering error, the problem reduces to the inference problem for the classical Bradley-Terry model. In particular, if we let be the number of times item is preferred over item , then the ranking problem can be solved by the maximum likelihood estimation
In general, the clustering step is not perfect, but if there are sufficiently many pairwise comparisons, Theorem 1 shows that the clusters can be approximately recovered with high probability. In this case, Algorithm 1 simply views the users in each cluster as from the same true cluster, and again solves the optimization problem corresponding to the maximum likelihood estimation for the Bradley-Terry model for each cluster.
Take one such cluster as an example. Recall that denote the set difference between the true cluster and the estimated cluster . It follows that at most users in are from other clusters and at most users in are assigned to wrong clusters. To simplify the notation, we omit the subscript and use to denote the true score vector for the cluster throughout this section. Let be the estimated score vector for cluster . The following theorem shows that when the number of comparisons is large enough, the relative error goes to zero when . We should emphasize that is only a good approximation for the score vectors of the users from cluster .
Let denote an estimator of a fixed cluster . Then there exists some constant such that with high probability
We first introduce some additional notation used in the proofs. Let
denote the identity matrix. Letdenote the vector with all-one entries and denote the matrix with all-one entries. For a matrix and a vector , let denote the matrix formed by adding as a column to the end of . For two matrices , we write if is positive semi-definite.
Vi-a Proof of Theorem 1
Let be the set of good users and Lemma 5 shows that the number of bad users
Following the proof of Proposition in , we can conclude that there exists a permutation such that, for all and
Vi-B Proof of Theorem 2
From Theorem 1, we get that there exists a permutation such that,
which requires and , respectively. Notice that the former condition is more stringent than the latter one; thus the clustering step needs more pairwise comparisons than the score vector estimation step to achieve the same error rate.
Vi-C Proof of Theorem 3
Let denote the set difference between the true cluster and the estimated cluster. Recall that and
are the random variable indicating’s comparison result of . The estimated score vector is given by , where
Let be the random variables indicating if compared and . By definition,
where . Let . As is the optimal solution,
where the second step is by Taylor expansion and for some . Define , where denotes the diagonal matrix formed by vector and is known as Laplacian. By Cauchy-Schwartz inequality,
where the second inequality follows because since for any .
Let . First we bound . For each ,
The first term is independent of . For , and . By Bernstein’s inequality, with high probability for large ,
We bound the next two terms by
Since the matrix only depends on but not the comparison results, the right hand side above is independent of or . As are independent Bernoulli random variables with parameter , with high probability for large ,