Clustering and Inference From Pairwise Comparisons

02/16/2015 ∙ by Rui Wu, et al. ∙ 0

Given a set of pairwise comparisons, the classical ranking problem computes a single ranking that best represents the preferences of all users. In this paper, we study the problem of inferring individual preferences, arising in the context of making personalized recommendations. In particular, we assume that there are n users of r types; users of the same type provide similar pairwise comparisons for m items according to the Bradley-Terry model. We propose an efficient algorithm that accurately estimates the individual preferences for almost all users, if there are r {m, n} m ^2 n pairwise comparisons per type, which is near optimal in sample complexity when r only grows logarithmically with m or n. Our algorithm has three steps: first, for each user, compute the net-win vector which is a projection of its m2-dimensional vector of pairwise comparisons onto an m-dimensional linear subspace; second, cluster the users based on the net-win vectors; third, estimate a single preference for each cluster separately. The net-win vectors are much less noisy than the high dimensional vectors of pairwise comparisons and clustering is more accurate after the projection as confirmed by numerical experiments. Moreover, we show that, when a cluster is only approximately correct, the maximum likelihood estimation for the Bradley-Terry model is still close to the true preference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The question of ranking items using pairwise comparisons is of interest in many applications. Some typical examples are from sports where pairs of players play against each other and people are interested in ranking the players from past games. This type of ranking problem is usually studied using the Bradley-Terry model [4] where each item is associated with a score measuring its competitiveness and

For this model, the maximum likelihood estimation of the score vector can be solved efficiently [15].

There are other examples where comparisons are obtained implicitly. For example, when a user clicks a result from a list returned by a search engine for a given request, it implies that this user prefers this result over nearby results on the list. Similarly, when a customer buys a product from an online retailer, it implies that this customer prefers this product over previously browsed products. Businesses providing these services are interested in inferring users’ rankings of items. In these examples, users can have different scores for the same item and a single score vector is insufficient to capture individual preferences. Therefore it is more appropriate to view the user preferences as generated from the mixture of Bradley-Terry models. Though this mixture model has been used in many fields (See [1, 23] and the references therein), little is known about how to cluster the users and learn the individual preferences efficiently, and how many pairwise comparions are needed for a target estimation error.

In this work, we study the following mixture Bradley-Terry model: users are clustered into different types; users of the same type have the same score vector; every user independently generates a few pairwise comparisons according to the Bradley-Terry model. Notice that under our model users of the same type will have similar but not necessarily identical pairwise comparisons. The task is to estimate the score vector for each user. Essentially, we would like to cluster the users using the observed pairwise comparisons and then estimate the score vector for each cluster. However, there are two key challenges. First, for each user, if we stack all the possible pairwise comparisons as a vector, this comparison vector lies in a high dimensional space and only a small number of its entries are observed. Hence, directly clustering users based on the comparison vectors is likely to be too noisy to work well; our numerical experiments (see Section VII) confirm as much. Second, although standard algorithms like maximum likelihood estimation [15, 22] are available for estimating the score vector once the clusters (users of the same type) are exactly found, it is still unclear how the algorithms perform when the clusters are only approximately recovered.

Our first contribution is to propose and show the effectiveness of clustering users according to their net-win vectors. A net-win vector for a user is a vector of length , where its -th coordinate counts the number of times item is preferred over other items minus the number of times other items are preferred over item according to this user’s pairwise comparisons. The effectiveness of net-win vectors in clustering users relies on the following surprising fact: the means of all the comparisons vectors are close to some

-dimensional linear subspace; the net-win vectors are essentially the projection of the comparisons vectors onto this low-dimensional linear subspace. We show the projection to the net-win vectors preserves the distances between different clusters but the net-win vectors are much less noisy than the comparison vectors. Given good separations of the net-win vectors corresponding to different clusters, we show that a standard spectral clustering algorithm approximately recovers the user clusters.

Our second contribution is to show that, even though the clusters have a few erroneously assigned users, the maximum likelihood estimator for the Bradley-Terry model is still close to the true score vector for this cluster. In our algorithm, as we only expect to approximately recover the user clusters, this robustness result ensures that we can still approximately recover the score vectors for most users.

The results for the clustering and estimation steps can be combined to provide a performance guarantee for the overall algorithm. Our algorithm accurately estimates the score vectors for most users with only pairwise comparisons per cluster, where is the number of user types (clusters) and is the total number of users. When there is only one cluster, it is known that pairwise comparisons are required for any algorithm to accurately estimate the score vector [22]. Also, each user needs to provide at least one pairwise comparison; otherwise there is no hope to accurately estimate the preference for this user, so at least pairwise comparisons are needed in total to learn individual preferences. When is of order or , the sample complexity of our algorithm matches the lower bounds up to logarithmic factors.

I-a Related Work

In this section, we point out some connections of our model and results to prior work. There is a vast literature on the ranking and related rating prediction problems; here we cover a fraction of it we see as most relevant.

Rank aggregation has been extensively studied across various disciplines including statistics, psychology, sociology, and computer science [16, 14, 6, 27, 9]. The Bradley-Terry (BT) model proposed in [5, 19] and its various extensions are widely used for studying the rank aggregation problem [15, 12, 21, 25, 2, 24]. The classical results in [15] show that the likelihood function under the BT model is concave and the ML estimator for the score vector can be efficiently found using an EM algorithm. It is further shown in [22] and [13] that pairwise comparisons are necessary for any algorithm to accurately infer the score vector and

randomly chosen pairwise comparisons is sufficient for the ML estimator. In this paper, we show the ML estimator is able to estimate the score vector accurately even with a small number of arbitrarily corrupted pairwise comparisons. In addition to the ML estimation, several Markov chain based iterative methods have been proposed in

[10, 22], and have been shown to accurately estimate the score vector with randomly chosen pairwise comparisons, which matches the sample complexity of the ML estimator.

Previous works on rank aggregation, however, mostly focus on a single type of users and aim to combine the observed user preferences to output a single ranking that best represents the preferences of all users. Little is known about clustering and learning individual preferences when there are multiple types of users. In this paper, we consider a mixed Bradley-Terry model to capture the heterogeneity of the user preferences. Our mixed BT model is closely related to the so-called mixed multinomial logit model studied in

[1] and [23]. Ammar et al. [1] studies a clustering problem similar to ours under the mixed multinomial logit model, where each user provides a set of favorite items instead of pairwise comparisons, and users are clustered based on the overlaps between the sets of

favorite items. Under a geometric decay condition on the score vectors, the algorithm is shown to cluster users correctly with high probability if

and there are only users. However, it is unclear whether the geometric decay condition holds in practice, and more importantly how the algorithm performs when there are a large number of users. Oh and Shah [23]

studies a different problem of estimating the model parameters under the mixed multinomial logit model. A tensor decomposition based algorithm is shown to estimate the model parameters accurately with

pairwise comparisons per component. If our algorithm is applied to estimate the model parameters, only pairwise comparisons per component are needed 111Since there is no need to estimate the preferences for every user in this context, the dependency on in our sample complexity can be dropped.. Another mixture approach is proposed in [7] for clustering heterogeneous ranking data and an efficient EM algorithm is derived for parameter estimation. This method can take rankings of different lengths as input. However, no analytical performance guarantee is provided for the clustering. Very recently, a nuclear norm regularization approach is proposed in [18] to estimate the score vectors for all users. By assuming each user has a unique score vector and the score matrix formed by stacking all score vectors as rows is approximately of low rank , they prove the estimation error if there are randomly chosen pairwise comparisons. However, it is not immediately clear how the nuclear norm approach performs in terms of the estimation error of the score vector for each individual.

Finally, we point out that there is a large body of work studying the related problem of rating predictions. A popular approach is based on matrix completion methods [8, 17], the incomplete rating matrix is assumed to be of low rank. Another line of work [3, 28, 20] assumes there are multiple types of users and users of the same type provide similar ratings. However, the rating based methods have several limitations comparing to the pairwise preference based methods. First, not all preferences are available in the form of ratings, while numerical ratings can be transformed into pairwise comparisons. Second, ratings are user-biased, e.g. a user may give higher or lower ratings on average than others, while pairwise comparisons are absolute. Third, pairwise comparisons are more reliable and consistent than ratings, e.g. it is easier for a user to compare two items than assign scores to them. Algorithmically, learning preferences from rankings is more challenging, because the vectors of pairwise comparisons lie in a -dimensional space, while the vectors of ratings lie in an -dimensional space. We overcome this challenge by a simple, but non-trivial projection of the comparison vectors into a low dimensional, linear subspace.

Ii Problem setup

Consider a system with user clusters of sizes and items and let . Each user has a score vector for the items , and he/she compares items according to the Bradley-Terry model: item is preferred over item with probability and vice versa with probability . Assume users in the same cluster have the same score vector and denote the common score vector for cluster by . As and for any constant

define the same probability distributions of pairwise comparisons in the Bradley-Terry model,

is only identifiable up to a constant shift. To eliminate the ambiguity and without loss of generality, we always shift to ensure that .

The overall comparison result is represented by an sample comparison matrix . The -th row is the comparison vector of users . The columns are indexed by two numbers with , and the -th column corresponds to the comparisons for item and . For each user , and items and with , user ’s comparison result of item and is sampled with probability independently, where is the erasure probability. Let if prefers over , if prefers over , and if ’s comparison is not sampled. Then

Our goal is to estimate the score vectors from .

To simplify the analysis, we will assume ’s are generated independently as follows: for each and , generate i.i.d. uniformly in , and then define

Clearly, and for any and . Notice that are not independent, and .

Ii-a Notation and Outline

Let

denote the singular value decomposition of a matrix

such that . The spectral norm of is denoted by , which is equal to the largest singular value. The best rank approximation of is defined as . For vectors, let denote the inner product between two vectors; the only norm that will be used is the usual norm, denoted as . In this paper, all vectors are row vectors. We say an event occurs with high probability when the probability of occurrence of that event goes to one as and go to infinity.

The rest of the paper is organized as follows. Section III describes our three-step algorithm and summarizes the main results. The key idea of de-noising the comparison vectors by projection in the first step is explained in Section IV. The details of the last two steps of our algorithm, i.e., user clustering and score vector estimation, are provided in Section V. All the proofs can be found in Section VI. The experimental results are given in Section VII. Section VIII concludes the paper.

Iii Algorithm and Main Result

Our algorithm for clustering users and inferring their preferences is presented as Algorithm 1. The basic idea is to estimate in two steps: cluster the users and then estimate a score vector for each cluster separately.

The difficulty lies in the clustering step. Recall that, in our problem, each user is represented by a comparison vector of length , and only roughly of its entries are observed. These comparison vectors are so noisy that directly clustering them result in poor performance, a fact which we confirm in our experiments in Section VII.

  Step 0: Sample splitting. Let be the support of , i.e., . We construct two sets and by independently assigning each element of only to with probability , only to with probability and to both and with probability . Define and .
  Step 1: Denoising. Let . The -th row of is the net-win vector of user .
  Step 2: User clustering. Let be the rank approximation of . Construct the clusters sequentially. For , after have been selected, choose an initial user not in the first clusters uniformly at random, and let where the threshold is specified later. Assign each remaining unclustered user to a cluster arbitrarily.
  Step 3: Score vector estimation. Let and for any and . For users in cluster , the estimated score vector is given by , where
Algorithm 1 Multi-Cluster Projected Ranking

We overcome this difficulty by reducing the dimension of the comparison vectors. Consider user with comparison vector . For each item , define the normalized net number of wins

We call the net-win vector of user . To simplify the notation, let be the matrix with the -th column being , where is the length vector with all s except for a in the -th coordinate, then it is easy to verify that

(1)

The effectiveness of net-win vectors in clustering users relies on the following surprising fact: the expected comparison vectors for all users, which are -dimensional nonlinear functions of the score vectors , are close to the -dimensional linear subspace spanned by the rows of , or the row space of . It suggests denoising the ’s for all users by projecting them onto the row space of . The projections of the ’s turn out to be isometric to the net-win vectors ’s. In particular, recall our definition of given in (1), the term acts just like an orthogonal projection onto the row space of . We show in Section IV that for any two users in two different clusters, and for . Therefore, the net-win vectors are much less noisy and easier to separate than the comparison vectors .

We then cluster the net-win vectors by a standard spectral clustering algorithm in Step 2 of Algorithm 1. Let denote the true clusters and denote the clusters generated by Algorithm 1 with threshold . Since the clusters are only identifiable up to a permutation of the indices, we define the number of errors in as

where denotes the symmetric difference of two sets. The following theorem highlights a key contribution of the paper: the projection of comparison vectors to the row space of results in significant denoising, which allows for accurately clustering the users with only a small number of pairwise comparisons.

Theorem 1.

Let . If for any arbitrarily small constant or for some constant , then with high probability, there exists a permutation such that,

and

In particular, when

the fraction of misclustered users in each cluster , i.e.,

Theorem 1 implies that if and are on the same order, roughly each user only needs to give pairwise comparisons to allow for the correct clustering for all but users. Notice that a user needs to give at least one pairwise comparison. The specific choice of is just for simplicity of the proof, which can be relaxed to for any constant . The lower bound is required. Consider the extreme case where , then the the score vectors for all users are all-zero vectors and the clusters are unidentifiable from pairwise comparisons. The upper bound is an artifact of our analysis as shown by our numerical experiments. Note that if , then the most favorable item is preferred over the least favorable item with probability approximately .

After estimating the clusters, Algorithm 1 treats each cluster separately and estimates a score vector using the maximum likelihood estimation for the single cluster Bradley-Terry model. In order to avoid the dependence between Step and Step , we generate two smaller samples and by subsampling , and use them in the two steps respectively. It is not hard to verify that the support sets and are independent.

The overall performance of Algorithm 1 is characterized by the following theorem, which shows that, when the number of pairwise comparisons is large enough, the estimations of the score vectors are accurate for most users with high probability.

Theorem 2.

Define

Assume for any arbitrarily small constant , then there exists a constant such that with high probability

except for users. In particular, if , then except for users.

Theorem 2 shows that the estimation error depends on the maximum of and : characterizes the fraction of misclustered users in a given cluster as shown by Theorem 1; characterizes the estimation error of the maximum likelihood estimation assuming the clustering is perfect. If , then there is no clustering error and the estimation error only depends on which matches the existing results in [22] with a single type of user. The lower bounds in [22] and [13] show that at least pairwise comparisons per type are needed to ensure even when clusters are known. Also, a user needs to provide at least one pairwise comparison for us to infer his/her preference, which means that at least pairwise comparisons in total are required to infer the preferences for most users. Theorem 2 shows that Algorithm 1 needs approximately comparisons per cluster, which matches the lower bounds up to logarithmic factors if is poly-logarithmic in or .

Iv Denoising using Net-win Vectors

In this section, we analyze Step 1 of Algorithm 1. We first argue that directly clustering based on the comparison vector is too noisy to work well. Then, we show the net-win vectors preserve the distances between different clusters from a geometric projection point of view. Finally, we prove the net-win vectors are much less noisy than the comparison vectors.

Recall that is a -dimensional vector of all pairwise comparisons for user . For any , the mean of the -th entry is

where . Since two users from the same cluster have the identical score vector, the means of their comparison vectors are also identical. With a slight abuse of notation, let denote the common means of the comparison vectors for users in cluster , where . For , we call the distance between cluster and . It is easy to check that with high probability. In other words, the distances between different clusters are roughly . Hence, if we observe the means of all the comparison vectors, then clustering becomes trivial. In our problem, for each user , we only observe , which is a noisy observation of . More specifically, since the expected number of comparison a user provides is ,

Therefore, we would expect the deviation , which is much larger than the distances between different clusters given by . As a result, the comparison vectors for two users from the same cluster are likely to be far apart, while the comparison vectors for two users from different clusters might happen to be close. Therefore, the comparison vectors are too noisy to be clustered directly.

In the following, we explain how to denoise the comparison vectors. An interesting observation is that the mean of the comparison vector lies close to an -dimensional linear subspace. In particular, using the definition of , we get

where for a vector , . Although is a non-linear function, we are able to show the angle between and the -dimensional linear subspace spanned by the rows of , or the row space of , is not large. To see it, let us first assume is small. Recall that for any and . In this regime, we can linearize the function at and get

which means that is approximately on the row space of and the angle . Somewhat surprisingly, is still not too large even if becomes so large that the linear approximation does not work any more. Consider the extreme case when , under our assumption that

are uniformly distributed, we have

, thus

The following lemma shows that is approximately in this case.

Lemma 1.

For any and assume for any and . Define row vector as . Then the angle between and the row space of is in the limit as .

For the intermediate range of , we do not have an analytical result on the upper bound of the angle . Through extensive simulation as plotted in Figure 1, we can see that the averaged over independent simulations decreases monotonically with and it is always upper bounded by .

Fig. 1: Cosine of the angle between and the row space of for various .

The observation that is close to the row space of suggests that we may denoise the comparison vectors by projecting onto the row space of . We show in Lemma 6 that the SVD of is given by , where and . Since the row vectors of form an orthonormal basis of the row space of , the projection of onto the row space of is given by and when represented in the basis , the projection is simply . Interestingly, we find that is isometric to the normalized net-win vectors 222In Algorithm 1, we generate two independent samples from and is defined using . Here, we simply write as for ease of notation. used in Algorithm 1:

Since the rows of form an orthonormal basis, when represented in the basis , is simply , which is exactly the same as when represented in the basis . Hence, the net-win vectors are equivalent to the projection of comparison vectors into the row space of . The benefit of using the net-win vectors instead of doing the projection is that they have a more clear physical meaning and are easier to compute; there is no need to compute the SVD of , which is prohibitive when is large.

Since , two users from the same cluster have the same expected net-win vectors. With a slight abuse of notation, let denote the common expected net-win vectors for users in cluster , where . The following lemma confirms that after the projection, and thus the projection preserves the distances between different clusters.

Lemma 2.

Assume for some constant . If for some constant and , then there exists some constant such that with high probability, for any ,

Remark 1.

The lower bound is necessary. When is too small, ’s all become very close to the all-zero vector and the distance between different clusters given by is too small to distinguish different clusters. Even though our theorem requires or is very large, our experiment shows that the ’s are in fact well separated for any . Moreover, the proof indicates that Lemma 2 applies to general pairwise comparison models as long as the probability of item is preferred over item minus the probability of item is preferred over item can be parameterized as

for some sigmoid function

(In BT model, ); the upper bound changes to , where .

Next, we show the net-win vectors are much less noisy than the comparison vectors. In particular, let , and then , which is much smaller than the deviation .

Lemma 3.

If , then with high probability,

Notice that Lemma 3 is independent of the pairwise comparison model. Together with Lemma 2, it shows that the projection of comparison vectors into the row space of

preserves the distances between different clusters and at the same time dramatically reduces the noise variances. In particular, if

, then the net-win vectors corresponding to different clusters are well-separated; K-means or some thresholding-based algorithm is going to work. In the next section, we will show that spectral clustering based on the net-win vectors does even better and works if

when and are on the same order.

Finally, we point out that the idea of projection or equivalently the net-win vectors introduced in this subsection, is not specific to the BT model and is applicable to general pairwise comparison models.

V User clustering and score vector estimation

In this section, we analyze Step 2 and 3 of Algorithm 1. Step 2 clusters the net-win vectors by a variation of the standard spectral clustering algorithm. After clustering the users, the algorithm estimates the score vectors for each cluster using sample . Recall that the supports of and are independent, which is important for the analysis to decouple the two steps.

V-a User clustering

Step 2 of Algorithm 1 first computes the best rank approximation of , and then clusters the rows of by a simple threshold based clustering algorithm. The reason we consider this threshold based clustering algorithm is that it is easy to analyze. However, in the experiments we see later, the more robust -means algorithm is used instead.

The use of can be understood from a geometric projection point of view. Let . Since the users from the same cluster have the same expected score vector, the rank of is . In other words, the expected net-win vectors lie in an -dimensional subspace of . Therefore, similar to the projection idea introduced in Section IV, we may de-noise the net-win vectors by projecting them onto this -dimensional subspace. However, this -dimensional subspace is determined by which is unobservable. Here the key idea is that is a perturbation of and thus the space spanned by the top right singular vectors of is close to the desired -dimensional subspace. Hence, we can de-noise the net-win vectors by projecting them onto the space spanned by the top right singular vectors of , which are exactly . The following lemma shows that such a projection is effective in de-noising. In particular, it shows which is much smaller than the deviation bound as shown by Lemma 3.

Lemma 4.

If , then with high probability,

Using a counting argument together with Lemma 4, we can show, for most users , is close to its expected comparison vector .

Lemma 5.

Let , then with high probability, there are at most

users such that .

Combined with the fact that the ’s are well separated as shown in Lemma 2, we get Theorem 1.

V-B Score vector estimation

In Step 3, Algorithm 1 estimates the score vectors for each cluster separately When there is no clustering error, the problem reduces to the inference problem for the classical Bradley-Terry model. In particular, if we let be the number of times item is preferred over item , then the ranking problem can be solved by the maximum likelihood estimation

The above optimization is convex and can be solved efficiently [15]. Further, the recent work [22] provides an error bound for when the pairs of items are chosen uniformly and independently.

In general, the clustering step is not perfect, but if there are sufficiently many pairwise comparisons, Theorem 1 shows that the clusters can be approximately recovered with high probability. In this case, Algorithm 1 simply views the users in each cluster as from the same true cluster, and again solves the optimization problem corresponding to the maximum likelihood estimation for the Bradley-Terry model for each cluster.

Take one such cluster as an example. Recall that denote the set difference between the true cluster and the estimated cluster . It follows that at most users in are from other clusters and at most users in are assigned to wrong clusters. To simplify the notation, we omit the subscript and use to denote the true score vector for the cluster throughout this section. Let be the estimated score vector for cluster . The following theorem shows that when the number of comparisons is large enough, the relative error goes to zero when . We should emphasize that is only a good approximation for the score vectors of the users from cluster .

Theorem 3.

Let denote an estimator of a fixed cluster . Then there exists some constant such that with high probability

Theorem 3 extends the previous results in [22] to the setting with clustering errors. Notice that the error bound scales exponentially with . This is likely to be an artifact of our analysis and also appears in previous results in [22].

Vi Proofs

In this section, we present the proofs for the main theorems first and then the lemmas. The proof of Theorem 1 uses Lemma 2 and Lemma 5. We prove Theorem 2 by combining Theorem 1 and Theorem 3.

We first introduce some additional notation used in the proofs. Let

denote the identity matrix. Let

denote the vector with all-one entries and denote the matrix with all-one entries. For a matrix and a vector , let denote the matrix formed by adding as a column to the end of . For two matrices , we write if is positive semi-definite.

Vi-a Proof of Theorem 1

Recall that . We say a user is a good user if . Under the assumption of Theorem 1, the condition of Lemma 2 holds. Then in view of Lemma 2, for all good users ,

Let be the set of good users and Lemma 5 shows that the number of bad users

Following the proof of Proposition in [28], we can conclude that there exists a permutation such that, for all and

Vi-B Proof of Theorem 2

From Theorem 1, we get that there exists a permutation such that,

We then apply Theorem 3, and get the result of Theorem 2. If we want to achieve for the good users, we need

which requires and , respectively. Notice that the former condition is more stringent than the latter one; thus the clustering step needs more pairwise comparisons than the score vector estimation step to achieve the same error rate.

Vi-C Proof of Theorem 3

Let denote the set difference between the true cluster and the estimated cluster. Recall that and

are the random variable indicating

’s comparison result of . The estimated score vector is given by , where

Let be the random variables indicating if compared and . By definition,

where . Let . As is the optimal solution,

where the second step is by Taylor expansion and for some . Define , where denotes the diagonal matrix formed by vector and is known as Laplacian. By Cauchy-Schwartz inequality,

where the second inequality follows because since for any .

Let . First we bound . For each ,

The first term is independent of . For , and . By Bernstein’s inequality, with high probability for large ,

We bound the next two terms by

Since the matrix only depends on but not the comparison results, the right hand side above is independent of or . As are independent Bernoulli random variables with parameter , with high probability for large ,