A Topic Modeling Approach to Ranking

12/11/2014 ∙ by Weicong Ding, et al. ∙ 0

We propose a topic modeling approach to the prediction of preferences in pairwise comparisons. We develop a new generative model for pairwise comparisons that accounts for multiple shared latent rankings that are prevalent in a population of users. This new model also captures inconsistent user behavior in a natural way. We show how the estimation of latent rankings in the new generative model can be formally reduced to the estimation of topics in a statistically equivalent topic modeling problem. We leverage recent advances in the topic modeling literature to develop an algorithm that can learn shared latent rankings with provable consistency as well as sample and computational complexity guarantees. We demonstrate that the new approach is empirically competitive with the current state-of-the-art approaches in predicting preferences on some semi-synthetic and real world datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent explosion of web technologies has enabled us to collect an immense amount of partial preferences for large sets of items, e.g., products from Amazon, movies from Netflix, or restaurants from Yelp, from a large and diverse population of users through transactions, clicks, check-ins, etc. (e.g., Lu and Boutilier, 2011; Volkovs and Zemel, 2014; Rajkumar and Agarwal, 2014). The goal of this paper is to develop a new approach to model, learn, and ultimately predict the preference behavior of users in pairwise comparisons which can form a building block for other partial preferences. Predicting preference behavior is important to personal recommendation systems, e-commerce, information retrieval, etc.

We propose a novel topic modeling approach to ranking and introduce a new probabilistic generative model for pairwise comparisons that accounts for a heterogeneous population of inconsistent users. The essence of our approach is to view the outcomes of comparisons generated by each user as a probabilistic mixture of a few latent global rankings that are shared across the user-population. This is especially appealing in the context of emerging web-scale applications where (i) there are multiple factors that influence individual preference behavior, e.g., product preferences are influenced by price, brand, etc., (ii) each individual is influenced by multiple latent factors to different extents, (iii) individual preferences for very similar items may be noisy and change with time, and (iv) the number of comparisons available from each user is typically limited. Research on ranking models to-date does not fully capture all these important aspects.

In the literature, we can identify two categories of models. In the first category of models the focus is on learning one global ranking that “optimally” agrees with the observations according to some metric (e.g., Gleich and Lim, 2011; Rajkumar and Agarwal, 2014; Volkovs and Zemel, 2014). Loosely speaking, this tacitly presupposes a fairly homogeneous population of users having very similar preferences. In the second category of models, there are multiple constituent rankings in the user population, but each user is associated with a single ranking scheme sampled from a set of multiple constituent rankings (e.g., Farias et al., 2009; Lu and Boutilier, 2011). Loosely speaking, this tacitly presupposes a heterogeneous population of users who are clustered into different types by their preferences and whose preference behavior is influenced by only one factor. In contrast to both these categories, we model each user’s pairwise preference behavior as a mixed membership latent variable model. This captures both heterogeneity (via the multiple shared constituent rankings) and inconsistent preference behavior (via the probabilistic mixture). This is a fundamental change of perspective from the traditional clustering-based approach to a decomposition-based one.

A second contribution of this paper is the development of a novel algorithmic approach to efficiently and consistently estimate the latent rankings in our proposed model. This is achieved by establishing a formal connection to probabilistic topic modeling where each document in a corpus is viewed as a probabilistic mixture of a few prevailing topics (Blei, 2012). This formal link allows us to leverage algorithms that were recently proposed in the topic modeling literature (Arora et al., 2013; Ding et al., 2013b, 2014) for estimating latent shared rankings. Overall, our approach has a running time and a sample complexity bound that are provably polynomial in all model parameters. Our approach is asymptotically consistent as the number of users goes to infinity even when the number of comparisons for each user is a small constant.

We also demonstrate competitive empirical performance in collaborative prediction tasks. Through a variety of performance metrics, we demonstrate that our model can effectively capture the variability of real-world user preferences.

2 Related Work

Rank estimation from partial or total rankings has been extensively studied over the last several decades in various settings. A prominent setting is one in which individual user rankings (in a homogeneous population) are modeled as independent drawings from a probability distribution which is centered around a single ground-truth global ranking. Efficient algorithms have been developed to estimate the global ranking under a variety of probability models

Qin et al. (2010); Gleich and Lim (2011); Negahban et al. (2012); Osting et al. (2013); Volkovs and Zemel (2014). Chief among them are the Mallows model (Mallows, 1957), the Plackett-Luce (PL) model (Plackett, 1975), and the Bradly-Terry-Luce (BTL) model (Rajkumar and Agarwal, 2014).

To account for the heterogeneity in the user population, (Jagabathula and Shah, 2008; Farias et al., 2009) considered models with multiple prevalent rankings and proposed consistent combinatorial algorithms for estimating the rankings. The mixture of Mallows model recently studied in (Lu and Boutilier, 2011; Awasthi et al., 2014) considers multiple constituent rankings as the “centers” for the Mallows components, as do the “mixture of PL” and the “mixture of BTL” models (Azari Soufiani et al., 2013; Oh and Shah, 2014). In all these settings, however, each user is associated with only one ranking sampled from the mixture model. They capture the cases where the population can be clustered into a few types in terms of their preference behavior.

The setup of our model, although being fundamentally different in modeling perspective, is most closely related to the seminal work in Jagabathula and Shah (2008); Farias et al. (2009) (denoted by FJS) (see Table 1 and appendix). As it turns out, our proposed model subsumes those proposed in FJS as special cases. On the other hand, while the algorithm in FJS can be applied to our more general setting, our algorithm has provably better computational efficiency, polynomial sample complexity, and superior empirical performance.

Method Assumptions Statistics Consistency Computational Sample
on used proved? complexity complexity
FJS Separability 1st order Yes Exponential in Not provided
This paper Separability up to 2nd order Yes Polynomial Polynomial
Table 1: Comparison to closely related work (Jagabathula and Shah, 2008) (Farias et al., 2009) (FJS)

Relation to topic modeling: Our ranking model shares the same motivation as topic models. Topic modeling has been extensively studied over the last decade and has yielded a number of powerful approaches (e.g., Blei, 2012)

. While the dominant trend is to fit a MAP/ML estimate using approximation heuristics such as variational Bayes or MCMC, recent work has demonstrated that the topic discovery problem can lend itself to provably efficient solutions with additional structural conditions 

(Arora et al., 2013; Ding et al., 2014). This forms the basis of our technical approach.

Relation to rating based methods: There is also a considerable body of work on modeling numerical ratings (e.g., Ricci et al., 2011) from which ranking preferences can be derived. An emerging trend explores the idea of combining a topic model for text reviews simultaneously with a rating-based model for “star ratings” (Wang and Blei, 2011). These approaches are, however, outside the scope of this paper.

The rest of the paper is organized as follows. We formally introduce the new generative model in Sec. 3. We then present the key geometrical perspective underlying the proposed approach in Sec. 4. We summarize the main steps of our algorithm and the overall computational and statistical efficiency in Sec. 5. We demonstrate competitive performance on semi-synthetic and real-world datasets in Sec. 6.

3 A new generative model

To formalize our proposed model, let be a universe of items. Let the latent rankings over items that are shared across a population of users be denoted by permutations . Each user compares pairs of items. The unordered item pairs to be compared are assumed to be drawn independently from some distribution with for all pairs. The -th comparison result of user is denoted by an ordered pair , if user compares item and and prefers over

. Let a probability vector

be the user-specific weights over the latent rankings. The generative model for the comparisons from each user is,

  1. Sample from a prior distribution

  2. For each comparison :

    1. Sample a pair of items from

    2. Sample a ranking token

    3. If , then , otherwise 111 is the position of item in the ranking and item is preferred over if .

Figure 1: Graphical model representation of the generative model. The boxes represent replicates. The outer plate represents users, and the inner plate represents ranking tokens and comparisons of each user.

Figure 1 is a standard graphical model representation of the proposed generative process. Each user is characterized by , the user-specific weights over the shared rankings. For convenience, we represent by a nonnegative ranking matrix whose

rows are indexed by all the ordered pairs

. We set , so that the -th column of is an equivalent representation of the ranking . We then denote by the dimensional weight matrix whose columns are the user-specific mixing weights ’s. Finally, let be the empirical comparisons-by-user matrix where denotes the number of times that user compares pair and prefers item over . The principal algorithmic problem is to estimate the ranking matrix given and .

If we denote by a diagonal matrix with the -th diagonal component , and set , then the generative model induces the following probabilities on comparisons :


Similarly, if we consider a probabilistic topic model on a set of documents, each composed of words drawn from a vocabulary of size , with a topic matrix and document-specific mixing weights sampled from a topic prior (e.g. Blei, 2012), then, the distribution induced on the observation , i.e., the -th word in document , has the same form as in (3):


where is any distinct word in the vocabulary. Noting that is column-stochastic, we have,

Lemma 1.

The proposed generative model is statistically equivalent to a standard topic model whose topic matrix is set to be and the topic prior to be .


Note that since is column stochastic, it is a valid topic matrix. We need to show that the distribution on the comparisons and on the words in topic model are the same. From (3) (2),

Note that , , and . Hence can be inferred directly from :


Thus, the problem of estimating the ranking matrix can be solved by any approach that can learn the topic matrix . Our approach is to leverage recent works in topic modeling (Arora et al., 2012, 2013; Ding et al., 2013b, 2014)

that come with consistency and statistical and computational efficiency guarantees by exploiting the second-order moments of the columns of

, i.e., a co-occurrence matrix of pairwise comparisons. We can establish parallel results for ranking model via the equivalency result of Lemma 1. In particular, by combining Lemma 1 with results in (Ding et al., 2013b, Lemma 1 in Appendix), the following result can be immediately established:

Lemma 2.

If and are obtained from by first splitting each user’s comparisons into two independent copies and then re-scaling the rows to make them row-stochastic, then


where , , , and and are, respectively, the expectation and correlation matrix of the weight vector .

4 A Geometric Perspective

Figure 2: A separable ranking matrix with rankings over items, and the underlying geometry of the row vectors of . are novel pairs. Shaded regions depict the solid angles of the extreme points.

The key insight of our approach is an intriguing geometric property of the normalized second-order moment matrix (defined in Lemma 2) illustrated in Fig. 2. This arises from the so-called separability condition on the ranking matrix ,

Definition 1.

A ranking matrix is separable if for each ranking , there is at least one ordered pair , such that and , .

In other words, for each ranking, there exists at least one “novel” pair of items such that is uniquely preferred over in that ranking while is ranked higher than in all the other rankings. Figure 2 shows an example of a separable ranking matrix in which the ordered pair is novel to ranking , the pair to , and the pair to .

The separability condition has been identified as a good approximation for real-world datasets in nonnegative matrix factorization (Donoho and Stodden, 2004) and topic modeling (Arora et al., 2013; Ding et al., 2014), etc. In the context of ranking, this condition has appeared, albeit implicitly in a different form, in the seminal works of (Jagabathula and Shah, 2008; Farias et al., 2009). Moreover, as shown in (Farias et al., 2009), the separability condition is satisfied with high probability when the underlying rankings are sampled uniformly from the set of all permutations. In our experiments we have observed that the ranking matrix induced by the rating matrix estimated by matrix factorization is often separable (Sec. 6.2).

If is separable then the novel pairs correspond to extreme points of the convex hull formed by all the row vectors of (Fig. 2

). Thus, the novel pairs can be efficiently identified through an extreme point finding algorithm. Once all the novel pairs are identified, the ranking matrix can be estimated using a constrained linear regression

(Arora et al., 2013; Ding et al., 2014). To exclude redundant rankings and ensure unique identifiability, we assume has full rank.

We leverage the normalized Solid Angle subtended by extreme points to detect the novel pairs as proposed in (Ding et al., 2014, Definition 1). The solid angles are indicated by the shaded regions in Fig. 2. From a statistical viewpoint, it can be defined as the probability that a row vector has the maximum projection value along an isotropically distributed random direction :


These can be efficiently approximated using a few iid isotropic ’s. By following the approach in (Ding et al., 2014, Lemma 2) for topic modeling, one can prove the following result which shows that the solid angles can be used to detect novel pairs:

Lemma 3.

Suppose is separable and is full rank, then, if and only if is a novel pair.

This motivates the following solution approach: Estimate the solid angles , Select distinct pairs with largest ’s, and Estimate the ranking matrix using constrained linear regression.

Given the estimated ranking matrix (and ), we follow the typical steps in topic modeling (Blei, 2012) to fit the ranking prior, infer user-specific preferences , and predict new comparisons (see Sec. 6).

5 Algorithm and Analysis

The main steps of our approach are outlined in Algorithm  1 and expanded in detail in Algorithms 23 and  4. Algorithm 2 detects all the novel pairs for the distinct rankings. Once the novel pairs are identified, Algorithm 3 estimates matrix using constrained linear regression followed by row and then column scaling.

Algorithm 4 further processes to obtain an estimate of the ranking matrix . Step 1 is based on Eq. (3) and step 2 further rounds each element to or . Algorithm 4 guarantees that is binary and satisfies the condition: for all and all .

0:  Pairwise comparisons , ; Number of rankings ; Number of projections ; Tolerance parameters .
0:  Ranking matrix estimate .
1:  Novel Pairs NovelPairDetect()
2:  EstimateRankings()
3:  PostProcess()
Algorithm 1 Ranking Recovery (Main Steps)
0:  , ; number of rankings ; number of projections ; tolerance ;
0:  : The set of all novel pairs of distinct rankings.
  , ,
  for  do
     Sample from an isotropic prior
  end for
  ,, and
  while  do
      index of the largest value among ’s
     if  then
     end if
  end while
Algorithm 2 NovelPairDetect (via Random Projections)
0:   the set of novel pairs of rankings; , ; precision
0:   as the estimate of .
  for all  pairs do
     Subject to , With precision
  end for
  column normalize
Algorithm 3 Estimate Rankings
0:   as the estimate of
0:   as the estimate of
1:  ,
2:  ,
Algorithm 4 Post Processing

Our approach inherits the polynomial computational complexity of the topic modeling algorithm in Ding et al. (2014):

Theorem 1.

The running time of Algorithm 1 is .

We further derive the sample complexity bounds for our approach which is also polynomial in all model parameters and where is the upper bound on error probability. A major technical improvement compared to the results that appear in Ding et al. (2014) is that our analysis holds true for any isotropic distribution over the random directions in Alg. 2. The previous result in (Ding et al., 2014, Theorem 1, 2) was designed only for specific distributions such as spherical Gaussian. Formally,

Theorem 2.

Let the ranking matrix be separable and have full rank. Then the Algorithm 1 can consistently recover up to a column permutation as the number of users and number of projections . Furthermore, for any isotropically drawn random direction , , if

and , then Algorithm 1 fails with probability at most . The other model parameters are defined as , , , , and ,

are the minimum /maximum eigenvalues of

. is the minimum solid angle of the extreme points of the convex hull of the rows of .

Detailed proofs are provided in the supplementary material. We combine the analysis of Alg. 4 and the re-scaling steps in Alg. 3 in order to exploit the structural constraints of the ranking model. As a result, we obtain an improved sample complexity bound for compared to Ding et al. (2014); Arora et al. (2013)

6 Experimental Validation

6.1 Overview of Experiments and Methodology

We conduct experiments first on semi-synthetic dataset in order to validate the performance of our proposed algorithm when the model assumptions are satisfied, and then on real-world datasets in order to demonstrate that the proposed model can indeed effectively capture the variability that one encounters in the real world. We focus on the collaborative filtering applications where population heterogeneity and user inconsistency are the well-known characteristics (e.g., Salakhutdinov and Mnih, 2008a).

We use Movielens, a benchmark movie-rating dataset widely used in the literature.222Another large benchmark, Netflix dataset is not available due to privacy issues. Movielens is currently available at http://grouplens.org/datasets/movielens/ The rating-based data is selected due to its public availability and widespread use, but we convert it to pairwise comparisons data and focus on modeling from a ranking viewpoint. This procedure has been suggested and widely used in the rank-aggregation literature (e.g., Lu and Boutilier, 2011; Volkovs and Zemel, 2014). For the semi-synthetic datasets, we evaluate the reconstruction error between the learned rankings and the ground truth. We adopt the standard Kendall’s tau distance between two rankings. For the real-world datasets where true parameters are not available, we use the held-out log-likelihood, a standard metric in ranking prediction (Lu and Boutilier, 2011) and in topic modeling Wallach et al. (2009).

In addition, we consider the standard task of rating prediction via our proposed ranking model. Our aim here is to illustrate that our model is suitable for real-word data. We do not optimize tuning parameters in order to achieve the best result. We measure the performance by root-mean-square-error (RMSE) which is the standard in literature(e.g., Salakhutdinov and Mnih, 2008a; Toscher et al., 2009).

The parameters of our algorithm are the same as in Ding et al. (2014). Specifically, the number of random projections , the tolerance parameter for Alg. 2 is fixed at and the precision parameter for Alg. 3.

6.2 Semi-synthetic simulation

We first use a semi-synthetic dataset to validate the performance of our algorithm. In order to match the dimensionality and other characteristics that are representative of real-world examples, we generate the semi-synthetic pairwise comparisons dataset using a benchmark movie star-ratings dataset, Movielens. The original dataset has approximately million ratings for movies from users. The ratings range from 1 star to 5 stars.

Figure 3: The normalized Kendall’s tau distance error of the estimated rankings, as functions of , estimated by RP and FJS from the semi-synthetic dataset with .

We follow the procedure in (Lu and Boutilier, 2011) and (Volkovs and Zemel, 2014) to generate the semi-synthetic dataset as follows. We consider the most frequently rated movies and train a latent factor model on the star-ratings data using a state-of-the-art matrix factorization based algorithm (Salakhutdinov and Mnih, 2008a). This approach is selected for its state-of-the-art performance on many real-world collaborative filtering tasks. This procedure learns a movie-factor matrix whose columns are interpreted as scores of the movies over the latent factors(Salakhutdinov and Mnih, 2008a; Volkovs and Zemel, 2014). By sorting the scores of each column of the movie-factor matrix, we obtain rankings for generating the semi-synthetic dataset. We set as suggested by Lu and Boutilier (2011) and Salakhutdinov and Mnih (2008a). We note that the resulting ranking matrix satisfies the separability condition.

The other model parameters are set as follows. , . The prior distribution for is set to be Dirichlet as suggested by (Lu and Boutilier, 2011). The parameters ’s are determined by , where the concentration parameter and the expectation is sampled uniformly from the dimensional simplex for each random realization. We note that the correlation matrix of the Dirichlet distribution has full rank (Arora et al., 2013). We fix comparisons per user to approximate the observed average pairwise comparisons in the Movielens dataset and vary .

Since the output of our algorithm is determined only up to a column permutation, we first align the columns of and using bipartite matching based on distance, and then measure the performance by the distance between the ground truth rankings and the estimate . Due to the way is defined, this is equivalent to the widely-used Kendall’s tau distance between two rankings which is proportional to the number of pairs in which two ranking schemes differ. We further normalize the error by so that the error measure for each column is a number between .

We compare our proposed algorithm (denoted by RP) against the algorithm proposed in (Jagabathula and Shah, 2008; Farias et al., 2009) (denoted by FJS) for estimating the ranking matrix. To the best of our knowledge, this is the most recent algorithm with consistency guarantees for .333We show in the appendix that Alg. FJS can be applied to our generative scheme since it only uses the first order statistics, and all the technical conditions are satisfied. We compared how the estimation error varies with the number of users , and the results are depicted in Fig. 3. For each setting, we average over Monte Carlo runs. Evidently, our algorithm shows superior performance over FJS. More specifically, since our ground truth ranking matrix is separable, as increases, the estimation error of RP converges to zero, and the convergence is much faster than FJS. We note that only for does the error of the FJS algorithm eventually start approaching 0.

6.3 Movielens - Comparison prediction

We apply the proposed algorithm (RP) to the real-world Movielens dataset introduced in Sec. 6.2 and consider the task of predicting pairwise comparisons. We consider two settings: new comparison prediction, and new user prediction. We train and evaluate our model using the comparisons obtained from the star-ratings of the Movielens dataset. This procedure of generating comparisons from star-ratings is motivated by (Lu and Boutilier, 2011; Volkovs and Zemel, 2014). We focus on the most frequently rated movies and obtain a subset of star-ratings from users. The pairwise comparisons are generated from the star ratings following (Lu and Boutilier, 2011; Volkovs and Zemel, 2014): for each user , we select pairs of movies that user rated, and compare the stars of the two movies to generate comparisons.

To select pairs of items to compare, we consider: (Full) all pairs of movies that a user has rated, or (Partial) randomly select pairs where is the number of movies user has rated.

To compare a pair of movies rated by a user, if the star rating of is higher than . For ties, we consider: (Both) generate and , (Ignore) do nothing, and (Random) select one of with equal probability.

New comparison prediction: In this setting, for each user, a subset of her ratings are used to generate the training comparisons while the remaining are for testing comparisons. We follow the training/testing split as in (Salakhutdinov and Mnih, 2008a).444The training/testing split is available at http://www.cs.toronto.edu/~rsalakhu/BPMF.html We convert both the training ratings and testing ratings into training comparisons and testing comparisons independently.

We evaluate the performance by the predictive log-likelihood of the testing data, i.e., . Given the estimate , we follow (Arora et al., 2013; Ding et al., 2014) to fit a Dirichlet prior model. We then calculate the prediction log-likelihood using the approximation in (Wallach et al., 2009) which is now the standard. We compare against the FJS algorithm. Figure 4(upper) summarizes the results for different strategies in generating the pairwise comparisons with held fixed. The log-likelihood is normalized by the total number of pairwise comparisons tested. As depicted in Fig. 4 (upper), the log-likelihood produced by the proposed algorithm RP is higher, by a large margin, compared to FJS. The predictive accuracy is robust to how the comparison data is constructed. We also consider the normalized log-likelihood as function of (see Fig. 5). The results validate the superior performance and suggest that is a reasonable parameter choice.

Figure 4: The normalized log-likelihood under different settings for (upper) new comparison prediction and (lower) new user prediction on the truncated Movielens. .
Figure 5: The normalized log-likelihood for Full + Ignore strategy for various on the truncated Movielens dataset (new comparison prediction).

New user prediction: In this setting, all the ratings of a subset of users are used to generate the training comparisons while the remaining users’ comparisons are used for testing. Following (Lu and Boutilier, 2011), we split the first users (in the original dataset) in the Movielens dataset for training, and the remaining for testing. We use the held-out log-likelihood, i.e., to measure the performance. The log-likelihoods are again calculated using the standard Gibbs Sampling approximation (Wallach et al., 2009). We compare our algorithm RP with the FJS algorithm. The log-likelihoods are then normalized by the total number of comparisons in the testing phase. We fix the number of rankings at . The results which are summarized in Fig. 4 (lower) agree with the results of the previous task.

6.4 Movielens - Rating prediction via ranking model

The purpose of this experiment is to illustrate that our ranking model can capture real-world user behavior through rating predictions, one important task in personal recommendation (Toscher et al., 2009). We first train our ranking model using the training comparisons, and then predict ratings based on comparison prediction. Our objective is to demonstrate results comparable to the state-of-the-art rating-based methods rather than achieving the best possible performance on certain datasets.

We use the same training/testing rating split from (Salakhutdinov and Mnih, 2008a) as used in new comparison prediction in Sec. 6.3, and focus only on the most rated movies. We first convert the training ratings into training comparisons (for each user, all pairs of movies she rated in the training set are converted into comparisons based on the stars and the ties are ignored) and train a ranking model. The prior is set to be Dirichlet.

To predict stars from comparison prediction, we propose the following method. Consider the problem of predicting , i.e., the rating of user on movie . We assume , then compare it against the ratings on movie she has rated in training. This generates a set of pairwise comparisons . For example, if user has rated movies with stars respectively in the training set and we are predicting her rating of movie . Then for , while for , . We then chose to maximize the likelihood of ,

We evaluate the performance using root-mean-square-error (RMSE). This is a standard metric in collaborative filtering (Toscher et al., 2009). 555 Normalized Discounted Cumulative Gain (nDCG) is another standard metric. It requires, however, to predict a total ranking and is inapplicable in our test setting. We compared our ranking-based algorithm, RP , against rating based algorithms. We choose to compare two benchmark algorithms, Probability Matrix Factorization (PMF) in (Salakhutdinov and Mnih, 2008b)

, and Bayesian probability matrix factorization (BPMF) in

(Salakhutdinov and Mnih, 2008a) for their robust empirical performance 666The implementation is available at http://www.cs.toronto.edu/~rsalakhu/BPMF.html. Both PMF and BPMF are latent factor models. The number of latent factors has the similar interpretation as in our ranking model. The RMSE for different choices of are summarized in Table 2.

10 1.0491 0.8254 0.8840 0.8723
15 0.9127 0.8236 0.8780 0.8734
20 0.9250 0.8213 0.8721 0.8678
Table 2: Testing RMSE on the Movielens dataset

Although coming from a different feature space and modeling perspective, our approach has similar RMSE performance as the rating-based PMF and BPMF. Since the ratings predicted by our algorithm are integers from to , we also consider restricting the output of BPMF to be integers (denote as BPMF-int). This is achieved by rounding the real-valued prediction of BPMF to the nearest integer from 1 to 5. We observe that our RP algorithm outperforms PMF which is known for over-fitting issues, and matches the performance of BPMF-int. This demonstrates that our approach is in fact suitable for modeling real-world user behavior.

We point out that one can potentially improve these results by designing a better comparison generating strategy, ranking prior, aggregation strategies, etc. This is, however, beyond the scope of this paper.

We note that our proposed algorithm can be naturally parallelized in a distributed database for web scale problems as demonstrated in (Ding et al., 2014). The statistical efficiency of the centralized version can be retained with an insignificant communication cost.


This article is based upon work supported by the U.S. AFOSR and the U.S. NSF under award numbers # FA9550-10-1-0458 (subaward # A1795) and # 1218992 respectively. The views and conclusions contained in this article are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the agencies.

Supplementary Material

While our analysis of the proposed approach and algorithm largely tracks the methodology in (Ding et al., 2014), here we develop a set of new analysis tools that can handle more general settings. Specifically, our new analysis tools can handle any isotropically distributed random projection directions. In contrast, the work in (e.g., Ding et al., 2014) can only handle special types of random projections, e.g., spherical Gaussian. Our new refined analysis can not only handle more general settings, it also gives an overall improved sample complexity bound.

We also analyse the post-processing step in Algorithm 4. This step accounts for the special constraints that a valid ranking representations must satisfy and guarantees a binary-valued estimate of . It should also satisfy the property that either or for all distinct and all .

We note that the analysis framework that we present here for the solid angle can in fact be extended to handle other types distributions for the random projection directions. This is, however, beyond the scope this paper.

Appendix A On the generative model

Proposition 1.

is column stochastic.


Noting that by definition, and , therefore,

Appendix B Connection to the model in FJS

Here we discuss in detail the connection to the probability model as well as the algorithm proposed in Jagabathula and Shah (2008)Farias et al. (2009) (denoted by FJS).

First, the generative model proposed in FJS can be viewed as a special case of our generative model. If we consider the prior distribution of to be a pmf on the vertices of the -dimensional probability simplex (so that has only one nonzero component with probability one), i.e.,


where is the -th standard basis vector and , then each user is associated with only one of the types with probability for the -th type. We note that under this prior, and has full rank.

Second, the algorithm proposed in FJS can certainly be applied to our more general setting. Since the algorithm FJS only uses the first order statistic which corresponds to pooling the comparisons from all the users together, it suffices to consider only the probabilities of by marginalizing over :

where the last step is due to the definition of the ranking matrix . The above derivation shows that if the expectation vector in our generative model equals that in the model of FJS, then the probability distribution of the first order statistic in both models will be identical and the two models will be indistinguishable in terms of the first order statistic. This shows that the comparison with FJS in the experiments conducted in Sections 6.1 and 6.2 of the main paper is both sensible and fair.

Indexing convention: For convenience, for the rest of this appendix we will index the rows of and by just a single index instead of an ordered pair as in the main paper.

Appendix C Proof of Lemma 2 in the main paper

Lemma 2 in the main paper is a result about the almost sure convergence of the estimate of the normalized second order moments . Our proof of this result will also provide an attainable rate of convergence.

We first provide a generic method to establish the convergence rate for a function of random variables given their individual convergence rates.

Proposition 2.

Let be random variables and be positive constants. Let for some constants , and be a continuously differentiable function in . If for , and , then,


Since is continuously differentiable in , such that


Now we are ready to prove Lemma 2 of the main paper. Recall that and are obtained from by first splitting each user’s comparisons into two independent copies and then re-scaling the rows to make them row-stochastic. Therefore, . Since , , and is row stochastic. From Lemma 2 of the main paper, we have

Lemma 4.

Let and . If , then,


For any ,

From the Strong Law of Large Numbers and equations (1), (2) in the main paper, we have

and by definition. Using McDiarmid’s inequality, we obtain

In order to calculate , we apply the results from Proposition 2. Let with , and , , . Let , , and . Then , , and .

If , , and , then , . Then note that

By applying Proposition 2, we get

where . There are many strategies for optimizing the free parameter . We set and solve for to obtain

Finally, by applying the union bound to the entries in , we obtain the claimed result. ∎

Appendix D Proof of Theorem 2 in the main paper

d.1 Outline

We focus on the case when the random projection directions are sampled from any isotropic distribution. Our proof is not tied to the special form of the distribution; just its isotropic nature. In contrast, the method in (e.g., Ding et al., 2014) can only handle special types of distributions such as the spherical Gaussian.

The proof of Theorem 2 in the main paper can be decoupled into two steps. First, we show that Algorithm 2 in the main paper can consistently identify all the novel words of the distinct rankings. Then, given the success of the first step, we will show that Algorithm 3 proposed in the main paper can consistently estimate the ranking matrix .

d.2 Useful propositions

We denote by the set of all novel pairs of the ranking , for , and denote by the set of other non-novel pairs. We first prove the following result.

Proposition 3.

Let be the -th row of . Suppose is separable and has full rank, then the following is true: