# Learning Mixed Membership Mallows Models from Pairwise Comparisons

We propose a novel parameterized family of Mixed Membership Mallows Models (M4) to account for variability in pairwise comparisons generated by a heterogeneous population of noisy and inconsistent users. M4 models individual preferences as a user-specific probabilistic mixture of shared latent Mallows components. Our key algorithmic insight for estimation is to establish a statistical connection between M4 and topic models by viewing pairwise comparisons as words, and users as documents. This key insight leads us to explore Mallows components with a separable structure and leverage recent advances in separable topic discovery. While separability appears to be overly restrictive, we nevertheless show that it is an inevitable outcome of a relatively small number of latent Mallows components in a world of large number of items. We then develop an algorithm based on robust extreme-point identification of convex polygons to learn the reference rankings, and is provably consistent with polynomial sample complexity guarantees. We demonstrate that our new model is empirically competitive with the current state-of-the-art approaches in predicting real-world preferences.

## Authors

• 9 publications
• 18 publications
• 61 publications
• ### A Topic Modeling Approach to Ranking

We propose a topic modeling approach to the prediction of preferences in...
12/11/2014 ∙ by Weicong Ding, et al. ∙ 0

• ### Necessary and Sufficient Conditions and a Provably Efficient Algorithm for Separable Topic Discovery

We develop necessary and sufficient conditions and a novel provably cons...
08/23/2015 ∙ by Weicong Ding, et al. ∙ 0

• ### Clustering and Inference From Pairwise Comparisons

Given a set of pairwise comparisons, the classical ranking problem compu...
02/16/2015 ∙ by Rui Wu, et al. ∙ 0

• ### Spectral Methods for Ranking with Scarce Data

Given a number of pairwise preferences of items, a common task is to ran...
07/02/2020 ∙ by Umang Varma, et al. ∙ 0

• ### A New Geometric Approach to Latent Topic Modeling and Discovery

A new geometrically-motivated algorithm for nonnegative matrix factoriza...
01/05/2013 ∙ by Weicong Ding, et al. ∙ 0

• ### Active Algorithms For Preference Learning Problems with Multiple Populations

In this paper we model the problem of learning preferences of a populati...
03/14/2016 ∙ by Aniruddha Bhargava, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of predicting preference for a diverse user-population arises in many applications including personal recommendation systems, e-commerce and information retrieval (Volkovs & Zemel, 2014; Lu & Boutilier, 2014; Ding et al., 2015). Pairwise comparisons of items by a heterogeneous and inconsistent population can now be observed and recorded over the web through transactions, clicks and check-ins for a large set of items. Our goal is to model, inference, and predict user behavior in pairwise comparisons.

This paper proposes a new Mixed Membership Mallows Model (M4) for pairwise comparisons that leverages the widely used mixture of Mallows model (e.g., Lu & Boutilier, 2014; Awasthi et al., 2014). The building block of M4 is the popular Mallows distribution on permutations. The pmf of Mallows model is centered around a reference ranking and the deviation is captured by a dispersion constant (Mallows, 1957). M4 naturally captures the heterogeneous, inconsistent, and noisy behavior by assuming each user’s comparisons as a probabilistic mixture of a few shared latent Mallows components. By design, the latent Mallows components capture the heterogeneous influencing factors in the population and the user-specific mixing weights reflect the influence of multiple latent factors on each user. Furthermore, the randomness of each Mallows component captures the fact that the same latent factor can consistently result in different outcomes on different users, more so far very similar items. Overall, M4 generalizes the clustering perspective in mixture of Mallows model into a decomposition modeling perspective that better fits the emerging web-scale observations.

The key contribution in this paper is to propose the first provable and polynomially efficient approach for learning multiple Mallows components in mixed membership settings from pairwise comparisons. As a special case of M4, the mixture of Mallows model has received significant attention (Lebanon & Lafferty, 2002; Busse et al., 2007; Lu & Boutilier, 2014; Awasthi et al., 2014), yet theoretical guarantees are not clear except for special cases (Awasthi et al., 2014). We propose to learn M4 by reducing it to an instance of a probabilistic topic modeling (Blei, 2012). Topic modeling for text corpus have been extensively studied but its connection to preference data is unclear. We view users as “documents”, pairwise comparisons as “words”, and the latent Mallows components as “topics”. This leads us to the question of topic discovery viewed within the context of M4.

The key technical contribution of our approach is to provably discover latent factors with a non-exact separability structure. Our approach is geometrically inspired by the recent work in exact separable topic discovery (e.g.m Arora et al., 2013; Ding et al., 2013)

, and we provably generalize it to approximately separability with finite degree of deviation. In M4, this requires for each Mallows component, there exist an item pair such that item A is preferred over B with very high probability in that Mallows component and B is preferred over A with high probability under the other Mallows component. While it might appear restrictive, we show formally that approximate separability is inevitable and naturally arises from the fact that we have large set of items relative to the number of shared latent preferences. As a consequence, most large M4 are approximately separable. We then provably generalize the geometry property in solid angle from

(Ding et al., 2014b) and establish guarantees for consistent estimation of reference rankings along with polynomial sample and computational complexity bounds. Our results only require the number of users to scale while allowing for the number of comparisons per user to be small.

### 1.1 Related work

Rank estimation from full or partial preferences has been extensively studied in different settings for decades (Marden, 1995; Rajkumar & Agarwal, 2014; Volkovs & Zemel, 2014). The family of mixture of ranking models have demonstrated superior modeling power to capture a heterogeneous population with noisy observations (e.g., Farias et al., 2009; Oh & Shah, 2014). In these models, each user is associated with one ranking component sampled from a set of multiple ranking components hence the population can be clustered into heterogeneous preference types. The mixture of Mallows model has received significant attention (Lebanon & Lafferty, 2002; Busse et al., 2007; Lu & Boutilier, 2014; Awasthi et al., 2014). EM-based algorithms have been used for estimation from pairwise comparisons (Lu & Boutilier, 2014) or full rankings (Busse et al., 2007). Only recently, (Awasthi et al., 2014)

proposed a provably correct algorithm based on tensor decomposition that can handle a mixture of 2 Mallows model using the top-3 ranked items as the observations which, in effect, requires users to consider all items. This is impractical within the context of the target web-scale applications. Since the mixture of Mallows is special case of M4 by positing a specific prior on each user’s mixing weights, our algorithm can thus be viewed as providing a powerful alternative approach for learning the mixture of Mallows model. We note that mixture of Bradley-Terry-Luce (BTL) models

(Oh & Shah, 2014), mixture of Plackett-Luce (PL) models (Azari Soufiani et al., 2013) have been studied.

Our model is closely related to (Ding et al., 2014a, 2015) that validated the advantages of adopting the mixed membership perspective. Ding et al. (2015) models each latent ranking factor as a single permutation and is a degenerate special case of Mallows distribution over the permutations in M4. Therefore, while both (Ding et al., 2015) and M4 can capture the inconsistent behavior semming from the influence of multiple latent factors, M4 can further account for the inconsistency as the consequence of the randomness within each Mallows components. Our approach has similar polynomial time and sample guarantees as in (Ding et al., 2015). We note that motivated by social choice application, Gormley & Murphy (2008) proposed another mixed membership ranking model where the latent “topics” are PL models. An MCMC based approach is used for estimation without theoretical guarantees. Table. 1 summarizes all the closely related works.

Connection to Separable Topic Discovery: A key motivation of our approach is the recent work on consistent and efficient topic discovery for topic matrices that have an exact separable structure (Arora et al., 2013; Ding et al., 2014b). The exact separability has been exploited as a suitable approximation to many problems including topic modeling (Arora et al., 2013) and ranking estimation (Ding et al., 2015).

Closely related to our technical settings is the so called near-separable structure where the observations are viewed as a noisy perturbation from some exact separable statistic. In the literature to-date, establishing provable guarantees requires the perturbation to go to zero via either data augmentation (Arora et al., 2013; Ding et al., 2013, 2015) or improving Signal-to-Noise-Ratio (Gillis & Vavasis, 2014; Benson et al., 2014). In contrast, the ideal statistic in our approach has a small but finite perturbation from the exactly separable ideal. Our provable guarantees require only a finite degree of approximate separability. We explicitly derive a sufficient condition that bounds on the degree of approximate separability.

Bansal et al. (2014)

recently proposed a provable approach that requires similar approximate separability as in our settings but requires a strong condition on the weight prior. In M4, it requires each user to have a dominant latent factor. In contrast, we only requires the second order moments of the prior to be full rank which is satisfied by many prior distributions

(Arora et al., 2013).

Rating based methods: Considerable work in preference prediction has focused on numerical ratings. The most important idea is also to model the ratings as being influenced by a small number of latent factors shared by the population (e.g., Salakhutdinov & Mnih, 2008a). Although coming from a different feature space, our model shares the same mixed membership modeling perspective.

The rest of the paper is organized as follows. Section 2 introduces the M4 model. In Sec. 3, we formally introduce the approximate separability and show that the set of approximate separable M4 models has an overwhelming probability. Section 4 summarizes the steps of our algorithm and the computational and sample complexity bounds. We demonstrate competitive performances on some semi-synthetic and real-world datasets in Sec. 5.

## 2 Mixed Membership Mallows Model

We now describe the generative process of the Mixed Membership Mallows Model (M4). To set up the problem, we consider a universe of items and a population of users that each compares

pairs of items. We assume the item pairs to be compared, denoted by un-ordered pairs

, are drawn independently from some distribution . The outcome of -th pairwise comparison of user is denoted by an ordered pair , if user compares item and , and prefers over .

We first introduce the Mallows model (Mallows, 1957). In M4, let the

-th Mallows component define a probability distribution on the set of all permutations over the

items. It is parameterized by a reference ranking and a dispersion parameter :

 pM(σ|σk,ϕk)=ϕd(σ,σk)k/Zk (1)

where denotes an arbitrary permutation, denotes the Kendall’s tau distance between two permutations, and is the normalization constant. The generative process for the comparisons in M4 from user is,

1. Sample ranking weight from prior .

2. For each comparison ,

1. Sample a pair of items from .

2. Sample a ranking token

3. Sample a permutation from -th Mallows component with parameter

4. If , then , otherwise 111 is the position of item in a ranking . Item is preferred over if .

Figure 1 is the plate representation of M4. The mixing weights over the shared Mallows components characterize each user. We denote by matrix for the empirical observations. Its rows are indexed by all the ordered pairs . denotes the number of times that user prefers item over . Given and , the primary problem in this paper is to learn the parameters of the shared latent Mallows component.

Reduction to Topic Modeling

We show that the problem of learning model parameters in M4 can be formally reduced to topic discovery in an equivalent topic model. To establish the connection, we first consider the distribution on the pairwise comparisons ,

 p(wm,n=(i,j)|θm)=μi,jK∑k=1∑ σ(i)<σ(j)pM(σ|σk,ϕk)θk,m

where is the probability of comparing item and . For further reference, we define ranking matrix to be a dimension matrix whose entries are,

 β(i,j),k:=∑σ: σ(i)<σ(j)pM(σ|σk,ϕk) (2)

Statistically, represents the probability that item is preferred over item if the ranking is sampled from the -th Mallows component. The -th column of therefore captures the pairwise comparison behavior induced by the -th Mallows distribution and is a function determined only by . For convenience, we also define a matrix as . Therefore, the conditional probability of the comparisons can be simplified as,

 p(wm,n=(i,j)|θm)=K∑k=1B(i,j),kθk,m (3)

Before we connect to topic modeling, we summarize the properties of the ranking matrix that enable us to infer the Mallows parameters directly from :

###### Proposition 1.

Let the ranking matrix be defined as in Eq. (10), and ’s are parameters of the Mallows distribution. Then, and , we have,

1. If and ,

2. If and ,

First, by Prop. 1 a., we can directly infer from . Second, by Prop. 1 b., one can infer the relative position of any two items in the reference rankings by comparing the entries in with . Therefore, if the estimation error in is element-wise small and , then, all the pairwise relations in the reference rankings can be correctly inferred hence the total rankings. Furthermore, the dispersion can be estimated using Prop. 1 c. 222If , the

-th Mallows component is the uniform distribution and is un-identifiable. We consider

in this paper.. In sum, we can learn all the model parameters from . For the rest of this paper, we focus on learning .

We note that Eq. (3) shares the same structure as in probabilistic topic modeling (Blei, 2012; Airoldi et al., 2014). We consider a topic model on a set of documents, each composed of words that are drawn from a vocabulary of size , with a dimension topic matrix , and the document-specific topic weights sampled independently from a topic prior . The conditional distribution on , the -th word in document , is

 p(wTMm,n=i|θTMm) =K∑k=1βTMi,kθTMk,m (4)

where are distinct words in the vocabulary. Noting that is also column-stochastic, we have,

###### Lemma 1.

The proposed Mixed Membership Mallows Model is statistically equivalent to a topic model whose topic matrix is set to be and the topic prior to be .

###### Proof.

We consider the distribution on the observations in both model, i.e, the distribution on the outcomes of pairwise comparisons in M4 and the words in topic model. Note that each user is independent conditioned on , from the conditional probabilities in Eq. (3) and (4), we have,

 p(w|B) =M∏m=1∫p(wm,1,…,wm,N|θm,B)Pr(θm)dθm =M∏m=1∫(N∏n=1K∑k=1Bwm,n,kθk,m)Pr(θm)dθm =p(wTM|β).

which is the same as in topic models (Blei, 2012). ∎

Thus, the estimation problem in M4 can be solved by first learning using any topic modeling algorithms, and then estimating the parameters of the shared Mallows components using Prop. 1. Before we discuss our approach in detail in next section, we consider the relation between M4 and other ranking models. We highlight that the proposed M4 is a much more general family that subsumes a few existing ranking models as special cases:

###### Proposition 2.

In Mixed Membership Mallows Model,

1. If the dispersion parameters , then, each Mallows component has non-zero probability only on the reference ranking , and the Mixed Membership Mallows Model reduces to topic modeling framework proposed in (Ding et al., 2015).

2. If the topic prior has non-zero probability only on the vertices of -dimension simplex, then, each user can only be influenced by one Mallows components and the Mixed Membership Mallows Model reduces to the mixture of Mallows model (Lu & Boutilier, 2014; Awasthi et al., 2014)

## 3 A Geometric Approach

We discuss in this section the key geometric insights of our approach. We leverage the recent works in separable topic discovery that come with consistency and efficiency guarantees (Arora et al., 2013; Ding et al., 2014b; Kumar et al., 2013; Bansal et al., 2014, etc.). The consistency is favorable here since we are not enforcing the estimation to be valid total rankings. To be precise, we exploit the geometric property of the second-order moments of the columns of , i.e., a co-occurrence matrix of pairwise comparisons, which can be estimated consistently:

###### Lemma 2.

If and are obtained from by first splitting each user’s comparisons into two independent halves and then re-scaling the rows to make them row-stochastic, then

 M˜X′˜X⊤M→∞−−−−−−−−−−−−→almost surely¯B¯R¯B⊤=:E,−1ex (5)

where , , and and are, respectively, the expectation and

correlation matrix of the weight vector

.

In this paper, we always assume that (the topic co-occurrence matrix) has full rank which is satisfied by many important prior distributions (Arora et al., 2013).

### 3.1 Approximate Separability

The consistent separable topic discovery approaches (e.g., Arora et al., 2013; Ding et al., 2014b) require the ranking matrix to be exactly separability, i.e., for each , there exist some novel rows (i.e., ordered pairs ) such that and . If this exact separability condition holds, the row vectors in of the novel pairs will be extreme points of the convex hull formed by all row vector of (the shaded dash circles in Fig. 2).

By the definition of the ranking matrix in Eq. (10), for , none of the entries in the ranking matrix is identically zero. Hence exact separability can not be satisfied. However, recall that is the probability of preferring item over in the -th Mallows component, by the property of the Mallows distribution, will be very close to 0 if the position of item in the reference ranking is higher than by a large margin. Explicitly,

###### Proposition 3.

Let and be the positions of items and in the reference ranking of the -th Mallows component and . If and , then,

 β(i,j),k≤LϕL−1k1+LϕL−1k (6)

Since , if increases, the corresponding is arbitrarily close to 0. Motivated by this observation in Prop. 3, we propose to consider the ranking matrix that is approximately separable:

###### Definition 1.

(-Approximate Separability) A non-negative matrix is -approximately separable for some constant , if , there exists at least one row (i.e., ordered pair) such that and , .

The -approximate separability requires the existence of ordered pairs that having negligible probability in all-but-one Mallows components, i.e., the row weights concentrates predominantly in one column (see Fig. 2). We will refer to such pairs (rows of ) as -approximate novel pairs (rows) for each latent factor. By Prop. 3 for M4, the approximate separability boils down to the existence of pairs of items such that is uniquely preferred over in one reference ranking, while is ranked higher than by a large margin in all other reference rankings.

For small , this seems to be a very restrictive condition on the shared latent Mallows distribution. However, as we show shortly in the next section, most M4 models are approximately separable for small constant if the number of items scales sufficiently faster than . Therefore, only a negligible fraction of models in M4 do not satisfy approximate separability.

### 3.2 Inevitability of the Approximate Separability

We investigate the probability that approximate separability is satisfied when we draw uniformly from M4. Specifically, we sample the reference rankings uniformly i.i.d from the set of all permutations, and set . We have,

###### Lemma 3.

Let the reference rankings be sampled i.i.d uniformly from the set of all permutations, and the dispersion parameters . Then, the probability that the ranking matrix being -approximately separable is at least

 1−Kexp(−QL(ϕ,λ)2K−1) (7)

where for some positive constant , and is the minimum integer that is no smaller than .

Therefore, for , the ranking matrix is going to be approximately separable with high probability. is determined by , and would be small for very small because of the logarithmic dependence. The proof exploits the property illustrated in Prop. 3 and is deferred to the supplementary section. We note that the result in Eq. (12) is only a loose upper bound on non-separable probability.

We point out that by definition, approximate separability of is equivalent to . Therefore is also approximately separable with high probability.

### 3.3 Robust Novel Pair Detection

Recall that when is exactly separable, the novel rows in are extreme points (shaded dash circles in Fig. 2). If is -approximate separable with small enough , the rows can be viewed as a small perturbation from the ideal case. As a result, the rows corresponding to the approximate novel pairs will be inside the ideal convex hull and are close to the ideal extreme points (, , and in Fig. 2). On the other hand, the non-novel rows could become extreme points but would be close to the convex hull formed by the approximate novel rows (e.g., in Fig. 2).

We detect the approximate novel pairs as the most “extreme” rows of based on a robust geometric measure, the normalized Solid Angle subtended by extreme points (see Fig. 2) (Ding et al., 2014b). Statistically, it is the probability that a row vector has the maximum projection value along an isotropically distributed direction :

 q(i,j)≜p{ ∀(s,t):∥E(i,j)−E(s,t)∥≥ζ, E(i,j)d>E(s,t)d} (8)

When is exact separable, for non-novel pairs and are strictly positive for novel pairs. When the deviation introduced by -approximate separability is small, the solid angle for approximate novel pairs will be close to that of the ideal extreme points. For the non-novel pairs that become extreme points due to -approximate separability ( in Fig. 2), the associated solid angles will be close to 0 since that it is very close to the convex hull formed by the rows of approximate separable pairs. In summary, if we sort the solid angles for all rows in , the ones with largest solid angles must corresponds to -approximate novel pairs for some constant and properly defined in Eq. (3.3).

By definition in Eq. (3.3), the solid angles can be consistently approximated using a few i.i.d isotropic ’s and an asymptotically consistent estimate of (Ding et al., 2014b). Once all the approximate novel pairs for distinct Mallows components are identified,

and therefore the model parameters can be estimated using constrained linear regression

(Arora et al., 2013; Ding et al., 2014b) and Prop. 1. Given the estimated parameters of the ranking matrix , we can infer the user-specific preference weight and evaluate the prediction probability of new comparisons using standard inference in topic modeling (Blei, 2012).

## 4 Algorithm and Complexity Bounds

The main steps of our approach are outlined in Alg. 1 and expanded in detail in Alg. 23 and  4. Alg. 2 detects all the approximate novel pairs for the distinct latent components. Alg. 3 estimates matrix using constrained linear regression followed by row scaling. Alg. 4 further infers the model parameters from using Prop. 1. In particular, Step 2 of Alg. 4 estimates all the pairwise relations in the reference rankings where (which is an equivalent representation of a total ranking), and Step 4 estimates .

Our approach has an overall polynomial computation complexity in all model parameters,

###### Theorem 1.

The running time of Algorithm 1 is .

The proofs are in supplementary. We note that the term is a loose upper bound for linear regression in Alg. 3. We also derive the sample complexity bounds for Alg. 1 which is also polynomial in all model parameters and where is the error probability. Formally,

###### Theorem 2.

Let the ranking matrix be -approximate separable and the second order moments of ranking prior to be full rank. If

 λ≤aminκ(1−ϕ)q∧8K2a0√log(W/q∧) (9)

and , then, Algorithm 1 can consistently recover all the reference rankings of the latent Mallows distributions. Moreover, , if

 M≥max{640W2log(3W/δ)Nη4d2q2∧, 320Wlog(3W/δ)Nη4λ2mina2min(1−ϕ)2}

and for

 P≥32log(3W/δ)q2∧

the proposed algorithm fails with probability at most . The other model parameters are defined as follows: ; , are the max/min of entries of ; ; ; is the condition number of ; be the minimum normalized solid angle formed by row vectors of ; ; . is the number of comparisons of each user.

The detailed proofs are summarized in the supplementary file. Eq. (19) provides an explicit sufficient upper bound on the required -approximate separable degree. It is roughly inverse polynomial in . By Prop. 3, the margin required to satisfy in Eq. (19) should scale as which is small.

We note that in the complexity bounds, the term represents the spread of the Mallows components and determines the hardness of estimation: for smaller , can be larger and the required is smaller. When , Eq. (19) reduces to and which is not achievable and the corresponding Mallows distribution is un-identifiable.

## 5 Experimental validation

We conduct experiments to validate the performance of our proposed approach when the M4 assumptions are satisfied on semi-synthetic dataset, and then demonstrate that the proposed M4 can indeed effectively capture the preference behavior in real-world datasets. In all experiments, we used the suggested settings by (Ding et al., 2014b). Specifically, the number of random projections , the tolerance in Alg. 2 and in Alg. 3.

### 5.1 Semi-synthetic Simulation

We generate synthetic examples according to proposed M4 and evaluate the proposed algorithm using reconstruction error measured by the Kendall’s tau distance between the estimated reference rankings and the ground-truth. Since our estimation is up to a column permutation, we align the estimated reference rankings using bipartite matching based on the Kendall’s tau distance.

The ground-truth reference rankings are obtained from a real world movie rating dataset, Movielens, using the same approach as in (Ding et al., 2015) over items and . We set the same dispersion parameter for all Mallows components as for . We use symmetric Dirichlet prior with concentration to generate ranking weights ’s. . . The error is further normalized by and averaged across the reference rankings.

Fig. 3 depicts how the estimation error varies with the number of users with different values of dispersion. We can see that the reconstruction error in reference rankings for converges to zero at different rates as a function of . For M4 with , it converges to a small but non-zero number when . We note that for the ground-truth ranking matrix , it is approximate separable for respectively. Our approach therefore can correctly detect the reference rankings when is small. When is mild, it can still detect most of the reference rankings correctly. 333For a random with , it is -approximate separable with probability for in a 1000 Monte Carlo runs.

### 5.2 Comparison prediction - Movielens

We consider in this section prediction of pairwise comparisons in a benchmark real-world dataset, Movielens. The star rating dataset is selected due to public availability and widespread use, but we convert it to pairwise comparisons and focus on modeling from the partial ranking viewpoint, as suggested in the ranking literature (Lu & Boutilier, 2014; Volkovs & Zemel, 2014; Ding et al., 2015).

We focus on the most frequently rated movies in the Movielens, split the first users for training, and use the remaining users for testing (Lu & Boutilier, 2014). We convert the training and test ratings into comparisons independently: for all pairs of movies user rating, is added if the star ratings for is higher than , and all ties are ignored. The prior is set to be Dirichlet and it is estimated using methods in (Arora et al., 2013) given estimated .

We evaluate the performance by the held-out log-likelihood, i.e., . The log-likelihoods are calculated using the standard Gibbs Sampling approximation in (Wallach et al., 2009). The log-likelihoods are then normalized by the total number of comparisons in the testing phase. We compared our new model (M4) against the topic modeling based model in (Ding et al., 2015) (TM) with closest settings to our model. We summarize the predictive probability for different in Fig. 4. One can see that M4 improves the prediction accuracy of TM for different choice of and can better fit the real-world observations.

### 5.3 Rating prediction via ranking model - Movielens

To further demonstrate that our model can capture real-world user behavior, we consider the standard rating prediction task in recommendation system (Toscher et al., 2009). We first train M4 using the training comparisons, and then predict ratings by aggregating the prediction of properly defined test comparisons. The purpose of this experiment is not to optimize to achieve the best empirical result in the rich literature on rating prediction.

We use the same training/testing rating split from (Salakhutdinov & Mnih, 2008a), and focus on the most rated movies in Movielens following (Ding et al., 2015). We convert the training ratings into training comparisons (for each user, all pairs of movies she rated in the training set are converted into comparisons based on the stars and the ties are ignored) and train a M4 model. The ranking prior is set to be Dirichlet. To predict stars rating of user for movie , we consider the following method: for , we set , and compare it against the movies user has rated in the training set. This generates a set of pairwise comparisons . For example, if user has rated movies with stars respectively in the training set and we are predicting her rating for movie . Then for , . We choose such that,

 ^ri,m=argmaxsp(wi,m(s)|wtrain,ˆβ).−1ex

We evaluate the performance using the standard root-mean-square-error (RMSE) metric (Toscher et al., 2009). We compared our approach, M4, against the topic modeling based methods in (Ding et al., 2015) (TM), and two benchmark rating-based algorithms, Probability Matrix Factorization (PMF) in (Salakhutdinov & Mnih, 2008b)

, and Bayesian probability matrix factorization (BPMF) in

(Salakhutdinov & Mnih, 2008a) that have robust empirical performance 555We use the suggested settings to optimize the hyper-parameters and use the implementation and data split from http://www.cs.toronto.edu/~rsalakhu/BPMF.html. Both PMF and BPMF are latent factor models and the number of latent factors has the similar interpretation as in M4. Note that the ratings predicted by our algorithm are integers from to , we also round the output of BPMF to the nearest integers from to (BPMF-int).

We report the RMSE for different choices of in Table 2. It is clear that M4 improves upon the ranking-based TM in which the latent factors are restricted to single permutations. On the other hand, when compared to the rating based algorithms, the RMSE of our M4 approach can match BPMF and outperforms BPMF-int and PMF although they are coming from a different feature space. We note that the BPMF typically provides robust and benchmark results on real-world problems. This demonstrates that our approach can accommodate noisy real-world user behavior.

## References

• Airoldi et al. (2014) Airoldi, E. M., Blei, D., Erosheva, E. A., and Fienberg, S. E. Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC, 2014.
• Arora et al. (2013) Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, M.l. A practical algorithm for topic modeling with provable guarantees. In

Proc. of the 30th International Conference on Machine Learning

, Atlanta, GA, USA, Jun. 2013.
• Awasthi et al. (2014) Awasthi, P., Blum, A., Sheffet, O., and .Vijayaraghavan, A. Learning mixtures of ranking models. In Advances in Neural Information Processing Systems. Montreal, Canada, Dec. 2014.
• Azari Soufiani et al. (2013) Azari Soufiani, H., Diao, H., Lai, Z., and Parkes, D. C. Generalized random utility models with multiple types. In Advances in Neural Information Processing Systems, pp. 73–81. Lake Tahoe, NV, USA, Dec. 2013.
• Bansal et al. (2014) Bansal, T., Bhattacharyya, C., and Kannan, R. A provable svd-based algorithm for learning topics in dominant admixture corpus. In Advances in Neural Information Processing Systems, pp. 1997–2005, 2014.
• Benson et al. (2014) Benson, A., Lee, J., Rajwa, B., and Gleich, D. Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices. In Advances in Neural Information Processing Systems, Montreal, Canada, Dec. 2014.
• Blei (2012) Blei, D. Probabilistic topic models. Commun. of the ACM, 55(4):77–84, 2012.
• Busse et al. (2007) Busse, Ludwig M, Orbanz, Peter, and Buhmann, Joachim M. Cluster analysis of heterogeneous rank data. In Proceedings of the 24th international conference on Machine learning, pp. 113–120, 2007.
• Ding et al. (2013) Ding, W., Rohban, M. H., Ishwar, P., and Saligrama, V. Topic discovery through data dependent and random projections. In Proc. of the 30th International Conference on Machine Learning, Atlanta, GA, USA, Jun. 2013.
• Ding et al. (2014a) Ding, W., Ishwar, P., and Saligrama, V. A Topic Modeling approach to Rank Aggregation. In Advances in on Neural Information Processing Systems, workshop on Analysis of Rank data, Montreal, Canada, Dec. 2014a.
• Ding et al. (2014b) Ding, W., Rohban, M. H., Ishwar, P., and Saligrama, V. Efficient Distributed Topic Modeling with Provable Guarantees. In

Proc. ot the 17th International Conference on Artificial Intelligence and Statistics

, Reykjavik, Iceland, Apr. 2014b.
• Ding et al. (2015) Ding, W., Ishwar, P., and Saligrama, V. A Topic Modeling approach to Ranking. In Proc. ot the 18th International Conference on Artificial Intelligence and Statistics, San Diago, CA, May 2015.
• Farias et al. (2009) Farias, V., Jagabathula, S., and Shah, D. A data-driven approach to modeling choice. In Advances in Neural Information Processing Systems. Vancouver, Canada, Dec. 2009.
• Gillis & Vavasis (2014) Gillis, N. and Vavasis, S. A. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 36(4):698–714, 2014.
• Gormley & Murphy (2008) Gormley, I. and Murphy, T. A mixture of experts model for rank data with applications in election studies. The Annals of Applied Statistics, pp. 1452–1477, 2008.
• Kumar et al. (2013) Kumar, A., Sindhwani, V., and Kambadur, P. Fast conical hull algorithms for near-separable non-negative matrix factorization. In the 30th Int. Conf. on Machine Learning, Atlanta, GA, Jun. 2013.
• Lebanon & Lafferty (2002) Lebanon, G. and Lafferty, J. D. Cranking: Combining rankings using conditional probability models on permutations. In Proc. of the 19th Int. Conf. on Machine Learning (ICML), 2002.
• Lu & Boutilier (2014) Lu, T. and Boutilier, C. Effective Sampling and Learning for Mallows Models with Pairwise-Preference Data. Journal of Machine Learning Research, 2014.
• Mallows (1957) Mallows, C. L. Non-null ranking models. i. Biometrika, pp. 114–130, 1957.
• Marden (1995) Marden, J. I. Analyzing and modeling rank data. Chapman and Hall, 1995.
• Oh & Shah (2014) Oh, S. and Shah, D.

Learning mixed multinomial logit model from ordinal data.

• Rajkumar & Agarwal (2014) Rajkumar, A. and Agarwal, S. A statistical convergence perspective of algorithms for rank aggregation from pairwise data. In Proc. of the 31st International Conference on Machine Learning, Beijing, China, Jun. 2014.
• Salakhutdinov & Mnih (2008a) Salakhutdinov, R. and Mnih, A.

Bayesian probabilistic matrix factorization using markov chain monte carlo.

In Proc. of the 25th International Conference on Machine Learning, pp. 880–887, Helsinki, Finland, Jun. 2008a.
• Salakhutdinov & Mnih (2008b) Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264, 2008b.
• Toscher et al. (2009) Toscher, A., Jahrer, M., and Bell, R. M. The bigchaos solution to the netflix grand prize, 2009.
• Volkovs & Zemel (2014) Volkovs, M. and Zemel, R. New learning methods for supervised and unsupervised preference aggregation. Journal of Machine Learning Research, 15:1135–1176, 2014.
• Wallach et al. (2009) Wallach, H. M., Murray, I., Salakhutdinov, R., and Mimno, D. Evaluation methods for topic models. In Proc. of the 26th International Conference on Machine Learning, Montreal, Canada, Jun. 2009.

## Appendix A Proof for Proposition 1 in the main paper

We first consider the property of the ranking matrix for M4 as summarized in Proposition 1 in the main paper. Recall that the ranking matrix in M4 is defined as

 β(i,j),k:=∑σ: σ(i)<σ(j)pM(σ|σk,ϕk) (10)

Proposition 1 (in the main paper) Let the ranking matrix be defined as in Eq. (10), and ’s are parameters of the Mallows distribution. Then, and , we have,

1. If and ,

2. If and ,

###### Proof.

For ,

 B(i,j),kB(i,j),k+B(j,i),k=μi,jβ(i,j),kμi,jβ(i,j),k+μj,iβ(j,i),k=β(i,j),k

since and . The proof of can be derived from the proof for Proposition 3 in the main paper. (see next section) ∎

## Appendix B Proof for Proposition 3 and Lemma 3 in the main paper

We first proof the Proposition 3 in the main paper.

Proposition 3 (in the main paper) Let and be the positions of items and in the reference ranking of the -th Mallows component. . If and , then,

 β(j,i),kβ(i,j),k≤LϕL−1k (11)
###### Proof.

Due to the symmetry in the ranking space, we consider hence where indicates “prefer over”. Instead of directly calculate summation as in the definition,

 σ(i,j),k=∑σ: σ(i)<σ(j)p(σ|σk,ϕk)

we consider the Repeated Insertion Model (RIM) procedure. RIM is a generative procedure for sampling a ranking which is equivalent to sampling a ranking from a Mallows component. Specifically, in RIM, a ranking is obtained by sequentially placing the -th item in the reference permutation () into the -th position (of the current partial sequence of length ), , in a probabilistic fashion:

 pi(ji=l)=ϕi−l1+ϕl+…+ϕi−1

and , .

Let . By definition is the probability that item is inserted after item in the RIM procedure. According to the procedure of RIM, this probability is irrelevant to the items after and by symmetric, it is irrelevant to the items before . Without loss of generality, we set and consider . For simplicity, we denote .

We first consider , the probability of item being on the -th position in the sequence after inserting the -th item. and . By induction, we shall show that .

As a initial point, after inserting the second item when , and . Assume for all , the assumption hold true, then for , and , (i.e., after inserting the item )

 qr,s+1=qr,sPr(js+1>r)+qr−1,sPr(js+1

where is the position of item after inserting it into the partial sequence. By the induction assumption,

 qr,s=ϕr−11+ϕ1+⋯+ϕs−1 qr−1,s=ϕr−21+ϕ1+⋯+ϕs−1

Therefore,

 qr,s+1 = ϕr−11+ϕ1+⋯+ϕs−1Pr(js+1>r) +ϕr−21+ϕ1+⋯+ϕs−1Pr(js+1

Similarly, it is true for and . This conclude the induction hypothesis that, .

Now we can calculate ,

 β(1,j),k = j−1∑r=1qr,j−1Pr(jj>r) = j−1∑r=1ϕr−1(1+⋯+ϕj−r−1)(1+⋯+ϕj−2)(1+⋯+ϕj−1) = j−1∑r=1n−2∑l=r−1ϕl(1+⋯+ϕj−2)(1+⋯+ϕj−1) = 1−jϕj−1+(j−1)ϕj(1−ϕ)2(1+⋯+ϕj−2)(1+⋯+ϕj−1)

Similarly, we have,

 β(j,1),k=j−1−jϕ+ϕjϕj−1(1−ϕ)2(1+⋯+ϕj−2)(1+⋯+ϕj−1)

Therefore,

 β(1,j),kβ(j,1),k = 1−jϕj−1+(j−1)ϕjϕj−1(j−1−jϕ+ϕj)≥1jϕj−1

and this conclude our proof.

We note that in the above equation, if we set , we got . This proves Proposition 1 c. We also note that so . This proves Proposition 1 b. ∎

Now, we consider the Lemma 3 in the main paper that shows the inevitability of the approximate separability of a random M4.

Lemma 2 (in the main paper) Let the reference rankings be sampled i.i.d uniformly from the set of all permutations, and the dispersion parameters . Then, the probability that the ranking matrix being -approximately separable is at least

 1−Kexp(−QL(ϕ,λ)2K−1) (12)

where for some positive constant , and is the minimum integer that is no smaller than .

###### Proof.

Note that by Proposition 3 in the main paper, if is preferred over in and under in other central permutations and the distance of their positions are , then, the corresponding row is at most approximately novel row for the first topic. This is same for all the topics.

We note that if we consider two groups of disjoint items, then, the relative rankings within each group is independent to the other group if the ranking is sampled uniformly from all the permutations. In general, we divide the items into groups of disjoint items, each containing items, denoted by , for . If a center permutation is sampled uniformly random from the set of all permutations, then, all the partial rankings within each group are independent to that of another group .

We now consider for each of these -tuples, the probability that there exist two items such that is first and is last in the group for first central permutation , and in the opposite way for the other permutations. We denote this probability by . By definition, we have,

 p1(ϕ;λ,k)≥ Pr{∃i,j∈{it,1,…,it,L},s.t.,σ1(i)<…<σ1(j), σ2(i)>…>σ2(j),…,σK(i)>…>σK(j)} = L(L−1)(1(L(L−1)))K=(L(L−1))−(K−1)

Now, let denote the event that none of the groups has a -approximately separable row, then, following the same argument as in Lemma LABEL:lem:ranking1, we have,

 Pr(⋃Bk)≤Kexp(−Qp1/L)≤Kexp(−QL2K−1)

as a upper bound for the probability of note being separable. We require such that . This concludes the proof. ∎

## Appendix C Analysis of Proposed Algorithm 1 in the main paper

Now we formally prove that if a ranking matrix is -approximately separable where being small enough, the proposed Algorithm 1 can consistently estimate the reference rankings of the shared Mallows components.

Indexing convention: For convenience, for the rest of this appendix we will index the rows of and by just a single index instead of an ordered pair as in the main paper.

### c.1 Consistency of Algorithm 2

Recall that . We decouple the effect of -separability from the error in estimating . Note that the second error converges to 0 as , we shall focus on the perturbation on solid angle as a result of the -approximate separability.

For being a -approximate novel row, let