The recent explosion of web technologies has enabled us to collect an immense amount of partial preferences for large sets of items, e.g., products from Amazon, movies from Netflix, or restaurants from Yelp, from a large and diverse population of users through transactions, clicks, check-ins, etc. (e.g., Lu and Boutilier, 2011; Volkovs and Zemel, 2014; Rajkumar and Agarwal, 2014). The goal of this paper is to develop a new approach to model, learn, and ultimately predict the preference behavior of users in pairwise comparisons which can form a building block for other partial preferences. Predicting preference behavior is important to personal recommendation systems, e-commerce, information retrieval, etc.
We propose a novel topic modeling approach to ranking and introduce a new probabilistic generative model for pairwise comparisons that accounts for a heterogeneous population of inconsistent users. The essence of our approach is to view the outcomes of comparisons generated by each user as a probabilistic mixture of a few latent global rankings that are shared across the user-population. This is especially appealing in the context of emerging web-scale applications where (i) there are multiple factors that influence individual preference behavior, e.g., product preferences are influenced by price, brand, etc., (ii) each individual is influenced by multiple latent factors to different extents, (iii) individual preferences for very similar items may be noisy and change with time, and (iv) the number of comparisons available from each user is typically limited. Research on ranking models to-date does not fully capture all these important aspects.
In the literature, we can identify two categories of models. In the first category of models the focus is on learning one global ranking that “optimally” agrees with the observations according to some metric (e.g., Gleich and Lim, 2011; Rajkumar and Agarwal, 2014; Volkovs and Zemel, 2014). Loosely speaking, this tacitly presupposes a fairly homogeneous population of users having very similar preferences. In the second category of models, there are multiple constituent rankings in the user population, but each user is associated with a single ranking scheme sampled from a set of multiple constituent rankings (e.g., Farias et al., 2009; Lu and Boutilier, 2011). Loosely speaking, this tacitly presupposes a heterogeneous population of users who are clustered into different types by their preferences and whose preference behavior is influenced by only one factor. In contrast to both these categories, we model each user’s pairwise preference behavior as a mixed membership latent variable model. This captures both heterogeneity (via the multiple shared constituent rankings) and inconsistent preference behavior (via the probabilistic mixture). This is a fundamental change of perspective from the traditional clustering-based approach to a decomposition-based one.
A second contribution of this paper is the development of a novel algorithmic approach to efficiently and consistently estimate the latent rankings in our proposed model. This is achieved by establishing a formal connection to probabilistic topic modeling where each document in a corpus is viewed as a probabilistic mixture of a few prevailing topics (Blei, 2012). This formal link allows us to leverage algorithms that were recently proposed in the topic modeling literature (Arora et al., 2013; Ding et al., 2013b, 2014) for estimating latent shared rankings. Overall, our approach has a running time and a sample complexity bound that are provably polynomial in all model parameters. Our approach is asymptotically consistent as the number of users goes to infinity even when the number of comparisons for each user is a small constant.
We also demonstrate competitive empirical performance in collaborative prediction tasks. Through a variety of performance metrics, we demonstrate that our model can effectively capture the variability of real-world user preferences.
2 Related Work
Rank estimation from partial or total rankings has been extensively studied over the last several decades in various settings. A prominent setting is one in which individual user rankings (in a homogeneous population) are modeled as independent drawings from a probability distribution which is centered around a single ground-truth global ranking. Efficient algorithms have been developed to estimate the global ranking under a variety of probability modelsQin et al. (2010); Gleich and Lim (2011); Negahban et al. (2012); Osting et al. (2013); Volkovs and Zemel (2014). Chief among them are the Mallows model (Mallows, 1957), the Plackett-Luce (PL) model (Plackett, 1975), and the Bradly-Terry-Luce (BTL) model (Rajkumar and Agarwal, 2014).
To account for the heterogeneity in the user population, (Jagabathula and Shah, 2008; Farias et al., 2009) considered models with multiple prevalent rankings and proposed consistent combinatorial algorithms for estimating the rankings. The mixture of Mallows model recently studied in (Lu and Boutilier, 2011; Awasthi et al., 2014) considers multiple constituent rankings as the “centers” for the Mallows components, as do the “mixture of PL” and the “mixture of BTL” models (Azari Soufiani et al., 2013; Oh and Shah, 2014). In all these settings, however, each user is associated with only one ranking sampled from the mixture model. They capture the cases where the population can be clustered into a few types in terms of their preference behavior.
The setup of our model, although being fundamentally different in modeling perspective, is most closely related to the seminal work in Jagabathula and Shah (2008); Farias et al. (2009) (denoted by FJS) (see Table 1 and appendix). As it turns out, our proposed model subsumes those proposed in FJS as special cases. On the other hand, while the algorithm in FJS can be applied to our more general setting, our algorithm has provably better computational efficiency, polynomial sample complexity, and superior empirical performance.
|FJS||Separability||1st order||Yes||Exponential in||Not provided|
|This paper||Separability||up to 2nd order||Yes||Polynomial||Polynomial|
Relation to topic modeling: Our ranking model shares the same motivation as topic models. Topic modeling has been extensively studied over the last decade and has yielded a number of powerful approaches (e.g., Blei, 2012)
. While the dominant trend is to fit a MAP/ML estimate using approximation heuristics such as variational Bayes or MCMC, recent work has demonstrated that the topic discovery problem can lend itself to provably efficient solutions with additional structural conditions(Arora et al., 2013; Ding et al., 2014). This forms the basis of our technical approach.
Relation to rating based methods: There is also a considerable body of work on modeling numerical ratings (e.g., Ricci et al., 2011) from which ranking preferences can be derived. An emerging trend explores the idea of combining a topic model for text reviews simultaneously with a rating-based model for “star ratings” (Wang and Blei, 2011). These approaches are, however, outside the scope of this paper.
The rest of the paper is organized as follows. We formally introduce the new generative model in Sec. 3. We then present the key geometrical perspective underlying the proposed approach in Sec. 4. We summarize the main steps of our algorithm and the overall computational and statistical efficiency in Sec. 5. We demonstrate competitive performance on semi-synthetic and real-world datasets in Sec. 6.
3 A new generative model
To formalize our proposed model, let be a universe of items. Let the latent rankings over items that are shared across a population of users be denoted by permutations . Each user compares pairs of items. The unordered item pairs to be compared are assumed to be drawn independently from some distribution with for all pairs. The -th comparison result of user is denoted by an ordered pair , if user compares item and and prefers over
. Let a probability vectorbe the user-specific weights over the latent rankings. The generative model for the comparisons from each user is,
Sample from a prior distribution
For each comparison :
Sample a pair of items from
Sample a ranking token
If , then , otherwise 111 is the position of item in the ranking and item is preferred over if .
Figure 1 is a standard graphical model representation of the proposed generative process. Each user is characterized by , the user-specific weights over the shared rankings. For convenience, we represent by a nonnegative ranking matrix whose
rows are indexed by all the ordered pairs. We set , so that the -th column of is an equivalent representation of the ranking . We then denote by the dimensional weight matrix whose columns are the user-specific mixing weights ’s. Finally, let be the empirical comparisons-by-user matrix where denotes the number of times that user compares pair and prefers item over . The principal algorithmic problem is to estimate the ranking matrix given and .
If we denote by a diagonal matrix with the -th diagonal component , and set , then the generative model induces the following probabilities on comparisons :
Similarly, if we consider a probabilistic topic model on a set of documents, each composed of words drawn from a vocabulary of size , with a topic matrix and document-specific mixing weights sampled from a topic prior (e.g. Blei, 2012), then, the distribution induced on the observation , i.e., the -th word in document , has the same form as in (3):
where is any distinct word in the vocabulary. Noting that is column-stochastic, we have,
The proposed generative model is statistically equivalent to a standard topic model whose topic matrix is set to be and the topic prior to be .
Note that , , and . Hence can be inferred directly from :
Thus, the problem of estimating the ranking matrix can be solved by any approach that can learn the topic matrix . Our approach is to leverage recent works in topic modeling (Arora et al., 2012, 2013; Ding et al., 2013b, 2014)
that come with consistency and statistical and computational efficiency guarantees by exploiting the second-order moments of the columns of, i.e., a co-occurrence matrix of pairwise comparisons. We can establish parallel results for ranking model via the equivalency result of Lemma 1. In particular, by combining Lemma 1 with results in (Ding et al., 2013b, Lemma 1 in Appendix), the following result can be immediately established:
If and are obtained from by first splitting each user’s comparisons into two independent copies and then re-scaling the rows to make them row-stochastic, then
where , , , and and are, respectively, the expectation and correlation matrix of the weight vector .
4 A Geometric Perspective
The key insight of our approach is an intriguing geometric property of the normalized second-order moment matrix (defined in Lemma 2) illustrated in Fig. 2. This arises from the so-called separability condition on the ranking matrix ,
A ranking matrix is separable if for each ranking , there is at least one ordered pair , such that and , .
In other words, for each ranking, there exists at least one “novel” pair of items such that is uniquely preferred over in that ranking while is ranked higher than in all the other rankings. Figure 2 shows an example of a separable ranking matrix in which the ordered pair is novel to ranking , the pair to , and the pair to .
The separability condition has been identified as a good approximation for real-world datasets in nonnegative matrix factorization (Donoho and Stodden, 2004) and topic modeling (Arora et al., 2013; Ding et al., 2014), etc. In the context of ranking, this condition has appeared, albeit implicitly in a different form, in the seminal works of (Jagabathula and Shah, 2008; Farias et al., 2009). Moreover, as shown in (Farias et al., 2009), the separability condition is satisfied with high probability when the underlying rankings are sampled uniformly from the set of all permutations. In our experiments we have observed that the ranking matrix induced by the rating matrix estimated by matrix factorization is often separable (Sec. 6.2).
If is separable then the novel pairs correspond to extreme points of the convex hull formed by all the row vectors of (Fig. 2
). Thus, the novel pairs can be efficiently identified through an extreme point finding algorithm. Once all the novel pairs are identified, the ranking matrix can be estimated using a constrained linear regression(Arora et al., 2013; Ding et al., 2014). To exclude redundant rankings and ensure unique identifiability, we assume has full rank.
We leverage the normalized Solid Angle subtended by extreme points to detect the novel pairs as proposed in (Ding et al., 2014, Definition 1). The solid angles are indicated by the shaded regions in Fig. 2. From a statistical viewpoint, it can be defined as the probability that a row vector has the maximum projection value along an isotropically distributed random direction :
These can be efficiently approximated using a few iid isotropic ’s. By following the approach in (Ding et al., 2014, Lemma 2) for topic modeling, one can prove the following result which shows that the solid angles can be used to detect novel pairs:
Suppose is separable and is full rank, then, if and only if is a novel pair.
This motivates the following solution approach: Estimate the solid angles , Select distinct pairs with largest ’s, and Estimate the ranking matrix using constrained linear regression.
5 Algorithm and Analysis
The main steps of our approach are outlined in Algorithm 1 and expanded in detail in Algorithms 2, 3 and 4. Algorithm 2 detects all the novel pairs for the distinct rankings. Once the novel pairs are identified, Algorithm 3 estimates matrix using constrained linear regression followed by row and then column scaling.
Algorithm 4 further processes to obtain an estimate of the ranking matrix . Step 1 is based on Eq. (3) and step 2 further rounds each element to or . Algorithm 4 guarantees that is binary and satisfies the condition: for all and all .
Our approach inherits the polynomial computational complexity of the topic modeling algorithm in Ding et al. (2014):
The running time of Algorithm 1 is .
We further derive the sample complexity bounds for our approach which is also polynomial in all model parameters and where is the upper bound on error probability. A major technical improvement compared to the results that appear in Ding et al. (2014) is that our analysis holds true for any isotropic distribution over the random directions in Alg. 2. The previous result in (Ding et al., 2014, Theorem 1, 2) was designed only for specific distributions such as spherical Gaussian. Formally,
Let the ranking matrix be separable and have full rank. Then the Algorithm 1 can consistently recover up to a column permutation as the number of users and number of projections . Furthermore, for any isotropically drawn random direction , , if
then Algorithm 1 fails with probability at most . The other
model parameters are defined as , , ,
are the minimum /maximum eigenvalues of
are the minimum /maximum eigenvalues of. is the minimum solid angle of the extreme points of the convex hull of the rows of .
Detailed proofs are provided in the supplementary material. We combine the analysis of Alg. 4 and the re-scaling steps in Alg. 3 in order to exploit the structural constraints of the ranking model. As a result, we obtain an improved sample complexity bound for compared to Ding et al. (2014); Arora et al. (2013)
6 Experimental Validation
6.1 Overview of Experiments and Methodology
We conduct experiments first on semi-synthetic dataset in order to validate the performance of our proposed algorithm when the model assumptions are satisfied, and then on real-world datasets in order to demonstrate that the proposed model can indeed effectively capture the variability that one encounters in the real world. We focus on the collaborative filtering applications where population heterogeneity and user inconsistency are the well-known characteristics (e.g., Salakhutdinov and Mnih, 2008a).
We use Movielens, a benchmark movie-rating dataset widely used in the literature.222Another large benchmark, Netflix dataset is not available due to privacy issues. Movielens is currently available at http://grouplens.org/datasets/movielens/ The rating-based data is selected due to its public availability and widespread use, but we convert it to pairwise comparisons data and focus on modeling from a ranking viewpoint. This procedure has been suggested and widely used in the rank-aggregation literature (e.g., Lu and Boutilier, 2011; Volkovs and Zemel, 2014). For the semi-synthetic datasets, we evaluate the reconstruction error between the learned rankings and the ground truth. We adopt the standard Kendall’s tau distance between two rankings. For the real-world datasets where true parameters are not available, we use the held-out log-likelihood, a standard metric in ranking prediction (Lu and Boutilier, 2011) and in topic modeling Wallach et al. (2009).
In addition, we consider the standard task of rating prediction via our proposed ranking model. Our aim here is to illustrate that our model is suitable for real-word data. We do not optimize tuning parameters in order to achieve the best result. We measure the performance by root-mean-square-error (RMSE) which is the standard in literature(e.g., Salakhutdinov and Mnih, 2008a; Toscher et al., 2009).
6.2 Semi-synthetic simulation
We first use a semi-synthetic dataset to validate the performance of our algorithm. In order to match the dimensionality and other characteristics that are representative of real-world examples, we generate the semi-synthetic pairwise comparisons dataset using a benchmark movie star-ratings dataset, Movielens. The original dataset has approximately million ratings for movies from users. The ratings range from 1 star to 5 stars.
We follow the procedure in (Lu and Boutilier, 2011) and (Volkovs and Zemel, 2014) to generate the semi-synthetic dataset as follows. We consider the most frequently rated movies and train a latent factor model on the star-ratings data using a state-of-the-art matrix factorization based algorithm (Salakhutdinov and Mnih, 2008a). This approach is selected for its state-of-the-art performance on many real-world collaborative filtering tasks. This procedure learns a movie-factor matrix whose columns are interpreted as scores of the movies over the latent factors(Salakhutdinov and Mnih, 2008a; Volkovs and Zemel, 2014). By sorting the scores of each column of the movie-factor matrix, we obtain rankings for generating the semi-synthetic dataset. We set as suggested by Lu and Boutilier (2011) and Salakhutdinov and Mnih (2008a). We note that the resulting ranking matrix satisfies the separability condition.
The other model parameters are set as follows. , . The prior distribution for is set to be Dirichlet as suggested by (Lu and Boutilier, 2011). The parameters ’s are determined by , where the concentration parameter and the expectation is sampled uniformly from the dimensional simplex for each random realization. We note that the correlation matrix of the Dirichlet distribution has full rank (Arora et al., 2013). We fix comparisons per user to approximate the observed average pairwise comparisons in the Movielens dataset and vary .
Since the output of our algorithm is determined only up to a column permutation, we first align the columns of and using bipartite matching based on distance, and then measure the performance by the distance between the ground truth rankings and the estimate . Due to the way is defined, this is equivalent to the widely-used Kendall’s tau distance between two rankings which is proportional to the number of pairs in which two ranking schemes differ. We further normalize the error by so that the error measure for each column is a number between .
We compare our proposed algorithm (denoted by RP) against the algorithm proposed in (Jagabathula and Shah, 2008; Farias et al., 2009) (denoted by FJS) for estimating the ranking matrix. To the best of our knowledge, this is the most recent algorithm with consistency guarantees for .333We show in the appendix that Alg. FJS can be applied to our generative scheme since it only uses the first order statistics, and all the technical conditions are satisfied. We compared how the estimation error varies with the number of users , and the results are depicted in Fig. 3. For each setting, we average over Monte Carlo runs. Evidently, our algorithm shows superior performance over FJS. More specifically, since our ground truth ranking matrix is separable, as increases, the estimation error of RP converges to zero, and the convergence is much faster than FJS. We note that only for does the error of the FJS algorithm eventually start approaching 0.
6.3 Movielens - Comparison prediction
We apply the proposed algorithm (RP) to the real-world Movielens dataset introduced in Sec. 6.2 and consider the task of predicting pairwise comparisons. We consider two settings: new comparison prediction, and new user prediction. We train and evaluate our model using the comparisons obtained from the star-ratings of the Movielens dataset. This procedure of generating comparisons from star-ratings is motivated by (Lu and Boutilier, 2011; Volkovs and Zemel, 2014). We focus on the most frequently rated movies and obtain a subset of star-ratings from users. The pairwise comparisons are generated from the star ratings following (Lu and Boutilier, 2011; Volkovs and Zemel, 2014): for each user , we select pairs of movies that user rated, and compare the stars of the two movies to generate comparisons.
To select pairs of items to compare, we consider: (Full) all pairs of movies that a user has rated, or (Partial) randomly select pairs where is the number of movies user has rated.
To compare a pair of movies rated by a user, if the star rating of is higher than . For ties, we consider: (Both) generate and , (Ignore) do nothing, and (Random) select one of with equal probability.
New comparison prediction: In this setting, for each user, a subset of her ratings are used to generate the training comparisons while the remaining are for testing comparisons. We follow the training/testing split as in (Salakhutdinov and Mnih, 2008a).444The training/testing split is available at http://www.cs.toronto.edu/~rsalakhu/BPMF.html We convert both the training ratings and testing ratings into training comparisons and testing comparisons independently.
We evaluate the performance by the predictive log-likelihood of the testing data, i.e., . Given the estimate , we follow (Arora et al., 2013; Ding et al., 2014) to fit a Dirichlet prior model. We then calculate the prediction log-likelihood using the approximation in (Wallach et al., 2009) which is now the standard. We compare against the FJS algorithm. Figure 4(upper) summarizes the results for different strategies in generating the pairwise comparisons with held fixed. The log-likelihood is normalized by the total number of pairwise comparisons tested. As depicted in Fig. 4 (upper), the log-likelihood produced by the proposed algorithm RP is higher, by a large margin, compared to FJS. The predictive accuracy is robust to how the comparison data is constructed. We also consider the normalized log-likelihood as function of (see Fig. 5). The results validate the superior performance and suggest that is a reasonable parameter choice.
New user prediction: In this setting, all the ratings of a subset of users are used to generate the training comparisons while the remaining users’ comparisons are used for testing. Following (Lu and Boutilier, 2011), we split the first users (in the original dataset) in the Movielens dataset for training, and the remaining for testing. We use the held-out log-likelihood, i.e., to measure the performance. The log-likelihoods are again calculated using the standard Gibbs Sampling approximation (Wallach et al., 2009). We compare our algorithm RP with the FJS algorithm. The log-likelihoods are then normalized by the total number of comparisons in the testing phase. We fix the number of rankings at . The results which are summarized in Fig. 4 (lower) agree with the results of the previous task.
6.4 Movielens - Rating prediction via ranking model
The purpose of this experiment is to illustrate that our ranking model can capture real-world user behavior through rating predictions, one important task in personal recommendation (Toscher et al., 2009). We first train our ranking model using the training comparisons, and then predict ratings based on comparison prediction. Our objective is to demonstrate results comparable to the state-of-the-art rating-based methods rather than achieving the best possible performance on certain datasets.
We use the same training/testing rating split from (Salakhutdinov and Mnih, 2008a) as used in new comparison prediction in Sec. 6.3, and focus only on the most rated movies. We first convert the training ratings into training comparisons (for each user, all pairs of movies she rated in the training set are converted into comparisons based on the stars and the ties are ignored) and train a ranking model. The prior is set to be Dirichlet.
To predict stars from comparison prediction, we propose the following method. Consider the problem of predicting , i.e., the rating of user on movie . We assume , then compare it against the ratings on movie she has rated in training. This generates a set of pairwise comparisons . For example, if user has rated movies with stars respectively in the training set and we are predicting her rating of movie . Then for , while for , . We then chose to maximize the likelihood of ,
We evaluate the performance using root-mean-square-error (RMSE). This is a standard metric in collaborative filtering (Toscher et al., 2009). 555 Normalized Discounted Cumulative Gain (nDCG) is another standard metric. It requires, however, to predict a total ranking and is inapplicable in our test setting. We compared our ranking-based algorithm, RP , against rating based algorithms. We choose to compare two benchmark algorithms, Probability Matrix Factorization (PMF) in (Salakhutdinov and Mnih, 2008b)
, and Bayesian probability matrix factorization (BPMF) in(Salakhutdinov and Mnih, 2008a) for their robust empirical performance 666The implementation is available at http://www.cs.toronto.edu/~rsalakhu/BPMF.html. Both PMF and BPMF are latent factor models. The number of latent factors has the similar interpretation as in our ranking model. The RMSE for different choices of are summarized in Table 2.
Although coming from a different feature space and modeling perspective, our approach has similar RMSE performance as the rating-based PMF and BPMF. Since the ratings predicted by our algorithm are integers from to , we also consider restricting the output of BPMF to be integers (denote as BPMF-int). This is achieved by rounding the real-valued prediction of BPMF to the nearest integer from 1 to 5. We observe that our RP algorithm outperforms PMF which is known for over-fitting issues, and matches the performance of BPMF-int. This demonstrates that our approach is in fact suitable for modeling real-world user behavior.
We point out that one can potentially improve these results by designing a better comparison generating strategy, ranking prior, aggregation strategies, etc. This is, however, beyond the scope of this paper.
We note that our proposed algorithm can be naturally parallelized in a distributed database for web scale problems as demonstrated in (Ding et al., 2014). The statistical efficiency of the centralized version can be retained with an insignificant communication cost.
This article is based upon work supported by the U.S. AFOSR and the U.S. NSF under award numbers # FA9550-10-1-0458 (subaward # A1795) and # 1218992 respectively. The views and conclusions contained in this article are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the agencies.
While our analysis of the proposed approach and algorithm largely tracks the methodology in (Ding et al., 2014), here we develop a set of new analysis tools that can handle more general settings. Specifically, our new analysis tools can handle any isotropically distributed random projection directions. In contrast, the work in (e.g., Ding et al., 2014) can only handle special types of random projections, e.g., spherical Gaussian. Our new refined analysis can not only handle more general settings, it also gives an overall improved sample complexity bound.
We also analyse the post-processing step in Algorithm 4. This step accounts for the special constraints that a valid ranking representations must satisfy and guarantees a binary-valued estimate of . It should also satisfy the property that either or for all distinct and all .
We note that the analysis framework that we present here for the solid angle can in fact be extended to handle other types distributions for the random projection directions. This is, however, beyond the scope this paper.
Appendix A On the generative model
is column stochastic.
Noting that by definition, and , therefore,
Appendix B Connection to the model in FJS
First, the generative model proposed in FJS can be viewed as a special case of our generative model. If we consider the prior distribution of to be a pmf on the vertices of the -dimensional probability simplex (so that has only one nonzero component with probability one), i.e.,
where is the -th standard basis vector and , then each user is associated with only one of the types with probability for the -th type. We note that under this prior, and has full rank.
Second, the algorithm proposed in FJS can certainly be applied to our more general setting. Since the algorithm FJS only uses the first order statistic which corresponds to pooling the comparisons from all the users together, it suffices to consider only the probabilities of by marginalizing over :
where the last step is due to the definition of the ranking matrix
The above derivation shows that if the expectation vector in our
generative model equals that in the model of FJS, then the probability
distribution of the first order statistic in both models will be
identical and the two models will be indistinguishable in terms of the
first order statistic. This shows that the comparison with FJS in the
experiments conducted in Sections 6.1 and 6.2 of the main paper is
both sensible and fair.
Indexing convention: For convenience, for the rest of this appendix we will index the rows of and by just a single index instead of an ordered pair as in the main paper.
Appendix C Proof of Lemma 2 in the main paper
Lemma 2 in the main paper is a result about the almost sure convergence of the estimate of the normalized second order moments . Our proof of this result will also provide an attainable rate of convergence.
We first provide a generic method to establish the convergence rate for a function of random variables given their individual convergence rates.
Let be random variables and be positive constants. Let for some constants , and be a continuously differentiable function in . If for , and , then,
Since is continuously differentiable in , such that
Now we are ready to prove Lemma 2 of the main paper. Recall that and are obtained from by first splitting each user’s comparisons into two independent copies and then re-scaling the rows to make them row-stochastic. Therefore, . Since , , and is row stochastic. From Lemma 2 of the main paper, we have
Let and . If , then,
For any ,
From the Strong Law of Large Numbers and equations (1), (2) in the main paper, we have
and by definition. Using McDiarmid’s inequality, we obtain
In order to calculate , we apply the results from Proposition 2. Let with , and , , . Let , , and . Then , , and .
If , , and , then , . Then note that
By applying Proposition 2, we get
where . There are many strategies for optimizing the free parameter . We set and solve for to obtain
Finally, by applying the union bound to the entries in , we obtain the claimed result. ∎
Appendix D Proof of Theorem 2 in the main paper
We focus on the case when the random projection directions are sampled from any isotropic distribution. Our proof is not tied to the special form of the distribution; just its isotropic nature. In contrast, the method in (e.g., Ding et al., 2014) can only handle special types of distributions such as the spherical Gaussian.
The proof of Theorem 2 in the main paper can be decoupled into two steps. First, we show that Algorithm 2 in the main paper can consistently identify all the novel words of the distinct rankings. Then, given the success of the first step, we will show that Algorithm 3 proposed in the main paper can consistently estimate the ranking matrix .
d.2 Useful propositions
We denote by the set of all novel pairs of the ranking , for , and denote by the set of other non-novel pairs. We first prove the following result.
Let be the -th row of . Suppose is separable and has full rank, then the following is true: