1 Introduction
1.1 Background
User preferences — in a wide variety of settings ranging from voting [26] to information retrieval [8] — are often modeled as a distribution on permutations. Here we study the problem of learning a mixture of Mallows models from random samples. First, a Mallows model is described by a center and a scaling parameter
. The probability of generating a permutation
iswhere is the KendallTau distance [17] and is a normalizing constant that only depends on and the number of elements being permuted which we denote by . C. L. Mallows introduced this model in and gave an inefficient procedure for sampling from them: Rank every pair of elements randomly and independently so that they agree with with probability and output the ranking if it is a total ordering. Doignon et al. [11] discovered a repeated insertionbased model that they proved is equivalent to the Mallows model and more directly lends itself to an efficient sampling procedure.
The Mallows model is a natural way to represent noisy data when there is one true ranking that correlates well with everyone’s own individual ranking. However in many settings (e.g. voting [15], recommendation systems) the population is heterogenous and composed of two or more subpopulations. In this case, it is more appropriate to model the data as a mixture of simpler models. Along these lines, there has been considerable interest in fitting the parameters of a mixture of Mallows models to ranking data [18, 22, 20]
. However most of the existing approaches (e.g. ExpectationMaximization
[18]) are heuristic and only recently were the first algorithms with provable guarantees given. For a single Mallows model, Braverman and Mossel
[7] showed how to learn it by quantifying how close the empirical average of the ordering of elements is to the ordering given by as the number of samples increases.Awasthi et al. [3] gave the first polynomial time algorithm for learning mixtures of two Mallows models. Their algorithm learns the centers and exactly and the mixing weights and scaling parameters up to an additive with running time and sample complexity
Here and are the scaling parameters and
is the smallest mixing weight. Their algorithm works based on recasting the parameter learning problem in the language of tensor decompositions, similarly to other algorithms for learning latent variable models
[2]. However there is a serious complication in that most of the entries in the tensor are exponentially small. So even though we can compute unbiased estimates of the entries of a tensor whose low rank decomposition would reveal the parameters of the Mallows model, most of the entries cannot be meaningfully estimated from few samples. Instead, Awasthi et al.
[3] show how the entries that can be accurately estimated can be used to learn the prefixes of the permutations, which can be bootstrapped to learn the rest of the parameters. In fact before their work, it was not even known whether a mixture of two Mallows models was identifiable — i.e. whether its parameters can be uniquely determined from an infinite number of samples.The natural open question was to give provable algorithms for learning mixtures of any constant number of Mallows models. For other learning problems like mixtures of product distributions [14, 12] and mixtures of Gaussians [16, 21, 4], algorithms for learning mixtures of two components under minimal conditions were eventually extended to any constant number of components. Chierichetti et al. [9] showed that when the number of components is exponential in , identifiability fails. On the other hand, when all the scaling parameters are the same and known, Chierichetti et al. [9] showed that it is possible to learn the parameters when given an arbitrary (and at least exponential in ) number of samples. Their approach was based on the Hadamard matrix exponential. They also gave a clusteringbased algorithm that runs in polynomial time and works whenever the centers are wellseparated according to the KendallTau distance, by utilizing recent concentration bounds for Mallows models that quantify how close a sample is likely to be to the center [5].
1.2 Our Results and Techniques
Our main result is a polynomial time algorithm for learning mixtures of Mallows models for any constant number of components. Let be the total variation distance, let be the smallest mixing weight, and let
denote the uniform distribution over the
possible permutations. We prove:Theorem 1.1.
For any constant , given samples from a mixture of Mallows models
where for all , for all and , there is an algorithm whose running time and sample complexity are
for learning each center exactly and the mixing weights and scaling parameters to within an additive . Moreover the algorithm succeeds with probability at least .
A main challenge in learning mixtures of Mallows models is in establishing polynomial identifiability — i.e. that the parameters of the model can be approximately determined from a polynomial number of samples. When addressing this question, there is a natural special matrix to consider: Let be an matrix whose rows and columns are indexed by permutations with
Zagier [27] used tools from representation theory to find a simple expression for the determinant of
. Interestingly his motivation for studying this problem came from interpolating between Bose and Fermi statistics in mathematical physics. We can translate his result into our context by observing that the columns of
, after normalizing so that they sum to one, correspond to Mallows models with the same fixed scaling parameter . Thus Zagier’s result implies that any two distinct mixtures and of Mallows models, where all the components have the same scaling parameter, produce different distributions^{1}^{1}1This result was rediscovered by Chierichetti et al. [9]using different tools, but without a quantitative lower bound on the smallest singular value.
.However the quantitative lower bounds that follow from Zagier’s expression for the determinant are too weak for our purposes, and are not adapted to the number of components in the mixture. We exploit symmetry properties to show lower bounds on the length of any column of projected onto the orthogonal complement of any other columns, which allows us to show that not only does have full rank, any small number of its columns are robustly linearly independent [1]. More precisely we prove:
Theorem 1.2.
Let . Let be any distinct columns of , normalized so that they each sum to one. Then
where are arbitrary real coefficients.
Even though this result nominally applies to mixtures of Mallows models where all the scaling parameters are the same, we are able to use it as a black box to solve the more general learning problem. We reformulate our lower bound on how close any columns can be to being linearly dependent in the langugage of test functions, which we use to show that when the scaling parameters are different, we can isolate one component at a time and subtract it off from the rest of the mixture. Combining these tools, we obtain our main algorithm. We note that the separation conditions we impose between pairs of components are informationtheoretically necessary for our learning task.
It is natural to ask whether the dependence on can be improved. First, we show lower bounds on the sample complexity. We construct two mixtures and whose components are far apart — every pair of components has total variation distance at least — but and have total variation distance about . As a corollary we have:
Corollary 1.
Any algorithm for learning the components of a mixture of Mallows models within in total variation distance must take at least samples.
Second, we consider a restricted model where the learner can only make local queries of the form: Given elements and locations and a tolerance , what is the probability that the mixture assigns to location for all from to , up to an additive ? We show that our algorithms can be implemented in the local model. Moreover, in this model, we can prove lower bounds on the dependence on and . We show:
Theorem 1.3 (Informal).
Any algorithm for learning a mixture of Mallows models through local queries must make at least queries or make a query with .
This is reminiscent of statistical query lower bounds for other unsupervised learning problems, most notably learning mixtures of Gaussians
[10]. However it is not clear how to prove lower bounds on the statistical query dimension [13], because of the complicated ways that the locations that each element is mapped to affect one another in a Mallows model, which makes it challenging to embed small hard instances into larger ones.Finally we turn to beyondworst case analysis and ask whether there are natural conditions on the mixture that allow us to get algorithms whose dependence on is a fixed polynomial rather than one whose degree depends on . Rather than requiring the centers to be far apart, we merely require their scaling parameters to be separated from oneanother. We show:
Theorem 1.4.
Given samples from a mixture of Mallows models
where for all , for all and , there is an algorithm whose running time and sample complexity are
for learning each center exactly and the mixing weights and scaling parameters to within an additive , where . Moreover the algorithm succeeds with probability at least .
Our algorithm leverages many of the lower bounds on the total variation distance between mixtures of Mallows models and test functions for separating one component from the others that we have established along the way.
Further Related Work
There are other natural models for distributions on permutations such as the BradleyTerry model [6] and the PlackettLuce model [23, 19]. Zhao et al. [28] showed that a mixture of PlackettLuce models is generically identifiable provided that
and gave a generalized method of moments algorithm that they proved is
consistent — meaning that as the number of samples goes to infinity, the algorithm recovers the true parameters. More generally, Teicher [24, 25] obtained sufficient conditions for the identifiability of finite mixtures but these conditions do not apply in our setting.2 Preliminaries
2.1 Basic Notation
Let . Given two permutations and on , let denote the KendallTau distance, which counts the number of pairs for which the two rankings disagree.
Definition 1.
A Mallows model defines a distribution on permutations of the set where the probability of generating a permutation is equal to
and be the normalizing constant, which is easy to see is independent of . When the number of elements is clear from context, we will omit and write .
The following is a wellknown (see e.g. [11]) iterative process for generating a ranking from : Consider the elements in rank decreasing order, according to . When we reach the ranked element, it is inserted into each of the possible positions with probabilities
respectively, where the order of the probabilities go from the highest rank position it could be inserted to the lowest. When the last element is inserted, the result is a random permutation drawn from .
A mixture of Mallows models is defined in the usual way: We write
where the mixing weights are nonnegative and sum to one. A permutation is generated by first choosing an index (each is chosen with probability ) and then drawing a sample from the corresponding Mallows model .
We will often work with the natural vectorizations of probability distributions:
Definition 2.
If is a distribution over permutations on we let denote the length vector whose entries are the probabilities of generating each possible permutation. We will abuse notation and write for the vectorization of a Mallows model .
Our algorithms and their analyses will frequently make use of the notion of restricting a permutation to a set of elements:
Definition 3.
Given a permutation on and a subset , let be the permutation on the elements of induced by .
2.2 Block and Orders
Our algorithms will be built on various structures we impose on permutations. The way to think about these structures is that each one gives us a statistic that we can measure: What is the probability that a permutation sampled from an unknown Mallows model has the desired structure? These act like natural moments of the distribution, that we will manipulate and use in conjunction with tensor methods to design our algorithms.
Definition 4.
A block structure is an ordered collection of disjoint subsets of . We say that a permutation satisfies as a block structure if for each , the elements of occur consecutively (i.e. in positions for some ) in and moreover the blocks occur in the order . Finally we let denote the set of permutations satisfying as a block structure.
Definition 5.
An order structure is a collection of ordered subsets of . We say a permutation satisfies as an order structure if for each , the elements of occur in in the same relative order as they do in .
Definition 6.
An ordered block structure is an ordered collection of ordered disjoint subsets of . We say a permutation satisfies as an ordered block structure if it satisfies both as a block structure and as an order structure — i.e. we forget the order within each when we treat it as a block structure and we forget the order among the ’s when we treat it as an order structure.
To help parse these definitions, we include the following example:
Example 1.
Let and consider . The permutation satisfies as a block structure. The permutation satisfies as an order structure and the permutation satisfies as an ordered block structure.
3 Basic Facts
Here we collect some basic facts about Mallows models, in particular a lower bound on the probability that they satisfy a given block structure if their base permutation does, relationships between the total variation distance and parameter distance, and determinantal identities for special matrices.
3.1 What Block Structures are Likely to be Satisfied?
In this subsection, our main result is a lower bound on the probability that a permutation drawn from a Mallows model satisfies a block structure that the underlying base permutation does. Along the way, we will also establish some useful ways to think about conditioning and projecting Mallows models in terms of tensors.
Fact 1.
The conditional distribution of a Mallows model when restricted to rankings where the elements in the set (of size ) are ranked in positions and the ranking of elements in is fixed is precisely .
Proof.
It is easy to see that for any two permutations and on where the elements of are ranked in positions and agree on the rankings of elements in satisfy
Thus the ratio of their probabilities is the same as the ratio of probabilities of and in , which completes the proof. ∎
Next we will describe a natural way to think about the conditional distribution on permutations that satisfy a given block structure as a tensor. Recall that the subsets of in a block structure are required to be disjoint.
Definition 7.
Given a Mallows model and a block structure , we define a dimensional tensor as follows: Each entry corresponds to orderings of respectively and in it, we put the probability that a ranking drawn from satisfies and for each , the elements in occur in the order specified by .
It is easy to see that has rank one. Technically this requires the obvious generalization of Fact 1 where we condition on the elements in each occurring in specified consecutive locations, and then note that these events are all disjoint.
Corollary 2.
Our next result gives a convenient lower bound for the probability that a sample satisfies a given block structure provided that the base permutation satisfies :
Lemma 1.
For any Mallows model and block structure where satisfies and , we have
Proof.
Without loss of generality let and . We will first consider the set of permutations with the following property: For each , the elements occur in their natural order and each of them occurs after all of the elements . Using the iterative procedure for sampling from a Mallows model defined in Section 2.1 and the fact that , we have that
since we just need that when each element in each is inserted, it is inserted in the lowest ranked position available.
Next we build a correspondence between permutations in and permutations in . Consider . For each block , say that in , they are placed in positions
Note that these are not necessarily consecutive. We will map to a permutation in by making them consecutive while preserving their order by doing the following for each block: Place in positions and all other elements displaced from their position are placed afterwards while preserving their ordering. Crucially this process only reduces the number of inversions in due to the way that was defined. In particular, the permutation we get from this process has the property that it is at least as likely as to be generated from . Also note the intervals must be disjoint and occur in that order. Now in this correspondence, the order of all elements outside is preserved. Thus, there are at most different permutations in that can be mapped to the same element in . Putting this all together implies that
which completes the proof. ∎
3.2 Total Variation Distance Bounds
In this subsection, we give some useful relationships between the total variation distance and the parameter distance between two Mallows models (in special cases) in terms of the distance between their base permutations and scaling parameters. We will defer the proofs to Appendix A. First we prove that if two Mallows models have different base permutations and their scaling parameters are bounded away from one, then the distributions that they generate cannot be too close.
Claim 1.
Consider two Mallows models and where and . Then .
Second, we give a condition under which we can conclude that two Mallows models are close in total variation distance. An analogous result is proved in [3] (see Lemma 2.6) except that here we remove the dependence on .
Lemma 2.
Consider two Mallows models and with the same base permutation on elements. If then .
3.3 Special Matrix Results
Here we present a determinantal identity from mathematical physics that will play a central role in our learning algorithms. Note where counts the number of inversions in a permutation. We make the following definition.
Definition 8.
Let be the matrix whose rows and columns are indexed by permutations on and whose entries are .
Zagier [27] gives us an explicit form for the determinant of this matrix, which we quote here:
Theorem 3.1.
[27]
This expression for the determinant gives us weak lower bounds on the total variation distance between mixtures of Mallows models where all the scaling parameters are the same. We will bootstrap this identity to prove a stronger result about how far a column of is from the span of any set of other columns.
4 Identifiability
In this section we show that any two mixtures of Mallows models whose components are far from each other (and the uniform distribution) in total variation distance are far from each other as mixtures too, provided that .
4.1 Robust Kruskal Rank
Our first step is to show that any columns of are not too close to being linearly dependent — i.e. the projection of any column onto the orthogonal complement of the span of any other columns cannot be too small. The Kruskal rank of a collection of vectors is the largest so that every vectors are linearly independent. The property we establish here is sometimes called a robust Kruskal rank [1].
Lemma 3.
Suppose and consider columns of . The projection of one column onto the orthogonal complement of the other has euclidean length at least .
Proof.
Assume for the sake of contradiction that there is a set of columns that violates the statement of the lemma. In particular suppose that the projection of onto the orthogonal complement of has euclidean length for some . We will use this assumption to prove an upper bound on the determinant of that violates Theorem 3.1. Our approach is to find an ordering of the columns so that as we scan through, at least once every columns the euclidean length of its projection onto the orthogonal complement of the columns seen so far is at most . Then using the naive upper bound of on the euclidean length of any column of we have
However from Theorem 3.1 we have
which yields the desired contradiction.
Now we complete the argument by constructing the desired ordering of the columns as follows: We start with . And then we choose any column not yet selected. Let be the permutation that maps to . Now maps to columns, and suppose of them have not been selected yet. Call these . We continue the ordering of the columns by appending . It is easy to see that the euclidean length of the projection of onto the orthogonal complement of the columns seen so far is also at most , which now finishes the proof. ∎
The above lemma is not directly useful for two reasons: First, the lower bound is exponentially small in . Second, it is tantamount to a lower bound on the norm of any sparse linear combination of the columns of . What we really want in the context of identifiability is a lower bound on the norm (of a matrix whose columns represent the components).
Definition 9.
Let be obtained from by normalizing its columns to sum to one.
Lemma 4.
Suppose and consider any columns of . Then
provided that .
Proof.
Let be the permutations corresponding to the columns . Also without loss of generality suppose and that . First we build a block structure that satisfies but no other does: For each , pick two consecutive elements in , say and that are inverted in . Such a pair exists because . Now we can take the union of these pairs over all to form a block structure so that, for all , and are in the same block and satisfies . Note that can be less than , if for example two of the pairs contain the same element. In any case, we have .
Now for each , set and . From Corollary 2 we have that
Next we show that we can find unit vectors so that

whenever and

for all
This fact essentially follows from Lemma 3. First observe that the and the column of corresponding to differ only by a normalization, since the former sums to one. Now for each we can take to be the unit vector in the direction of the projection of onto the orthogonal complement of all the ’s for . Note that the additional factor arises because Lemma 3 deals with and to normalize any column of it we need to divide by at most .
With this construction, we have that for all since each differs from when restricted to at least one of the blocks . Moreover using property above and Lemma 1 we have
where the last inequality follows from the bound . Finally note that
since the entries of are at most one in absolute value, and each can be formed from by zeroing out entries (corresponding to permutations that do not satisfy ) and summing subsets of the remaining ones together (that correspond to permutations with the same ordering for each ). ∎
The above lemma readily implies that any two mixtures of Mallows models whose components all have the same scaling parameter and whose mixing weights are different are far from each other in total variation distance. In the sequel we will be interested in proving identifiability even when the scaling parameters are allowed to be different. As a step towards that goal, first we give a simple extension that allows the scaling parameters to be slightly different:
Lemma 5.
Consider any distinct permutations and scaling parameters . Let and suppose that for each , and for each ,
Then for any coefficients with we have
4.2 Polynomial Identifiability
Now we are ready to prove our main identifiability result. First we define a notion of nondegeneracy, which is informationtheoretically necessary when our goal is to identify all the components in the mixture.
Definition 10.
We say a mixture of Mallows models is non degenerate if the total variation distance between any pair of components is at least and the total variation distance between any component and the uniform distribution over all permutations is also at least . Furthermore we say that the mixture is non degenerate if in addition each mixing weight is at least .
We will not need the following definition until later (when we state the guarantees of various intermediary algorithms), but let us also define a natural notion for two mixtures to be componentwise close:
Definition 11.
We say that two mixtures of Mallows models and with the same number of components are componentwise close if there is a relabelling of components in one of the mixtures after which and for all .
Most of these conditions are standard, because in order to be able to identify from a polynomial number of samples we need to get at least one sample from each component and at least one sample from the difference between any two components. In our context, we additionally require no component to be too close to the uniform distribution because the distribution is the same regardless of the choice of .
Below, we state and prove our main lemma in this section. The technical argument is quite involved, but many of the antecedents (in particular finding block structures that capture the disagreements between permutations, representing the distribution on a subset of permutations as a tensor and constructing test functions as the tensor product of simple vectors) were used already in the proof of Lemma 4.
Lemma 6.
Consider any (not necessarily distinct) permutations and scaling parameters . Set and suppose that the collection of Mallows models is non degenerate. Then for any coefficients with we have
Again, it suffices to consider . Set . We will break up the proof of this lemma into two cases. First consider the case where and . We will use the following intricate construction: Consider the set of such that
and suppose without loss of generality these are . Now by Lemma 2 and the assumption of non degeneracy we conclude that the permutations are all distinct and that are all at most . For each of these permutations, we will build an ordered block structure where the total size of the sets in it is at most , with the property that satisfies but do not. We will do this as follows: For each choice of a permutation in the list we add to two consecutive elements of whose order is reversed in . This completes the construction.
Now we will pick an additional elements in the following manner. First none of these elements should occur in any of the ordered block structures . Second we want that each pair is consecutive in . Third, we want the pairs to be as early as possible in . What is the largest rank that we will need to use to select these additional elements? There are at most elements among the ordered block structures and at worst the gap between these elements in is one so that we need to use at most rank .
Proof of First Case.
Now consider the set of permutations where the first elements are in that order except up to possibly reversing the order of some pairs and . There are such permutations and we will let denote the vector restricted to the indices corresponding to . Set and . As in Definition 7 we can form a dimensional rank one tensor of order whose entries represent the probability of any permutation in , which using Corollary 2 can be written as where
Furthermore since the probability that the elements are the first elements in that order is at least .
Now for any and any define the vector as follows: If occurs before in then set . Otherwise set . Note that by construction we have that and are orthogonal. Now define the tensor
which has the key property that for . Next we lower bound . First we note for any we have . Thus we conclude . Now putting it all together we have
We claim that the permutations must all be distinct because we chose the elements in to avoid each ordered block structure . Moreover the scaling parameters are all close to each other. We can apply Lemma 5 to lower bound the right hand side to complete the proof. ∎
We remark that everything in the first case works as is if and for any . Thus in the remaining case we can assume that for every either or .