This paper considers the following natural scenario: there is a large heterogeneous population which consists of disjoint subgroups, and for each subgroup there is a “central preference order” specifying a ranking over a fixed set of items (equivalently, specifying a permutation in the symmetric group ). For each , the preference order of each individual in subgroup is assumed to be a noisy version of the central preference order (the permutation corresponding to subgroup ). A natural learning task which arises in this scenario is the following: given access to the preference order of randomly selected members of the population, is it possible to learn the central preference orders of the sub-populations, as well as the relative sizes of these sub-populations within the overall population?
Worst-case formulations of the above problem typically tend to be (difficult) variants of the feedback arc set problem, which is known to be NP-complete [GJ79]. In view of the practical importance of problems of this sort, though, there has been considerable recent research interest in studying various generative models corresponding to the above scenario (we discuss some of the recent work which is most closely related to our results in Section 1.3). In this paper we will model the above general problem schema as follows: The “central preference orders” of the subgroups are given by unknown permutations . The fraction of the population belonging to the -th subgroup, for , is given by an unknown (so ). Finally, the noise is modeled by some family of distributions , where each distribution is supported on , and the preference order of a random individual in the -th subgroup is given by , where . Here is a model parameter capturing the “noise rate” (we will have much more to say about this for each of the specific noise models we consider below). The learning task is to recover the central rankings and their proportions , given access to preference orders of randomly chosen individuals from the population. In other words, each sample provided to the learner is independently generated by first choosing a random permutation , where is chosen to be
with probability; then independently drawing a random ; and finally, providing the learner with the permutation Let denote the function which is at and otherwise. With this notation, we write “” to denote the distribution over noisy samples described above, and our goal is to approximately recover given such noisy samples. The reader may verify that the distribution defined by is precisely given by the group convolution (and hence the notation).
1.1 The noise models that we consider
We consider a range of different noise models, corresponding to different choices for the parametric family , and for each one we give an efficient algorithm for recovering the population in the presence of that kind of noise. In this subsection we detail the three specific noise models that we will work with (though as we discuss later, our general mode of analysis could be applied to other noise models as well).
(A.) Symmetric noise. In the symmetric noise model, the parametric family of distributions over is denoted
. Given a vector(so each and ), a draw of is obtained as follows:
Choose , where value is chosen with probability .
Choose a uniformly random subset of size exactly . Draw uniformly from ; in other words, is a uniformly random permutation over the set and is the identity permutation on elements in
. (We denote this uniform distribution overby .)
Note that in this model, if the noise vector has , then every draw from is a uniform random permutation and there is no useful information available to the learner.
In order to define the next two noise models that we consider, let us recall the notion of a right-invariant metric on . Such a metric is one that satisfies for all . We note that a metric is right-invariant if and only if it is invariant under relabeling of the items , and that most metrics considered in the literature satisfy this condition (see [KV10, Dia88b] for discussions of this point). In this paper, for technical convenience we restrict our attention to the metric being the Cayley distance over (though see Section 1.5 for a discussion of how our methods and results could potentially be generalized to other right-invariant metrics):
Let be the undirected graph with vertex set and an edge between permutations and if there is a transposition such that . The Cayley distance over is the metric induced by this graph; in other words, where is the smallest value such that there are transpositions satisfying .
Now we are ready to define the next two parameterized families of noise distributions that we consider. We note that each of the noise distributions considered below has the natural property that decreases with where is the identity distribution.
(B.) Heat kernel random walk under Cayley distance. Let be the Laplacian of the graph from Section 1.1. Given a “temperature” parameter , the heat kernel is the matrix It is well known that
is the transition matrix of the random walk induced by choosing a Poisson-distributed time parameterand then taking steps of a uniform random walk in the graph . With this motivation, we define the heat kernel noise model as follows: the parametric family of distributions is , where the probability weight that assigns to permutation is the probability that the above-described random walk, starting at the identity permutation , reaches . (Observe that higher temperature parameters correspond to higher rates of noise. More precisely, it is well known that the mixing time of a uniform random walk on is steps, so if grows larger than then the distribution converges rapidly to the uniform distribution on ; see [DS81]
for detailed results along these lines.) We note that these probability distributions (or more precisely, the associated heat kernel) have been previously studied in the context of learning rankings, see e.g. [KL02, KB10, JV18]. In some of this work, a different underlying distance measure was used over rather than the Cayley distance; see our discussion of related work in Section 1.3.
(C.) Mallows-type model under Cayley distance (Cayley-Mallows / Ewens model). While the heat kernel noise model arises naturally from an analyst’s perspective, a somewhat different model, called the Mallows model
, has been more popular in the statistics and machine learning literature. The Mallows model is defined using the “Kendall-distance” between permutations (defined in Section 1.3) rather than the Cayley distance ; the Mallows model with parameter assigns probability weight to the permutation , where is a normalizing constant. As proposed by Fligner and Verducci [FV86], it is natural to consider generalizations of the Mallows model in which other distance measures take the place of the Kendall -distance. The model which we consider is one in which the Cayley distance is used as the distance measure; so given , the noise distribution which we consider assigns weight to each permutation , where is a normalizing constant. In fact, this noise model was already proposed in 1972 by W. Ewens in the context of population genetics [Ewe72] and has been intensively studied in that field (we note that [Ewe72] has been cited more than 2000 times according to Google Scholar). To align our terminology with the strand of research in machine learning and theoretical computer science which deals with the Mallows model, in the rest of this paper we refer to as the Cayley-Mallows model. For the same reason, we will also refer to the usual Mallows model (with the Kendall -disance) as the Kendall-Mallows model. We observe that for the Cayley-Mallows model , in contrast with the heat kernel noise model now smaller values of correspond to higher levels of noise, and that when the distribution is simply the uniform distribution over and there is no useful information available to the learner.
1.2 Our results
For each of the noise models defined above, we give algorithms which, under a mild technical assumption (that no mixing weight is too small), provably recover the unknown central rankings and associated mixing weights up to high accuracy. A notable feature of our results is that the sample and running time dependence is only quasipolynomial in the number of elements and the number of sub-populations ; as we detail in Section 1.3 below, this is in contrast with recent results for similar problems in which the dependence on is exponential.
Below we give detailed statements of our results. The following notation and terminology will be used in these statements: for a distribution over (or any function from to ) we write to denote the set of permutations that have . For a given noise model , we write “” to denote the distribution over noisy samples that is provided to the learning algorithm as described earlier. Given two functions , we write “” to denote , the distance between and . If and are both distributions then we write to denote the total variation distance between and , which is Finally, if is a distribution over in which for every such that , we say that is -heavy.
Learning from noisy rankings: Positive and negative results. Our first algorithmic result is for the symmetric noise model (A) defined earlier. Section 1.2, stated below, gives an efficient algorithm as long as the vector is “not too extreme” (i.e. not too biased towards putting almost all of its weight on large values very close to ):
[Algorithm for symmetric noise] There is an algorithm with the following guarantee: Let be an unknown -heavy distribution over with . Let be such that
Given , the value of , a confidence parameter , and access to random samples from , the algorithm runs in time and with probability outputs a distribution such that
Our second algorithmic result, which is similar in spirit to Section 1.2, is for the heat kernel noise model:
[Algorithm for heat kernel noise] There is an algorithm with the following guarantee: Let be an unknown -heavy distribution over with . Let be any value that is . Given , the value of , a confidence parameter , and access to random samples from , the algorithm runs in time and with probability outputs a distribution such that
Recalling that the uniform random walk on the Cayley graph of mixes in steps, we see that the algorithm of Section 1.2 is able to handle quite high levels of noise and still run quite efficiently (in quasi-polynomial time).
so measures the minimum distance between and any integer in . Section 1.2 gives an algorithm which can be quite efficient for the Cayley-Mallows noise model if the noise parameter is such that is not too small:
[Algorithm for the Cayley-Mallows model] There is an algorithm with the following guarantee: Let be an unknown -heavy distribution over with . Given , the value of , a confidence parameter , and access to random samples from , the algorithm runs in time and with probability outputs a distribution such that
As alluded to earlier, as approaches the difficulty of learning in the noise model increases (and indeed learning becomes impossible at ); since for small we have , this is accounted for by the factor in our running time bound above. However, for larger values of the dependence may strike the reader as an unnatural artifact of our analysis: is it really hard to learn when is very close to , easy when is very close to and hard again when is very close to ? Perhaps surprisingly, the answer is yes: it turns out that the parameter captures a fundamental barrier to learning in the Cayley-Mallows model. We establish this by proving the following lower bound for the Cayley-Mallows model, which shows that a dependence on as in Section 1.2 is in fact inherent in the problem:
Given , there are infinitely many values of and such that the following holds: Let be such that , and let be any algorithm which, when given access to random samples from where is a distribution over with , with probability at least 0.51 outputs a distribution over that has . Then must use samples.
1.3 Relation to prior work
Starting with the work of Mallows [Mal57], there is a rich line of work in machine learning and statistics on probabilistic models of ranking data, see e.g. [Mar14, LL02, BOB07, MM09, MC10, LB11]. In order to describe the prior works which are most relevant to our paper, it will be useful for us to define the Kendall-Mallows model (referred to in the literature just as the Mallows model) in slightly more detail than we gave earlier. Introduced by Mallows [Mal57], the Kendall-Mallows model is quite similar to the Cayley-Mallows model that we consider — it is specified by a parametric family of distributions and a central permutation , and a draw from the model is generated as follows: sample and output . The distribution assigns probability weight to the permutation where is the normalizing constant and is the Kendall -distance (defined next): The Kendall -distance is a distance metric on defined as
In other words, is the number of inversions between and . Like the Cayley distance, the Kendall -distance is also a right-invariant metric. Another equivalent way to define is to consider the undirected graph on where vertices share an edge if and only where is an adjacent transposition – in other words, for some . Then is defined as the shortest path metric on this graph. From this perspective, the difference between the Kendall -distance and the Cayley distance is that the former only allows adjacent transpositions while the latter allows all transpositions.
Learning mixture models: As mentioned earlier, probabilistic models of ranking data have been studied extensively in probability, statistics and machine learning. Models that have been considered in this context include the Kendall-Mallows model [Mal57, LB11, MPPB07, GP18], the Cayley-Mallows model (and generalizations of it) [FV86, MM03, Muk16, DH92, Dia88a, Ewe72] and the heat kernel random walk model [KL02, KB10, JV18], among others. In contrast, within theoretical computer science interest in probabilistic models of ranking data is somewhat more recent, and the best-studied model in this community is the Kendall-Mallows model. Braverman and Mossel [BM08] initiated this study and (among other results) gave an efficient algorithm to recover a single Kendall-Mallows model from random samples. The question of learning mixtures of Kendall-Mallows models was raised soon thereafter, and and Awasthi et al. [ABSV14] gave an efficient algorithm for the case . We note two key distinctions between our work and that of [ABSV14]: (i) our results apply to the Cayley-Mallows model rather than the Kendall-Mallows model, and (ii) the work of [ABSV14] allows for the two components in the mixture to have two different noise parameters and whereas our mixture models allow for only one noise parameter across all the components.
Very recently, Liu and Moitra [LM18] extended the result of [ABSV14] to any constant . In particular, the running time of the [LM18] algorithm scales as . It is interesting to contrast our results with those of [LM18]. Besides the obvious difference in the models treated (namely Kendall-Mallows in [LM18] versus Cayley-Mallows in this paper), another significant difference is that our running time scales only quasipolynomially in versus exponentially in for [LM18]. (In fact, [LM18] shows that an exponential dependence on is necessary for the problem they consider.) Another difference is that their algorithm allows each mixture component to have a different noise parameter whereas our result requires the same noise parameter across the mixture components. We observe that one curious feature of the algorithm of [LM18] is the following: When all the noise parameters are well-separated (meaning that for all , ), then the running time of [LM18] can be improved to . This suggests that the case when all are the same might be the hardest for the Liu-Moitra [LM18] algorithm.
Finally, we note that while the analysis in this paper does not immediately extend to the Kendall-Mallows model (see Section 1.5 for more details), we point out that there is a sense in which the Kendall-Mallows and Cayley-Mallows models are fundamentally incomparable. This is because, while the results of [LM18] show that mixtures of Kendall-Mallows models are identifiable whenever each , Theorem 1.2 shows that mixtures of Cayley-Mallows models are not identifiable at various larger values of such as , even when all of the noise parameters are the same value which is provided to the algorithm.
1.4 Our techniques
A key notion for our algorithmic approach is that of the marginal of a distribution over :
Fix to be some distribution over . Let , let be a vector of distinct elements of and likewise . We say the -marginal of is the probability
that for all , the -th element of a random drawn from is . When and are of length we refer to such a probability as a -way marginal of .
The first key ingredient of our approach for learning from noisy rankings is a reduction from the problem of learning (the unknown distribution supported on rankings ) given access to samples from
, to the problem of estimating-way marginals (for a not-too-large value of ). More precisely, in Section 2 we give an algorithm which, given the ability to efficiently estimate -way marginals of , efficiently computes a high-accuracy approximation for an unknown -heavy distribution with support size at most (see Section 2). This algorithm builds on ideas in the population recovery literature, suitably extended to the domain rather than .
With the above-described reduction in hand, in order to obtain a positive result for a specific noise model the remaining task is to develop an algorithm which, given access to noisy samples from , can reliably estimate the required marginals. In Section 3 we show that if the noise distribution (a distribution over ) is efficiently samplable, then given samples from
, the time required to estimate the required marginals essentially depends on the minimum, over a certain set of matrices arising from the Fourier transform (over the symmetric group
) of the noise distribution, of the minimum singular value of the matrix. (SeeSection 3 for a detailed statement.) At this point, we have reduced the algorithmic problem of obtaining a learning algorithm for a particular noise model to the analytic task of lower bounding the relevant singular values. We carry out the required analyses on a noise-model-by-noise-model basis in Sections 4, 5, and 6. These analyses employ ideas and results from the representation theory of the symmetric group and its connections to enumerative combinatorics; we give a brief overview of the necessary background in Appendix A.
To establish our lower bound for the Cayley-Mallows model, Section 1.2, we exhibit two distributions and over the symmetric group such that the distributions of noisy rankings and have very small statistical distance from each other. Not surprisingly, the inspiration for this construction also comes from the representation theory of the symmetric group; more precisely, the two above-mentioned distributions are obtained from the character (over the symmetric group) corresponding to a particular carefully chosen partition of . A crucial ingredient in the proof is the fact that characters of the symmetric group are rational-valued functions, and hence any character can be split into a positive part and a negative part; details are given in Section 8.
1.5 Discussion and future work
In this paper we have considered three particular noise models — symmetric noise, heat kernel noise, and Cayley-Mallows noise — and given efficient algorithms for these noise models. Looking beyond these specific noise models, though, our approach provides a general framework for obtaining algorithms for learning mixtures of noisy rankings. Indeed, for essentially any efficiently samplable noise distribution , given access to samples from our approach reduces the algorithmic problem of learning to the analytic problem of lower bounding the minimum singular values of matrices arising from the Fourier transform of (see Section 3). We believe that this technique may be useful in a broader range of contexts, e.g. to obtain results analogous to ours for the original Kendall-Mallows model or for other noise models.
As is made clear in Sections 4, 5, and 6, the representation-theoretic analysis that we require for our noise models is facilitated by the fact that each of the noise distributions considered in those sections is a class function (in other words, the value of the distribution on a given input permutation depends only on the cycle structure of the permutation). Extending the kinds of analyses that we perform to other noise models which are not class functions is a technical challenge that we leave for future work.
2 Algorithmic recovery of sparse functions
The main result of this section is the reduction alluded to in Section 1.4. In more detail, we give an algorithm which, given the ability to efficiently estimate -way marginals, efficiently computes a high-accuracy approximation for an unknown -heavy distribution with support size at most :
Let be an unknown -heavy distribution over with . Suppose there is an algorithm with the following property: given as input a value and two vectors and each composed of distinct elements of algorithm runs in time and outputs an additively -accurate estimate of the -marginal of (recall Section 1.4). Then there is an algorithm with the following property: given the value of , algorithm runs in time and returns a function such that .
Looking ahead, given Section 2, in order to obtain a positive result for a specific noise model the remaining task is to develop an algorithm which, given access to noisy samples from , can reliably estimate the required marginals. The algorithm is given in Section 3 and the detailed analyses establishing its efficiency for each of the noise models (by bounding minimum singular values of certain matrices arising from each specific noise distribution) is given in Sections 4, 5, and 6.
2.1 A useful structural result
The following structural result on functions from to with small support will be useful for us:
[Small-support functions are correlated with juntas] Fix and let be such that and . There is a subset and a list of values such that and
Section 2.1 is reminiscent of analogous structural results for functions over which are implicit in the work of [WY12] (specifically, Theorem 1.5 of that work), and indeed Section 2.1 can be proved by following the techniques of [WY12]. Michael Saks [Sak18] has communicated to us an alternative, and arguably simpler, argument for the relevant structural result over ; here we follow that alternative argument (extending it in the essentially obvious way to the domain rather than ).
Let the support of be . Note that since , there must exist some set of coordinates such that any two elements of differ in at least one of those coordinates. Without loss of generality, we assume that this set is the first coordinates
We prove Section 2.1 by analyzing an iterative process that iterates over the coordinates . At the beginning of the process, we initialize a set of “live coordinates” to be , initialize a set of constraints to be initially empty, and initialize a set of “live support elements” to be the entire support of . We will see that the iterative process maintains the following invariants:
The coordinates in are sufficient to distinguish between the elements in , i.e. any two distinct strings in have distinct projections onto the coordinates in ;
The only elements of that satisfy all the constraints in are the elements of .
Before presenting the iterative process we need to define some pertinent quantities. For each coordinate and each index , we define
the weight under of the live support elements that have , and we define
the number of live support elements that have (note that has nothing to do with ). It will also be useful to have notation for fractional versions of each of these quantities, so we define
Note that for any we have that , or equivalently
For each coordinate , we write to denote the element which is such that for all (we break ties arbitrarily). Finally, we let .
Now we are ready to present the iterative process:
If every has 111Note that this means almost all of the weight under of the live support elements is on elements that all agree with the majority value on coordinate . Note further that if is empty then this condition trivially holds., then halt the process. Otherwise, let be any element of for which .
For this coordinate , choose which maximizes the ratio (or equivalently, maximizes ) subject to and .
Add the constraint to , remove from , and remove all such that from . Go to Step 1.
When the iterative process ends, suppose that the set is . Then we claim that Equation 1 holds for .
To argue this, we first observe that both invariants (I1) and (I2) are clearly maintained by each round of the iterative process. We next observe that each time a pair is processed in Step 3, it holds that , and hence each round shrinks by a factor of at least . Thus, after steps, the set must be of size at most and hence the process must halt. (Note that the claimed bound follows from the fact that the process runs for at most stages.)
Next, note that when the process halts, by a union bound over the at most coordinates in it holds that
On the other hand, by the first invariant (I1), the cardinality of the set for all is precisely . This immediately implies that almost all of the weight of , across elements of , is on a single element; more precisely, that
from which it follows that
So to establish Equation 1, it remains only to establish a lower bound on when the process terminates. To do this, let us suppose that the process runs for steps where in the step the coordinate chosen is . Now, at any stage , we have
(because the denominator is at most and since the process does not terminate, the numerator is at least ). As a result, we get that if the constraint chosen at time is , then
By Equation 3, when the process halts we have
But since at least one element remains, we have that , and since , we conclude (recalling that ) that
Combining with (2), this yields the claim. ∎
2.2 Proof of Section 2
The idea of the proof is quite similar to the algorithmic component of several recent works on population recovery [MS13, WY12, LZ15, DST16]. Given any function and any integer , we define the function as follows:
At a high level, the algorithm of Section 2 works in stages, by successively reconstructing . In each stage it uses the procedure described in the following claim, which says that high-accuracy approximations of the -marginals together with the support of (or a not-too-large superset of it) suffices to reconstruct : Let be an unknown distribution over supported on a given set of size . There is an algorithm which has the following guarantee: The algorithm is given as input , and parameters (for every set of size at most and every ) which satisfy
runs in time and outputs a function such that
We consider a linear program which has a variablefor each (representing the probability that puts on ) and is defined by the following constraints:
For each of size at most and each , include the constraint
Algorithm sets up and solves the above linear program (this can clearly be done in time ). We observe that the linear program is feasible since by definition is a feasible solution. To prove the claim it suffices to show that every feasible solution is -close to ; so let denote any other feasible solution to the linear program, and let denote Define so By Section 2.1, we have that there is a subset of size at most and a such that
On the other hand, since both and are feasible solutions to the linear program, by the triangle inequality it must be the case that
Essentially the only remaining ingredient required to prove Section 2 is a procedure to find (a not-too-large superset of) the support of . This is given by the following claim, which inductively uses the algorithm to successively construct suitable (approximations of) the support sets for
Under the assumptions of Section 2, there is an algorithm with the following property: given as input a value , algorithm runs in time and for each outputs a set of size at most which contains the support of .
The algorithm works inductively, where at the start of stage (in which it will construct the set ) it is assumed to have a set with which contains the support of (Note that at the start of the first stage this holds trivially since trivially has empty support).
Let us describe the execution of the -th stage of . For , we define the set as follows:
Observe that in time , we can compute up to error (denote this estimate by ) for all . Since is -heavy, we have that
Consequently, we can compute the set in time . The final observation is that the set (of cardinality at most ) obtained by appending each final -th character from to each element of must contain the support of . Set ; by the assumption of Section 2, in time it is possible to obtain additively -accurate estimates of each of the -way marginals of . In the -th stage, algorithm runs using and these estimates of the marginals; by Section 2.2, this takes time and yields a function such that Since by assumption is -heavy, it follows that any element in the support of such that must not be in the support of ; so the algorithm removes all such elements from to obtain the set This resulting is precisely the support of , and is clearly of size at most ∎
3 Computing limited way marginals from noisy samples
Recall that the noisy ranking learning problems we consider are of the following sort: There is a known noise distribution supported on , and an unknown -sparse -heavy distribution . Each sample provided to the learning algorithm is generated by the following probabilistic process: independent draws of and are obtained, and the sample given to the learner is . By the reduction established in Section 2, in order to give an algorithm that learns the distribution in the presence of a particular kind of noise , it suffices to give an algorithm that can efficiently estimate -way marginals given samples
The main result of this section, Section 3, gives such an algorithm. Before stating the theorem we need some terminology and notation and we need to recall some necessary background from representation theory of the symmetric group (see Appendix A for a detailed overview of all of the required background).
First, let be a distribution over (which should be thought of as a noise distribution as described earlier). We say that is efficiently samplable if there is a -time randomized algorithm which takes no input and, each time it is invoked, returns an independent draw of
Next, we recall that a partition of the natural number (written “”) is a vector of natural numbers where and (see Section A.2 for more detail). For two partitions and of , we say that dominates , written , if for all (see Section A.2). Given any , let denote the set of all partitions such that
We recall that a representation of the symmetric group is a group homomorphism from to (see Appendix A). We further recall that for each partition there is a corresponding irreducible representation, denoted (see Section A.2). For a matrix we write to denote the smallest singular value of . Given a partition we define the value to be
the smallest singular value across all Fourier coefficients of the noise distribution of irreducible representations corresponding to partitions that dominate . (We recall that the Fourier coefficients of functions over the symmetric group, and indeed over any finite group, are matrices; see Section A.2.)
Finally, for we define the partition to be
Now we can state the main result of this section: Let be an efficiently samplable distribution over . Let be an unknown distribution over . There is an algorithm with the following properties: receives as input a parameter , a confidence parameter , a pair of -tuples , each composed of distinct elements, and has access to random samples from . Algorithm runs in time and outputs a value which with probability at least is a -accurate estimate of the -marginal of .
We will use the following claim to prove Section 3:
Let be any unitary representation of , let be any efficiently samplable distribution over , and let