Optimal Learning of Mallows Block Model

06/03/2019 ∙ by Robert Busa-Fekete, et al. ∙ MIT National Technical University of Athens Verizon Media 0

The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group S_m. To analyze more complex ranking data, several studies considered the Generalized Mallows model defined by Fligner and Verducci 1986. Despite the significant research interest of ranking distributions, the exact sample complexity of estimating the parameters of a Mallows and a Generalized Mallows Model is not well-understood. The main result of the paper is a tight sample complexity bound for learning Mallows and Generalized Mallows Model. We approach the learning problem by analyzing a more general model which interpolates between the single parameter Mallows Model and the m parameter Mallows model. We call our model Mallows Block Model -- referring to the Block Models that are a popular model in theoretical statistics. Our sample complexity analysis gives tight bound for learning the Mallows Block Model for any number of blocks. We provide essentially matching lower bounds for our sample complexity results. As a corollary of our analysis, it turns out that, if the central ranking is known, one single sample from the Mallows Block Model is sufficient to estimate the spread parameters with error that goes to zero as the size of the permutations goes to infinity. In addition, we calculate the exact rate of the parameter estimation error.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Mallows model is one of the most fundamental ranking distribution since it was introduced in the seminal paper of [Mallows(1957)]. The model has two parameters, the central ranking and the spread parameter

. Based on these, the probability of observing a ranking

is proportional to , where is a ranking distance, such as the number of discordant pairs, a.k.a Kendall’s tau distance.

To capture more complicated distributions over rankings, several studies considered the generalized Mallows model FlignerV1986,DoPeRe04,Mar95, which assigns a different spread parameter to each alternative . Now the probability of observing

decreases exponentially in a weighted sum over the discordant pairs, where the weights are determined by the spread parameters of discordant items. Statistical estimation of the distribution and the parameters of the Mallows model has been of interest in a wide range of scientific areas including theoretical statistics Mukherjee16, machine learning LuBo11,AwasthiBS14,ChenBBK09,MeilaB10, social choice CaragiannisPS16, theoretical computer science LiuM18 and many more, as we discuss in Section 

1.2.

Despite this extensive literature, to the best of our knowledge, no optimal results are known on the sample complexity of learning the parameters of a Mallows or a generalized Mallows model. In this work, we fill this gap by proving: (1) an upper bound on the number of samples needed by some simple estimators to accurately estimate the parameters of the Mallows model, (2) an essentially matching lower bound on the sample complexity of any accurate estimator. Using our tight sample analysis, we are able to quantify in the finite sample regime some results that were only known in the asymptotic regime (e.g., [Mukherjee(2016)]).

Additionally, we introduce the Mallows Block model, which interpolates between the simple Mallows and the generalized Mallows models. The definition of the Mallows Block model is similar in spirit to the (fundamental in theoretical statistics) Stochastic Block model KloppTV17, which admits similar statistical properties. Also, [Berthet et al.(2016)Berthet, Rigollet, and Srivastava] recently introduced the Ising Block model, which is conceptually similar to the Stochastic Block Model. As we prove, the Mallows Block model combines two nice properties: (a) like the generalized Mallows model, it describes a wider range of distributions over rankings than the Mallows model; and (b) it allows accurate estimation of the spread parameters even from one sample, as it has been proved in Mukherjee16 for the Mallows model. We analyze the sample complexity of the Mallows Block model by proving essentially tight upper and lower bounds when the block structure is known.

1.1 Results and Techniques

In this work, we fully determine the sample complexity of learning Mallows and Generalized Mallows distributions, in a unified way, via the definition of the Mallows Block model. In a nutshell, we show how to estimate the parameters of these distributions in a (sample and time) efficient way, and how this implies efficient density estimation in KL-divergence and in total variation distance. Our approach is general and exploits properties of the exponential family. As we illustrate in Section 3, the use of these properties might useful in proving the exact learning rates for other complicated exponential families, such as the Ising model.

Learning in KL-divergence. Our learning algorithm for the spread parameters essentially finds the maximum likelihood solution, but in a provably computationally efficient way. The sample complexity analysis of the consistency of our estimator is based on some known and some novel results about exponential families. As we see in Theorem 2

.4, the KL-divergence of two distributions in an exponential family is equal to the square difference of their parameters multiplied by the variance of a corresponding distribution inside the exponential family. If we put this together with Theorem 

3, where we obtain a new strong concentration inequality for distributions in an exponential family, we get a systematic way of proving upper bounds on the number of samples required to learn an exponential family in KL-divergence. Thus, we depart from the (only known) upper bounds on density estimation in total variation distance. We apply our technique to the Mallows Block model and get tight upper bounds of samples, where is the (known) number of blocks in the Mallows Block model. We sketch the statement of this result below, for a formal statement see Theorem 5.2.

Informal Theorem 1

Given samples from a Mallows -Block distribution , we can learn a distribution such that and hence .

Parameter Estimation. Extending a result of [Caragiannis et al.(2016)Caragiannis, Procaccia, and Shah], we show that a logarithmic number of samples is both sufficient and necessary to estimate the central ranking of a generalized Mallows distribution (Theorem 5.1). Then, using our results on exponential families, we show that estimating the spread parameter of a Mallows distribution boils down to obtaining a lower bound on the KL-divergence between two Mallows distributions with the same central ranking and parameters . With such a lower bound on the KL-divergence, we can apply the concentration inequality of Theorem 3, and show that once we learn the central ranking, with additional

i.i.d. samples, we can estimate the parameter vector

of the underlying Mallows Block model within error at most . Here, denotes the number of blocks of the Mallows Block model and is the minimum size of any block. We put everything together in the following informal theorem and refer to Theorem 5.1 for a formal statement.

Informal Theorem 2

Given samples from a Mallows -Block distirbution with parameters and , we can estimate and so that and .

A key observation in the proof of Theorem 5.1 is that the sufficient statistics for a generalized Mallows model with known central ranking are provided by an -variate distribution where the -th coordinate is an independent truncated geometric distribution

. Truncated geometric distributions interpolate between Bernoulli and geometric distributions. The sufficient statistics of the Mallows Block model correspond to sums of truncated geometric distributions, which interpolate between Binomial and Negative Binomial distributions. We hence believe that the study of sums of truncated geometric distribution may be of independent interest. We should also highlight that in our approach, only the lower bound on the variance depends on Kendall’s tau distance. Once we have such a bound for other exponential families, we can immediately apply our technique, e.g., to Mallows models with Spearman’s Footrule and Spearman’s Rank Correlation, as in Mukherjee16.

Learning from one sample. Arguably, the most interesting corollary of our tight analysis is that a single sample from a Mallows -Block model with known central ranking is enough to estimate within error , where again is the minimum size of any block in the Mallows Block model. This result provides the exact rate of an asymptotic result by [Mukherjee(2016)]. The formal version of the following informal theorem can be found in Corollary 5.1.

Informal Theorem 3

Given a single sample from a Mallows -Block distribution with known central ranking and spread parameters , we can estimate so that .

Lower Bounds. On the lower bound side, we use Fano’s inequality and show that samples are necessary even for learning a simple Mallows distribution in total variation distance (Lemma 4.2). Then, we show that samples are necessary for learning a Mallows -Block distribution in total variation distance. For a formal statement of the following informal theorem we refer to Lemma 5.2.

Informal Theorem 4

Any distribution estimation that is based only on samples from a Mallows -Block distribution satisfies .

Interestingly, our lower bound uses a general way to compute the total variation distance of two distributions that belong to the same exponential family (Theorem 3). This theorem states that the total variation of two distributions in the same exponential family is equal to the distance between their parameters times the absolute deviation of a corresponding distribution in the family. This should be compared with Theorem 2.4, on the KL-divergence between two distributions in the same exponential family. Using Theorem 3

, our lower bound boils down to showing that for some range of parameters, the absolute deviation is within a constant from the standard deviation. With this proven, we get that the total variation distance is within a constant factor from the square root of the KL-divergence, and Fano’s inequality can be applied.

Open Problems. An open problem that naturally arises from the definition of the Mallows Block model is the possibility of estimating the spread parameters, even from a single sample, of the Mallows Block model when the block structure is unknown. Such results are known for the fundamental Stochastic Block model in theoretical statistics KloppTV17. Recently, [Berthet et al.(2016)Berthet, Rigollet, and Srivastava] introduced the Ising Block model and proved some similar results. Another interesting question is about the minimum number of samples required to recover the block structure of the Mallows Block Model. Again, similar results are known for the Stochastic Block Model MosselNS18.

Another research direction is to obtain lower bounds on the variance of the distance to the central ranking for other notions of distance, such as Spearman’s Footrule and Spearman’s Rank Correlation. Then, we can apply our general approach and obtain tight bounds on the sample complexity of learning such models and on the quality of parameter estimation from a single sample, as in Mukherjee16.

1.2 Related work

There has been a significant volume of research work on algorithmic and learning problems related to our work. In the consensus ranking problem, a finite set of rankings is given, and we want to compute the ranking . This problem is known to be NP-hard Bartholdi1989, but it admits a polynomial-time -approximation algorithm problem AiChNe05 and a PTAS KeSc07. When the rankings are i.i.d. samples from a Mallows distribution, consensus ranking is equivalent to computing the maximum likelihood ranking, which does not depend on the spread parameter. Intuitively, the problem of finding the central ranking should not be hard, if the probability mass is concentrated around the central ranking. [Meila et al.(2012)Meila, Phadnis, Patterson, and Bilmes] came up with a branch and bound technique which relies on this observation. [Braverman and Mossel(2009)] proposed a dynamic programming approach that computes the consensus ranking efficiently, under the Mallows model. [Caragiannis et al.(2016)Caragiannis, Procaccia, and Shah] showed that the central ranking can be recovered from a logarithmic number of i.i.d. samples from a Mallows distribution (see also Theorem 5.1).

[Mukherjee(2016)] considered learning the spread parameter of a Mallows model based on a single sample, assuming that the central ranking is known. He studied the asymptotic behavior of his estimator and proved consistency. We strengthen this result by showing that our parameter estimator, based on single sample, can achieve optimal error for Mallows Block model (Corollary 5.1).

There has been significant work either on learning a Mallows model based on partial information, e.g. partial rankings or pairwise comparisons AdFl98,LuBo11,Busa-FeketeHS14, or on learning generalizations of the Mallows model, such as learning mixture of Mallows models LiuM18. Among these works, AwasthiBS14, LiuM18 seem the most relevant to our paper, since they considered learning mixtures of single parameter Mallows models in a learning setup that is similar in spirit to ours: find a model that is close to the underlying one either in the parameter space or in total variation distance based on as few sample as possible. However, the sample complexity of learning mixtures is necessarily much higher and a high degree polynomial of and . Hence their results do not compare with our optimal sample complexity analysis even for the simple Mallows model case.

The parameter estimation of the Generalized Mallows Model has been examined from a practical point of view by [Meilă et al.(2007)Meilă, Phadnis, Patterson, and Bilmes] but no theoretical guarantees for the sample complexity have been provided. Several ranking models are routinely used in analyzing ranking data Mar95,Shi16, such as Plackett-Luce model Pla75,Luc59, Babington-Smith model HV93 and spectral analysis based methods NIPS2012_4720,SibonyCJ15 and non-parametric methods LebanonM07. However, to our best knowledge, none of these ranking methods have been analyzed from point of distribution learning which comes with guarantee on some information theoretic distance. HajekOX14 considered the problem of learning parameters of Plackett-Luce model and they came up with high probability bounds for their estimator that is tight in a sense that there is no algorithm which can achieve lower estimation error with fewer examples.

2 Preliminaries and Notation

Small bold letters refer to real vectors in finite dimension and capital bold letters refer to matrices in . We denote by the th coordinate of , and by the th coordinate of . For any we define .

Metrics between distributions. Let , be two probability measures in the discrete probability space then the total variation distance between and is defined as , and the KL-divergence between and is defined as .

Exponential Families. In this section we summarize the basic definitions and properties of the exponential families of distributions. We follow the formulation and the expressions of Keener11, NielsenG09 where we also refer for complete proofs of the statements presented in this section. Let be a measure on and also , be measurable functions. We define the logarithmic partition function as . We also define the range of natural parameters as . The exponential family with sufficient statistics , carrier measure and natural parameters is the family of distributions

where the probability distribution

has density

(2.1)

Truncated Geometric Distribution.

We say that a random variable

follows the truncated geometric distribution with parameters and if it has the following probability mass function for and otherwise.

For the distribution

is a Bernoulli distribution with success probability

. For and the distribution is a geometric distribution . Observe that if we fix then is an exponential family with natural parameter . Again the domain of changes to for .

Basic Properties of Exponential Families. We summarize in the next theorem the fundamental properties of exponential families. For a proof of this theorem we refer to the Appendix A. Let be an exponential family parametrized by and for simplicity let and then the following hold.

  1. For all , it holds that

    (2.2)
  2. For all , it holds that

    (2.3)
  3. For all , , it holds that

    (2.4)
  4. For all , and for some it holds that

    (2.5)

2.1 Ranking Distributions

In this section we review the basic definitions of exponential families over permutations. We define the single parameter Mallows model and its generalization.

Single Parameter Mallows Model. The Mallows model or, more specifically, Mallows

-distribution is a parametrized, distance-based probability distribution that belongs to the family of exponential distributions

with probability mass function where and are the parameters of the model: is the location parameter also called center ranking and the spread parameter. Moreover, is a distance metric on permutations, which for our paper will be the Kendall tau distance , that is, the number of discordant item pairs .

The normalization factor in the definition of the model is equal to . When the distance metric is the Kendall tau distance we have . Observe that the family of distributions as stated is not an exponential family because of the location parameter . If we fix the permutation parameter then the family is an exponential family with natural parameter .

Generalized Mallows Model. One of the most famous generalizations of Mallows model is the one introduced by [Fligner and Verducci(1986)] with the name Generalized Mallows Model. We define to be the number of discordant item pairs involving item , i.e. . The generalized Mallows family of distribution with parameters and is defined as the probability measure over with probability mass function . One important property of the generalized mallows model when the distance metric is the Kendall tau distance is that the random variables where are independent. This follows from the following decomposition lemma of the partition function . For the proof of Lemma 2.1 we refer to the Appendix A.

When , we have that , where .

In Section 5 we introduce the Mallows Block Model that interpolates between the single parameter and the generalized Mallows model.

2.2 Fano’s Inequality

In this section we present Fano’s inequality which is our main technical tool for proving lower bounds on the sample complexity of learning Mallows Block Models. For this, let denote some finite set.

Maximum Risk of an Estimator. Let be a family of distributions and assume that we have access to i.i.d. samples . Let . Then the maximum risk of with respect to the family is equal to

(2.6)

Minimax Risk. Let be a family of distributions and assume that we have access to i.i.d. samples . Let also . Then we define the minimax risk of the family as

(2.7)

We can now state Fano’s Inequality as presented by [Yu(1997)].

[Lemma 3 in Yu1997] Let be a finite family of densities such that

then it holds that

3 Concentration Inequality and Total Variation of Exponential Families

We shall prove a concentration inequality for the sufficient statistics of an exponential family. This concentration inequality will be the basic building block for the general learning algorithm for exponential inequalities that we will present in the next section. Then we prove an exact formula for the total variation distance between two distributions that belong to the same exponential family.

Let be an exponential family with natural parameter , logarithmic partition function and range of parameters . Then the following concentration inequality holds for all

(3.1)

We give the proof for and the case can be handled respectively. Let , and for simplicity then it holds that

(Markov’s Inequality)
(Independence of ’s)
(By (2.2), (2.4))

Now we define the function . The second derivative of is . From (2.3) we conclude that and hence which implies that is a concave function. Hence achieves its maximum for at such that . But which implies that for it holds that . Therefore the optimal bound of the above form is achieved for . Hence we have the following

which concludes the proof.

The following useful corollary of Theorem 3.1 can be obtained if we apply Pinsker’s inequality to the right hand side of (3.1).

Let be an exponential family with natural parameter , logarithmic partition function and range of parameters . Then the following concentration inequality holds for all

(3.2)

We now move to proving an exact formula for . For the proof of Theorem 3 we refer to the Appendix B.

Let be an exponential family with natural parameters . If , , with then for some it holds that

To give some intuition about Theorem 3, consider the single dimensional case with and . In this case, it is easy to see that the sign of and are the same and hence the expression becomes . This gives the intuition that the total variation of two distribution in the same exponential family, with parameters sufficiently close, is equal to the distance between their parameters times the absolute deviation of a corresponding distribution in the family. This should be compared with Theorem 2.4, on the KL-divergence between two distributions in the same exponential family. The single dimensional version Theorem 2

.4 states that the KL-divergence is equal to the square difference of their parameters multiplied by the variance of a corresponding distribution inside the exponential family. Since the standard deviation is greater than the absolute deviation this conclusion resembles the well known Pinsker’s inequality. Furthermore, in a lot of exponential families, e.g. Gaussian distributions, the absolute deviation is only a constant fraction away from the standard deviation which indicates the existence of a

converse Pinsker’s inequality in these settings.

4 Warm-up: Learning Single Parameter Mallows Model

In this section we give a simple algorithm and prove its sample complexity for learning the parameters of a single parameter distribution given i.i.d. samples from . We also provide bounds for learning the distribution in total variation distance. As we will see if the central ranking is known then an accurate estimation of is possible hence giving an alternative proof of a phenomenon proved by [Mukherjee(2016)].

4.1 Parameter Estimation

For the single parameter Mallows model the sample complexity of estimating the central ranking has been identified in [Caragiannis et al.(2016)Caragiannis, Procaccia, and Shah] as we see in the next theorem. We focus on the case where the ranking distance is the Kendall tau distance .

[CaragiannisPS16] For any and any , there exists a polynomial time estimator such that given i.i.d. samples satisfies . Moreover, if then for any estimator there exists a distribution such that .

Hence it remains to estimate the parameter if we have the knowledge of the central ranking . As we explained in the definition of Mallows model when the central ranking is known the family of distributions is a single parameter exponential family. The sufficient statistic of this family is . The natural parameter of is the parameter and logarithmic partition function .

For any , , there exist estimators that can be computed in polynomial time from i.i.d. samples such that if , then

In the case where is known for then there exists an estimator that can be computed in polynomial time such that if , then

Theorem 4.1 follows from the more general Theorem 5.1 and hence we postpone its proof for the Section 5. One interesting thing to point out though from Theorem 4.1 is that in the case where is known, Theorem 4.1 provides accuracy for the parameter that goes to , even with sample, as the size of the permutation goes to infinity, i.e. . This was observed before by [Mukherjee(2016)] but no explicit rates as the ones we provide, were provided. We summarize our result for sample in the following corollary, which immediately follows from Theorem 4.1.

For any known , any and , there exists an estimator that can be computed in polynomial time from one sample such that

4.2 Learning in KL and TV Distance

The upper bound on the number of samples that we need to learn the distribution in KL and TV distance follows from Theorem 3 and Theorem 4.1 as we show in the more general Theorem 5.2. To finish this section we focus on proving the lower bound for learning in TV distance. The lower bound for learning the single parameter follows again from the corresponding lower bound of Section 5 and hence the term in the sampling complexity necessary. In the next lemma we prove that the term is also necessary.

For any it holds that

For the proof of Lemma 4.2 we refer to the Appendix C.

5 Learning Mallows Block Model

We start this section with properties of the Generalized Mallows Model as it is defined in Section 2. Then we move to the definition of the Mallows Block Model and the presentation of our main results. We remind the reader that the generalized Mallows family of distribution is with parameters and is defined as the probability measure over with probability mass function that using Lemma 2.1 is equal to

(5.1)

We define now the random variables where which are the sufficient statistics for when is known. It is easy to observe from (5.1) that the probability mass function of the vector is

(5.2)

and hence the random variables are independent. Observe also from the probability mass function and the definition of the truncated geometric distribution in Section 2 that . To formally summarize this observation we define to be the multivariate distribution , where . The following lemma relates the distribution with the distribution when the central ranking is known. For the proof we refer to the Appendix D.

Let and . Let also be the support of the distribution and the support of the distribution . Then there exists a bijective map such that for any it holds that . In particular, and .

The above lemma reduces the problem of learning the Generalized Mallows distribution to the learning of the central ranking and the distribution .

Mallows Block Model. The motivation of Mallows Block Model is to incorporate setting where some group of alternatives have the same probability of being misplaced hence they have the same parameter , but not all alternatives have the same probability of being misplaced as in the single parameter Mallows model. As we will explore in this section, the knowledge of the groups of alternatives with the same parameter can significantly decrease the number of samples needed to learn the parameters of the model. In the extreme case, when the size of the groups of alternatives is large enough, we can get very good rates even from just one samples from the distribution as we already discussed in Corollary 4.1. The Mallows Block Model with parameters is the family of distributions

where is a partitioning of the set . Each distribution is defined as a probability measure over with the following probability mass function

(5.3)

Again using Lemma 2.1 we have that . The sufficient statistics of , when are known, is the dimensional vector where . We define the distribution to be the distribution of the random vector where are independent and satisfies .

One important parameter of the Mallows Block Model are the sizes of the sets in the partition of . For this reason we define and .

5.1 Parameter Estimation in Mallows Block Model

We start with the estimation of the central ranking. Since the single parameter Mallows model is a special case of the Mallows Block Model the lower bound of [Caragiannis et al.(2016)Caragiannis, Procaccia, and Shah] presented in Theorem 4.1 still holds, and thus samples are necessary. The upper bound we present in Theorem 5.1. Its proof is deferred to Appendix D.

For any , any , any known partition of , there exists a polynomial time computable estimator such that given i.i.d. samples satisfies . Moreover, if then for any estimator there exists a distribution such that .

What remains is to estimate the vector of parameters assuming the knowledge of the central ranking . As we explained in the definition of Mallows Block Model when the central ranking is known the family of distributions is an exponential family. The sufficient statistics of this family are . The natural parameters of is the vector of parameters where and logarithmic partition function . We may simplify the notation to when is clear from the context.

For any , , any fixed partition of with and any there exist estimators that can be computed in polynomial time from i.i.d. samples such that if , where , then

In the case where is known and then there exists an estimator that can be computed in polynomial time such that if then

As a corollary of Theorem 5.1 we also have that when is known even one sample is sufficient to consistently learn all the parameters as the size of the smaller block of goes to infinity.

Let , , and a partition of with , there exist an estimator that can be computed in polynomial time from a sample such that

where and .

Proof of Theorem 5.1: (Sketch) From Theorem 5.1 we focus on the estimation of the parameters . We describe the intuition for the single parameter Mallows Model and we defer the full proof to Appendix E. Let . Once the central ranking is known the distribution is an exponential family and let be its sufficient statistics. It is not hard to prove that is an increasing function of . Therefore, it follows with a simple argument, that the better we estimate the better we can estimate . Now the main idea of our proof is to use the general concentration inequality of Theorem 3 to bound the accuracy that we can estimate . As it is clear from the form of the concentration inequality (3.1), to get good enough concentration we have to prove a strong lower bound on the KL-divergence of two distributions in the family. From (2.3) this reduces to proving a lower bound on the variance of a distribution in the family with parameter that is very close to . Such a good lower bound is not always possible to prove and we have to consider some cases. But in the main case a very careful lower bound of the variance in combination with (3.1) gives the sample complexity upper bound.

5.2 Learning in KL-divergence and Total Variation Distance

In this section we will describe how we can use the concentration inequality that we proved in Section 3 to learn a distribution in KL-divergence from i.i.d. samples. We also prove a lower bound that matches the upper bound up to a factor.

For any , , any fixed partition of with and any there exist estimators that can be computed in polynomial time from i.i.d. samples such that if , then

and hence .

Furthermore, for any there exists an such that for all and all functions with there exists , partition of and such that

The proof of Theorem 5.2 is based on two lemmas, one for the upper bound and one for the lower bound, that we present here and the Lemma 4.2 that we presented in Section 4. For the proofs of Lemma 5.2 and Lemma 5.2 we refer to the Appendix D.

For any , , any fixed partition