Estimating a symmetric property of a distribution given a small number of samples is a fundamental problem in algorithmic statistics. Formally, a property is symmetric if it is invariant to permutation of the labels, i.e. it is a function only of the multiset of probabilities and does not depend on the symbol labels. For many natural properties, including support size, coverage, distance from uniform and entropy, there has been extensive work that has led to designing efficient estimators both with respect to computational time and sample complexity [HJWW17, HJM17, AOST14, RVZ17, ZVV16, WY16b, RRSS07, WY15, OSW16, VV11b, WY16a, JVHW15, JHW16, VV11a]. In many cases these estimators are tailored to the particular property of interest. This paper is motivated by the goals of unifying the development of efficient estimators of symmetric properties of distributions and designing a single efficient universal algorithm for estimating arbitrary symmetric properties of distributions.
Our approach stems from the observation that a sufficient statistic for the problem of estimating a symmetric property from a sequence of samples is the profile of the sequence, i.e. the multiset of the frequencies (i.e multiplicities) of symbols in the sequence, e.g. the profile of is . Profiles are also called histograms of histograms, histogram order statistics, or fingerprints. Our approach to obtaining a universal estimator is based on the elegant problem of profile maximum likelihood (PML) introduced by Orlitsky et al. [OSS04]: Given a sequence of
samples, find the distribution that maximizes the probability of the observed profile. This problem has been studied in several papers since, applying heuristic approaches such as Bethe approximation[Von12, Von14], the EM algorithm [OSS04], and some algebraic approaches [ADM10] to calculate the PML. Recently Pavlichin, Jiao and Weissman [PJW17] introduced an efficient dynamic programming heuristic for PML that can be computed in linear time. While there are no approximation guarantees for the solution they produce, their approach was the initial impetus for our work.
A recent paper of Acharya et al. [ADOS16] showed that a distribution that optimizes the PML objective can be used to obtain a plug-in estimator for various symmetric properties of distributions. In fact it suffices to compute a distribution that approximates the PML objective to within a factor for constant where is the size of the sample. Unfortunately, no polynomial time computable PML estimator with such an approximation guarantee was known previously. In this paper, we provide an estimator with an approximation factor of , leading to a universal estimator for a host of symmetric properties. Moreover, our estimator is computable in time nearly linear in . Our techniques extend to computing a -dimensional generalization of PML, where we have access to samples from multiple distributions on a common domain. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions.
1.1 Overview of approach
The bulk of our work is dedicated to find a distribution that approximately maximizes the PML objective within an factor for a constant . We call such a distribution an approximate PML distribution. Given a sequence and its corresponding profile , the PML optimization problem is a maximization problem over all distributions . The objective function of the PML optimization problem is the probability of observing profile with respect to a distribution , which in turn is equal to the summation of probabilities of sequences (with respect to p) that have as their corresponding profile. The distribution that maximizes this objective is called a profile maximum likelihood (PML) distribution. (See Section 2 for formal definitions.)
To efficiently compute an approximate PML distribution, we first restrict ourselves to maximizing the PML objective for a discretized version of the profile over a class of distributions we call discrete pseudo-distributions (See Section 4). Here, the probability values of the distribution are restricted to belong to a small set P of permissible values (See Section 4.1)), and the frequencies in the profile are similarly restricted to belong to a small set M (See Section 4.2). We call the resulting maximizing distribution, a discrete PML (DPML) distribution and the corresponding optimization problem as DPML optimization (See Section 4.3).
There are two main features of the DPML optimization problem. Firstly, the maximizing distribution DPML is an approximate PML distribution with an approximation guarantee that we can control (as a function of the sizes of P and M). Secondly, the DPML optimization problem has a simpler equivalent formulation, in which sequences that have the same associated probability value with respect to a discrete pseudo-distribution are combined together into sub groups and the whole summation is written as a summation over a small number of subgroups. The number of these subgroups is a function of the sizes of P and M which we control (See Section 4.3 for both these results).
As an illustration of DPML, consider the profile
and a probability distribution on 5 elements: two with a value ofand three with a value of . Note that the probability values come from the set . One way to get the profile is to have an element of probability appear twice and two elements of probability appear once. There are choices of such elements and for each such choice, sequences of length 4 with these elements. The probability of any such sequence is the same: . We consider the set of all these sequences as one subgroup. Different subgroups are identified by specifying, for each permissible probability value, the frequencies with which elements of that probability value are seen in the sample. The DPML objective then sums up the contributions of each such subgroup.
Reformulating the problem in terms of summation over a small number of subgroups is crucial to our approach. It allows us to focus on the subgroup that gives the largest contribution to the objective instead of summing over all the subgroups. We call the optimization problem that optimizes the contribution of a single subgroup (instead of summing over all terms) as single discrete PML (SDPML). We show that the SDPML optimization problem has a convex relaxation and can be solved efficiently. Since there were a small number of these subgroups in the summation, the optimizing discrete pseudo-distribution that optimizes over just one subgroup has objective function value that is lower by at most the number of subgroups. Hence the maximizing discrete pseudo-distribution for this new objective function approximately optimizes the earlier objectives (PML and DPML) with bounded loss (See Section 4.3).
Ultimately, our algorithm first solves this convex relaxation to the SDPML optimization problem to obtain a fractional solution (in some representation space of these discrete pseudo-distributions) (See Section 4.4). Then we apply a rounding algorithm that finds a distribution which maintains the approximation guarantee need to obtain an approximate PML distribution (See Section 4.5).
1.2 Related work
As discussed in the introduction, PML was introduced by Orlitsky et al. [OSS04] in 2004. Many heuristic approaches such as Bethe approximation [Von12, Von14], the EM algorithm [OSS04], algebraic approaches [ADM10] and a dynamic programming approach [PJW17] have been proposed to calculate the approximate PML.
The connection between PML and universal estimators was first studied in [ADOS16]. There have been several other approaches for designing universal estimators for symmetric properties. Valiant and Valiant [VV11b]
adopted and rigorously analyzed a linear programming based approach for universal estimators proposed by[ET76] and showed that it is sample complexity optimal in the constant error regime for estimating certain symmetric properties (namely, entropy, support size, support coverage, and distance to uniformity). Recent work of Han, Jiao and Weissman [HJW18]
applied a local moment matching based approach in designing efficient universal symmetric property estimators for a single distribution.[HJW18] achieves the optimal sample complexity in all error regimes for estimating the power sum function, support and entropy.
Estimating symmetric properties of a distribution is a rich field and extensive work has been dedicated to studying their optimal sample complexity for estimating each of these properties. Optimal sample complexities for estimating many symmetric properties were resolved in the past few years, including all the properties studied here: support [VV11b, WY15], support coverage [OSW16, ZVV16], entropy [VV11b, WY16a] and distance from uniform [VV11a, JHW16].
1.3 Paper organization
The rest of the paper is structured as follows. In Section 2, we provide definitions and notations. In Section 3, we state our main results of the paper. Our main contribution is to provide an algorithm that efficiently compute an approximate PML and in Section 4 we prove this result. In this section, we also present an almost linear time algorithm based on cutting plane methods for solving our convex relaxation to SDPML; however we defer all of its analysis to the appendix. Finally, in Section 5, we provide the connection between approximate PML distribution and a universal estimator for symmetric property estimation. The proof presented in [ADOS16] showed this connection for an -approximate PML estimator and we show it for an -approximate PML estimator. However it is easy to see the proof presented in [ADOS16] works for any -approximate PML estimator for constant . In Appendix E we show that the techniques presented here generalize to a higher dimensional version of PML.
Let and denote the interval of integers and reals and respectively and let . Let be the set of all distributions supported on domain and let be the size of the domain. We use the word distribution to refer to discrete distributions. Throughout this paper we assume that we receive a sequence of independent samples from an underlying distribution . Let be the set of all length sequences and be one such sequence with denoting its th element. The probability of observing sequence is:
where is the frequency (multiplicity) of symbol in sequence and is the probability of domain element .
We extend and use the definition for
to any vectorby letting . Further, for functions of probability distributions p, we assume those expressions are also defined for any vector just by replacing by everywhere.
For any given sequence one could define its type (histogram) and profile (histogram of a histogram or fingerprint) that are sufficient statistics for symmetric property estimation. The histogram of histogram perspective comes from viewing type as a histogram and profile as histogram of type.
Definition 2.1 (Type).
A type of a sequence is the vector of frequencies of domain elements in . We call the length of type and use to represent the set of all types of length .
To simplify notation we use just to denote type and the associated sequence will be clear from context. For a distribution , the probability of a type is:
where and .
Definition 2.2 (Profile).
For any sequence , let be the set of all its distinct frequencies and be elements of the set D. The profile of a sequence denoted is where is the number of domain elements with frequency in . We call the length of profile and as a function of profile , . We let denote the set of all profiles of length . 111The number of unseen domain elements is not part of the profile, because the domain size is unknown.
For any distribution , the probability of a profile is defined as:
One can also define the profile of a type . We overload notation and use to denote the profile associated with type and .
For future use, we also write the probability of a profile in terms of its types. All types with have the same value and we use notation to represent this quantity. The explicit expression for is written below:
We next derive an expression for the probability of a profile in terms of its types:
The distribution which maximizes the probability of a profile is called a profile maximum likelihood distribution.
Definition 2.3 (Profile maximum likelihood).
For any profile , a profile maximum likelihood (PML) distribution is:
and is the maximum PML objective value.
The central goal of this paper is to define efficient algorithms for computing approximate PML distributions defined as follows.
Definition 2.4 (Approximate PML).
For any profile , a distribution is a -approximate PML distribution if
Throughout this paper we use the phrase approximate PML to denote a -approximate PML distribution for some non-trivial .
2.1 Representation of a profile
For any profile , we represent using the set of tuples, where a tuple denotes that number of domain elements have frequency in the sequence. We use to denote the size of profile in this representation. It is not hard to see that for any length profile . Further it takes time to write the profile in this representation.
For all our algorithmic results, when we are given a profile, we assume the above representation. We will explicitly state running times when we start with a sequence instead of a profile.
Here we state the main results of this paper. Our first main theorem provides an algorithm to efficiently compute an approximate PML distribution. Our approximation guarantee in this result is something that depends on the running time itself and we can achieve sub-linear running times (in size of the sample) if we allow for weaker approximation guarantees.
Theorem 3.1 (Efficient and approximate PML distribution).
Given a profile , let be its corresponding PML distribution. There is an algorithm that for any , computes an -approximate PML distribution , i.e.
in time. Using this running time simplifies to .
In the above result, the best approximation is achieved for and we get an -approximate PML distribution in nearly linear time (in the number of samples). This result is summarized below.
Corollary 3.2 (Nearly linear time - approximate PML distribution).
Let be a sequence and be its corresponding profile. There is an algorithm that computes an -approximate PML distribution in time .
This results constitutes the first polynomial time algorithm to compute an -approximate PML for any constant . In the corollary above we start with a sequence instead of a profile; in this case our algorithm still runs in because we only need time to compute the profile of a sequence in the representation discussed in Section 2.1.
Our next result relates an approximate PML distribution to a universal plug-in estimator that is sample complexity optimal for support size, coverage, entropy and distance from uniform. In Section 5, we prove this result. However it is easy to see the proof presented in Section 5 proves a more general result that approximate PML is sample complexity optimal for a broad class of symmetric properties satisfying certain conditions. One such set of conditions (informally) is the existence of an estimator for with following properties: the estimator is sample complexity optimal, the estimator has low bias, and the output of the estimator is not changed by much when we change any individual sample. This result was already shown in [ADOS16] for an -approximate PML distribution. Using the same proof with slight modifications we get the following result.
Theorem 3.3 (Universal estimator using approximate PML).
Let be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity and be a large positive constant. Let for any constant , then for any , the -approximate PML estimator estimates entropy, support, support coverage, and distance to uniformity to an accuracy of with probability at least .
Setting in the theorem above and combined with creftypecap 3.2, we obtain the following result.
Theorem 3.4 (Efficient universal estimator using approximate PML).
Let be the optimal sample complexity of estimating entropy, support, support coverage and distance to uniformity. If , then there exists a PML based universal plug-in estimator that runs in time and is sample complexity optimal for estimating entropy, support, support coverage and distance to uniformity to accuracy .
Our techniques for PML are general and can be extended to a generalization of PML to multiple dimensions (multidimensional PML). We provide a polynomial time (in number of samples) algorithm to compute approximate PML in multiple dimensions when the number of dimensions is constant. This allows for universal plug-in estimation of various symmetric relationships between multiple distributions. We next formally define and state our main results for multidimensional PML.
3.1 Results for multidimensional PML
First we describe the multidimensional setting, then we define multidimensional PML, and then state our main results. Throughout this paper we assume the number of dimensions is constant.
For each , we receive a sequence that consists of independent samples drawn from an underlying distribution supported on same domain (), further is independent of other sequences for and . We call a -sequence and its -length. Let be the set of all -sequences of -length equal to n. We use to denote the probability of domain element in distribution . We also refer to as a -distribution and let denote the set of all -distributions.
For any -distribution , the probability of a -sequence is defined as:
Recall that for each , is the frequency of domain element in sequence . For any -sequence , we call the -frequency of domain element in . Let be the set of all -frequencies generated by different domain elements in all possible -sequences in and we let denote its th element. We next define multidimensional generalizations of profile, PML, and approximate PML.
For any -sequence , we call a -profile if and is the number of domain elements with -frequency . We call n the -length of and use to denote the set of all -profiles of -length equal to n. For any -distribution , the probability of a -profile is defined as:
Profile maximum likelihood:
For any -profile , a Profile Maximum Likelihood -distribution is:
and is the maximum PML objective value.
Approximate profile maximum likelihood:
For any -profile , a -distribution is a -approximate PML -distribution if
We next state our results for approximate PML -distributions. In Footnote 2, we give a algorithm to efficiently compute an approximate PML -distribution. Then, we substitute in this result to get creftypecap 3.6.
Theorem 3.5 (Efficient and approximate multidimensional PML).
Let be a -sequence of -length . There is an algorithm that computes an -approximate PML -distribution in time222Here notation hides all terms and therefore term as well..
Corollary 3.6 (Efficient and approximate PML for two dimensions).
For , let be a -sequence of -length . There is an algorithm that computes an -approximate PML -distribution in time.
As mentioned before, one of the important applications of approximate multidimensional PML is in estimating symmetric properties for -distributions. A symmetric property is a function of -distributions that is invariant to a permutation of the labels. Here we study one such symmetric property for called KL divergence that is studied in the context of PML. Estimation of KL divergence between two distributions is well studied and estimators that achieve optimal sample complexity were given by [BZLV16, HJW16]. In Theorem 3.7, we show that approximate PML is sample complexity optimal for estimating KL divergence. A similar result was already shown in [Ach18] (Theorem 6) for exact PML and we use the same proof with slight modification to prove our result. In creftypecap 3.8, we give an efficient version of Theorem 3.7 by combining it with creftypecap 3.6.
Theorem 3.7 (Optimal sample complexity for KL divergence).
Let be such that, , and let be the optimal sample complexity for estimating KL divergence between and to an accuracy . If 333Recall here is the size of domain . and , then -approximate PML -distribution (for ) with is sample complexity optimal for estimating KL divergence to an accuracy .
Theorem 6 in [Ach18] also requires and a slightly weaker version of the other condition ().
Corollary 3.8 (Efficient estimator for KL divergence).
Let be such that, , and let be the optimal sample complexity for estimating KL divergence between and to an accuracy . If and , then there exists a PML based universal plug-in estimator that runs in time and is sample complexity optimal for estimating KL divergence to an accuracy .
4 Existence of Structured Approximate PML for One Dimension
Here we provide the proof for Theorem 3.1. First, we show the existence of an approximate PML distribution with a nice structure in Sections 4.1, 4.2 and 4.3. Then, we exploit this structure in Section 4.4 to give an algorithm that returns a fractional solution with running time ranging from nearly linear to sub linear depending on the desired approximation factor. Finally, in Section 4.5 we present a rounding algorithm that takes the fractional solution from the previous step as input and returns an approximate PML distribution within the desired approximation factor.
First, we show the existence of a distribution with minimum non-zero probability value that is an -approximate PML distribution.
Lemma 4.1 (Minimum probability lemma).
For any profile , there exists a distribution such that is a -approximate PML distribution and .
See Appendix A. ∎
This lemma allows us define a region in which our approximate PML takes all its probability values and we use this fact throughout the paper. In Section 4.1 and Section 4.2 we show how we can further simplify the problem of computing an approximate PML by discretizing the probability and the frequency spaces respectively.
4.1 Probability discretization
Let where is such that for some . P is the set representing discretization of probability space and discretization introduces a technicality of probability values not summing up to one and we define pseudo-distributions and discrete pseudo-distribution to handle it.
Definition 4.2 (Pseudo-distribution).
is a pseudo-distribution if and a discrete pseudo-distribution if all its entries are in P as well. We use and to denote the set of all such pseudo-distributions respectively. 444 As discussed in Section 2 we extend all functions of distributions as functions defined for any general vector in and therefore to pseudo-distributions as well. For convenience we refer to for any pseudo-distribution q as the “probability” of profile or PML objective value with respect to q.
One of the important structural properties we prove here is the following: there exists a discrete pseudo-distribution q that when converted to a distribution by dividing all its entries by its norm () is an approximate PML distribution. Even stronger, the discrete pseudo-distribution q itself has value that approximates within a good factor and converting q into a distribution by its norm is only going to help us in this probability because . In the rest of the paper we refer to such a discrete pseudo-distribution as an approximate PML pseudo-distribution and for the earlier reason we focus on finding an approximate PML pseudo-distribution.
The way we show the existence of such a discrete pseudo-distribution that is an approximate PML pseudo-distribution is by taking the PML distribution and converting it into a discrete pseudo-distribution while still preserving the PML objective value to a desired approximation factor. Our next lemma formally proves a general version of this statement. In the remainder of this paper, for notational convenience, for a scalar and set S we use the notation and to denote:
Definition 4.3 (Discrete pseudo-distribution).
For any distribution , its discrete pseudo-distribution is defined as:
Note that . Further, for , . We next state a result that captures the impact of discretizing the probability space.
Lemma 4.4 (Probability discretization lemma).
For any profile and distribution , its discrete pseudo-distribution satisfies:
The first inequality is immediate because for all . To show second inequality consider any sequence ,
In the inequality above we use . Now,
4.2 Multiplicity discretization
Let be the set representing discretization of multiplicities where is such that , and as before will be carefully choose later. Let and note the definition of M keeps all positive integers . We use to denote elements of set M and using this set M we define an analogous quantity to profile called discrete profile.
Definition 4.5 (Discrete profile).
For a sequence , its discrete profile is a profile and is defined as: , where and is the length of discrete profile with . We use to denote the set of all such discrete profiles.
As mentioned in the definition, a discrete profile is also a profile. Note that in the representation of discrete profile we might have indices with , however we have defined profiiles so that there are no such zero entries. We keep these zero entries in our discrete profile for notational convenience and proof simplification. Further it only takes time to write a discrete profile from access to a profile in the representation discussed in Section 2.1.
A discrete profile is a profile of length and it correspond to profile of some sequences of length . One such sequence can be obtained by appending of symbols to sequence itself. The probability of with respect to a distribution p is straightforward:
We next state a result that captures the impact of discretizing the multiplicity space. It is important to note that probability terms ( and ) have different summation terms and yet we show their values approximate each other.
Lemma 4.6 (Profile discretization lemma).
For any distribution , and a sequence :
where and are the profile and discrete profile of respectively.
See Appendix B. ∎
Corollary 4.7 (Discretization lemma).
For any distribution , and a sequence . If is the discrete distribution of p then,
where and are the profile and discrete profile of respectively.
The discretization lemma above suggests that optimizing over over discrete pseudo-distributions with as input is approximately as good as as optimizing over distributions with as input. This result motivates the definition of a new objective function which we introduce and study next.
4.3 Discrete PML Optimization
Here we define a new optimization problem that admits convex relaxations and further returns an approximate PML pseudo-distribution555Note we call a pseudo-distribution q an approximate PML pseudo-distribution if it satisfies , for some non-trivial .. First, we define a discrete profile maximum likelihood (DPML) which is just the PML objective maximized over discrete pseudo-distributions with discrete profile as input. In creftypecap 4.9 we show the optimal discrete pseudo-distribution of this new objective is an approximate PML pseudo-distribution. In Lemma 4.10, we rephrase the DPML optimization problem. Finally, using this DPML reformulation, we define a new optimization problem that we call a single discrete PML (SDPML) and in Lemma 4.14, we show the maximizing discrete pseudo-distribution for the SDPML objective is an approximate PML pseudo-distribution.
Definition 4.8 (Discrete profile maximum likelihood).
Let be any sequence, and be its profile and discrete profile respectively, a discrete profile maximum likelihood (DPML) pseudo-distribution is:
and is the maximum objective value.
Corollary 4.9 (DPML is an approximate PML).
For any sequence if and are its profile and discrete profile respectively, then
Note that is a discrete pseudo-distribution. The result follows from creftypecap 4.7 applied to . ∎
In a approximate sense, our creftypecap 4.7 suggests that working with discrete profile and discrete pseudo-distributions is no different than original profile and distribution itself.
In the next two lemmas we rephrase the DPML optimization problem in forms that are amenable to convex relaxation. To do this, we introduce some new notation.
As before let P and M be sets representing discretization of probabilities and frequencies respectively. Recall that we used to denote the elements of set M and we use to denote the elements of set P. Let be the vector with elements indexed from to and th element equal to . Also let be the vector with elements indexed from to . Its zeroth entry (denoted by ) is equal to and th entry is equal to .
Let be a variable matrix with entries for . As in the case for vector , our second index of variable matrix starts at and not at . Here the variable counts the number of domain symbols with probability value and frequency . Further, counts the number of unseen domain symbols with probability value .
For any vector v and set , we use to denote the length vector corresponding to the portion of vector v associated with index set .
For a discrete profile (corresponding to sequence ), define
Note the constraint does not involve variables that corresponds to unseen elements. These variables only appear in the constraint which ensures our output is always a pseudo-distribution.
For a discrete profile (of ) and a discrete pseudo-distribution q, also define
where and denote the number of domain elements with probability value in pseudo-distribution q. It will be clear from our next lemma why we define these constraint sets.
The advantage of probability and profile discretization we described earlier is that many types in the set share the same probability value of being observed and our goal is to group them using these