# The Broad Optimality of Profile Maximum Likelihood

We study three fundamental statistical-learning problems: distribution estimation, property estimation, and property testing. We establish the profile maximum likelihood (PML) estimator as the first unified sample-optimal approach to a wide range of learning tasks. In particular, for every alphabet size k and desired accuracy ε: Distribution estimation Under ℓ_1 distance, PML yields optimal Θ(k/(ε^2 k)) sample complexity for sorted-distribution estimation, and a PML-based estimator empirically outperforms the Good-Turing estimator on the actual distribution; Additive property estimation For a broad class of additive properties, the PML plug-in estimator uses just four times the sample size required by the best estimator to achieve roughly twice its error, with exponentially higher confidence; α-Rényi entropy estimation For an integer α>1, the PML plug-in estimator has optimal k^1-1/α sample complexity; for non-integer α>3/4, the PML plug-in estimator has sample complexity lower than the state of the art; Identity testing In testing whether an unknown distribution is equal to or at least ε far from a given distribution in ℓ_1 distance, a PML-based tester achieves the optimal sample complexity up to logarithmic factors of k. With minor modifications, most of these results also hold for a near-linear-time computable variant of PML.

• 8 publications
• 12 publications
08/27/2020

### On the High Accuracy Limitation of Adaptive Property Estimation

Recent years have witnessed the success of adaptive (or unified) approac...
11/08/2019

### Unified Sample-Optimal Property Estimation in Near-Linear Time

We consider the fundamental learning problem of estimating properties of...
03/04/2019

### Data Amplification: Instance-Optimal Property Estimation

The best-known and most commonly used distribution-property estimation t...
08/02/2022

### Bias Reduction for Sum Estimation

In classical statistics and distribution testing, it is often assumed th...
02/23/2018

### Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance

We present Local Moment Matching (LMM), a unified methodology for symmet...
02/21/2020

### Practical Estimation of Renyi Entropy

Entropy Estimation is an important problem with many applications in cry...
09/10/2021

### PAC Mode Estimation using PPR Martingale Confidence Sequences

We consider the problem of correctly identifying the mode of a discrete ...

## 1 Introduction

### 1.1 Distributions and their properties

A distribution over a discrete alphabet of size corresponds to an element of the simplex

 ΔX:={p∈Rk≥0:∑x∈Xp(x)=1}.

A distribution property is a mapping associating a real value with each distribution. For example its support size. A distribution property is symmetric if it is invariant under domain-symbol permutations. A symmetric property is additive if it can be written as , where for simplicity we use to denote both the property and the corresponding real function.

Many important symmetric properties are additive. For example,

• Support size , a fundamental quantity arising in the study of vocabulary size (Efron and Thisted, 1976; McNeil, 1973; Thisted and Efron, 1987), population estimation (Good, 1953; Mao and Lindsay, 2007), and database studies (Haas et al., 1995).

• Support coverage , where is a given parameter, the expected number of distinct elements observed in a sample of size , arising in biological (Chao, 1984; Kroes et al., 1999) and ecological (Chao, 1984; Chao and Chiu, 2014; Chao and Lee, 1992; Colwell et al., 2012) research;

• Shannon entropy , the primary measure of information (Cover and Thomas, 2012; Shannon, 1948)

with numerous applications to machine learning

(Bresler, 2015; Chow and Liu, 1968; Quinn et al., 2013) and neuroscience (Gerstner and Kistler, 2002; Mainen and Sejnowski, 1995);

• Distance to uniformity , where

is the uniform distribution over

, a property being central to the field of distribution property testing (Batu et al., 2000, 2001; Canonne, 2017; Ron, 2010).

Besides being symmetric and additive, these four properties have yet another attribute in common. Under the appropriate interpretation, they are also all -Lipschitz. Specifically, for two distributions , let be the collection of distributions over with marginals and on the first and second factors respectively. The relative earth-mover distance (Valiant and Valiant, 2011b), between and is

 R(p,q):=infγ∈Γp,qE(X,Y)∼γ∣∣∣logp(X)q(Y)∣∣∣.

One can verify (Valiant and Valiant, 2011b, 2015) that , , and are all -Lipschitz on the metric space , and is -Lipschitz over , the set of distributions in

whose nonzero probabilities are at least

. We will study all such Lipschitz properties in later sections.

An important symmetric non-additive property is Rényi entropy

, a well-known measure of randomness with numerous applications to unsupervised learning

(Jenssen et al., 2003; Xu, 1999) and image registration (Ma et al., 2000; Neemuchwala et al., 2006). For a distribution and a non-negative real parameter , the -Rényi entropy (Rényi, 1961) of is . In particular, denoted by , the -Rényi entropy is exactly Shannon entropy (Rényi, 1961).

### 1.2 Problems of interest

We consider the following three fundamental statistical-learning problems.

### Distribution estimation

A natural learning problem is to estimate an unknown distribution from an i.i.d. sample . For any two distributions , let be the loss when we approximate by . A distribution estimator associates every sequence with a distribution . We measure the performance of an estimator by its sample complexity

 n(^p,ε,δ):=min{n:∀p∈ΔX,PrXn∼p(ℓ(p,^p(Xn))≥ε)≤δ},

the smallest sample size that requires to estimate all distributions in to a desired accuracy , with error probability . The sample complexity of distribution estimation over is

 n(ε,δ):=min{n(^p,ε,δ):^p:X∗→ΔX},

the lowest sample complexity of any estimator. For simplicity, we will omit when .

For a distribution , we denote by the multiset of its probabilities. The sorted distance between two distributions is

 ℓ\tiny{<}1(p,q):=minp′∈ΔX:{p′}={p}∥∥p′−q∥∥1,

the smallest distance between and any sorted version of . As illustrated in Section 7.1, this is essentially the -Wasserstein distance between uniform measures on the probability multisets and . We will consider both the sorted and unsorted distances.

### Property estimation

Often we would like to estimate a given property of an unknown distribution based on a sample . A property estimator is a mapping . Analogously, the sample complexity of in estimating over a set is

 nf(^f,P,ε,δ):=min{n:∀p∈P,PrXn∼p(|^f(Xn)−f(p)|≥ε)≤δ},

the smallest sample size that requires to estimate with accuracy and confidence , for all distributions in . The sample complexity of estimating over is

 nf(P,ε,δ):=min{nf(^f,P,ε,δ):^f:X∗→R},

the lowest sample complexity of any estimator. For simplicity, we will omit when , and omit when . The standard “median-trick" shows that . By convention, we say an estimator is sample-optimal if .

### Property testing: Identity testing

A closely related problem is distribution property testing, of which identity testing is the most fundamental and well-studied (Canonne, 2017; Goldreich, 2017). Given an error parameter , a distribution , and a sample from an unknown distribution , identity testing

aims to distinguish between the null hypothesis

 H0:p=q

and the alternative hypothesis

 H1:∥p−q∥1≥ε.

A property tester is a mapping , indicating whether or is accepted. Analogous to the two formulations above, the sample complexity of is

 nq(^t,ε,δ):=min{n:∀i∈{0,1},PrXn∼p(^t(Xn)≠i∣Hi is true)≤δ},

and the sample complexity of identity testing with respect to is

 nq(ε,δ):=min{nq(^t,ε,δ):^t:X∗→{0,1}}.

Again, when , we will omit . For , the problem is also known as uniformity testing.

### 1.3 Profile maximum likelihood

The multiplicity of a symbol in a sequence is , the number of times appears in . These multiplicities induce an empirical distribution that associates a probability with each symbol .

The prevalence of an integer in is the number of symbols appearing times in . For known , the value of can be deduced from the remaining multiplicities, hence we define the profile of to be

, the vector of all positive prevalences. For example,

. Note that the profile of also corresponds to the multiset of multiplicities of distinct symbols in .

For a distribution , let

 p(xn):=PrXn∼p(Xn=xn)

be the probability of observing a sequence under i.i.d. sampling from , and let

 p(φ):=∑yn:φ(yn)=φp(yn)

be the probability of observing a profile . While the sequence maximum likelihood estimator maps a sequence to its empirical distribution, which maximizes the sequence probability , the profile maximum likelihood (PML) estimator (Orlitsky et al., 2004b) over a set maps each profile to a distribution

 pφ:=argmaxp∈Pp(φ)

that maximizes the profile probability. Relaxing the optimization objective, for any , a -approximate PML estimator (Acharya et al., 2017a) maps each profile to a distribution such that .

Originating from the principle of maximum likelihood, PML was proved (Acharya et al., 2012, 2017a; Anevski et al., 2017; Das., 2012; Orlitsky et al., 2004b) to possess a number of useful attributes, such as existence over finite discrete domains, majorization by empirical distributions, consistency for distribution estimation under both sorted and unsorted distances, and competitiveness to other profile-based estimators.

Let be an error parameter and be one of the four properties in Section 1.1. Set . Recent work of Acharya et al. (2017a) showed that for some absolute constant , if and , then a plug-in estimator for , using an -approximate PML, is sample-optimal. Motivated by this result, Charikar et al. (2019) constructed an explicit -approximate PML (APML) whose computation time is near-linear in . Combined, these two results provide a unified, sample-optimal, and near-linear-time computable plug-in estimator for the four properties.

## 2 New results and implications

### 2.1 New results

Let be an additive symmetric property that is -Lipschitz on . Let be an error parameter and , the smallest sample size required by any estimator to achieve accuracy with confidence , for all distributions in . For an absolute constant , if ,

###### Theorem 1.

The PML plug-in estimator, when given a sample of size from any distribution , will estimate up to an error of , with probability at least .

For a different , Theorem 1 also holds for APML, which is near-linear-time computable (Charikar et al., 2019). In Section 8, we propose a PML variation called truncated PML (TPML) for which the lower bound on can be improved to the near-optimal , for symmetric properties such as Shannon entropy, support coverage, and support size. See Theorem 78, and 9 for detail.

#### Rényi entropy estimation

For of finite size and any , it is well-known that . The following theorems characterize the performance of the PML plug-in estimator in estimating Rényi entropy.

For any distribution , error parameter , absolute constant , and sampling parameter , draw a sample and denote its profile by . Then for sufficiently large ,

###### Theorem 2.

For , if ,

 Pr(|Hα(pφ)−Hα(p)|≥ε)≤exp(−√n).
###### Theorem 3.

For non-integer , if ,

 Pr(|Hα(pφ)−Hα(p)|≥ε)≤exp(−n1−λ).
###### Theorem 4.

For integer , if and ,

 Pr(|Hα(pφ)−Hα(p)|≥ε)≤1/3.

Replacing by , Theorem 2 also holds for APML with a better probability bound . In addition, Theorem 3 holds for APML without any modifications.

#### Distribution estimation

Let be the absolute constant defined just prior to Theorem 1. For any distribution , error parameter , and sampling parameter , draw a sample and denote its profile by .

###### Theorem 5.

If and ,

 Pr(ℓ\tiny{<}1(pφ,p)≥ε)≤exp(−Ω(√n)).

For a different , Theorem 5 also holds for APML with a better probability bound . In Section 8.3, we show that a simple combination of the TPML and empirical estimators accurately recovers the actual distribution under a variant of the relative earth-mover distance, regardless of .

#### Identity testing

The recent works of Diakonikolas and Kane (2016) and Goldreich (2016) provided a procedure reducing identity testing to uniformity testing, while modifying the desired accuracy and alphabet size by only absolute constant factors. Hence below we consider uniformity testing.

The uniformity tester shown in Figure 1 is purely based on PML and satisfies

###### Theorem 6.

If and , then the tester will be correct with probability at least . The tester also distinguishes between and .

The notation only hides logarithmic factors of . The tester is near-optimal as for uniform distribution , the results in (Diakonikolas et al., 2018) yield an lower bound on .

The rest of the paper is organized as follows. Section 3 and Section 4 illustrate PML’s theoretical and practical advantages by comparing it to existing methods for a variety of learning tasks. With the exception of Section 8, which proposes TPML and establishes four new results, Section 5 to 9 present the proofs of Theorem 1 to 6. Section 10 concludes the paper and outlines future directions.

### 2.2 Implications

Several immediate implications are in order.

Theorem 1 makes PML the first plug-in estimator that is universally sample-optimal for a broad class of distribution properties. In particular, Theorem 1 also covers the four properties considered in (Acharya et al., 2017a). To see this, as mentioned in Section 1.1, , , and are -Lipschitz on ; as for , the following result (Acharya et al., 2017a) relates it to for distributions in , and proves PML’s optimality.

###### Lemma 1.

For any , , and ,

 |~S(p)−~Cm(p)log(1/ε)|≤ε.

The theorem also applies to many other properties. As an example (Valiant and Valiant, 2011b), given an integer , let . Then to within a factor of two, approximates the distance between any distribution and the closest uniform distribution in of support size .

In Section 3.2 we compare Theorem 1 with existing results and present more of its implications.

Theorem 2 and 3 imply that for all non-integer (resp. ), the PML (resp. APML) plug-in estimator achieves a sample complexity better than the best currently known (Acharya et al., 2017b). This makes both the PML and APML plug-in estimators the state-of-the-art algorithms for estimating non-integer order Rényi entropy. See Section 3.3 for an introduction of known results, and see Section 3.4 for a detailed comparison between existing methods and ours.

Theorem 4 shows that for all integer , the sample complexity of the PML plug-in estimator has optimal dependence (Acharya et al., 2017b; Obremski and Skorski, 2017) on the alphabet size .

Theorem 5 makes APML the first distribution estimator under sorted distance that is both near-linear-time computable and sample-optimal for a range of desired accuracy beyond inverse polylogarithmic of . In comparison, existing algorithms (Acharya et al., 2012; Han et al., 2018; Valiant and Valiant, 2017b) either run in polynomial time in the sample sizes, or are only known to achieve optimal sample complexity for , which is essentially different from the applicable range of in Theorem 5. We provide a more detailed comparison in Section 3.6.

Theorem 6 provides the first PML-based uniformity tester with near-optimal sample complexity. As stated, the tester also distinguishes between and . This is a stronger guarantee since by the Cauchy-Schwarz inequality, implies .

## 3 Related work and comparisons

The study of additive property estimation dates back at least half a century (Carlton, 1969; Good, 1953; Good and Toulmin, 1956) and has steadily grown over the years. For any additive symmetric property and sequence , the simplest and most widely-used approach uses the empirical (plug-in) estimator that evaluates

at the empirical distribution. While the empirical estimator performs well in the large-sample regime, modern data science applications often concern high-dimensional data, for which more involved methods have yielded property estimators that are more sample-efficient. For example, for relatively large

and for being , , , or , recent research (Jiao et al., 2015; Orlitsky et al., 2016; Valiant and Valiant, 2011a, b; Wu and Yang, 2016, 2019) showed that the empirical estimator is optimal up to logarithmic factors, namely where is for , and is for the other properties.

Below we classify the methods for deriving the corresponding sample-optimal estimators into two categories: plug-in and approximation, and provide a high-level description. For simplicity of illustration, we assume that

.

The plug-in approach essentially estimates the unknown distribution multiset, which suffices for computing any symmetric properties. Besides the empirical and PML estimators, Efron and Thisted (1976)

proposed a linear-programming approach that finds a multiset estimate consistent with the sample’s profile. This approach was then adapted and analyzed by

Valiant and Valiant (2011a, 2017b), yielding plug-in estimators that achieve near-optimal sample complexities for and , and optimal sample complexity for , when is relatively large.

The approximation approach modifies non-smooth segments of the probability function to correct the bias of empirical estimators. A popular modification is to replace those non-smooth segments by their low-degree polynomial approximations and then estimate the modified function. For several properties including the above four and power sum , where is a given parameter, this approach yields property-dependent estimators (Jiao et al., 2015; Orlitsky et al., 2016; Wu and Yang, 2016, 2019) that are sample-optimal for all .

More recently, Acharya et al. (2017a) proved the aforementioned results on PML estimator and made it the first unified, sample-optimal plug-in estimator for , , and and relatively large . Following these advances, Han et al. (2018) refined the linear-programming approach and designed a plug-in estimator that implicitly performs polynomial approximation and is sample-optimal for , , and with , when is relatively large.

### 3.2 Comparison I: Theorem 1 and related property-estimation work

In terms of the estimator’s theoretical guarantee, Theorem 1 is essentially the same as Valiant and Valiant (2011b). However, for each property, , and , (Valiant and Valiant, 2011b) solves a different linear program and constructs a new estimator, which takes polynomial time. On the other hand, both the PML estimator and its near-linear-time computable variant, once computed, can be used to accurately estimate exponentially many properties that are -Lipschitz on . A similar comparison holds between the PML method and the approximation approach, while the latter is provably sample-optimal for only a few properties. In addition, Theorem 1 shows that the PML estimator often achieves the optimal sample complexity up to a small constant factor, which is a desired estimator attribute shared by some, but not all approximation-based estimators (Jiao et al., 2015; Orlitsky et al., 2016; Wu and Yang, 2016, 2019).

In term of the method and proof technique, Theorem 1 is closest to Acharya et al. (2017a). On the other hand, (Acharya et al., 2017a) establishes the optimality of PML for only four properties, while our result covers a much broader property class. In addition, both the above mentioned “small constant factor” attribute, and the confidence boost from to are unique contributions of this work. The PML plug-in approach is also close in flavor to the plug-in estimators in Valiant and Valiant (2011a, 2017b) and their refinement in Han et al. (2018). On the other hand, as pointed out previously, these plug-in estimators are provably sample-optimal for only a few properties. More specifically, for estimating , , and , the plug-in estimators in (Valiant and Valiant, 2011a, 2017b) achieve sub-optimal sample complexities with regard to the desired accuracy ; and the estimation guarantee in (Han et al., 2018) is provided in terms of the approximation errors of polynomials that are not directly related to the optimal sample complexities.

### 3.3 Rényi entropy estimation

Motivated by the wide applications of Rényi entropy, heuristic estimators were proposed and studied in the physics literature following

Grassberger (1988), and asymptotically consistent estimators were presented and analyzed in the statistical-learning literature Källberg et al. (2012); Xu and Erdogmuns (2010). For the special case of 1-Rényi (or Shannon) entropy, the works of (Valiant and Valiant, 2011a, b) determined the sample complexity to be .

For general -Rényi entropy, the best-known results in Acharya et al. (2017b) state that for integer and non-integer values, the corresponding sample complexities are and , respectively. The upper bounds for integer are achieved by an estimator that corrects the bias of the empirical plug-in estimator. To achieve the upper bounds for non-integer values, one needs to compute some best polynomial approximation of , whose degree and domain both depend on , and construct a more involved estimator using the approximation approach (Jiao et al., 2015; Wu and Yang, 2016) mentioned in Section 3.1.

### 3.4 Comparison II: Theorem 2 to 4 and related Rényi-entropy-estimation work

Our result shows that a single PML estimate suffices to estimate the Rényi entropy of different orders . Such adaptiveness to the order parameter is a significant advantage of PML over existing methods. For example, by Theorem 3 and the union bound, one can use a single APML or PML to accurately approximate exponentially many non-integer order Rényi entropy values, yet still maintains an overall confidence of . By comparison, the estimation heuristic in (Acharya et al., 2017b) requires different polynomial-based estimators for different values. In particular, to construct each estimator, one needs to compute some best polynomial approximation of , which is not known to admit a closed-form formula for . Furthermore, even for a single and with a sample size times larger, such estimator is not known to achieve the same level of confidence as PML or APML.

As for the theoretical guarantees, the sample-complexity upper bounds in both Theorem 2 and 3 are better than those mentioned in the previous section. More specifically, for any and , Theorem 2 shows that . Analogously, for any non-integer and , Theorem 3 shows that . Both bounds are better than the best currently known by a factor.

### 3.5 Distribution estimation

Estimating large-alphabet distributions from their samples is a fundamental statistical-learning tenet. Over the past few decades, distribution estimation has found numerous applications, ranging from natural language modeling (Chen and Goodman, 1999) to biological research (Armañanzas et al., 2008), and has been studied extensively. Under the classical and KL losses, existing research (Braess and Sauer, 2004; Kamath et al., 2015) showed that the corresponding sample complexities are and , respectively. Several recent works have investigated the analogous formulation under sorted distance, and revealed a lower sample complexity of . Specifically, under certain conditions, Valiant and Valiant (2017b) and Han et al. (2018) derived sample-optimal estimators using linear programming, and Acharya et al. (2012) showed that PML achieves a sub-optimal sample complexity for relatively large .

### 3.6 Comparison III: Theorem 5 and related distribution-estimation work

We compare our results with existing ones from three different perspectives.

Applicable parameter ranges: As shown by (Han et al., 2018), for , the simple empirical estimator is already sample-optimal. Hence we consider the parammeter range . For the results in (Valiant and Valiant, 2017b) and (Acharya et al., 2012) to hold, we would need to be at least . On the other hand, Theorem 5 shows that PML and APML are sample-optimal for larger than . Here, the gap is exponentially large. The result in (Han et al., 2018) applies to the whole range , which is larger than the applicable range of our results.

Time complexity: Both the APML and the estimator in (Valiant and Valiant, 2017b) are near-linear-time computable in the sample sizes, while the estimator in (Han et al., 2018) would require polynomial time to be computed.

Statistical confidence: The PML and APML achieve the desired accuracy with an error probability at most . On the contrary, the estimator in (Han et al., 2018) is known to achieve an error probability that decreases only as . The gap is again exponentially large. The estimator in (Valiant and Valiant, 2017b) admits an error probability bound of , which is still far from ours.

### 3.7 Identity testing

Initiated by the work of (Goldreich and Ron., 2000), identity testing is arguably one of the most important and widely-studied problems in distribution property testing. Over the past two decades, a sequence of works (Acharya et al., 2013, 2015; Batu et al., 2001; Chan et al., 2014; Diakonikolas and Kane, 2016; Diakonikolas et al., 2015, 2018; Goldreich and Ron., 2000; Paninski, 2008; Valiant and Valiant, 2017a) have addressed the sample complexity of this problem and proposed testers with a variety of guarantees. In particular, applying a coincidence-based tester, Paninski (2008) determined the sample complexity of uniformity testing up to constant factors; utilizing a variant of the Pearson’s chi-squared statistic, Valiant and Valiant (2017a) resolved the general identity testing problem. For an overview of related results, we refer interested readers to (Canonne, 2017) and (Goldreich, 2017). The contribution of this work is mainly showing that PML, is a unified sample-optimal approach for several related problems, and as shown in Theorem 6, also provides a near-optimal tester for this important testing problem.

## 4 Numerical experiments

A number of different approaches have been taken to computing the PML and its approximations. Among the existing works, Acharya et al. (2010) considered exact algebraic computation, Orlitsky et al. (2004a, b) designed an EM algorithm with MCMC acceleration, Vontobel (2012, 2014) proposed a Bethe approximation heuristic, Anevski et al. (2017) introduced a sieved PML estimator and a stochastic approximation of the associated EM algorithm, and Pavlichin et al. (2017) derived a dynamic programming approach. Notably and recently, for a sample size , Charikar et al. (2019) constructed an explicit -approximate PML whose computation time is near-linear in .

In this section, we first introduce a variant of the MCMC-EM algorithm in (Orlitsky et al., 2004a, b; Pan, 2012) and then demonstrate the efficacy of PML on a variety of learning tasks through experiments.

### 4.1 MCMC-EM algorithm variant

To approximate PML, the work (Orlitsky et al., 2004a)

proposed an MCMC-EM algorithm, where MCMC and EM stand for Markov chain Monte Carlo and expectation maximization, respectively. A sketch of the original MCMC-EM algorithm can be found in

(Orlitsky et al., 2004a), and a detailed description is available in Chapter 6 of (Pan, 2012). The EM part uses a simple iteration procedure to update the distribution estimates. One can show (Pan, 2012) that it is equivalent to the conventional generalized gradient ascent method. The MCMC part exploits local properties of the update process and accelerates the EM computation. Below we present a variant of this algorithm that often runs faster and is more accurate.

##### Step 1:

We separate the large and small multiplicities. Define a threshold parameter and suppress in for simplicity. For symbols with , estimate their probabilities by and remove them from the sample. Denote the collection of removed symbols by and the remaining sample sequence by . In the subsequent steps, we apply the EM-MCMC algorithm to .

The idea is simple: By the Chernoff-type bound for binomial random variables, with high probability, the empirical frequency

of a large-multiplicity symbol is very close to its mean value . Hence for large-multiplicity symbols we can simply use the empirical estimates and focus on estimating the probabilities of small-multiplicity symbols. This is similar to initializing the EM algorithm by the empirical distribution and fixing the large probability estimates through the iterations. However, the approach described here is more efficient.

##### Step 2:

We determine a proper alphabet size for the output distribution of the EM algorithm. If the true value is provided, then we simply use . Otherwise, we apply the following support size estimator Acharya et al. (2017a) to :

 ^S(Xr):=∑j≥1(1−(−(t−1))jPr(L≥j))⋅φj(Xr),

where and is an independent binomial random variable with support size and success probability . For any larger than an absolute constant, estimator achieves the optimal sample complexity in estimating support size, up to constant factors (Acharya et al., 2017a).

##### Step 3:

Apply the MCMC-EM algorithm in (Orlitsky et al., 2004a; Pan, 2012) to with the output alphabet size determined in the previous step, and denote the resulting distribution estimate by . (In the experiments, we perform the EM iteration for times.) Intuitively, this estimate corresponds to the conditional distribution given that the next observation is a symbol with small probability.

##### Step 4:

Let be the total probability of the large-multiplicity symbols. Treat as a vector and let . For every symbol , append to , and return the resulting vector. Note that this vector corresponds to a valid discrete distribution.

##### Algorithm code

The implementation of our algorithm is available at https://github.com/ucsdyi/PML.

For computational efficiency, the program code for the original MCMC-EM algorithm in (Orlitsky et al., 2004a; Pan, 2012) is written in C++, with a file name “MCMCEM.cpp”. The program code for other functions is written in Python3. Note that to execute the program, one should have a 64-bit Windows/Linux system with Python3 installed (64-bit version). In addition, we also use functions provided by “NumPy” and “SciPy”, while the latter is not crucial and can be removed by modifying the code slightly.

Our implementation also makes use of “ctypes”, a built-in foreign language library for Python that allows us to call C++ functions directly. Note that before calling C++ functions in Python, we need to compile the corresponding C++ source files into DLLs or shared libraries. We have compiled and included two such files, one is “MCMCEM.so”, the other is “MCMCEM.dll”.

Functions in “MCMCEM.cpp” can be used separately. To compute a PML estimate, simply call the function “int PML(int MAXSZ=10000, int maximum_EM=20, int EM_n=100)”, where the first parameter specifies an upper bound on the support size of the output distribution, the second provides the maximum number of EM iteration, and the last corresponds to the sample size . This function takes as input a local file called “proFile”, which contains the profile vector in the format of “1 4 7 10 …”. Specifically, the file “proFile” consists of only space-separated non-negative integers, and the -th integer represents the value of . The output is a vector of length at most MAXSZ, and is stored in another local file called “PMLFile”. Each line of the file “PMLFile” contains a non-negative real number, corresponding to a probability estimate.

To perform experiments and save the plots to the directory containing the code, simply execute the file “Main.py”. To avoid further complication, the code compares our estimator with only three other estimators: empirical, empirical with a larger sample size, and improved Good-Turing (Orlitsky and Suresh, 2015) (for distribution estimation under unsorted distance). The implementation covers all the distributions described in the next section. One can test any of these distributions by including it in “D_List” of the “main()” function. The implementation also covers a variety of learning tasks, such as distribution estimation under sorted and unsorted distances, and property estimation for Shannon entropy, -Rényi entropy, support coverage, and support size.

Finally, functions related to distribution and sample generation are available in file “Samples.py”. Others including the property computation functions, the sorted and unsorted distance functions, and the previously-described support size estimator, are contained in file “Functions.py”.

### 4.2 Experiment distributions

In the following experiments, samples are generated according to six distributions with the same support size .

Three of them have finite support by definition: uniform distribution, two-step distribution with half the symbols having probability and the other half have probability , and a three-step distribution with one third the symbols having probability , another third having probability , and the remaining having probability .

The other three distributions are over , and are truncated at

and re-normalized: geometric distribution with parameter

satisfying , Zipf distribution with parameter satisfying , and log-series distribution with parameter satisfying .

### 4.3 Experiment results and details

As shown below, the proposed PML approximation algorithm has exceptional performance.

##### Distribution estimation under ℓ1 distance

We derive a new distribution estimator under the (unsorted) distance by combining the proposed PML computation algorithm with the denoising procedure in (Valiant and Valiant, 2015) and a missing mass estimator (Orlitsky and Suresh, 2015).

First we describe this distribution estimator, which takes a sample from some unknown distribution . An optional input is , the underlying alphabet.

##### Step 1:

Apply the PML computation algorithm described in Section 4.1 to , and denote the returned vector, consisting of non-negative real numbers that sum to , by .

##### Step 2:

Employ the following variant of the denoising procedure in (Valiant and Valiant, 2015). Arbitrarily remove a total probability mass of from entries of the vector without making any entry negative. Then for each , augment the vector by entries of probability . For every multiplicity appearing in the sample, assign to all symbols appearing times the following probability value. If , simply assign to each of these symbols the empirical estimate ; otherwise, temporally associate a weight of with each entry in , and assign to each of these symbols the current weighted median of .

##### Step 3:

If is available, we can estimate the total probability mass of the unseen symbols (a.k.a., the missing mass) by the following estimator:

We equally distribute this probability mass estimate among symbols that do not appear in the sample.

As shown below, the proposed distribution estimator achieves the state-of-the-art performance.

In Figures 2, the horizontal axis reflects the sample size , ranging from to , and the vertical axis reflects the (unsorted) distance between the true distribution and the estimates, averaged over independent trials. We compare our estimator with three others: the improved Good-Turing estimator (Orlitsky and Suresh, 2015), the empirical estimator, serving as a baseline, and the empirical estimator with a larger sample size. Note that is roughly . As shown in (Orlitsky and Suresh, 2015), the improved Good-Turing estimator is provably instance-by-instance near-optimal and substantially outperforms other estimators such as the Laplace (add-) estimator, the Braess-Sauer estimator (Braess and Sauer, 2004), and the Krichevsky-Trofimov estimator (Krichevsky and Trofimov, 1981). Hence we do not include those estimators in our comparisons.

As the following plots show, our proposed estimator outperformed the improved Good-Turing estimator in all experiments.

##### Distribution estimation under sorted ℓ1 distance

In Figure 3, the sample size ranges from to , and the vertical axis reflects the sorted distance between the true distribution and the estimates, averaged over independent trials. We compare our estimator with that proposed by Valiant and Valiant (2017b) that utilizes linear programming, with the empirical estimator, and with the empirical estimator with a larger sample size.

We do not include the estimator in (Han et al., 2018) since there is no implementation available, and as pointed out by the recent work of (Vinayak et al., 2019) (page 7), the approach in (Han et al., 2018) “is quite unwieldy. It involves significant parameter tuning and special treatment for the edge cases.” and “Some techniques …are quite crude and likely lose large constant factors both in theory and in practice.”

As shown in Figure 3, with the exception of uniform distribution, where the estimator in Valiant and Valiant (2017b)

(VV-LP) is the best and PML is the closest second, the PML estimator outperforms VV-LP for all other tested distributions. As the underlying distribution becomes more skewed, the improvement of PML over VV-LP grows. For the log-series distribution, the performance of VV-LP is even worse than the empirical estimator.

Additionally, the plots also demonstrate that PML has a more stable performance than VV-LP.

##### Shannon entropy estimation under absolute error

In Figure 4, the sample size ranges from to , and the vertical axis reflects the absolute difference between the true entropy values and the estimates, averaged over independent trials. We compare our estimator with two state-of-the-art estimators, WY in (Wu and Yang, 2016), and JVHW in (Jiao et al., 2015), as well as the empirical estimator, and the empirical estimator with a larger sample size. Additional entropy estimators such as the Miller-Mallow estimator (Carlton, 1969), the best upper bound (BUB) estimator (Paninski, 2003), and the Valiant-Valiant estimator (Valiant and Valiant, 2017b) were compared in (Wu and Yang, 2016; Jiao et al., 2015) and found to perform similarly to or worse than the two estimators that we compared with, therefore we do not include them here. Also, considering Valiant and Valiant (2017b), page 50 in (Yang, 2016) notes that “the performance of linear programming estimator starts to deteriorate when the sample size is very large.”

Note that the alphabet size is a crucial input to WY, but is not required by either JVHW or our PML algorithm. In the experiments, we provide WY with the true value of .

As shown in the plots, our estimator performs as well as these state-of-the-art estimators.

##### α-Rényi entropy estimation under absolute error

For a distribution , recall that the -power sum of is , implying . To establish the sample-complexity upper bounds mentioned in Section 3.3 for non-integer values, Acharya et al. (2017b) first estimate the using the -power-sum estimator proposed in (Jiao et al., 2015), and then substitute the estimate into the previous equation. The authors of (Jiao et al., 2015) have implemented this two-step Rényi entropy estimation algorithm. In the experiments, we take a sample of size , ranging from to , and compare our estimator with this implementation, referred to as JVHW, the empirical estimator, and the empirical estimator with a larger sample size. Note that ranges from to . According to the results in (Acharya et al., 2017b), the sample complexities for estimating -Rényi entropy are quite different for and , hence we consider two cases: and .

As shown in Figure 5 and 6, our estimator clearly outperformed the one proposed by (Acharya et al., 2017b; Jiao et al., 2015).

We further note that for small sample sizes and several distributions, the estimator in (Acharya et al., 2017b; Jiao et al., 2015) performs significantly worse than ours. Also, for large sample sizes, the estimators in (Acharya et al., 2017b; Jiao et al., 2015) degenerates to the simple empirical plug-in estimator. In comparison, our proposed estimator tracks the performance of the empirical estimator with a larger sample size for nearly all the tested distributions.

## 5 Lipschitz-property estimation

### 5.1 Proof outline of Theorem 1

The proof proceeds as follows. First, fixing , , and a symmetric additive property that is -Lipschitz on , we consider a related linear program defined in (Valiant, 2012), and lower bound the worst-case error of any estimators using the linear program’s objective value, say . Second, following the construction in (Valiant, 2012), we find an explicit estimator that is linear, i.e., can be expressed as a linear combination of ’s, and show optimality by upper bounding its worst-case error in terms of . Third, we study the concentration of a general linear estimator, and through the McDiarmid’s inequality (McDiarmid, 1989), relate the tail probability of its estimate to the estimator’s sensitivity to the input changes. Fourth, we bound the sensitivity of by the maximum difference between its consecutive coefficients, and further bound this difference by a function of , showing that the estimate induced by highly concentrates around its expectation. Finally, we invoke the result in (Acharya et al., 2017a) that the PML-plug-in estimator is competitive to all profile-based estimators whose estimates are highly concentrated, concluding that PML shares the optimality of , thereby establishing Theorem 1.

### 5.2 Technical details

Let be a symmetric additive property that is -Lipschitz on . Without loss of generality, we assume that if for some .

##### Lower bound

First, fixing , , and , we lower bound the worst-case error of any estimators.

Let be a small absolute constant. If there is an estimator that, when given a length- sample from any distribution , will estimate up to an error of with probability at least . Then for any two distributions satisfying , we can use to distinguish from , and will be correct with probability at least .

On the other hand, for any parameter and , consider the corresponding linear program defined in Linear Program 6.7 in (Valiant, 2012), and denote by the objective value of any of its solutions. Then, Proposition 6.8 in (Valiant, 2012) implies that we can find two distributions such that , and no algorithm can use sample points to distinguish these two distributions with probability at least .

The previous reasoning yields that . By construction, is a function of and , and essentially serves as a lower bound for .

##### Upper Bound

Second, fixing , , and , we construct an explicit estimator based on the previously mentioned linear program, and show optimality by upper bounding its worst-case error in terms of , the linear program’s objective value.

A property estimator is linear if there exist real coefficients such that the identity holds for all . The following lemma (Proposition 6.10 in (Valiant, 2012)) bounds the worst-case error of a linear estimator when its coefficients satisfy certain conditions.

###### Lemma 2.

Given any positive integer , and real coefficients , define . Let , and . If for some ,

1. ,

2. for any and such that ,

then given a sample from any , the estimator defined by will estimate with an accuracy of and a failure probability at most .

Following the construction in (Valiant, 2012) (page 124), let be the vector of coefficients induced by any solution of the dual program of the previously mentioned linear program. For our purpose, the way in which these coefficients are derived is largely irrelevant. One can show that . Let and , and define

 βi:=(1−e−tnαi)f((i+1)αn)n(i+1)α+i∑ℓ=0zℓ(1−tn)ℓαℓ(1−α)i−ℓ(iℓ).

for any , and for . The next lemma shows that we can find proper parameters and to apply Lemma 2 to the above construction. Specifically,

###### Lemma 3.

For any and some such that , if and satisfy , the two conditions in Lemma 2 hold for the above construction with , , , and . Furthermore, for any , we have .

This lemma differs from the results established in the proof of Proposition 6.19 in (Valiant, 2012) only in the applicable range of , where the latter assumes that . For completeness, we will present a proof of Lemma 3 in Appendix A.

By Lemma 2 and 3, if , given a sample from any , the linear estimator will estimate with an accuracy of and a failure probability at most . Recall that for fixed and , the value of is a constant, thus can be computed without samples. Furthermore according to the last claim in Proposition 6.19 in (Valiant, 2012), for , the estimator that always returns has an error of at most . Hence with high probability, the estimator will estimate up to an error of , for any possible values of .

##### Concentration of linear estimators

Third, we slightly diverge from the previous discussion and study the concentration of general linear estimators.

The sensitivity of a property estimator for a given input size is

 sn(^f):=max{f(xn)−f(yn):xn and yn % differ in one element},

the maximum change in its value when the input sequence is modified at exactly one location. For any and , the following corollary of the McDiarmid’s inequality (McDiarmid, 1989) relates the two-side tail probability of to .

###### Lemma 4.

For all , we have

Define . The next lemma bounds the sensitivity of a linear estimator in terms of , the maximum absolute difference between its consecutive coefficients.

###### Lemma 5.

For any and linear estimator , we have .

###### Proof.

Let and be two arbitrary sequences over that differ in one element. Let be the index where . Then by definition, the following multiplicity equalities hold: , , and for satisfying . For simplicity of notation, let , , and for any , let .

The first multiplicity equality implies and