1 Introduction
1.1 Distributions and their properties
A distribution over a discrete alphabet of size corresponds to an element of the simplex
A distribution property is a mapping associating a real value with each distribution. For example its support size. A distribution property is symmetric if it is invariant under domainsymbol permutations. A symmetric property is additive if it can be written as , where for simplicity we use to denote both the property and the corresponding real function.
Many important symmetric properties are additive. For example,

Distance to uniformity , where
is the uniform distribution over
, a property being central to the field of distribution property testing (Batu et al., 2000, 2001; Canonne, 2017; Ron, 2010).
Besides being symmetric and additive, these four properties have yet another attribute in common. Under the appropriate interpretation, they are also all Lipschitz. Specifically, for two distributions , let be the collection of distributions over with marginals and on the first and second factors respectively. The relative earthmover distance (Valiant and Valiant, 2011b), between and is
One can verify (Valiant and Valiant, 2011b, 2015) that , , and are all Lipschitz on the metric space , and is Lipschitz over , the set of distributions in
whose nonzero probabilities are at least
. We will study all such Lipschitz properties in later sections.An important symmetric nonadditive property is Rényi entropy
, a wellknown measure of randomness with numerous applications to unsupervised learning
(Jenssen et al., 2003; Xu, 1999) and image registration (Ma et al., 2000; Neemuchwala et al., 2006). For a distribution and a nonnegative real parameter , the Rényi entropy (Rényi, 1961) of is . In particular, denoted by , the Rényi entropy is exactly Shannon entropy (Rényi, 1961).1.2 Problems of interest
We consider the following three fundamental statisticallearning problems.
Distribution estimation
A natural learning problem is to estimate an unknown distribution from an i.i.d. sample . For any two distributions , let be the loss when we approximate by . A distribution estimator associates every sequence with a distribution . We measure the performance of an estimator by its sample complexity
the smallest sample size that requires to estimate all distributions in to a desired accuracy , with error probability . The sample complexity of distribution estimation over is
the lowest sample complexity of any estimator. For simplicity, we will omit when .
For a distribution , we denote by the multiset of its probabilities. The sorted distance between two distributions is
the smallest distance between and any sorted version of . As illustrated in Section 7.1, this is essentially the Wasserstein distance between uniform measures on the probability multisets and . We will consider both the sorted and unsorted distances.
Property estimation
Often we would like to estimate a given property of an unknown distribution based on a sample . A property estimator is a mapping . Analogously, the sample complexity of in estimating over a set is
the smallest sample size that requires to estimate with accuracy and confidence , for all distributions in . The sample complexity of estimating over is
the lowest sample complexity of any estimator. For simplicity, we will omit when , and omit when . The standard “mediantrick" shows that . By convention, we say an estimator is sampleoptimal if .
Property testing: Identity testing
A closely related problem is distribution property testing, of which identity testing is the most fundamental and wellstudied (Canonne, 2017; Goldreich, 2017). Given an error parameter , a distribution , and a sample from an unknown distribution , identity testing
aims to distinguish between the null hypothesis
and the alternative hypothesis
A property tester is a mapping , indicating whether or is accepted. Analogous to the two formulations above, the sample complexity of is
and the sample complexity of identity testing with respect to is
Again, when , we will omit . For , the problem is also known as uniformity testing.
1.3 Profile maximum likelihood
The multiplicity of a symbol in a sequence is , the number of times appears in . These multiplicities induce an empirical distribution that associates a probability with each symbol .
The prevalence of an integer in is the number of symbols appearing times in . For known , the value of can be deduced from the remaining multiplicities, hence we define the profile of to be
, the vector of all positive prevalences. For example,
. Note that the profile of also corresponds to the multiset of multiplicities of distinct symbols in .For a distribution , let
be the probability of observing a sequence under i.i.d. sampling from , and let
be the probability of observing a profile . While the sequence maximum likelihood estimator maps a sequence to its empirical distribution, which maximizes the sequence probability , the profile maximum likelihood (PML) estimator (Orlitsky et al., 2004b) over a set maps each profile to a distribution
that maximizes the profile probability. Relaxing the optimization objective, for any , a approximate PML estimator (Acharya et al., 2017a) maps each profile to a distribution such that .
Originating from the principle of maximum likelihood, PML was proved (Acharya et al., 2012, 2017a; Anevski et al., 2017; Das., 2012; Orlitsky et al., 2004b) to possess a number of useful attributes, such as existence over finite discrete domains, majorization by empirical distributions, consistency for distribution estimation under both sorted and unsorted distances, and competitiveness to other profilebased estimators.
Let be an error parameter and be one of the four properties in Section 1.1. Set . Recent work of Acharya et al. (2017a) showed that for some absolute constant , if and , then a plugin estimator for , using an approximate PML, is sampleoptimal. Motivated by this result, Charikar et al. (2019) constructed an explicit approximate PML (APML) whose computation time is nearlinear in . Combined, these two results provide a unified, sampleoptimal, and nearlineartime computable plugin estimator for the four properties.
2 New results and implications
2.1 New results
Additive property estimation
Let be an additive symmetric property that is Lipschitz on . Let be an error parameter and , the smallest sample size required by any estimator to achieve accuracy with confidence , for all distributions in . For an absolute constant , if ,
Theorem 1.
The PML plugin estimator, when given a sample of size from any distribution , will estimate up to an error of , with probability at least .
For a different , Theorem 1 also holds for APML, which is nearlineartime computable (Charikar et al., 2019). In Section 8, we propose a PML variation called truncated PML (TPML) for which the lower bound on can be improved to the nearoptimal , for symmetric properties such as Shannon entropy, support coverage, and support size. See Theorem 7, 8, and 9 for detail.
Rényi entropy estimation
For of finite size and any , it is wellknown that . The following theorems characterize the performance of the PML plugin estimator in estimating Rényi entropy.
For any distribution , error parameter , absolute constant , and sampling parameter , draw a sample and denote its profile by . Then for sufficiently large ,
Theorem 2.
For , if ,
Theorem 3.
For noninteger , if ,
Theorem 4.
For integer , if and ,
Distribution estimation
Let be the absolute constant defined just prior to Theorem 1. For any distribution , error parameter , and sampling parameter , draw a sample and denote its profile by .
Theorem 5.
If and ,
Identity testing
The recent works of Diakonikolas and Kane (2016) and Goldreich (2016) provided a procedure reducing identity testing to uniformity testing, while modifying the desired accuracy and alphabet size by only absolute constant factors. Hence below we consider uniformity testing.
The uniformity tester shown in Figure 1 is purely based on PML and satisfies
Theorem 6.
If and , then the tester will be correct with probability at least . The tester also distinguishes between and .
The notation only hides logarithmic factors of . The tester is nearoptimal as for uniform distribution , the results in (Diakonikolas et al., 2018) yield an lower bound on .
The rest of the paper is organized as follows. Section 3 and Section 4 illustrate PML’s theoretical and practical advantages by comparing it to existing methods for a variety of learning tasks. With the exception of Section 8, which proposes TPML and establishes four new results, Section 5 to 9 present the proofs of Theorem 1 to 6. Section 10 concludes the paper and outlines future directions.
2.2 Implications
Several immediate implications are in order.
Theorem 1 makes PML the first plugin estimator that is universally sampleoptimal for a broad class of distribution properties. In particular, Theorem 1 also covers the four properties considered in (Acharya et al., 2017a). To see this, as mentioned in Section 1.1, , , and are Lipschitz on ; as for , the following result (Acharya et al., 2017a) relates it to for distributions in , and proves PML’s optimality.
Lemma 1.
For any , , and ,
The theorem also applies to many other properties. As an example (Valiant and Valiant, 2011b), given an integer , let . Then to within a factor of two, approximates the distance between any distribution and the closest uniform distribution in of support size .
Theorem 2 and 3 imply that for all noninteger (resp. ), the PML (resp. APML) plugin estimator achieves a sample complexity better than the best currently known (Acharya et al., 2017b). This makes both the PML and APML plugin estimators the stateoftheart algorithms for estimating noninteger order Rényi entropy. See Section 3.3 for an introduction of known results, and see Section 3.4 for a detailed comparison between existing methods and ours.
Theorem 4 shows that for all integer , the sample complexity of the PML plugin estimator has optimal dependence (Acharya et al., 2017b; Obremski and Skorski, 2017) on the alphabet size .
Theorem 5 makes APML the first distribution estimator under sorted distance that is both nearlineartime computable and sampleoptimal for a range of desired accuracy beyond inverse polylogarithmic of . In comparison, existing algorithms (Acharya et al., 2012; Han et al., 2018; Valiant and Valiant, 2017b) either run in polynomial time in the sample sizes, or are only known to achieve optimal sample complexity for , which is essentially different from the applicable range of in Theorem 5. We provide a more detailed comparison in Section 3.6.
Theorem 6 provides the first PMLbased uniformity tester with nearoptimal sample complexity. As stated, the tester also distinguishes between and . This is a stronger guarantee since by the CauchySchwarz inequality, implies .
3 Related work and comparisons
3.1 Additive property estimation
The study of additive property estimation dates back at least half a century (Carlton, 1969; Good, 1953; Good and Toulmin, 1956) and has steadily grown over the years. For any additive symmetric property and sequence , the simplest and most widelyused approach uses the empirical (plugin) estimator that evaluates
at the empirical distribution. While the empirical estimator performs well in the largesample regime, modern data science applications often concern highdimensional data, for which more involved methods have yielded property estimators that are more sampleefficient. For example, for relatively large
and for being , , , or , recent research (Jiao et al., 2015; Orlitsky et al., 2016; Valiant and Valiant, 2011a, b; Wu and Yang, 2016, 2019) showed that the empirical estimator is optimal up to logarithmic factors, namely where is for , and is for the other properties.Below we classify the methods for deriving the corresponding sampleoptimal estimators into two categories: plugin and approximation, and provide a highlevel description. For simplicity of illustration, we assume that
.The plugin approach essentially estimates the unknown distribution multiset, which suffices for computing any symmetric properties. Besides the empirical and PML estimators, Efron and Thisted (1976)
proposed a linearprogramming approach that finds a multiset estimate consistent with the sample’s profile. This approach was then adapted and analyzed by
Valiant and Valiant (2011a, 2017b), yielding plugin estimators that achieve nearoptimal sample complexities for and , and optimal sample complexity for , when is relatively large.The approximation approach modifies nonsmooth segments of the probability function to correct the bias of empirical estimators. A popular modification is to replace those nonsmooth segments by their lowdegree polynomial approximations and then estimate the modified function. For several properties including the above four and power sum , where is a given parameter, this approach yields propertydependent estimators (Jiao et al., 2015; Orlitsky et al., 2016; Wu and Yang, 2016, 2019) that are sampleoptimal for all .
More recently, Acharya et al. (2017a) proved the aforementioned results on PML estimator and made it the first unified, sampleoptimal plugin estimator for , , and and relatively large . Following these advances, Han et al. (2018) refined the linearprogramming approach and designed a plugin estimator that implicitly performs polynomial approximation and is sampleoptimal for , , and with , when is relatively large.
3.2 Comparison I: Theorem 1 and related propertyestimation work
In terms of the estimator’s theoretical guarantee, Theorem 1 is essentially the same as Valiant and Valiant (2011b). However, for each property, , and , (Valiant and Valiant, 2011b) solves a different linear program and constructs a new estimator, which takes polynomial time. On the other hand, both the PML estimator and its nearlineartime computable variant, once computed, can be used to accurately estimate exponentially many properties that are Lipschitz on . A similar comparison holds between the PML method and the approximation approach, while the latter is provably sampleoptimal for only a few properties. In addition, Theorem 1 shows that the PML estimator often achieves the optimal sample complexity up to a small constant factor, which is a desired estimator attribute shared by some, but not all approximationbased estimators (Jiao et al., 2015; Orlitsky et al., 2016; Wu and Yang, 2016, 2019).
In term of the method and proof technique, Theorem 1 is closest to Acharya et al. (2017a). On the other hand, (Acharya et al., 2017a) establishes the optimality of PML for only four properties, while our result covers a much broader property class. In addition, both the above mentioned “small constant factor” attribute, and the confidence boost from to are unique contributions of this work. The PML plugin approach is also close in flavor to the plugin estimators in Valiant and Valiant (2011a, 2017b) and their refinement in Han et al. (2018). On the other hand, as pointed out previously, these plugin estimators are provably sampleoptimal for only a few properties. More specifically, for estimating , , and , the plugin estimators in (Valiant and Valiant, 2011a, 2017b) achieve suboptimal sample complexities with regard to the desired accuracy ; and the estimation guarantee in (Han et al., 2018) is provided in terms of the approximation errors of polynomials that are not directly related to the optimal sample complexities.
3.3 Rényi entropy estimation
Motivated by the wide applications of Rényi entropy, heuristic estimators were proposed and studied in the physics literature following
Grassberger (1988), and asymptotically consistent estimators were presented and analyzed in the statisticallearning literature Källberg et al. (2012); Xu and Erdogmuns (2010). For the special case of 1Rényi (or Shannon) entropy, the works of (Valiant and Valiant, 2011a, b) determined the sample complexity to be .For general Rényi entropy, the bestknown results in Acharya et al. (2017b) state that for integer and noninteger values, the corresponding sample complexities are and , respectively. The upper bounds for integer are achieved by an estimator that corrects the bias of the empirical plugin estimator. To achieve the upper bounds for noninteger values, one needs to compute some best polynomial approximation of , whose degree and domain both depend on , and construct a more involved estimator using the approximation approach (Jiao et al., 2015; Wu and Yang, 2016) mentioned in Section 3.1.
3.4 Comparison II: Theorem 2 to 4 and related Rényientropyestimation work
Our result shows that a single PML estimate suffices to estimate the Rényi entropy of different orders . Such adaptiveness to the order parameter is a significant advantage of PML over existing methods. For example, by Theorem 3 and the union bound, one can use a single APML or PML to accurately approximate exponentially many noninteger order Rényi entropy values, yet still maintains an overall confidence of . By comparison, the estimation heuristic in (Acharya et al., 2017b) requires different polynomialbased estimators for different values. In particular, to construct each estimator, one needs to compute some best polynomial approximation of , which is not known to admit a closedform formula for . Furthermore, even for a single and with a sample size times larger, such estimator is not known to achieve the same level of confidence as PML or APML.
As for the theoretical guarantees, the samplecomplexity upper bounds in both Theorem 2 and 3 are better than those mentioned in the previous section. More specifically, for any and , Theorem 2 shows that . Analogously, for any noninteger and , Theorem 3 shows that . Both bounds are better than the best currently known by a factor.
3.5 Distribution estimation
Estimating largealphabet distributions from their samples is a fundamental statisticallearning tenet. Over the past few decades, distribution estimation has found numerous applications, ranging from natural language modeling (Chen and Goodman, 1999) to biological research (Armañanzas et al., 2008), and has been studied extensively. Under the classical and KL losses, existing research (Braess and Sauer, 2004; Kamath et al., 2015) showed that the corresponding sample complexities are and , respectively. Several recent works have investigated the analogous formulation under sorted distance, and revealed a lower sample complexity of . Specifically, under certain conditions, Valiant and Valiant (2017b) and Han et al. (2018) derived sampleoptimal estimators using linear programming, and Acharya et al. (2012) showed that PML achieves a suboptimal sample complexity for relatively large .
3.6 Comparison III: Theorem 5 and related distributionestimation work
We compare our results with existing ones from three different perspectives.
Applicable parameter ranges: As shown by (Han et al., 2018), for , the simple empirical estimator is already sampleoptimal. Hence we consider the parammeter range . For the results in (Valiant and Valiant, 2017b) and (Acharya et al., 2012) to hold, we would need to be at least . On the other hand, Theorem 5 shows that PML and APML are sampleoptimal for larger than . Here, the gap is exponentially large. The result in (Han et al., 2018) applies to the whole range , which is larger than the applicable range of our results.
Time complexity: Both the APML and the estimator in (Valiant and Valiant, 2017b) are nearlineartime computable in the sample sizes, while the estimator in (Han et al., 2018) would require polynomial time to be computed.
Statistical confidence: The PML and APML achieve the desired accuracy with an error probability at most . On the contrary, the estimator in (Han et al., 2018) is known to achieve an error probability that decreases only as . The gap is again exponentially large. The estimator in (Valiant and Valiant, 2017b) admits an error probability bound of , which is still far from ours.
3.7 Identity testing
Initiated by the work of (Goldreich and Ron., 2000), identity testing is arguably one of the most important and widelystudied problems in distribution property testing. Over the past two decades, a sequence of works (Acharya et al., 2013, 2015; Batu et al., 2001; Chan et al., 2014; Diakonikolas and Kane, 2016; Diakonikolas et al., 2015, 2018; Goldreich and Ron., 2000; Paninski, 2008; Valiant and Valiant, 2017a) have addressed the sample complexity of this problem and proposed testers with a variety of guarantees. In particular, applying a coincidencebased tester, Paninski (2008) determined the sample complexity of uniformity testing up to constant factors; utilizing a variant of the Pearson’s chisquared statistic, Valiant and Valiant (2017a) resolved the general identity testing problem. For an overview of related results, we refer interested readers to (Canonne, 2017) and (Goldreich, 2017). The contribution of this work is mainly showing that PML, is a unified sampleoptimal approach for several related problems, and as shown in Theorem 6, also provides a nearoptimal tester for this important testing problem.
4 Numerical experiments
A number of different approaches have been taken to computing the PML and its approximations. Among the existing works, Acharya et al. (2010) considered exact algebraic computation, Orlitsky et al. (2004a, b) designed an EM algorithm with MCMC acceleration, Vontobel (2012, 2014) proposed a Bethe approximation heuristic, Anevski et al. (2017) introduced a sieved PML estimator and a stochastic approximation of the associated EM algorithm, and Pavlichin et al. (2017) derived a dynamic programming approach. Notably and recently, for a sample size , Charikar et al. (2019) constructed an explicit approximate PML whose computation time is nearlinear in .
In this section, we first introduce a variant of the MCMCEM algorithm in (Orlitsky et al., 2004a, b; Pan, 2012) and then demonstrate the efficacy of PML on a variety of learning tasks through experiments.
4.1 MCMCEM algorithm variant
To approximate PML, the work (Orlitsky et al., 2004a)
proposed an MCMCEM algorithm, where MCMC and EM stand for Markov chain Monte Carlo and expectation maximization, respectively. A sketch of the original MCMCEM algorithm can be found in
(Orlitsky et al., 2004a), and a detailed description is available in Chapter 6 of (Pan, 2012). The EM part uses a simple iteration procedure to update the distribution estimates. One can show (Pan, 2012) that it is equivalent to the conventional generalized gradient ascent method. The MCMC part exploits local properties of the update process and accelerates the EM computation. Below we present a variant of this algorithm that often runs faster and is more accurate.Step 1:
We separate the large and small multiplicities. Define a threshold parameter and suppress in for simplicity. For symbols with , estimate their probabilities by and remove them from the sample. Denote the collection of removed symbols by and the remaining sample sequence by . In the subsequent steps, we apply the EMMCMC algorithm to .
The idea is simple: By the Chernofftype bound for binomial random variables, with high probability, the empirical frequency
of a largemultiplicity symbol is very close to its mean value . Hence for largemultiplicity symbols we can simply use the empirical estimates and focus on estimating the probabilities of smallmultiplicity symbols. This is similar to initializing the EM algorithm by the empirical distribution and fixing the large probability estimates through the iterations. However, the approach described here is more efficient.Step 2:
We determine a proper alphabet size for the output distribution of the EM algorithm. If the true value is provided, then we simply use . Otherwise, we apply the following support size estimator Acharya et al. (2017a) to :
where and is an independent binomial random variable with support size and success probability . For any larger than an absolute constant, estimator achieves the optimal sample complexity in estimating support size, up to constant factors (Acharya et al., 2017a).
Step 3:
Apply the MCMCEM algorithm in (Orlitsky et al., 2004a; Pan, 2012) to with the output alphabet size determined in the previous step, and denote the resulting distribution estimate by . (In the experiments, we perform the EM iteration for times.) Intuitively, this estimate corresponds to the conditional distribution given that the next observation is a symbol with small probability.
Step 4:
Let be the total probability of the largemultiplicity symbols. Treat as a vector and let . For every symbol , append to , and return the resulting vector. Note that this vector corresponds to a valid discrete distribution.
Algorithm code
The implementation of our algorithm is available at https://github.com/ucsdyi/PML.
For computational efficiency, the program code for the original MCMCEM algorithm in (Orlitsky et al., 2004a; Pan, 2012) is written in C++, with a file name “MCMCEM.cpp”. The program code for other functions is written in Python3. Note that to execute the program, one should have a 64bit Windows/Linux system with Python3 installed (64bit version). In addition, we also use functions provided by “NumPy” and “SciPy”, while the latter is not crucial and can be removed by modifying the code slightly.
Our implementation also makes use of “ctypes”, a builtin foreign language library for Python that allows us to call C++ functions directly. Note that before calling C++ functions in Python, we need to compile the corresponding C++ source files into DLLs or shared libraries. We have compiled and included two such files, one is “MCMCEM.so”, the other is “MCMCEM.dll”.
Functions in “MCMCEM.cpp” can be used separately. To compute a PML estimate, simply call the function “int PML(int MAXSZ=10000, int maximum_EM=20, int EM_n=100)”, where the first parameter specifies an upper bound on the support size of the output distribution, the second provides the maximum number of EM iteration, and the last corresponds to the sample size . This function takes as input a local file called “proFile”, which contains the profile vector in the format of “1 4 7 10 …”. Specifically, the file “proFile” consists of only spaceseparated nonnegative integers, and the th integer represents the value of . The output is a vector of length at most MAXSZ, and is stored in another local file called “PMLFile”. Each line of the file “PMLFile” contains a nonnegative real number, corresponding to a probability estimate.
To perform experiments and save the plots to the directory containing the code, simply execute the file “Main.py”. To avoid further complication, the code compares our estimator with only three other estimators: empirical, empirical with a larger sample size, and improved GoodTuring (Orlitsky and Suresh, 2015) (for distribution estimation under unsorted distance). The implementation covers all the distributions described in the next section. One can test any of these distributions by including it in “D_List” of the “main()” function. The implementation also covers a variety of learning tasks, such as distribution estimation under sorted and unsorted distances, and property estimation for Shannon entropy, Rényi entropy, support coverage, and support size.
Finally, functions related to distribution and sample generation are available in file “Samples.py”. Others including the property computation functions, the sorted and unsorted distance functions, and the previouslydescribed support size estimator, are contained in file “Functions.py”.
4.2 Experiment distributions
In the following experiments, samples are generated according to six distributions with the same support size .
Three of them have finite support by definition: uniform distribution, twostep distribution with half the symbols having probability and the other half have probability , and a threestep distribution with one third the symbols having probability , another third having probability , and the remaining having probability .
The other three distributions are over , and are truncated at
and renormalized: geometric distribution with parameter
satisfying , Zipf distribution with parameter satisfying , and logseries distribution with parameter satisfying .4.3 Experiment results and details
As shown below, the proposed PML approximation algorithm has exceptional performance.
Distribution estimation under distance
We derive a new distribution estimator under the (unsorted) distance by combining the proposed PML computation algorithm with the denoising procedure in (Valiant and Valiant, 2015) and a missing mass estimator (Orlitsky and Suresh, 2015).
First we describe this distribution estimator, which takes a sample from some unknown distribution . An optional input is , the underlying alphabet.
Step 1:
Apply the PML computation algorithm described in Section 4.1 to , and denote the returned vector, consisting of nonnegative real numbers that sum to , by .
Step 2:
Employ the following variant of the denoising procedure in (Valiant and Valiant, 2015). Arbitrarily remove a total probability mass of from entries of the vector without making any entry negative. Then for each , augment the vector by entries of probability . For every multiplicity appearing in the sample, assign to all symbols appearing times the following probability value. If , simply assign to each of these symbols the empirical estimate ; otherwise, temporally associate a weight of with each entry in , and assign to each of these symbols the current weighted median of .
Step 3:
If is available, we can estimate the total probability mass of the unseen symbols (a.k.a., the missing mass) by the following estimator:
We equally distribute this probability mass estimate among symbols that do not appear in the sample.
As shown below, the proposed distribution estimator achieves the stateoftheart performance.
In Figures 2, the horizontal axis reflects the sample size , ranging from to , and the vertical axis reflects the (unsorted) distance between the true distribution and the estimates, averaged over independent trials. We compare our estimator with three others: the improved GoodTuring estimator (Orlitsky and Suresh, 2015), the empirical estimator, serving as a baseline, and the empirical estimator with a larger sample size. Note that is roughly . As shown in (Orlitsky and Suresh, 2015), the improved GoodTuring estimator is provably instancebyinstance nearoptimal and substantially outperforms other estimators such as the Laplace (add) estimator, the BraessSauer estimator (Braess and Sauer, 2004), and the KrichevskyTrofimov estimator (Krichevsky and Trofimov, 1981). Hence we do not include those estimators in our comparisons.
As the following plots show, our proposed estimator outperformed the improved GoodTuring estimator in all experiments.
Distribution estimation under sorted distance
In Figure 3, the sample size ranges from to , and the vertical axis reflects the sorted distance between the true distribution and the estimates, averaged over independent trials. We compare our estimator with that proposed by Valiant and Valiant (2017b) that utilizes linear programming, with the empirical estimator, and with the empirical estimator with a larger sample size.
We do not include the estimator in (Han et al., 2018) since there is no implementation available, and as pointed out by the recent work of (Vinayak et al., 2019) (page 7), the approach in (Han et al., 2018) “is quite unwieldy. It involves significant parameter tuning and special treatment for the edge cases.” and “Some techniques …are quite crude and likely lose large constant factors both in theory and in practice.”
As shown in Figure 3, with the exception of uniform distribution, where the estimator in Valiant and Valiant (2017b)
(VVLP) is the best and PML is the closest second, the PML estimator outperforms VVLP for all other tested distributions. As the underlying distribution becomes more skewed, the improvement of PML over VVLP grows. For the logseries distribution, the performance of VVLP is even worse than the empirical estimator.
Additionally, the plots also demonstrate that PML has a more stable performance than VVLP.
Shannon entropy estimation under absolute error
In Figure 4, the sample size ranges from to , and the vertical axis reflects the absolute difference between the true entropy values and the estimates, averaged over independent trials. We compare our estimator with two stateoftheart estimators, WY in (Wu and Yang, 2016), and JVHW in (Jiao et al., 2015), as well as the empirical estimator, and the empirical estimator with a larger sample size. Additional entropy estimators such as the MillerMallow estimator (Carlton, 1969), the best upper bound (BUB) estimator (Paninski, 2003), and the ValiantValiant estimator (Valiant and Valiant, 2017b) were compared in (Wu and Yang, 2016; Jiao et al., 2015) and found to perform similarly to or worse than the two estimators that we compared with, therefore we do not include them here. Also, considering Valiant and Valiant (2017b), page 50 in (Yang, 2016) notes that “the performance of linear programming estimator starts to deteriorate when the sample size is very large.”
Note that the alphabet size is a crucial input to WY, but is not required by either JVHW or our PML algorithm. In the experiments, we provide WY with the true value of .
As shown in the plots, our estimator performs as well as these stateoftheart estimators.
Rényi entropy estimation under absolute error
For a distribution , recall that the power sum of is , implying . To establish the samplecomplexity upper bounds mentioned in Section 3.3 for noninteger values, Acharya et al. (2017b) first estimate the using the powersum estimator proposed in (Jiao et al., 2015), and then substitute the estimate into the previous equation. The authors of (Jiao et al., 2015) have implemented this twostep Rényi entropy estimation algorithm. In the experiments, we take a sample of size , ranging from to , and compare our estimator with this implementation, referred to as JVHW, the empirical estimator, and the empirical estimator with a larger sample size. Note that ranges from to . According to the results in (Acharya et al., 2017b), the sample complexities for estimating Rényi entropy are quite different for and , hence we consider two cases: and .
As shown in Figure 5 and 6, our estimator clearly outperformed the one proposed by (Acharya et al., 2017b; Jiao et al., 2015).
We further note that for small sample sizes and several distributions, the estimator in (Acharya et al., 2017b; Jiao et al., 2015) performs significantly worse than ours. Also, for large sample sizes, the estimators in (Acharya et al., 2017b; Jiao et al., 2015) degenerates to the simple empirical plugin estimator. In comparison, our proposed estimator tracks the performance of the empirical estimator with a larger sample size for nearly all the tested distributions.
5 Lipschitzproperty estimation
5.1 Proof outline of Theorem 1
The proof proceeds as follows. First, fixing , , and a symmetric additive property that is Lipschitz on , we consider a related linear program defined in (Valiant, 2012), and lower bound the worstcase error of any estimators using the linear program’s objective value, say . Second, following the construction in (Valiant, 2012), we find an explicit estimator that is linear, i.e., can be expressed as a linear combination of ’s, and show optimality by upper bounding its worstcase error in terms of . Third, we study the concentration of a general linear estimator, and through the McDiarmid’s inequality (McDiarmid, 1989), relate the tail probability of its estimate to the estimator’s sensitivity to the input changes. Fourth, we bound the sensitivity of by the maximum difference between its consecutive coefficients, and further bound this difference by a function of , showing that the estimate induced by highly concentrates around its expectation. Finally, we invoke the result in (Acharya et al., 2017a) that the PMLplugin estimator is competitive to all profilebased estimators whose estimates are highly concentrated, concluding that PML shares the optimality of , thereby establishing Theorem 1.
5.2 Technical details
Let be a symmetric additive property that is Lipschitz on . Without loss of generality, we assume that if for some .
Lower bound
First, fixing , , and , we lower bound the worstcase error of any estimators.
Let be a small absolute constant. If there is an estimator that, when given a length sample from any distribution , will estimate up to an error of with probability at least . Then for any two distributions satisfying , we can use to distinguish from , and will be correct with probability at least .
On the other hand, for any parameter and , consider the corresponding linear program defined in Linear Program 6.7 in (Valiant, 2012), and denote by the objective value of any of its solutions. Then, Proposition 6.8 in (Valiant, 2012) implies that we can find two distributions such that , and no algorithm can use sample points to distinguish these two distributions with probability at least .
The previous reasoning yields that . By construction, is a function of and , and essentially serves as a lower bound for .
Upper Bound
Second, fixing , , and , we construct an explicit estimator based on the previously mentioned linear program, and show optimality by upper bounding its worstcase error in terms of , the linear program’s objective value.
A property estimator is linear if there exist real coefficients such that the identity holds for all . The following lemma (Proposition 6.10 in (Valiant, 2012)) bounds the worstcase error of a linear estimator when its coefficients satisfy certain conditions.
Lemma 2.
Given any positive integer , and real coefficients , define . Let , and . If for some ,

,

for any and such that ,
then given a sample from any , the estimator defined by will estimate with an accuracy of and a failure probability at most .
Following the construction in (Valiant, 2012) (page 124), let be the vector of coefficients induced by any solution of the dual program of the previously mentioned linear program. For our purpose, the way in which these coefficients are derived is largely irrelevant. One can show that . Let and , and define
for any , and for . The next lemma shows that we can find proper parameters and to apply Lemma 2 to the above construction. Specifically,
Lemma 3.
For any and some such that , if and satisfy , the two conditions in Lemma 2 hold for the above construction with , , , and . Furthermore, for any , we have .
This lemma differs from the results established in the proof of Proposition 6.19 in (Valiant, 2012) only in the applicable range of , where the latter assumes that . For completeness, we will present a proof of Lemma 3 in Appendix A.
By Lemma 2 and 3, if , given a sample from any , the linear estimator will estimate with an accuracy of and a failure probability at most . Recall that for fixed and , the value of is a constant, thus can be computed without samples. Furthermore according to the last claim in Proposition 6.19 in (Valiant, 2012), for , the estimator that always returns has an error of at most . Hence with high probability, the estimator will estimate up to an error of , for any possible values of .
Concentration of linear estimators
Third, we slightly diverge from the previous discussion and study the concentration of general linear estimators.
The sensitivity of a property estimator for a given input size is
the maximum change in its value when the input sequence is modified at exactly one location. For any and , the following corollary of the McDiarmid’s inequality (McDiarmid, 1989) relates the twoside tail probability of to .
Lemma 4.
For all , we have
Define . The next lemma bounds the sensitivity of a linear estimator in terms of , the maximum absolute difference between its consecutive coefficients.
Lemma 5.
For any and linear estimator , we have .
Proof.
Let and be two arbitrary sequences over that differ in one element. Let be the index where . Then by definition, the following multiplicity equalities hold: , , and for satisfying . For simplicity of notation, let , , and for any , let .
The first multiplicity equality implies and