1 Distribution Properties
Let denote the collection of distributions over a countable set of finite or infinite cardinality . A distribution property is a mapping . Many applications call for estimating properties of an unknown distribution from its samples. Often these properties are additive
, namely can be written as a sum of functions of the probabilities. Symmetric additive properties can be written as
and arise in many biological, genomic, and language-processing applications:
- Shannon entropy
, where throughout the paper is the natural logarithm, is the fundamental information measure arising in a variety of applications info .
- Normalized support size
- Normalized support coverage
- Power sum
- Distance to uniformity
, appears in property testing testingu .
More generally, non-symmetric additive properties can be expressed as
for example distances to a given distribution, such as:
- L1 distance
, the distance of the unknown distribution from a given distribution , appears in hypothesis-testing errors testing .
- KL divergence
Given one of these, or other, properties, we would like to estimate its value based on samples from an underlying distribution.
2 Recent Results
In the common property-estimation setting, the unknown distribution generates i.i.d. samples , which in turn are used to estimate . Specifically, given property , we would like to construct an estimator such that is as close to as possible. The standard estimation loss is the expected squared loss
Generating exactly samples creates dependence between the number of times different symbols appear. To avoid these dependencies and simplify derivations, we use the well-known Poisson sampling poisamp paradigm. We first select , and then generate independent samples according to
. This modification does not change the statistical nature of the estimation problem since a Poisson random variables is exponentially concentrated around its mean. Correspondingly the estimation loss is
For simplicity, let be the number of occurrences of symbol in . An intuitive estimator is the plug-in empirical estimator that first uses the samples to estimate and then estimates as
Given an error tolerance parameter , the -sample complexity of an estimator in estimating is the smallest number of samples allowing for estimation loss smaller than ,
Since is unknown, the common min-max approach considers the worst case -sample complexity of an estimator over all possible ,
Finally, the estimator minimizing is called the min-max estimator of property , denoted . It follows that is the smallest Poisson parameter , or roughly the number of samples, needed for any estimator to estimate to estimation loss for all .
There has been a significant amount of recent work on property estimation. In particular, it was shown that for all seven properties mentioned earlier, improves the sample complexity by a logarithmic factor compared to . For example, for Shannon entropy mmentro , normalized support size mmsize , normalized support coverage mmcover , and distance to uniformity mml1 , while . Note that for normalized support size, is typically replaced by , and for normalized support coverage, is replaced by .
3 New Results
While the results already obtained are impressive, they also have some shortcomings. Recent state-of-the-art estimators are designed mmentro ; mmsize ; mml1 or analyzed mmcover ; valiant to estimate each individual property. Consequently these estimators cover only few properties. Second, estimators proposed for more general properties mmcover ; jnew are limited to symmetric properties and are not known to be computable in time linear in the sample size. Last but not least, by design, min-max estimators are optimized for the “worst” distribution in a class. In practice, this distribution is often very different, and frequently much more complex, than the actual underlying distribution. This “pessimistic” worst-case design results in sub-optimal estimation, as born by both the theoretical and experimental results.
In Section 6, we design an estimator that addresses all these issues. It is unified and applies to a wide range of properties, including all previously-mentioned properties ( for power sums) and all Lipschitz properties where each is Lipschitz. It can be computed in linear-time in the sample size. It is competitive in that it is guaranteed to perform well not just for the worst distribution in the class, but for each and every distribution. It “amplifies” the data in that it uses just samples to approximate the performance of the empirical estimator with samples regardless of the underlining distribution , thereby providing an off-the-shelf, distribution-independent, “amplification” of the amount of data available relative to the estimators used by many practitioners. As we show in Section 8, it also works well in practice, outperforming existing estimator and often working as well as the empirical estimator with even samples.
For a more precise description, let represent a quantity that vanishes as and write for . Suppressing small for simplicity first, we show that
where the first right-hand-side term relates the performance of with samples to that of with samples. The second term adds a small loss that diminishes at a rate independent of the support size , and for fixed decreases roughly as . Specifically, we prove,
For every property satisfying the smoothness conditions in Section 5, there is a constant such that for all and all ,
The reflects a multiplicative factor unrelated to and . Again, for normalized support size, is replaced by , and we also modify as follows: if , we apply , and if , we apply the corresponding min-max estimator mmsize . However, for experiments shown in Section 8, the original is used without such modification. In Section 7, we note that for several properties, the second term can be strengthened so that it does not depend on .
Theorem 1 has three important implications.
Many modern applications, such as those arising in genomics and natural-language processing, concern properties of distributions whose support sizeis comparable to or even larger than the number of samples . For these properties, the estimation loss of the empirical estimator is often much larger than , hence the proposed estimator, , yields a much better estimate whose performance parallels that of with samples. This allows us to amplify the available data by a factor of regardless of the underlying distribution.
Note however that for some properties , when the underlying distributions are limited to a fixed small support size, . For such small support sizes, may not improve the estimation loss.
By contrast, is a linear-time estimator well for all properties satisfying simple Lipschitz-type and second-order smoothness conditions. All properties described earlier: Shannon entropy, normalized support size, normalized suppport coverage, power sum, distance and KL divergence satisfy these conditions, and therefore applies to all of them.
More generally, recall that a property is Lipschitz if all are Lipschitz. It can be shown, e.g. learning , that with samples, approximates a -element distribution to a constant distance, and hence also estimates any Lipschitz property to a constant loss. It follows that estimates any Lipschitz property over a distribution of support size to constant estimation loss with samples. This provides the first general sublinear-sample estimator for all Lipschitz properties.
Previous results were geared towards the estimator’s worst estimation loss over all possible distributions. For example, they derived estimators that approximate the distance to uniformity of any -element distribution with samples, and showed that this number is optimal as for some distribution classes estimating this distance requires samples.
However, this approach may be too pessimistic. Distributions are rarely maximally complex, or are hardest to estimate. For example, most natural scenes have distinct simple patterns, such as straight lines, or flat faces, hence can be learned relatively easily.
More concretely, consider learning distance to uniformity for the collection of distributions with entropy bounded by . It can be shown that for sufficiently large , can learn distance to uniformity to constant estimation loss using samples. Theorem 1 therefore shows that the distance to uniformity can be learned to constant estimation loss with samples. (In fact, without even knowing that the entropy is bounded.) By contrast, the original min-max estimator results would still require the much larger samples.
The rest of the paper is organized as follows. Section 5 describes mild smoothness conditions satisfied by many natural properties, including all those mentioned above. Section 6 describes the estimator’s explicit form and some intuition behind its construction and performance. Section 7 describes two improvements of the estimator addressed in the supplementary material. Lastly, Section 8 describes various experiments that illustrate the estimator’s power and competitiveness. For space considerations, we relegate all the proofs to the appendix.
5 Smooth properties
Many natural properties, including all those mentioned in the introduction satisfy some basic smoothness conditions. For , consider the Lipschitz-type parameter
We consider properties satisfying the following conditions: (1) , ; (2) for ; (3) for some absolute constant .
Note that the first condition, , entails no loss of generality. The second condition implies that is continuous over , and in particular right continuous at 0 and left-continuous at . It is easy to see that continuity is also essential for consistent estimation. Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where , and that all seven properties described earlier satisfy these three conditions. Finally, to ensure that distance satisfies these conditions, we let .
6 The Estimator
Given the sample size , define an amplification parameter , and let be the amplified sample size. Generate a sample sequence independently from , and let denote the number of times symbol appeared in . The empirical estimate of with samples is then
Our objective is to construct an estimator that approximates for large using just samples.
Since sharply concentrates around , we can show that can be approximated by the modified empirical estimator,
where for all and .
Since large probabilities are easier to estimate, it is natural to set a threshold parameter and rewrite the modified estimator as a separate sum over small and large probabilities,
Note however that we do not know the exact probabilities. Instead, we draw two independent sample sequences and from , each of an independent size, and let and be the number of occurrences of in the first and second sample sequence respectively. We then set a small/large-probability threshold
and classify a probabilityas large or small according to :
is the modified small-probability empirical estimator, and
is the modified large-probability empirical estimator. We rewrite the modified empirical estimator as
Correspondingly, we express our estimator as a combination of small- and large-probability estimators,
The large-probability estimator approximates as
Note that we replaced the length- sample sequence by the independent length- sample sequence . We can do so as large probabilities are well estimated from fewer samples.
The small-probability estimator approximates and is more involved. We outline its construction below and details can be found in Appendix G. The expected value of for the small probabilities is
Let be the expected number of times symbol will be observed in , and define
As explained in Appendix G.1, the sum beyond a truncation threshold
is small, hence it suffices to consider the truncated sum
Observe that is the tail probability of a distribution that diminishes rapidly beyond . Hence determines which summation terms will be attenuated, and serves as a smoothing parameter.
An unbiased estimator ofis
Finally, the small-probability estimator is
In Theorem 1, for fixed , as , the final slack term approaches a constant. For certain properties it can be improved. For normalized support size, normalized support coverage, and distance to uniformity, a more involved estimator improves this term to
for any fixed constant .
For Shannon entropy, correcting the bias of emiller and further dividing the probability regions, reduces the slack term even more, to
Finally, the theorem compares the performance of with samples to that of with samples. As shown in the next section, the performance is often comparable to that of samples. It would be interesting to prove a competitive result that enlarges the amplification to or even . This would be essentially the best possible as it can be shown that for the symmetric properties mentioned in the introduction, amplification cannot exceed .
We evaluated the new estimator by comparing its performance to several recent estimators mmentro ; mmsize ; mmcover ; pnas ; jvhw . To ensure robustness of the results, we performed the comparisons for all the symmetric properties described in the introduction: entropy, support size, support coverage, power sums, and distance to uniformity. For each property, we considered six underlying distributions: uniform, Dirichlet-drawn-, Zipf, binomial, Poisson, and geometric. The results for the first three properties are shown in Figures 1–3, the plots for the final two properties can be found in Appendix I. For nearly all tested properties and distributions, achieved state-of-the-art performance.
As Theorem 1 implies, for all five properties, with just (not even ) samples, performed as well the empirical estimator with roughly samples. Interestingly, in most cases performed even better, similar to with samples.
Relative to previous estimators, depending on the property and distribution, different previous estimators were best. But in essentially all experiments, was either comparable or outperformed the best previous estimator. The only exception was PML that attempts to smooth the estimate, hence performed better on uniform, and near-uniform Dirichlet-drawn distributions for several properties.
Two additional advantages of
may be worth noting. First, underscoring its competitive performance for each distribution, the more skewed the distribution the better is its relative efficacy. This is because most other estimators are optimized for the worst distribution, and work less well for skewed ones.
Second, by its simple nature, the empirical estimator is very stable. Designed to emulate for more samples, is therefore stable as well. Note also that is not always the best estimator choice. For example, it always underestimates the distribution’s support size. Yet even for normalized support size, Figure 2 shows that
outperforms other estimators including those designed specifically for this property (except as above for PML on near-uniform distributions).
The next subsection describes the experimental settings. Additional details and further interpretation of the observed results can be found in Appendix I.
We tested the five properties on the following distributions: uniform distribution; a distribution randomly generated from Dirichlet prior with parameter 2; Zipf distribution with power
; Binomial distribution with success probability
; Poisson distribution with mean
; geometric distribution with success probability.
With the exception of normalized support coverage, all other properties were tested on distributions of support size . The Geometric, Poisson, and Zipf distributions were truncated at and re-normalized. The number of samples, , ranged from to , shown logarithmically on the horizontal axis. Each experiment was repeated 100 times and the reported results, shown on the vertical axis, reflect their mean squared error (MSE).
We compared the estimator’s performance with samples to that of four other recent estimators as well as the empirical estimator with , , and samples. We chose the amplification parameter as , where was selected based on independent data, and similarly for . Since performed even better than Theorem 1 guarantees, ended up between 0 and 0.3 for all properties, indicating amplification even beyond . The graphs denote by NEW, with samples by Empirical, with samples by Empirical+, with samples by Empirical++, the pattern maximum likelihood estimator in mmcover by PML, the Shannon-entropy estimator in jvhw by JVHW, the normalized-support-size estimator in mmsize and the entropy estimator in mmentro by WY, and the smoothed Good-Toulmin Estimator for normalized support coverage estimation pnas , slightly modified to account for previously-observed elements that may appear in the subsequent sample, by SGT.
While the empirical and the new estimators have the same form for all properties, as noted in the introduction, the recent estimators are property-specific, and each was derived for a subset of the properties. In the experiments we applied these estimators to all the properties for which they were derived. Also, additional estimators ventro ; pentro ; mentro ; gsize ; ccover ; cacover ; jcover for various properties were compared in mmentro ; mmsize ; pnas ; jvhw and found to perform similarly to or worse than recent estimators, hence we do not test them here.
In this paper, we considered the fundamental learning problem of estimating properties of discrete distributions. The best-known distribution-property estimation technique is the “empirical estimator” that takes the data’s empirical frequency and plugs it in the property functional. We designed a general estimator that for a wide class of properties, uses only samples to achieve the same accuracy as the plug-in estimator with samples. This provides an off-the-shelf method for amplifying the data available relative to traditional approaches. For all the properties and distributions we have tested, the proposed estimator performed as well as the best estimator(s). A meaningful future research direction would be to verify the optimality of our results: the amplification factor and the slack terms. There are also several important properties that are not included in our paper, for example, Rényi entropy jrenyi and the generalized distance to uniformity yi2018 ; batu17 . It would be interesting to determine whether data amplification could be obtained for these properties as well.
Appendix A Smooth properties
Theorem holds for a wide class of properties . For , consider the Lipschitz-type parameter
We assume that satisfies the following conditions:
for some absolute constant .
Note that the first condition, , entails no loss of generality. The second condition implies that is continuous over , and in particular right continuous at 0 and left-continuous at . It is easy to see that continuity is also essential for consistent estimation. Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where , and that all seven properties described earlier satisfy these three conditions. Finally, to ensure that distance satisfies these conditions, we let . Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where .
For normalized support size, we modify our estimator as follows: if , we apply the estimator , and if , we apply the corresponding min-max estimator mmsize . However, for experiments shown in Section I, the original estimator is used without such modification.
Table 1 below summarizes the results on the quantity and for different properties. Note that for a given property, is unique while is not.
|Power sum ()||()||1|
|Normalized support coverage||1||1|
|Distance to uniformity||1||1|
For simplicity, we denote the partial expectation , and . To simplify our proofs and expressions, we assume that the number of samples , the amplification parameter , and . Without loss of generality, we also assume that , and are integers. Finally, set and , where and are fixed constants such that and .
Appendix B Outline
The rest of the appendix is organized as follows.
In Section C.1, we present a few concentration inequalities for Poisson and Binomial random variables that will be used in subsequent proofs. In Section C.2, we analyze the performance of the modified empirical estimator that estimates by instead of . We show that performs nearly as well as the original empirical estimator , but is significantly easier to analyze.
In Section D, we partition the loss of our estimator, , into three parts: , , and , corresponding to a quantity which is roughly , the loss incurred by , and the loss incurred by , respectively.
, we bound the squared bias and variance ofrespectively.
In Section G.1, we partition the series to be estimated in into and , and show that it suffices to estimate the quantity . In Section G.2, we outline how we construct the linear estimator based on . Then, we bound term : in Section G.3 and G.4, we bound the variance and squared bias of respectively. In Section G.5, we derive a tight bound on .
In Section H, we prove Theorem based on our previous results.
In Section I, we demonstrate the practical advantages of our methods through experiments on different properties and distributions. We show that our estimator can even match the performance of the -sample empirical estimator in estimating various properties.
Appendix C Preliminary Results
c.1 Concentration Inequalities for Poisson and Binomial
The following lemma gives tight tail probability bounds for Poisson and Binomial random variables.
concen Let be a Poisson or Binomial random variable with mean , then for any ,
and for any ,
We have the following corollary by choosing different values of .
Let be a Poisson or Binomial random variable with mean ,
where the second inequality follows from the fact that and decrease with and the equality follows as . ∎
c.2 The Modified Empirical Estimator
The modified empirical estimator
estimates the probability of a symbol not by the fraction of times it appeared, but by , where is the parameter of the Poisson sampling distribution.
We show that the original and modified empirical estimators have very similar performance.
For all ,
By the definition of , if ,
and if ,
where the last step follows as and . ∎
Appendix D Large and Small Probabilities
Recall that has the following form
We can rewrite the property as follows
The difference between and the actual value can be partitioned into three terms
is the bias of the modified empirical estimator with Poi() samples,
corresponds to the loss incurred by the large-probability estimator , and
corresponds to the loss incurred by the small-probability estimator .
By Cauchy-Schwarz inequality, upper bounds on , , and , suffice to also upper bound the estimation loss .
Appendix E Squared Bias:
We relate to through the following inequality.
Let be a positive function over ,
We upper bound in terms of using Cauchy-Schwarz inequality and Lemma 4.