## 1 Introduction

Recent years have seen significant interest in estimating properties of discrete distributions over large domains [1, 2, 3, 4, 5, 6]. Chief among these properties are support size and coverage, Shannon entropy, and -distance to a known distribution. The main achievement of these papers is essentially estimating several properties of distributions with alphabet size using just samples.

In practice however, the underlying distributions are often simple, and their properties can be accurately estimated with significantly fewer than samples. For example, if the distribution is concentrated on a small part of the domain, or is exponential, very few samples may suffice to estimate the property. To address this discrepancy, [7] took the following competitive approach.

The best-known distribution property estimator
is the *empirical estimator* that replaces the unknown underlying distribution
by the observed empirical distribution. For example, with samples,
it estimates entropy by where
is the number of times symbol

appeared. Besides its simple and intuitive form, the empirical estimator is also consistent, stable, and universal. It is therefore the most commonly used property estimator for data-science applications.

The estimator derived in [7] uses samples and for
any underlying distribution achieves the same performance that
the empirical estimator would achieve with samples.
It therefore provides an effective way to *amplify* the amount of data
available by a factor of , regardless of
the domain or structure of the underlying distribution.

In this paper we present novel estimators that increase the
amplification factor for all sufficiently smooth properties including those mentioned above from
to the information-theoretic bound of .
Namely, for *every* distribution their expected estimation error
with samples is that of the empirical estimator with
samples and no further uniform amplification is possible.

It can further be shown [1, 2, 3, 6] that the empirical estimator estimates all of the above four properties with linearly many samples, hence the sample size required by the new estimators is always at most the guaranteed by the state-of-the-art estimators.

The current formulation has several additional advantages over previous approaches.

##### Fewer assumptions

It eliminates the need for some commonly used assumptions. For example, support size cannot be estimated with any number of samples, as arbitrarily-many low-probabilities may be missed. Hence previous research

[5, 3] unrealistically assumed prior knowledge of the alphabet size , and additionally that all positive probabilities exceed . By contrast, the current formulation does not need these assumptions. Intuitively, if a symbol’s probability is so small that it won’t be detected even with samples, we do not need to worry about it.##### Refined bounds

For some properties, our results are more refined than previously shown. For example, existing results estimate the support size to within , rendering the estimates rather inaccurate when the true support size is much smaller than . By contrast, the new estimation errors are bounded by , and are therefore accurate regardless of the support size. A similar improvement holds for support coverage.

##### Graceful degradation

For the previous results to work, one needs at least samples. With fewer samples, the estimators have no guarantees. By contrast, the guarantees of the new estimators work for any sample size . If , the performance may degrade, but will still track that of the empirical estimators with a factor more samples.

##### Instance optimality

With the recent exception of [7], all
modern property-estimation research took a min-max-related
approach, evaluating the estimation improvement based on the worst possible
distribution for the property. In reality, practical distributions
are rarely the worst possible and often quite simple, rendering min-max approach
overly pessimistic, and its estimators, typically suboptimal in
practice. In fact, for this very reason, practical distribution
estimators do not use min-max based approaches [8].
By contrast, our *competitive*, or *instance-optimal*,
approach provably ensures amplification for every underlying distribution,
regardless of its complexity.

In addition, the proposed estimators run in time linear in the sample size, and the constants involved are very small, properties shared by some, though not all existing estimators.

We formalize the foregoing discussion in the following definitions.

Let denote the collection of discrete distributions over
. A distribution *property* is a mapping
.
It is *additive* if it can be written as

where are real functions. Many important distribution properties are additive:

##### Shannon entropy

##### -distance

##### Support size

##### Support coverage

, for a given ,
represents the number of distinct elements we would expect to see in
independent samples, arises in many
ecological [25, 26, 27, 28], biological [29, 30], genomic [31] as well as database [32] studies.

Given an additive property and sample access to an
unknown distribution , we would like to estimate the value of
as accurately as possible. Let denote
the collection of all length- sequences, an estimator is a function
that maps a sample sequence to a property estimate . We evaluate the
performance of in estimating via its
*mean absolute error* (MAE),

Since we do not know , the common approach is to consider the worst-case MAE of over ,

The best-known and most commonly-used property estimator
is the *empirical plug-in estimator*.
Upon observing , let denote the number of times symbol
appears in .
The empirical estimator estimates by

Starting with Shannon entropy, it has been shown [2] that for , the worst-case MAE of the empirical estimator is

(1) |

On the other hand, [1, 2, 3, 6] showed that for , more sophisticated estimators achieve the best min-max performance of

(2) |

Hence up to constant factors, for the “worst” distributions, the MAE of these estimators with samples equals that of the empirical estimator with samples. A similar relation holds for the other three properties we consider.

However, the min-max formulation is pessimistic as it evaluates the estimator’s performance based on its MAE for the worst distributions. In many practical applications, the underlying distribution is fairly simple and does not attain this worst-case loss, rather, a much smaller MAE can be achieved. Several recent works have therefore gone beyond worst-case analysis and designed algorithms that perform well for all distributions, not just those with the worst performance [33, 34].

For property estimation, [7] designed an estimator
that for any underlying distribution uses samples to achieve the
performance of the -sample empirical estimator,
hence effectively multiplying the data size by a
*amplification factor*.

###### Lemma 1.

[7] For every property in a large class that includes the four properties above, there is an absolute constant such that for all distribution and all ,

In this work, we fully strengthen the above result and establish the limits of data amplification for all sufficiently smooth additive properties including four of the most important ones. Using Shannon entropy as an example, we achieve a amplification factor. Equations (1) and (2) imply that the improvement over the empirical estimator cannot always exceed , hence up to a constant, this amplification factor is information-theoretically optimal. Similar optimality arguments hold for our results on the other three properties.

Specifically, we derive linear-time-computable estimators , , , , and for Shannon entropy, -distance, support size, support coverage, and a broad class of additive properties which we refer to as “Lipschitz properties”. These estimators take a single parameter , and given samples , amplify the data as described below.

Let and abbreviate the support size by . For some absolute constant , the following five theorems hold for all , all distributions , and all .

###### Theorem 1 (Shannon entropy).

Note that the estimator does not need to know or . When , the estimator amplifies the data by a factor of . As decreases, the amplification factor decreases, and so does the extra additive inaccuracy. One can also set to be a vanishing function of , e.g., . This result may be interpreted as follows. For distributions with large support sizes such that the min-max estimators provide no or only very weak guarantees, our estimator with samples always tracks the performance of the -sample empirical estimator. On the other hand, for distributions with relatively small support sizes, our estimator achieves a near-optimal -error rate.

In addition, the above result together with Proposition 1 in [35] trivially implies that

###### Corollary 1.

In the large alphabet regime where , the min-max MAE of estimating Shannon entropy satisfies

Similarly, for -distance,

###### Theorem 2 (-distance).

For any , we can construct an estimator for such that

Besides having an interpretation similar to Theorem 1, the above result shows that for each and each , we can use just samples to achieve the performance of the -sample empirical estimator. More generally, for any additive property that satisfies the simple condition: is -Lipschitz, for all ,

###### Theorem 3 (General additive properties).

Given , we can construct an estimator such that

We refer to the above general distribution property class as the class of “Lipschitz properties”. Note that the -distance , for any , clearly belongs to this class.

Lipschitz properties are essentially bounded by absolute constants and Shannon entropy grows at most logarithmically in the support size, and we were able to approximate all with just an additive error. Support size and support coverage can grow linearly in and , and can be approximated only multiplicatively. We therefore evaluate the estimator’s normalized performance.

Note that for both properties, the amplification factor is logarithmic in the property value, which can be arbitrarily larger than the sample size . The following two theorems hold for ,

###### Theorem 4 (Support size).

To make the slack term vanish, one can simply set to be a vanishing function of (or ), e.g., . Note that in this case, the slack term modifies the multiplicative error in estimating by only , which is negligible in most applications. Similarly, for support coverage,

###### Theorem 5 (Support coverage).

Abbreviating by ,

For notational convenience, let for entropy, for -distance, for support size, and for support coverage. In the next section, we provide an outline of the remaining contents, and a high-level overview of our techniques.

## 2 Outline and technique overview

In the main paper, we focus on Shannon entropy and prove a weaker version of Theorem 1.

###### Theorem 6.

For all and all distributions , the estimator described in Section 5 satisfies

The proof of Theorem 6 in the rest of the paper is organized as follows. In Section 3

, we present a few useful concentration inequalities for Poisson and binomial random variables. In Section

4, we relate the bias of the -sample empirical estimator to the degree- Bernstein polynomial by . In Section 4.1, we show that the absolute difference between the*derivative*of and a simple function is at most , uniformly for all .

Let be the amplification parameter. In Section 4.2 we approximate by a degree- polynomial , and bound the approximation error uniformly by . Let . By construction, , implying .

In Section 5, we construct our estimator as follows. First, we divide the symbols into small- and large- probability symbols according to their counts in an independent -element sample sequence. The concentration inequalities in Section 3 imply that this step can be performed with relatively high confidence. Then, we estimate the partial entropy of each small-probability symbol

with a near-unbiased estimator of

, and the combined partial entropy of the large-probability symbols with a simple variant of the empirical estimator. The final estimator is the sum of these small- and large- probability estimators.In Section 6, we bound the bias of . In Sections 6.1 and 6.2, we use properties of and the Bernstein polynomials to bound the partial biases of the small- and large-probability estimators in terms of , respectively. The key observation is , implying that the small-probability estimator has a small bias. To bound the bias of the large-probability estimator, we essentially rely on the elegant inequality .

By the triangle inequality, it remains to bound the mean absolute deviation of

. We bound this quantity by bounding the partial variances of the small- and large- probability estimators in Section

7.1 and Section 7.2, respectively. Intuitively speaking, the small-probability estimator has a small variance because it is constructed based on a low-degree polynomial; the large-probability estimator has a small variance because is smoother for larger values of .To demonstrate the efficacy of our methods, in Section 8, we compare the experimental performance of our estimators with that of the state-of-the-art property estimators for Shannon entropy and support size over nine distributions. Our competitive estimators outperformed these existing algorithms on nearly all the experimented instances.

## 3 Concentration inequalities

The following lemma gives tight tail probability bounds for Poisson and binomial random variables.

###### Lemma 2.

## 4 Approximating Bernstein polynomials

With samples, the bias of the empirical estimator in estimating is

By the linearity of expectation, the right-hand side equals

Noting that the degree- Bernstein polynomial of is

we can express the bias of the empirical estimator as

Given a sampling number and a parameter , define the amplification factor . Let and be sufficiently large and small constants, respectively. In the following sections, we find a polynomial of degree , whose error in approximating over satisfies

Through a simple argument, the degree- polynomial

approximates with the following pointwise error guarantee.

###### Lemma 3.

For any ,

In Section 4.1, we relate to a simple function , which can be expressed in terms of . In Section 4.2, we approximate by a linear combination of degree- min-max polynomials of over different intervals. The resulting polynomial is .

### 4.1 The derivative of a Bernstein polynomial

According to [37], the first-order derivative of the Bernstein polynomial is

Letting

we can write as

Recall that . After some algebra, we get

Furthermore, using properties of [38], we can bound the absolute difference between and its Bernstein polynomial as follows.

###### Lemma 4.

For any and ,

As an immediate corollary,

###### Corollary 2.

For ,

###### Proof.

### 4.2 Approximating the derivative function

Denote the degree- min-max polynomial of over by

As shown in [2], the coefficients of satisfy

and the error of in approximating are bounded as

By a change of variables, the degree- min-max polynomial of over is

Correspondingly, for any , we have

To approximate , we approximate by , and by . The resulting polynomial is

By the above reasoning, the error of in approximating over satisfies

Moreover, by Corollary 2,

The triangle inequality combines the above two inequalities and yields

Therefore, denoting

and noting that , we have

###### Lemma 5.

For any ,

## 5 A competitive entropy estimator

In this section, we design an explicit entropy estimator based on and the empirical estimator. Note that is a polynomial with zero constant term. For , denote

Setting for and , we have the following lemma.

###### Lemma 6.

The function can be written as

In addition, its coefficients satisfy

The proof of the above lemma is delayed to the end of this section.

To simplify our analysis and remove the dependency between the counts , we use the conventional *Poisson sampling* technique [2, 3]. Specifically, instead of drawing exactly samples, we make the sample size an independent Poisson random variable with mean . This does not change the statistical natural of the problem as highly concentrates around its mean (see Lemma 2). We still define as the counts of symbol in . Due to Poisson sampling, these counts are now independent, and satisfy .

For each , let be the order- falling factorial of . The following identity is well-known:

Note that for sufficiently small , the degree parameter . By the linearity of expectation, the unbiased estimator of is

Let be an independent Poisson random variable with mean , and be an independent length- sample sequence drawn from . Analogously, we denote by the number of times that symbol appears. Depending on whether

or not, we classify

into two categories: small- and large- probabilities. For small probabilities, we apply a simple variant of ; for large probabilities, we estimate by essentially the empirical estimator. Specifically, for each , we estimate byConsequently, we approximate by

For the simplicity of illustration, we will refer to

as the *small-probability estimator*, and

as the *large-probability estimator*. Clearly, is the sum of these two estimators.

In the next two sections, we analyze the bias and mean absolute deviation of . In Section 6, we show that for any , the absolute bias of satisfies

In Section 7, we show that the mean absolute deviation of satisfies

For sufficiently small , the triangle inequality combines the above inequalities and yields

This basically completes the proof of Theorem 6.

### Proof of Lemma 6

We begin by proving the first claim:

By definition, satisfies

The last step follows by reorganizing the indices.

Next we prove the second claim. Recall that , thus

Since for and , it suffices to bound the magnitude of :

## 6 Bounding the bias of

By the triangle inequality, the absolute bias of in estimating satisfies

Note that the first term on the right-hand side is the absolute bias of the empirical estimator with sample size , i.e.,

Hence, we only need to consider the second term on the right-hand side, which admits

where

is the absolute bias of the small-probability estimator, and

is the absolute bias of the large-probability estimator.

Assume that is sufficiently large. In Section 6.1, we bound the small-probability bias by

In Section 6.2, we bound the large-probability bias by

### 6.1 Bias of the small-probability estimator

We first consider the quantity . By the triangle inequality,