# Data Amplification: Instance-Optimal Property Estimation

The best-known and most commonly used distribution-property estimation technique uses a plug-in estimator, with empirical frequency replacing the underlying distribution. We present novel linear-time-computable estimators that significantly "amplify" the effective amount of data available. For a large variety of distribution properties including four of the most popular ones and for every underlying distribution, they achieve the accuracy that the empirical-frequency plug-in estimators would attain using a logarithmic-factor more samples. Specifically, for Shannon entropy and a broad class of properties including ℓ_1-distance, the new estimators use n samples to achieve the accuracy attained by the empirical estimators with n n samples. For support-size and coverage, the new estimators use n samples to achieve the performance of empirical frequency with sample size n times the logarithm of the property value. Significantly strengthening the traditional min-max formulation, these results hold not only for the worst distributions, but for each and every underlying distribution. Furthermore, the logarithmic amplification factors are optimal. Experiments on a wide variety of distributions show that the new estimators outperform the previous state-of-the-art estimators designed for each specific property.

• 8 publications
• 12 publications
03/29/2019

### Data Amplification: A Unified and Competitive Approach to Property Estimation

Estimating properties of discrete distributions is a fundamental problem...
06/10/2019

### The Broad Optimality of Profile Maximum Likelihood

We study three fundamental statistical-learning problems: distribution e...
05/21/2019

### Efficient Profile Maximum Likelihood for Universal Symmetric Property Estimation

Estimating symmetric properties of a distribution, e.g. support size, co...
11/08/2019

### Unified Sample-Optimal Property Estimation in Near-Linear Time

We consider the fundamental learning problem of estimating properties of...
06/25/2019

### Distribution-robust mean estimation via smoothed random perturbations

We consider the problem of mean estimation assuming only finite variance...
07/06/2018

### Outperforming Good-Turing: Preliminary Report

Estimating a large alphabet probability distribution from a limited numb...
09/05/2019

### Predictive distributions that mimic frequencies over a restricted subdomain (expanded preprint version)

A predictive distribution over a sequence of N+1 events is said to be "f...

## 1 Introduction

Recent years have seen significant interest in estimating properties of discrete distributions over large domains [1, 2, 3, 4, 5, 6]. Chief among these properties are support size and coverage, Shannon entropy, and -distance to a known distribution. The main achievement of these papers is essentially estimating several properties of distributions with alphabet size using just samples.

In practice however, the underlying distributions are often simple, and their properties can be accurately estimated with significantly fewer than samples. For example, if the distribution is concentrated on a small part of the domain, or is exponential, very few samples may suffice to estimate the property. To address this discrepancy,  [7] took the following competitive approach.

The best-known distribution property estimator is the empirical estimator that replaces the unknown underlying distribution by the observed empirical distribution. For example, with samples, it estimates entropy by where is the number of times symbol

appeared. Besides its simple and intuitive form, the empirical estimator is also consistent, stable, and universal. It is therefore the most commonly used property estimator for data-science applications.

The estimator derived in [7] uses samples and for any underlying distribution achieves the same performance that the empirical estimator would achieve with samples. It therefore provides an effective way to amplify the amount of data available by a factor of , regardless of the domain or structure of the underlying distribution.

In this paper we present novel estimators that increase the amplification factor for all sufficiently smooth properties including those mentioned above from to the information-theoretic bound of . Namely, for every distribution their expected estimation error with samples is that of the empirical estimator with samples and no further uniform amplification is possible.

It can further be shown [1, 2, 3, 6] that the empirical estimator estimates all of the above four properties with linearly many samples, hence the sample size required by the new estimators is always at most the guaranteed by the state-of-the-art estimators.

The current formulation has several additional advantages over previous approaches.

##### Fewer assumptions

It eliminates the need for some commonly used assumptions. For example, support size cannot be estimated with any number of samples, as arbitrarily-many low-probabilities may be missed. Hence previous research

[5, 3] unrealistically assumed prior knowledge of the alphabet size , and additionally that all positive probabilities exceed . By contrast, the current formulation does not need these assumptions. Intuitively, if a symbol’s probability is so small that it won’t be detected even with samples, we do not need to worry about it.

##### Refined bounds

For some properties, our results are more refined than previously shown. For example, existing results estimate the support size to within , rendering the estimates rather inaccurate when the true support size is much smaller than . By contrast, the new estimation errors are bounded by , and are therefore accurate regardless of the support size. A similar improvement holds for support coverage.

##### Graceful degradation

For the previous results to work, one needs at least samples. With fewer samples, the estimators have no guarantees. By contrast, the guarantees of the new estimators work for any sample size . If , the performance may degrade, but will still track that of the empirical estimators with a factor more samples.

##### Instance optimality

With the recent exception of [7], all modern property-estimation research took a min-max-related approach, evaluating the estimation improvement based on the worst possible distribution for the property. In reality, practical distributions are rarely the worst possible and often quite simple, rendering min-max approach overly pessimistic, and its estimators, typically suboptimal in practice. In fact, for this very reason, practical distribution estimators do not use min-max based approaches [8]. By contrast, our competitive, or instance-optimal, approach provably ensures amplification for every underlying distribution, regardless of its complexity.

In addition, the proposed estimators run in time linear in the sample size, and the constants involved are very small, properties shared by some, though not all existing estimators.

We formalize the foregoing discussion in the following definitions.

Let denote the collection of discrete distributions over . A distribution property is a mapping . It is additive if it can be written as

 F(→p):=∑i∈[k]fi(pi),

where are real functions. Many important distribution properties are additive:

##### Shannon entropy

, is the principal measure of information [9]

, and arises in a variety of machine-learning

[10, 11, 12], neuroscience [13, 14, 15], and other applications.

##### ℓ1-distance

, where is a given distribution, is one of the most basic and well-studied properties in the field of distribution property testing [16, 17, 18, 19].

##### Support size

, is a fundamental quantity for discrete distributions, and plays an important role in vocabulary size [20, 21, 22] and population estimation [23, 24].

##### Support coverage

, for a given , represents the number of distinct elements we would expect to see in independent samples, arises in many ecological [25, 26, 27, 28], biological [29, 30], genomic [31] as well as database [32] studies.

Given an additive property and sample access to an unknown distribution , we would like to estimate the value of as accurately as possible. Let denote the collection of all length- sequences, an estimator is a function that maps a sample sequence to a property estimate . We evaluate the performance of in estimating via its mean absolute error (MAE),

 L(^F,→p,n):=EXn∼→p∣∣^F(Xn)−F(→p)∣∣.

Since we do not know , the common approach is to consider the worst-case MAE of over ,

 L(^F,n):=max→p∈ΔkL(^F,→p,n).

The best-known and most commonly-used property estimator is the empirical plug-in estimator. Upon observing , let denote the number of times symbol appears in . The empirical estimator estimates by

 ^FE(Xn):=∑i∈[k]fi(Nin).

Starting with Shannon entropy, it has been shown [2] that for , the worst-case MAE of the empirical estimator is

 L(^HE,n)=Θ(kn+logk√n). (1)

On the other hand, [1, 2, 3, 6] showed that for , more sophisticated estimators achieve the best min-max performance of

 L(n):=min^Fmax→p∈ΔkL(^F,→p,n)=Θ(knlogn+logk√n). (2)

Hence up to constant factors, for the “worst” distributions, the MAE of these estimators with samples equals that of the empirical estimator with samples. A similar relation holds for the other three properties we consider.

However, the min-max formulation is pessimistic as it evaluates the estimator’s performance based on its MAE for the worst distributions. In many practical applications, the underlying distribution is fairly simple and does not attain this worst-case loss, rather, a much smaller MAE can be achieved. Several recent works have therefore gone beyond worst-case analysis and designed algorithms that perform well for all distributions, not just those with the worst performance [33, 34].

For property estimation, [7] designed an estimator that for any underlying distribution uses samples to achieve the performance of the -sample empirical estimator, hence effectively multiplying the data size by a amplification factor.

###### Lemma 1.

[7] For every property in a large class that includes the four properties above, there is an absolute constant such that for all distribution and all ,

 L(^FA,→p,n)≤L(^FE,→p,εn√logn)+cF⋅ε.

In this work, we fully strengthen the above result and establish the limits of data amplification for all sufficiently smooth additive properties including four of the most important ones. Using Shannon entropy as an example, we achieve a amplification factor. Equations (1) and (2) imply that the improvement over the empirical estimator cannot always exceed , hence up to a constant, this amplification factor is information-theoretically optimal. Similar optimality arguments hold for our results on the other three properties.

Specifically, we derive linear-time-computable estimators , , , , and for Shannon entropy, -distance, support size, support coverage, and a broad class of additive properties which we refer to as “Lipschitz properties”. These estimators take a single parameter , and given samples , amplify the data as described below.

Let and abbreviate the support size by . For some absolute constant , the following five theorems hold for all , all distributions , and all .

###### Theorem 1 (Shannon entropy).
 L(^H,→p,n)≤L(^HE,→p,εnlogn)+c⋅(ε∧(S→pn+1n0.49)).

Note that the estimator does not need to know or . When , the estimator amplifies the data by a factor of . As decreases, the amplification factor decreases, and so does the extra additive inaccuracy. One can also set to be a vanishing function of , e.g., . This result may be interpreted as follows. For distributions with large support sizes such that the min-max estimators provide no or only very weak guarantees, our estimator with samples always tracks the performance of the -sample empirical estimator. On the other hand, for distributions with relatively small support sizes, our estimator achieves a near-optimal -error rate.

In addition, the above result together with Proposition 1 in [35] trivially implies that

###### Corollary 1.

In the large alphabet regime where , the min-max MAE of estimating Shannon entropy satisfies

 L(n)≤(1+o(1))log(1+k−1nlogn).

Similarly, for -distance,

###### Theorem 2 (ℓ1-distance).

For any , we can construct an estimator for such that

 L(^D,→p,n)≤L(^DE,→p,ε2nlogn)+c⋅⎛⎝ε∧⎛⎝√S→pn+1n0.49⎞⎠⎞⎠.

Besides having an interpretation similar to Theorem 1, the above result shows that for each and each , we can use just samples to achieve the performance of the -sample empirical estimator. More generally, for any additive property that satisfies the simple condition: is -Lipschitz, for all ,

###### Theorem 3 (General additive properties).

Given , we can construct an estimator such that

 L(^F,→p,n)≤L(^FE,→p,ε2nlogn)+O⎛⎝ε∧⎛⎝√S→pn+1n0.49⎞⎠⎞⎠.

We refer to the above general distribution property class as the class of “Lipschitz properties”. Note that the -distance , for any , clearly belongs to this class.

Lipschitz properties are essentially bounded by absolute constants and Shannon entropy grows at most logarithmically in the support size, and we were able to approximate all with just an additive error. Support size and support coverage can grow linearly in and , and can be approximated only multiplicatively. We therefore evaluate the estimator’s normalized performance.

Note that for both properties, the amplification factor is logarithmic in the property value, which can be arbitrarily larger than the sample size . The following two theorems hold for ,

###### Theorem 4 (Support size).

To make the slack term vanish, one can simply set to be a vanishing function of (or ), e.g., . Note that in this case, the slack term modifies the multiplicative error in estimating by only , which is negligible in most applications. Similarly, for support coverage,

###### Theorem 5 (Support coverage).

Abbreviating by ,

 1C→pL(^C,→p,n)≤1C→pL(^CE,→p,|log−2ε|⋅nlogC→p)+c(C|log−1ε|−12→p+ε).

For notational convenience, let for entropy, for -distance, for support size, and for support coverage. In the next section, we provide an outline of the remaining contents, and a high-level overview of our techniques.

## 2 Outline and technique overview

In the main paper, we focus on Shannon entropy and prove a weaker version of Theorem 1.

###### Theorem 6.

For all and all distributions , the estimator described in Section 5 satisfies

 L(^H,→p,n)≤L(^HE,→p,εnlogn)+(1+c⋅ε)∧(S→pεn+1n0.49).

The proof of Theorem 6 in the rest of the paper is organized as follows. In Section 3

, we present a few useful concentration inequalities for Poisson and binomial random variables. In Section

4, we relate the bias of the -sample empirical estimator to the degree- Bernstein polynomial by . In Section 4.1, we show that the absolute difference between the derivative of and a simple function is at most , uniformly for all .

Let be the amplification parameter. In Section 4.2 we approximate by a degree- polynomial , and bound the approximation error uniformly by . Let . By construction, , implying .

In Section 5, we construct our estimator as follows. First, we divide the symbols into small- and large- probability symbols according to their counts in an independent -element sample sequence. The concentration inequalities in Section 3 imply that this step can be performed with relatively high confidence. Then, we estimate the partial entropy of each small-probability symbol

with a near-unbiased estimator of

, and the combined partial entropy of the large-probability symbols with a simple variant of the empirical estimator. The final estimator is the sum of these small- and large- probability estimators.

In Section 6, we bound the bias of . In Sections 6.1 and 6.2, we use properties of and the Bernstein polynomials to bound the partial biases of the small- and large-probability estimators in terms of , respectively. The key observation is , implying that the small-probability estimator has a small bias. To bound the bias of the large-probability estimator, we essentially rely on the elegant inequality .

By the triangle inequality, it remains to bound the mean absolute deviation of

. We bound this quantity by bounding the partial variances of the small- and large- probability estimators in Section

7.1 and Section 7.2, respectively. Intuitively speaking, the small-probability estimator has a small variance because it is constructed based on a low-degree polynomial; the large-probability estimator has a small variance because is smoother for larger values of .

To demonstrate the efficacy of our methods, in Section 8, we compare the experimental performance of our estimators with that of the state-of-the-art property estimators for Shannon entropy and support size over nine distributions. Our competitive estimators outperformed these existing algorithms on nearly all the experimented instances.

Replacing the simple function by a much finer approximation of , we establish the full version of Theorem 1 in Appendix A. Applying similar techniques, we prove the other four results in Appendices B (Theorem 2 and 3), C (Theorem 4), and D (Theorem 5).

## 3 Concentration inequalities

The following lemma gives tight tail probability bounds for Poisson and binomial random variables.

###### Lemma 2.

[36] Let be a Poisson or binomial random variable with mean , then for any ,

 P(X≥(1+δ)μ)≤(eδ(1+δ)(1+δ))μ≤e−(δ2∧δ)μ/3,

and for any ,

 P(X≤(1−δ)μ)≤(e−δ(1−δ)(1−δ))μ≤e−δ2μ/2.

## 4 Approximating Bernstein polynomials

With samples, the bias of the empirical estimator in estimating is

 Bias(^HE,n):=E[^HE(Xn)]−H(→p).

By the linearity of expectation, the right-hand side equals

Noting that the degree- Bernstein polynomial of is

 Bn(h,x):=n∑j=0h(jn)(nj)xj(1−x)n−j,

we can express the bias of the empirical estimator as

 Bias(^HE,n)=∑i∈[k](Bn(h,pi)−h(pi)).

Given a sampling number and a parameter , define the amplification factor . Let and be sufficiently large and small constants, respectively. In the following sections, we find a polynomial of degree , whose error in approximating over satisfies

 |B′na(h,x)−~hna(x)|≤1+O(ε).

Through a simple argument, the degree- polynomial

 ~Hna(x):=∫x0~hna(t)dt,

approximates with the following pointwise error guarantee.

###### Lemma 3.

For any ,

 |Bna(h,x)−~Hna(x)|≤x(1+O(ε)).

In Section 4.1, we relate to a simple function , which can be expressed in terms of . In Section 4.2, we approximate by a linear combination of degree- min-max polynomials of over different intervals. The resulting polynomial is .

### 4.1 The derivative of a Bernstein polynomial

According to [37], the first-order derivative of the Bernstein polynomial is

Letting

 hn(x):=n(h((n−1n)x+1n)−h((n−1n)x)),

we can write as

 B′n(h,x)=n−1∑j=0hn(jn−1)(n−1)jxj(1−x)(n−1)−j=Bn−1(hn,x).

Recall that . After some algebra, we get

 hn(x)=−logn−1n+(n−1)(h(x+1n−1)−h(x)).

Furthermore, using properties of  [38], we can bound the absolute difference between and its Bernstein polynomial as follows.

###### Lemma 4.

For any and ,

 −1−xm≤Bm(h,x)−h(x)≤0.

As an immediate corollary,

###### Corollary 2.

For ,

 |B′n(h,x)−hn(x)|=|Bn−1(hn,x)−hn(x)|≤1.
###### Proof.

By the equality and Lemma 4, for ,

 |Bn−1(hn,x)−hn(x)| ≤(n−1)|(Bn−1(h,x+(n−1)−1)−h(x+(n−1)−1)) −(Bn−1(h,x)−h(x))| ≤(n−1)∣∣ ∣∣max{1−x−(n−1)−1n−1,1−xn−1}∣∣ ∣∣ ≤1.

### 4.2 Approximating the derivative function

Denote the degree- min-max polynomial of over by

 ~h(x):=d∑j=0bjxj.

As shown in [2], the coefficients of satisfy

 |bj|≤O(23d),

and the error of in approximating are bounded as

 maxx∈[0,1]|h(x)−~h(x)|≤O(1log2n).

By a change of variables, the degree- min-max polynomial of over is

 ~h1(x):=d∑j=0bj(ncllogn)j−1xj+(logncllogn)x.

Correspondingly, for any , we have

 maxx∈In|h(x)−~h1(x)|≤O(1nlogn).

To approximate , we approximate by , and by . The resulting polynomial is

 ~hna(x) :=−logna−1na+(na−1)(~h1(x+(na−1)−1)−~h1(x)) =−logna−1clalogn+(na−1)(d∑j=0bj(ncllogn)j−1((x+1na−1)j−xj)).

By the above reasoning, the error of in approximating over satisfies

 maxx∈In|hna(x)−~hna(x)|≤O(nanlogn)≤O(ε).

Moreover, by Corollary 2,

 maxx∈[0,1/2]|B′na(h,x)−hna(x)|=maxx∈[0,1/2]|Bna−1(hna,x)−hna(x)|≤1.

The triangle inequality combines the above two inequalities and yields

 maxx∈In|B′na(h,x)−~hna(x)|≤1+O(ε).

Therefore, denoting

 ~Hna(x):=∫x0~hna(t)dt,

and noting that , we have

###### Lemma 5.

For any ,

 |Bna(h,x)−~Hna(x)|≤∫x0|B′na(h,t)−~hna(t)|dt≤x(1+O(ε)).

## 5 A competitive entropy estimator

In this section, we design an explicit entropy estimator based on and the empirical estimator. Note that is a polynomial with zero constant term. For , denote

 gt:=d∑j=tbjj+1(ncllogn)j−1(1na−1)j−t(j+1j−t+1).

Setting for and , we have the following lemma.

###### Lemma 6.

The function can be written as

 ~Hna(x)=d∑t=1b′txt.

In addition, its coefficients satisfy

 |b′t|≤(ncllogn)t−1O(24d).

The proof of the above lemma is delayed to the end of this section.

To simplify our analysis and remove the dependency between the counts , we use the conventional Poisson sampling technique [2, 3]. Specifically, instead of drawing exactly samples, we make the sample size an independent Poisson random variable with mean . This does not change the statistical natural of the problem as highly concentrates around its mean (see Lemma 2). We still define as the counts of symbol in . Due to Poisson sampling, these counts are now independent, and satisfy .

For each , let be the order- falling factorial of . The following identity is well-known:

 E[Nt–i]=(npi)t, ∀t≤n.

Note that for sufficiently small , the degree parameter . By the linearity of expectation, the unbiased estimator of is

 ^Hna(Ni):=d∑t=1b′tNt–int.

Let be an independent Poisson random variable with mean , and be an independent length- sample sequence drawn from . Analogously, we denote by the number of times that symbol appears. Depending on whether

or not, we classify

into two categories: small- and large- probabilities. For small probabilities, we apply a simple variant of ; for large probabilities, we estimate by essentially the empirical estimator. Specifically, for each , we estimate by

 ^h(Ni,N′i):=^Hna(Ni)⋅1Ni≤cllogn⋅1N′i≤ε−1+h(Nin)⋅1N′i>ε−1.

Consequently, we approximate by

 ^H(XN,XN′):=∑i∈[k]^h(Ni,N′i).

For the simplicity of illustration, we will refer to

 ^HS(XN,XN′):=∑i∈[k]^Hna(Ni)⋅1Ni≤cllogn⋅1N′i≤ε−1

as the small-probability estimator, and

 ^HL(XN,XN′):=∑i∈[k]h(Nin)⋅1N′i>ε−1

as the large-probability estimator. Clearly, is the sum of these two estimators.

In the next two sections, we analyze the bias and mean absolute deviation of . In Section 6, we show that for any , the absolute bias of satisfies

 ∣∣E[^H(XN,XN′)]−H(→p)∣∣≤∣∣Bias(^HE,na)∣∣+(1+O(ε))(1∧(ε−1+1)S→pn).

In Section 7, we show that the mean absolute deviation of satisfies

 E∣∣^H(XN,XN′)−E[^H(XN,XN′)]∣∣≤O(1n1−Θ(cs)).

For sufficiently small , the triangle inequality combines the above inequalities and yields

 E∣∣^H(XN,XN′)−H(→p)∣∣≤∣∣%Bias(^HE,na)∣∣+(1+c⋅ε)∧(S→pεn+1n0.49).

This basically completes the proof of Theorem 6.

### Proof of Lemma 6

We begin by proving the first claim:

 ~Hna(x)=−d∑t=1b′txt.

By definition, satisfies

 ~Hna(x)+(logna−1clalogn)x =(na−1)(d∑j=1bjj+1(ncllogn)j−1((x+1na−1)j+1−(1na−1)j+1−xj+1)) =d∑t=1xt(d∑j=tbjj+1(ncllogn)j−1(1na−1)j−t(j+1)j−t+1).

The last step follows by reorganizing the indices.

Next we prove the second claim. Recall that , thus

 logna−1clalogn≤O(24d).

Since for and , it suffices to bound the magnitude of :

 |gt| ≤d∑j=t∣∣bj∣∣j+1(ncllogn)j−1(1na−1)j−t(j+1j−t+1) ≤d∑j=t∣∣bj∣∣(1cllogn)j−1nt−1(jt) ≤(ncllogn)t−1d∑j=t∣∣bj∣∣(jt) ≤(ncllogn)t−1d∑j=t∣∣bj∣∣(dj−t) ≤(ncllogn)t−1O(24d).

## 6 Bounding the bias of ^H

By the triangle inequality, the absolute bias of in estimating satisfies

 ∣∣ ∣∣∑i∈[k](E[^h(Ni,N′i)]−h(pi))∣∣ ∣∣ ≤∣∣ ∣∣∑i∈[k](Bna(h,pi)−h(pi))∣∣ ∣∣ +∣∣ ∣∣∑i∈[k](E[^h(Ni,N′i)]−Bna(h,pi))∣∣ ∣∣.

Note that the first term on the right-hand side is the absolute bias of the empirical estimator with sample size , i.e.,

 ∣∣Bias(^HE,na)∣∣=∣∣ ∣∣∑i∈[k](Bna(h,pi)−h(pi))∣∣ ∣∣.

Hence, we only need to consider the second term on the right-hand side, which admits

 ∣∣ ∣∣∑i∈[k](E[^h(Ni,N′i)]−Bna(h,pi))∣∣ ∣∣≤BiasS+BiasL,

where

 BiasS:=∣∣ ∣∣∑i∈[k]E[(^Hna(Ni)⋅1Ni≤cllogn−Bna(h,pi))⋅1N′i≤ε−1]∣∣ ∣∣

is the absolute bias of the small-probability estimator, and

 BiasL:=∣∣ ∣∣∑i∈[k]E[(h(Nin)−Bna(h,pi))⋅1N′i>ε−1]∣∣ ∣∣

is the absolute bias of the large-probability estimator.

Assume that is sufficiently large. In Section 6.1, we bound the small-probability bias by

 |BiasS|≤(1+O(ε))(1∧(ε−1+1)S→pn).

In Section 6.2, we bound the large-probability bias by

 |BiasL|≤2(ε∧S→pn).

### 6.1 Bias of the small-probability estimator

We first consider the quantity . By the triangle inequality,

 BiasS ≤∑i:pi∉In∣∣E[^Hna(Ni)⋅1Ni≤cllogn]−Bna(h,pi)∣∣⋅E[1N′i≤ε−1] +∑i:pi∈In∣∣E[^Hna(Ni)]−Bna(h,pi)∣∣⋅E[1N′i