# Data Amplification: A Unified and Competitive Approach to Property Estimation

Estimating properties of discrete distributions is a fundamental problem in statistical learning. We design the first unified, linear-time, competitive, property estimator that for a wide class of properties and for all underlying distributions uses just 2n samples to achieve the performance attained by the empirical estimator with n√( n) samples. This provides off-the-shelf, distribution-independent, "amplification" of the amount of data available relative to common-practice estimators. We illustrate the estimator's practical advantages by comparing it to existing estimators for a wide variety of properties and distributions. In most cases, its performance with n samples is even as good as that of the empirical estimator with n n samples, and for essentially all properties, its performance is comparable to that of the best existing estimator designed specifically for that property.

• 8 publications
• 12 publications
• 1 publication
• 52 publications
03/04/2019

### Data Amplification: Instance-Optimal Property Estimation

The best-known and most commonly used distribution-property estimation t...
07/03/2020

### Monotonicity preservation properties of kernel regression estimators

Three common classes of kernel regression estimators are considered: the...
03/23/2018

### Determinantal Point Processes for Coresets

When one is faced with a dataset too large to be used all at once, an ob...
08/27/2020

### On the High Accuracy Limitation of Adaptive Property Estimation

Recent years have witnessed the success of adaptive (or unified) approac...
11/08/2019

### Unified Sample-Optimal Property Estimation in Near-Linear Time

We consider the fundamental learning problem of estimating properties of...
02/26/2020

### Profile Entropy: A Fundamental Measure for the Learnability and Compressibility of Discrete Distributions

The profile of a sample is the multiset of its symbol frequencies. We sh...
06/25/2019

### Distribution-robust mean estimation via smoothed random perturbations

We consider the problem of mean estimation assuming only finite variance...

## 1 Distribution Properties

Let denote the collection of distributions over a countable set of finite or infinite cardinality . A distribution property is a mapping . Many applications call for estimating properties of an unknown distribution from its samples. Often these properties are additive

, namely can be written as a sum of functions of the probabilities. Symmetric additive properties can be written as

 f(p)def=∑x∈Xf(px),

and arise in many biological, genomic, and language-processing applications:

Shannon entropy

, where throughout the paper is the natural logarithm, is the fundamental information measure arising in a variety of applications info .

Normalized support size

plays an important role in population population and vocabulary size estimation vocabulary .

Normalized support coverage

is the normalized expected number of distinct elements observed upon drawing independent samples, it arises in ecological ecological , genomic genomic , and database studies database .

Power sum

, arises in Rényi entropy renyientropy , Gini impurity gini , and related diversity measures.

Distance to uniformity

, appears in property testing testingu .

More generally, non-symmetric additive properties can be expressed as

 f(p)def=∑x∈Xfx(px),

for example distances to a given distribution, such as:

L1 distance

, the distance of the unknown distribution from a given distribution , appears in hypothesis-testing errors testing .

KL divergence

, the KL divergence of the unknown distribution from a given distribution , reflects the compression info and prediction kl degradation when estimating by .

Given one of these, or other, properties, we would like to estimate its value based on samples from an underlying distribution.

## 2 Recent Results

In the common property-estimation setting, the unknown distribution generates i.i.d. samples , which in turn are used to estimate . Specifically, given property , we would like to construct an estimator such that is as close to as possible. The standard estimation loss is the expected squared loss

 EXn∼pn(^f(Xn)−f(p))2.

Generating exactly samples creates dependence between the number of times different symbols appear. To avoid these dependencies and simplify derivations, we use the well-known Poisson sampling poisamp paradigm. We first select , and then generate independent samples according to

. This modification does not change the statistical nature of the estimation problem since a Poisson random variables is exponentially concentrated around its mean. Correspondingly the estimation loss is

 L^f(p,n)def=EN∼Poi(n)[EXN∼pN(^f(XN)−f(p))2].

For simplicity, let be the number of occurrences of symbol in . An intuitive estimator is the plug-in empirical estimator that first uses the samples to estimate and then estimates as

 fE(XN)def={∑x∈Xfx(NxN)N>0,0N=0.

Given an error tolerance parameter , the -sample complexity of an estimator in estimating is the smallest number of samples allowing for estimation loss smaller than ,

 n^f(δ,p)def=minn∈N{L^f(p,n)<δ}.

Since is unknown, the common min-max approach considers the worst case -sample complexity of an estimator over all possible ,

 n^f(δ)def=maxp∈DXn^f(δ,p).

Finally, the estimator minimizing is called the min-max estimator of property , denoted . It follows that is the smallest Poisson parameter , or roughly the number of samples, needed for any estimator to estimate to estimation loss for all .

There has been a significant amount of recent work on property estimation. In particular, it was shown that for all seven properties mentioned earlier, improves the sample complexity by a logarithmic factor compared to . For example, for Shannon entropy mmentro , normalized support size mmsize , normalized support coverage mmcover , and distance to uniformity mml1 , while . Note that for normalized support size, is typically replaced by , and for normalized support coverage, is replaced by .

## 3 New Results

While the results already obtained are impressive, they also have some shortcomings. Recent state-of-the-art estimators are designed mmentro ; mmsize ; mml1 or analyzed mmcover ; valiant to estimate each individual property. Consequently these estimators cover only few properties. Second, estimators proposed for more general properties mmcover ; jnew are limited to symmetric properties and are not known to be computable in time linear in the sample size. Last but not least, by design, min-max estimators are optimized for the “worst” distribution in a class. In practice, this distribution is often very different, and frequently much more complex, than the actual underlying distribution. This “pessimistic” worst-case design results in sub-optimal estimation, as born by both the theoretical and experimental results.

In Section 6, we design an estimator that addresses all these issues. It is unified and applies to a wide range of properties, including all previously-mentioned properties ( for power sums) and all Lipschitz properties where each is Lipschitz. It can be computed in linear-time in the sample size. It is competitive in that it is guaranteed to perform well not just for the worst distribution in the class, but for each and every distribution. It “amplifies” the data in that it uses just samples to approximate the performance of the empirical estimator with samples regardless of the underlining distribution , thereby providing an off-the-shelf, distribution-independent, “amplification” of the amount of data available relative to the estimators used by many practitioners. As we show in Section 8, it also works well in practice, outperforming existing estimator and often working as well as the empirical estimator with even samples.

For a more precise description, let represent a quantity that vanishes as and write for . Suppressing small for simplicity first, we show that

 Lf∗(p,2n)≲LfE(p,n√logn)+o(1),

where the first right-hand-side term relates the performance of with samples to that of with samples. The second term adds a small loss that diminishes at a rate independent of the support size , and for fixed decreases roughly as . Specifically, we prove,

###### Theorem 1.

For every property satisfying the smoothness conditions in Section 5, there is a constant such that for all and all ,

 Lf∗(p,2n)≤(1+3logϵn)LfE(p,nlog12−ϵn)+Cfmin{knlogϵn+~O(1n),1logϵn}.

The reflects a multiplicative factor unrelated to and . Again, for normalized support size, is replaced by , and we also modify as follows: if , we apply , and if , we apply the corresponding min-max estimator mmsize . However, for experiments shown in Section 8, the original is used without such modification. In Section 7, we note that for several properties, the second term can be strengthened so that it does not depend on .

## 4 Implications

Theorem 1 has three important implications.

#### Data amplification

Many modern applications, such as those arising in genomics and natural-language processing, concern properties of distributions whose support size

is comparable to or even larger than the number of samples . For these properties, the estimation loss of the empirical estimator is often much larger than , hence the proposed estimator, , yields a much better estimate whose performance parallels that of with samples. This allows us to amplify the available data by a factor of regardless of the underlying distribution.

Note however that for some properties , when the underlying distributions are limited to a fixed small support size, . For such small support sizes, may not improve the estimation loss.

#### Unified estimator

Recent works either prove efficacy results individually for each property mmentro ; mmsize ; mml1 , or are not known to be computable in linear time mmcover ; jnew .

By contrast, is a linear-time estimator well for all properties satisfying simple Lipschitz-type and second-order smoothness conditions. All properties described earlier: Shannon entropy, normalized support size, normalized suppport coverage, power sum, distance and KL divergence satisfy these conditions, and therefore applies to all of them.

More generally, recall that a property is Lipschitz if all are Lipschitz. It can be shown, e.g. learning , that with samples, approximates a -element distribution to a constant distance, and hence also estimates any Lipschitz property to a constant loss. It follows that estimates any Lipschitz property over a distribution of support size to constant estimation loss with samples. This provides the first general sublinear-sample estimator for all Lipschitz properties.

#### Competitive optimality

Previous results were geared towards the estimator’s worst estimation loss over all possible distributions. For example, they derived estimators that approximate the distance to uniformity of any -element distribution with samples, and showed that this number is optimal as for some distribution classes estimating this distance requires samples.

However, this approach may be too pessimistic. Distributions are rarely maximally complex, or are hardest to estimate. For example, most natural scenes have distinct simple patterns, such as straight lines, or flat faces, hence can be learned relatively easily.

More concretely, consider learning distance to uniformity for the collection of distributions with entropy bounded by . It can be shown that for sufficiently large , can learn distance to uniformity to constant estimation loss using samples. Theorem 1 therefore shows that the distance to uniformity can be learned to constant estimation loss with samples. (In fact, without even knowing that the entropy is bounded.) By contrast, the original min-max estimator results would still require the much larger samples.

The rest of the paper is organized as follows. Section 5 describes mild smoothness conditions satisfied by many natural properties, including all those mentioned above. Section 6 describes the estimator’s explicit form and some intuition behind its construction and performance. Section 7 describes two improvements of the estimator addressed in the supplementary material. Lastly, Section 8 describes various experiments that illustrate the estimator’s power and competitiveness. For space considerations, we relegate all the proofs to the appendix.

## 5 Smooth properties

Many natural properties, including all those mentioned in the introduction satisfy some basic smoothness conditions. For , consider the Lipschitz-type parameter

 ℓf(h)def=maxxmaxu,v∈[0,1]:max{u,v}≥h|fx(u)−fx(v)||u−v|,

and the second-order smoothness parameter, resembling the modulus of continuity in approximation theory approx ; approxconst ,

 ω2f(h)def=maxxmaxu,v∈[0,1]:|u−v|≤2h{∣∣∣fx(u)+fx(v)2−fx(u+v2)∣∣∣}.

We consider properties satisfying the following conditions: (1) , ; (2) for ; (3) for some absolute constant .

Note that the first condition, , entails no loss of generality. The second condition implies that is continuous over , and in particular right continuous at 0 and left-continuous at . It is easy to see that continuity is also essential for consistent estimation. Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where , and that all seven properties described earlier satisfy these three conditions. Finally, to ensure that distance satisfies these conditions, we let .

## 6 The Estimator f∗

Given the sample size , define an amplification parameter , and let be the amplified sample size. Generate a sample sequence independently from , and let denote the number of times symbol appeared in . The empirical estimate of with samples is then

 fE(XN′′)=∑x∈Xfx(N′′xN′′).

Our objective is to construct an estimator that approximates for large using just samples.

Since sharply concentrates around , we can show that can be approximated by the modified empirical estimator,

 fME(XN′′)def=∑x∈Xfx(N′′xnt),

where for all and .

Since large probabilities are easier to estimate, it is natural to set a threshold parameter and rewrite the modified estimator as a separate sum over small and large probabilities,

 fME(XN′′)=∑x∈Xfx(N′′xnt)1px≤s+∑x∈Xfx(N′′xnt)1px>s.

Note however that we do not know the exact probabilities. Instead, we draw two independent sample sequences and from , each of an independent size, and let and be the number of occurrences of in the first and second sample sequence respectively. We then set a small/large-probability threshold

and classify a probability

as large or small according to :

 fMES(XN′′,XN′)def=∑x∈Xfx(N′′xnt)1N′x≤s0

is the modified small-probability empirical estimator, and

 fMEL(XN′′,XN′)def=∑x∈Xfx(N′′xnt)1N′x>s0

is the modified large-probability empirical estimator. We rewrite the modified empirical estimator as

 fME(XN′′)=fMES(XN′′,XN′)+fMEL(XN′′,XN′).

Correspondingly, we express our estimator as a combination of small- and large-probability estimators,

 f∗(XN,XN′)def=f∗S(XN,XN′)+f∗L(XN,XN′).

The large-probability estimator approximates as

 f∗L(XN,XN′)def=fMEL(XN,XN′)=∑x∈Xfx(Nxnt)1N′x>s0.

Note that we replaced the length- sample sequence by the independent length- sample sequence . We can do so as large probabilities are well estimated from fewer samples.

The small-probability estimator approximates and is more involved. We outline its construction below and details can be found in Appendix G. The expected value of for the small probabilities is

 E[fMES(XN′′,XN′)]=∑x∈XE[1Nx≤s0]E[fx(N′′xnt)].

Let be the expected number of times symbol will be observed in , and define

 gx(v)def=fx(vnt)(tt−1)v.

Then

 E[fx(N′′xnt)]=∞∑v=0e−λxt(λxt)vv!fx(vnt)=e−λx∞∑v=1e−λx(t−1)(λx(t−1))vv!gx(v).

As explained in Appendix G.1, the sum beyond a truncation threshold

 umaxdef=2s0t+2s0−1

is small, hence it suffices to consider the truncated sum

 e−λxumax∑v=1e−λx(t−1)(λx(t−1))vv!gx(v).

Applying the polynomial smoothing technique in pnas , Appendix G approximates the above summation by

 e−λx∞∑v=1hx,vλvx,

where

 hx,v=(t−1)v(umax∧v)∑u=1gx(u)(−1)v−u(v−u)!u!(1−e−rv+u∑j=0rjj!),

and

 rdef=10s0t+10s0.

Observe that is the tail probability of a distribution that diminishes rapidly beyond . Hence determines which summation terms will be attenuated, and serves as a smoothing parameter.

is

 ∞∑v=1hx,vv!⋅1Nx=v=hx,Nx⋅Nx!.

Finally, the small-probability estimator is

 f∗S(XN,XN′)def=∑x∈Xhx,Nx⋅Nx!⋅1N′x≤s0.

## 7 Extensions

In Theorem 1, for fixed , as , the final slack term approaches a constant. For certain properties it can be improved. For normalized support size, normalized support coverage, and distance to uniformity, a more involved estimator improves this term to

 Cf,γmin{knlog1−ϵn+1n1−γ,1log1+ϵn},

for any fixed constant .

For Shannon entropy, correcting the bias of  emiller and further dividing the probability regions, reduces the slack term even more, to

 Cf,γmin{k2n2log2−ϵn+1n1−γ,1log2+2ϵn}.

Finally, the theorem compares the performance of with samples to that of with samples. As shown in the next section, the performance is often comparable to that of samples. It would be interesting to prove a competitive result that enlarges the amplification to or even . This would be essentially the best possible as it can be shown that for the symmetric properties mentioned in the introduction, amplification cannot exceed .

## 8 Experiments

We evaluated the new estimator by comparing its performance to several recent estimators mmentro ; mmsize ; mmcover ; pnas ; jvhw . To ensure robustness of the results, we performed the comparisons for all the symmetric properties described in the introduction: entropy, support size, support coverage, power sums, and distance to uniformity. For each property, we considered six underlying distributions: uniform, Dirichlet-drawn-, Zipf, binomial, Poisson, and geometric. The results for the first three properties are shown in Figures 13, the plots for the final two properties can be found in Appendix I. For nearly all tested properties and distributions, achieved state-of-the-art performance.

As Theorem 1 implies, for all five properties, with just (not even ) samples, performed as well the empirical estimator with roughly samples. Interestingly, in most cases performed even better, similar to with samples.

Relative to previous estimators, depending on the property and distribution, different previous estimators were best. But in essentially all experiments, was either comparable or outperformed the best previous estimator. The only exception was PML that attempts to smooth the estimate, hence performed better on uniform, and near-uniform Dirichlet-drawn distributions for several properties.

may be worth noting. First, underscoring its competitive performance for each distribution, the more skewed the distribution the better is its relative efficacy. This is because most other estimators are optimized for the worst distribution, and work less well for skewed ones.

Second, by its simple nature, the empirical estimator is very stable. Designed to emulate for more samples, is therefore stable as well. Note also that is not always the best estimator choice. For example, it always underestimates the distribution’s support size. Yet even for normalized support size, Figure 2 shows that

outperforms other estimators including those designed specifically for this property (except as above for PML on near-uniform distributions).

The next subsection describes the experimental settings. Additional details and further interpretation of the observed results can be found in Appendix I.

### Experimental settings

We tested the five properties on the following distributions: uniform distribution; a distribution randomly generated from Dirichlet prior with parameter 2; Zipf distribution with power

; Binomial distribution with success probability

; Poisson distribution with mean

; geometric distribution with success probability

.

With the exception of normalized support coverage, all other properties were tested on distributions of support size . The Geometric, Poisson, and Zipf distributions were truncated at and re-normalized. The number of samples, , ranged from to , shown logarithmically on the horizontal axis. Each experiment was repeated 100 times and the reported results, shown on the vertical axis, reflect their mean squared error (MSE).

We compared the estimator’s performance with samples to that of four other recent estimators as well as the empirical estimator with , , and samples. We chose the amplification parameter as , where was selected based on independent data, and similarly for . Since performed even better than Theorem 1 guarantees, ended up between 0 and 0.3 for all properties, indicating amplification even beyond . The graphs denote by NEW, with samples by Empirical, with samples by Empirical+, with samples by Empirical++, the pattern maximum likelihood estimator in mmcover by PML, the Shannon-entropy estimator in jvhw by JVHW, the normalized-support-size estimator in mmsize and the entropy estimator in mmentro by WY, and the smoothed Good-Toulmin Estimator for normalized support coverage estimation pnas , slightly modified to account for previously-observed elements that may appear in the subsequent sample, by SGT.

While the empirical and the new estimators have the same form for all properties, as noted in the introduction, the recent estimators are property-specific, and each was derived for a subset of the properties. In the experiments we applied these estimators to all the properties for which they were derived. Also, additional estimators ventro ; pentro ; mentro ; gsize ; ccover ; cacover ; jcover for various properties were compared in mmentro ; mmsize ; pnas ; jvhw and found to perform similarly to or worse than recent estimators, hence we do not test them here.

## 9 Conclusion

In this paper, we considered the fundamental learning problem of estimating properties of discrete distributions. The best-known distribution-property estimation technique is the “empirical estimator” that takes the data’s empirical frequency and plugs it in the property functional. We designed a general estimator that for a wide class of properties, uses only samples to achieve the same accuracy as the plug-in estimator with samples. This provides an off-the-shelf method for amplifying the data available relative to traditional approaches. For all the properties and distributions we have tested, the proposed estimator performed as well as the best estimator(s). A meaningful future research direction would be to verify the optimality of our results: the amplification factor and the slack terms. There are also several important properties that are not included in our paper, for example, Rényi entropy jrenyi and the generalized distance to uniformity yi2018 ; batu17 . It would be interesting to determine whether data amplification could be obtained for these properties as well.

## Appendix A Smooth properties

Theorem holds for a wide class of properties . For , consider the Lipschitz-type parameter

 ℓf(h)def=maxxmaxu,v∈[0,1]:max{u,v}≥h|fx(u)−fx(v)||u−v|,

and the second-order smoothness parameter, resembling similar approximation-theory terms approx ; approxconst ,

 ω2f(h)def=maxxmaxu,v∈[0,1]:|u−v|≤2h{∣∣∣fx(u)+fx(v)2−fx(u+v2)∣∣∣}.

We assume that satisfies the following conditions:

• , ;

• for ;

• for some absolute constant .

Note that the first condition, , entails no loss of generality. The second condition implies that is continuous over , and in particular right continuous at 0 and left-continuous at . It is easy to see that continuity is also essential for consistent estimation. Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where , and that all seven properties described earlier satisfy these three conditions. Finally, to ensure that distance satisfies these conditions, we let . Observe also that these conditions are more general than assuming that is Lipschitz, as can be seen for entropy where .

For normalized support size, we modify our estimator as follows: if , we apply the estimator , and if , we apply the corresponding min-max estimator mmsize . However, for experiments shown in Section I, the original estimator is used without such modification.

Table 1 below summarizes the results on the quantity and for different properties. Note that for a given property, is unique while is not.

For simplicity, we denote the partial expectation , and . To simplify our proofs and expressions, we assume that the number of samples , the amplification parameter , and . Without loss of generality, we also assume that , and are integers. Finally, set and , where and are fixed constants such that and .

## Appendix B Outline

The rest of the appendix is organized as follows.

In Section C.1, we present a few concentration inequalities for Poisson and Binomial random variables that will be used in subsequent proofs. In Section C.2, we analyze the performance of the modified empirical estimator that estimates by instead of . We show that performs nearly as well as the original empirical estimator , but is significantly easier to analyze.

In Section D, we partition the loss of our estimator, , into three parts: , , and , corresponding to a quantity which is roughly , the loss incurred by , and the loss incurred by , respectively.

In Section E, we bound by roughly . In Section F, we bound : in Section F.1 and F.2

, we bound the squared bias and variance of

respectively.

In Section G.1, we partition the series to be estimated in into and , and show that it suffices to estimate the quantity . In Section G.2, we outline how we construct the linear estimator based on . Then, we bound term : in Section G.3 and G.4, we bound the variance and squared bias of respectively. In Section G.5, we derive a tight bound on .

In Section H, we prove Theorem based on our previous results.

In Section I, we demonstrate the practical advantages of our methods through experiments on different properties and distributions. We show that our estimator can even match the performance of the -sample empirical estimator in estimating various properties.

## Appendix C Preliminary Results

### c.1 Concentration Inequalities for Poisson and Binomial

The following lemma gives tight tail probability bounds for Poisson and Binomial random variables.

###### Lemma 1.

concen Let be a Poisson or Binomial random variable with mean , then for any ,

 P(X≥(1+δ)μ)≤(eδ(1+δ)(1+δ))μ≤e−(δ2∧δ)μ/3

and for any ,

 P(X≤(1−δ)μ)≤(e−δ(1−δ)(1−δ))μ≤e−δ2μ/2.

We have the following corollary by choosing different values of .

###### Lemma 2.

Let be a Poisson or Binomial random variable with mean ,

 P(X≤12μ)≤e−0.15μ,  P(X≤13μ)≤e−0.30μ,
 P(X≤15μ)≤e−0.478μ, and P(X≤116μ)≤e−0.76μ.
###### Lemma 3.

Let ,

 E[√nN∣∣∣N≥1]≤1+3n.
###### Proof.

For ,

 nN≤nN+1+3n(N+1)(N+2),

hence,

 E[nN∣∣∣N≥1] ≤E[nN+1∣∣∣N≥1]+E[3n(N+1)(N+2)∣∣∣N≥1] ≤E[nN+1]+E[3n(N+1)(N+2)] =P[N≥1]+3nP[N≥2] ≤1+3n,

where the second inequality follows from the fact that and decrease with and the equality follows as . ∎

### c.2 The Modified Empirical Estimator

The modified empirical estimator

 fME(XN)=∑x∈Xfx(Nxn)

estimates the probability of a symbol not by the fraction of times it appeared, but by , where is the parameter of the Poisson sampling distribution.

We show that the original and modified empirical estimators have very similar performance.

###### Lemma 4.

For all ,

 E[(fE(XN)−fME(XN))2]≤ℓ2f(1/n)n.
###### Proof.

By the definition of , if ,

 ∣∣∣fx(Nxn)−fx(NxN)∣∣∣≤ℓf(1n)∣∣∣Nxn−NxN∣∣∣=ℓf(1n)NxN|N−n|n,

and if ,

 ∣∣∣fx(Nxn)−fx(NxN)∣∣∣=0≤ℓf(1n)NxN|N−n|n.

Therefore,

 E⎡⎣(∑x∈Xfx(Nxn)−fx(NxN))2⎤⎦ ≤E⎡⎣(∑x∈Xℓf(1n)NxN|N−n|n)2⎤⎦ ≤E⎡⎣(ℓf(1n)|N−n|n)2⎤⎦ =ℓ2f(1/n)n2E[(N−n)2] =ℓ2f(1/n)n,

where the last step follows as and . ∎

## Appendix D Large and Small Probabilities

Recall that has the following form

 f∗(XN,XN′)=f∗S(XN,XN′)+f∗L(XN,XN′).

We can rewrite the property as follows

 f(p)=f(p)−E[fME(XN′′)]+E[fMES(XN′′,XN′)]+E[f% MEL(XN′′,XN′)].

The difference between and the actual value can be partitioned into three terms

 f∗(XN,XN′)−f(p)=A+B+C,

where

is the bias of the modified empirical estimator with Poi() samples,

 Bdef=f∗L(XN,XN′)−E[fMEL(XN′′,XN′)]

corresponds to the loss incurred by the large-probability estimator , and

 Cdef=f∗S(XN,XN′)−E[fMES(XN′′,XN′)]

corresponds to the loss incurred by the small-probability estimator .

By Cauchy-Schwarz inequality, upper bounds on , , and , suffice to also upper bound the estimation loss .

In the next section, we bound the squared bias term . In Section E and Section F, we bound the large- and small-probability terms and , respectively.

## Appendix E Squared Bias: E[A2]

We relate to through the following inequality.

###### Lemma 5.

Let be a positive function over ,

 E[A2]≤1+T(n)ntℓ2f(1nt)+(1+1T(n))LfE(p,nt).
###### Proof.

We upper bound in terms of using Cauchy-Schwarz inequality and Lemma 4.

 E[A2] =(∑x∈X(E[fx(N′′xnt)]−fx(px))