Distribution-robust mean estimation via smoothed random perturbations

06/25/2019 ∙ by Matthew J. Holland, et al. ∙ Osaka University 0

We consider the problem of mean estimation assuming only finite variance. We study a new class of mean estimators constructed by integrating over random noise applied to a soft-truncated empirical mean estimator. For appropriate choices of noise, we show that this can be computed in closed form, and utilizing relative entropy inequalities, these estimators enjoy deviations with exponential tails controlled by the second moment of the underlying distribution. We consider both additive and multiplicative noise, and several noise distribution families in our analysis. Furthermore, we empirically investigate the sensitivity to the mean-standard deviation ratio for numerous concrete manifestations of the estimator class of interest. Our main take-away is that an inexpensive new estimator can achieve nearly sub-Gaussian performance for a wide variety of data distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this work, we consider the problem of mean estimation under very weak assumptions on the underlying distribution; all we assume is that the variance is finite. This problem is important because the practitioner often will not know a priori whether or not the underlying data distribution is sub-Gaussian in nature or whether it is heavy-tailed with infinite higher-order moments [7]

. Furthermore, procedures which provide strong statistical estimation guarantees in this setting have been shown to have many applications in modern machine learning tasks, where off-sample generalization performance is measured using a risk (expected loss) function to be estimated empirically as a core feedback mechanism, with more robust feedback leading to provably stronger learning guarantees

[3, 6, 11].

Given a sample of

independent random variables taking values in

with common distribution , the traditional approach for estimation of the mean is to use the empirical mean

, for which many optimality properties are well-known. For example, if the data is Normally distributed with variance

, then confidence intervals for the deviations can be computed exactly, and with probability no less than

, one has

where

denotes the Normal cumulative distribution function. Even without Normal assumptions, since we have finite variance, asymptotically the central limit theorem tells us that the same kinds of guarantees are possible, where as

we have

This deviation bound is a natural benchmark against which to compare other estimators, since in the Normal case, the empirical mean is essentially optimal, in the following sense [4]. Let be a family of distributions including all Normal distributions with some finite variance . Take any estimator , and denote any valid one-sided deviation bounds by , where

for and . Then, regardless of the construction of , for any confidence level , there always exists an unlucky Normal distribution with such that

Analogous statements hold for the lower tail, meaning that the benchmark set by the empirical mean in the case of Normal data is essentially the best we can expect of any estimator, in terms of dependence on and and guarantees that hold uniformly over the model .

With this natural benchmark in mind, Devroye et al. [7] did an in-depth study of so-called sub-Gaussian estimators, namely any estimator which satisfies

(1)

where is typically a large, non-parametric class of distributions, is the variance of , and is some distribution-free constant. Since , this is a slight weakening of the benchmark given above, but fundamentally captures the same phenomena. One important fact is that assuming only finite variance, namely the model , the empirical mean is not a sub-Gaussian estimator. As a salient example, Catoni [4] shows that one can always create a distribution such that with probability at least , one has

namely a lower bound which says that the guarantees provided by Chebyshev’s inequality, with polynomial, rather than logarithmic dependence on , are essentially tight for the finite-variance model .

Given that the empirical mean is not distribution-robust in the sub-Gaussian sense just stated, it is natural to ask whether, under such weak assumptions, there exist sub-Gaussian estimators at all. The answer in this case is affirmative, although the construction of depends on the desired confidence level . This is indeed necessary, as Devroye et al. [7] prove that for one cannot construct a sub-Gaussian estimator whose design is free of . One of the most lucid examples is the median-of-means estimator [13], which partitions into disjoint subsets, computes a sample mean on each , and finally returns the median . With the right number of partitions (e.g., ), the estimator is sub-Gaussian. Another lucid example is that of M-estimators with variance-controlled scaling [4], taking the form

where is an appropriate convex function, and the parameter scales as . Furthermore, one of the main results of Devroye et al. [7] is a novel estimator construction technique which is provably sub-Gaussian with nearly optimal constants assuming only finite variance, but their procedure is not computationally tractable.

Our contributions

All known sub-Gaussian estimators require solving some sub-problem whose solution is implicitly defined; even the most practical choices such as the median-of-means and M-estimator approaches amount to minimizing a data-dependent convex function. In this work, we consider a new class of estimators which can be computed directly and precisely, without any iterative sub-routines. The cost for this is we give up sub-Gaussianity; we show that the estimators of interest satisfy deviation bounds of the form

for an appropriate constant which only depends on and . The basic form is the same as the sub-Gaussian condition of Devroye et al. [7], but because the second moment controls the bound instead of the variance, it becomes sensitive to the absolute value of the mean , though as we demonstrate shortly, a simple sample-splitting technique works well to mitigate this sensitivity both in theory and in practice. The estimators we construct are built by first considering the application of multiplicative or additive noise to a soft-truncated version of the empirical mean, and then smoothing out these effects by taking expectation with respect to these random perturbations, whose distribution we control. We consider adding noise from Bernoulli, Normal, Weibull, and Student-t families to provide some concrete examples of the estimator class of interest. In particular, the Bernoulli variety is computationally simplest and performs best in our experimental setting, offering a new alternative to the sub-Gaussian mean estimators cited earlier, with the appeal of no computational error and more transparent analysis.

The rest of the paper is structured as follows. In section 2 we introduce the estimator class of interest, and highlight some basic statistical and computational principles which will be of use for subsequent analysis. Our main theoretical results are organized in section 3, where we prove statistical error bounds and derive closed-form expressions for concrete examples from the class of interest. In section 4 we conduct a series of controlled experiments in which we analyze performance, evaluating in particular the sensitivity to the size of the mean relative to the standard deviation. A brief discussion and concluding remarks close the paper in section 5.

2 Estimator class of interest

Our aim is to study the behavior of a class of new estimators which use moment-dependent scaling, smoothed noise (both additive and multiplicative), and bounded soft truncation. Let denote a generic scaling parameter to be determined shortly, and let denote independent copies of a noise random variable . Both and are assumed to be under our control. For simplicity, we focus on the following two main types of estimators:

  1. Multiplicative: Assuming , construct estimator as

    (2)
  2. Additive: Assuming , construct estimator as

    (3)

First, regarding the soft truncation function

, any differentiable, odd, non-decreasing function is plausible, but for concreteness we use the convenient sigmoid function of

Catoni and Giulini [5], given in (4) and pictured in Figure 1.

(4)

Second, the expectation is taken with respect to the product measure induced by the sample . As we shall see in section 3, with proper choice of , using the convenient polynomial form of , we can often compute and directly.

Figure 1: Graph of (in green), along with upper and lower bounds given in (10).

2.1 Computation of new estimators

Here we consider some general principles which will aid us in computing the estimators and just introduced. To begin, note that the piecewise function can be written explicitly using indicator functions as

(5)

Let denote an arbitrary random variable. We consider computation of , where expectation is taken with respect to , and and are respectively shift and scale parameters. To streamline implementation, for integer and input , we introduce the notation

(6)
(7)

Some care is required with the value of . For the case of , it follows immediately that . When , then we need to pay attention to the sign,111As an obvious example, say . Unless , taking expectation over and will respectively lead to different results. computing as

(8)

This equality follows from straightforward rearrangements. When , note

for any choice of . When , note that

from which the second case in (8) is obtained. Assuming then that evaluating is tractable, obtaining can be reduced to direct computations as

(9)

where we have

The above follows from straightforward algebra, using the convenient form (5). In practice then, we will need to evaluate for degrees . Since the actual computations depend completely on the distribution of here, detailed derivations for different distribution families will be given in section 3.

2.2 Two general-purpose deviation bounds

Recall the setting described above, in which we have iid data and iid “strategic noise” , respectively distributed as and , inspired by the approach of Catoni and Giulini [5], we may obtain convenient inequalities depending on the noise distribution , which is “posterior” in the sense that it may depend on the sample, and a “prior” distribution , which must be specified in advance. The fundamental underlying property we use is illustrated in Lemma 1 below.222See Holland [8] for an elementary proof.

Lemma 1.

Fix an arbitrary prior distribution on , and consider , assumed to be bounded and measurable. It follows that with probability no less than over the random draw of the sample, we have

uniform in the choice of , where expectation on the left-hand side is over the noise sample.

In the context of and defined in section 2, we can obtain convenient upper bounds through special cases of Lemma 1. The key property of the truncation function (4) is that over its entire domain, the following upper and lower bounds hold (see Figure 1 for an illustration):

(10)

As such, when we set for the case of , and for the case of , then using the bounds (10), it follows that each of these estimators enjoy the following upper bounds, each of which holds with probability at least , uniform in the choice of noise distribution :

(11)
(12)

We emphasize that the randomness in the estimators and is exclusively due to the data sample, since we are integrating out any randomness due to the noise. The reason we restrict the definition of to the case of noise with non-zero mean is purely because we are interested in deviation bounds. For any noise distribution with and , the first term on the right-hand side of (11) evaluates to

meaning that the bounds on in this case are always free of , which spoils this approach for seeking bounds on . On the other hand, the reason for restricting in the definition of is to prevent an artefact in the upper bound on deviations that does not depend on .

3 Theoretical analysis

Here we consider a number of different noise distribution families for constructing the estimators (2) and (3) introduced in the previous section. For each distribution, we seek both statistical error guarantees, as well as explicit forms for efficiently computing the estimators.

3.1 Bernoulli noise

Perhaps the simplest choice of noise distribution is that of randomly deleting observations with a fixed probability, namely the case of Bernoulli noise . The following result makes this concrete.

Proposition 2 (Deviation bounds and estimator computation).

Consider noise for some , and prior . The estimator in (2) takes the form

and satisfies the following deviation bound,

with probability no less than .

Remark 3 (Centered estimates).

To reduce the dependence of the above estimator on the second moments of the underlying distribution, use of an ancillary mean estimator to approximately center the data is a useful strategy. For example, from the full sample of , let the first observations be used to construct such an ancillary estimator, denoted

where . Then shift the remaining data points from as , for each . Writing the upper bound in Proposition 2 depending on the full -sized sample as , the second moment of the shifted data can be readily bounded above by . Using this upper bound to scale , this time applied to the centered dataset of size , write for the resulting estimator. Shifting this back into the original position, we have our final output, namely , which enjoys variance-dependent deviation tails of the form

3.2 Normal noise

Proposition 4 (Deviation bounds).

Additive case: consider noise and prior , setting and . Then, we have

with probability no less than over the draw of the sample.

Multiplicative case: consider noise and prior , setting and . Then, we have

with probability no less than over the draw of the sample.

Remark 5 (Comparison of additive and multiplicative noise).

The key difference in terms of the performance guarantees available for and under the Normal noise setting is that when , the additive case has tighter bounds, and when , the multiplicative case has tighter bounds. A much smaller technical difference is that we have confidence intervals for the multiplicative case, but intervals for the additive case.

Next, let us consider computation of the estimators under Normal noise.

Proposition 6 (Estimator computation, [5]).

For , the key quantities are computed as

where denotes the standard Normal CDF.

With Proposition 6 in hand, computation is very straightforward using the general form

recalling the definition of in (9). To implement the special case for which the bounds of Proposition 4 hold, this amounts to

Remark 7 (Convenient computation).

For the case of , note that using the value of rather than the signed is computationally convenient because then we only need to consider the positive case in evaluating (8). Of course, the quantities being computed in either case are equivalent. To see this, just note that since and we have conditioned on any ,

where equality here means equality in distribution. The validity of this statement follows from the symmetry of the Normal distribution, since both and have the same distribution, namely .

3.3 Weibull noise

with shape and scale is defined by

The corresponding density function is

The relative entropy can be computed in a straightforward manner, as is shown in a technical note by Bauckhage [2]. The general form for the relative entropy between and is

(13)

where is the Euler-Mascheroni constant.

Proposition 8 (Estimator computation).

Let . Then following our notation in (6)–(8) and (9), we have

for , where is the unnormalized incomplete Gamma function.333Popular numerical computation libraries almost always include efficient implementations of the incomplete gamma function. For example, gsl_sf_gamma_inc in the GNU Scientific Library, and special.gammainc in SciPy. See Abramowitz and Stegun [1] for more background.

Thanks to Proposition 8, we know that we can compute and under Weibull noise. Now we look at the statistical guarantees that are available for such an estimation procedure.

Proposition 9 (Deviation bounds).

Additive case: consider noise with prior , setting . Then we have

with probability no less than over the draw of the sample.

Multiplicative case: consider noise with prior , where . To keep the notation clean, we write

and have that setting

it follows that

with probability no less than over the draw of the sample.

Using Propositions 89 and equation (9), computation is done using the general form

which specializes to

Note that unlike our implementation in the previous section 3.2, here can be both positive and negative. The need for this arises naturally as the Weibull distribution is asymmetric, and thus conditioned on , we cannot in general say that and have the same distribution. As such, both cases in (8) will be utilized for computations in the Weibull case.

Remark 10 (Special case).

The Weibull distribution includes many other well-known distributions as special cases. In particular, setting shape and scale for

yields an Exponential distribution with rate parameter

.

3.4 Student-t noise

We say a random variable has the “Student-t” distribution with degrees of freedom when it has the following probability density function:

(14)

where the integer is called the “degrees of freedom” parameter of the distribution, and is the usual gamma function of Euler [1]. Throughout this section, for we shall write for the normalization constant.

The relative entropy computation for the Student case cannot be given in closed form; if we want to compute it directly for two Student distributions with degrees of freedom and , we have

noting that the final summand requires numerical integration to evaluate. We write to denote the derivative of the log-Gamma function, often called the digamma function.444There are many references for computing the Student relative entropy, e.g. Villa and Rubio [15] and the papers cited within for a recent reference. Numerical integration is required for the last term. The digamma function is tractable, see Abramowitz and Stegun [1] for more details. Example implementations are the gsl_sf_psi function in the GNU Scientific Library, and special.digamma in SciPy.

Proposition 11 (Estimator computation).

Let for . Then following our notation in (6) and (9), we have

where denotes under a Student-t distribution with degrees of freedom.

Computation of and under Student-t noise is slightly more complicated than in the previous cases seen above, but as shown in Proposition 11, it is tractable. Next we look at guarantees on the statistical accuracy.

Proposition 12 (Deviation bounds).

In both the additive and multiplicative cases below, the relative entropy can be bounded above by

where is a free parameter specified below, and is the degrees of freedom parameter of the underlying Student-t distribution.

Additive case: consider noise with prior for an arbitrary constant . Setting the scaling parameter as

we have with probability at least that

Multiplicative case: consider noise for an arbitrary constant , with prior . Setting the scaling parameter as

we have with probability at least that

With Propositions 1112, via (9) we can compute using the general form

and with for large enough (as specified in Proposition 11), we specialize as

Once again we remark that the computations can be done with either or , since by symmetry the resulting noise distribution conditioned on is the same, just as in the Normal case (section 3.2).

4 Empirical analysis

In this section, we use controlled simulations to investigate the behavior of the estimators and under various types of noise distributions, and compare this with other benchmark estimators. In addition to comparing the actual distribution of the deviations with confidence intervals derived in section 3, we also pay particular attention to how performance depends on the underlying distribution, the mean to standard deviation ratio, and the sample size.555Full implementations of all the experiments carried out in this work are made available at the following online repository: https://github.com/feedbackward/1dim.git.

4.1 Experimental setup

Data generation

For each experimental setting and each independent trial, we generate a sample of size , compute some estimator , and record the deviation . The sample sizes range over , and the number of trials is . We draw data from two distribution families: the Normal family with mean and variance , and the log-Normal family, with log-mean and log-variance , under multiple parameter settings. Regarding the variance, we have “low,” “mid,” and “high” settings, which correspond to in the Normal case, and in the log-Normal case. Over all settings, the log-location parameter of the log-Normal data is fixed at .

Moment control

Of particular interest here is the impact of different mean to standard deviation ratios: we test ranging over . The standard deviation is fixed as just described, and the mean value is determined automatically as for each value of to be tested. Shifting the Normal data is trivially accomplished by taking the desired . Shifting the log-Normal data is accomplished by subtracting the true mean (pre-shift) equal to to center the data, and subsequently adding the desired location.

Methods being tested

We compare canonical location estimators with numerous examples from the robust estimator class described and analyzed in sections 13. The methods being compared are organized in Table 1. Essentially, we are comparing the empirical mean and median with different manifestations of the estimators and . As additional reference, we also compare with two sub-Gaussian estimators introduced in 1, namely the median-of-means estimator and an M-estimator constructed in the style of Catoni [4]. For the former, we partition into equal-sized subsets, with . Regarding bound computations, from Theorem 4.1 of Devroye et al. [7], the median-of-means estimator satisfies (1) with constant , as long as . In the event this is not satisfied, we just return the sample mean. For the latter, as an influence function we use the Gudermannian function , with scaling as . Using the results of Catoni [4] it is then straightforward to show that the resulting estimator satisfies

with probability no less than .

Since all non-classical estimators here depend on a confidence level parameter, for all experiments we set . Furthermore, for precise control of the experimental conditions, we use the true variance and/or second moments of the data distribution for scaling and as specified in section 3, which is known since we are in control of the underlying data distribution. The empirical median is computed (after sorting) as the middle point when is odd, or the average of the two middle points when is even. For under Weibull noise, and is determined automatically. For under Weibull noise, and . For both and under Student-t noise, shift parameter is , and degrees of freedom parameter is .

mean Empirical mean.
med Empirical median.
mom Median-of-means [7].
mest M-estimator [4].
mult_b under Bernoulli noise (sec. 3.1).
mult_bc centered version of mult_b (sec. 3.1).
mult_g under Normal noise (sec. 3.2).
mult_w with Weibull noise (sec. 3.3).
mult_s with Student noise (sec. 3.4).
add_g with Normal noise (sec. 3.2).
add_w with Weibull noise (sec. 3.3).
add_s with Student-t noise (sec. 3.4).
Table 1: List of location estimators being evaluated in sections 4.24.4.

4.2 Impact of mean-SD ratio

Figure 2: Deviations averaged over all trials, plotted as a function of the ratio . Sample size is , variance level is low. Left: Normal data. Right: log-Normal data.

In Figure 2, we look at how the mean to standard deviation ratio impacts the performance of different estimators. Clearly, the dependence of and on the second moment, as suggested by the bounds derived in section 3, is not vacuous. A remarkably large difference in sensitivity appears between different manifestations of these new estimators, however. In order of increasing sensitivity: Bernoulli, Weibull, Gaussian, Student. Estimators with a higher sensitivity to the absolute value of the mean are more strongly biased toward a particular location value. We also note that in observing analogous plots as the variance level increases from low mid high, all of the instances (the add_* methods) converge to behave identically to the Bernoulli type estimator mult_b. The reason for this is simple. As grows, so does the value of used in Propositions 2, 4, 9, and 12. Since we are not modifying the noise distribution in a data-driven fashion, this means that the variance of scaled noise decreases, taking each summand in of the form progressively closer to the Bernoulli case of . This is in stark contrast to the multiplicative case , in which the noise and the data are coupled.

Overall, the Bernoulli case is least sensitive among the uncentered estimators, and considering the ease of computation, is clearly a strong choice. Furthermore, the centered version mult_bc is distinctly much less sensitive to the mean-SD ratio, which is precisely what we would expect given the discussion in section 3.1. In the log-Normal case, since the dependence on the parameters being controlled is asymmetric, the nature of the underlying distribution differs for positive and negative values of , leading to the asymmetric curves observed.

4.3 Impact of sample size

Figure 3: Deviations averaged over all trials, plotted as a function of the sample size . Mean to standard deviation ratio is , variance level is low. Left: Normal data. Right: log-Normal data.

In Figure 3, we examine how performance improves as the sample size increases; at the same time, we may observe how performance deteriorates when data is scarce. For readability,