1 Introduction
In this work, we consider the problem of mean estimation under very weak assumptions on the underlying distribution; all we assume is that the variance is finite. This problem is important because the practitioner often will not know a priori whether or not the underlying data distribution is subGaussian in nature or whether it is heavytailed with infinite higherorder moments [7]
. Furthermore, procedures which provide strong statistical estimation guarantees in this setting have been shown to have many applications in modern machine learning tasks, where offsample generalization performance is measured using a risk (expected loss) function to be estimated empirically as a core feedback mechanism, with more robust feedback leading to provably stronger learning guarantees
[3, 6, 11].Given a sample of
independent random variables taking values in
with common distribution , the traditional approach for estimation of the mean is to use the empirical mean, for which many optimality properties are wellknown. For example, if the data is Normally distributed with variance
, then confidence intervals for the deviations can be computed exactly, and with probability no less than
, one haswhere
denotes the Normal cumulative distribution function. Even without Normal assumptions, since we have finite variance, asymptotically the central limit theorem tells us that the same kinds of guarantees are possible, where as
we haveThis deviation bound is a natural benchmark against which to compare other estimators, since in the Normal case, the empirical mean is essentially optimal, in the following sense [4]. Let be a family of distributions including all Normal distributions with some finite variance . Take any estimator , and denote any valid onesided deviation bounds by , where
for and . Then, regardless of the construction of , for any confidence level , there always exists an unlucky Normal distribution with such that
Analogous statements hold for the lower tail, meaning that the benchmark set by the empirical mean in the case of Normal data is essentially the best we can expect of any estimator, in terms of dependence on and and guarantees that hold uniformly over the model .
With this natural benchmark in mind, Devroye et al. [7] did an indepth study of socalled subGaussian estimators, namely any estimator which satisfies
(1) 
where is typically a large, nonparametric class of distributions, is the variance of , and is some distributionfree constant. Since , this is a slight weakening of the benchmark given above, but fundamentally captures the same phenomena. One important fact is that assuming only finite variance, namely the model , the empirical mean is not a subGaussian estimator. As a salient example, Catoni [4] shows that one can always create a distribution such that with probability at least , one has
namely a lower bound which says that the guarantees provided by Chebyshev’s inequality, with polynomial, rather than logarithmic dependence on , are essentially tight for the finitevariance model .
Given that the empirical mean is not distributionrobust in the subGaussian sense just stated, it is natural to ask whether, under such weak assumptions, there exist subGaussian estimators at all. The answer in this case is affirmative, although the construction of depends on the desired confidence level . This is indeed necessary, as Devroye et al. [7] prove that for one cannot construct a subGaussian estimator whose design is free of . One of the most lucid examples is the medianofmeans estimator [13], which partitions into disjoint subsets, computes a sample mean on each , and finally returns the median . With the right number of partitions (e.g., ), the estimator is subGaussian. Another lucid example is that of Mestimators with variancecontrolled scaling [4], taking the form
where is an appropriate convex function, and the parameter scales as . Furthermore, one of the main results of Devroye et al. [7] is a novel estimator construction technique which is provably subGaussian with nearly optimal constants assuming only finite variance, but their procedure is not computationally tractable.
Our contributions
All known subGaussian estimators require solving some subproblem whose solution is implicitly defined; even the most practical choices such as the medianofmeans and Mestimator approaches amount to minimizing a datadependent convex function. In this work, we consider a new class of estimators which can be computed directly and precisely, without any iterative subroutines. The cost for this is we give up subGaussianity; we show that the estimators of interest satisfy deviation bounds of the form
for an appropriate constant which only depends on and . The basic form is the same as the subGaussian condition of Devroye et al. [7], but because the second moment controls the bound instead of the variance, it becomes sensitive to the absolute value of the mean , though as we demonstrate shortly, a simple samplesplitting technique works well to mitigate this sensitivity both in theory and in practice. The estimators we construct are built by first considering the application of multiplicative or additive noise to a softtruncated version of the empirical mean, and then smoothing out these effects by taking expectation with respect to these random perturbations, whose distribution we control. We consider adding noise from Bernoulli, Normal, Weibull, and Studentt families to provide some concrete examples of the estimator class of interest. In particular, the Bernoulli variety is computationally simplest and performs best in our experimental setting, offering a new alternative to the subGaussian mean estimators cited earlier, with the appeal of no computational error and more transparent analysis.
The rest of the paper is structured as follows. In section 2 we introduce the estimator class of interest, and highlight some basic statistical and computational principles which will be of use for subsequent analysis. Our main theoretical results are organized in section 3, where we prove statistical error bounds and derive closedform expressions for concrete examples from the class of interest. In section 4 we conduct a series of controlled experiments in which we analyze performance, evaluating in particular the sensitivity to the size of the mean relative to the standard deviation. A brief discussion and concluding remarks close the paper in section 5.
2 Estimator class of interest
Our aim is to study the behavior of a class of new estimators which use momentdependent scaling, smoothed noise (both additive and multiplicative), and bounded soft truncation. Let denote a generic scaling parameter to be determined shortly, and let denote independent copies of a noise random variable . Both and are assumed to be under our control. For simplicity, we focus on the following two main types of estimators:

Multiplicative: Assuming , construct estimator as
(2) 
Additive: Assuming , construct estimator as
(3)
First, regarding the soft truncation function
, any differentiable, odd, nondecreasing function is plausible, but for concreteness we use the convenient sigmoid function of
Catoni and Giulini [5], given in (4) and pictured in Figure 1.(4) 
Second, the expectation is taken with respect to the product measure induced by the sample . As we shall see in section 3, with proper choice of , using the convenient polynomial form of , we can often compute and directly.
2.1 Computation of new estimators
Here we consider some general principles which will aid us in computing the estimators and just introduced. To begin, note that the piecewise function can be written explicitly using indicator functions as
(5) 
Let denote an arbitrary random variable. We consider computation of , where expectation is taken with respect to , and and are respectively shift and scale parameters. To streamline implementation, for integer and input , we introduce the notation
(6)  
(7) 
Some care is required with the value of . For the case of , it follows immediately that . When , then we need to pay attention to the sign,^{1}^{1}1As an obvious example, say . Unless , taking expectation over and will respectively lead to different results. computing as
(8) 
This equality follows from straightforward rearrangements. When , note
for any choice of . When , note that
from which the second case in (8) is obtained. Assuming then that evaluating is tractable, obtaining can be reduced to direct computations as
(9) 
where we have
The above follows from straightforward algebra, using the convenient form (5). In practice then, we will need to evaluate for degrees . Since the actual computations depend completely on the distribution of here, detailed derivations for different distribution families will be given in section 3.
2.2 Two generalpurpose deviation bounds
Recall the setting described above, in which we have iid data and iid “strategic noise” , respectively distributed as and , inspired by the approach of Catoni and Giulini [5], we may obtain convenient inequalities depending on the noise distribution , which is “posterior” in the sense that it may depend on the sample, and a “prior” distribution , which must be specified in advance. The fundamental underlying property we use is illustrated in Lemma 1 below.^{2}^{2}2See Holland [8] for an elementary proof.
Lemma 1.
Fix an arbitrary prior distribution on , and consider , assumed to be bounded and measurable. It follows that with probability no less than over the random draw of the sample, we have
uniform in the choice of , where expectation on the lefthand side is over the noise sample.
In the context of and defined in section 2, we can obtain convenient upper bounds through special cases of Lemma 1. The key property of the truncation function (4) is that over its entire domain, the following upper and lower bounds hold (see Figure 1 for an illustration):
(10) 
As such, when we set for the case of , and for the case of , then using the bounds (10), it follows that each of these estimators enjoy the following upper bounds, each of which holds with probability at least , uniform in the choice of noise distribution :
(11)  
(12) 
We emphasize that the randomness in the estimators and is exclusively due to the data sample, since we are integrating out any randomness due to the noise. The reason we restrict the definition of to the case of noise with nonzero mean is purely because we are interested in deviation bounds. For any noise distribution with and , the first term on the righthand side of (11) evaluates to
meaning that the bounds on in this case are always free of , which spoils this approach for seeking bounds on . On the other hand, the reason for restricting in the definition of is to prevent an artefact in the upper bound on deviations that does not depend on .
3 Theoretical analysis
Here we consider a number of different noise distribution families for constructing the estimators (2) and (3) introduced in the previous section. For each distribution, we seek both statistical error guarantees, as well as explicit forms for efficiently computing the estimators.
3.1 Bernoulli noise
Perhaps the simplest choice of noise distribution is that of randomly deleting observations with a fixed probability, namely the case of Bernoulli noise . The following result makes this concrete.
Proposition 2 (Deviation bounds and estimator computation).
Consider noise for some , and prior . The estimator in (2) takes the form
and satisfies the following deviation bound,
with probability no less than .
Remark 3 (Centered estimates).
To reduce the dependence of the above estimator on the second moments of the underlying distribution, use of an ancillary mean estimator to approximately center the data is a useful strategy. For example, from the full sample of , let the first observations be used to construct such an ancillary estimator, denoted
where . Then shift the remaining data points from as , for each . Writing the upper bound in Proposition 2 depending on the full sized sample as , the second moment of the shifted data can be readily bounded above by . Using this upper bound to scale , this time applied to the centered dataset of size , write for the resulting estimator. Shifting this back into the original position, we have our final output, namely , which enjoys variancedependent deviation tails of the form
3.2 Normal noise
Proposition 4 (Deviation bounds).
Additive case: consider noise and prior , setting and . Then, we have
with probability no less than over the draw of the sample.
Multiplicative case: consider noise and prior , setting and . Then, we have
with probability no less than over the draw of the sample.
Remark 5 (Comparison of additive and multiplicative noise).
The key difference in terms of the performance guarantees available for and under the Normal noise setting is that when , the additive case has tighter bounds, and when , the multiplicative case has tighter bounds. A much smaller technical difference is that we have confidence intervals for the multiplicative case, but intervals for the additive case.
Next, let us consider computation of the estimators under Normal noise.
Proposition 6 (Estimator computation, [5]).
For , the key quantities are computed as
where denotes the standard Normal CDF.
With Proposition 6 in hand, computation is very straightforward using the general form
recalling the definition of in (9). To implement the special case for which the bounds of Proposition 4 hold, this amounts to
Remark 7 (Convenient computation).
For the case of , note that using the value of rather than the signed is computationally convenient because then we only need to consider the positive case in evaluating (8). Of course, the quantities being computed in either case are equivalent. To see this, just note that since and we have conditioned on any ,
where equality here means equality in distribution. The validity of this statement follows from the symmetry of the Normal distribution, since both and have the same distribution, namely .
3.3 Weibull noise
with shape and scale is defined by
The corresponding density function is
The relative entropy can be computed in a straightforward manner, as is shown in a technical note by Bauckhage [2]. The general form for the relative entropy between and is
(13) 
where is the EulerMascheroni constant.
Proposition 8 (Estimator computation).
Let . Then following our notation in (6)–(8) and (9), we have
for , where is the unnormalized incomplete Gamma function.^{3}^{3}3Popular numerical computation libraries almost always include efficient implementations of the incomplete gamma function. For example, gsl_sf_gamma_inc in the GNU Scientific Library, and special.gammainc in SciPy. See Abramowitz and Stegun [1] for more background.
Thanks to Proposition 8, we know that we can compute and under Weibull noise. Now we look at the statistical guarantees that are available for such an estimation procedure.
Proposition 9 (Deviation bounds).
Additive case: consider noise with prior , setting . Then we have
with probability no less than over the draw of the sample.
Multiplicative case: consider noise with prior , where . To keep the notation clean, we write
and have that setting
it follows that
with probability no less than over the draw of the sample.
Using Propositions 8–9 and equation (9), computation is done using the general form
which specializes to
Note that unlike our implementation in the previous section 3.2, here can be both positive and negative. The need for this arises naturally as the Weibull distribution is asymmetric, and thus conditioned on , we cannot in general say that and have the same distribution. As such, both cases in (8) will be utilized for computations in the Weibull case.
Remark 10 (Special case).
The Weibull distribution includes many other wellknown distributions as special cases. In particular, setting shape and scale for
yields an Exponential distribution with rate parameter
.3.4 Studentt noise
We say a random variable has the “Studentt” distribution with degrees of freedom when it has the following probability density function:
(14) 
where the integer is called the “degrees of freedom” parameter of the distribution, and is the usual gamma function of Euler [1]. Throughout this section, for we shall write for the normalization constant.
The relative entropy computation for the Student case cannot be given in closed form; if we want to compute it directly for two Student distributions with degrees of freedom and , we have
noting that the final summand requires numerical integration to evaluate. We write to denote the derivative of the logGamma function, often called the digamma function.^{4}^{4}4There are many references for computing the Student relative entropy, e.g. Villa and Rubio [15] and the papers cited within for a recent reference. Numerical integration is required for the last term. The digamma function is tractable, see Abramowitz and Stegun [1] for more details. Example implementations are the gsl_sf_psi function in the GNU Scientific Library, and special.digamma in SciPy.
Proposition 11 (Estimator computation).
Computation of and under Studentt noise is slightly more complicated than in the previous cases seen above, but as shown in Proposition 11, it is tractable. Next we look at guarantees on the statistical accuracy.
Proposition 12 (Deviation bounds).
In both the additive and multiplicative cases below, the relative entropy can be bounded above by
where is a free parameter specified below, and is the degrees of freedom parameter of the underlying Studentt distribution.
Additive case: consider noise with prior for an arbitrary constant . Setting the scaling parameter as
we have with probability at least that
Multiplicative case: consider noise for an arbitrary constant , with prior . Setting the scaling parameter as
we have with probability at least that
With Propositions 11–12, via (9) we can compute using the general form
and with for large enough (as specified in Proposition 11), we specialize as
Once again we remark that the computations can be done with either or , since by symmetry the resulting noise distribution conditioned on is the same, just as in the Normal case (section 3.2).
4 Empirical analysis
In this section, we use controlled simulations to investigate the behavior of the estimators and under various types of noise distributions, and compare this with other benchmark estimators. In addition to comparing the actual distribution of the deviations with confidence intervals derived in section 3, we also pay particular attention to how performance depends on the underlying distribution, the mean to standard deviation ratio, and the sample size.^{5}^{5}5Full implementations of all the experiments carried out in this work are made available at the following online repository: https://github.com/feedbackward/1dim.git.
4.1 Experimental setup
Data generation
For each experimental setting and each independent trial, we generate a sample of size , compute some estimator , and record the deviation . The sample sizes range over , and the number of trials is . We draw data from two distribution families: the Normal family with mean and variance , and the logNormal family, with logmean and logvariance , under multiple parameter settings. Regarding the variance, we have “low,” “mid,” and “high” settings, which correspond to in the Normal case, and in the logNormal case. Over all settings, the loglocation parameter of the logNormal data is fixed at .
Moment control
Of particular interest here is the impact of different mean to standard deviation ratios: we test ranging over . The standard deviation is fixed as just described, and the mean value is determined automatically as for each value of to be tested. Shifting the Normal data is trivially accomplished by taking the desired . Shifting the logNormal data is accomplished by subtracting the true mean (preshift) equal to to center the data, and subsequently adding the desired location.
Methods being tested
We compare canonical location estimators with numerous examples from the robust estimator class described and analyzed in sections 1–3. The methods being compared are organized in Table 1. Essentially, we are comparing the empirical mean and median with different manifestations of the estimators and . As additional reference, we also compare with two subGaussian estimators introduced in 1, namely the medianofmeans estimator and an Mestimator constructed in the style of Catoni [4]. For the former, we partition into equalsized subsets, with . Regarding bound computations, from Theorem 4.1 of Devroye et al. [7], the medianofmeans estimator satisfies (1) with constant , as long as . In the event this is not satisfied, we just return the sample mean. For the latter, as an influence function we use the Gudermannian function , with scaling as . Using the results of Catoni [4] it is then straightforward to show that the resulting estimator satisfies
with probability no less than .
Since all nonclassical estimators here depend on a confidence level parameter, for all experiments we set . Furthermore, for precise control of the experimental conditions, we use the true variance and/or second moments of the data distribution for scaling and as specified in section 3, which is known since we are in control of the underlying data distribution. The empirical median is computed (after sorting) as the middle point when is odd, or the average of the two middle points when is even. For under Weibull noise, and is determined automatically. For under Weibull noise, and . For both and under Studentt noise, shift parameter is , and degrees of freedom parameter is .
mean  Empirical mean. 

med  Empirical median. 
mom  Medianofmeans [7]. 
mest  Mestimator [4]. 
mult_b  under Bernoulli noise (sec. 3.1). 
mult_bc  centered version of mult_b (sec. 3.1). 
mult_g  under Normal noise (sec. 3.2). 
mult_w  with Weibull noise (sec. 3.3). 
mult_s  with Student noise (sec. 3.4). 
add_g  with Normal noise (sec. 3.2). 
add_w  with Weibull noise (sec. 3.3). 
add_s  with Studentt noise (sec. 3.4). 
4.2 Impact of meanSD ratio
In Figure 2, we look at how the mean to standard deviation ratio impacts the performance of different estimators. Clearly, the dependence of and on the second moment, as suggested by the bounds derived in section 3, is not vacuous. A remarkably large difference in sensitivity appears between different manifestations of these new estimators, however. In order of increasing sensitivity: Bernoulli, Weibull, Gaussian, Student. Estimators with a higher sensitivity to the absolute value of the mean are more strongly biased toward a particular location value. We also note that in observing analogous plots as the variance level increases from low mid high, all of the instances (the add_* methods) converge to behave identically to the Bernoulli type estimator mult_b. The reason for this is simple. As grows, so does the value of used in Propositions 2, 4, 9, and 12. Since we are not modifying the noise distribution in a datadriven fashion, this means that the variance of scaled noise decreases, taking each summand in of the form progressively closer to the Bernoulli case of . This is in stark contrast to the multiplicative case , in which the noise and the data are coupled.
Overall, the Bernoulli case is least sensitive among the uncentered estimators, and considering the ease of computation, is clearly a strong choice. Furthermore, the centered version mult_bc is distinctly much less sensitive to the meanSD ratio, which is precisely what we would expect given the discussion in section 3.1. In the logNormal case, since the dependence on the parameters being controlled is asymmetric, the nature of the underlying distribution differs for positive and negative values of , leading to the asymmetric curves observed.
4.3 Impact of sample size
In Figure 3, we examine how performance improves as the sample size increases; at the same time, we may observe how performance deteriorates when data is scarce. For readability, here we only compare the two classical estimators and two known subGaussian estimators with the Bernoullitype estimators from our class of interest. In addition to the consistency of (and its centered version) that the deviation bounds of section 3 imply, we are able to confirm that is a very close competitor to the variancedependent Mestimator of Catoni [4]. Without any iterative subroutine,
does well under the outlierprone logNormal case, and yet has small enough bias so as to effectively mimic the sample mean in the Normal case. Performance is well above the medianofmeans estimator, but both
and the Mestimator make use of moment information in these tests, so definitive statements about relative superiority remain difficult to make.