 # Robust estimation of the mean with bounded relative standard deviation

Many randomized approximation algorithms operate by giving a procedure for simulating a random variable X which has mean μ equal to the target answer, and a relative standard deviation bounded above by a known constant c. Examples of this type of algorithm includes methods for approximating the number of satisfying assignments to 2-SAT or DNF, the volume of a convex body, and the partition function of a Gibbs distribution. Because the answer is usually exponentially large in the problem input size, it is typical to require an estimate μ̂ satisfy P(|μ̂/μ - 1| > ϵ) ≤δ, where ϵ and δ are user specified nonnegative parameters. The current best algorithm uses 2c^2ϵ^-2(1+ϵ)^2 ln(2/δ) samples to achieve such an estimate. By modifying the algorithm in order to balance the tails, it is possible to improve this result to 2(c^2ϵ^-2 + 1)/(1-ϵ^2)ln(2/δ) samples. Aside from the theoretical improvement, we also consider how to best implement this algorithm in practice. Numerical experiments show the behavior of the estimator on distributions where the relative standard deviation is unknown or infinite.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Suppose we are interested in approximating a target value . Then many randomized approximation algorithms work by constructing a random variable such that and for a known constant . The randomized algorithm then simulates as independent identically distributed (iid) draws from . Finally, the values are input into a function to give an estimate for .

Examples of this type of algorithm include when is the number of solutions to a logic formula in Disjunctive Normal Form (DNF) karpl1983 , the volume of a convex body dyerfk1991 , and the partition function of a Gibbs distribution huber2015a . For all of these problems, the random variable

used is nonnegative with probability 1, and so we shall only consider this case for the rest of the paper.

For these types of problems, generating the samples from a high dimensional distribution is the most computationally intensive part of the algorithm. Hence we will measure the running time of the algorithm by the number of samples from that must be generated.

The answer for these problems typically grows exponentially quickly in the size of the problem input size. Therefore, it is usual to desire an approximation that is accurate when measured by relative error. We use as our bound on the relative error, and as the bound on the probability that the relative error restriction is violated.

###### Definition 1

Say is an -randomized approximation scheme (-ras) if

 P(∣∣∣^μμ−1∣∣∣>ϵ)≤δ.

This is equivalent to considering a loss function

that is , and requiring that the expected loss be no more than .

Note that this question is slightly different than the problem most statistical estimates are designed to handle. For instance, the classic work of huber1964

is trying to minimize the asymptotic variance of the estimator, not determine how often the relative error is at most

.

The first approach to this problem is often the standard sample average

 Sn=X1+⋯+Xnn.

For this estimator, only knowing that , the best we can say about the probability that the relative error is large comes from Chebyshev’s inequality tchebichef1867 .

 P(|Sn−μ|>ϵμ)≤Var(X)nϵ2μ2≤c2ϵ−2n−1.

In particular, setting gives an -ras. There are simple examples where Chebyshev’s inequality is tight.

While this bound works for all distributions, for most distributions in practice the probability of error will go down exponentially (and not just polynomially) in . We desire an estimate that matches this speed of convergence.

For example, suppose that is a normal random variable with mean and variance (write

. Then the sample average is normally distributed as well. To be precise,

, and it is straightforward to show that

 P(|Sn−μ|>ϵμ)=P(|Z|≥ϵ√n/c),

where is a standard normal random variable. This gives

 n=2c2ϵ−2[ln(2/δ)−o(δ)]. (1)

samples being necessary and sufficient to achieve an -ras.

We did this calculation for normally distributed samples, but in fact this gives a lower bound on the number of samples needed. For any with mean and variance , and an estimate ,

 P(|^θ−μ|>ϵμ)≥(1/2)P(|Z|≥ϵ√n/c).

See, for instance, Proposition 6.1 of catoni2012 . Hence the minimum number of samples required for all instances is

 n≥2c2ϵ−2[ln(1/δ)−o(δ)].

The method presented in hubertoappearb comes close to this lower bound, requiring

 n=⌈2c2ϵ−2(1+ϵ)2ln(2/δ)⌉

samples. So it is larger than optimal by a factor of .

Our main result is to modify the estimator slightly. This has two beneficial effects.

1. The nuisance factor is reduced from first order in to second order. To be precise, the new nuisance factor is .

2. It is possible to solve exactly for the value of the -estimator (using square and cube roots) rather than through numerical appoximation if so desired.

###### Theorem 1.1

Suppose that , , and the standard deviation of is at most , where . Then there exists an -ras where at most

 ⌈2(c2ϵ−2+1)(1−ϵ2)−1ln(2/δ)⌉

draws from are used.

Ignoring the ceiling function, the new method uses a number of samples bounded by the old method times a factor of

 1+ϵ2/c2(1+ϵ)2(1−ϵ2).

For instance, when and , this factor is , and so we obtain an improvement of over 14% in the running time. At first, this might seem slight, but remember that the best improvement we can hope to make based on normal random variables is , or a bit less than 18%. Therefore, this does not quite obtain the maximum improvement of , but does obtain a factor of .

The remainder of this paper is organized as follows. In Section 2, we describe what -estimators are, and show how to down weight samples that are far away from the mean. Section 3 shows how to find the estimator both approximately and exactly. In Section 4 we consider using these estimators on some small examples and see how they behave numerically.

## 2 Ψ-estimators

The median is a robust centrality estimate, but it is easier to build a random variable whose mean is the target value. P.J. Huber first showed in huber1964 how to build an estimate that has the robust nature of the median estimate while still converging to the mean.

Consider a set of real numbers . The sample average of the is the point where the sum of the distances to points to the right of equals the sum of the distances to points to the left of .

This idea can be generalized as follows. First, begin with a function . Second, using , form the function

 Ψ(m)=n∑i=1ψ(xi,m).

For convenience, we suppress the dependence of on in the notation.

Consider the set of zeros of . These zeros form the set of -type -estimators for the center of the points .

For example, suppose our is defined as

 f(x,m)={(x/m)−1m≠00x=m=0.

Then consider

 m∑i=1f(xi,m)=0.

For and with the same units, the right hand side is unitless. If the are nonnegative and at least one is positive,

 n∑i=1f(xi,m)=0⇔m=∑ni=1xin.

That is, the unique -estimator using as our function is the sample average.

Now consider

 g(x,m)=\mathbbold1(f(x,m)>0)−\mathbbold1(f(x,m)<0).

Then we wish to find such that

 n∑i=1g(xi,m)=0.

Summing adds 1 when greater than , and subtracts 1 when is . Then the sum of the is zero exactly when there are an equal number of that are above and below . When the number of distinct

values is odd, then

has a unique solution equal to the sample median.

The sample median has the advantage of being robust to large changes in the values, but this will converge to the median of the distribution (under mild conditions) rather than the mean. The sample average actually converges to the mean when applied to iid draws from

, and this is the value we care about finding. However, the sample average can be badly thrown off by a single large outlier. Our goal is to create a

function that combines the good qualities of each while avoiding the bad qualities.

Both and can be expressed using weighted differences. Let and be defined as

 df(u) =u dg(u) =\mathbbold1(u>0)−\mathbbold1(u<0).

Let for , and for . Then and .

Catoni catoni2012 improved upon this -estimator by using a function that approximated for , and approximated for , and could be analyzed using the Chernoff bound approach chernoff1952 . Catoni and Guillini then created an easier to use version of the -estimator in catonig2017 .

The following is a modification of the Catoni and Guillini -estimator. Unlike catonig2017 , the weighted difference here does not have square roots in the constants, which makes them slightly easier to work with from a computational standpoint.

 dh(u) =(56)\mathbbold1(u>1)+(u−u36)\mathbbold1(u∈[−1,1])−(56)\mathbbold1(u<−1).

Then define

 h(xi,m)=dh(u(xi,m)).

The function behaves like the mean weight for values near 0, but like the median weights for values far away from 0. See Figure 1

. In order to link this function to the first and second moments of the random variable, the following links

to weighted differences introduced in catoni2012 . See Figure 2.

###### Lemma 1

Let

 dL(u) =−ln(1−u+u2/2) dU(u) =ln(1+u+u2/2).

Then for all ,

 dL(u)≤dh(u)≤dU(u)
###### Proof

Note . This derivative is positive over so is increasing in this region. Since , over .

Similarly, for , and so it is decreasing in this region, and so in . Finally, inside , . So a minimum of occurs either at , , or a critical point where . The unique critical point in is at , and . This means for all .

The lower bound follows from .

It helps to introduce a scale factor that allows us to extend the effective range where . Let

 hλ(xi,m)=λ−1dh(λ(xi/m−1)).

Then is a parameter that can be chosen by the user ahead of running the algorithm based upon and .

The following was shown as Lemma 13 of hubertoappearb .

###### Lemma 2

For given , let and . Set

 n=⌈2c2ϵ−2ln(2/δ)(1+ϵ)2⌉.

For iid , form the function

 Ψλ(m)=1nn∑i=1λ−1dh(λu(Xi,m)).

Let be any value of such that Then is an -ras for .

To improve upon this result and obtain Theorem 1.1, we must be more careful about our choice of .

To meet the requirement of an -ras, we must show that has all its zeros in the interval with probability at least . Given that is continuous and decreasing, it suffices to show that and .

###### Lemma 3

Let be iid . Then for ,

 P(Ψλ((1+ϵ)μ)≥0) ≤[1+E[uϵ]+12E[u2ϵ]]n, P(Ψλ((1−ϵ)μ)≤0) ≤[1−E[u−ϵ]+12E[u2−ϵ]]n.

where

 uϵ=X(1+ϵ)μ−1.
###### Proof

Note that

 P(Ψλ((1+ϵ)μ)≥0) =P(exp(λΨλ((1+ϵ)μ)≥1) ≤E[exp(λΨλ((1+ϵ)μ))]

by Markov’s inequality. Note

 exp(λΨλ(m))=n∏i=1exp(λdh(Xi/m−1)).

Each term in the product is independent, therefore the mean of the product is the product of the means (which are identical.)

Let . Then

 P(Ψλ((1+ϵ)μ)≥0)≤[E(exp(λdh(u)))]n

From the previous lemma,

 exp(dh(u)) ≤exp(ln(1+u+u2/2)) =1+u+u2/2,

Hence

 E(exp(dh(u)))=1+E(u)+E(u2/2).

Putting into this expression then gives the first inequality.

For the second inequality, the steps are nearly identical, but we begin by multiplying by . This completes the proof.

In particular, if we choose so that , and , then by the union bound the probability that has a root in is at least .

As is well known, for ,

 (1−g)n≤exp(−gn),

and so if ,

 (1−g)n≤δ/2.

We refer to as the gap. Since the number of samples needed is inversely proportional to the gap, we wish the gap to be as large as possible. We can lower bound the gap given by the previous lemma in terms of and .

###### Lemma 4

For , and ,

 1+E[u]+E[u2/2]≤1+λμma(m)+12(λμm)2[c2+a(m)2]. (2)

Similarly,

 1+E[u]+E[u2/2]≤1−λμma(m)+12(λμm)2[c2+a(m)2]. (3)
###### Proof

From linearity of expectations,

 E[u]=λ(μm−1)=λμma(m).

The second moment is the sum of the variance plus the square of the first moment. Using , we get

 E[u2] =Var(u)+E[u]2 =(λm)2Var(X)+[λμma(m)]2.

Since ,

 E[u2] ≤(λμm)2[c2+a(m)2],

giving the first result. The proof of the second statement is similar.

###### Lemma 5

Let and . Then

 pℓ ≤[1−λϵ1−ϵ+12(λ1−ϵ)2(c2+ϵ2)]n, pr ≤[1−λϵ1+ϵ+12(λ1+ϵ)2(c2+ϵ2)]n.
###### Proof

Note

 a((1−ϵ)μ)=1−(1−ϵ)μ/μ=ϵ.

Combine Lemmas 34, and 6 with equal to to get the first inequality. Then set and use to get the second.

Our goal is to use as few samples as possible, which means simultaneously minimizing the quantities inside the brackets in Lemma 5. Both of the upper bounds are upward facing parabolas, and they have a unique minimum value.

###### Lemma 6

The minimum value of is

 1−a212a2,

at

###### Proof

Complete the square.

However, because the coefficients in the quadratic upper bounds are different, it is not possible to simultaneously minimize these bounds with the same choice of .

What can be said is that these bounds are quadratic with positive coefficient on , and so the best we can do is to choose a such that the upper bounds are equal to one another. This gives us our choice of .

###### Lemma 7

Let

 λ=ϵc2+ϵ2(1−ϵ2).

Then

 max{pℓ,pr}≤[1−12⋅ϵ2c2+ϵ2(1−ϵ2)]n.
###### Proof

Follows directly from Lemma 5.

This makes the inverse gap

 ϵ−2(c2+ϵ2)(1−ϵ2)−1=(c2ϵ−2+1)(1−ϵ−2)−1

and immediately gives Theorem 1.1.

## 3 Computation

Set . Consider how to locate any root of for a given set of . As before, we assume that the are nonnegative and not all identically zero. The function is continuous and decreasing, although not necessarily strictly decreasing. Therefore, it might have a set of zeros that form a closed interval.

### 3.1 An approximate method

Suppose that the points are sorted into their order statistics,

 X(1)≤X(2)≤⋯X(n).

Then and , so there exists some such that and . Since any particular value of requires time to compute, this index can be found using binary search in time.

At this point we can switch from a discrete binary search over to a continuous binary search over the interval . This allows us to quickly find a root to any desired degree of accuracy.

This is the method that would most likely be used in practice.

### 3.2 An exact method

Although the approximation procedure is what would be used in practice because of its speed, there does exist a polynomial time exact method for this problem. For , and a value , suppose that the points of that fall into the interval are exactly Then say that .

That is, define the set for as follows.

 m(i,j)={m:X(k)∈[m−λ,m+λ]⇔k∈{i,i+1,…,j}}.

See Figure 3. Figure 3: Since the interval [m−λ,m+λ] includes X(2), X(3), and X(4), m∈m(2,4).

Note that

 (∀m∈[X(1),X(n)])(∃i

There are at most choose 2 such where is nonempty. In fact, since as we slide from to , each point can enter or leave the interval exactly once. Therefore there are at most pairs where is nonempty.

That means that to find a root of we need merely check if there is a root for all such that .

For each , the contribution of to is , and the contribution of is . Hence is a zero of if an only if and

 n−j−(i−1)+j∑k=i(rX(k)−1)−(rX(k)−1)3/6=0.

This last equation is a cubic equation in , and so the value of that satisfies it (assuming such exists) can be determined exactly using the cubic formula.

## 4 Numerical Experiments

The

-estimator presented here can be thought of as a principled interpolation between the sample mean and the sample median. Because for

small , as , the estimator converges to the sample mean.

At the other extreme, as , all of the will evaluate to either 0, 1, or -1. Hence the estimator converges to the sample median.

When , we can precisely bound the chance of the relative error being greater than . However, the estimator can be used for any value of .

For instance, Table 1 records the result of using the estimator for 100 exponential random variables with mean 2 and median .

When is small, the result is nearly identical to the sample mean. As increases, the result moves towards the median value.

Unlike the sample average, however, this

-estimator will always converge to a value, even when the mean does not exist. Consider the following draws from the absolute value of a Cauchy distribution. The mean of these random variables is infinite so the sample average will not converge. The median of this distribution is 1.

Even for values such as , the result is fairly close to the median. For any , the -estimator will not go to infinity as the sample average does, but instead to converge to a fixed value as the number of samples goes to infinity.

### 4.1 Timings

To test the time required to create the new estimates, the algorithm presented here together with the algorithm from hubertoappearb were implemented in R. Table 3 shows the results of running both algorithms using an Euler-Maruyama simulation of a simple SDE as a test case.

As can be seen from the data, the relative change is near , and the relative change gets closer to the smaller becomes.

## 5 Conclusion

The modified Catoni -estimator presented here gives a means of interpolating between the sample mean and the sample average. The estimator is designed for the output of Monte Carlo simulations where the distribution is usually unknown, but often it is possible to compute a bound on the relative standard deviation. It is fast to calculate in practice and can be computed exactly in terms of square and cube roots in polynomial time. The estimator has a parameter which controls how close the estimate is to the sample mean or sample median.

Given a known upper bound on the relative standard deviation of the output, can be chosen as to yield an -randomized approximation scheme that uses a number of samples (to first order) equal to that if the data was normally distributed. Even if is unknown (or infinite), the estimator will still converge to a fixed measure of centrality for any choice of .

## 6 Code

The following code was used to create the tables in Section 4.

The function psi_cg implements the method from Catoni & Guilini :

[] psi_cg <- function(m, x, lambda) { # Calculates the psi function for a piece of data u <- lambda * (x / m - 1) a1 <- 2 * sqrt(2) / 3 a2 <- sqrt(2) d <- a1 * (u > a2) - a2 * (u < -a2) + (u - u^3 / 6) * (u >= -1) * (u <= 1) return(sum(d)) }
[] psi_h <- function(m, x, lambda) { # Calculates the psi function for a piece of data u <- lambda * (x / m - 1) d <- (5 / 6) * (u > 1) - (5 / 6) * (u < -1) + (u - u^3 / 6) * (u >= -1) * (u <= 1) return(sum(d)) }
[] m <- function(x, lambda, estimate = psi_h, tol = 10^(-12)) { x <- sort(x) # Put everything in order a <- x b <- x[length(x)] while ((b - a) > tol) { c <- (a + b) / 2 psi.c <- estimate(c, x, lambda) if (psi.c < 0) b <- c else a <- c } return((a + b) / 2) }

### 6.1 Numerical Experiments

Now we are ready to undertake the numerical experiments.

[] # install.packages("tidyverse") library(tidyverse)

Create Table 1 (accuracy new method):

[] # Generate data to work with set.seed(123456) v <- rexp(100 * 5, rate = 1 / 2) r <- c(rep(1, 100), rep(2, 100), rep(3, 100), rep(4, 100), rep(5, 100)) df <- tibble(run = r, rv = v) df_sum <- df group_by(run) summarize( Mean = mean(rv), Median = median(rv), "\\lambda = m(rv, lambda = 0.1), "\\lambda = m(rv, lambda = 1), "\\lambda = m(rv, lambda = 5), ) select(-run) kable(df_sum, digits = 2)

Create Table 2 (accuracy new method):

[] # Generate data to work with set.seed(123456) v <- abs(rcauchy(100 * 5)) r <- c(rep(1, 100), rep(2, 100), rep(3, 100), rep(4, 100), rep(5, 100)) df <- tibble(run = r, rv = v) df_sum <- df group_by(run) summarize( Mean = mean(rv), Median = median(rv), "\\lambda = m(rv, lambda = 0.1), "\\lambda = m(rv, lambda = 1), "\\lambda = m(rv, lambda = 5), ) select(-run) kable(df_sum, digits = 2)

Toy example of process where we know ratio of the variance to square of mean, Brownian motion created through an inefficient process.

[] # Euler-Maruyama for SDE: dS_t = dt + dW_t bm <- function(t, h) { z <- rnorm(t / h) s <- 0 for (i in 1:length(z)) s <- s + h * 1 + sqrt(h) * z[i] return(s) }

Time test for old:

[] time_psi_cg <- function(epsilon, delta, c) { start_time <- as.numeric(as.numeric(Sys.time())*1000, digits=15) n <- ceiling(2 * c^2 * epsilon^(-2) * log(2 / delta) * (1 + epsilon)^2) lambda <- epsilon / (1 + epsilon) / c^2 r <- replicate(n, bm(1, 0.001)) hatm <- m(r, lambda, psi_cg) end_time <- as.numeric(as.numeric(Sys.time())*1000, digits=15) return(end_time - start_time) }

Time test for new:

[] time_psi_h <- function(epsilon, delta, c) { start_time <- as.numeric(as.numeric(Sys.time())*1000, digits=15) n <- ceiling(2 * (c^2 * epsilon^(-2) + 1) * (1 + epsilon^2)^(-1) * log(2 / delta)) lambda <- epsilon * (1 - epsilon^2) / (c^2 + epsilon^2) r <- replicate(n, bm(1, 0.001)) hatm <- m(r, lambda, psi_h) end_time <- as.numeric(as.numeric(Sys.time())*1000, digits=15) return(end_time - start_time) }

Timings using an Intel(R) Core(TM) i7-6700 CPT @ 3.40GHz.

[] set.seed(123456) ep <- c(0.1, 0.05) del <- c(10^(-6), 10^(-6)) r <- rep(0, 2) s <- rep(0, 2) for (i in 1:2) { r[i] <- mean(replicate(100, time_psi_cg(ep[i], del[i], 1))) s[i] <- mean(replicate(100, time_psi_h(ep[i], del[i], 1))) } df <- tibble(epsilon = ep, delta = del, CG = r, New = s) mutate(improvement = (s - r) / r) kable(df)

###### Acknowledgements.
This work supported by National Science Foundation grant DMS-1418495.