 # A New Confidence Interval for the Mean of a Bounded Random Variable

We present a new method for constructing a confidence interval for the mean of a bounded random variable from samples of the random variable. We conjecture that the confidence interval has guaranteed coverage, i.e., that it contains the mean with high probability for all distributions on a bounded interval, for all samples sizes, and for all confidence levels. This new method provides confidence intervals that are competitive with those produced using Student's t-statistic, but does not rely on normality assumptions. In particular, its only requirement is that the distribution be bounded on a known finite interval.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider one of the fundamental problems in statistics: how to use samples of a real-valued random variable to obtain a confidence interval on its mean. Methods for constructing such confidence intervals are used across all branches of science. In the natural and social sciences, the confidence interval based on Student’s -statistic (Student, 1908) is one of the standard tools for quantifying uncertainty about the results of an empirical study. In theoretical work, concentration inequalities like Hoeffding’s inequality (Hoeffding, 1963)

are often used to analyze properties of algorithms in machine learning, data science, and other areas. Providing methods for obtaining tighter confidence intervals from fewer samples is critical to scientific advancement, enabling stronger conclusions to be drawn from the same experimental data.

In particular, there is a practical need today for confidence intervals that hold for small sample sizes. Since the confidence interval produced using Student’s -statistic, which we refer to hereafter as the Student- interval, relies on the (near) normality of the sample mean, it is recommended that sample sizes be at least 30 for it to be used, unless there is a specific reason to believe that the population distribution is approximately normal. While other confidence intervals that hold for small sample sizes exist (such as Anderson’s (1969)), they produce intervals that are so wide as to be of little use in practice. This leaves the practitioner with the following choices:

• use methods, such as bootstrap methods, with no performance guarantees;

• use methods with unrealistic assumptions, such as the Student- interval;

• use valid but weak methods such as Hoeffding or Anderson’s inequalities that provide little information about the mean;

• abandon the idea of obtaining useful confidence intervals from the data.

In this paper, we introduce a new confidence interval for bounded distributions that is much tighter than other confidence intervals that come with guarantees. We conjecture that it holds for all bounded distributions, all samples sizes, and all confidence levels. We suggest that for many applications, this is the first practical confidence interval for sample sizes less than 30. We now offer a formal statement of the problem we are addressing.

### 1.1 Problem Statement

Let be independent and identically distributed real-valued random variables. Let each take values in the interval and have expected value . For now we focus on a high-confidence upper bound, i.e., we desire a function such that, for all distributions of , all sample sizes , and all confidence levels :

 Pr(mα(X1,…,Xn)≥μ)≥1−α. (1)

That is, with probability at least , should be greater than or equal to the mean. Critically, in this statement the random quantity is the high-confidence upper bound, not the mean. Any definition of that satisfies (1) for all bounded distributions, samples sizes , and , is said to have guaranteed coverage.

In this paper we present a new method for constructing confidence intervals on the mean: a new . We conjecture that our function satisfies (1), i.e., that it has guaranteed coverage. If our conjecture holds, this is the first confidence interval with tightness comparable to the Student- interval, but with guaranteed coverage in this setting. We prove in Section 7 that it dominates several other known confidence intervals with guaranteed coverage. That is, for every possible sample , it produces a confidence interval with width less than or equal to these previous methods, often with a much smaller width. This makes our confidence interval suitable for small sample sizes where other methods are not practical.

After defining in the next section, we present the following results:

• a proof of (1

) for a class of distributions that includes Bernoulli distributions, for all samples sizes

, and for all confidence levels ;

• the sketch of a proof that our intervals are always at least as tight as those provided by Anderson (1969), which, in turn, are strictly tighter than those of Hoeffding (1963);

• results of extensive simulations on a wide variety of distributions that are consistent with (1) for many sample sizes and confidence intervals;

• empirical comparisons (through Monte Carlo simulations) with previous methods, demonstrating that the confidence intervals produced by are consistently tighter than or as tight as the intervals produced by existing methods.

## 2 A New Confidence Interval for the Mean

In this section we present our new confidence interval. We also present our conjecture that it holds for all distributions bounded on , for all sample sizes, and for all confidence levels.

Let . Let be the order statistics of , i.e.,

is a vector containing the sorted values of

such that . Let denote a particular sample of and a sample of . For notational convenience, we alternate between viewing as a function of or . So, when we write subsequently, this corresponds to a definition of where are the order statistics of .

Let be the order statistics of a sample of size

from the continuous uniform distribution on

, with being a particular sample of . Since are order statistics, . We define a function of two ordered vectors:

 m(z,u)def=1−n∑i=1ui(zi+1−zi), (2)

where . Let be the quantile function of the scalar random variable , i.e.,

 Q(1−α,Y)def=inf{y∈R:FY(y)≥1−α}, (3)

where

is the cumulative distribution function (CDF) of

.

Consider the random quantity , which depends upon a fixed sample (non-random) and also on the random variable . We define to be the -quantile of , i.e.,

 mα(z)def=Q(1−α,m(z,U)). (4)

We conjecture that this definition of satisfies (1):

###### Conjecture 1.
Let be independent and identically distributed random variables bounded in the interval , each with mean . Let be the order statistics of . Then for all :
(5)
where is defined in (4).

Several extensions of this conjecture are apparent. First, since each is bounded above by , this conjecture implies that is a -confidence lower bound on . Second, if our main conjecture holds, we further conjecture that the assumption that the random variables are in can be extended to , or for the high-confidence lower bound. Furthermore, the deterministic upper bound of can be loosened to only require an almost-sure upper bound of . Although these extensions may be important for some applications, hereafter we focus on the basic setting introduced previously.

## 3 Understanding mα(z)

Our high-confidence bound (for brevity, hereafter we refer to it as simply our bound) is given by the function defined above. In this section we introduce the following concepts, which provide intuition for :

• ordered CDF pairs,

• the conservative completion of a set of ordered CDF pairs,

• the induced mean of a set of ordered CDF pairs, via conservative completion.

### 3.1 Ordered CDF pairs

For any order statistic vector , each element of can be paired with an element from a non-decreasing sequence of numbers, , to form pairs:

 (z1,u1),(z2,u2),...,(zn,un). (6)

Assuming the ’s are all in the interval (as is the case if is a sample of ), these pairs can be viewed as points on a CDF , i.e., . For this reason, we refer to these pairs as ordered CDF pairs, and write to denote such a set of ordered CDF pairs. We say that a set of ordered CDF pairs is consistent with a CDF if for all . Notice that a set of ordered CDF pairs is consistent with many (usually infinitely many) different CDFs—all non-decreasing functions on the interval that pass through these points (see Figure 1). Figure 1: Given a sample z and a vector u of sorted uniform samples, the ordered CDF pairs (black points) are compatible with a large family of CDFs. Two of the CDFs compatible with these points are shown, a smooth orange one, and a stairstep blue one. The blue one represents the CDF F(z,u)(x), which has the greatest mean among all such CDFs, since it puts mass “as far right” as possible in a way that is still compatible with the ordered CDF pairs. We refer to this CDF as the conservative completion of the ordered CDF pairs (z,u).

### 3.2 Conservative Completion of Ordered CDF Pairs

Given a set of ordered CDF pairs, one may ask which of the (usually infinitely many) CDFs that are consistent with the ordered CDF pairs represents the distribution with the greatest mean, and is this CDF unique? This CDF is unique, and we refer to it as the conservative completion of the ordered CDF pairs. That is, the mean of the distribution characterized by the conservative completion represents an upper bound on the mean of any distribution consistent with the set of ordered CDF pairs.

The conservative completion for a set of ordered CDF pairs, , is illustrated in Figure 1. It is given by the CDF:

 (7) Figure 2: For CDFs defined on the interval [0,1], the mean of the distribution characterized by F(z,u)(x) is given by the area of the region above the CDF (blue), or one minus the area of the region below the CDF (pink).

### 3.3 The Induced Mean, m(z,u)

We introduce to represent the mean of the distribution characterized by . This quantity is, for distributions over , equivalent to the area of the region above the CDF, as depicted in Figure 2. The geometry of this figure suggests two methods for calculating this mean. The first, derived by decomposing the blue in Figure 2 into a set of horizontal strips, is

 m(z,u)=n+1∑i=1zi(ui−ui−1), (8)

where , , and . Another formula is given by dividing the pink region into vertical strips, and is given by

 m(z,u)=1−n∑i=1ui(zi+1−zi). (9)

Next, we consider the distribution of such means obtained by allowing to vary in a particular fashion.

### 3.4 A Distribution of Induced Means

Recall that is a random vector containing samples from the continuous uniform distribution on , sorted such that . We now consider a distribution of induced means obtained by replacing the fixed in with a random vector to form a new scalar random variable .

Recall the definition of the quantile function from (3). We define to be the -quantile of the random variable , i.e.,

 mα(z)def=Q(1−α,m(z,U)). (10)

Thus, considers the set of all ’s that can be used to form an ordered CDF pair with a particular sample and chooses the -quantile of the resulting induced means. As we shall see, this turns out to be just “conservative enough” to provide a valid high-confidence bound for Bernoulli-like distributions, and appears to be looser for distributions that are not Bernoulli-like.

## 4 The Order Statistic Simplex and Feasible Set

In this section, we define the order statistic simplex and the notion of feasible and infeasible sets of samples of order statistics. These definitions will be used in Section 5 to prove that our bound holds for all Bernoulli distributions and also for a more general set of Bernoulli-like distributions.

### 4.1 A Conditional Analysis

Our general method of proof (in the next section) will rely on a conditional analysis for a specific set of distributions. In particular, we will analyze our bound specifically for

• a fixed sample size ,

• a subset of the distributions with a specific mean, ,

• a specific confidence level, (or equivalently, failure rate ).

If we can show that the bound, (1), holds for each tuple , then we have a complete proof for the set of distributions under consideration.

Before proceeding with this method of proof, we define a few necessary terms.

### 4.2 The Order Statistic Simplex

Consider the order statistic simplex in dimensions—the set of all possible order statistic vectors, , which forms a polytope of dimension with vertices, i.e., a simplex. For distributions on , we define the order statistic simplex as:

 Zdef={z=(z1,z2,...,zn):0≤z1≤z2...≤zn≤1}. (11)

The order statistic simplex for is depicted by the blue region in Figure 3. Figure 3: The order statistic simplex in n=2 dimensions. The upper left region (blue) shows the region of possible (valid) order statistics for a sample of size n=2. For other n, this region is defined by a polytope of n dimensions and n+1 vertices, i.e., a simplex. We refer to this as the order statistic simplex in n dimensions. Figure 4: Feasible and infeasible regions for n=2. For various values of the confidence level 1−α and the mean μ, we show the feasible regions in blue (for which the bound is greater than or equal to the mean) and the infeasible regions in green (for which the bound is less than the mean).

### 4.3 The Infeasible Set of z’s

Let the sample size, , be fixed. For distributions with a specific mean, , and for a specific confidence level, , let denote the set of ’s for which . We refer to as the feasible set, and say that satisfies the bound if . Let denote the complement of this set, the set of ’s for which . We refer to as the infeasible set of ’s, and say that does not satisfy the bound if it is in . Note that, given the mean, , of a distribution, the feasible and infeasible sets have no other dependency on the unknown distribution of .

These ideas are illustrated in Figure 4. For sample size , each plot shows the feasible (blue) and infeasible (green) regions for given confidence levels and means . Each row shows results for the same and different confidence levels. Note that in some cases, such as and , the entire order statistic simplex is feasible (there are no green pixels).

## 5 Bernoulli and Half-Bernoulli Distributions

In this section, we present a proof of our conjecture for Bernoulli distributions and for a generalization of Bernoullis that we refer to as half-Bernoulli distributions. While Bernoulli distributions have point masses on both 0 and 1, half-Bernoullis can have point masses at two positions: and , where . Thus, they are a generalization of Bernoullis that allow the lower value to be any non-negative value less than 1. Let be the half-Bernoulli distribution where the point masses are at and , and the mean is . With and specified by , the probability of sampling is given by

 pk=1−μ1−k. (12)

Bernoullis (and half-Bernoullis) are prime candidates for distributions for which the bound will fail (violate (1)), since it is not uncommon to have a sample of all 0’s (or all ’s), despite having a relatively large mean. For example, the probability of obtaining a sample of from a Bernoulli distribution with parameter is . Since the probability of getting this sample is greater than , the bound must produce a result greater than for this sample at the confidence level . As we demonstrate below, the bound holds for all Bernoulli and half-Bernoulli distributions.

### 5.1 Finding Worst Case Half-Bernoullis for (n,μ,1−α)

Our approach will be to find “worst case” distributions among the set of half-Bernoulli distributions. By worst case, we mean that the probability of drawing a sample for which the bound fails is as high as possible. We will show that for the worst case half-Bernoulli distributions (there can be more than one of these for each tuple , the probability of drawing a sample is no more than . Since the bound holds for the worst case half-Bernoullis, it must hold for all half-Bernoullis.

#### 5.1.1 Enumerating possible z’s

Consider a half-Bernoulli distribution with probability masses at and , and with mean . For a given , there are possible order statistics, :

 [k,k,…,k,k] (13) [k,k,…,k,1] (14) [k,k,…,1,1] (15) ⋮ (16) [k,1,…,1,1] (17) [1,1,…,1,1]. (18)

Let be the sample with out of values of , i.e., for :

 zj,ndef=[k,…,kj,1,…,1n−j]. (19)

#### 5.1.2 Monotonicity of mα(z)

Notice that is monotonic in the following sense. For two samples and :

 If∀i,yi≤zithenmα(y)≤mα(z). (20)

It follows from (20) that whenever .

#### 5.1.3 An expression for Pr(Z∈¯¯¯¯Zμα)

For any half-Bernoulli distribution , sample size , and confidence level , let

 jmin(Hk,μ,1−α,n)=min{j∈{0,1,…,n}:zj,n∈¯¯¯¯Zμα}, (21)

where if .

For example, suppose and but . Then , where here and in the following the arguments of are implicit. By the monotonicity of the bound (see (20)), all of the samples with will be in .

For a given half-Bernoulli distribution we can now write an expression for the probability that will not satisfy the bound:

 PrHk,μ(Z∈¯¯¯¯Zμα)= n∑i=jminPrHk,μ(Z=zi,n) (22) = n∑i=jminBinomial(i;n,pk) (23) = βcdf(pk;jmin,n−jmin+1), (24)

where

is the CDF of a beta distribution with parameters

and . The above derivation uses the property that each

can be viewed as a sample from a binomial distribution, and in the last step we use a well-known identity that relates the sum of binomials to the CDF of a beta distribution.

#### 5.1.4 Simplification of m(zj,n) due to the simple structure of zj,n

Before continuing with deriving the that maximizes the failure rate of the bound, we show how simplifies for samples from half-Bernoulli distributions.

Recall that our bound is a quantile of the function with respect to the uniform random variable . For samples of the form , this function reduces to a particularly simple form:

 m(zj,n,U) = 1−n∑i=1Ui((zj,n)i+1−(zj,n)i) (25) = 1−[0,...,0,1−k,0,...0]⋅U (26) = 1−(1−k)Uj. (27)

That is, with the exception of the term, all of the successive differences of are 0,111We can ignore the case where , since the bound is trivial for . leaving us with a simple function of the order statistic, . Later it will be useful to note that the order statistic when taking samples from a uniform distribution is beta distributed with parameters and (Casella and Berger, 2002, Example 5.4.5).

#### 5.1.5 Choosing pk to maximize the failure rate

For a fixed , consider the set of distributions, , with , for some value . We are interested in the that maximizes . Since we are only considering half-Bernoulli distributions with a particular mean, , the entire distribution is specified if is specified, and so we solve for the that maximizes :

 argmaxpk:jmin=jPrHk,μ(Z∈¯¯¯¯Zμα)(a)= argmaxpk:jmin=jβcdf(pk;j,n−j+1) (28) (b)= argmaxpk:jmin=jpk. (29)

Step (a) follows from (24). Step (b) follows since all beta CDFs are monotonic in their first argument. In other words, within the set of that have the same and , the failure rate of the bound (the probability that ) is monotonic in .

Although the failure rate is monotonic in , this does not mean that the worst-case is when , since this monotonicity result is restricted to the set of half-Bernoulli distributions with . We therefore now solve for the maximum such that in order to obtain the half-Bernoulli distribution with mean that maximizes the failure rate:

 max{pk∈[0,1]:jmin=j} (30) = max{pk∈[0,1]:{z1,n,z2,n,...,zj−1,n}⊆Zμα, (32) {zj,n,zj+1,n,...,zn,n}⊆¯¯¯¯Zμα} (a)= max{pk∈[0,1]:zj,n∈¯¯¯¯Zμα} (33) = max{pk∈[0,1]:PrU(m(zj,n,U)<μ)≥1−α} (34) (b)= max{pk∈[0,1]:PrU(1−(1−k)Uj<μ)≥1−α} (35) = max{pk∈[0,1]:PrU(Uj>1−μ1−k)≥1−α} (36) = max{pk∈[0,1]:PrU(Uj>pk)≥1−α} (37) = max{pk∈[0,1]:1−PrU(Uj≤pk)≥1−α} (38) (c)= max{pk∈[0,1]:1−βcdf(pk;j,n−j+1)≥1−α} (39) = max{pk∈[0,1]:βcdf(pk;j,n−j+1)≤α} (40) = max{pk∈[0,1]:β−1cdf(α;j,n−j+1)≥pk} (41) = β−1cdf(α;j,n−j+1). (42)

Step (a) follows from the monotonicity of the bound and Section 5.1.2. Step (b) uses the result of (27). Step (c) uses the the fact that the order statistic of a uniform sample of size is beta distributed with parameters and .

#### 5.1.6 Bringing the pieces together

We have established that, for a given , of all half-Bernoulli distributions with mean and , the one that maximizes the failure rate of the bound has . Plugging this into (24), we have that

 maxHk,μ:jmin=jPrHk,μ(Z∈¯¯¯¯Zμα)= βcdf(β−1cdf(α;j,n−j+1);j,n−j+1) (43) = α. (44)

Thus, we have seen that by maximizing the probability that causes , i.e., maximizing , we can produce a probability of violation of at most . Thus, the bound holds with probability at least . Since this is true for all half-Bernoulli distributions for all values of and arbitrary tuples , it is true for all half-Bernoullis, all sample sizes and all confidence levels.

## 6 Computing mα

In this section, we discuss two methods for computing our bound, , for a particular sample . The first is based upon a geometric analysis of the bound and the second uses a Monte Carlo sampling technique.

### 6.1 Geometric Computation of the Bound

Recall that the random variable represents the order statistics of a uniform sample, and hence lies in the order statistic simplex defined in Section 4.2. Figure 5 shows the order statistic simplex for . Note that the order statistic simplex of dimension has volume .

Let be the spacings of the sample. Consider an example in which . Then , as shown in the figure. Figure 5: The figure shows several quantities related to the geometric computation of our bound. The upper left triangle represents the order statistic simplex. The point s represents the spacings of the sample z. The pink region, which we define later to be ULMT, is a section of the order statistic simplex cut by the hyperplane l, which represents the set of vectors u for which u⋅s is greater than or equal to some value. The goal is to find the maximum such value, and thereby the minimum ^μ such that the volume of the pink region is 100(1−α)% of the volume of the order statistic simplex.

Starting from the definition of our bound and expanding the definition of the quantile function, we have

 mα(z)= inf{^μ∈R:Pr(m(z,U)≤^μ)≥1−α} (45) = inf{^μ∈R:n!Volume({u:m(z,u)≤^μ})≥1−α} (46) = inf{^μ∈R:n!Volume({u:1−u⋅s≤^μ})≥1−α} (47) = inf{^μ∈R:n!Volume({u:u⋅s≥1−^μ})≥1−α}. (48)

This final expression has a clear geometric interpretation. The set of points such that is greater than some value is the upper right region of the order statistic simplex, depicted by the pink region in Figure 5. The bound is defined to be the least value of such that the volume of the pink region is a fraction of the simplex volume.

This value of can be found by evaluating the volume of the section of the order statistic simplex above the hyperplane , a hyperplane orthogonal to the spacings vector . Thus, we seek the smallest value of such that the volume of this section is of the volume of the order statistic simplex. This value of is our bound.

Closed-form expressions for sections of simplexes cut by hyperplanes have been published by several authors, including Lasserre (2015). These expressions lead to efficient calculations of the bound in most cases. However, these formulas have singularities that cause problems for certain samples , such as samples with repeated values. Thus, we explore a more reliable Monte Carlo approach for computing our bound below.

### 6.2 Monte Carlo Estimate of the Bound

Since the bound is defined in terms of a quantile of a function that depends upon a uniform random variable, it is simple to develop a Monte Carlo estimate. This is provided in Algorithm

1.

Algorithm 1 can be implemented more efficiently if will be estimated multiple times for the same , since samples of can be computed and sorted a single time. Also, notice that the number of Monte Carlo samples, , does not scale poorly for distributions with rare values, since we are estimating the -quantile of

, which is robust to outliers (e.g.,

makes this the median, which is well known to be robust to outliers).

In practice, we find that tends to provide a reasonable approximation of for . Note that (1) may not hold when using this Monte Carlo estimate of due to error in the estimate. In practice this can be remedied by increasing , or by incorporating high-probability bounds on the error in the Monte Carlo estimate into the bound. Also, note that as decreases, should be increased.

## 7 Related Work

In this section, we review other methods for computing high-confidence upper bounds on the mean of a random variable from samples (several of which we compare to in the subsequent numerical analysis section). Although some of these methods extend to more general settings (e.g., Hoeffding’s inequality does not require identically distributed samples, and Anderson’s inequality does not require a lower-bound on the random variable), here we consider only the standard setting that we have discussed in this paper, wherein the samples are i.i.d. and the random variable always takes values in the interval . We divide this section into two parts: prior methods that provide guaranteed coverage, and prior methods that do not provide guaranteed coverage. We present these prior methods as functions, , , etc., each of which provides an alternative to .

### 7.1 Prior Methods with Guaranteed Coverage

The methods presented in this subsection have guaranteed coverage in the setting that we have described—they satisfy (1) if used in place of .

Using Hoeffding’s inequality (Hoeffding, 1963) to construct a high-confidence upper-bound on is perhaps the best known, and simplest, prior method with guaranteed coverage:

 mHoeffdingα(x)def=¯x+√ln(1/α)2n, (49)

where is the sample mean, i.e.,

. In cases where the variance of the random variable is significantly less than one, the upper bounds provided by Maurer and Pontil’s empirical Bernstein bound

(Maurer and Pontil, 2009) can be tighter than those produced by Hoeffding’s inequality. This is achieved by leveraging not just the sample mean, , but also the sample variance, :

 mMaurer\&Pontilα(x)def=¯x+√2ˆVar(x)ln(2/α)n+7ln(2/α)3(n−1). (50)

Going one step further, Anderson’s inequality provides high-confidence upper bounds on the mean by using the entire sample CDF (rather than only the sample mean and variance):

 mAndersonα(z)def=m(z,uDKW), (51)

where for ,

 uDKWidef=max{0,i/n−√ln(1/α)/2n} (52)

is a vector that Anderson derived from the Dvoretsky-Kiefer-Wolfowitz (DKW) inequality (Dvoretzky et al., 1956). Note that the form we present for Anderson’s inequality uses the tight constant for the DKW inequality found by Massart (1990), which relies on the assumption that . This restriction is not restrictive because high-confidence bounds are typically applied with small values of , e.g., .

Following these three methods, several alternatives have been proposed, including other approaches that rely only on the sample mean (Chen, 2008), methods that extend Maurer and Pontil’s empirical Bernstein bound to provide tighter bounds for random variables with long tails (Bubeck et al., 2012; Thomas et al., 2015a), and methods that provide alternatives to Anderson’s inequality that use alternative methods of defining inclusion envelopes for a distribution’s CDF (Learned-Miller and DeStefano, 2008; Diouf and Dufour, 2005).

### 7.2 Prior Methods without Guaranteed Coverage

All of the methods presented in this subsection do not have guaranteed coverage in the setting that we have described, but are often used to compute high-confidence upper bounds on the mean.

Perhaps the most common method for constructing high-confidence upper bounds is based on Student’s -statistic (Student, 1908):

 mStudentα(x)def=¯x+√ˆVar(x)nt1−α,n−1, (53)

where denotes the percentile of the Student’s distribution with degrees of freedom. We refer to this confidence interval as the Student- interval. If

is normally distributed, then

does

provide guaranteed coverage. The central limit theorem implies that

tends towards a normal distribution as increases, and so this method is often applied in scientific research if , even though this does not provide coverage guarantees.

Bootstrap methods tend to provide the tightest confidence intervals for the mean. However, this comes at a large cost: they do not have guaranteed coverage, even with normality assumptions. Despite concerns about their reliability, bootstrap methods remain in common use due to their tight confidence intervals and tendency to produce error rates roughly around for many common distributions (Hanna et al., 2017; Thomas et al., 2015b). Since bootstrap methods are not easily expressed as closed-form alternatives to , we refer the reader to the work of Efron and Tibshirani (1993) for details on these approaches. The two that we focus on in our subsequent experiments are the most common, the percentile bootstrap, and one of the most sophisticated, the bias corrected and accelerated (BCa) bootstrap.

One limitation of BCa, the more sophisticated bootstrap method, is that it is not defined in some cases (e.g., if all of the samples take the same value) and can encounter numerical issues in other cases. In our implementations, whenever numerical issues are detected, the method automatically reverts to the percentile bootstrap.

## 8 Theoretical Analysis

In this section, we provide an analytic comparison of our bound to two prior methods that provide guaranteed coverage: Hoeffding’s inequality and Anderson’s inequality. We will show that, for any sample , our high-confidence upper bound is less than that resulting from Hoeffding’s inequality, and never greater than that of Anderson’s inequality. We break this result into two components. First we show that for all , . Second, we show that for all , , which implies that , where these inequalities are strict if .

### 8.1 Theoretical Comparison to Anderson’s Inequality

In this section we compare to .

###### Theorem 1.

For all possible values of and all ,

 mα(z)≤mAndersonα(z). (54)
###### Proof.

We present a sketch of the proof. Consider the diagram in Figure 6. This figure depicts, for , the space of possible vectors , which are sorted uniform samples. The point represents , defined in (52). denotes the set of vectors that are element-wise greater than . It follows from the DKW inequality, with the tight constants found by Massart (1990), that the probability is in is at least . The region is any set of ’s that result in the lowest induced means, , while ensuring that the probability that is in is precisely . Note that any point that is not contained within the pink region must represent a vector that results in an induced mean, , which is greater than the induced mean of any point in the pink region. Our bound is effectively the maximum over induced means of points in the pink region and Anderson’s bound is the maximum over points in the blue region. Since the probability that is in cannot be less than the probability that is in , and contains the vectors that minimize the induced mean, our bound cannot be larger than Anderson’s. ∎ Figure 6: Diagram comparing our bound and Anderson’s.

### 8.2 Analytic Comparison to Hoeffding’s Inequality

In this section we prove the following theorem:

###### Theorem 2.

For all possible values of and all ,

 mAndersonα(z)≤mHoeffdingα(z), (55)

where the inequality is strict if .

###### Proof.

We begin with and present a sequence of inequalities that conclude with , where one inequality is strict if :

 mAndersonα(z)= m(z,uDKW) (56) = 1−n∑i=1(zi+1−zi)uDKWi (57) = 1−n∑i=1(zi+1−zi)max⎧⎨⎩0,in−√ln(1/α)2n⎫⎬⎭ (58) ≤ 1−n∑i=1(zi+1−zi)⎛⎝in−√ln(1/α)2n⎞⎠. (59)

If , then this final inequality is strict because, when , we have that for any , , and so

 max⎧⎨⎩0,in−√ln(1/α)2n⎫⎬⎭>in−√ln(1/α)2n. (60)

Continuing, we have:

 mAndersonα(z)≤ 1−n∑i=1(zi+1−zi)in+n∑i=1(zi+1−zi)√ln(1/α)2n (61) = 1+1n(n∑i=1izi−n∑i=1izi+1)+(zn+1−z1)√ln(1/α)2n (62) = 1+1n(n∑i=1izi−(n∑i=2(i−1)zi)−n)+(1−z1)√ln(1/α)2n (63) = 1nn∑i=1zi+(1−z1)√ln(1/α)2n (64) ≤ 1nn∑i=1zi+√ln(1/α)2n (65) = mHoeffdingα(z). (66)

Notice that (64) provides an expression similar to Hoeffding’s inequality, but where the lower bound on the random variable (in our case, zero) is replaced by the smallest observed sample, . This presents a tighter variant of Hoeffding’s inequality that holds when and the random variables are i.i.d. (the general form of Hoeffding’s inequality holds for random variables that are not necessarily identically distributed). ∎

It then follows from Theorem 2 that our bound is always at least as tight as Hoeffding’s inequality, and is strictly tighter if :

###### Corollary 1.

For all possible values of and all ,

 mα(z)≤mHoeffdingα(z), (67)

where the inequality is strict if .

###### Proof.

This follows immediately from Theorems 1 and 2. ∎

## 9 Numerical Analysis

In this section we present results from a numerical analysis of our bound. These empirical results aim to answer the following research questions:

1. For a variety of distributions, confidence levels, and number of samples, are results consistent with (1)?

2. For a variety of distributions that resemble common use-cases, how do the confidence intervals produced by our bound compare to those of previous methods that have guaranteed coverage (i.e., those that satisfy (1))?

3. This question is the same as RQ2, but for methods that do not have guaranteed coverage.

4. Can our bound provide confidence intervals that are practical for scientific experiments with fewer than samples?

### 9.1 Numerical Studies on Guaranteed Coverage

In this subsection we study RQ1 with experiments that are consistent with the conjecture that , as we have defined it, has guaranteed coverage (satisfies (1)). Although they show that (1) appears to hold for a variety of settings, this does not imply that settings do not exist under which (1) does not hold.

To study RQ1, we selected a variety of different distributions (uniform, beta, and Bernoulli, each with various parameters), confidence levels , and number of samples, . For each such tuple, , we collected samples of , computed , and checked whether . From these tests, we estimated the coverage—the probability that the bound holds. That is, we estimated by dividing the number of samples of such that by .

Although or goal here is to study RQ1, to facilitate the interpretation of the presented results, we provide results using two prior methods that exhibit the different types of behavior that our bound might produce. This comparison also provides some insight into RQ2 (via the comparison to Heoffding’s inequality) and RQ3 (via comparison to the Student- interval). However, note that subsequent experiments further study these two research questions.

First consider Figure 7, which presents results using Hoeffding’s inequality. The top left plot shows the coverage for a variety of beta distributions, but with fixed. To interpret this plot, consider the curve at the position on the horizontal axis. The position on the horizontal axis indicates that we requested an upper bound that holds with probability at least . Since, at horizontal position , the curve (red curve) lies above (blue curve), the upper bound produced by Hoeffding’s inequality held with probability greater than . Hence, a curve remaining above the blue line indicates that the desired confidence level was achieved. However, notice that the curve is far above the blue line—this indicates that Hoeffding’s inequality was overly conservative. It provided a high-confidence upper bound that was greater than or equal to the mean far more often than requested. This means that for this distribution the confidence interval provided by Hoeffding’s inequality is not tight, and could be improved.

The three other plots in Figure 7 are similar, but use different parameters. The top row presents results for beta distributions, while the bottom row presents results for Bernoulli distributions. The left column presents results as the parameters of the distributions are varied, while the right column presents results as the number of samples is varied. Overall Figure 7 shows the behavior that we would expect of a bound that has guaranteed coverage, but which provides loose high-confidence upper bounds. Figure 7: Estimated probability that the high-confidence upper bound produced using Hoeffding’s inequality is greater than or equal to the true mean.

Now consider Figure 8, which is identical to Figure 7, except that it uses the Student- interval instead of Hoeffding’s inequality. This plots shows very different behavior: the curves tend to be much lower, indicating tighter confidence intervals around the sample mean. However, the curves often cross the blue line, indicating that in these settings the Student- interval does not provide guaranteed coverage—if you ask for an -confidence upper bound, you may only get a -confidence upper bound. Hence Figure 8 shows the behavior that we would expect of a bound that does not have guaranteed coverage, but which provides tight “high-confidence” upper bounds. Figure 8: Estimated probability that the high-confidence upper bound produced using the Student-t interval is greater than or equal to the true mean.

The desired behavior of a high-confidence upper bound would blend the desirable properties of Hoeffding’s inequality and the Student- interval. In these plots, this would result in curves that always remain above the blue curve (guaranteed coverage), but are otherwise as low as possible (tight). Figure 9 presents the results of this same experiment, conducted using our bound. It achieves this desired behavior—it always remains above the blue line (consistent with guaranteed coverage), but tends to be significantly lower than Hoeffding’s inequality. Figure 9: Estimated probability that the high-confidence upper bound produced using our bound is greater than or equal to the true mean.

### 9.2 Numerical Comparison to Previous Methods

In this subsection we focus on RQ2 and RQ3 with experiments that compare the tightness of our bound to that of previous methods (both with and without guaranteed coverage). A variety of different statistics can be used to capture how tight the high-confidence upper bounds produced by a method are, including the mean upper bound, i.e., , and the median upper bound. Here we report the mean upper bound: we gather samples of from a distribution and compute the upper bounds produced by our bound and several previous methods and report the sample mean of the upper bounds for each method. For simplicity, here we vary the distribution and but fix to obtain -confidence upper bounds. Figure 10: These plots depict the mean upper bounds (over 1,000 trials) for various distributions (the titles on the plots describe the distribution) and using various methods. All figures share the following legend:

First compare the blue curve (our bound) to the black curves (previous methods with guaranteed coverage), noting that the horizontal axis uses a logarithmic scale. In every case, the blue curve remains strictly below the black curves, indicating that in every setting our bound produces lower values on average. Notice that frequently our bound obtains mean upper bounds that previous methods require an order of magnitude more samples to achieve, indicating that our bound is a drastic improvement in tightness and/or data efficiency.

Next, compare the blue curve (our bound) to the red curves (previous methods that do not have guaranteed coverage). The two bootstrap methods do not provide guaranteed coverage, even with normality assumptions. So, although they produce tight confidence intervals (as is evident in these plots), the high-confidence bounds that they produce cannot be relied upon.

Next consider the Student- interval: for the uniform distribution, it produces high-confidence upper bounds that are similar to those produced by our bound. When the Student-

interval is computed from normally distributed data, it produces a tighter high-confidence upper-bound than our bound. However, when the sampling distribution includes right-skew (e.g.,

), the Student- interval tends to be overly optimistic—it does not have guaranteed coverage (as is evident in Figure 8 with ). Hence, although the upper bound of the Student- interval tends to be lower than those produced by our method for , it does not have guaranteed coverage. On the other hand, when the sampling distribution includes left-skew (e.g., , the Student- interval is overly-conservative (like Hoeffding’s inequality). Hence, in Figure 10, the plot indicates that our bound is tighter than the Student- interval. This is further evidence that our bound is combining the desirable properties of Hoeffding’s inequality and the Student- interval: it roughly preserves the tightness of the Student- interval, except in the cases where the Student- interval is too tight to provide guaranteed coverage (in which case our bound is sufficiently looser to provide guaranteed coverage).

### 9.3 Numerical Support for Practical use With n<30

Of the many potential uses of our bound, one stands out: it provides a valid method for constructing confidence intervals for scientific studies with fewer than samples. Even though the Student- interval does not have guaranteed coverage when the sampling distribution is not normal, the central limit theorem tells us that the sample mean tends towards a normal distribution as increases. Hence, the Student- interval becomes reasonable when is large.222Notice that even with arbitrarily large , the Student- interval may not have guaranteed coverage, so here saying that the Student- interval is “reasonable” does not mean that it has guaranteed coverage. However, without knowing the sampling distribution, it is not clear how large must be for the Student- interval to be reasonable. A common rule of thumb used in current scientific research is that must be at least .

This raises the question: what should one do when fewer than samples are available? Our bound provides an answer (assuming our conjecture is true), as it provides confidence intervals of comparable tightness, but with guaranteed support for any and without any normality assumptions. The only requirement is the ability to identify limits on the support of the distribution. To answer RQ3, we present an experiment that shows how our bound can be used to obtain confidence intervals based on fewer than empirical measurements.

Specifically, we used data from the United States Census from the year 2000 to obtain an estimate of the distribution of people’s ages, considering only people zero to 84 years old. We then consider the problem of obtaining a tight high-confidence upper bound on the mean age of people ages zero to 84 based on samples. The results of this experiment are presented in Figure 11, which is a similar form to Figure 10 (but without the logarithmic horizontal axis). The key observations from this plot are: 1) previous methods with guaranteed coverage are too loose to provide useful high-confidence bounds with so few samples, 2) the Student- interval is sufficiently tight, but it cannot be applied responsibly with such a small , and 3) our bound produces high-confidence bounds that are comparable to the Student- interval (while maintaining guaranteed coverage, if our conjecture holds). Figure 11: The mean upper bounds (over 1,000 trials) produced by various methods when n is varied and the sampling distribution is an approximation of the distribution of ages (bounded in [0,84]) in the United States in the year 2000.

## 10 Acknowledgements

This work benefitted significantly from conversations with Vince Lysinski and George Bissias, as well as from discussions with Andrew McGregor, Don Towsley, Berthold Horn, Archan Ray, Justin Domke, Gary Huang, Dan Sheldon, Luc Rey-Bellet, and Markos Katsoulakis. Some of this work grew out of early efforts to improve Anderson’s bound by Benjamin Mears while at the University of Massachusetts, Amherst.

## References

• Anderson (1969) T. W. Anderson. Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute, 43:249–251, 1969.
• Bubeck et al. (2012) S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. Arxiv, arXiv:1209.1727, 2012.
• Casella and Berger (2002) G. Casella and R. L. Berger. Statistical Inference, volume 2. Duxbury Pacific Grove, CA, 2002.
• Chen (2008) X. Chen. Confidence interval for the mean of a bounded random variable and its applications in point estimation. arXiv preprint arXiv:0802.3458, 2008.
• Diouf and Dufour (2005) M. A. Diouf and J. M. Dufour. Improved nonparametric inference for the mean of a bounded random variable with application to poverty measures. 2005.
• Dvoretzky et al. (1956) A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics, 27:642–669, 1956.
• Efron and Tibshirani (1993) B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, London, 1993.
• Hanna et al. (2017) J. P. Hanna, P. Stone, and S. Niekum. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, pages 538–546. International Foundation for Autonomous Agents and Multiagent Systems, 2017.
• Hoeffding (1963) W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
• Lasserre (2015) J. B. Lasserre. Volume of slices and sections of the simplex in closed form. Optimization Letters, 9(7):1263–1269, 2015.
• Learned-Miller and DeStefano (2008) E. Learned-Miller and J. DeStefano. A probabilistic upper bound on differential entropy. IEEE Transactions on Information Theory, 54(11):5223–5230, 2008.
• Massart (1990) P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, 1990.
• Maurer and Pontil (2009) A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, pages 115–124, 2009.
• Student (1908) Student. The probable error of a mean. Biometrika, pages 1–25, 1908.
• Thomas et al. (2015a) P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence off-policy evaluation. In

Proceedings of the Twenty-Ninth Conference on Artificial Intelligence

, 2015a.
• Thomas et al. (2015b) P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning, 2015b.