Log In Sign Up

A Causal Bootstrap

The bootstrap, introduced by Efron (1982), has become a very popular method for estimating variances and constructing confidence intervals. A key insight is that one can approximate the properties of estimators by using the empirical distribution function of the sample as an approximation for the true distribution function. This approach views the uncertainty in the estimator as coming exclusively from sampling uncertainty. We argue that for causal estimands the uncertainty arises entirely, or partially, from a different source, corresponding to the stochastic nature of the treatment received. We develop a bootstrap procedure that accounts for this uncertainty, and compare its properties to that of the classical bootstrap.


page 1

page 2

page 3

page 4


Asymptotics of the Empirical Bootstrap Method Beyond Asymptotic Normality

One of the most commonly used methods for forming confidence intervals f...

Neural Bootstrapper

Bootstrapping has been a primary tool for uncertainty quantification, an...

Network resampling for estimating uncertainty

With network data becoming ubiquitous in many applications, many models ...

A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification

We validate the recently introduced deep learning classification adapted...

Differentially Private Bootstrap: New Privacy Analysis and Inference Strategies

Differential private (DP) mechanisms protect individual-level informatio...

Multiplier bootstrap for Bures-Wasserstein barycenters

Bures-Wasserstein barycenter is a popular and promising tool in analysis...

1. Introduction

The bootstrap, introduced by Efron (1982)

, has become a very popular method for constructing hypothesis tests or confidence intervals. This popularity stems in part from the fact that it provides approximations to the distribution of an estimator or statistic that are in certain cases superior to those obtained from using a Gaussian asymptotic approximation together with estimated standard errors (asymptotic refinement). While the classical bootstrap is designed to approximate distributions that result from repeated sampling from a large population, this paper shows how to adapt the bootstrap principle when the estimand of interest is a causal parameter, and the data is generated by a randomized experiment.

Permutation tests, such as Fisher’s exact test (see e.g. Imbens and Rubin (2015)), can yield exact p-values under the auxiliary hypothesis that treatment effects are constant across units, however we argue below that those methods are not suitable for forming confidence intervals for parameters describing the distribution of causal effects in a given population. For the average treatment effect, causal standard errors have been proposed by Aronow, Green, and Lee (2014) as well as Abadie, Athey, Imbens, and Wooldridge (2017). These methods impose no restrictions on treatment effect heterogeneity but their use generally relies on a Gaussian limiting approximation. We propose a bootstrap approach to causal inference which also does not restrict treatment heterogeneity, but improves on the Gaussian asymptotic approximation in samples of small or moderate size.

Using the potential outcome framework, e.g., Imbens and Rubin (2015)

, we are interested in the average causal effect of a binary variable

(the “treatment”) on an outcome variable whose potential outcomes we denote with , for a population of units . Implicitly we assume that the potential outcomes for unit do not vary with the treatment status assigned to other units, known as the Stable Unit Treatment Value Assumption (SUTVA, Rubin (1978)). For all units in the population we observe the treatment and the realized outcome . One common estimand is the average effect for the units in the population:


We assume that the data arise from a completely randomized experiment, where units are selected at random from the population as experimental subjects, of which units are then randomly assigned to receive the treatment, and the remaining units are assigned to the control group. We let denote an indicator whether the th unit is included in the sample.

Specifically, for and we have

1.1. Sampling Uncertainty and Design Uncertainty

We wish to distinguish between two types of uncertainty in estimators, sampling uncertainty arising from the stochastic nature of , and design uncertainty arising from the stochastic nature of .

To characterize sampling uncertainty we postulate the existence of a large, possibly infinite, population. We draw a random sample from this population, and observe for each unit in this sample a set of values, say, a pair . We may be interested in the difference between the population averages of for the subpopulations with and . We can estimate this object using the difference in average outcomes by values in the sample. This estimator differs from the target because we do not observe all units in the population. Had we drawn a different random sample, with different units, the value of the estimator would have been different. See Table 1, where is the sampling indicator, equal to 1 for sampled units and 0 otherwise. The uncertainty arising from the randomness in is captured by the conventional standard error.

Actual Alternative Alternative
Unit Sample Sample I Sample II
1 0 0
0 0 0
0 1 1
0 1 0
1 0 0
Table 1. : Sampling-based Uncertainty ( is observed, is missing)

In a randomized experiment the uncertainty is not necessarily of this sampling variety. Instead we can think of the uncertainty arising from the stochastic nature of the assignment, . For units with we observe , and for units with we observe . In our sample units have a particular set of assignments. In a repeated sampling thought experiment the units in the sample would have remained the same, but their assignments would might been different, leading to a different value for the estimator. See Table 2.

Actual Alternative Alternative
Unit Sample Sample I Sample II
1 1 ? 0
0 0 ? 0
0 1 ? 1
0 0 ? 1
1 0 ? 0
Table 2. : Design-based Uncertainty ( is observed, is missing)

1.2. Notation

In the following, we denote the distribution of potential outcomes in the population with and the size of that population with . The distribution in the sample of size is denoted with . is not quite the empirical distribution because we only observe one of the values in each pair . We use superscripts throughout to indicate population quantities, and superscripts to denote their sample analogs. The number of treated units in the sample is denoted with , the number of control units with , and the respective shares of treated and control units with , so that . We also define the empirical c.d.f. for either potential outcome given the randomized treatment as and .

2. The Causal Bootstrap for Average Treatment Effects

In this section we consider causal bootstrap inference for the population average treatment effect defined in (1.1). The estimator we use is the difference in sample averages by treatment status:


The repeated sampling perspective we take is one where the potential outcomes are fixed for all units in the population. The stochastic properties of the estimator arise from the stochastic nature of the assignment and sampling, which are both sources of randomness in the average of realized outcomes by treatment status, where we regard as fixed.

2.1. The True Variance of the Estimator for the Average Treatment Effect

Here we present the true variance of the estimator under random assignment of the treatment. From the experimental subjects, are selected at random to receive the active treatment, and the remainder are assigned to the control group. Define


Then the exact variance of , over the randomization distribution, is

See, for example, Neyman (1923,1990), Aronow, Green, and Lee (2014), Ding (2017), and Abadie, Athey, Imbens, and Wooldridge (2017).

2.2. An Analytical Variance Estimator


Then the standard variance estimator is

This estimator ignores the third term in the variance, which is negative, so in general overestimates the true variance. It is possible to give sharp bounds for given the respective marginal distributions of and . Aronow, Green, and Lee (2014) proposed a consistent estimator for the resulting bounds on that can be expressed as

where is an estimator of the sharp lower bound for .111Such an estimator is

where is a consistent estimator for the upper bound for , see equation (8) of their paper.

2.3. The Classical Bootstrap

The classical bootstrap corresponds to the case where the uncertainty is purely sampling uncertainty. The bootstrap approximates the cumulative distribution function of the pairs

, in the population by the empirical distribution function , where

It then calculates properties of the estimator given that approximate distribution

. One can interpret the standard bootstrap as imputing all the missing values of

in the population by replicates of the observed values, and thus constructing an artificial population from which we then draw random samples. This perspective is helpful to contrast the different approach underlying the causal bootstrap.

2.4. The Causal Bootstrap

Here we initially take the perspective that the uncertainty is solely arising from the stochastic nature of the assignment, as in Table 2. In the spirit of the above interpretation of the standard bootstrap, we use the observed data to impute all the missing values in the population. Then we simulate the estimator using this partly imputed population.

The difference with the standard bootstrap is in the nature of the missing data process, and how we impute them. Consider unit 1 in Table 2. In the actual sample this unit receives the active treatment, and so we observe , but we do not know the value of the control outcome for this unit, .

A natural approach is to impute the missing value of using one of the observed values for , that is, one of the realized values of for control units. The question is which one to use. It turns out that it matters how we choose to impute the missing values from the observed values. This issue is related to the term in the true variance of the estimator for the average treatment effect, the term that is not consistently estimable, and which we typically ignore in practice.

To frame this question, it is useful to start with the joint distribution function of the pairs of potential outcomes in the population,

The average treatment effect, and any other causal parameters of interest, can be written as a functional of this distribution,

Given , the assignment mechanism completely determines the distribution of any estimator, for example the difference in averages by treatment status, . This is similar to the way in which in the sampling case knowledge of the joint population distribution allows us to deduce the properties of any estimator.

The problem, and the main difference with the sampling case is that for each unit in the population, at most one of the two potential outcomes and is observed so that there is no consistent estimator for : In general, the joint distribution of potential values can be written as

where the copula is a function that is nondecreasing in either argument for each value of . By Sklar’s theorem (e.g. stated as Theorem 2.3.3 in Nelsen (2006)), such a copula exists even though it need not be unique unless the marginal distributions are continuous. In the following, we let

denote the set of all possible copulae.

It is important to note that although the marginal distributions can be estimated consistently from a completely randomized experiment as sample size grows, the data on realized treatments and outcomes impose no empirical restrictions on the copula for the joint distribution of . Hence, neither the parameter nor the distribution of an estimator need in general not be point-identified.

In the spirit of the variance estimator in Aronow, Green, and Lee (2014), we address this challenge by simulating the distribution of using an estimator for the population distribution that is conservative with respect to the copula in a sense to be made more precise below. To illustrate the broader conceptual idea, consider an estimator

for a general functional of the distribution of potential values. Under regularity conditions,222See e.g. Bloznelis and Götze (2001) for regularity conditions for finite-population expansions of this type. such an estimator admits a stochastic expansion of the form

where . The first-order “bias” term

and the scale parameter

are deterministic functions of the unknown distribution , and the limit for the asymptotic variance is taken as and grow large. The second-order approximation error

is a tight random variable whose distribution also depends on


If the functional is not point-identified, then may take values in a set whose bounds may be characterized in terms of the marginal distributions . Specifically, given the marginal distributions we have sharp bounds of the form

that are generally available, see e.g. Heckman, Smith, and Clements (1997) and Manski (1997). Similarly we can form bounds for the variance,

For a given inference problem, the bootstrap has to estimate these quantities conservatively with respect to the unknown copula , which can be done iteratively as follows: we first need to determine which couplings attain the value of which is least favorable for the inference problem at hand. Within the (not necessarily singleton) set of such couplings, we then determine the least-favorable value of for . We can apply this principle recursively either until the resulting set contains a unique copula, or until we reach the order of approximation desired for formal results regarding the bootstrap procedure. This results in an estimate for the population distribution that is conservative regarding the inference task at hand. The causal bootstrap then approximates the distribution of by sampling and randomization from a population using the known sampling and assignment mechanism.

2.5. Least Favorable Coupling for the Average Treatment Effect

In this paper, we consider the special case of two-sided confidence intervals based on a t-ratio for the sample average treatment effect. The case of the average treatment effect has been the main focus of the previous literature. It is a special case for our problem in that the copula does not matter for estimation - by inspection, the functional

does not depend on the copula, and the default estimator

is known to be unbiased for under any coupling so that for each . In order to ensure that the estimand is well-defined and satisfies other regularity conditions for the bootstrap, we make the following assumptions:

Assumption 2.1.

The first four moments of

and are bounded.

Also for a two-sided confidence interval constructed from inverting a t-test based on

, the least favorable coupling must attain the upper bound for the asymptotic variance,

We next show that is uniquely maximized at the joint distribution corresponding to the isotone assignment which matches values of to values of while preserving their respective marginal distributions. More formally, the joint distribution of the potential outcomes under the isotone coupling is characterized by the copula

We find that the upper bound on the variance is in fact uniquely attained at the isotone coupling. Therefore an estimator for the distribution of which assumes the isotone coupling is asymptotically conservative at any order of approximation.

Proposition 2.1.

(Least Favorable Coupling for the ATE). Suppose that Assumption 2.1 holds. Then, given the marginal distributions , the variance bound is uniquely attained at

where is the joint distribution corresponding to the isotone coupling.

The fact that the variance bound is attained at the isotone coupling is widely known (see e.g. Becker (1973), Fan and Park (2010), Stoye (2010), and Aronow, Green, and Lee (2014)), for expositional purposes we provide a proof in the appendix. We establish the slightly stronger conclusion that the distribution under the isotone coupling is in fact maximal with respect to second-order stochastic dominance. For our approach it is also important to establish that this maximum is unique in the sense that the joint distribution resulting from any other coupling yields a variance that is strictly lower than . In particular, for confidence intervals based on the Gaussian asymptotic distribution, the isotone coupling does indeed constitute the least favorable coupling.

We also want to stress that there are other causal estimands of interest (including the distribution of treatment effects and its quantile) for which the isotone assignment is not the least favorable coupling (see e.g.

Heckman, Smith, and Clements (1997), Fan and Wu (2010), Fan and Park (2010), Lu, Ding, and Dasgupta (2018)). Cambanis, Simmons, and Stout (1976) give conditions under which the isotone assignment does in fact constitute the least favorable bound for a functional of the joint distribution.

2.6. Related Literature

Worst case bounds on the distributions of potential outcomes and treatment effects and their quantiles have been analyzed by Heckman, Smith, and Clements (1997), Manski (1997), Firpo and Ridder (2008), Fan and Park (2010), Fan and Park (2012), Fan and Wu (2010), and Lu, Ding, and Dasgupta (2018). This literature uses theoretical results on dependency bounds for functions of several random variables which were developed among others by Cambanis, Simmons, and Stout (1976), Makarov (1982), Frank, Nelsen, and Schweizer (1987), and Williamson and Downs (1990). Stoye (2010) establishes that a class of spread parameters is monotone with respect to conventional stochastic orders of distribution, and shows how to derive parameter bounds for causal inference. Several of these studies also propose inference procedures that account for sampling uncertainty rather than randomization error. In contrast, for our problem we need to explicitly construct the respective couplings that achieve the lower and upper bounds to the parameter, and in addition the largest randomization variance for an estimator of either bound.

Robins (1988) proposes a confidence interval for a causal parameter based on the least-favorable coupling for a binary outcome variable. Aronow, Green, and Lee (2014) propose an estimator of the sharp upper bound for the randomization variance of the average treatment effect in completely randomized experiments. Our approach of embedding the finite-population randomization distribution into an asymptotic sequence of sampling experiments closely follows Abadie, Athey, Imbens, and Wooldridge (2017). Our results make use of a finite-population CLT for the empirical process developed by Bickel (1969)

for the two-sample problem. Finite-sample central limit theorems for randomization inference were also provided by

Li and Ding (2017). Bootstrap methods for sampling from finite populations (without replacement) have been proposed by Bickel and Freedman (1984) and Booth, Butler, and Hall (1994). For this problem the main challenge in generating the finite bootstrap population is that the size of the super-population may be a non-integer multiple of . We propose a new alternative for estimating the potential outcome distribution for a super-population of exact size and for which the marginal distributions coincide with their empirical analogs up to rounding error.

2.7. Comparison to Fisher’s Exact Test

Bootstrap inference on the average treatment effect as proposed in this paper bears some conceptual similarities with Fisher’s exact test of the sharp null of no unit-level treatment effect (see e.g. Rosenbaum (2002), Imbens and Rubin (2015), Ding (2017)),

with probability 1. One important distinction is that the justification for our procedure is only asymptotic, whereas the Fisher exact test is valid in finite samples.

Furthermore, the Fisher exact test evaluates the randomization distribution of the estimated ATE under the sharp null of no or a constant unit-level treatment effect. The sharp null not only implies that the joint distribution of and corresponds to the isotone assignment, but also equality of the marginal distributions , which may in fact be rejected by the data under the null of a zero average treatment effect. In that case even a conservative estimator of the randomization variance may in fact be smaller than that implied by zero, or constant, unit-level effects. More generally, when Fisher’s sharp null fails and , the bootstrap estimate of the randomization variance can in several important scenarios be smaller than that implicit in Fisher’s exact test, in which case our procedure is asymptotically more powerful.

Specifically, standard variance calculations (see e.g. Ding (2017)) imply that the implicit variance estimate for Fisher’s exact test under the null of no average effect is

We can compare this to the actual variance stated in Section 2.1,

Our bootstrap procedure implies a conservative estimate, i.e. a sharp lower bound for from the isotone coupling of the potential outcomes, which is strictly positive whenever the marginal distributions of and are not the same. The comparison between the terms and is generally ambiguous - Ding (2017) describes several cases in which the randomization variance implied by Fisher’s test is strictly larger, and his conclusions carry over to the bootstrap procedure in this paper. On the other hand it is important to note that when

, Fisher’s exact test over-rejects under the null hypothesis of no average treatment effect, so that this potential power advantage for the Fisher test only arises in situations in which the exact test does not provide a valid test of that null. We illustrate this possibility using Monte Carlo simulations in Section


The relationship between Fisher’s sharp null and Neyman’s null hypothesis of no average effect is clarified in Ding (2017), who also shows that Neyman’s test of the null of no average effect is weakly more powerful against alternatives than Fisher’s exact test. Fisher’s exact null also implies that the distribution of is degenerate at a constant, however the power comparison for the ATE does not carry over to set-identified objects like quantiles or the c.d.f. of since the bounds for the identified set are typically not attained at the isotone coupling that is implied by the sharp null.

3. Bootstrap Procedure

This section describes the bootstrap procedure for confidence intervals for the average treatment effect, in which case the least favorable coupling is the isotone (rank-preserving) assignment by Proposition 2.1. The method allows for sampling and randomization uncertainty, where we consider a sampling experiment under which the researcher observes units that are selected at random out of a population of units. For the purposes of asymptotic approximations, we assume that the population of interest in turn consists of i.i.d. draws from an encompassing distribution .

Assumption 3.1.

(Sampling Experiment) The population consists of units with potential values which are i.i.d. draws from the distribution . The observed units are sampled at random and without replacement from the population,

where we denote .

We assume throughout that the treatment is binary, and that the outcome for unit does not vary with the treatment status assigned to other units. The latter requirement is also known as individualistic treatment response, or Stable Unit Treatment Value Assumption (SUTVA). We assume furthermore that the experiment is completely randomized:

Assumption 3.2.

(Complete Randomization) Treatment assignment is completely randomized, that is for each unit with we have

where for units selected at random and without replacement from the observations with , and the propensity score satisfies .

For greater clarity of exposition we also assume that the researcher observes no further covariate information. The approach of this paper can be generalized to observational studies under unconfoundedness, and experiments with imperfect compliance for which unconfoundedness fails, but intention to treat is (conditionally) independent of potential outcomes and can serve as an instrumental variable to identify causal effects on a population of compliers.

Given a sample generated according to Assumptions 3.1 and 3.2, we denote the point estimate for the average treatment effect

and the upper variance bound

For the purposes of this paper, the main target of interest for the causal bootstrap is the distribution of the t-ratio

3.1. Bootstrap Algorithm

The proposed bootstrap algorithm proceeds in four main steps:

  1. We obtain nonparametric estimates of the potential outcome distributions and from the units for which (, respectively) in the actual experiment.

  2. We create an empirical population of size , by generating an appropriate number of replicas of the sample of draws for . If the sample is the population, , we can skip this step.

  3. We then impute potential values for each unit , where and is obtained from the estimated potential outcome distributions and the least-favorable copula for the parameter of interest.

  4. Finally, we simulate the randomization distribution by repeatedly drawing units out of that empirical population without replacement and generating randomization draws . We then evaluate the sample average treatment effect for the bootstrap sample obtained using the imputed potential outcomes.

Given the simulated randomization distribution for the estimated bounds, we can estimate the percentiles of the t-ratios that are needed to construct confidence intervals for the functional. We next describe each of these steps in greater detail.

3.2. Generating the Empirical Population

To obtain the empirical population of size , we generate replicates of the observed units, however not necessarily of the same number for each observation when is not an integer multiple of . We propose the following procedure for doing so:

  • We create the samples of values for for the units with , and with values for the units with . We assume that each sample is ordered, for all , and the rank variable for .

  • Let and . We generate the empirical population by including copies of with and copies of with .

Since the respective maxima of are equal to for either of the two strata (corresponding to and , respectively), so that this procedure ensures that the empirical population has size equal to . Also, for and sufficiently large, the respective empirical distributions of among units with and among units with are, up to an approximation error of the order , equal to and , respectively.

3.3. Imputing Missing Counterfactuals

For the specific case of two-sided inference for the average treatment effect, Proposition 2.1 shows that the least favorable coupling corresponds to the isotone assignment . For other inference problems, the missing counterfactuals would have to be imputed by drawing from the appropriate least-favorable coupling, following the strategy outlined in Section 2.4.

In order to generate an empirical population with joint distribution , we can simply impute the missing counterfactuals according to:


For functionals of the potential outcome distribution other that the average treatment effect, or inference problems other than two-sided confidence intervals, the least favorable coupling will be of a different form, so this step would have to be replaced by a procedure imputing the missing counterfactuals from a different coupling.

3.4. Resampling Algorithm

For the th bootstrap replication, we initially draw units from the empirical population at random and without replacement.

Given a known propensity score , for the th bootstrap replication we can generate as independent Bernoulli draws with success probability and obtain the bootstrap sample , where .

We can then compute the bootstrap analogs of the estimated c.d.f.s and , the corresponding estimates of the average treatment effect, and the variance bound,

We then record the studentized values of the bootstrap estimates,

Repeating the resampling step times, we obtain a sample that constitutes independent draws from the bootstrap estimator of the randomization distribution and can be used to construct critical values for tests or confidence intervals.

3.5. Confidence Intervals

We consider confidence intervals constructed by inverting a t-test based on the point estimate and given the variance bound introduced before. The proposed confidence intervals for are then of the form


We use bootstrap approximations to the randomization distribution of the t-ratio under the least favorable coupling in order to determine the critical values. Specifically, let denote the empirical distribution for the bootstrap samples obtained from the previous step. We then estimate the critical values using and .

4. Monte Carlo Simulations

We next compare the performance of this causal bootstrap with the standard bootstrap and other alternative methods based on sampling or randomization designs. Specifically, we consider confidence intervals using Gaussian critical values with the respective analytic estimators of the sampling variance and the causal variance given in Section 2.2. We also consider Gaussian inference using the variance estimators and obtained from the classical (sampling) bootstrap, and the causal bootstrap proposed in this paper. We compare these to confidence intervals from inverting Fisher’s exact test, and confidence intervals from the standard and the causal bootstrap for the t-statistic based on either sampling or causal variance estimate. Throughout we will restrict our attention to the case , i.e. when the full population of interest is observed.

We first consider three different simulation designs to illustrate the main points of comparison between the causal bootstrap and the main alternatives for causal inference.

  • Design I sets and draws potential outcomes according to , . In this setting, treatment effects are constant at and the marginal distributions , so that all procedures should be expected to do well.

  • For Design II we again have , but generate potential outcomes as , and . In that case, the marginal distributions and are different, so that causal standard errors and the causal bootstrap should do better than their sampling analogs.

  • Design III replicates Design II at a smaller sample size, where , and , .

  • For Design IV, , and we generate non-Gaussian potential outcomes where and is a mixture that is drawn from with probability 0.9, and from with probability 0.1. This design highlights the difference between the bootstrap and Gaussian inference, which is no longer exact for this design.

Simulation results are shown in Table 3, where we compare coverage rates of nominal 95% confidence intervals, and the corresponding standard errors for each of the three designs. If a particular method does not directly calculate standard errors, we calculate the standard errors by taking the ratio of the difference between the upper and lower limit of the confidence interval and dividing by 2 times 1.96. We also report the nominal confidence level and theoretical causal standard error for each design in the last row (“Target”). With the exception of Fisher’s exact test, all methods rely on asymptotics, and should therefore not be expected to achieve exact coverage at the nominal level.

Under the first design, causal and sampling standard errors coincide, and the exact distribution of is Gaussian. Furthermore, Fisher’s exact null holds true, so all methods should work well. Under the second design, the exact distribution of is again Gaussian but a conservative estimate for the variance is strictly positive, so the causal standard error is strictly smaller than the sampling-based standard error. Hence for Design II, inference based on sampling based standard errors or the standard bootstrap should be expected to be conservative, whereas inference using causal standard errors or the causal bootstrap may still be conservative but coverage should be closer to the nominal level than for sampling-based methods.

Design III repeats Design II for a smaller sample size, which reveals a modest downward bias in the (bootstrap or analytical) causal standard errors, resulting in rejection rates exceeding the nominal level. Such a bias should be expected since the causal standard error corresponds to the plug-in estimator , where under regularity conditions is a smooth but nonlinear functional of the marginal distributions . While the bootstrap should not be expected to provide refinements for estimating the standard error (a non-pivotal quantity), a refinement for inference based on the studentized estimator corrects for the leading term of that bias and therefore result in rejection rates closer to the desired level.

Design IV has heterogeneous treatment effects and non-Gaussian marginal distributions for potential outcomes, where the tails of the marginal distribution of are thicker than for the Gassian distribution. In this setting sampling-based methods should be more conservative than their causal analogs, and the pivotal bootstrap may provide refinements over Gaussian inference with causal standard errors. Since sample size for the third design is fairly small and all methods except for Fisher’s exact test rely on asymptotics, all inference methods exhibit modest size distortions. However simulated coverage rates for the pivotal bootstrap are close to the nominal level, and simulation results in Tables 4 and 5 confirm that coverage rates approach the desired nominal level when sample sizes get sufficiently large.

Design I Design II Design III Design IV
Variance Bootstrap Pivotal Cov Med Cov Med Cov Med Cov Med
Estimator Version Statistic Rate s.e. Rate s.e. Rate s.e. Rate s.e.
N/A No 0.9536 0.1412 0.9950 0.0999 0.9870 0.2218 0.9776 0.3330
N/A No 0.9528 0.1404 0.9524 0.0706 0.9334 0.1568 0.9116 0.2354
Standard No 0.9518 0.1405 0.9944 0.0994 0.9850 0.2162 0.9744 0.3245
Causal No 0.9512 0.1400 0.9494 0.0704 0.9302 0.1548 0.9084 0.2325
Fisher’s Exact Test 0.9766 0.1411 0.9630 0.0999 0.9626 0.2219 0.9698 0.3332
Standard Yes 0.9534 0.1421 0.9954 0.1012 0.9900 0.2404 0.9838 0.3865
Standard Yes 0.9528 0.1433 0.9954 0.1012 0.9900 0.2404 0.9838 0.3865
Causal Yes 0.9526 0.1414 0.9510 0.0715 0.9446 0.1681 0.9434 0.2802
Causal Yes 0.9530 0.1419 0.9510 0.0715 0.9446 0.1681 0.9434 0.2802
Target 0.9500 0.1414 0.9500 0.0707 0.9500 0.1581 0.9500 0.2500

Table 3. 95% Confidence Intervals And Standard Errors

We next illustrate the role of the coupling of the potential values where we draw

from a bivariate Gaussian distribution with variances

and and correlation coefficient of the two potential values, . From our theoretical results, we should expect Gaussian inference using causal standard errors and the causal bootstrap to have asymptotically exact coverage under the isotonic coupling and be conservative when . Furthermore, for any coupling this design implies heterogeneous treatment effects, so that Fisher’s exact test does not in general control nominal confidence size for the average treatment effect. Given the calculations in Section 2.7 we designed the experiment deliberately to illustrate the potential of Fisher’s exact procedure to underestimate the spread of the randomization distribution, where and