Sampling-based randomized designs for causal inference under the potential outcomes framework

by   Zach Branson, et al.
Harvard University

We establish the inferential properties of the mean-difference estimator for the average treatment effect in randomized experiments where each unit in a population of interest is randomized to one of two treatments and then units within treatment groups are randomly sampled. The properties of this estimator are well-understood in the experimental design scenario where first units are randomly sampled and then treatment is randomly assigned, but this is not the case for the aforementioned scenario where the sampling and treatment assignment stages are reversed. We find that the mean-difference estimator under this experimental design scenario is more precise than under the sample-first-randomize-second design, but only when there is treatment effect heterogeneity in the population. We also explore to what extent pre-treatment measurements can be used to improve upon the mean-difference estimator for this experimental design.



There are no comments yet.


page 1

page 2

page 3

page 4


Design-Based Uncertainty for Quasi-Experiments

Social scientists are often interested in estimating causal effects in s...

Rejective Sampling, Rerandomization and Regression Adjustment in Survey Experiments

Classical randomized experiments, equipped with randomization-based infe...

Asymptotic Efficiency Bounds for a Class of Experimental Designs

We consider an experimental design setting in which units are assigned t...

Multiple Randomization Designs

In this study we introduce a new class of experimental designs. In a cla...

Statistical Properties of Exclusive and Non-exclusive Online Randomized Experiments using Bucket Reuse

Randomized experiments is a key part of product development in the tech ...

Statistical analysis of two arm randomized pre-post design with one post-treatment measurement

Randomized pre-post designs, with outcomes measured at baseline and foll...

Personalization and Optimization of Decision Parameters via Heterogenous Causal Effects

Randomized experimentation (also known as A/B testing or bucket testing)...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction: The Potential Outcomes Framework for Randomized Experiments

In most two-armed randomized experiments conducted to compare the causal effects of two treatments (or one treatment and one control), a finite population of experimental units is considered. These units receive one of the treatments through a randomized assignment mechanism, and the observed outcomes from the two treatment groups are compared to draw inference on causal estimands of interest. Such causal estimands can be defined in terms of potential outcomes, and the framework for drawing inference in the setup described above is the well-known Neyman-Rubin causal model (Sekhon 2008) or simply the Rubin causal model (Holland 1986). In particular, this model posits that each unit has two fixed potential outcomes, and , denoting unit ’s outcome under treatment and control, respectively, and thus the only stochasticity in a randomized experiment is units’ assignment to treatment groups.

The most common estimand in the causal inference literature is the finite-population average treatment effect (ATE), defined as . In a randomized experiment, ultimately each unit can only be assigned to treatment or control—never both—and thus only or is observed for each unit. As a result, must be estimated. The most common estimator for is the mean-difference estimator, i.e., the mean difference in the observed treatment outcomes and observed control outcomes. In this finite population setup where the potential outcomes are assumed fixed, the mean-difference estimator is unbiased for

, and its sampling variance was derived by

Neyman (1923). Using this estimator and sampling variance, a normal approximation can be used to draw inference on . If additional covariates are available for each unit, covariate adjustment can be performed to obtain more precise estimators for , such as through post-stratification (Miratrix et al. 2013) or regression (Lin 2013).

However, in many practical situations, it may not be possible to observe a response for all units due to resource constraints. A natural way to address this limitation is to first randomly sample units from the population and then randomly assign the two treatments to the sampled units. For example, this is a common procedure in political science and other social sciences, where an experiment is conducted on a representative sample of some larger population, often through a well-designed online survey (Mutz 2011; Coppock 2018; Miratrix et al. 2018). In this case, the mean-difference estimator is still unbiased for the ATE, and its sampling variance has been studied in works such as Imbens (2004) and Imbens and Rubin (2015, Chapter 6). Then, one can still use the mean-difference estimator and its sampling variance to draw inference on via a normal approximation. This scenario—where units are sampled from a population of units, and then an experiment is conducted on these sampled units—is shown in the left panel of Figure 1.

(a) Design 1: Sampling in first stage and randomization in second stage.
(b) Design 2: Randomization in first stage and sampling in second stage.
Figure 1: Two sampling-based designs for estimation of ATE.

This article is motivated by a similar limitation as stated in the previous paragraph, but the experimental design is different in the sense that the order of sampling and randomization is reversed. That is, first, each of the units is assigned to one of the two treatments. Then, a subset of units is sampled from the units exposed to each treatment group, and the response is measured for each sampled unit. This experimental design scenario is shown in the right panel of Figure 1. Such a strategy may be useful if sampling and measurement of response is more expensive and/or complex compared to treatment assignment. Examples of experiments that involve this type of design can be found in material science, biomedicine, and the social sciences. For example, in an experiment conducted to assess the difference between two types of oxidation procedures with respect to their impact on the dimension of nanotubes, scientists typically split a population (a container of nanotubes) into two subpopulations (two smaller containers) and apply different oxidation procedures to the two subpopulations (Remillard et al. 2016, 2017). However, since measuring dimensions of nanotubes is an expensive and time-consuming process, a sample of oxidized nanotubes is taken from each subpopulation, and dimensions of the sampled nanotubes are measured. A similar procedure is followed when conducting stem cell experiments (Chung et al. 2005; Doi et al. 2009).

The main contribution of this article is deriving the sampling properties of the mean-difference estimator for the ATE under Design 2—we do this in Section 3 after setting up some notation in Section 2. We will find that these sampling properties are identical to those under Design 1, meaning that the ordering of the sampling and randomization stages is inconsequential for the precision of the mean-difference estimator. This finding will also bring some clarifications about inference of the ATE under Design 1, as we discuss in Section 3.1. In particular, the variance of the mean-difference estimator under Design 1 is often characterized by asymptotic results where or (Imbens 2004; Imbens and Rubin 2015; Sekhon and Shem-Tov 2017), and our work illuminates on the case where the population size and sample size are finite. Our work also gives insight into other sampling-based randomized designs: For example, Design 2 is similar—but not identical—to a cluster-randomized design (Campbell et al. 2007), where one cluster of size is assigned to treatment group 1 and another cluster of size is assigned to treatment group 0, and random samples are obtained within each cluster. Cluster-randomized designs are quite common in education (Hedges and Hedberg 2007), medicine (Eldridge et al. 2004), and psychology (Raudenbush 1997), and we discuss how Design 2 compares to cluster-randomized designs in Section 3.2. Then, in Section 4, we consider the case where baseline (pre-treatment) measurements of the response can be obtained for a sample of experimental units, and we explore the extent to which the mean-difference estimator can be improved under Design 2 by using such measurements. Unfortunately, we will find that pre-treatment measurements are often unhelpful in improving the precision of average treatment effect estimators under this design, unless a large number of pre-treatment measurements that are highly associative with the post-treatment measurements can be obtained. In Section 5, we confirm these results by conducting a simulation study based on a real experiment in nanomaterials. In Section 6, we conclude.

2 Setup and Notations

Let the two treatments be denoted by 0 and 1, and for , let denote the potential outcomes for unit when exposed to treatments 0 and 1, respectively. The unit-level treatment effect is

and the finite-population ATE is

where and denote the average potential outcomes for the treatment groups 1 and 0, respectively. We also denote the variances of the potential outcomes for treatment group as follows:


and the covariance of potential outcomes for treatment groups 1 and 0 by


Finally, the variance of the unit-level treatment effects is


As in Design 2 shown in the right panel of Figure 1, and units are assigned to treatments 1 and 0 respectively, where and are predetermined. From these two groups, and units are sampled and their responses are observed. Let and denote the observed averages for the treatment groups 1 and 0, respectively. Then, a natural estimator of is


We will examine the sampling properties of and compare them with those of a similar estimator obtained from Design 1 shown in Figure 1.

3 Sampling properties of the estimator of the ATE

Because the potential outcomes are assumed fixed, the sampling properties of defined in (4

) will be determined by the random variables associated with the randomization and sampling stages. We define two such variables now. For

, let denote a Bernoulli random variable indicating the random level of treatment (0 or 1) that unit receives. Recalling that the assignment mechanism essentially involves a random partitioning of units into two groups of predetermined sizes and

, the properties of the assignment vector

are straightforward to establish and can be found in standard texts (e.g., Imbens and Rubin 2015).

Next, define

as an indicator random variable equaling

if the potential outcome is randomly sampled among the units assigned to treatment level . Importantly, note that we know that conditional on ; i.e., the th unit will not be sampled from treatment group 1 if it was assigned to treatment group 0. As shown in the right panel of Figure 1, we assume that samples of size and are sampled from treatment groups 0 and 1, respectively. Properties of the are crucial in the derivation of sampling properties of , and are summarized in Lemma 1. The proofs are in Appendix A.

Lemma 1.

The properties of the sampling indicators can be summarized as:

Remark 1.

To provide some intuition for (7), note that we expect the quantity to be negative if , because if we know that unit is sampled in treatment group , then unit is less likely to be sampled in the same group. Similarly, we expect the quantity to be positive, because if we know that unit is sampled in treatment group 1, then unit is more likely to be sampled in treatment group 0.

We now represent the estimator in (4) in terms of the potential outcomes and the random variables and defined above. To do this, note that the average observed response in treatment group can be written as


By combining (4) and (8) and using Lemma 1, we now derive the sampling properties of the estimator and summarize them in Theorem 1. The proof of Theorem 1 is in Appendix B.

Theorem 1.

The estimator given by (4) satisfies the following properties:

  1. is an unbiased estimator of the ATE


  2. The sampling variance of is given by


    where and are given by (1) and is given by (3).

3.1 Comparison with two other sampling-based randomized designs

We now compare the sampling properties of as derived in Theorem 1 with the unbiased estimators of obtained from two other designs: (i) a design where responses for all units are observed, that is, no sampling is involved, and (ii) a design described as Design 1 in the left panel of Figure 1, where units are first sampled from units in the population and subsequently exposed to treatments. We denote the estimator of from (i) by and that from (ii) by . Both and are unbiased estimators of . Neyman (1923) derived the following result on the sampling variance of :


As a result, we obtain the inequality , reflecting the price one has to pay for sampling from the finite population.

Meanwhile, to our knowledge, Imbens and Rubin (2015, Chapter 6, Appendix B) were the first to derive the expression of the finite-population sampling variance of , although others have discussed the infinite-population case where (e.g., Imbens 2004). However, there was a slight error in the derivation in Imbens and Rubin (2015, Chapter 6, Appendix B): The covariance between the sampling indicators was stated to be , when really it is , as can be seen by the , case in Lemma 1 and other works on survey sampling (e.g., Lohr 2009, Page 53). As we show in Appendix C, by making this correction, and then following the derivation of Imbens and Rubin (2015, Chapter 6, Appendix B), we have the following expression of the sampling variance of :


Thus, the precision of the mean-difference estimator under Design 1 is identical to that under Design 2; in other words, the order of the sampling and randomization stages is inconsequential in terms of the precision of the mean-difference estimator. This finding also clarifies some discussions in the causal inference literature that consider a super-population framework. When discussing causal inference under a super-population framework, many works implicitly assume that or , and thus the third term in , , is often ignored (Imbens (2004), Imbens and Rubin (2015, Chapter 6), Ding et al. (2017), Sekhon and Shem-Tov (2017)). As we will see in Section 3.3, this term is ignored in the estimation of or , because the observed data do not provide any information about , since none of the individual are ever observed. However, the above derivation emphasizes that this term nonetheless exists in the true variance of the mean-difference estimator under Design 1 and Design 2 when the super-population that is sampled from is finite.

3.2 Comparison with cluster-randomized designs

As discussed in Section 1, Design 2 is similar but not identical to a cluster-randomized design, where treatment is assigned among clusters of units but response is measured at the unit level. The most important distinction between Design 2 and a cluster-randomized design is that, in Design 2, and (the number of units assigned to treatment groups 0 and 1, respectively) are fixed, and in a cluster-randomized design, these quantities are stochastic. For example, consider a cluster-randomized design with two clusters, where Cluster A has units and Cluster B has

units, and one cluster is randomly assigned to treatment group 0 and the other to treatment group 1. There is a 0.5 probability that

and a 0.5 probability that , and is analogously stochastic.

This stochasticity causes many complications to inference—for example, the mean-difference estimator is often biased in cluster-randomized designs, even when all units’ responses can be measured (Middleton 2008; Middleton and Aronow 2015). Consequently, there is not a straightforward analog to our results that will hold for cluster-randomized designs; indeed, our results will not even hold for the simple two-cluster example mentioned in the previous paragraph. Because we require that and are fixed, our results will hold for cluster-randomized designs if the number of units within each cluster is the same across clusters, but this is rarely the case. As discussed in Middleton (2008) and Middleton and Aronow (2015), cluster sizes and the covariance between treatment group sizes and treatment effects are important quantities for deriving inferential properties of ATE estimators in cluster-randomized designs, and likely these quantities would be similarly important in deriving analogous results when there is a sampling stage after treatment assignment at the cluster level. We leave this for future work.

3.3 Estimation of sampling variance and approximate confidence intervals



denote the sample variances of observed responses for treatment groups 0 and 1, respectively. From standard sampling theory (e.g., Lohr 2009, Pages 52-54), it follows that and are unbiased estimators of and , respectively. Consequently,


is a natural Neymanian-style estimator of and , which, as seen in Section 3.1, has an upward bias of unless strict additivity holds. This estimator—originating in Neyman (1923)—is by far the most common estimator for the variance of the ATE in randomized experiments (Rubin 1990; Imbens 2004; Miratrix et al. 2013; Imbens and Rubin 2015; Ding et al. 2017), and thus it is reassuring that it can be used for and , i.e., it can be used under Design 1 or Design 2.

Estimator (13

) can be used to obtain approximate confidence intervals for



denotes the quantile function of a standard normal distribution. The asymptotic normality of

is based on the finite population central limit theorem

(Hájek 1960) and its application in the context of randomized experiments by Li and Ding (2017).

4 Can the estimator be improved using pre-treatment measurements?

Now assume that the response can be measured for each experimental unit prior to application of the treatments. Let denote these measurements for the units, which are fixed quantities like the potential outcomes. The unit-level differences and for are referred to as “gain scores” in the psychology (Rumrill and Bellini 2017) and education (McGowen and Davis 2002) literature. While the unit-level gain scores or their averages and are of interest in psychology and education (Hake 1998), they are not referred to as causal estimands or causal effects (Rubin et al. 2004; Imbens and Rubin 2015). However, it is easy to see that the unit-level and average treatment effects can respectively be expressed as the unit-level and average differences of the gain scores, i.e., and . In spite of this connection, several experts in education and psychology have recommended avoiding the use of gain scores when estimating treatment effects in experiments and observational studies. Campbell and Erlebacher (1970) claimed, “gain scores are in general a treacherous quicksand,” and Cronbach and Furby (1970) recommended researchers to “frame their questions in other ways.” Despite some recent interest on utilizing gain scores to identify causal effects, there appears to be a general aversion in the causal inference community towards the use of gain scores. Here we explore whether, in the current setup, a proper design and analysis of the experiment using gain scores can potentially lead to more precise estimation of treatment effects under certain assumptions.

In the setup of Design 2 in Figure 1, where measuring the pre-treatment response for each unit is not feasible, it is often a common practice to measure the pre-treatment response for a random sample of size to estimate the average gain scores and . Denoting the sampling indicator by , the estimators of and are and , respectively, where . While the sampling properties of these estimators can be readily obtained, they do not help in increasing the precision of the estimator of because the average pre-treatment scores get canceled out in the difference of the gain score estimators. However, if samples of and pre-treatment observations are taken independently from the treatment groups 1 and 0 after assignment but before administration of the treatments, it is possible to obtain a different estimator of the ATE. For , let denote the sampling indicator associated with the random sampling of among the units assigned to treatment . Then


are the observed sample averages of pre-treatment responses for the two treatment groups. Then we can define the following estimator of the ATE:


where and are given by (14). The sampling properties of depend on the distribution of

and the joint distribution of

and . These properties are summarized in the following two lemmas.

Lemma 2.

The properties of the indicators are exactly the same as those of stated in Lemma 1, just replacing by and by .

Lemma 3.

For , the covariance between the indicators and is given by:

The proof for Lemma 3 is in Appendix D. Using Lemmas 2 and 3, we arrive at the following result:

Theorem 2.

The estimator given by (15) satisfies the following properties:

  1. is an unbiased estimator of the ATE .

  2. The sampling variance of is given by


    where is given by (9) and


The proof of Theorem 2 is in Appendix E.

Remark 2.

From Theorem 2, it follows that is a more efficient estimator than if and only if


where denotes the finite sample correlation coefficient between the potential outcomes and pre-treatment measurements . In order for condition (19) to be achieved, pre-treatment measurements need to be highly predictive of the outcome, post-treatment outcomes need to be substantially variable relative to the pre-treatment measurements, and pre-treatment measurements need to be obtained for a large portion of the population. For example, consider the case where we have a balanced design (i.e., ) and a balanced sample size (). In this case, even if the pre-treatment measurements are perfectly correlated with the outcomes (i.e., ), in order for (19) to be achieved, we need , and usually will be small due to resource constraints. Similarly, if the pre-treatment measurements are moderately correlated with the outcomes (e.g., ) and all units’ pre-treatment measurements are observed (i.e., ), we still need

, i.e., the average standard deviation of the post-treatment outcomes needs to be larger than the standard deviation of the pre-treatment measurements. Cases where

or are indeed quite extreme and unrealistic conditions, and thus Theorem 2 in general gives credence to the skepticism many causal inference experts have about using gain scores for estimation of treatment effects in the context of the experimental design discussed here.

5 Simulation Study: Considering an Experiment in Nanomaterials

To assess how the findings from Theorems 1 and 2 are informative for real experiments, we now conduct a simulation study based on a real experiment in nanomaterials (Remillard et al. 2017). One purpose of this experiment was to assess how the dimensions of carbon nanotubes change as they undergo different processing procedures. One such procedure is sonication, the primary purpose of which is to disperse the nanotubes. However, sonication may also result in the undesirable outcome of breaking the nanotubes, possibly causing changes to their length. The experiment considered two types of sonication—bath sonication and probe sonication—which are the treatments in this application. Bath sonication is a more gentle procedure than probe sonication, so it was hypothesized that bath sonication would not decrease the length of the carbon nanotubes as much as probe sonication.

Because of practical constraints, this experiment was conducted using Design 2 in Figure 1

. Because the nanotubes are so small, it was infeasible to select individual nanotubes for treatment, and instead, a container of nanotubes was evenly divided into two smaller containers, and each of these containers underwent bath sonication or probe sonication. After sonication, nanotubes were randomly selected and their length was measured after treatment. Furthermore, the length of a random sample of nanotubes was measured before treatment. Primary interest was in the average difference in log-length between bath sonication and probe sonication, i.e., the ATE; the log-length was used due to skewness in the length distribution across carbon nanotubes.

In this paper, we have discussed two ATE estimators under Design 2: the mean-difference estimator defined in (4), and the mean-difference estimator defined in (15) that incorporates pre-treatment information. In the real experiment conducted in Remillard et al. (2017), we could only implement Design 2 once, and thus we could not observe the behavior of these estimators across different randomizations and random samples of carbon nanotubes. In order to understand the behavior of these estimators across many implementations of Design 2, we will conduct a simulation study mimicking this experiment.222In actuality, the Remillard et al. (2017) experiment considered several treatment factors, such as oxidation as well as sonication, and it considered several types of carbon nanotubes. For ease of exposition, the simulation data discussed here is based on the subset of the experiment that considered the carbon nanotube called “D15L1-5” in Remillard et al. (2017). We chose this type of carbon nanotube because it was already oxidated, and thus only sonication (i.e., one treatment factor) was used in the actual experiment for this type of carbon nanotube.

Consider the following implementation of Design 2 for this experiment: carbon nanotubes will be randomly divided into treatment groups of sizes (probe sonication) and (bath sonication). Random samples of sizes and will be obtained before treatment, and random samples of sizes and will be obtained after treatment; these were approximately the sample sizes that were used in the actual experiment. The individual log-length of each carbon nanotube will be measured for these samples.

First, the population of pre-treatment measurements as well as post-treatment measurements and were generated using the following model:


As discussed in Remillard et al. (2017), normality was a reasonable distributional assumption to place on the log-length of carbon nanotubes. The mean parameters ( and ), variance parameters ( and ), and ATE parameter were set to the sample estimates observed in the Remillard et al. (2017) experiment.333Specifically, , , , , and , where the units are on the scale. The treatment effect heterogeneity parameter and association parameter could not be estimated in the experiment, because none of the could ever be jointly observed. In this simulation study, we set to induce strong treatment effect heterogeneity (Ding et al. 2016) and consider various values for .

After the population-level triplet was generated once, the following randomization and sampling procedures (i.e., Design 2) were repeated 10,000 times:

  1. Each unit was randomly assigned to treatment group 1 or treatment group 0 with equal probability, such that and .

  2. Within each treatment group , units were randomly sampled and their recorded as an observed pre-treatment measurement.

  3. Within each treatment group , units were randomly sampled and their recorded as an observed post-treatment measurement.

For each of the 10,000 replications, we recorded the ATE estimators defined in (4) and defined in (15). Recall that is the mean-difference estimator, and its properties under this design are established by Theorem 1; meanwhile, uses the pre-treatment measurements to alter , and its properties are established by Theorem 2.

First, let us consider one simulation setting to confirm the results in Theorems 1 and 2. Figure 2 shows the empirical distribution of and after 10,000 replications of the above procedure when the pre- and post-treatment measurements are moderately associated . There are two observations that can be made from Figure 2. First, the normal distributions for and (constructed using Theorems 1 and 2, respectively) fit the empirical distributions of these estimators; this confirms our theoretical results, including the comparison of these estimators given in Remark 2. Second, under this setting, is slightly more dispersed than , suggesting that using the pre-treatment measurements in this case inflates the variance of the ATE estimator.

Figure 2: 10,000 replications of (top) and (bottom) when the pre- and post-treatment measurements are moderately associated and there is strong treatment effect heterogeneity . The lines denote normal densities, where the mean is the true treatment effect and the variance is given by Theorem 1 for and Theorem 2 for .

This second observation begs the question: In this scenario, when will using the pre-treatment measurements improve estimation of the ATE? To address this question, we considered various values for the association parameter as well as the sample sizes and for . Figure 3 displays a heatmap of the empirical ratio for various values of and the sample size. Looking at the bottom row of this heatmap, we can see that for the sample of size of 150 that was actually feasible for this experiment, even an association of would not have led to being more precise than . For this experiment, the sample size would have had to significantly increase while keeping the association moderately high in order for the pre-treatment measurements to be useful in increasing the precision of the ATE estimator. This echoes the observation made at the end of Section 4 that the use of gains scores is often not beneficial in improving ATE estimators.

Figure 3: Heatmap of the empirical ratio across the 10,000 replications. We considered values in (20) and sample sizes for , , , and . Blue shades suggest is more precise; red shades suggest is more precise.

In summary, the above simulation study confirms that Theorems 1 and 2 correctly establish the behavior of the ATE estimators and under Design 2. However, for these data that are based on the experiment from Remillard et al. (2017), an infeasibly large number of pre-treatment measurements would need to be obtained in order for to be more precise than , and these measurements would need to be at least moderately associated with the post-treatment measurements. As noted in Remark 2, this will also be the case for other experiments, unless the post-treatment measurements are substantially more variable than the pre-treatment measurements, which was not the case for the Remillard et al. (2017) experiment.

6 Discussion

Many causal inference works have focused on experimental settings where the outcomes for all units in the experiment can be measured. In some settings, it is too expensive to conduct an experiment on all units of interest, and instead an experiment is conducted on a random sample of units. Texts such as Imbens and Rubin (2015) have shown that the inferential properties of common treatment effect estimators in these settings can be established by first accounting for the stochasticity of the sampling stage and then accounting for the stochasticity of the randomization stage. However, inferential properties under the experimental design scenario where the ordering of the sampling and randomization stages are reversed has not been established. Forms of this experimental design have become increasingly common in the physical, medical, and social sciences, and so it is important to understand statistical inference in this case.

We established the inferential properties of the mean-difference estimator under this experimental design scenario, and we compared our findings to results for other experimental designs. We found that the inferential properties of the mean-difference estimator under this experimental design scenario are identical to those under the sample-first-randomize-second design, which is the more common experimental design discussed in the literature. Thus, the ordering of the randomization and sampling stages is inconsequential for inference of average treatment effects. We also assessed if pre-treatment measurements of units’ outcomes can be used to improve upon the mean-difference estimator for this experimental design scenario. We found that this is only the case if (1) the pre-treatment measurements are highly predictive of the outcome, (2) the post-treatment outcomes are substantially variable relative to the pre-treatment measurements, and (3) pre-treatment measurements are obtained for a large portion of the population. We also conducted a simulation study based on an experiment in nanomaterials (Remillard et al. 2017) and found that these results hold for realistic applications.

A recent strand of causal inference literature has elucidated and leveraged the connection between experimental design and finite-population sampling to refine theory and methodology for randomized experiments. This includes theory on design-based estimators for treatment effects (Samii and Aronow 2012; Aronow and Middleton 2013), properties of covariate-adjustment in randomized experiments (Freedman 2008; Lin 2013; Miratrix et al. 2013), and methods for estimating treatment effects in complex experimental settings such as cluster-randomized experiments (Middleton and Aronow 2015), experiments with interference (Aronow and Samii 2013), and experiments with multiple treatments (Mukerjee et al. 2018). This work continues this trend of using experimental design and finite-population sampling techniques to characterize treatment effect estimation in randomized experiments. As we discussed in Section 3.2, a promising line for future work is to establish analogous results for other types of experiments, such as cluster-randomized designs, where there are multiple stages of stochasticity through sampling, randomization, and other mechanisms.

Appendix A Proof of Lemma 1

To prove (5), note that

Next, we have that

which proves (6).

To prove (7), first we consider the case where and :

Next, for the case where and , we have that:

Thus, it follows that,

after a little algebra.

Finally, for the case where and , we have that:


Appendix B Proof of Theorem 1

To prove the first part, note that for , defined in (8) is an unbiased estimator of by (5). Consequently, from (4), is an unbiased estimator of . To prove the second part, we need the following three lemmas:

Lemma 4.

For ,


By (1), for ,

Lemma 5.

For ,


By (8), for ,

Lemma 6.

The covariance between and is given by: