1 Introduction: The Potential Outcomes Framework for Randomized Experiments
In most twoarmed randomized experiments conducted to compare the causal effects of two treatments (or one treatment and one control), a finite population of experimental units is considered. These units receive one of the treatments through a randomized assignment mechanism, and the observed outcomes from the two treatment groups are compared to draw inference on causal estimands of interest. Such causal estimands can be defined in terms of potential outcomes, and the framework for drawing inference in the setup described above is the wellknown NeymanRubin causal model (Sekhon 2008) or simply the Rubin causal model (Holland 1986). In particular, this model posits that each unit has two fixed potential outcomes, and , denoting unit ’s outcome under treatment and control, respectively, and thus the only stochasticity in a randomized experiment is units’ assignment to treatment groups.
The most common estimand in the causal inference literature is the finitepopulation average treatment effect (ATE), defined as . In a randomized experiment, ultimately each unit can only be assigned to treatment or control—never both—and thus only or is observed for each unit. As a result, must be estimated. The most common estimator for is the meandifference estimator, i.e., the mean difference in the observed treatment outcomes and observed control outcomes. In this finite population setup where the potential outcomes are assumed fixed, the meandifference estimator is unbiased for
, and its sampling variance was derived by
Neyman (1923). Using this estimator and sampling variance, a normal approximation can be used to draw inference on . If additional covariates are available for each unit, covariate adjustment can be performed to obtain more precise estimators for , such as through poststratification (Miratrix et al. 2013) or regression (Lin 2013).However, in many practical situations, it may not be possible to observe a response for all units due to resource constraints. A natural way to address this limitation is to first randomly sample units from the population and then randomly assign the two treatments to the sampled units. For example, this is a common procedure in political science and other social sciences, where an experiment is conducted on a representative sample of some larger population, often through a welldesigned online survey (Mutz 2011; Coppock 2018; Miratrix et al. 2018). In this case, the meandifference estimator is still unbiased for the ATE, and its sampling variance has been studied in works such as Imbens (2004) and Imbens and Rubin (2015, Chapter 6). Then, one can still use the meandifference estimator and its sampling variance to draw inference on via a normal approximation. This scenario—where units are sampled from a population of units, and then an experiment is conducted on these sampled units—is shown in the left panel of Figure 1.
This article is motivated by a similar limitation as stated in the previous paragraph, but the experimental design is different in the sense that the order of sampling and randomization is reversed. That is, first, each of the units is assigned to one of the two treatments. Then, a subset of units is sampled from the units exposed to each treatment group, and the response is measured for each sampled unit. This experimental design scenario is shown in the right panel of Figure 1. Such a strategy may be useful if sampling and measurement of response is more expensive and/or complex compared to treatment assignment. Examples of experiments that involve this type of design can be found in material science, biomedicine, and the social sciences. For example, in an experiment conducted to assess the difference between two types of oxidation procedures with respect to their impact on the dimension of nanotubes, scientists typically split a population (a container of nanotubes) into two subpopulations (two smaller containers) and apply different oxidation procedures to the two subpopulations (Remillard et al. 2016, 2017). However, since measuring dimensions of nanotubes is an expensive and timeconsuming process, a sample of oxidized nanotubes is taken from each subpopulation, and dimensions of the sampled nanotubes are measured. A similar procedure is followed when conducting stem cell experiments (Chung et al. 2005; Doi et al. 2009).
The main contribution of this article is deriving the sampling properties of the meandifference estimator for the ATE under Design 2—we do this in Section 3 after setting up some notation in Section 2. We will find that these sampling properties are identical to those under Design 1, meaning that the ordering of the sampling and randomization stages is inconsequential for the precision of the meandifference estimator. This finding will also bring some clarifications about inference of the ATE under Design 1, as we discuss in Section 3.1. In particular, the variance of the meandifference estimator under Design 1 is often characterized by asymptotic results where or (Imbens 2004; Imbens and Rubin 2015; Sekhon and ShemTov 2017), and our work illuminates on the case where the population size and sample size are finite. Our work also gives insight into other samplingbased randomized designs: For example, Design 2 is similar—but not identical—to a clusterrandomized design (Campbell et al. 2007), where one cluster of size is assigned to treatment group 1 and another cluster of size is assigned to treatment group 0, and random samples are obtained within each cluster. Clusterrandomized designs are quite common in education (Hedges and Hedberg 2007), medicine (Eldridge et al. 2004), and psychology (Raudenbush 1997), and we discuss how Design 2 compares to clusterrandomized designs in Section 3.2. Then, in Section 4, we consider the case where baseline (pretreatment) measurements of the response can be obtained for a sample of experimental units, and we explore the extent to which the meandifference estimator can be improved under Design 2 by using such measurements. Unfortunately, we will find that pretreatment measurements are often unhelpful in improving the precision of average treatment effect estimators under this design, unless a large number of pretreatment measurements that are highly associative with the posttreatment measurements can be obtained. In Section 5, we confirm these results by conducting a simulation study based on a real experiment in nanomaterials. In Section 6, we conclude.
2 Setup and Notations
Let the two treatments be denoted by 0 and 1, and for , let denote the potential outcomes for unit when exposed to treatments 0 and 1, respectively. The unitlevel treatment effect is
and the finitepopulation ATE is
where and denote the average potential outcomes for the treatment groups 1 and 0, respectively. We also denote the variances of the potential outcomes for treatment group as follows:
(1) 
and the covariance of potential outcomes for treatment groups 1 and 0 by
(2) 
Finally, the variance of the unitlevel treatment effects is
(3) 
As in Design 2 shown in the right panel of Figure 1, and units are assigned to treatments 1 and 0 respectively, where and are predetermined. From these two groups, and units are sampled and their responses are observed. Let and denote the observed averages for the treatment groups 1 and 0, respectively. Then, a natural estimator of is
(4) 
We will examine the sampling properties of and compare them with those of a similar estimator obtained from Design 1 shown in Figure 1.
3 Sampling properties of the estimator of the ATE
Because the potential outcomes are assumed fixed, the sampling properties of defined in (4
) will be determined by the random variables associated with the randomization and sampling stages. We define two such variables now. For
, let denote a Bernoulli random variable indicating the random level of treatment (0 or 1) that unit receives. Recalling that the assignment mechanism essentially involves a random partitioning of units into two groups of predetermined sizes and, the properties of the assignment vector
are straightforward to establish and can be found in standard texts (e.g., Imbens and Rubin 2015).Next, define
as an indicator random variable equaling
if the potential outcome is randomly sampled among the units assigned to treatment level . Importantly, note that we know that conditional on ; i.e., the th unit will not be sampled from treatment group 1 if it was assigned to treatment group 0. As shown in the right panel of Figure 1, we assume that samples of size and are sampled from treatment groups 0 and 1, respectively. Properties of the are crucial in the derivation of sampling properties of , and are summarized in Lemma 1. The proofs are in Appendix A.Lemma 1.
The properties of the sampling indicators can be summarized as:
(5)  
(6)  
(7) 
Remark 1.
To provide some intuition for (7), note that we expect the quantity to be negative if , because if we know that unit is sampled in treatment group , then unit is less likely to be sampled in the same group. Similarly, we expect the quantity to be positive, because if we know that unit is sampled in treatment group 1, then unit is more likely to be sampled in treatment group 0.
We now represent the estimator in (4) in terms of the potential outcomes and the random variables and defined above. To do this, note that the average observed response in treatment group can be written as
(8) 
By combining (4) and (8) and using Lemma 1, we now derive the sampling properties of the estimator and summarize them in Theorem 1. The proof of Theorem 1 is in Appendix B.
Theorem 1.
The estimator given by (4) satisfies the following properties:
3.1 Comparison with two other samplingbased randomized designs
We now compare the sampling properties of as derived in Theorem 1 with the unbiased estimators of obtained from two other designs: (i) a design where responses for all units are observed, that is, no sampling is involved, and (ii) a design described as Design 1 in the left panel of Figure 1, where units are first sampled from units in the population and subsequently exposed to treatments. We denote the estimator of from (i) by and that from (ii) by . Both and are unbiased estimators of . Neyman (1923) derived the following result on the sampling variance of :
(10) 
As a result, we obtain the inequality , reflecting the price one has to pay for sampling from the finite population.
Meanwhile, to our knowledge, Imbens and Rubin (2015, Chapter 6, Appendix B) were the first to derive the expression of the finitepopulation sampling variance of , although others have discussed the infinitepopulation case where (e.g., Imbens 2004). However, there was a slight error in the derivation in Imbens and Rubin (2015, Chapter 6, Appendix B): The covariance between the sampling indicators was stated to be , when really it is , as can be seen by the , case in Lemma 1 and other works on survey sampling (e.g., Lohr 2009, Page 53). As we show in Appendix C, by making this correction, and then following the derivation of Imbens and Rubin (2015, Chapter 6, Appendix B), we have the following expression of the sampling variance of :
(11) 
Thus, the precision of the meandifference estimator under Design 1 is identical to that under Design 2; in other words, the order of the sampling and randomization stages is inconsequential in terms of the precision of the meandifference estimator. This finding also clarifies some discussions in the causal inference literature that consider a superpopulation framework. When discussing causal inference under a superpopulation framework, many works implicitly assume that or , and thus the third term in , , is often ignored (Imbens (2004), Imbens and Rubin (2015, Chapter 6), Ding et al. (2017), Sekhon and ShemTov (2017)). As we will see in Section 3.3, this term is ignored in the estimation of or , because the observed data do not provide any information about , since none of the individual are ever observed. However, the above derivation emphasizes that this term nonetheless exists in the true variance of the meandifference estimator under Design 1 and Design 2 when the superpopulation that is sampled from is finite.
3.2 Comparison with clusterrandomized designs
As discussed in Section 1, Design 2 is similar but not identical to a clusterrandomized design, where treatment is assigned among clusters of units but response is measured at the unit level. The most important distinction between Design 2 and a clusterrandomized design is that, in Design 2, and (the number of units assigned to treatment groups 0 and 1, respectively) are fixed, and in a clusterrandomized design, these quantities are stochastic. For example, consider a clusterrandomized design with two clusters, where Cluster A has units and Cluster B has
units, and one cluster is randomly assigned to treatment group 0 and the other to treatment group 1. There is a 0.5 probability that
and a 0.5 probability that , and is analogously stochastic.This stochasticity causes many complications to inference—for example, the meandifference estimator is often biased in clusterrandomized designs, even when all units’ responses can be measured (Middleton 2008; Middleton and Aronow 2015). Consequently, there is not a straightforward analog to our results that will hold for clusterrandomized designs; indeed, our results will not even hold for the simple twocluster example mentioned in the previous paragraph. Because we require that and are fixed, our results will hold for clusterrandomized designs if the number of units within each cluster is the same across clusters, but this is rarely the case. As discussed in Middleton (2008) and Middleton and Aronow (2015), cluster sizes and the covariance between treatment group sizes and treatment effects are important quantities for deriving inferential properties of ATE estimators in clusterrandomized designs, and likely these quantities would be similarly important in deriving analogous results when there is a sampling stage after treatment assignment at the cluster level. We leave this for future work.
3.3 Estimation of sampling variance and approximate confidence intervals
Let
(12) 
denote the sample variances of observed responses for treatment groups 0 and 1, respectively. From standard sampling theory (e.g., Lohr 2009, Pages 5254), it follows that and are unbiased estimators of and , respectively. Consequently,
(13) 
is a natural Neymanianstyle estimator of and , which, as seen in Section 3.1, has an upward bias of unless strict additivity holds. This estimator—originating in Neyman (1923)—is by far the most common estimator for the variance of the ATE in randomized experiments (Rubin 1990; Imbens 2004; Miratrix et al. 2013; Imbens and Rubin 2015; Ding et al. 2017), and thus it is reassuring that it can be used for and , i.e., it can be used under Design 1 or Design 2.
Estimator (13
) can be used to obtain approximate confidence intervals for
aswhere
denotes the quantile function of a standard normal distribution. The asymptotic normality of
is based on the finite population central limit theorem
(Hájek 1960) and its application in the context of randomized experiments by Li and Ding (2017).4 Can the estimator be improved using pretreatment measurements?
Now assume that the response can be measured for each experimental unit prior to application of the treatments. Let denote these measurements for the units, which are fixed quantities like the potential outcomes. The unitlevel differences and for are referred to as “gain scores” in the psychology (Rumrill and Bellini 2017) and education (McGowen and Davis 2002) literature. While the unitlevel gain scores or their averages and are of interest in psychology and education (Hake 1998), they are not referred to as causal estimands or causal effects (Rubin et al. 2004; Imbens and Rubin 2015). However, it is easy to see that the unitlevel and average treatment effects can respectively be expressed as the unitlevel and average differences of the gain scores, i.e., and . In spite of this connection, several experts in education and psychology have recommended avoiding the use of gain scores when estimating treatment effects in experiments and observational studies. Campbell and Erlebacher (1970) claimed, “gain scores are in general a treacherous quicksand,” and Cronbach and Furby (1970) recommended researchers to “frame their questions in other ways.” Despite some recent interest on utilizing gain scores to identify causal effects, there appears to be a general aversion in the causal inference community towards the use of gain scores. Here we explore whether, in the current setup, a proper design and analysis of the experiment using gain scores can potentially lead to more precise estimation of treatment effects under certain assumptions.
In the setup of Design 2 in Figure 1, where measuring the pretreatment response for each unit is not feasible, it is often a common practice to measure the pretreatment response for a random sample of size to estimate the average gain scores and . Denoting the sampling indicator by , the estimators of and are and , respectively, where . While the sampling properties of these estimators can be readily obtained, they do not help in increasing the precision of the estimator of because the average pretreatment scores get canceled out in the difference of the gain score estimators. However, if samples of and pretreatment observations are taken independently from the treatment groups 1 and 0 after assignment but before administration of the treatments, it is possible to obtain a different estimator of the ATE. For , let denote the sampling indicator associated with the random sampling of among the units assigned to treatment . Then
(14) 
are the observed sample averages of pretreatment responses for the two treatment groups. Then we can define the following estimator of the ATE:
(15) 
where and are given by (14). The sampling properties of depend on the distribution of
and the joint distribution of
and . These properties are summarized in the following two lemmas.Lemma 2.
The properties of the indicators are exactly the same as those of stated in Lemma 1, just replacing by and by .
Lemma 3.
For , the covariance between the indicators and is given by:
Theorem 2.
Remark 2.
From Theorem 2, it follows that is a more efficient estimator than if and only if
(19)  
where denotes the finite sample correlation coefficient between the potential outcomes and pretreatment measurements . In order for condition (19) to be achieved, pretreatment measurements need to be highly predictive of the outcome, posttreatment outcomes need to be substantially variable relative to the pretreatment measurements, and pretreatment measurements need to be obtained for a large portion of the population. For example, consider the case where we have a balanced design (i.e., ) and a balanced sample size (). In this case, even if the pretreatment measurements are perfectly correlated with the outcomes (i.e., ), in order for (19) to be achieved, we need , and usually will be small due to resource constraints. Similarly, if the pretreatment measurements are moderately correlated with the outcomes (e.g., ) and all units’ pretreatment measurements are observed (i.e., ), we still need
, i.e., the average standard deviation of the posttreatment outcomes needs to be larger than the standard deviation of the pretreatment measurements. Cases where
or are indeed quite extreme and unrealistic conditions, and thus Theorem 2 in general gives credence to the skepticism many causal inference experts have about using gain scores for estimation of treatment effects in the context of the experimental design discussed here.5 Simulation Study: Considering an Experiment in Nanomaterials
To assess how the findings from Theorems 1 and 2 are informative for real experiments, we now conduct a simulation study based on a real experiment in nanomaterials (Remillard et al. 2017). One purpose of this experiment was to assess how the dimensions of carbon nanotubes change as they undergo different processing procedures. One such procedure is sonication, the primary purpose of which is to disperse the nanotubes. However, sonication may also result in the undesirable outcome of breaking the nanotubes, possibly causing changes to their length. The experiment considered two types of sonication—bath sonication and probe sonication—which are the treatments in this application. Bath sonication is a more gentle procedure than probe sonication, so it was hypothesized that bath sonication would not decrease the length of the carbon nanotubes as much as probe sonication.
Because of practical constraints, this experiment was conducted using Design 2 in Figure 1
. Because the nanotubes are so small, it was infeasible to select individual nanotubes for treatment, and instead, a container of nanotubes was evenly divided into two smaller containers, and each of these containers underwent bath sonication or probe sonication. After sonication, nanotubes were randomly selected and their length was measured after treatment. Furthermore, the length of a random sample of nanotubes was measured before treatment. Primary interest was in the average difference in loglength between bath sonication and probe sonication, i.e., the ATE; the loglength was used due to skewness in the length distribution across carbon nanotubes.
In this paper, we have discussed two ATE estimators under Design 2: the meandifference estimator defined in (4), and the meandifference estimator defined in (15) that incorporates pretreatment information. In the real experiment conducted in Remillard et al. (2017), we could only implement Design 2 once, and thus we could not observe the behavior of these estimators across different randomizations and random samples of carbon nanotubes. In order to understand the behavior of these estimators across many implementations of Design 2, we will conduct a simulation study mimicking this experiment.^{2}^{2}2In actuality, the Remillard et al. (2017) experiment considered several treatment factors, such as oxidation as well as sonication, and it considered several types of carbon nanotubes. For ease of exposition, the simulation data discussed here is based on the subset of the experiment that considered the carbon nanotube called “D15L15” in Remillard et al. (2017). We chose this type of carbon nanotube because it was already oxidated, and thus only sonication (i.e., one treatment factor) was used in the actual experiment for this type of carbon nanotube.
Consider the following implementation of Design 2 for this experiment: carbon nanotubes will be randomly divided into treatment groups of sizes (probe sonication) and (bath sonication). Random samples of sizes and will be obtained before treatment, and random samples of sizes and will be obtained after treatment; these were approximately the sample sizes that were used in the actual experiment. The individual loglength of each carbon nanotube will be measured for these samples.
First, the population of pretreatment measurements as well as posttreatment measurements and were generated using the following model:
(20)  
As discussed in Remillard et al. (2017), normality was a reasonable distributional assumption to place on the loglength of carbon nanotubes. The mean parameters ( and ), variance parameters ( and ), and ATE parameter were set to the sample estimates observed in the Remillard et al. (2017) experiment.^{3}^{3}3Specifically, , , , , and , where the units are on the scale. The treatment effect heterogeneity parameter and association parameter could not be estimated in the experiment, because none of the could ever be jointly observed. In this simulation study, we set to induce strong treatment effect heterogeneity (Ding et al. 2016) and consider various values for .
After the populationlevel triplet was generated once, the following randomization and sampling procedures (i.e., Design 2) were repeated 10,000 times:

Each unit was randomly assigned to treatment group 1 or treatment group 0 with equal probability, such that and .

Within each treatment group , units were randomly sampled and their recorded as an observed pretreatment measurement.

Within each treatment group , units were randomly sampled and their recorded as an observed posttreatment measurement.
For each of the 10,000 replications, we recorded the ATE estimators defined in (4) and defined in (15). Recall that is the meandifference estimator, and its properties under this design are established by Theorem 1; meanwhile, uses the pretreatment measurements to alter , and its properties are established by Theorem 2.
First, let us consider one simulation setting to confirm the results in Theorems 1 and 2. Figure 2 shows the empirical distribution of and after 10,000 replications of the above procedure when the pre and posttreatment measurements are moderately associated . There are two observations that can be made from Figure 2. First, the normal distributions for and (constructed using Theorems 1 and 2, respectively) fit the empirical distributions of these estimators; this confirms our theoretical results, including the comparison of these estimators given in Remark 2. Second, under this setting, is slightly more dispersed than , suggesting that using the pretreatment measurements in this case inflates the variance of the ATE estimator.
This second observation begs the question: In this scenario, when will using the pretreatment measurements improve estimation of the ATE? To address this question, we considered various values for the association parameter as well as the sample sizes and for . Figure 3 displays a heatmap of the empirical ratio for various values of and the sample size. Looking at the bottom row of this heatmap, we can see that for the sample of size of 150 that was actually feasible for this experiment, even an association of would not have led to being more precise than . For this experiment, the sample size would have had to significantly increase while keeping the association moderately high in order for the pretreatment measurements to be useful in increasing the precision of the ATE estimator. This echoes the observation made at the end of Section 4 that the use of gains scores is often not beneficial in improving ATE estimators.
In summary, the above simulation study confirms that Theorems 1 and 2 correctly establish the behavior of the ATE estimators and under Design 2. However, for these data that are based on the experiment from Remillard et al. (2017), an infeasibly large number of pretreatment measurements would need to be obtained in order for to be more precise than , and these measurements would need to be at least moderately associated with the posttreatment measurements. As noted in Remark 2, this will also be the case for other experiments, unless the posttreatment measurements are substantially more variable than the pretreatment measurements, which was not the case for the Remillard et al. (2017) experiment.
6 Discussion
Many causal inference works have focused on experimental settings where the outcomes for all units in the experiment can be measured. In some settings, it is too expensive to conduct an experiment on all units of interest, and instead an experiment is conducted on a random sample of units. Texts such as Imbens and Rubin (2015) have shown that the inferential properties of common treatment effect estimators in these settings can be established by first accounting for the stochasticity of the sampling stage and then accounting for the stochasticity of the randomization stage. However, inferential properties under the experimental design scenario where the ordering of the sampling and randomization stages are reversed has not been established. Forms of this experimental design have become increasingly common in the physical, medical, and social sciences, and so it is important to understand statistical inference in this case.
We established the inferential properties of the meandifference estimator under this experimental design scenario, and we compared our findings to results for other experimental designs. We found that the inferential properties of the meandifference estimator under this experimental design scenario are identical to those under the samplefirstrandomizesecond design, which is the more common experimental design discussed in the literature. Thus, the ordering of the randomization and sampling stages is inconsequential for inference of average treatment effects. We also assessed if pretreatment measurements of units’ outcomes can be used to improve upon the meandifference estimator for this experimental design scenario. We found that this is only the case if (1) the pretreatment measurements are highly predictive of the outcome, (2) the posttreatment outcomes are substantially variable relative to the pretreatment measurements, and (3) pretreatment measurements are obtained for a large portion of the population. We also conducted a simulation study based on an experiment in nanomaterials (Remillard et al. 2017) and found that these results hold for realistic applications.
A recent strand of causal inference literature has elucidated and leveraged the connection between experimental design and finitepopulation sampling to refine theory and methodology for randomized experiments. This includes theory on designbased estimators for treatment effects (Samii and Aronow 2012; Aronow and Middleton 2013), properties of covariateadjustment in randomized experiments (Freedman 2008; Lin 2013; Miratrix et al. 2013), and methods for estimating treatment effects in complex experimental settings such as clusterrandomized experiments (Middleton and Aronow 2015), experiments with interference (Aronow and Samii 2013), and experiments with multiple treatments (Mukerjee et al. 2018). This work continues this trend of using experimental design and finitepopulation sampling techniques to characterize treatment effect estimation in randomized experiments. As we discussed in Section 3.2, a promising line for future work is to establish analogous results for other types of experiments, such as clusterrandomized designs, where there are multiple stages of stochasticity through sampling, randomization, and other mechanisms.
Appendix A Proof of Lemma 1
To prove (5), note that
To prove (7), first we consider the case where and :
Next, for the case where and , we have that:
Thus, it follows that,
after a little algebra.
Finally, for the case where and , we have that:
Consequently,
Appendix B Proof of Theorem 1
To prove the first part, note that for , defined in (8) is an unbiased estimator of by (5). Consequently, from (4), is an unbiased estimator of . To prove the second part, we need the following three lemmas:
Lemma 4.
For ,
Proof.
Lemma 5.
For ,
Proof.
Lemma 6.
The covariance between and is given by:
Comments
There are no comments yet.