## 1 Introduction

It is common practice in the design of experiments to use baseline covariates or data from past waves to inform sampling or treatment assignment. An example is stratification, in which units are grouped into blocks using baseline covariates, and then randomized to treatment or control separately within each block, thereby ensuring that the covariate distribution is “balanced” between treatment and controls. In a review of a selection of research articles using experiments in development economics, Bruhn and McKenzie (2009) report that about of these articles use some form of stratification. Further description and discussion of such designs are given in survey articles (Duflo et al., 2007) and textbooks (Imbens and Rubin, 2015; Rosenberger and Lachin, 2015). See also Bugni et al. (2018) for further references.

Such designs have received renewed interest in the theoretical literature, with several papers deriving asymptotic approximations to the sampling distribution of estimators and test statistics in such designs

(see, among others, Bugni et al., 2018; Bai et al., 2021). One goal of this literature has been to design experiments that improve the asymptotic efficiency of estimators and tests. In the case of a binary treatment, the efficiency bound of Hahn (1998) gives a lower bound on the asymptotic performance of estimators and tests for the average treatment effect (ATE) under experimental designs that lead to independent and identically distributed (iid) data. A key finding is that one can use data from past waves to design an experiment that optimizes this bound, along with a subsequent estimator that achieves the optimized bound (Hahn et al., 2011; Tabord-Meehan, 2018; Cytrynbaum, 2021). However, the Hahn (1998) bound not apply once one allows for randomization rules involving stratification on covariates or data from past waves. Can the optimized Hahn (1998) bound be further improved using stratification or other dependence-inducing experimental designs?In this paper, we derive asymptotic efficiency bounds in a general setting that allows for such designs. Applied to the case of a binary treatment, our results show that the optimized Hahn (1998) bound indeed gives a lower bound for the performance of any estimator or test with data from any experimental design in this general setting. The key technical result is a likelihood expansion and local asymptotic normality theorem that applies to arbitrary experimental designs that assign treatment after observing the entire set of covariates and past outcome values for an independent sample from an infinite population. To derive these results, we apply techniques used in the recent literature deriving asymptotic distributions of estimators in related settings (in particular, we use apply a martingale representation similar to those used in Abadie and Imbens, 2012) to a Le Cam style local expansion of the likelihood ratio. Applying these results to the least favorable submodels used to derive the corresponding bounds in the iid case then gives the efficiency bounds.

The rest of this paper is organized as follows. Section 2 gives an informal description of our results in a simple setting with a binary treatment and no constraints on the experimental design. Section 3 describes the formal setup, and includes our main technical results. Section 4 applies these results to provide a formal statement of the optimality result in the simple setting in Section 2. Section 5 considers a more general setting with multiple treatments and possible constraints on overall treatment and sampling.

## 2 Informal Description of Results in a Simple Case

Consider the case of a binary treatment. Unit has potential outcomes and

under treatment and non-treatment. In addition, there is a vector of baseline covariates

associated with individual . We assume that are drawn iid from some population, and we are interested in the ATE for this population. The researcher first observes a sample of baseline covariates. The researcher chooses a treatment assignment for each unit , and observes for this unit. The treatment assignment can depend on the entire sample of baseline covariates, as well as past outcomes for .^{1}

^{1}1We subscript by as well as since the treatment assignment rule depends on the entire sample and can therefore vary arbitrarily with ; see Section 3 for a formal description of our notation.

One possible design is to assigning treatment independently across , with

. The conditional treatment probability

is referred to in the literature as the propensity score. This yields iid data, so that the semiparametric efficiency bound of Hahn (1998) applies, giving(1) |

as a bound for the asymptotic variance of an estimator of the ATE, where

and . We can choose the propensity score to minimize this bound by taking first order conditions: the optimal propensity score satisfies(2) |

Following the literature, we refer to this as the Neyman allocation, after Neyman (1934).

Since requires knowledge of the unknown conditional variance , this design is not feasible, but a feasible design can be obtained by using a pilot study to estimate (Hahn et al., 2011). Using data from this experimental design, one can achieve the semiparametric efficiency bound using an estimator that adjusts flexibly for covariates or uses a flexible estimate of the propensity score (Hahn et al., 2011). To avoid the additional complexity of such estimators, one can alternatively design the experiment using stratification on covariates, so that a simple estimator that weights on the (true) propensity score achieves the bound (Tabord-Meehan, 2018; Cytrynbaum, 2021).

Such designs, however, lead to dependent data that violates the assumptions used in the Hahn (1998) bound. Nonetheless, our results show that the bound apply to these designs, as well as any other experimental design for assigning treatment as a function of past values and the entire vector of baseline covariates. Thus, the combinations of estimators and experimental designs in Hahn et al. (2011); Tabord-Meehan (2018); Cytrynbaum (2021) are indeed asymptotically optimal among any such design with any possible estimator.

Formally, semiparametric efficiency bounds amount to a statement that no uniform efficiency improvement is possible over a class of distributions that is rich enough to include a particular one dimensional submodel, called a “least favorable submodel.” Our results show that this statement continues to hold for any experimental design in our setup, with the same least favorable submodel as in the iid case. Section 4 provides a formal statement for the binary setting considered here, and Section 5 generalizes this to multiple treatments and cost constraints. The next section describes the formal setup and derives the main technical results (likelihood expansion and local asymptotic normality) used in our bounds.

## 3 Setup and Main Results

This section presents our formal setup and main technical results. Section 3.1 presents notation and sampling assumptions. Section 3.2 presents the assumptions on parametric submodels. Section 3.3 presents our main likelihood expansion and local asymptotic normality theorem.

### 3.1 Setup and Sampling Assumptions

We consider a setting in which baseline covariates and potential outcomes are associated with unit , where is a finite set of possible treatment assignments. We assume that are drawn iid from some population. The researcher chooses a treatment assignment for each observation , and observes and for each observation . In forming this assignment rule, the researcher first observes the entire sample of covariates. The rule is then allowed to depend sequentially on observed outcome variables. Let . The treatment rule is given by where is a measureable function of and

is a random variable independent of the sample, which allows for randomized treatment rules. We will also allow for unit

not to be assigned to any treatment group, in which case none of the outcomes are observed, and we set and . Based on this data, the researcher then forms an estimator or test for some parameter of the population distribution of .###### Remark 3.1.

Our setup allows for experimental designs that use information on baseline covariates in essentially arbitrary ways. In particular, designs involving stratified randomization on covariates and, in particular, matched pairs, are allowed. Our setup also includes designs that use outcomes from a pilot study, by defining observations as observations from this study (with the possibility that as ).

###### Remark 3.2.

We follow much of the literature by assuming that our sample is taken independently from an infinite population. In particular, this assumption is made in papers deriving asymptotics for estimators and tests under stratified sampling including Bugni et al. (2018) and Bai et al. (2021), and papers on experimental design including Imbens et al. (2009), Hahn et al. (2011), Tabord-Meehan (2018) and Cytrynbaum (2021). One can consider this an approximation to a setting where one samples from a large population of units. Formally, each unit has covariates and outcomes , and we we draw by drawing a random variable

over the uniform distribution on

, and then defining and for each . This corresponds exactly to sampling from the larger population with replacement, which is a good approximation to sampling without replacement when is large.Thus, our setup incorporates an assumption that the experimental
design involves randomized sampling from a large population.^{2}^{2}2This also means
that treatment assignments that assign units to treatment groups
deterministically as a function of the index or covariates are still
“randomized” in the sense that the subset of units in each treatment group
is random as a subset of the larger population.
For example, the assignment that takes for
and for is
“randomized” in the sense that the sample of treated units is a random subset of the population , as well
as being a random subset of the sampled units (it is not a deterministic
function of the set of sampled units).
Results that explicitly address the
question of whether it is indeed optimal to randomly sample from a (possibly
large) finite population include Savage (1972, Ch. 14, Section
8) and Blackwell and
Girshick (1954, Section
8.7).^{3}^{3}3The notion of “optimality” is slightly different in these references, since they
consider finite-sample minimax over a fixed set of distributions, in contrast to
the semiparametric results in the present paper which correspond to
asymptotic minimax bounds over a localized parameter space.
We note that our results do allow for some statements about the optimal use of
covariates for sampling a single outcome (by taking to be a
singleton and incorporating cost constraints, as in Section
5).

### 3.2 Parametric Submodel and Likelihood Ratio

We consider a finite dimensional parametric model indexed by

. We are interested in efficiency bounds at a particular . While our analysis will allow us to consider parametric settings, we will be primarily interested in using least favorable submodels to derive semiparametric efficiency bounds in infinite dimensional settings, as in the ATE bound for binary treatment described in Section 2. In cases where ambiguity may arise, we subscript expectations and probability statements by to indicate that are drawn from this model.Let denote the density of with respect to , and let denote the density of with respect to , where and are measures that do not depend on . Let denote the density of (which does not depend on ). The probability density of is

(3) |

where . The researcher makes a decision using the observed data , along with the treatment rule and the variable , which determine the treatment assignments . Since the treatment rule is known once is given, we can take the observed data to be and , so that the likelihood is given by (3).

Following the literature on asymptotic efficiency, we make a quadratic mean differentiability assumption on the model (see van der Vaart, 1998, Section 7.2, for a definition).

###### Assumption 3.1.

The family is differentiable in quadratic mean (qmd) at with score function , and, for each , the family is qmd at with score function .

Here, the qmd condition for the conditional distribution is taken to mean that the family is qmd when is distributed according to ; i.e. the family is qmd at . Let denote the information for , and let and denote the conditional and unconditional information for for each . Note that these are finite by Theorem 7.2 in van der Vaart (1998).

### 3.3 Likelihood Expansion and Local Asymptotic Normality

Consider a sequence where is given. To obtain efficiency bounds, we extend Le Cam’s result on the asymptotics of likelihood ratio statistics in parametric families (Theorem 7.2 in van der Vaart (1998)) to our setting, with the likelihood given in (3). Since does not depend on , this term drops out, and the log of the likelihood ratio for vs is given by

where

###### Theorem 3.1.

Under Assumption 3.1, the likelihood ratio satisfies

(4) |

###### Proof.

It is immediate from Theorem 7.2 in van der Vaart (1998) that

To prove (3.1), we obtain a similar decomposition for the terms involving . Let be given. Let . The qmd condition then implies . Note that

where the last equality uses a second order Taylor expansion of , with . It follows immediately from the proof of Theorem 7.2 in van der Vaart (1998) that . Thus,

We will show that each of the terms

(5) | |||

(6) | |||

(7) |

converge in probability to zero under .

Let so that the summand in (5) is given by . For , let denote the sigma algebra generated by , and . Note that is measureable with respect to for , and that is measureable with respect to for . In addition, , where the last step uses the fact that is a score function conditional on . Thus, for ,

so that the expectation of the square of (5) is given by

by qmd, where the last inequality uses the fact that is equal to minus its expectation given .

For (6), note that

Thus, the expectation of the absolute value of (6) is bounded by times

Letting , this is bounded by

This converges to zero since by qmd.

For (7), note that is bounded by , which was shown above to converge to zero. Thus, to show that (7) converges in probability to zero under , it suffices to show that converges in probability to zero under

. This follows by a law of large numbers for martingale difference arrays

(Theorem 2 in Andrews, 1988), since the summand is a martingale difference array with respect to the filtration , and it is uniformly integrable under since it is bounded by the sequence , which is iid and has finite mean. This completes the proof of (3.1).∎

We now use Theorem 3.1 to prove a local asymptotic normality result.

###### Corollary 3.1.

Suppose Assumption 3.1, holds and let . Let be a positive definite symmetric matrix. If converges in probability to under , then converges in distribution to a law under . If (where inequality is in the positive definite sense), then one can define a probability space under each with an additional random variable (and with the marginal distribution of under unchanged) such that converges in distribution to a law under .

###### Proof.

We use a martingale representation similar to the one used for matching estimators by Abadie and Imbens (2012). For , let denote the sigma algebra generated by , and let . For , let , denote the sigma algebra generated by , and , and let . Then is a martingale difference array with respect to the filtration . In addition, , and, by Theorem 3.1, we have . In the case where converges in probability to under

, it then immediately from a central limit theorem for martingale arrays

(Theorem 35.12 Billingsley, 1995) that converges to a law under (the Lindeberg condition follows since andare each dominated by sequences of iid variables with finite second moment).

Now consider the case where . Let be a sequence of positive semidefinite symmetric matrices with . Given , let

be iid and normally distributed under

with identity covariance and mean . Thenwhere the last step applies Theorem 3.1. Let us define for , so that the above display can be written as . Letting be the sigma algebra generated by and for , is a martingale difference array with respect to the filtration . Furthermore, , and it satisfies the Lindeberg condition by the arguments above and uniform boundedness of . It therefore follows that converges in distribution under to a law as claimed.

∎

According to Corollary 3.1, the model indexed by is locally asymptotically normal in the sense of Definition 7.14 in van der Vaart (1998). Therefore, the risk of any decision is bounded from below asymptotically by the risk from a decision in the limiting model, in which a random variable is observed. Augmenting the data by the variables is a technical trick that appears to be needed to cover, for example, treatment rules that do not assign any treatment to some individuals, which is relevant in the setting in Section 5 with cost constraints. The bounds obtained from local asymptotic normality still apply to the original setting in which the variables are not observed, since the bound from the model applies to decisions that do not use the variables .

## 4 Efficiency Bounds for Average Treatment Effect

We now apply these results to derive the asymptotic efficiency bound for estimation and inference on the average treatment effect (ATE) in the case of a binary treatment (), as described in Section 2. Given a population distribution, the variance bound (1) corresponds to a least favorable one-dimensional submodel indexed by , with corresponding to the given population distribution. Thus, we consider the variance bound in (1) with and , and we define the Neyman allocation in (2) with . We then consider a submodel through that corresponds to the least favorable submodel used to derive this bound in the iid case. Calculations in Hahn (1998, pp. 326-327) show that this submodel takes the form in Section 3, with

(8) |

The score function for this submodel is

and the information is . Furthermore, letting for in this submodel the calculations in Hahn (1998, pp. 326-327) show that is differentiable at in the sense of p. 363 of van der Vaart (1998), and that is the efficient influence function, so that