Balanced and robust randomized treatment assignments: the Finite Selection Model

05/19/2022
by   Ambarish Chattopadhyay, et al.
Comcast
Harvard University
0

The Finite Selection Model (FSM) was proposed and developed by Carl Morris in the 1970s for the experimental design of RAND's Health Insurance Experiment (HIE) (Morris 1979, Newhouse et al. 1993), one of the largest and most comprehensive social science experiments conducted in the U.S. The idea behind the FSM is that treatment groups take turns selecting units in a fair and random order to optimize a common criterion. At each of its turns, a treatment group selects the available unit that maximally improves the combined quality of its resulting group of units in terms of the criterion. Herein, we revisit, formalize, and extend the FSM as a general tool for experimental design. Leveraging the idea of D-optimality, we propose and evaluate a new selection criterion in the FSM. The FSM using the D-optimal selection function has no tuning parameters, is affine invariant, and achieves near-exact mean-balance on a class of covariate transformations. In addition, the FSM using the D-optimal selection function is shown to retrieve several classical designs such as randomized block and matched-pair designs. For a range of cases with multiple treatment groups, we propose algorithms to generate a fair and random selection order of treatments. We demonstrate FSM's performance in terms of balance and efficiency in a simulation study and a case study based on the HIE data. We recommend the FSM be considered in experimental design for its conceptual simplicity, practicality, and robustness.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/06/2021

Randomized and Balanced Allocation of Units into Treatment Groups Using the Finite Selection Model for R

The original Finite Selection Model (FSM) was developed in the 1970s to ...
01/25/2019

A Discrepancy-Based Design for A/B Testing Experiments

The aim of this paper is to introduce a new design of experiment method ...
05/12/2018

Design of Order-of-Addition Experiments

In an order-of-addition experiment, each treatment is a permutation of m...
06/15/2022

Optimality of Matched-Pair Designs in Randomized Controlled Trials

In randomized controlled trials (RCTs), treatment is often assigned by s...
06/08/2022

Inference for Matched Tuples and Fully Blocked Factorial Designs

This paper studies inference in randomized controlled trials with multip...
06/21/2022

Nonparametric identification of causal effects in clustered observational studies with differential selection

The clustered observational study (COS) design is the observational stud...
12/06/2020

Better Experimental Design by Hybridizing Binary Matching with Imbalance Optimization

We present a new experimental design procedure that divides a set of exp...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 The RAND Health Insurance Experiment

In the 1970’s, the challenge of financing and delivering high-quality and affordable health care to all Americans was at the center of national policy debate. At the time, two central questions were “How much more medical care would people use if it is provided free of charge?” and “What are the consequences of using more medical care on their health?” To address these and other related questions, an interdisciplinary team of researchers led by Joseph P. Newhouse at RAND designed and conducted the Health Insurance Experiment (HIE), a large-scale, multi-year, randomized public policy experiment developed and completed between 1971 and 1982. To this day, the HIE is one of the largest and most comprehensive social science experiments ever conducted in the U.S. Even now, four decades after its completion, evidence from the HIE is still fundamental to the national discussion on health care cost sharing and health care reform.

In the HIE, a representative sample of 2,750 families comprising more than 7,700 individuals were chosen from six urban and rural sites across the United States. At the beginning of the study, participants completed a baseline survey providing numerous demographic, medical, and socioeconomic measurements. Families were then assigned to health insurance plans that varied substantially in their coinsurance rates and out-of-pocket expenditure maxima, for a total of 13 possible treatment groups. The goal of the study was to estimate the marginal averages of utilization and health outcomes in each of the six sites under each plan. To make evidence on health utilization and outcomes as strong as possible, the study had to be randomized. However, achieving balance for numerous continuous and categorical covariates through randomization is challenging in contexts with so many treatment groups and implementation sites.

1.2 Toward balanced, efficient, and robust experimental designs

Randomized experiments are considered to be the gold standard for causal inference, as randomization provides an unequivocal basis for both inference and control. In randomized experiments, the act of randomization ensures balance on both observed and unobserved covariates on average. However, a given realization of the random assignment mechanism may produce substantial imbalances on one or more covariates. This imbalance problem can be exacerbated in settings like the HIE, where treatments are multi-valued and many baseline covariates exist, leading to loss in efficiency of the effect estimates.

A variety of methods have been proposed in the literature to address this problem, such as blocking (fisher1925statistical, fisher1935design, cochran1957experimental), optimal pair-matching (greevy2004optimal), greedy pair-switching (krieger2019nearly), and designs using mixed-integer programming (bertsimas2015power). In particular, rerandomization (morgan2012rerandomization) has gained popularity over the last few years and has become commonplace in experiments. However, rerandomization may not protect against and be robust to chance imbalances in functions of the covariates that are not explicitly addressed by the rerandomization criterion (banerjee2017decision), especially in experiments with multi-valued (2) treatments. Defining the rerandomization criterion requires selection of a tuning parameter governing the acceptable degree of imbalance, which may require iteration in practice. Moreover, rerandomization rules out imbalanced assignments ex post, which may complicate inference (athey2017econometrics).

To overcome these and other related challenges, we consider the Finite Selection Model (FSM) for experimental design. The original version of the FSM was proposed and developed by Carl Morris in the design of the HIE (morris1979finite, newhouse1993free, morris1993the). The idea behind the FSM is that each treatment group takes turns in a fair and random order to select units from a pool of available units such that, at each stage, each treatment group selects the unit that maximally improves the combined quality of its current group of units. The criterion for measuring quality is flexible, and in this paper, we develop a new criterion based on the concept of D-optimality, which does not require tuning parameters.

To illustrate, Figure 1 exhibits the performance of complete randomization, rerandomization, and the FSM in a version of the HIE data with four treatment groups and 20 covariates. For rerandomization we compute the maximum Mahalanobis distance (using the 20 covariates) across all possible pairs of treatment groups and accept 0.1% of the assignments with the smallest covariate distance (see Section 7.1 for details). The figure displays the distribution of absolute standardized mean differences (ASMD; rosenbaum1985constructing)111The absolute standardized mean difference for a single covariate between treatment groups and is , where and

are the mean and variance of

in treatment group , respectively. Please see rosenbaum1985constructing) for details. in covariates and their second order transformations across multiple realizations of the randomization mechanisms for the three designs. Lower values of ASMD indicate better balance on the covariates (or transformations thereof). We observe that rerandomization substantially outperforms complete randomization in terms of imbalances on the main covariates, but not in terms of their squares and interactions. In contrast, the FSM markedly outperforms both methods without requiring tuning parameters. This analysis reveals that, while rerandomization performs well by common standards (the majority of the ASMD is smaller than 0.1), there is room for improvement. As we explain in Section 7, in experiments like the HIE the space of possible assignments is vast and the FSM can improve the assignment of units into treatment groups to achieve better covariate balance. Better balance can improve the validity and credibility of a study, and also translates into increased efficiency and robustness.

Figure 1: Distributions of ASMD for complete randomization, rerandomization, and the FSM, for 20 baseline covariates in the HIE data. Without tuning parameters, the FSM handles multiple (2) treatment groups and substantially improves covariate balance and efficiency.

1.3 Contribution and outline

In this paper, we revisit, formalize, and extend the FSM for experimental design. We show that the FSM can be used for balanced, efficient, and robust random treatment assignment, outperforming common assignment methods on these three dimensions. In particular, we describe the FSM under the potential outcomes framework (neyman1923application, rubin1974estimating). We use the sequentially controlled Markovian random sampling (SCOMARS, morris1983sequentially) algorithm to determine the selection order of treatments for two-group experiments and develop its extensions to multi-group experiments. We propose a new selection criterion for treatments based on the idea of D-optimality and discuss its theoretical properties. In particular, we show that the FSM using the D-optimal selection function is affine invariant and achieves near-exact balance on a class of covariate transformations. The FSM using the D-optimal selection function is also shown to retrieve several classical designs such as randomized block and matched-pair designs. We analyze the FSM’s performance both theoretically and empirically and compare it to common assignment methods. We discuss model-based approaches to inference to the FSM and develop randomization-based alternatives. In addition, we discuss potential extensions of the FSM to more complex experimental design settings, such as stratified experiments and experiments with sequential arrival of units in batches. In an accompanying paper (chattopadhyay2021randomized), we describe how these methods can be implemented in the new FSM package for R, which is publicly available on CRAN.

The paper proceeds as follows. In Section 2 we describe the design of the RAND Health Insurance Experiment, focusing on the assignment of each family to a single one of 13 health insurance plans. In Section 3 we present the setup, notation, and main components of the FSM. In Section 4, we propose a selection criterion based on D-optimality and analyze its theoretical properties. In Section 5 we discuss inference under the FSM. In Section 6, we evaluate the performance of the FSM and compare it to standard methods such as complete randomization and rerandomization. In Section 7, we perform a similar comparison using the HIE data. Finally, in Section 8 we consider extensions of the FSM to other settings such as multi-group, stratified, and sequential experiments. In Section 9 we conclude with a summary and remarks.

2 Design of the Health Insurance Experiment

In the HIE, families were assigned to different health insurance plans using the original version of the FSM (morris1979finite). Initially, assignments were made in each of the six HIE sites to 12 or 13 fee-for-service plans with varying combinations of coinsurance (cost sharing) rates and income-related deductibles. Coinsurance plans consisted of (free care), , , or coinsurance rates, plus a plan with mixed coinsurance rates, and an individual deductible plan. Within the cost sharing plans, families were further assigned to different out-of-pocket maxima where the out-of-pocket expenditures were capped at 5%, 10%, or 15% of family income, with an annual maximum of $1,000 (brook2006health). To ensure that the treatment groups across the insurance plans were balanced relative to the overall population, the FSM considered a discard group of study non-participants as an additional treatment group in its assignment process.

The HIE spanned six U.S. sites tracked over several years, listed here in chronological order of study initiation: Dayton, OH; Seattle, WA; Fitchburg, MA; Franklin County, MA; Charleston, SC; and Georgetown County, SC. The FSM was used, independently in each of the sites, to make random assignments to improve balance on up to 22 family-level baseline covariates across treatment groups. In each of the first two sites, the FSM was used multiple times for separate independent subsets of families to maintain baseline data schedules. In addition to estimating overall marginal effects of health insurance plan design on healthcare utilization and outcomes, the HIE team also sought to understand how particular design choices would affect experimental results. Specifically, in each HIE site, the team conducted four additional randomized sub-experiments to estimate the impact of alternative choices addressing the following questions: (i) which families would undergo shorter enrollment durations (three years or five); (ii) which would receive participation incentives (yes or no); (iii) which would receive pre-experimental physician visits (yes or no); (iv) and which would have higher interviewing frequency (weekly or biweekly) (newhouse1993free). For each of these four sub-experiments, after the insurance treatments were determined, families were randomized to the sub-treatment groups using the FSM.

3 Foundations and overview of the FSM

3.1 Setup and notation

Consider a sample of units indexed by . Each of these units is to be assigned into one of treatment groups labelled by , with . Write for the pre-specified size of group . Denote as the assigned treatment group label of unit and

as the vector of treatment group labels. Following the potential outcomes framework for causal inference

(neyman1923application; rubin1974estimating), each unit has a potential outcome under each treatment , , but only one of these outcomes is observed: . Denote as the vector of potential outcomes under treatment . Each unit has a vector of observed covariates, . We write for the matrix of observed covariates and and for the mean vector and covariance matrix of these covariates in the full sample, respectively. For reference, in Table A1 of the Online Supplementary Materials we provide a list of the notation used in this paper.

Based on this notation, is the causal effect of treatment relative to treatment for unit . We are interested in estimating the sample average treatment effect and the population average treatment effect . For this, we will randomly assign the units into treatment groups using the FSM.

3.2 Components of the FSM

In the FSM, the treatment groups take turns to select units in a random but controlled order while optimizing a certain criterion. This is accomplished by means of a selection order matrix (SOM), which determines the order in which the treatment groups select the units, and a selection function, which provides the optimality criterion. A good SOM guarantees that the selection of units is fair, so that no single treatment group selects all the units of a given type, and random, so that both observed and unobserved covariates are balanced in expectation and there is a basis for inference. A good selection function will produce efficient and robust inferences under a wide class of possible outcome functions.

To illustrate, Table 1(a) presents an example data set with 12 observations and one covariate, age. We consider assigning these 12 units into two groups of equal sizes using the FSM. Table 1(b) shows an example of an SOM in this setting. The SOM determines the order in which each treatment selects a unit at each stage. In the example, treatment group 2 selects first in stage 1, treatment group 1 selects in stage 2, and so on. Treatment groups select units based on the selection function.

In general, it is crucial that the order of selection is random, but that no group chooses in a disproportionate manner. For two treatment groups of arbitrary sizes, this can be accomplished by means of the Sequentially Controlled Markovian Random Sampling (SCOMARS) algorithm (morris1983sequentially)

. In the FSM, SCOMARS specifies the probability of a treatment group selecting at stage

(), conditional on the number of selections made by that group up to stage . See the Online Supplementary Materials for a formal description of the algorithm. SCOMARS satisfies the sequentially controlled condition (morris1983sequentially), which requires the deviation of the observed number of selections made by a treatment group up to stage from its expectation to be strictly less than one. Intuitively, this condition ensures that throughout the selection process, no treatment group departs too much from its expected fair share of choices. Moreover, SCOMARS is Markovian because for each group, the probability of selection at stage depends solely on the number of selections made up to stage .222In fact, SCOMARS is the unique randomized algorithm for generating an SOM that is both Markovian and sequentially controlled. For two groups of equal sizes (as in the example in Table 1), generating an SOM under SCOMARS boils down to successively generating independent random permutations of the treatment labels . In Section 8.1 we describe this and other extensions of SCOMARS to multi-group experiments. Unless otherwise specified, in the rest of the paper, we will use SCOMARS to generate the SOM for experiments with two treatment groups.

The selection function gives a value to each of the units available for selection at each stage. This value depends on the characteristics of each available unit in addition to those already assigned to the choosing treatment group. In principle, any criterion can be used in the selection function. For example, if the selection function is constant, units are randomly assigned. Alternatively, the selection function can compute the contribution of each unit to a measure of accuracy of the estimators. In this spirit, we propose the D-optimal selection function, which, at each stage, minimizes the generalized variance of the estimated regression coefficients in a linear potential outcome model.

To build intuition, we discuss the special case of covariate. With the D-optimal selection function, the choosing group, in its first choice, selects the unit whose covariate value is farthest from the full-sample mean of the covariate; and in the subsequent choices, selects the unit whose covariate value is farthest from its current mean of the covariate. In the example in Table 1, treatment selects unit with age , the farthest age from the full-sample mean . In the next stage, treatment selects unit with age , the farthest age from . Next, treatment selects unit with age , the farthest age from its current mean age . The process continues until all the 12 units are selected.

Index Age
1 24
2 30
3 34
4 36
5 40
6 41
7 45
8 46
9 50
10 54
11 56
12 60
Mean 43
(a) Data set
Selection order matrix Unit selected
Stage Treatment Index Age
1 2 1 24
2 1 12 60
3 1 2 30
4 2 11 56
5 1 3 34
6 2 10 54
7 1 9 50
8 2 4 36
9 1 5 40
10 2 8 46
11 2 6 41
12 1 7 45
(b) Selection order matrix and assignment
Table 1: (a) Example data set; (b) selection order matrix and an assignment using the FSM.

4 The D-optimal selection function

4.1 Definition and behavior

In this section, we formally define the D-optimal selection function and provide an equivalent, closed-form characterization that explains how this selection function governs the selection of units at each stage. Without loss of generality, we assume that treatment 1 gets to select at the th stage, . Let be the number of units already belonging to treatment group 1 after the th stage. We denote as the remaining set of unselected units after the th stage. Let and be the mean vector and covariance matrix of the covariates in treatment group 1, respectively, after the th stage. Also, let be the design matrix in treatment 1 after the th stage.333The design matrix includes a column of all 1’s (corresponding to the intercept) and columns of covariates. Finally, let be the design matrix in the full sample. We assume that has full column rank.

To define the selection function, we implicitly consider a linear potential outcome model of on , i.e., , where is an error term satisfying .444More generally, one can consider a linear model of on a vector of basis functions of the covariates. For unit , let be the resulting design matrix in treatment group 1 if unit is selected. We first consider the case where (i.e., treatment 1 has made at least one selection) and is invertible. The D-optimal selection function selects unit , where .555Ties in the values of are resolved randomly. In other words, at the th stage, the D-optimal selection function chooses the unit among that optimally decreases the generalized variance of the estimated regression coefficients of the fitted linear model in treatment 1. In Lemma A1 in the Online Supplementary Materials, we show that maximizing is equivalent to maximizing , where is the covariate vector of a the th unit in . In stages where is not invertible, we augment by a scalar multiple of (akin to ridge augmentation) and consider the objective function , where is a fixed constant. Finally, when (i.e., treatment 1 has not made any selections yet), the objective function takes the form .

In Theorem 4.1, we provide an equivalent characterization of the D-optimal selection function that provides more insight into the selection made by the choosing treatment group at each stage.

Theorem 4.1.

Let treatment 1 be the choosing group at the th stage. The D-optimal selection function chooses unit with covariate vector , where

where

and

Theorem 4.1 shows that at every stage, the D-optimal selection function selects that unit among the remaining pool whose covariate vector maximizes a type of Mahalanobis distance. In its first choice, treatment 1 aims to maximize the Mahalanobis distance from the covariate distribution in the full sample (in particular, from ), thereby choosing the most outlying unit available in the full sample. For the subsequent stages where is not invertible, treatment 1 aims to maximize the Mahalanobis distance from a mixture covariate distribution between treatment group 1 and the full sample, where determines the mixing rate. Finally, the latter selections by treatment 1 aim at maximizing the Mahalanobis distance from the covariate distribution in treatment group 1. Therefore, with every selection, treatment 1 maximizes the overall separation of the covariates from its current mean, which helps increase the efficiency of the estimated regression coefficients.

4.2 Properties

By definition, the D-optimal selection function improves the estimation accuracy of the fitted linear model in each treatment group by sequentially minimizing the generalized variance of the estimated regression coefficients. With the D-optimal selection function, we can also establish several additional desirable properties of the FSM. In particular, Theorem 4.2 leverages the connection between D-optimality and Mahalanobis distance (as in Theorem 4.1) and presents two key properties of the FSM with the D-optimal selection function.

Theorem 4.2.
  1. [label=()]

  2. The FSM with the D-optimal selection function is invariant under affine transformations of the covariate vector.

  3. For continuous, symmetrically distributed covariates and two groups of equal size, the FSM with the D-optimal selection function almost always produces exact mean-balance on all even transformations of the centered covariate vector.

It follows from Theorem 4.2(a) that, for any SOM, the choices made by each treatment group remain unchanged even if the covariate vectors are transformed via an affine transformation (e.g., changing the units of measurement of the covariates). Therefore, the FSM with the D-optimal selection function self-standardizes the covariates. Thus, without loss of generality, we can assume that the covariates have mean zero in the full sample. In addition, if the covariate vector is symmetrically distributed in the sample, then by Theorem 4.2

(b), the FSM exactly balances even transformations such as the second, fourth order moments, and the pairwise products of the covariates. An implication of Theorem

4.2(b) is that, for covariates drawn from symmetric continuous distributions (such as the Normal, t, and Laplace distributions), the FSM tends to balance all these transformations because of the approximate symmetry of the covariates in the sample. The choice of the D-optimal selection function is thus robust in the sense that it allows the FSM to balance a family of transformations of the covariate vector by design, without explicitly including them in the assumed linear model nor requiring the specification of tuning parameters.

The FSM with D-optimal selection function is also attractive because it can encompass several classical designs, such as randomized blocked and matched-pair designs. Theorem 4.3 formalizes this result. In the traditional randomized block design (RBD), the units are grouped into blocks of size according to a categorical blocking variable and each treatment is randomly applied to exactly one unit within each block (see, e.g., cox2000theory, Section 3.4). Here we consider a more general version of an RBD where the blocks are of size (where is a fixed positive integer) and each treatment is applied to units within each block. This is a special case of a stratified randomized experiment with strata of equal size and equal allocation among treatments per stratum. In a matched-pair design with treatments, similar units are grouped into pairs and each treatment is randomly applied to one unit within each pair. This is also a special case of a stratified randomized experiment with equal allocation per strata, where the size of each stratum equals two.

Theorem 4.3.
  1. [label=()]

  2. Consider a setting where units belonging to blocks of equal size are to be randomly assigned into treatment groups of equal size, where is a fixed positive integer. Then, if the linear model in the FSM consists of an intercept and indicators of any levels of the blocking variable, the FSM with the D-optimal selection function produces the same assignment as an RBD.

  3. Consider a setting where identical pairs of units in terms of baseline covariates are to be assigned into treatment groups of equal size. Assume is drawn from a continuous distribution. Then, if the linear model in the FSM consists of the intercept and the covariates , then the FSM almost surely produces the same assignment mechanism as a matched-pair design.

In the first setting, Theorem 4.3(a) states that, by including the levels of a blocking variable as regressors, the FSM with the D-optimal selection function automatically blocks on that variable. Thus, the FSM retrieves an RBD without explicitly performing separate randomizations within each block. In the second setting, Theorem 4.3(b) states that, by including the covariates as regressors, the FSM with the D-optimal selection function produces the same assignment as a matched-pair experiment, without explicitly performing separate randomizations in each pair. This phenomenon is particularly useful when the sample consists of near-identical twins but that are difficult to identify a priori due to multiple covariates.

4.3 Connection to A-optimality

The original FSM used a criterion based on A-optimality as the selection function (see morris1979finite). In this section, we compare the A-and D-optimal selection functions. The A-optimal selection function requires prespecifying a policy matrix and a corresponding vector of policy weights . Here, transforms the original vector of regression coefficients to a vector of linear combinations that are of policy interest, and assigns weights to each combination according to their importance.

If treatment 1 gets to choose at the th stage, then the A-optimality criterion selects the unit that minimizes the resulting , where .666For ease of exposition, we only discuss the case where is invertible. Proposition 4.4 shows an equivalent characterization of the A-optimal selection function.

Proposition 4.4.

Let treatment 1 be the choosing group at the th stage. Assume that and is invertible. The A-optimal selection function chooses unit with covariate vector , where

(1)

The A-optimality criterion provides a family of selection functions depending on and . In general, the A-optimality criterion is not invariant with respect to affine transformations of the covariate vector. For instance, setting and produces a selection function that is not affine invariant. On the other hand, setting and yields an affine invariant selection function. In fact, for the latter choice, the A-optimal selection function is closely related to the D-optimal selection function. To see this, consider a case where in the selection process, the design matrices in each treatment group scale similarly relative to the design matrix in the full sample. In particular, for treatment 1 (the choosing group at stage ), for some constant . In this case, the A-optimal selection function chooses unit such that

(2)
(3)

which is same as the D-optimal selection function. Hence, in this case, the FSM under the D-optimal and A-optimal selection functions make similar choices of units.

5 Inference under the FSM

Using the FSM we can make both model- and randomization-based inferences. Both modes of inference are feasible for any selection function and any randomized SOM. In model-based inference, the sample is typically assumed to be drawn randomly from some superpopulation and inference for the PATE is done by modeling the observed outcome distribution conditional on the treatment indicators and the covariates. For instance, let the potential outcome model under treatment be , where is a vector of basis functions of the covariates, and , are mutually independent errors, independent of the covariates. Under this model,

can be unbiasedly estimated by

, where and is the OLS estimator of

obtained by fitting a linear regression of

on in treatment group

. We call this the regression imputation estimator of

. The standard error of this estimator and the corresponding confidence interval for

can be obtained using standard OLS theory. We note that, in model-based inference, the standard errors and confidence intervals do not take into account the randomness stemming from the assignment mechanism. Moreover, often the regression models proposed at the design stage are considered misspecified and are later modified at the analysis stage by, e.g., incorporating covariates (or transformations thereof) that are deemed important predictors for the outcome. Due to the balancing properties of the FSM, the regression imputation estimators tend to exhibit sufficient precision even when the model posited by the FSM is misspecified (see sections 4.1, 6, and 7).

In randomization-based inference, the potential outcomes and the covariates are typically considered as fixed and the assignment mechanism is the only source of randomness (see Chapter 2 of rosenbaum2002observational and chapters 5–7 of imbens2015causal for overviews). Inference for causal effects can be done via exact randomization tests for sharp null hypotheses on unit-level causal effects (fisher1935design), or via estimation under Neyman’s repeated sampling approach (neyman1923application

). Under the FSM, randomization tests for sharp null hypotheses can be performed by approximating the distribution of the test statistic through repeated realizations of the FSM. To illustrate, consider testing the sharp null hypothesis of zero unit-level causal effects, i.e.,

for all , at level using the FSM. While any choice of test statistic preserves the validity of the test, a common choice is the absolute difference-in-means statistic . Large values of are considered evidence against . Under , and the vectors of potential outcomes and are known and fixed. The -value of the test is given by , where is the value of the test statistic for the observed realization of under the FSM. We can compute this -value by Monte Carlo approximation, i.e., we generate independent vectors of assignments , using the FSM and approximate the -value as . We reject at level if .

Similar tests can be applied for more general sharp hypotheses of treatment effects (e.g., dilated and tobit effects; rosenbaum2002observational; rosenbaum2010design2). We can invert these tests to obtain a confidence interval for the hypothesized effect (rosenbaum2002observational, Section 2.6.1). Moreover, we can get a point estimate of the effect by solving a Hodges-Lehmann estimating equation corresponding to these tests (rosenbaum2002observational, Section 2.7.2). Finally, under Neyman’s approach, we can estimate the sample average treatment effect by the difference-in-means statistic. In particular, for groups of equal size, this difference-in-means statistic is unbiased for under the FSM (see Proposition A1 for a proof).

6 A simulation study

6.1 Setup

We now compare the performance of the FSM to complete randomization and rerandomization in a simulation study. Here, , , , and . The covariates are generated following the design of hainmueller2012balancing:

(4)

In this design, , , and are mutually independent and separately independent of . We draw a sample of 120 units once from the data generating mechanism in (4

). Conditional on this sample, we compare four different assignment methods, namely a completely randomized design (CRD), rerandomization with 0.01 acceptance rate (RR 0.01), rerandomization with 0.001 acceptance rate (RR 0.001), and the FSM. Both RR 0.01 and RR 0.001 use as rerandomization criteria the Mahalanobis distance between the two treatment groups on the original covariates. The FSM uses a linear potential outcome model on the original covariates and the D-optimal selection function. For each design we draw 800 independent assignments. The assignments under the FSM are generated using the open source R package

FSM available on CRAN. The total runtime of the FSM for the 800 simulated experiments was about one and a half minutes on a Windows 64-bit computer with an Intel(R) Core i7 processor. See chattopadhyay2021randomized for detailed step-by-step instructions and vignettes on the use of FSM package.

6.2 Balance

We evaluate balance on the main and transformed covariates. Figures 2(a) and 2(b) show density plots of the Absolute Standardized Mean Differences (ASMD; rosenbaum1985constructing, stuart2010matching) of the six main covariates and their second-order transformations (including squares and pairwise products), respectively. A smaller ASMD for a covariate indicates better mean-balance on that covariate between the two treatment groups. Figure 2(a) indicates that both rerandomization methods improve balance on the means of the original covariates over CRD. As expected, the ASMD distribution under RR 0.001 is more concentrated than that of RR 0.01, with 32% smaller mean ASMD than RR 0.01. Both the FSM and RR 0.001 have similar distributions of the ASMD with FSM having moderately (9%) smaller mean ASMD. See Table A3 in the Online Supplementary Materials for a comparison of the average ASMD of each covariate.

(a) Main covariates
(b) Squares and pairwise products
Figure 2: Distributions of absolute standardized mean differences (ASMD) of the main covariates and all their second-order transformations. In the top right corners the legends present the average ASMD across simulations for the four methods. On average, the FSM achieves better covariate balance. In terms of the main covariates, the FSM marginally outperforms RR 0.001. In terms of the second-order transformations, the FSM substantially outperforms RR 0.001.

Figure 2(b) shows that the imbalances of covariate transformations are substantially smaller with the FSM than with CRD, RR 0.01, and RR 0.001. In fact, the FSM achieves a 70% reduction in the mean ASMD with respect to RR 0.001. Thus, although the FSM and RR 0.001 exhibit comparable balance in terms of the main covariates, the FSM balances these transformations of the covariates much better than RR 0.001. This highlights the improved robustness of the FSM against model misspecification, as discussed previously in the context of Theorem 4.2(b).777For the FSM, the implicit potential outcome model is the same as the model used to specify the D-optimal selection function. Although rerandomization does not explicitly model the potential outcomes, an implicit model can be conceptualized from the covariates (or transformations thereof) used to construct the Mahalanobis distance. Moreover, reducing the tuning parameter of rerandomization from 0.01 to 0.001 yields only 2% improvement in the mean ASMD.888In fact, for some covariate transformations, reducing this tuning parameter exacerbates imbalance (see Table A4 in the Online Supplementary Materials). In Figure 2(b), both RR 0.01 and RR 0.001 often produce ASMD larger than 0.1, and in some cases, larger than 0.5, indicative of substantial imbalances on these covariate transformations.

Under rerandomization, balance on the squares and pairwise products of the covariates can be improved by explicitly incorporating these transformations in the Mahalanobis distance. For instance, with continuous covariates, a Mahalanobis distance needs to include variables to control the imbalances on the means of all the covariates and their squares and pairwise products. However, with large , calculating the Mahalanobis distance becomes computationally expensive and, in the extreme case (when ), infeasible. The FSM, by contrast, only requires main covariates for computing the D-optimality criterion (see Theorem 4.1) to produce adequate balance on these transformations.

For each method, we also compare balance in the overall correlation structure of the covariates. Let denote the sample correlation matrix in group , . As a measure of imbalance, we consider the Frobenius norm of , denoted by .999The Frobenius norm of a matrix is the square root of the sum of squares of all its elements. Smaller values of are indicative of better balance on the correlation matrix of the covariates between the two groups. Figure 3 shows the boxplots of the distributions of . The FSM outperforms the other three designs with at least 75% smaller average . In particular, among the 800 randomizations, the highest value of under FSM is smaller than the corresponding lowest value under the other three designs, indicating that in terms of the correlation structure (and hence the interactions) of the covariates, the least balanced realization of the 800 FSMs exhibits better balance than the best balanced realization of the 800 complete randomizations and rerandomizations.

Figure 3: Distributions of discrepancies between the correlation matrices of the covariates in the treatment and the control group (as measured by the Frobenius norm, ) across 800 randomizations. The FSM produces substantially lower discrepancies than the other three methods, indicating markedly improved balance on the correlations of the covariates.

Finally, we evaluate balance on the joint distribution of the covariates. To this end, we use two recently proposed non-parametric graph-based tests for equality of multivariate distributions (

agarwal2019distribution). In Table A5 of the Online Supplementary Materials, we show the average p-values of the two tests for each design. Since each method ensures covariate balance in expectation, the average p-values for both tests are all substantially greater than the typical 0.05 level. Nevertheless, the average p-value for the FSM is the highest among the four designs, indicating improved covariate balance in aggregate on the joint distributions.

6.3 Efficiency

We now compare the efficiency of the methods under both model- and randomization-based approaches to inference. Under the model-based approach, we consider a potential outcome model where is linear in (Model A1) and another model where is linear in and all its second-order transformations (Model A2). For each potential outcome model, we fit the corresponding observed outcome model by OLS and estimate using the regression imputation method described in Section 5. Tables 2(a) and 2(b) show the average and maximum model-based standard error (SE) of the regression imputation estimator relative to the FSM across 800 randomizations under the two models.

Designs
CRD RR 0.01 RR 0.001 FSM
Average SE 1.03 1.00 1.00 1.00
Maximum SE 1.13 1.00 1.00 1.00
(a) Model A1
Designs
CRD RR 0.01 RR 0.001 FSM
Average SE 1.39 1.27 1.26 1.00
Maximum SE 3.61 1.97 1.80 1.00
(b) Model A2
Table 2: Average and maximum model-based standard errors relative to the FSM across randomizations. Under Model A1 (linear model on the main covariates), the FSM and RR exhibit similar performance, improving over CRD. Under Model A2 (linear model on the main covariates and their second-order transformations), the FSM is considerably more efficient than both CRD and RR.

Under Model A1, since both rerandomization and the FSM are able to adequately balance the means of the original covariates, they lead to lower SE (hence, higher efficiency) than CRD. Across randomizations, the worst case SE under RR 0.01, RR 0.001, and the FSM are 13% smaller than under CRD. Under Model A1, the FSM has similar model-based SE as the two rerandomization methods. However, under Model A2, the FSM uniformly outperforms the other three designs, with a 26% reduction in average SE and an 80% reduction in maximum SE than RR 0.001. This improvement in efficiency can be attributed to the balance achieved by the FSM on the main covariates and their squares and pairwise products. In sum, when the model assumed at the design stage is correct and is used at the analysis stage, the FSM is as efficient as the two rerandomizations for estimating the treatment effect. However, when the model assumed at the design stage is misspecified and later corrected by augmenting transformations of the covariates (e.g., squares and pairwise products), the FSM is considerably more efficient and robust than the other designs.

Under the randomization-based approach, we compare the standard errors of the difference-in-means statistic under each design. Following hainmueller2012balancing, the potential outcomes are generated using the models: , (Model B1) and , (Model B2), where . Both generative models satisfy the sharp-null hypothesis of zero treatment effect for every unit and hence, . Conditional on these potential outcomes,

is estimated under each design using the standard difference-in-means estimator. The corresponding randomization-based SE of this estimator is obtained by generating 800 randomizations of the design and computing the standard deviation of the difference-in-means estimator across these 800 randomizations. Table

3 shows the randomization-based SE of the difference-in-means statistic for under each model.

Designs
CRD RR 0.01 RR 0.001 FSM
SE 2.72 1.26 1.08 1
(a) Model B1
Designs
CRD RR 0.01 RR 0.001 FSM
SE 5.69 4.56 4.47 1
(b) Model B2
Table 3: Average randomization-based standard errors relative to the FSM. The standard error for the FSM is 0.2 under Model B1 (linear model on the main covariates) and 0.43 under Model B2 (linear model on the main covariates and their second-order transformations). Especially under Model B2, the FSM is considerably more efficient than both CRD and RR.

Under Model B1, the potential outcomes depend linearly on the covariates and therefore balancing the means of the covariates improves efficiency. This is reflected in Table 3 as the FSM has the smallest SE, closely followed by RR 0.001. Under Model B2, the potential outcomes depend linearly on the squares and pairwise products of the covariates. By better balancing these transformations, the FSM yields a considerably smaller SE than the other designs. In particular, under Model B2, the SE under the FSM is 67% smaller than the SE under RR 0.001. Therefore, in a similar way as in the model-based approach, in randomization-based approach the FSM exhibits comparable efficiency to rerandomization under correct-specification of the outcome model, and considerable robustness under model misspecification.

7 The Health Insurance Experiment

7.1 Data

We evaluate and compare the performance of the FSM with standard designs using the baseline data of the HIE. To this end, we consider a version of the HIE data presented in aron2013rand. This version includes data on six cost sharing plans described in Section 2. To make the group sizes more homogeneous, we consider combining the groups with , , and mixed coinsurance plans. Thus, in our analysis, we have treatment groups corresponding to , “free care” (); , “, or mixed coinsurance” (); , “ coinsurance” (); and , “individual deductible” (). In total, there are families. We consider assigning all families to the four treatment groups and hence do not consider a discard group of non-participants. Moreover, in this version, the units (i.e., families) across five of the six sites are pooled and we consider randomly assigning all the families in this pooled set to the four treatment groups at once. Due to loss of data, the Dayton site is excluded from this analysis.

We consider family-level baseline covariates, where are scaled covariates, are binary covariates, and are binary covariates indicating missing data (see Table A6 for a description of each baseline covariate). Using this data, we compare complete randomization, rerandomization, and the FSM in terms of balance and efficiency. For the FSM, we generate the SOM by first using SCOMARS on the combined groups and , and then using SCOMARS again to split each combined group into its component groups. For rerandomization, we use two balance criteria based on Wilks’ lambda statistic (lock2011rerandomization, Section 5.2) and the maximum pairwise Mahalanobis distance between any two treatment groups (morgan2012rerandomization). For each design, we draw 800 independent assignments. The runtime of each of these assignments with the FSM was less than one minute on a Windows 64-bit laptop computer with an Intel(R) Core i7 processor.

7.2 Balance

Figure 4 displays the ASMD distributions across randomizations for the main covariates and second-order transformations of the scaled covariates in the HIE data. In both cases we see that the FSM outperforms complete randomization and rerandomization. While rerandomization balances the main covariates better than complete randomization, this advantage is less marked than in the previous simulation study and disappears for the transformations of the covariates. In fact, the average imbalances for these transformations are very similar between complete randomization (0.055) and rerandomization (0.052), and with both methods it is common to see imbalances greater than 0.1 ASMD. With the FSM, however, the average imbalance is less than half (0.02) of those under CRD and RR, and extreme imbalances are non-existent after the assignments.

(a) Main covariates
(b) Squares and pairwise products
Figure 4: Distributions of absolute standardized mean differences (ASMD) of the main covariates and their second-order transformations in the HIE data. The legends present the average ASMD across simulations for the four methods. On average, the FSM substantially outperforms CRD and RR in terms of both the main covariates and their second-order transformations.

A related question is how well the methods balance all second-order features of the joint distribution of the covariates. Figure 5 provides an answer to this question in the boxplots of the discrepancies between correlation matrices () across randomizations. As in the aforementioned second-order transformations, we see a similar performance between complete randomization and rerandomization, which is considerably improved by the FSM with a median about three times smaller. Arguably, one could improve the performance of rerandomization; for example, by restricting imbalances on these transformations via the rerandomization criterion; however, unlike the FSM, this may incur increased computational cost and may require additional tuning parameters. Moreover, these transformations can also be included in the FSM model, which would then also improve balance on higher order transformations of them.

(a)
(b)
Figure 5: Distributions of discrepancies of the correlation matrices of the covariates in the treatment groups of the HIE data across randomizations. The discrepancies are measured by , where is the sample correlation matrix of the covariates in treatment group and is the Frobenius norm. The FSM systematically produces lower discrepancies than the other methods, exhibiting substantially improved balance on the correlations of the covariates.

7.3 Efficiency

As in the simulation study, we evaluate efficiency under model- and randomization-based approaches to inference. The main differences between the model- and randomization-based standard errors is that in the model-based approach, the variance calculation does not explicitly take into account the variability arising through the randomization distribution, whereas in the randomization-based approach it does. For illustration, here we consider estimating the average treatment effect of treatment 3 relative to treatment 2, i.e., and .

Under the model-based approach, we consider two potential outcome models, one that is linear on the main covariates (Model A3), and another that is linear on the main covariates and the second-order transformations of the scaled covariates (Model A4). The results are summarized in Table 4. While the performance of the three methods is similar under Model A3, under Model A4 there are substantial differences with the FSM outperforming both complete randomization and rerandomization. In fact, under Model A4, there is a 13-15% reduction in the average standard error, and a 50-69% reduction in the maximum standard error, with the FSM.

Designs
CRD RR Wilks RR Mahalanobis FSM
Average SE 1.02 1.01 1.01 1.00
Maximum SE 1.04 1.02 1.02 1.00
(a) Model A3
Designs
CRD RR Wilks RR Mahalanobis FSM
Average SE 1.15 1.13 1.13 1.00
Maximum SE 1.69 1.50 1.55 1.00
(b) Model A4
Table 4: Average and maximum model-based standard errors relative to the FSM across randomizations. Under Model A3 (linear model on the covariates), the FSM is slightly more efficient than RR and CRD. Under Model A4 (linear model on the covariates and their second-order transformations), the FSM is considerably more efficient than CRD and RR.

Under the randomization-based approach, we consider the generative models (Model B3) and (Model B4) where and . Similar to the simulation study, both generative models satisfy the sharp-null hypothesis of zero treatment effect for every unit and hence, . Under each design, is estimated using the standard difference-in-means estimator and the corresponding randomization-based SE is obtained by generating 800 randomizations and computing the standard deviation of the estimator across these 800 randomizations. The results are summarized in Table 5. In terms of efficiency, we see again a clear advantage of the FSM. Under Model B3, the average standard errors of rerandomization are 73% and 83% larger than the one of the FSM. Under Model B4, this difference is accentuated and the average standard errors of rerandomization are 236% and 242% larger.

Designs
CRD RR 0.01 RR 0.001 FSM
SE 2.36 1.73 1.83 1
(a) Model B3
Designs
CRD RR 0.01 RR 0.001 FSM
SE 3.95 3.42 3.36 1
(b) Model B4
Table 5: Randomization-based standard errors relative to the FSM. The standard error for the FSM is 0.12 under Model B3 (linear model on the covariates) and 0.67 under Model B4 (linear model on the covariates and their second-order transformations). Under both models, the FSM is considerably more efficient than both CRD and RR.

7.4 Explanation

This analysis illustrates some important differences between the FSM and RR. First is the criterion for assignment. While RR uses the Mahalanobis distance, the FSM uses the D-optimality criterion, which, coupled with a suitable SOM, leads to robust assignments under a more general class of potential outcome models. Second is the use of this criterion. While RR essentially “constrains” the allowable treatment assignments, the FSM “optimizes” them toward the criterion. In essence, RR solves a feasibility problem whereas the FSM solves a maximization problem. Furthermore, the feasibility problem solved by RR depends on the balance threshold, which may be difficult to choose in practice. A very high value of this threshold bears the risk of accepting an assignment with poor covariate balance, whereas a very low value can be computationally onerous.

Third is the step-wise assignment of units into treatment groups. While RR assigns all units in one step and then discards imbalanced assignments, the FSM assigns units one at a time in a random but optimal fashion governed by the selection order and the selection criterion. This difference is crucial because in experiments like the HIE with several treatment groups and many covariates, the space of possible treatment assignments is vast. As shown in our analysis, optimally selecting among these assignments in a step-wise manner can make a substantial difference in terms of balance, efficiency, computational time, and, ultimately, in the use of ever-scarce resources available for experimentation.101010Figures 1 and 4 show that, although RR does well under common balance standards (the mean differences are systematically lower than the typical threshold of 0.1 ASMD), there is room to select better (more balanced) random treatment assignments, which is achieved by the FSM.

8 Practical considerations and extensions

8.1 Multi-group experiments

As discussed, the FSM can readily handle experiments with multiple treatment groups. In so doing, the key methodological consideration is the choice of the SOM. As in two-group experiments, with groups, it helps to generate an SOM that is randomized and sequentially controlled, so that at every stage of the random selection process, the number of selections made by each treatment group up to that stage is close to its fair share. Formally, we say that an SOM is sequentially controlled if for all and , where is the number of selections made by group up to stage and is the corresponding expected number of selections. While the construction of an SOM that is sequentially controlled for multi-group experiments with arbitrary group sizes is an open problem, such constructions are possible for several practically relevant configurations of the group sizes. For example, in multi-group experiments with groups of equal size, we can generate an SOM that is sequentially controlled by successively generating random permutations of the group labels . See Definition 1 and Proposition A2 for a formal statement of this algorithm and a proof.

In multi-group experiments with groups having one of two distinct sizes, we can generate a sequentially controlled SOM by combining the groups of equal size, followed by using SCOMARS on the combined group labels and randomly permuting the component group labels within each combined group (see Theorem A3 for details). For example, with groups of sizes , , this algorithm first generates an SOM at the level of the combined groups (of size 20) and (of size 63), and then splits each combined group into its component groups. Experiments with multiple groups of two sizes arise, for example, in clinical settings where an exploratory treatment is evaluated with greater precision in a larger group size and conventional treatments are applied to smaller groups of equal size.

Finally, in multi-group experiments with groups of more than two distinct sizes such that when combined by groups of equal size they have the same total size, we can again obtain a sequentially controlled SOM by randomly permuting across and within the combined group labels. See Theorem A4 for details. For example, with groups of sizes , , , this algorithm first generates an SOM at the level of the combined groups , , (each of size 60), and then splits each combined group into its component groups. In practice, for more general group size configurations, one strategy to generate an SOM is to first identify one of the previous three configurations that is structurally similar to the configuration at hand, and then use the corresponding SOM-generating algorithm. The resulting algorithm may not always be sequentially controlled, but is still likely to produce a well-controlled randomized selection order.

8.2 Stratified experiments

In stratified experiments, units are grouped into two or more predefined strata, and within each stratum units are randomly assigned into treatment groups. By Theorem 4.3, when the strata are of equal size, the FSM with stratification variables as covariates automatically retrieves a stratified randomized experiment, without explicitly randomizing units within each stratum. However, in more general stratified experiments, the FSM needs to be extended to explicitly account for the possibly varying stratum sizes and shares of the treatment groups within each stratum. Here we discuss a family of such extensions of the FSM, which we term as stratified FSM.

We consider stratified experiments where the treatment group sizes within each stratum are set by the investigator beforehand. To accommodate the FSM to such experiments, we again need to carefully construct an SOM. In the stratified FSM, we append the SOM with an additional column of stratum labels, indicating which stratum the treatment group selects from at each stage of the selection process. This column of stratum labels is specified in such a way that by construction, the resulting SOM is consistent with the pre-fixed treatment group sizes within each stratum. To this end, we discuss two potential approaches below.

Conceptually, the most straightforward approach is to generate a separate SOM for each stratum. This is equivalent to setting the column of stratum labels as , where is the number of strata and is the size of th stratum, . This approach is easy to implement and can be useful if, e.g., data on each stratum is available at different stages of the experiment, akin to a sequential experiment (see Section 8.3). However, in this approach, the treatment groups only get to explore the covariate space of a single stratum for a number of successive stages of selection and hence may not make the most efficient choices. We address this issue with an alternative approach. For ease of exposition, we consider two strata: 1 and 2. Let and be the (fixed) sizes of treatment group in strata 1 and 2, respectively, where . In this approach, we first generate a usual SOM with group sizes . For , we then select the order of the strata that treatment chooses from by running a SCOMARS algorithm with group sizes and . By allowing the treatment groups to select units from different strata in a balanced manner, this approach mimics the unstratified FSM where the covariate space of the entire sample is explored for choosing units. Also, by design, this approach satisfies the size requirement of each treatment group within each stratum.

8.3 Sequential experiments

Sequential experiments are experiments where units arrive sequentially in batches, possibly of varying sizes. In this section, we discuss extensions of the FSM to sequential experiments, which we term as batched FSM. For simplicity of exposition, we consider assigning units into equal-sized groups. Let denote the size of the th batch. Using random permutations of the group labels, we can generate a sufficiently large SOM and use the first rows as an SOM for the first batch, the next rows as an SOM for the second batch, and so on. Given a new batch and its corresponding SOM, the simplest approach is to treat the batch as a fresh new sample and assign units using the usual FSM. However, this approach completely ignores the covariate information of the units already assigned. Therefore, in general, this approach fails to correct for any existing covariate imbalances among the treatment groups. To alleviate this, we consider an alternative approach that explicitly takes into account the covariate information of both the current and previous batches. Given a new batch and its corresponding SOM, we run the FSM as if the new batch is a continuation of the previous batch. In other words, we use the design matrix based on all the units already assigned to the choosing treatment group to evaluate the D-optimal selection function for each unit in the new batch, and select the unit that maximizes the selection function. By carrying over the existing design matrix to the new batch, this approach tends to correct for any existing covariate imbalances.

An important practical consideration for batched FSM is the size of the batches. For a fixed total number of units enrolled in the experiment, increasing the batch size (and hence, reducing the number of batches) tends to increase the overall balance and efficiency properties of the FSM. On one extreme, a batched FSM with a single batch containing all the units enrolled in the experiment reduces to the usual FSM, which tends to have the highest efficiency among all possible batched FSMs. On the other extreme, a batched FSM with multiple batches of size one each is equivalent to CRD, which ignores the covariate information. How to optimally determine the batch sizes for the FSM is an important open question for practice.

9 Summary and remarks

We revisited, formalized, and extended the FSM for experimental design. We proposed a new selection function based on D-optimality that requires no tuning parameters. We showed that, equipped with this selection function, the FSM has a number of appealing properties. First, the FSM is affine invariant and hence, it self-standardizes covariates with possibly different units of measurements. Second, for approximately symmetric data, the FSM yields near-exact balance on a large class of covariate transformations, including transformations that are not part of the assumed linear model under the FSM. Third, the FSM produces randomized block designs without explicitly randomizing in each block. Fourth, the FSM also produces matched-pair designs without explicitly constructing the matched pairs beforehand and randomizing within each pair. We described how both model-based and randomization-based inference on treatment effects can be conducted using the FSM. For a range of practically relevant configurations of group sizes in multi-group experiments, we proposed new algorithms to generate a fair and random selection order of treatments under the FSM. We also discussed potential extensions of the FSM to stratified and sequential experiments. In a simulation study and a case study on the RAND Health Insurance Experiment, we showed that the FSM is a robust approach to randomization, exhibiting better performance than complete randomization and rerandomization in terms of balance and efficiency.

While there are settings where complete randomization may perform better than the FSM in terms of efficiency, such settings are less common and involve jagged, i.e., highly non-smooth, potential outcome models. In practice, where investigators believe there is reasonable smoothness in the models, the FSM is expected to perform well. Overall, through our extensive explorations with real and simulated experimental data, the FSM has consistently stood out as a robust design that can handle multiple treatment groups and a fairly large number of categorical and continuous covariates without requiring tuning parameters and without the need to coarsen covariates. We recommend giving strong considerations to the FSM in experimental design for its conceptual simplicity, practicality, and robustness.

References

Supplementary Materials

A Notation and estimands

Full sample size
Index of unit,
Number of treatments
Index of treatment group,
Size of treatment group
Number of baseline covariates
Observed vector of baseline covariates of unit
matrix of covariates in the full sample
design matrix in the full sample
vector of means of the baseline covariates in the full sample
covariance matrix of the baseline covariates in the full sample
Potential outcome of unit under treatment
Vector of potential outcomes under treatment ,
Treatment assignment indicator of unit ,
Vector of treatment assignment indicators,
Observed outcome of unit ,
Table A1: Notation
Unit level causal effect of treatment relative to treatment for unit ;
, the Sample Average Treatment Effect of treatment relative to treatment
, the Population Average Treatment Effect of treatment relative to treatment
Table A2: Estimands

B Proofs of theoretical results

Lemma A1.

Let treatment 1 be the choosing group at the th stage. Also, let be the design matrix in treatment group 1 after the th stage, where and . The D-optimal selection function chooses unit with covariate vector , where

(5)
Proof.

We follow the notations outlined in Section 4.1. At the th stage, D-optimal selection function selects unit , where . Now, for ,

(6)
(7)
(8)

where the final equality holds since for two matrices and , . Equation 8 implies that the selected unit maximizes . This completes the proof.

Proof of Theorem 4.1

Proof.

We use the notations in Section 3.1 and Table A1. We first consider the case where