# Optimal Rerandomization via a Criterion that Provides Insurance Against Failed Experiments

We present an optimized rerandomization design procedure for a non-sequential treatment-control experiment. Randomized experiments are the gold standard for finding causal effects in nature. But sometimes random assignments result in unequal partitions of the treatment and control group, visibly seen as imbalanced observed covariates, increasing estimator error. There is also imbalance on unobserved covariates which likewise increase estimator error. Rerandomization can throw away poor assignments only in the observed covariates by limiting the imbalance to a prespecified threshold. Limiting this threshold too much can increase the risk of having error due to unobserved covariates. We introduce a criterion that gauges errors due to both imbalance in the observed and the risk of imbalance in the unobserved covariates. We then use this criterion to locate the optimal rerandomization threshold based on the practitioner's level of desired insurance. We also provide an open source R package available on CRAN named OptimalRerandExpDesigns which generates designs according to our algorithm.

## Authors

• 7 publications
• 4 publications
• 3 publications
• 10 publications
09/06/2021

### Rerandomization with Diminishing Covariate Imbalance and Diverging Number of Covariates

Completely randomized experiments have been the gold standard for drawin...
08/13/2020

### Improving the Power of the Randomization Test

We consider the problem of evaluating designs for a two-arm randomized e...
12/06/2020

### Better Experimental Design by Hybridizing Binary Matching with Imbalance Optimization

We present a new experimental design procedure that divides a set of exp...
07/14/2020

### Network Flow Methods for the Minimum Covariates Imbalance Problem

The problem of balancing covariates arises in observational studies wher...
11/06/2019

### A Comparison of Methods of Inference in Randomized Experiments from a Restricted Set of Allocations

Rerandomization is a strategy of increasing efficiency as compared to co...
08/18/2019

### Spectral inference for large Stochastic Blockmodels with nodal covariates

In many applications of network analysis, it is important to distinguish...
08/14/2018

### Ridge Rerandomization: An Experimental Design Strategy in the Presence of Collinearity

Randomization ensures that observed and unobserved covariates are balanc...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and Background

We consider a classic problem: a treatment-control experiment with the subjects (individuals, participants or units) and one clearly defined outcome of interest (also called the response or endpoint) for each subject denoted . We scope our discussion to the response being continuous and uncensored with incidence and survival being further research.

Each subject is assigned to the treatment group and a control group denoted and and referred to as the two arms. Here we consider the settings where all subjects along with their observed subject-specific covariates (measurements or characteristics) denoted known beforehand and considered fixed. This non-sequential setting was studied by Fisher (1925) when assigning treatments to agricultural plots and is still of great importance today. In fact, they occur in clinical trials as “many phase I studies use ‘banks’ of healthy volunteers … [and] … in most cluster randomised trials, the clusters are identified before treatment is started” (Senn, 2013, page 1440). We discuss in our concluding section about how are work can extend to the widely-used sequential setting where subjects enter one-by-one and must be immediately assigned to or .

Synonymously referred to as a randomization, an allocation or an assignment

is a vector

whose entries indicate whether the subject received (coded as +1) or (coded as -1) and thus this vector must be an element in .

The goals of such experiments are to estimate and test a population average treatment effect (PATE) denoted where (the multiplicative factor of 2 would not be present if was coded with 0 and 1).

After the sample is provided, the only degree of control the experimenter has is to choose the entries of . This choice of division of the subjects to the two arms we term a design, strategy, algorithm, method or procedure

and denote it as the random variable

, a generalized multivariate Bernoulli mass function, and support , termed the allocation space. If all realizations satisfy , i.e. all of the assignments feature an equal number of subjects assigned to the treatment group and the control group, , we term a forced balance procedure following Rosenberger  Lachin (2016, Chapter 3.3).

The question then becomes: what is the best design strategy? This has been debated for over 100 years with loosely two main camps: those that optimize assignments and those that randomize assignments. The latter, first advocated by Fisher (1925), has prevailed, certainly in the context of causal inference and clinical trials. We show here that the design being “best” is dependent on assumptions of the response model, the degree to which observed covariates matter relative to the unobserved covariates and the fundamental assumption of where the randomness comes from. Then it depends further on choices: the choice of estimator for and the choice of metric to gauge optimality. There is also the additional concern with how to draw inference which will be different depending on these assumptions and choices.

To build intuition about the complexity at hand, consider the following simplistic model

 y=βTw+x+z (1)

where comes from a forced balance procedure and will be explained shortly. To make the expressions even simpler, we assume

is centered and scaled so its entries have average zero and standard deviation one.

We denote the responses coming from random variable but there are loosely two perspectives as to the source of this response randomness. Our work focuses on the randomization model which we defend after introducing both perspectives. To simplify our discussion in this section, we consider estimation only (not hypothesis testing).

### 1.1 Population Model Perspective

In the population model (Rosenberger  Lachin, 2016, Chapter 6.2), subjects in the treatment group are sampled at random from an infinitely-sized superpopulation and thus the responses in the treatment group are considered independent and identically distributed with density . Subjects in the control group are sampled analogously from a different superpopulation density .

In the simple model of Equation 1, the population model can be created via the assumption that is a random draw from such that and its entries are independent. The law of iterated expectation yields that the distribution of the are mean-centered. Each experiment gets one draw from which is usually referred to as “noise” or “measurement error”. To make our expressions simpler, we also assume homoskedasticity so that each of the

’s share variance denoted

.

It is straightforward to demonstrate that the strategy’s mass function is ancillary to the likelihood of the responses and thus the randomization procedure can be ignored during estimation.

To investigate optimality in this perspective, we now need an estimator and a criterion. We consider for now two widely-used estimators. First is the simple differences-in-means estimator, where is the random variable of the average response in the treatment group and is the random variable of the average response in the treatment group. If the model were to be unknown,

would be the uniformly minimum variance unbiased estimator for the PATE

(Lehmann  Casella, 1998, Example 4.7). Second is the covariate-adjusted regression estimator, defined as the slope coefficient of in an ordinarly least squares fit of using and . If is known to be linear in , then is known to be the best linear unbiased estimator for the PATE.111The employment of in the realistic setting where the linear model assumption is untestable and most likely wrong is a subject of a large debate (Freedman, 2008; Berk ., 2010; Lin, 2013). In this work, we do not take sides in this debate and offer optimal designs for both estimators. Since was standardized, the intercept was omitted in this example.

Table 1

provides the expression of the estimator, its expectation, variance and mean squared error (MSE) for both the differences-in-means and linear regression estimators. The expectations are taken over the distribution of

, as this is the source of randomness in the response and thus the expressions are conditional on and . To simplify the expressions we denote the imbalance in as and the imbalance in as where the bar notation denotes sample average and the subscript indicates arm and , a covariance-like term measuring the relatedness of the observed and unobserved measurements.

If the optimal design is selected by the MSE, then the best assignment is the one that minimizes , the imbalance in the observed covariate. This is the main justification used to advocate against randomization and instead use deterministic optimal designs.

Here, the strategy is degenerate where for , the MSE-optimal assignment. To find in practice is more difficult. One can formulate the procedure as a binary integer programming problem. If forced balance is required, the partitioning problem is NP hard, meaning that there does not exist a known algorithm that can find the optimal allocation in polynomial time. There are approximations that run in polynomial time that are usually close enough for practical purposes, e.g. branch and bound (Land  Doig, 1960).

Tangentially, we also observe that if the model is linear, then is more efficient than as its variance is when using two terms in a geometric series approximation222And thus is only appropriate when . versus the variance of the simple estimator . This will become important when we discuss our methodology in sec. 2

Once again, those that assume the population model prefer optimal deterministic design.

### 1.2 Randomization Model Perspective

In the randomization model (Rosenberger  Lachin, 2016, Chapter 6.3) also called the “Fisher model” or “Neyman model”, the source of the randomness is in the treatment assignments . “The subjects are the population of interest; they are not assumed to be randomly drawn from a superpopulation” (Lin, 2013, page 297). Table 2 provides the expression of the estimator, its expectation, variance and mean squared error (MSE) for both the differences-in-means and linear regression estimators. The expectations are taken over the distribution of , as this is the source of randomness in the response and thus the expressions are conditional on and .

The response model of Equation 1 under the randomization model perspective was studied by Kapelner . (2019) and the expressions in Table 2 can be found therein. To derive these expressions, there was one additional assumption placed on that we will make precise in Section 2: for any assignment , the assignment where the treatment group subjects are swapped for the control groups subjects, , is equally likely. Also note that for , no closed form expressions are available so they are approximated to the third order in a geometric series.

The term denotes , the variance-covariance matrix of the strategy that features as ones along the diagonal and off-diagonals that gauge the covariance between subject ’s assignment and subject ’s assignment. is defined to be the component of orthogonal to . It makes sense that this term is important in the error of the linear regression estimator as the linear regression can only adjust for the component of it has access to via .

The most striking observation is that the optimal design strategy is not clear from the MSE expressions as they were in the population model perspective. Our response model was deliberately picked to be the most simplistic and our estimators were chosen to be the most popular. Further, if optimal design were to be defined by minimal MSE, it cannot be resolved as is unknown, a practical problem addressed this problem in Section 2.

We note that there is finite sample bias in the linear regression estimator because the regression assumed a response model that omitted the additive term. However, this bias is small as noted by Freedman (2008) and Lin (2013).

### 1.3 Our Assumption: the Randomization Model Perspective

The debate between the population model and randomization model is long Reichardt  Gollob (see for instance 1999, pages 125-127) and there are compelling reasons for the population model as well. In this work, we employ the latter perspective following Rosenberger  Lachin (2016); Freedman (2008); Lin (2013) and others.

The population model requires the subjects to truly be sampled at random from a superpopulation. However, in most experiments, subjects are recruited from nonrandom sources in nonrandom locations. Experimental settings are frequently selected because of expertise, an ability to recruit subjects, and their budgetary requirements (Rosenberger  Lachin, 2016, page 99). In the context of clinical trials, Lachin (1988, page 296) states rather harshly that, “the invocation of a population model for the analysis of a clinical trial becomes a matter of faith that is based upon assumptions that are inherently untestable”.

Further, there is another reason we would like to point out. Since is deterministic under the randomization model, this implies is fixed and we believe it is realistic to believe so. Consider a clinical trial where subject gender, height and weight are assessed, the observed covariates. It is not difficult to accept that these measurement values are affixed to the subjects. However, there are many other measurements that are important for the endpoint in the study such as presence or absence of rare diseases, presence or absense of genetic SNP’s, etc. These unobserved measurements combine together to form the ’s and are just as fixed to the subjects as the observed measurements.

### 1.4 The Rerandomization Design

Assigning each individual to treatment by an independent fair coin flip we will term a the Bernoulli Trial (Imbens  Rubin, 2015, Chapter 4.2). In the Bernoulli Trial, all assignments are independent and thus where the latter notation indicates the identity matrix. Any other design is termed a restricted randomization because the permissible allocations are restricted to a subset of all possible allocations. A very weak restriction is that of forced balance. If each forced balance allocation is equally likely, we term this as the balanced completely randomized design (BCRD) as in Wu (1981). Here, there is a slight, but vanishing negative covariance between the assignments and thus the off-diagonal elements of are all .

However, there is a large problem with employing bernoulli or BCRD assignments that was identified at the outset: under some unlucky random assignments there are unluckily large differences in the distribution of observed covariates between the two groups. Running the experiment under a particular unlucky assignment is destructive to both perspectives. This can be seen in our simple model of Equation 1, the estimator features an additive term. In the population model perspective, the MSE of suffers an additive penalty of and the MSE of suffers a multiplicative penalty of derived from the specific unlucky assignment (see Table 1). In the randomization model perspective, the MSE of suffers an additive penalty of and the approximate MSE of also suffers an additive penalty but it is more difficult to see as it is buried in the quartic form (see Table 2). These penalties are due to the presence of many unlucky assignments in .

Thus, in the randomization model, the MSE for both estimators can be diminished if these unlucky assignments are eliminated from , the main reason restricted randomization has been employed for the past 100 years. An algorithm that eliminates such ’s can go as follows. (1) Realize a from BCRD and measure (2) If this is less than a predetermined threshold , then retain and run the experiment otherwise return to step (1). Here, the variance-covariance matrix of the strategy will be dependent on both and threshold .

This rerandomization procedure is not new. Student (1938, page 366) wrote that after an unlucky, highly imbalanced randomization, “it would be pedantic to [run the experiment] with [an assignment] known beforehand to be likely to lead to a misleading conclusion”. His solution is for “common sense [to prevail] and chance [be] invoked a second time”. Although this rerandomization design is classic, it has been rigorously investigated only recently (Morgan  Rubin, 2012).

However, the selection of the threshold “remains an open problem” (Li ., 2018, page 9162). This threshold is a critical quantity because it controls the degree of randomness in the design. A miniscule will demand the optimal, deterministic design; a large would allow for complete randomization. Hence, solving this problem can bridge the intellectual gap between those that deterministically design experiments via optimization and those that design experiments via randomization.

Herein, we provide an algorithm to find this threshold for both the differences-in-means estimator and the covariate adjusted estimator for a realistic response model.

## 2 Methodology

We assume covariates measured for each subject and collect them into a row vector . We denote as the matrix that stacks the vectors for each subject row-wise. Without loss of generality, we assume that each column in is centered and scaled. We examine a general response model

 y=βTw+Xβ+z. (2)

Here, are the weights for each covariate as a column vector of length . And is the fixed contribution of unobserved measurements for the subjects as in the model of Equation 1 plus the misspecification errors of the true response function minus the linear component, i.e. . Because each column in is standardized, the linear component of this model does not require the additive intercept term.

In order to simplify our expressions, we assume has the mirror property (Assumption 2.1

) whereby the treatment group and control group can be swapped without changing the probability of the assignment.

###### Assumption 2.1 (Mirror Property).

For all , .

In Appendix 5.1 we show that in the randomization model the MSE of is

 MSEw[^βDMT|z,X;β]=1n2(Xβ+z)⊤ΣW(Xβ+z) (3)

and in Appendix 5.3 we show that the MSE of can be approximated to the third order as

 MSEw[^βLRT|z,X,β]≈1n2z⊤(G+2nD)z (4)

where where is the standard orthogonal projection matrix onto the column space of and , an expectation of a quartic form that cannot be simplified. Table 2 shows these expressions for the special case of covariate. Remarkably, this expression is independent of the unknown coefficients, a fact that will allow us to evaluate explicit designs in the following sections.

In theory, the optimal rerandomization strategy is to locate , the threshold corresponding to the minimum of Equation 3 or 4 over . As is varied, the determining matrix of the quadratic form varies as is a function of and . However, in practice this is impossible as (and in the case of ) is unknown.

We now discuss three criterions that remove this dependence on the fixed set of .

### 2.1 Criterions of Design Optimality

#### 2.1.1 The Minimax Design

If we assume nature to be adversarial, we can examine the MSE in the case of the supremum over the MSE, i.e. the worst possible finite vector of values . The optimal minimax strategy would then be the bernoulli trial (Kapelner ., 2019, Theorem 2.1) for and would asymptotically be the bernoulli trial333Not even the weak restriction of the BCRD is permitted (“no free lunch” whatsoever). for , corresponding to . This result is not only trivial but would also correspond to the sitation where is aligned with one arbitrary direction in , a punitively conservative situation.

#### 2.1.2 The Mean Unobserved Design

Can we consider the case of an “average ”? This would be incoherent in the randomization model where is fixed. Regardless, we consider being drawn from a continuous “faux measure” on . This measure is not necessarily believed in reality; rather, it is a useful mathematical device that yields a sensible criterion. We emphasize that the faux measure is not necessary or limiting to the inference we discuss in Section 2.3.

We assume it is mean centered in 2.2 and 2.3, and independence among the subjects’ unobserved covariates. And, only to make our expressions simpler, we assume 2.4, homoskedasticity in .

.

is diagonal.

###### Assumption 2.4 (Faux Measure Homoskedasticity).

for all .

In Appendix 5.2, we derive the mean criterion for ,

 Ez[MSEw[^βDMT|z,X,β]]=σ2zn+1n2β⊤X⊤ΣWXβ (5)

and in Appendix 5.3 we derive the MSE for approximated to the third order as

 Ez[MSEw[^βLRT|z,X,β]]≈σ2zn+σ2zn2tr[X⊤⊥ΣWX⊥] (6)

where we denote as the orthogonalized matrix with columns .

The optimal design for would be the single vector that minimizes . Note that this assumes knowledge of which is unknown in practice. Thus we leave this design problem for future work. The optimal design for would be the single vector that minimizes which remarkably does not depend upon . This is similar to the optimal design under the population model setting as explained in Section 1.1.

Since there are an exponential number of vectors, the corresponding rerandomization threshold will be . Finding this vector is practically impossible as there is no known polynomial-time algorithm. This is not a practical way of selecting a rerandomization threshold. Even if it were, there would be inferential complications with such a deterministic design that we explain in Section 2.3.

#### 2.1.3 The Tail Unobserved Design

The supremum criterion is too conservative and the mean criterion does not incorporate the ruinous effect of vectors that can be near the supremum. Following Kapelner . (2019, Section 2.2.6), we consider a tail criterion, for both the estimators. This criterion gauges the average experimental error at the worst percent of ’s. For example, a would be a criterion that considers the “5% worst worlds”.

As increases, the practitioner takes out insurance on settings that are more and more improbable. As , the only way to insure against every event is more randomness in (and more imbalance) as explained in Section 2.1.1. As decreases towards the quantile corresponding to the mean, minimizing imbalance in becomes more important at the expense of less randomization as seen in Section 2.1.2. Hence the tail provides a tradeoff between the two competing interests of optimality and randomization. We will understand the tradeoff explicitly in Section 2.2.3 but we first describe the algorithm.

### 2.2 Algorithms to Optimize the Tail Criterion

Our Algorithm 1 is an exhaustive search to locate a minimal MSE design for . An analogous algorithm to locate the minimal MSE design for we leave to future work as it depends on , an unknown quantity. We first outline two inputs to the algorithm besides the raw data that are required.

The algorithm takes as input a collection of assignments where is large, so that loosely approximates the full space . Since “rerandomization is simply a tool that allows us to draw from some predetermined set of acceptable randomizations” (Morgan  Rubin, 2012, page 1267), there is no theory we know of that specifies . Thus, following Morgan  Rubin (2012) and Li . (2018), we consider the initial set to be draws from BCRD. In practice, this decision can limit our algorithm’s performance in certain situations that will become apparent in the simulation study in Section 3 and strategies to mitigate these limiations are discussed in Section 4.

We then sort the assignments based on imbalance in . The imbalance calculation requires the second input: a metric of imbalance, of which our algorithm is agnostic. An imbalance metric takes and as input and outputs a non-negative scalar. An output of zero indicates perfect balance between the treatment and control groups. One such metric is Mahalanobis distance, a collinearity adjusted squared sum of average differences in each dimension. This metric has nice properties; for instance it is an “affinely invariant scalar measure” Morgan  Rubin (2012, page 1271). In our notation, it can be compactly written as . We discuss other choices of metrics in Section 4.

Let be the vector with the smallest imbalance in the set and be the largest, i.e. . The closest vector to , the assignment that truly minimizes imbalance, in our small initial subset will be .

Beginning with w.p. 1 , we compute , the quantile tail criterion at for both the estimators. In order to calculate the quantile, we require the inverse CDF function in for Equations 3 and 4. In general the closed form of the density functions are unknown asymptotically (see e.g. Götze  Tikhomirov, 2002). For both estimators, we present three procedures in the next three sections but first finish explaining the algorithm.

We then proceed to the strategy w.p. 0.5 and w.p. 0.5 and recalculate . We repeat this procedure until the th iteration where , the finite approximation of BCRD. Over all iterations, there is a minimum quantile corresponds to a an iteration number , defining the approximate optimal strategy and optimal threshold corresponding to the imbalance .

Our exhaustive search algorithm is computationally intense but can be approximated by skip-stepping through the iterations. Also, we conjecture that is convex in . Thus, we provide a binary search option in our implementation.

We now discuss three strategies to compute based on assumptions about the measure on . These strategies will be different for both estimators. Algorithm 1 thus takes an argument that specifies one of three different tail computation functions found in Algorithm 2 for .

#### 2.2.1 Provide a Distribution Explicitly

If the distribution of is provided explicitly, then in each iteration of Algorithm 1, the empirical quantile of the MSE of (Equation 4) can be approximated as the mean over many draws of . As this average will be noisy, the criterion can be smoothed over by use of cubic smoothing splines (Hastie  Tibshirani, 1990, Chapter 2.2). The optimal threshold will be invariant to shifts or scales of the MSE and thus the distribution can be standardized without changing the result. An analogous algorithm for the MSE of (Equation 3) is deferred to future work as there is an additional complication due to the dependence on the unknown . Note that assuming independence and homoskedasticity (Assumptions 2.3 and 2.4) are unnecessary in this section.

#### 2.2.2 Assume Normality and Use a CDF Approximation

In addition to assuming independence and homoskedasticity (Assumptions 2.3 and 2.4), we make the assumption that is Gaussian, then the distribution of the quadratic form of the MSE of (Equation 4) is known explicitly. Since the optimal threshold will be invariant to shifts or scales of the MSE, we assume a standard Gaussian. Since the determining matrix

is positive definite and symmetric, the distribution of this quadratic form is a standard result — it is a weighted sum of standard chi-squared distributions,

 Z⊤RZ∼n∑i=1λiχ21 (7)

where the

’s are the eigenvalues of

. Quantiles of this distribution are unknown in closed form but can be computed by numerical integration using the characteristic function. Fast approximations have been studied since the 1930’s.

Bodenham  Adams (2016) compared many approximations on accuracy and computation time and recommend the Hall-Buckley-Eagleson method (Buckley  Eagleson, 1988) which has error that is . This is the approximation used in our R package.

As for the MSE of (Equation 3), there is once again the additional complication due to the dependence on the unknown .

#### 2.2.3 Assume an Excess Kurtosis and Use a Double Approximation

The quantile criterion can be expressed as

 Q:=Ez[MSEw[^βT|z,X,β]]+c×SEz[M%SEw[^βT|z,X,β]] (8)

where is one-to-one with given . For example, if and is normal, , the Gaussian quantile. The MSE for both estimators is conjectured to be asymptotically the form of Equation 7

which is well-approximated by the normal distribution as long as (1) there are many non-zero eigenvalues relative to

and (2) these eigenvalues are fairly uniformly distributed. These assumptions will be true as long as

remains fairly random i.e. this approximation will work for rerandomization thresholds that are not too small and thus maintain randomness in . The simulations of Kapelner . (2019, Section 3) demonstrate that the Gaussian quantiles are approximately correct in many situations. This is the approximation employed in this section.

In order to derive the standard error of the MSE’s for both expressions, we need to assume

2.5, that the

’s have a finite fourth moment and

2.6 that the third and fourth moments are independent of .

for all .

###### Assumption 2.6 (Faux Measure has Higher Moments Independent of Observed Covariates).

and are independent of for all .

In Appendix 5.5, we derive the standard error for the MSE of the differences-in-means estimator,

 SEz[MSEw[^βDMT|z,X,β]]=σ2zn2(nκz+2||ΣW||2F+4σ2zβ⊤XTΣ2WXβ)12 (9)

where

is the excess kurtosis in

and denotes the Frobenius norm. The advantage of this method is that we can find approximately optimal rerandomizations designs for general distributions by merely specifying the excess kurtosis.

Putting Equations 5 and 9 together and recalling that the optimal threshold will be invariant to shifts or scales of the MSE, we can locate a tail criterion objective function that we denote ,

 (10)

where the quantitites in grey are independent of the choice of design given the normal approximation. The terms measure observed imbalance. Over the range of possible values, the term ranges from in the case of BCRD with down to , i.e. effectively zero, in the case of the deterministic minimal imbalance (Kallus, 2018, Section 3.3). For BCRD, . We have the general upper bound . The term measures randomness. Over the range of possible values, the term ranges from in the case of BCRD all the way to in the case of the deterministic minimal imbalance. The tradeoff between imbalance and randomness is now clearly seen in the tail criterion. The inequality above is proven in Appendix 5.5.

In Appendix 5.7, we derive an approximation for the standard error for the MSE in the linear regression estimator,

 SEz[MSEw[^βLRT|z,X,β]]≈σ2zn2(2||(I−P)ΣW||2F+8ntr[GD]+8n2tr[D2]+κzSS)12 (11)

where . Putting Equations 6 and 11 together and dropping additive and multiplicative constants, we find an approximate tail criterion objective function,

 Q′^βLRT≈%tr[X⊤⊥ΣWX⊥]BAL+c(2||(I−P)ΣW||2FRAND2+8% tr[GD]n≤RAND×BAL+[rgb]0.7,0.7,0.78tr[D2]n2≤BAL2+κzSS)12 (12)

and once again, quantitites in grey are independent of the choice of design and the tradeoff between imbalance and randomness is clearly observed in as the quantiles of errors as well. The inequalities above are proven in Appendix 5.7.

As for the MSE of (Equation 3), there is the additional complication due to the dependence on the unknown . Thus, we leave algorithms that minimize this MSE to future work.

### 2.3 Frequentist Inference in Our Designs

To run the experiment, we first realize one from our optimal strategy and these treatments are administered to the subjects. Then, responses for the treatment group and for the control group are collected. Using these responses, we wish to test theories about

and produce a confidence interval for

.

Under the randomization model, the only randomness is in and thus a randomization test is most appropriate; this was Fisher’s “reasoned basis” for inference. The randomization test “can incorporate whatever rerandomization procedure was used, will preserve the significance level of the test and works for any estimator” (Morgan  Rubin, 2012, p. 1268)

. This test requires a sharp null hypothesis whereby all subjects’ response will be exactly equal under both treatment and control conditions.

First, using the experimental we compute for the simple estimator or the linear regression estimator depending on the choice of analysis. A null distribution can then created by using every other , the curated set of assignments thresholded by the optimal , and computing the same estimate (since these estimates were computed by labeling responses to treatment and control in a way unrelated to the design used in the actual experiment). “One should analyze as one designs” (Rosenberger  Lachin, 2016, p. 105), so that replicates can only be drawn from the same restricted allocation set where originated.

If, is large, the null distribution can be approximated with replicates where controls the precision of the -value. Each of the replicates involves drawing one and recomputing the estimates. Note that we cannot draw any , the unrestricted set if was a restricted randomization.

A properly-sized but -level two-sided test can be run by assessing if the experimental falls in the retainment region bounded by the and the quantiles of the replicate set . By the duality of confidence intervals and hypothesis tests, one can alter the sharp null by adding (or subtracting) and rerun the test. The set of values of where the test rejects constitues an approximate confidence interval for the average treatment effect. See Garthwaite (1996) for an efficient algorithm.

If the optimal threshold is small, our design “must leave enough acceptable randomizations to perform a randomization test” (Morgan  Rubin, 2012, page 1268) and there may not be enough assignments of the that satisfy this threshold. The smallest imbalance values of, i.e. the left tail of the distribution , will be in the case of the being sampled from an iid Gaussian generating process and is the Mahalanobis distance. Kallus (2018, Section 3.3) proved that the imbalance for the optimal assignment in this case is . Thus, BCRD will not be able to locate vectors far into this left tail in a reasonable amount of time.

Additionally, estimating the terms in our criterion of Equation 12 may suffer from smaller and smaller sets of assignments. The RAND term for instance relies on an accurate estimate of a function of the variance-covariance matrix and the matrix . These matrices have an number of parameters and require many samples for high accuracy and a bias correction.

We now discuss a means to supplement draws from BCRD to provide many vectors with small imbalance. We consider Krieger . (, Algorithm 1)

, a greedy heuristic that begins with BCRD and in each iteration find the switch of treatment-control subject pair that is optimal until no more minimization is possible. This algorithm finds assignments that are

. A nice property is that it converges quickly requiring very few switches and thus their assignments are nearly as random as BCRD. Thus, including assignments from their algorithm are unlikely to affect the terms but greatly decrease the terms in our criterion . Since the imbalance of vectors of BCRD is known, the greedy pair-switching algorithm can be used for stratified sampling for imbalance distributions upper bounded at small .

## 3 Simulations for the Linear Regression Estimator

### 3.1 The Three Strategies in Different Settings

We first simulate a case of and where the entries of are standard normal realizations. We begin with of size sampled from BCRD. The baseline imbalance are given in Fig 1. It is the job of our algorithms to threshold this distribution at .

For the linear estimator, we run Algorithm 1 to find the optimal threshold for many settings and plot the 95% tail criterion over in Figure 2.

There are many takeaways from this illustration. First, the criterion as a function of appears long and flat from BCRD (where ) until relatively restricted strategies (where ) for all strategies and settings. This means the gains to MSE of the estimator due to improving imbalance a priori are small when a posteriori covariate adjustment is employed. However, it does recommend a fair degree of rerandomization however small the benefits.

The three tail strategies of Sections 2.2.1, 2.2.2 and 2.2.3 are shown in green, red and blue where the exact strategy was simulated with 1,000 standard normal draws for each and the tail approximation used (paralleling the Gaussian setting). The ’s found by the two approximations are nearly identical meaning the gaussian tail approximation is close to the sum of scaled chi-squared approximation of Hall-Buckey-Eagleson. The from the exact is a little bit larger. This can be due to the inaccuracy of the approximations and sampling variation in the exact strategy (even with smoothing attempting to attenuate it). Given the flatness of the relative tail criterion in this region, this small difference will not effect MSE performance.

We then wished to investigate how robust these approximations of the optimal rerandomization threshold are to deviations from normality. The orange line shows the exact strategy for the standard laplace distribution, a distribution similar to the normal in shape but with fatter tails and due to the exponential tail in absolute distance from the mean versus squared distance from the mean. Again, due to the flatness of the criterion in this region, the difference in to the red and blue lines will not make a difference in practice. To ensure the tail approximation can be adjusted for estimates of excess kurtosis, the purple line shows the tail approximation with the argument set to 3. The found matches the exact simulation nearly perfectly (the purple and orange lines coincide). We then investigate a very fat-tailed measure, Student’s distribution with two degrees of freedom (in black). Even an extreme distribution such as this one does not substantially alter the threshold . If the red and blue approximations were naively used, performance would not likely suffer as the criterion is fairly flat in this region.

### 3.2 Visualizing the Optimal MSE Tail

We now simulate response models of Equation 2 under different scenarios to demonstrate that our algorithm finds the optimal threshold for a tail quantile. We retain the same from before with and . To draw responses, we set and draw from . The entire simulation is fixed upon these values.

We then employed five design strategies: (BCRD) balanced complete randomization with vectors, (GOOD) The 5,000 least imbalanced vectors of the , (OPT) the optimal design according to the weighted chi-squared approximation algorithm of Section 2.2.2, (DET) the deterministic design of the best vector out of the vectors and (BAD) the 5,000 most imbalanced vectors of the .

For 3,000 iterations, we draw a from where was computed so its ratio with the standard error of was 3:1 thereby emulating a typical clinical case where . For 1,000 nested iterations, we draw a from the strategy. If the strategy is DET, we skip this nested loop as there is only one allocation . We then compute via Equation 2 and from , and and then we record the squared error . The average over all 1,000 different ’s is recorded as an . Over the 3,000 draws of , the MSE density is estimated. Figure 3 shows the density estimates for all five strategies. The expectation of the MSE is estimated by an average over all 3,000 draws (the vertical solid lines) and the 95% empirical quantile estimates the tail criterion (the vertical dashed lines).

There are many observations from this figure. First, as Equation 6 predicts, DET has the lowest expected MSE as shown as the solid blue line. However, note the shape of the MSE density for DET with its long right tail. The tail shape can be understand by examining the first term in the MSE expression (Equation 4) that can be written as where denotes the component of orthogonal to the column space of . In the case of DET, the quadratic form becomes . This implies that for ’s that interact poorly with the single, lowest imbalance allocation, there is a lot of error. In the case above, the 95%ile of the MSE (the dotted lines) is not even displayed for DET as its value is 2.23.

Second, the OPT design of the best 1,102 vectors of the provides the best insurance against the 95% worst MSE (the green dotted line is the lowest of all dotted lines) for only a marginal loss in expected behavior (the green solid line is only nominally more than the solid blue line).

Third, there is a lot of leeway in finding the optimal design. The GOOD strategy of the top 5,000 assignments performs about the same as OPT. This is due to the flatness of the tail criterion among designs that neither too random or too deterministic.

Fourth, BCRD performs somewhat worse, but not terrible when compared to OPT. This is related to the fifth observation: the BAD strategy of the worst 5,000 assignments performs only 18% worse on average than the optimal deterministic design. This is a testament to the advantage of the covariate adjusted linear estimator — the contribution of a priori imbalance is attenuated by a factor of , a rapid rate of vanishing (Equation 6). This is another reason that the error inherent in our approximation strategies of Sections 2.2.2 and 2.2.3 are unlikely to matter in practice.

### 3.3 Optimal Rerandomization Threshold Dependence on p

We now investigate how the optimal threshold depends on the number of observed covariates . We fix and iteratively add more columns to varying where new column entries are standard normal realizations. We then recompute the optimal threshold via the weighted chi-squared approximation algorithm of Section 2.2.2.

Figure 3(b) shows the log optimal threshold increasing linearly in log . Even though the is increasing and thus the optimal strategy appears to converge to BCRD, this can be due to the fact that the Mahalanobis metric increases as increases. We examine this threshold as the quantile of all vectors conditional on in Figure 3(b). This figure shows that although the threshold is increasing, the relative threshold is converging towards the deterministic design. This illustration is consonant with out intuition: as there are more observed covaraites, a priori restrictions on the randomization to decrease imbalance becomes more important to the resultant MSE. It is possible that these illustrations will be different if was increased by many orders of magnitude.

## 4 Discussion

We have offered an algorithm that can provide the optimal rerandomization threshold for the randomization model when inferring the average treatment effect using the simple differences-in-means estimator and the covariate-adjusted linear regression estimator. Our algorithm is much simpler and requires less choices for the practitioner when employing the linear regression estimator. The MSE of this estimator is also substantially lower in simulations and thus we recommend it over the differences-in-means estimator. Our simulations also demonstrate that to get insurance on the 95%ile of most adversarial draws from the non-linear and unobserved component of the response function, relatively little rerandomization is needed as the thresholds are not too low when is small relative to , the prevalent setting in experimental practice. This underscores the robustness of complete randomization coupled with linear adjustment as a design-estimation approach in experimentation.

There are other imbalance metrics and the choice seems arbitrary or one of convenience. Recently, Kallus (2018, Section 2.3) proved that the optimal imbalance metric is dependent on the choice of norm for the response model. For instance, if the norm is Lipschitz, then the optimal would capture the deviation from a paire matching allocation, such as pairwise Mahalanobis distance (ibid, Theorem 4). If the norm is the sup norm, then the optimal is one that captures deviation from an incomplete blocking design (ibid, Theorem 3). If the squared norm is , i.e. a measure of the signal in the linear model, then the optimal is the Mahalanobis distance (ibid, Section 2.3.3). If the response function is represented by a reproducing Kernel Hilbert Space, then the optimal would be where is the gram matrix, a matrix whose entries tally the kernel distance from to (ibid, Equation 4.2).

These kernel spaces can be infinite in dimension and thereby can model non-parametric response functions. The exponential kernel with kernel distance function can be seen as modeling all polynomials (with infinite degree) and the Gaussian kernel with kernel distance function is especially flexible (these were two examples given in ibid, Section 2.4).

Since the true model is unknown, one would want to employ an sufficiently flexible to not suffer poor performance if the model deviates from their arbitrary choice. Thus, the Gaussian kernel is recommended in practice (ibid, Section 6). Our software contains options for a wide variety of choices for .

Additionally, as mentioned in Section 2.2, there is no theory we know of that specifies how should be sampled from the full space . We have employed BCRD here but there may be other choices for instance those that provide more vectors with exponentially smaller imbalance. Our algorithm would proceed identically, but early simulation suggests the optimal designs can change drastically and we leave this for future research.

### Acknowledgements

We thank Nathan Kallus for helpful discussions. Adam Kapelner acknowledges a PSC-CUNY grant and Abba Krieger acknowledges a Simons Foundation grant for support.

### Replication

The figures and values given in Section 3 can be duplicated by running the code in https://github.com/kapelner/CovBalAndRandExpDesign/blob/master/package_testing.R. The code for our R package OptimalRerandExpDesigns can be found in the root of that repository as well.

## 5 Appendix

### 5.1 The MSE for ^βDMT

This section is heavily adapted from Kapelner . (2019, Appendices 6.2–6.3).

We first show that the estimator is unbiased i.e. . Using the model given by Equation 2:

 Ew[^βDMT|z,X;β] = 1nEw[w⊤(βTw+Xβ+z)|z,X;β] = 1n(Ew[βTw⊤w]+Ew[w⊤Xβ|X;β]+Ew[w⊤z|z]) = 1n(βTEw[w⊤w]+Ew[w⊤Xβ|X;β]+Ew[w]⊤z) = βT+1n(Ew[w⊤Xβ|X;β]+Ew[w]⊤z)

Since then, . The assumption that is a mirror strategy (Assumption 2.1) implies that and

 Ew[w⊤Xβ|X,β]=∑w∈WwTXβP(W=w | X)=0 (13)

since each cancels out with the summand with with shared probability.

This unbiasedness implies that the MSE equals the variance,

 Varw[^βT|z,X;β] = Ew[^βDM2T|z,X;β]−β2T = E[(w⊤y/n)2|z,X;β]−β2T = 1n2E[(w⊤(βTw+Xβ+z))2∣∣z,X;β]−β2T = 1n2E[(βTw⊤w+w⊤(Xβ+z))2∣∣z,X;β]−β2T = 1n2E[2nβTw⊤(Xβ+z)+(w⊤(Xβ+z))2∣∣z,X;β]

where the last line follows from , simplification and canceling out the constant . By the same arguments of Equation 13, leaving us with

 = 1n2E[(w⊤(Xβ+z))2∣∣z,X;β] = 1n2E[(Xβ+z)⊤ww⊤(Xβ+z