Harmonizing Fully Optimal Designs with Classic Randomization in Fixed Trial Experiments

by   Adam Kapelner, et al.

There is a movement in design of experiments away from the classic randomization put forward by Fisher, Cochran and others to one based on optimization. In fixed-sample trials comparing two groups, measurements of subjects are known in advance and subjects can be divided optimally into two groups based on a criterion of homogeneity or "imbalance" between the two groups. These designs are far from random. This paper seeks to understand the benefits and the costs over classic randomization in the context of different performance criterions such as Efron's worst-case analysis. In the criterion that we motivate, randomization beats optimization. However, the optimal design is shown to lie between these two extremes. Much-needed further work will provide a procedure to find this optimal designs in different scenarios in practice. Until then, it is best to randomize.



page 13


Improving the Power of the Randomization Test

We consider the problem of evaluating designs for a two-arm randomized e...

Optimal designs for the development of personalized treatment rules

In the present paper, personalized treatment means choosing the best tre...

Applications and Analysis of Bio-Inspired Eagle Strategy for Engineering Optimization

All swarm-intelligence-based optimization algorithms use some stochastic...

Imbalanced Randomization in Non-Inferiority Trial

Randomization is a common technique used in clinical trials to eliminate...

Constrained randomization and statistical inference for multi-arm parallel cluster randomized controlled trials

Cluster randomized controlled trials (cRCTs) are designed to evaluate in...

Generalized Bayesian D criterion for single-stratum and multistratum designs

DuMouchel and Jones (1994) proposed the Bayesian D criterion by modifyin...

Auditable Blockchain Randomization Tool

Randomization is an integral part of well-designed statistical trials, a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this short survey, we wish to investigate performance differences between completely random experimental designs and non-random designs that optimize for observed covariate imbalance. We demonstrate that depending on how we wish to evaluate our estimator, the optimal strategy will change. We motivate a performance criterion that when applied, does not crown either as the better choice, but a design that is a harmony between the two of them. We demonstrate our claim through simulation and a heuristic argument. Our observations open the door to future fundamental research on this age-old debate.

1.1 Background

We consider a classic problem: a two-arm, fixed, non-sequential experiment whose goal is to estimate and test the treatment effect. This experiment has a clearly defined outcome of interest (also called the response or endpoint) and we scope our discussion to the response being continuous and uncensored.

Synonymously referred to as a design, a randomization, an allocation or an assignment and constructed via a strategy, algorithm, method or procedure is the division of individuals (subjects, participants or units) into a treatment group and a control group ( and ), the two arms. Historically, standard Bernoulli draws for each individual is termed complete randomization and is sometimes called the “gold standard”. Any other design is termed a restricted randomization because it is a restricted to a subset of all possible allocations.

Why has complete randomization been given this high distinction? Cornfield (1959, p. 245) gives two reasons: (1) known and unknown covariate differences among the groups will be small and bounded and (2) it forms a basis for inference. However, there is a large problem which was identified at the inception of experimentation: sometimes differences are exhibited in the distribution of observed covariates among individuals within the two groups under some of the assignments produced by complete randomization. The amount of covariate difference we term observed imbalance. Through an abuse of terminology, the literature denotes this as simply imbalance, but this is an ambiguous term since it usually ignores the state of imbalance in the unobserved covariates. Observed imbalance is summed up in a numerical metric that can be defined a number of different ways. By convention we consider “larger” imbalance values as worse.

Mitigating the chance of large observed imbalances is the predominant reason for a priori restrictions on the randomized allocation. Fisher (1925, p. 251) wrote “it is still possible to eliminate much of the …heterogeneity, and so increase the accuracy of our [estimator], by laying restrictions on the order in which the strips are arranged”. Here, he introduced blocking, a restricted design still popular today. Student (1938, p. 366) wrote that after an unlucky, highly imbalanced randomization, “it would be pedantic to [run the experiment] with [an allocation] known beforehand to be likely to lead to a misleading conclusion”. His solution is for “common sense [to prevail] and chance [be] invoked a second time”. In doing this rerandomization, all allocations above a predetermined threshold of observed imbalance are eliminated, a strategy that has been rigorously investigated only recently (Morgan and Rubin, 2012; Li et al., 2016). Another idea is to allocate treatment and control among similar subjects by using a matching algorithm (Greevy et al., 2004), a tool by and large popularized in the field of observational studies to minimize confounding. Additionally, once the imbalance metric is explicitly defined, one can formulate the procedure as a binary integer programming problem, a form of numerical optimization (Bertsimas et al., 2015; Kallus, 2018). This is frequently solved via branch and bound (Land and Doig, 1960) which can provide near optimal designs. One can also employ other heuristics that optimize but simultaneously preserve randomness (e.g. Krieger et al., 2016).

The above gives a short introduction to the wide range of design ideas. At the “extremes” of this range, there is complete randomization and the optimized observed imbalance design. We now seek to compare them. We begin with a simple, intuition-building scenario where complete randomization provides superior performance over the optimized imbalance design.

1.2 A Simple Illustration and the Efron Bound

Consider a fixed trial experiment with

subjects comparing a treatment group to a control group. There is one known covariate vector

and one unknown covariate vector and a population average treatment effect (PATE) of 1 which we denote . Consider the additive treatment effect model, where denotes the allocation vector with entries +1 for treatment or -1 for control.

For the purposes of illustration, consider the scenario where the subjects are grouped into pairs where the two subjects within each pair share the same value but differ by a value denoted and . The value of the observed covariate is the same for the two observations in any pair. The value in pair is . We will see shortly that this structure is the most adversarial if we optimize for observed imbalance.

We now wish to compare completely randomized design to optimal design. To follow assumptions introduced in the next sections, we will limit our discussion here to forced balance designs (Rosenberger and Lachin, 2016, chap 3.3) where the treatment group and the control group are both coerced to have equal numbers of subjects.

Considering the standard imbalance metric of Mahalanobis distance between the sample average measurements in the treatment group and the sample average measurements in the control group, the optimal in this case can be found by a priori matching (Greevy et al., 2004). In this case, the optimal design is to match the two people in each pair. The member of the pair that is given the treatment is randomly chosen.

In this matched pairs allocation procedure, there are only allocations, a restricted subset of allocations under complete randomization with forced balance (CRFB) where there are possible choices. Parenthetically, the optimal design in this case will have many allocation vectors but if the covariates were truly continuous, there would be only one unique optimal partition.

We then employ the simple differences-in-means estimator, where the division by two is only due to the fact that has entries -1 and +1 (not 0 and 1). The criterion for the performance of under the different designs we choose to be mean squared error (MSE), an average over all possible experimental replicates given these subjects with their particular and constant.

The ratio of the standard deviation of the observed covariate to the standard deviation of the unobserved covariate

is a pivotal quantity in this illustration. Using this equivalent parameterization, . Tab. 1 shows (a) the mean observed imbalance as measured by the squared difference in averages of the two groups i.e. , (b) the mean unobserved imbalance i.e. and (c) the MSE of the difference-in-means estimator for both the restricted and CRFB procedures.

mean observed mean unobserved MSE of the
imbalance imbalance treatment estimator
random (CRFB)
restricted (matching) 0
Table 1: Metrics for a general adversarial example for two designs.

The problem is calibrated so that when , the observed and unobserved variables carry the same weight in determining the response. Note that the smaller the (i.e., the less important the observed covariate), the more our estimator favors randomization over matching. In fact, randomization is preferred so long as For example, the case of , and (implying ) is shown below in tab. 2.

mean observed mean unobserved MSE of the
imbalance imbalance treatment estimator
random (CRFB) 0.53 1.80 0.58
restricted (matching) 0.00 3.00 0.75
Table 2: Results for a specific adversarial example.

We observe that (1) the mean observed imbalance in the optimally designed experiments is zero, as expected. Any imbalance metric can be chosen as the distribution of covariate in the treatment group and control group are identical (2) the optimal imbalance procedure has worse imbalance on the unobserved covariate

than the CRFB procedure. Further, (3) the estimation accuracy is worse under the optimal imbalance procedure even though the observed imbalance is optimal. Why is the variance of the treatment effect lower under randomization than matching when

is small? Because that is when is more important in determining and it is intuitive that imbalancing will adversely affect performance. Note that (2) and (3) illustrate the first reason for randomization provided by Cornfield (1959) we quoted above.

The question is how to formalize the harm due to imbalancing the unobserved covariates into a criterion. One of the first attempts was introduced by Efron (1971, sec. 5). He explains that when testing the null of the average treatment effect being zero, there is an inferential penalty incurred when the “accidental bias”, defined as , is non-zero (these expressions are generalized and explained in detail the next section). When considering replications of the experiments with different ’s (the set of which are defined by the procedure), he derives the increase in variance of the simple mean differences treatment effect estimator to be where , the variance-covariance matrix of the distribution of the allocation vectors from the procedure.

A trivial bound then follows: its maximum cannot exceed the largest eigenvalue of

for normalized, i.e.


(ibid, Equation 5.4, Lachin, 1988, p. 320 and Rosenberger and Lachin, 2016, chap 4). Thus, the worst case

is when it is the eigenvector corresponding to the largest eigenvalue (modulo a scaling), and this is exactly the

we have adversarially demonstrated in this example.

2 Some Different Criterions

The example of the previous section is compelling and possibly a justification for randomization in and of itself. However, it is cynically adversarial as such a is extremely unlikely to occur in practice. Imagine a fixed but a random (e.g. the standard assumption of iid sampling of the ’s). A realization of that is non-trivially parallel to the worst eigenvector of (an arbitrary direction in

) is a low probability event whose probability shrinks exponentially as


This criterion thus may not be the right choice for a practitioner. We now explore different criterions. But first we make clear our problem setup and assumptions.

2.1 Problem Setup

Assume a trial with a fixed number of subjects where each subject has fixed observed measurements ascertained before the study begins (assumption A1). We denote as the measurements where the th row represents the covariate of subject , and it is denoted by , . All results and expressions to follow are conditional on and thus the notation is dropped.

The experiment begins when one allocation, a vector of manipulations denoted by , is administered to all subjects. The allocation is drawn from the design space, denoted , where the restriction is most made on the basis of . We assume that allocations have the “mirror property” where treatments and controls can be switched with equal probability: for all (assumption A2).

The experiment ends when we assess the outcome . We assume an additive effect model, i.e. , , where is the treatment effect and is some unknown function (assumption A3). Defining, ; we obtain our main model,


By the law of iterated expectation, is mean-centered.

We employ the simple mean differences estimator to infer . Sec. 6.1 proves that is unbiased.

2.2 A Worst-Case Criterion

Conditional on a given realization of , the variance, equal to the mean squared error, can be expressed as:


according to sec. 6.2, where , the variance-covariance matrix for . This criterion represents the average error for a design, where the design is specified by the sufficient parameter , conditional on one set of subjects (one ).

Consider the minimax optimal design, one that minimizes the MSE based on the worst i.e.

where the condition of being bounded is required to avoid a trivial infinity (the specific bound used is needed for the result below to hold). The design space is defined to be all variance-covariance matrices of a generalized

-dimensional Bernoulli distributions. The worst

is where is the scaled eigenvector corresponding to the largest eigenvalue of denoted i.e. when Efron’s bound is tight (eq. 1). Specifically, if we choose , where is the eigenvector corresponding to the largest eigenvalue and then (i.e., belongs to the set over which the supremum is taken) and (i.e., it is the scaled eigenvector corresponding to the largest eigenvalue). Note that there is nothing special about the constant 2, it can be replaced by any number greater than 1.

When taking the infimum over designs, we can only minimize as and are fixed. The solution is complete randomization where and (see Rosenberger and Lachin, 2016, Problem 4.3). Thus complete randomization would be minimax optimal for a specific and even if is unknown.

This result is similar to the “no free lunch” theorem proved by Kallus (2018, sec. 2.1). Here, there is no free lunch in the sense that any design that balances observed covariates via a restriction of the sample space can inadvertently trigger higher variance in due to adversarial unobserved covariates . Knowledge of does not provide assistance.

2.3 A Mean Criterion

Conditioning on a single is a limiting assumption since there are infinitely many states of the unknown covariates . Why not examine the mean MSE over all possible instead of only the worst? Sec. 6.3 shows that


if we assume homoskedasticity in the unobserved covariates, i.e. (assumption A4). Note that eliminating the homoskedasticity assumption (while retaining independence of the ’s) will not substantively change the interpretation of the result which follows.

When taking the of the above, we minimize the first term that represents imbalance. The second term signifies a fundamental estimation error that cannot be reduced. Thus, the optimal design corresponds to optimal balance, a result noted by Kallus (2018, p. 90).

Minimizing the first term under unknown has been addressed by the same author (ibid, sec. 2.2) who allows nature to choose the response function adversarially in a set of functions , a normed vector space with norm denoted by . Then, he follows another minimax approach, finding the infimum over designs under the supremum of . Once this supremum is evaluated, the first term of eq. 4 will be an objective function (with the known inputs ) that can be minimized.

Quite shockingly, assumptions about imply different well-known objective functions. For example, if was the Lipschitz norm with respect to distance metric between and , then the objective function to minimize would be the pairwise matching objective with distance (ibid, Theorem 4). If one assumes a linear model, then the objective function would be the Mahalanobis distance between the treatment group average and the control group average (ibid, sec. 2.3.3). Further, he shows that the first term of eq. 4 features an extremely rapid vanishing rate of under optimal allocation for parametric response functions (ibid, sec. 3.3). For most objective functions, the problem of finding such an optimal allocation is NP-hard and thus the rate is slower in practice as either approximate polynomial or heuristic methods must be employed.

2.4 A Tail Criterion

However, considering the average may be imprudent when we know the worst case , is unquestionably ruinous. Moreover, minimizing eq. 4 boils down to minimizing the first term since the second term does not depend on . While the first term can become exponentially small, the second term would remain as . This means that the design that minimizes eq. 4 gains very little against other simpler alternatives, e.g. pairwise matching. As we show below, this marginal improvement in the mean criterion comes at the expense of much higher variance yielding a long right tail of the MSE distribution.

Parenthetically, we note a further weakness of the previous criterion. The unconditional MSE, , is one term in the law of total variance formula. The other term, , is zero since the estimator is unbiased (and thus a constant). This implies that the criterion above of eq. 4 is equivalent to . This corresponds to a worldview where replicates consist of both and being jointly realized together. This is similar to treating as “noise” that will be different for each allocation

for the same subjects. This perspective is substantially at odds with our assumption that

is affixed to the subjects and is not changed with different allocations.

To remedy both of these problems (but not be as conservative as Efron in assuming the worst case), we propose to optimize the design for the performance of the worst percent of ’s (i.e. the tail events where is large),


Note that complete randomization, the result of sec. 2.2, can be recovered if we require the worst tail event, i.e. .

Once again, we are considering

as a random variable in the unobserved covariates. To solve for optimal design, the inverse cumulative distribution function (CDF) must be evaluated at


However, the expression of eq. 3 is a quadratic form with associated matrix properties that unfortunately are not amenable to current asymptotic distributional results (see e.g. Götze and Tikhomirov, 2002). If the ’s were to be normal, the distribution is known, but is not amenable for locating the optimal design (see sec. 6.4).

3 Optimal Designs

The worst-case criterion of sec. 2.2 yields complete randomization as the optimal design. The mean criterion of sec. 2.3 yields perfect balance as the optimal design. The criterion that seeks a combination of worst-case and average performance, the tail criterion of sec. 2.4, is less clear. Without the inverse CDF and an explicit means to find the infimum over designs, we cannot provide a procedure to locate the optimal design.

However, we can prove our eponymous claim, i.e. that the best design is between complete randomization and complete optimization. Denoting as the quantile in Expression 5

, note that the quantile expression can be expressed as the mean plus a number of standard errors,


where the quantile of interest and the number of standard errors exist in a one-to-one relationship. We are interested in high quantiles, which correspond to large ’s.

The expectation term was found in eq. 4 and the standard error term calculation is found in sec. 6.5. Putting both the expectation and standard error together we have


where , which is zero if

is normally distributed and

is the squared Frobenius norm of . To compute , we both need to assume

has a finite fourth moment (assumption A5) and that its third and fourth moments are independent of

(assumption A6). The two gray terms are constants unaffected by our choice of design. There is another term in the standard error that is eliminated via assuming forced balance designs between the number of treatments and controls (assumption A7).

There is a further complication. Given a desired quantile , the constant will be a function of (although we will see in sec. 4 that it changes only slightly). There are a couple of default values of we can employ. First, we can use the Chebyshev constant, , thus our criterion becomes a bound on the quantile and not the quantile itself. Second, we can unconditionally employ as in the Gaussian setting, a value that is shown to be approximately true for a large set of continuous distributions (Sharpe, 1970). Any choice of here will not affect the asymptotic argument we now turn to.

To prove our claim, we asymptotically examine the three terms affected by the design, denoted and as the first two measure imbalance in the portion of the response that is observed and the last loosely measures the degree of randomness, serving as a “regularization” term in a sense as it “shrinks” the design back towards randomization. We do this analysis for complete randomization and for perfect optimization.

For complete randomization with forced balance, and thus the and terms are equal to because and the term is corresponding to

For near perfect balance (PB) via optimization, the resultant (where is the optimal allocation) and has -1’s and +1’s in the off-diagonal. The term is exactly since has only one non-zero eigenvalue:

. The asymptotic analysis of the

and terms uses the result from Kallus (2018, sec. 3.3) twice. First, the term is . Second, the term is bounded by so it behaves at most like . Thus,

When considering both analyses, the two balance terms vary between order of down to , but the randomness term varies between order of to . Therefore, while the perfect optimization obtains very small balance terms and , the gain is undone by the large randomness term . We conclude that the optimal design must feature balance terms with order slower than to guarantee the randomness term does not grow as fast as . This design must be between complete randomization and perfect balance.

Note that the constant that figures most prominently into our tail criterion of eq. 7 is . Smaller values indicate that the observed covariates explain more of the spread in the response (i.e. is larger).

4 Simulations

In the model of eq. 2, we assume , one covariate and let (the simplest case) and . We generate one fixed whose values are iid realizations from . This simulation exercise is conditional on those values. We then generate all forced balance vectors in (i.e. million vectors).

Since the model is linear, the appropriate imbalance function to be minimized is the Mahalanobis distance between the average observed covariates in the treatment group and the average of the observed covariates in the control group. With one covariate, the imbalance objective reduces to . The optimal vector is then located and saved.

We consider three designs: (CRFB) complete randomization with forced balance (PB) perfect balance using and (PM) the pairwise matching algorithm of Greevy et al. (2004). The third is to demonstrate a design between complete randomization and perfect balance. The choice of matching does not suggest that we view it as the true optimal design; it is merely an illustration. To do the pairwise matching, the ’s are sorted and each set of two becomes a pair.

In each iteration we first draw one whose entries are iid where the choice of variance was made so that of the model was approximately 35% for the observed covariates (). We then repeat the following 2,000 times for each design.

(CRFB) We choose at random 300 ’s from to simulate the CRFB design. For each, we compute and then . Over these 300, we estimate by taking the sample average over these 300. (PB) We compute for the allocations and and then for both. The average of both values is the estimate of . (PM) We choose at random 300 ’s by randomly permuting within the 10 pairs of subjects. The procedure for CRFB is then repeated.

Fig. 1 illustrates for all three designs the distribution of over the 2,000 draws from the distribution.

Figure 1: Density estimates for the MSE distribution for CRFB (red), perfect balance (green) and pairwise matching (blue) over the 2,000 simulation replicates explained in the text. Solid vertical lines are the estimated mean MSE and dashed vertical lines are the estimated 95% percentile of the MSE. Note that the mean MSE for perfect balance and pairwise matching are nearly identical with perfect balance ahead by a small margin because the gains in over-optimizing the term are slim.

There are many observations from this illustration. First, the solid vertical lines indicate the estimated , the mean criterion of sec. 2.3. Here, the perfect balance design is optimal (matching is a close runner-up) as explained in the beginning of sec. 3. Second, the worst-case criterion of sec. 2.2 is assessed by the maximum values which are CRFB: 0.36, PB: 1.65 (note the figure only shows up to 0.6!), PM: 0.38. This value is smallest for the CRFB design as explained in the beginning of sec. 3. Last, the dashed vertical lines illustrate the estimated , our recommended criterion of sec. 2.4. Here, perfect balance performs terribly as CRFB is clearly more optimal. Pairwise matching is an example design that is between these two extremes and beats both of them. However the true optimal, an elusive harmony between randomization and optimization, will perform better than even matching as the asymptotic analyses of sec. 3 demonstrates. This optimal design is “closer” to CRFB than to PB.

Additionally, there is the concern of the constant in eq. 7 varying with fixed . In the above simulation, the values of for CRFB, PM and PB are 1.75, 1.84 and 1.85. If were set to be 2, the story would not differ.

We also demonstrate two more extreme examples. The first is where there is no effect of observed covariates (accomplished by allowing ). The second is where , i.e. the unobserved covariates contribute very little to the response (accomplished by setting ). and fig. 2 illustrate the results.

Figure 2: Same as fig. 1 except the observed covariates and PATE account for the 90% of the variance in the response.

Fig. 3 illustrates the case of no effect of the observed covariates. Here, the mean MSE is the same for PB, PM and CRFB procedures. This is expected as the balance term () in eq. 4 is 0 (since ) hence the mean MSE is regardless of the procedure. However, the 95% quantile of MSE is minimized with CRFB. This is because any restriction on the allocation will increase the value of the term in eq. 7 with no corresponding decrease in the and terms and hence CRFB is optimal here.

Fig. 2 illustrates the case where the observed covariates are the significant driver of the response. We note that optimal design still lies between CRFB and PB when considering case of the 95%ile tail criterion (as in fig. 1). In contrast (a) the mean criterion more clearly illustrates thet dominance of using PB designs over randomized designs and (b) the optimal design, still a harmony of PB and CRFB, is now “closer” to PB than to CRFB (since the small makes the term dominate over the term in eq. 7).

Figure 3: Same as fig. 1 except the observed covariates have no effect on the response.

Taking the last scenario to its limit where the unobserved covariates are omitted, we will find that optimization is the best design; and randomization is inferior by a large margin. This can be seen in eq. 7. Since , there is no effect of any term except the balance term.

In our final simulation, we duplicate the first scenario with to illustrate our titular claim maintains in sample sizes common in real-world studies. There is one complication. When was 20, the optimal was found by brute force; at , this is no longer possible. To approximate , we use the numerical method of Krieger et al. (2016). Their algorithm switches pairs of subjects greedily to minimize observed imbalance until a local optimum is found. Using the R package GreedyExperimentalDesign, we repeat their algorithm 20,000 times and find the with the minimum imbalance. Here, is on the order of and matching is on the order of .

The results are displayed in fig. 4 and we observe the same results as previously. The results for the optimal design were robust to different approximations to ranging from up to . For the true optimal (which is impossible to find), the results will be very similar. Also, the values of for CRFB, PM and PB were 1.72, 1.76 and 2.04. Again, if were set to be 2, the story would not differ.

Figure 4: Same as fig. 1 except the the sample size is now .

5 Discussion

We assume a two-arm randomized fixed trial with a response model that can be decomposed into a sum of an observed measurement component, a treatment effect and an unobserved measurement component. Using the differences-in-mean estimator and worrying about a tail of extreme events in its mean-squared error, we have shown that CRFB is too conservative and optimizing the observed measurements’ balance is too aggressive. The optimal design lies somewhere between these two extremes.

To create an algorithm to find this optimal design will require an explicit minimum of eq. 7. Since the response function is unknown, a supremum over its worst case is likely prudent. The best design will also depend on constants that must be estimated a priori. The most important being the proportion of variance in the response explained by the observed covariates. The less the covariates matter in this respect, the closer the optimal design will be to CRFB; the more they matter, the closer the optimal design will be to optimization 111Even if the optimal variance-covariance matrix can be computed, it is known that this matrix does not fully specify or (Teugels, 1990). Thus, we would need to employ a numeric algorithm to draw ’s whose entries have the correct covariances..

Our criterion suggested in this paper is a quantile of the distribution of the mean squared error of the mean-differences estimator. A common alternate estimator in practice is the ordinary least squares regression estimator for

. The mathematics herein can be redone for this estimator. Unshown simulations reveal the same story as in Figures 14.

Additionally, reductions in MSE of the estimator are usually only important insofar as they increase power. More work needs to be done with power simulations. This brings up hypothesis testing, a topic avoided in this work. Hypothesis testing is complicated in restricted randomization as the finite distribution of the estimator are conditional upon the randomization procedure.

Given the simulation study herein and an asymptotic analysis, complete randomization with forced balance is still likely the best policy pending further necessary research.


We thank Michael Sklar and Bracha Blau for helpful discussions.



All figures and tables can be reproduced by running the R code found at https://github.com/kapelner/harmonizing_designs/blob/master/paper_duplication.R.

6 APPENDIX: Proofs

6.1 Proof of Unbiasedness

We now show that . The law of iterated expectation implies that . We solve the inner expecation below using the model given by Equation 2:


Since then, . The mirror property (A2) implies that , since

for all that we consider. Then by 2.1, each summand corresponding to certain cancels out with the summand with . Therefore, .

Also by the mirror property, . This leaves us with:

The unconditional expectation is equivalent.

6.2 Derivation of the Conditional MSE

The unbiasedness of (proven in sec. 6.1) implies that where the expectation is taken over . Recall that so that

Note that , since . After canceling out the constant , we are left with:

By the same arguments of sec. 6.1, leaving us with

6.3 Derivation of the Mean MSE

We wish to find . Using the result from sec. 6.2,

Since by construction, the second term is zero.

The third term is the expectation of a quadratic form that is the trace of the associated matrix times the variance-covariance matrix of the vector plus the quadratic form of the expectation vector and the associated matrix. Since (by 2.1), we only need to consider the first term. Assuming homoskedasticity (2.3), and , the expression evaluates to since the diagonal entries are identifically one. The heteroskedastic case will not be substantively different. Thus,

6.4 Distribution of the MSE Under Normality

If we assume , then . We now examine the distribution of the quadratic form, where has properties outlined in the text. Baldessari (1967) proves that this quadratic form is distributed as

where and the ’s are the unique eigenvalues and eigenvectors of , is the multiplicity of the eigenvalue and denotes a non-central

random variable with degrees of freedom

and non centrality parameter . Thus, it is a sum of scaled non-central random variables.

Since the distribution is parameterized by the eigendecomposition of , it would be very difficult to optimize the inverse CDF over the space of legal matrices.

6.5 Derivation of the Variance of the MSE

We wish to derive an expression for the variance of the expected squared loss function where the expectation is taken over all randomizations and the variance is taken over all unobserved covariates realizations, i.e.


From sec. 6.2, we learned that . This is a variance of a quadratic form, which can be calculated via Petersen and Pedersen (2012, eq. 319) when assuming that the conditional third and fourth moment of are finite (3) and do not depend on (3). Thus we have,


where and . We prove in sec. 6.6 that the last expression in eq. 11 above is zero and therefore eq. 7 follows.

6.6 A Proof that the Last MSE Term is Zero

We wish to demonstrate that . First, we reiterate that by the assumption of forced balance (3). This also means that since every is balanced. Then, .

Note that where the and ’s are its eigenvalues and eigenvectors respectively. Since is a variance-covariance matrix, it is symmetric implying that its eigenvalues are all non-negative. We can then write . This means that for all . In order for this to be true, for every either or .

We now examine just the term which can be written as . For all either or is zero rendering the “middle” irrelevant. Thus and .


  • Baldessari (1967) Baldessari, B. (1967). The distribution of a quadratic form of normal random variables. The Annals of Mathematical Statistics 38(6), 1700–1704.
  • Bertsimas et al. (2015) Bertsimas, D., M. Johnson, and N. Kallus (2015). The power of optimization over randomization in designing experiments involving small samples. Operations Research 63(4), 868–876.
  • Cornfield (1959) Cornfield, J. (1959). Principles of research. American journal of mental deficiency 64, 240–252.
  • Efron (1971) Efron, B. (1971). Forcing a sequential experiment to be balanced. Biometrika 58(3), 403–417.
  • Fisher (1925) Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh Oliver & Boyd.
  • Götze and Tikhomirov (2002) Götze, F. and A. Tikhomirov (2002). Asymptotic distribution of quadratic forms and applications. Journal of Theoretical Probability 15(2), 423–475.
  • Greevy et al. (2004) Greevy, R., B. Lu, J. H. Silber, and P. Rosenbaum (2004). Optimal multivariate matching before randomization. Biostatistics 5(2), 263–275.
  • Kallus (2018) Kallus, N. (2018). Optimal a priori balance in the design of controlled experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(1), 85–112.
  • Krieger et al. (2016) Krieger, A. M., D. Azriel, and A. Kapelner (2016). Nearly random designs with greatly improved balance. arXiv preprint arXiv:1612.02315.
  • Lachin (1988) Lachin, J. M. (1988). Properties of simple randomization in clinical trials. Controlled clinical trials 9(4), 312–326.
  • Land and Doig (1960) Land, A. H. and A. G. Doig (1960). An automatic method of solving discrete programming problems. Econometrica: Journal of the Econometric Society, 497–520.
  • Li et al. (2016) Li, X., P. Ding, and D. B. Rubin (2016). Asymptotic theory of rerandomization in treatment-control experiments. arXiv preprint arXiv:1604.00698.
  • Morgan and Rubin (2012) Morgan, K. L. and D. B. Rubin (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 1263–1282.
  • Petersen and Pedersen (2012) Petersen, K. B. and M. S. Pedersen (2012, nov). The matrix cookbook. Version 20121115.
  • Rosenberger and Lachin (2016) Rosenberger, W. F. and J. M. Lachin (2016). Randomization in clinical trials: theory and practice (Second ed.). John Wiley & Sons.
  • Sharpe (1970) Sharpe, K. (1970). Robustness of normal tolerance intervals. Biometrika 57(1), 71–78.
  • Student (1938) Student (1938). Comparison between balanced and random arrangements of field plots. Biometrika, 363–378.
  • Teugels (1990) Teugels, J. L. (1990).

    Some representations of the multivariate bernoulli and binomial distributions.

    Journal of multivariate analysis

     32(2), 256–268.