1 After Randomization: To Adjust or Not To Adjust?
Randomized experiments are often considered the “gold standard” of statistical inference because randomization balances the covariate distributions of the treatment and control groups on average, which limits confounding between treatment effects and covariate effects. However, it is possible that a particular treatment assignment from a randomized experiment yields covariate imbalances that researchers wish to address. One option is to employ experimental design strategies such as blocking or rerandomization (Morgan & Rubin, 2012), which prevent substantial covariate imbalance from occurring before the experiment is conducted. When these strategies are not employed, covariate imbalance must be addressed in the analysis stage rather than the design stage. The analyst of such experiments must make a choice: to adjust or not to adjust for the covariate imbalance realized by a particular randomization. If adjustment is done, it is typically done via statistical models (e.g., regression adjustment); however, the results from such adjustment may be biased and/or sensitive to model specification (Imai et al., 2008; Freedman, 2008; Aronow & Middleton, 2013). Meanwhile, unadjusted estimators—though unbiased across randomizations—could be confounded by the realized covariate imbalance at hand. Lin (2013) rigorously investigated these tradeoffs between unadjusted and adjusted estimators, noting that biases due to regression are often minimal, but also that unadjusted estimators are appealing for their simplicity and transparency. Regardless of where these works fall on the “to adjust or not to adjust” spectrum, they all agree that accounting for covariate balance is a key concern in randomized experiments.
1.1 Accounting for Covariate Balance in Randomization Tests
In addition to model-based testing, one can use randomization tests to account for covariate balance in experiments. Randomization tests are often considered minimal-assumption approaches in that they usually only require assuming a probability distribution on treatment assignment rather than structural modeling assumptions or central limit theorems(Rosenbaum, 2002b). In particular, a randomization test requires specifying only (1) the assumed assignment mechanism and (2) the test statistic. In this context, one can account for covariate balance by making particular choices for the assignment mechanism or the test statistic, but most have focused on the choice of the latter. For example, many have found that using model-adjusted estimators as test statistics to address covariate imbalances can result in statistically powerful randomization tests (Raz 1990, Rosenbaum 2002a, Rosenbaum 2002b Chapter 2, Hernández et al. 2004, Imbens & Rubin 2015 Chapter 5). Meanwhile, practitioners typically use the assignment mechanism that was actually used in the design of the experiment when conducting a randomization test (e.g., if units were assigned completely at random, then this same assignment mechanism is used during the randomization test). However, by considering other choices for the assignment mechanism, one can also account for covariate balance.
In particular, a small strand of literature has explored randomization tests that restrict the assignment mechanism to only consider treatment assignments that are similar to the observed one in terms of covariate balance, even if such an assignment mechanism was not explicitly specified by design. This literature has focused on cases where all covariates are categorical, and thus treatment assignment is characterized by permutations within covariate strata. For example, Rosenbaum (1984)
proposed a conditional permutation test for observational studies that permutes the treatment indicator within groups of units with the same covariate values. This test assumes (1) the treatment assignment is strongly ignorable, (2) the true propensity score model is a logistic regression model, and (3) the collection of covariates is sufficient for the logistic regression model. More recently,Hennessy et al. (2016) proposed a conditional randomization test for randomized experiments that is similar to Rosenbaum (1984) in that it also permutes within groups of units with the same covariate values, but it does not require any kind of model specification. Rosenbaum (1984) and Hennessy et al. (2016) only consider cases with categorical covariates, and they make connections between their randomization tests and adjustment methods for categorical covariates, such as post-stratification (Miratrix et al., 2013).
1.2 Our Contribution: Considering Non-Categorical Covariates
We develop a randomization test that conditions on the realized covariate balance of an experiment for the more general case where covariates may be non-categorical. We demonstrate that our randomization test is more powerful than randomization tests that do not condition on covariate balance and is comparable to randomization tests that use model-adjusted estimators as test statistics. In general, we recommend the use of randomization tests that either condition on covariate balance through the assignment mechanism or utilize model-adjusted test statistics, instead of an unconditional randomization test that uses an unadjusted test statistic.
Our main contribution is outlining a randomization test that conditions on covariate balance through the assignment mechanism for the general case of non-categorical covariates. Unlike the case where only categorical covariates are present, samples from the conditional randomization distribution cannot be obtained via permutations of the treatment indicator when there are non-categorical covariates. In response to this complication, we develop a rejection-sampling algorithm to sample from the conditional randomization distribution.
We find that our conditional randomization test appears to be equivalent to randomization tests that use regression-based test statistics. This contribution is particularly notable because most have characterized the choice of test statistic as the main avenue for increasing the power of a randomization test and for adjusting for imbalance in an experiment. Our work suggests how the choice of assignment mechanism can be an analogous avenue for obtaining statistically powerful randomization tests that appropriately adjust for imbalance. Furthermore, through simulation, we also find that our conditional randomization test is valid across randomizations conditional on a particular level of covariate balance, while unconditional randomization tests are often not valid across such randomizations. This suggests that our conditional randomization test can be used to ensure that statistical inferences are valid for the observed data at hand; meanwhile, unconditional randomization tests do not provide this benefit. Overall this suggests that practitioners using randomization tests should either condition on observed imbalance or use adjusted test statistics rather than the traditional randomization procedures usually seen in the literature.
To build intuition for our conditional randomization test, in Section 2 we review randomization tests for Fisher’s Sharp Null and review the conditional randomization test of Hennessy et al. (2016). In Section 3 we outline our conditional randomization test, which can flexibly condition on multiple levels of balance for non-categorical covariates. In Section 4 we provide simulation evidence that our conditional randomization test (1) is more powerful than unconditional and other conditional randomization tests, and (2) is approximately equivalent to an unconditional randomization test that uses a regression-based test statistic. In Section 5
we conclude by discussing how confidence intervals can be constructed from our conditional randomization test and the extent to which our conditional randomization test can be used for observational studies.
2 Review of Randomization Tests for Fisher’s Sharp Null
We focus on randomization tests for Fisher’s Sharp Null. While conclusions from such tests are limited—the only conclusion that can be made is whether or not there is any treatment effect among the experimental units—in Section 5 we discuss how such tests can be inverted to yield uncertainty intervals as well.
First we review a general framework for randomization tests for Fisher’s Sharp Null. We then review the unconditional randomization test typically discussed in the literature under this framework. Finally, we review the conditional randomization test of Hennessy et al. (2016) that conditions on categorical covariate balance.
2.1 Setup and Randomization Test Procedure
Consider units to be allocated to treatment and control in a randomized experiment. Following Rubin (1974), let and denote the treatment and control potential outcomes, respectively, for unit , and let denote a
-dimensional vector of pre-treatment covariates. Letif unit is assigned to treatment and 0 otherwise. Furthermore, define and as the covariate matrix and vector of treatment assignments, respectively. The observed outcomes are . Importantly, the potential outcomes and covariates are fixed; the only stochastic element of the observed outcomes is the treatment assignment .
Throughout, we assume a completely randomized experiment, where the true distribution of the treatment assignment is:
with the number of treated units, , fixed. Many causal estimands can be considered in this framework, but we focus on the average treatment effect
because it is the most common estimand in the causal inference literature. The potential outcomes and are never both observed, so (2) needs to be estimated. One common estimator is the mean-difference estimator
where and are the number of units that receive treatment and control, respectively.
A common test for assessing if an estimate for the average treatment effect is statistically significant is to test for Fisher’s Sharp Null:
which states that there is no treatment effect for any of the units. A rejection of Fisher’s Sharp Null implies that a treatment effect is present. We focus on testing Fisher’s Sharp Null because it is the most common hypothesis to assess using randomization tests in the causal inference literature (Rosenbaum, 2002b; Imbens & Rubin, 2015). See Ding (2017) and the ensuing comments for a discussion of how testing Fisher’s Sharp Null compares to testing Neyman’s Weak Null within the context of randomization-based causal inference.
Under Fisher’s Sharp Null, the outcomes for any particular randomization will be equal to the observed outcomes; i.e., the observed outcomes will be the same across all realizations of under the Sharp Null. Thus, under , the value of any test statistic can be computed for any particular realization of the treatment assignment . A common choice of test statistic is . Our framework can incorporate any test statistic that differentiates between treatment and control response; for now we will focus on the test statistic , and later we will discuss model-adjusted test statistics. See Rosenbaum (2002b, Chapter 2) for further discussion on choices of test statistics for randomization tests.
To test Fisher’s Sharp Null, one compares the observed value of the test statistic, , to the randomization distribution of the test statistic under the Sharp Null. Importantly, the randomization distribution of the test statistic depends on the set of treatment assignments that one considers possible within the randomization test.
We follow the notation of Imbens & Rubin (2015, Chapter 4) and define as the set of treatment assignments with positive probability within a given randomization test. Given any test statistic , the two-sided randomization test -value for Fisher’s Sharp Null is
In other words, the -value (5) is the probability that a test statistic larger than the observed one would have occurred under the Sharp Null, given the assignment mechanism .
Thus, testing Fisher’s Sharp Null is a three-step procedure (Good, 2013):
Specify the distribution (and, consequentially, ) to be used within the randomization test.
Choose a test statistic .
Compute or approximate the -value (5).
In the remainder of this section we will discuss two randomization tests: one that does not condition on covariate balance and one that does. The only difference between the two tests is the first step in the procedure above, i.e., the choice of the assignment mechanism .
2.2 Unconditional Randomization Tests
The most common randomization test in the literature utilizes the same assignment mechanism used to design the experiment, the completely randomized assignment mechanism defined in (1). A completely randomized assignment mechanism assumes that , i.e., it only considers assignments where units are assigned to treatment. Hennessy et al. (2016) call randomization tests that assume a completely randomized assignment mechanism “unconditional randomization tests” because they do not condition on forms of covariate balance. Once and a test statistic are specified, the randomization test follows the three-step procedure from Section 2.1. This test is also called a permutation test because random samples from can be obtained by randomly permuting the observed treatment assignment .
Instead of using in the randomization test procedure, Hennessy et al. (2016) proposed using an assignment mechanism that conditions on covariate balance.
2.3 Conditional Randomization Tests
Because the number of treated units is prespecified as part of the design of a completely randomized experiment, the unconditional randomization test in Section 2.2 follows the typical recommendation to “analyze as you randomize.” However, many have recommended conditioning on the observed number of treated units even when the number of treated units was not specified by design (Hansen & Bowers, 2008; Zheng & Zelen, 2008; Miratrix et al., 2013; Rosenberger & Lachin, 2015). The goal of conditional inference in general (and conditional randomization tests specifically) is to focus inference on experiments that are most relevant to the data at hand by conditioning on pertinent statistics such as the number of treated units or forms of covariate balance. As we show through simulation in Section 4, conditional randomization tests can have the benefit of being valid conditional on the data as well as being valid unconditionally, whereas unconditional randomization tests are only valid unconditionally.
To formalize this idea of conditioning on pertinent statistics, define a criterion that is a function of the treatment assignment and pre-treatment covariates:
This notation mimics that of Morgan & Rubin (2012), who use to define treatment assignments that are desirable for an experimental design, and that of Branson & Bind (2018), who were the first to introduce such notation for randomization tests. The unconditional randomization test in Section 2.2 inherently defines if and 0 otherwise. In general, conditional randomization tests involve sampling from the conditional distribution rather than the unconditional distribution in Section 2.2.
Hennessy et al. (2016) focus on that indicate some specified degree of categorical covariate balance. Assume there are covariate strata specified by the researcher such that each unit belongs to only one stratum, and define if the th unit belongs to the th stratum. The strata may be defined using all of the covariates or some subset of them. Then, Hennessy et al. (2016) define the criterion as111Hennessy et al. (2016) use slightly different notation, instead defining a balance function and condition on the balance function being equal to some prespecified . The more general notation that uses will become helpful in our discussion of continuous covariate balance.
In other words, each stratum is treated as a completely randomized experiment. Hennessy et al. (2016) assume that the conditional distribution is uniform, i.e.,
Random samples from can be obtained by randomly permuting the observed treatment assignment within the covariate strata . Once a test statistic is specified, the conditional randomization test follows the three-step procedure in Section 2.1, but using instead of .
Hennessy et al. (2016) showed via simulation that this conditional randomization test using the test statistic is more powerful than the unconditional randomization test in Section 2.2 using . Furthermore, they found that this conditional randomization test using is comparable to the unconditional randomization test using the post-stratification test statistic
where is the estimator within stratum (Miratrix et al., 2013).
Note that the set of possible treatment assignments must be large enough to perform a powerful randomization test. For example, if , then it is impossible to obtain a randomization test -value less than 0.05. It may be surprising that conditional randomization tests can be more powerful than unconditional randomization tests, because the former utilizes fewer treatment assignments than the latter. However, these fewer treatment assignments are more relevant to the observed treatment assignment in terms of covariate balance, which leads to more powerful inference, as discussed by works such as Rosenbaum (1984) and Hennessy et al. (2016).
When the criterion is defined as in (8), , which is typically large. Furthermore, assuming that is uniform, random samples from this distribution can be obtained directly, and thus implementation of the conditional randomization test is straightforward. However, this approach is less straightforward when contains non-categorical covariates, because is no longer composed of strata where there are treatment and control units in each stratum. One option is to coarsen into strata and then use the conditional randomization test of Hennessy et al. (2016). Instead of throwing away information via coarsening, we propose a criterion that incorporates covariate balance for non-categorical covariates. We define such that is large enough while still sufficiently conditioning on covariate balance. Furthermore, as we discuss below, random samples from will no longer be equivalent to random permutations of ; thus, we develop an algorithm to obtain random samples from .
3 A Conditional Randomization Test for the Case of Non-Categorical Covariates
The conditional randomization test discussed in Section 2.3 is equivalent to a permutation test within strata. This is analogous to analyzing a completely randomized experiment as if it were a blocked randomized experiment. We follow this intuition by proposing a conditional randomization test that is analogous to analyzing a completely randomized experiment as if it were a rerandomized experiment, where the rerandomization scheme incorporates a general form of covariate balance.
Rerandomization involves randomly allocating units to treatment and control until a certain level of prespecified covariate balance is achieved. Thus, rerandomization requires specifying a metric for covariate balance. We first consider an omnibus measure of covariate balance and the corresponding conditional randomization test. We then extend this conditional randomization test to flexibly incorporate multiple measures of covariate balance, rather than a single omnibus measure, which we find yields more powerful randomization tests.
3.1 Conditional Randomization Test Using An Omnibus Measure of Covariate Balance
The most common covariate balance metric used in the rerandomization literature is the Mahalanobis distance (Mahalanobis, 1936), which is defined as
where and are -dimensional vectors of the covariate means in the treatment and control groups, respectively, and is the sample covariance matrix of , which is fixed across randomizations. The derivation for the equality in (12) can be found in Morgan & Rubin (2012). Note that , and so is stochastic through .
We focus on using the Mahalanobis distance for our conditional randomization test because of its widespread use in measuring covariate balance for non-categorical covariates. Note that the Mahalanobis distance is an omnibus measure for balance among the individual covariates as well as their interactions (see, e.g., Stuart (2010)). Following Hennessy et al. (2016), we define a criterion such that:
It is asymmetric in treatment and control.222In particular, we would like the criterion to be able to distinguish between assignments where treated units have higher covariate values and where control units have higher covariate values. As discussed in Hennessy et al. (2016), this can be useful information to condition on during a randomization test. In contrast, the Mahalanobis distance is symmetric in treatment and control. Asymmetry allows us to condition on the direction of the mean imbalance between treatment and control, in addition to the degree of covariate overlap between the two groups (as measured by the Mahalanobis distance).
It conditions on the covariate balance being similar to the observed balance for a particular randomization.
To fulfill these two desires, we consider the following criterion for our conditional randomization test:
The equality of signs for all covariate mean differences addresses the first item above—in particular, it recognizes whether the treatment or control group has higher covariate values—while the bounds address the second item.
The criterion (13) only considers randomizations that correspond to covariate balance similar to the observed . Restricting to be within the bounds is analogous to stratifying the Mahalanobis distance and restricting to be in the same stratum as the observed . Now we outline two procedures for selecting for our conditional randomization test.
3.1.1 How to Choose the Bounds
To gain some intuition for how to choose the bounds, note that the interval should be narrow enough around the observed such that the corresponding sufficiently conditions on the observed covariate balance, but also the interval should be wide enough such that a powerful randomization test can still be performed. For example, consider the most narrow interval possible, when . In this case, there may be only a single randomization such that (i.e., ) and thus our conditional randomization test completely loses its power, even though it is fully conditioning on the observed covariate balance.
We will consider two ways to pick , presented as Procedures 1 and 2 below. Procedure 1 selects the bounds unconditionally of , while Procedure 2 does the same conditional on . In Section 3.3 we establish that Procedure 1 yields a valid randomization test, and we also discuss the extent to which Procedure 2 yields a valid randomization test.
Procedure 1 for Selecting : Bin the Mahalanobis Distance
Approximate the sign-constrained randomization distribution of the Mahalanobis distance by generating randomizations such that , and computing the corresponding .
Before observing , bin the aforementioned randomization distribution into categories. Denote the cutoff points for these bins as , where and and .
After observing , set and for the such that .
Procedure 2 for Selecting : Build a Neighborhood around
Approximate the sign-constrained randomization distribution of the Mahalanobis distance by generating randomizations such that , and computing the corresponding .
Specify an acceptance probability that denotes the proportion of the aforementioned randomization distribution to be included in .
After observing , let be the set of Mahalanobis distances that are immediately below , and let be the set of Mahalanobis distances that are immediately above . Then, set and .
If there are fewer than Mahalanobis distances immediately below , set as the set of all Mahalanobis distances below , and set as the set of Mahalanobis distances immediately above such that .
If there are fewer than Mahalanobis distances immediately above , set as the set of all Mahalanobis distances above , and set as the set of Mahalanobis distances immediately below such that .
Procedure 1 categorizes the Mahalanobis distance and then sets according to the category that falls into. Procedure 2 sets according to the Mahalanobis distances that are immediately around , such that of the Mahalanobis distances are contained in , with being the median of (except for the two corner cases noted in the final step of Procedure 2). Furthermore, one can use rejection sampling to generate the randomizations in Step 1 of Procedures 1 and 2: generate a complete randomization , where is defined in (1), and only keep if the sign constraint is fulfilled by . In the simulation study discussed in Section 4, we focus on Procedure 2, because it ensures that the hypothetical randomizations used during the conditional randomization test are the randomizations most similar to the observed one in terms of covariate balance.
3.1.2 Rejection-Sampling Approach for Performing the Conditional Randomization Test
is uniformly distributed, random samples from this conditional distribution no longer correspond to random permutations ofas in the unconditional randomization test in Section 2.2 or the conditional randomization test in Section 2.3. Similar to how the randomizations in Step 1 of Procedures 1 and 2 can be generated, we propose a simple rejection-sampling algorithm to generate a random draw from :
Generate a random draw from defined in (1).
Accept if ; otherwise, repeat Step 1.
Note that, as gets smaller, it will be more computationally intensive to generate random samples from , but it corresponds to more precisely conditioning on the observed covariate balance. If generating random samples from via rejection-sampling is computationally intensive, one can use an alternative approach proposed by Branson & Bind (2018), which uses importance-sampling to approximate randomization test -values at a lower computational cost than rejection-sampling.
In Section 4 we show via simulation that this conditional randomization test is more powerful than the standard unconditional randomization test, because the former conditions on a measure of covariate balance. However, the criterion (13) uses an omnibus measure of covariate balance, which may not sufficiently condition on the observed randomization if the number of covariates is large. We now extend this procedure to more precisely condition on the observed covariate balance for a given randomization by incorporating multiple measures of covariate balance. We show in Section 4 that this extension results in a further gain in statistical power.
3.2 Conditional Randomization Test Using Multiple Measures of Covariate Balance
Consider tiers (or sets) of covariates that are of interest as specified by the researcher. Let denote the covariates in tier , where each covariate only appears in one of the tiers. Then, define
as the Mahalanobis distance for the covariates in tier . This setup of dividing covariates into tiers is similar to Morgan & Rubin (2015), who developed a rerandomization framework that forces each to be sufficiently small by design. Note that the setup in Section 3.1 corresponds to tiers.
Our proposed conditional randomization test follows a procedure similar to that in Section 3.1, but within each tier . Define the criterion
for some lower and upper bounds and for each tier . Then, define the overall criterion
The bounds are chosen separately for each tier using the procedure discussed in Section 3.1.1. This requires choosing an acceptance probability for each tier. Because a smaller corresponds to more stringent conditional inference, tiers with covariates that are believed to be most relevant to the outcomes should be assigned smaller . However, recall that smaller corresponds to more computational time required to obtain draws from via our rejection-sampling algorithm discussed in Section 3.1.2.
3.3 The Validity of Conditional Randomization Tests
A test is valid if , where
is the Sharp Null Hypothesis andis the calculated -value. In our context, is a function of the observed assignment and the testing procedure, and the probability is taken over the true assignment mechanism with the potential outcomes held fixed. For our conditional tests, the -value is calculated as the probability of observing a test statistic more extreme than the observed one across randomizations such that , for a specific determined by the observed assignment and covariates. Thus, the validity of our conditional randomization test depends on the criterion , which—as shown in (15)—is defined by the bounds in each tier and the covariate sign constraints. In Section 3.1.1, Procedure 1 defines the bounds before randomization, whereas Procedure 2 defines the bounds based on after randomization. This latter case induces complications to establishing validity that we believe have not been previously discussed in the literature. In what follows, we discuss why exact validity may not necessarily hold for the conditional randomization test that uses Procedure 2, and establish validity for the test that uses Procedure 1.
Define as the set of possible bounds and as the set of possible covariate signs across all randomizations, and define as the set of all randomizations that lead to particular bounds and signs . The collection of partition into non-overlapping sets.333Recall that is the set of treatment assignments with positive probability; see section 2. The overall probability of our conditional randomization test falsely rejecting the null can then be decomposed as
Given the above, a sufficient condition for establishing validity is that for all and .
A given and pair specify a specific conditioning function . Let be the set of randomizations satisfying a given function . Then, our calculated -value, conditioned on an observed randomization, consequent , and outcome will be
Under the null, and are both invariant to random assignment, making our test statistic solely a function of . Under the null, then, let
be a random variable whose distribution is that of, where is uniformly distributed across the elements of , and let be analogously defined for . (Note that , because the realized and are specified by .) Now consider our conditional probability for some specific and . Given this conditioning, our original test statistic is distributed as . Regardless of the observed value of our test statistic, we have that our reference distribution will be , for our given . Thus, our -value, conditioned on our original randomization giving us our given , pair will then be the upper tail of our reference distribution, calculated as
is the cumulative distribution function of. Here is a function of and , given the potential outcomes and covariates.
Typically, validity of a randomization test is proven by arguing that -values of the form (19) are uniformly distributed by applying the probability integral transform (for an example of this method of proof, see Hennessy et al. 2016, Section 2). When Procedure 1 is used to select the bounds, ; i.e., all of the assignments used in the conditional randomization test are the same assignments that would lead to the realized and . Therefore, and have the same distribution under Procedure 1, and validity immediately follows from (19). However, because Procedure 2 specifies the bounds as a neighborhood around , the conditional randomization test under Procedure 2 uses randomizations that may not have led to the realized and . As a result, and will differ, and and will not necessarily have the same distribution. Consequently, our conditional randomization test that uses Procedure 2 for selecting the bounds is not necessarily valid. Nonetheless, in Section 4 and the Appendix, we find that our conditional randomization test using Procedure 2 is empirically valid under a wide variety of scenarios. This in part stems from the centering of our reference distributions around the test statistics; by contrast, if we had always selected distributions less extreme than the observed, we could induce invalidity. We leave investigating when validity formally holds when randomization test -values are of the form (19) for two differing distributions as a promising line for future research.
4 Simulation Study: Conditional and Unconditional Performance of Conditional and Unconditional Randomization Tests
We now conduct a simulation study to explore the statistical power of the unconditional randomization test from Section 2.2, our conditional randomization tests from Sections 3.1 and 3.2, and another conditional randomization test inspired by Coarsened Exact Matching (CEM). CEM was designed for observational studies to find a subset of treatment and control units that match exactly on a coarsened covariate space (Iacus et al., 2011, 2012). Even though CEM was developed for observational studies and not randomization tests, we include it in our comparison because—as we noted at the end of Section 2—coarsening into strata is one option for performing a conditional randomization test in the face of continuous covariates. Thus, it is the most natural test to compare to our conditional randomization test.
In what follows, we find that our conditional randomization test using is more powerful than the unconditional randomization test using as well as the CEM-based tests. Furthermore, we find that our test is comparable to an unconditional randomization test using a regression-based test statistic. Finally, we find that the conditional randomization tests and an unconditional randomization test using a regression-based test statistic are all valid both unconditionally and conditional on the data, whereas the unconditional randomization test that uses an unadjusted test statistic is only valid unconditionally.
4.1 Simulation Procedure
Consider units whose potential outcomes are generated according to the following model:
where and are independently and randomly sampled from a distribution. The parameters and take on values and across simulations. As increases, the covariates become more associated with the outcome; as increases, the treatment effect increases and thus should be easier to detect.
Once the potential outcomes are generated, units are randomized to treatment and control such that units receive treatment and units receive control; in other words, units are assigned according to the completely randomized assignment mechanism (1). This is repeated such that 1,000 randomizations are produced using the same fixed potential outcomes. In the Appendix we also consider an unbalanced design where an unequal number of units are assigned to treatment and control; however, the results for that scenario are largely the same as the results presented here, where .
For each randomization, five separate randomization tests were performed:
Conditional Randomization Test: The procedure described in Section 3.2 using the criterion (16), which requires specifying the number of covariate tiers and acceptance probability in Procedure 2 for selecting the bounds within each tier. We consider number of tiers and acceptance probabilities . The case corresponds to the procedure described in Section 3.1.444For , the first two covariates are in one tier while the last two are in another tier. For , all covariates are in their own tier. For each tier, we choose by setting all tier-level acceptance probabilities to be equal, where the overall acceptance probability is .555Note that this equality holds only because the covariates in each tier are independent. Thus, for all tiers . We use the test statistic .
Unconditional Randomization (with model-adjusted test statistic): The procedure described in Section 2.2, using the test statistic , which is defined as the estimated coefficient for from the linear regression of on , , and . This test statistic was discussed in Lin (2013), but within the context of Neymanian inference rather than randomization tests.
Coarsened Exact Matching (Prespecified Groups): Each covariate is coarsened into
groups according to the quantiles of, thus coarsening the covariate space into strata. Then, to perform a randomization test using as a test statistic, is permuted many times within each stratum. We consider number of groups .
Coarsened Exact Matching (Automatic Groups): The same as the previous test, but the groups are chosen automatically by the R function cem.
Our motivation for including the third randomization test in our comparison is that Hennessy et al. (2016) found that their conditional randomization test using is comparable to the unconditional randomization test using defined in (10), and that is equivalent to when covariates are categorical (Lin, 2013). We also considered our conditional randomization test using instead of , and found that the power results for that test are essentially the same as those for the unconditional randomization test using ; we relegate those results to the Appendix.
Meanwhile, the last two procedures utilize CEM. These conditional tests are identical to the test of Hennessy et al. (2016) using the assignment mechanism (9), where the strata are chosen via CEM. In the CEM (Prespecified Groups) procedure, the strata are specified according to the quantiles of the known distributions of the covariates. Meanwhile, in the CEM (Automatic Groups) procedure, the strata are automatically specified according to Sturges’ rule, which uses the range of the covariates and is the default option in the cem R package (Iacus et al., 2009). Details about this procedure and other automated procedures in the context of CEM are discussed in Iacus et al. (2012).
4.2 Simulation Results: Unconditional Performance
We first assess statistical power, which corresponds to how often each randomization test rejected Fisher’s Sharp Null across the 1,000 complete randomizations when . The average rejection rates for the unconditional randomization tests using and as well as our conditional randomization test are presented in Figure 1 for various values of and . Figure 0(a) displays results for a fixed acceptance probability and different numbers of tiers, while Figure 0(b) displays results for a fixed number of tiers and different acceptance probabilities.
Several conclusions can be made from Figure 1. First, when (i.e., when the covariates are not associated with the outcome), all of the randomization tests are essentially equivalent. When the covariates are associated with the outcome, our conditional randomization test is more powerful than the unconditional randomization test that uses . Furthermore, the power of our conditional randomization test increases as the acceptance probability decreases and/or the number of tiers increases; this is expected: lower and higher corresponds to more stringent conditioning.
Figure 0(a) suggests that practitioners can increase power by increasing the number of tiers without any additional computational cost (i.e., without decreasing the acceptance probability). Furthermore, Figure 0(b) suggests that the additional gain in power decreases as decreases, which echoes the observation made by Li et al. (2018) in the rerandomization literature that the marginal benefit to decreasing decreases as decreases. Analogous figures for the and cases are in the Appendix; by comparing those figures with Figure 0(b), it can be seen that the additional gain in power from decreasing increases as increases. This observation emphasizes the benefits of conditioning on multiple measures of covariate balance rather than a single omnibus measure. Further discussion on this point is in the Appendix.
Meanwhile, Figure 1 also shows that the unconditional randomization test using was more powerful than all of the conditional and unconditional randomization tests using . However, as gets smaller and gets larger—i.e., as conditioning becomes more stringent—the performance of our conditional randomization test appears to approach that of the unconditional randomization test that uses . This reinforces the claim made by Li et al. (2018) that—in a Neymanian inference context— under complete randomization is equivalent to under very stringent rerandomization. However, Li et al. (2018) made this claim about the rerandomization scheme that uses an omnibus measure of covariate balance; our findings suggest that this claim should be qualified to state that the equivalence between under complete randomization and under rerandomization holds when the rerandomization scheme incorporates separate measures of balance for each covariate used in , rather than a single omnibus measure.
Here, is correctly specified because the potential outcomes are generated from a linear model, and one may wonder how the unconditional randomization test using performs when this model is misspecified. We consider this in the Appendix and obtain findings very similar to those presented here. In particular, for the simulation settings considered, we find that it is still beneficial to use the unconditional randomization test with or our conditional randomization test with in the misspecified case as long as the functions of the covariates used in the regression to construct
are correlated with the response; when they are not correlated, these tests are essentially equivalent. In the Appendix we also explore a variety of additional simulation scenarios—when the covariates have positive and negative effects on the potential outcomes, when there are heterogeneous treatment effects, and when the covariates are not normally distributed—and we again find results that are very similar to the results presented here. This suggests that these results hold under a wide variety of scenarios.
Now we assess the performance of the conditional randomization tests that use CEM. Figure 2 shows the average rejection rate of Fisher’s Sharp Null for the CEM-based randomization tests. To anchor our comparison, Figure 2 also includes the results for the unconditional randomization tests using and (i.e., the same results presented in Figure 1). When the covariate space for each covariate is coarsened into groups, these conditional randomization tests are more powerful than the unconditional randomization test using when , although they are not as powerful as our conditional randomization test or the unconditional randomization test using . When the number of groups for each covariate is increased, the power of the conditional randomization tests tend to decrease, especially for the CEM procedure that specifies strata according to the quantiles of the known covariate distributions. At first this finding may be surprising, because more groups should correspond to more stringent conditioning and thus possibly higher power. However, as the number of groups increases, there are fewer strata with both treatment and control units, and thus more units are discarded and there are fewer possible randomizations used during the randomization test. For example, for the CEM (Prespecified Groups) procedure, when there were groups, on average 4 of the 100 units were discarded across the 1,000 randomizations; when , on average 54 of the 100 units were discarded; and when , on average 88 of the 100 units were discarded. In the most extreme case, if we let the number of groups go to infinity—i.e., not coarsen the continuous covariate space at all—there would not be any treatment and control units with the same covariate values, and thus all units would be discarded.
Meanwhile, there is not a clear winner between the CEM (Prespecified Groups) and CEM (Automatic Groups) procedures, although the CEM (Automatic Groups) procedure is not as severely underpowered for the case as the CEM (Prespecified Groups) procedure. In their development of CEM, Iacus et al. (2009, 2011, 2012) recommend researchers use context-specific knowledge for specifying strata rather than automated procedures, but it is unclear if this should be the recommendation when using CEM for conditional randomization tests. Indeed, the CEM (Prespecified Groups) procedure uses the most context-specific knowledge possible (the actual data-generating process for the covariates), but it does not necessarily perform as well as the automated procedure.
In summary, these findings suggest that it is beneficial to condition on forms of covariate balance that account for continuous covariates, rather than condition on a coarsened version of the continuous covariate space. Furthermore, our methodology allows researchers to condition on the data at hand in a way that increases the power of randomization tests, while coarsening the covariate space may lead to a lack of possible treatment assignments to perform a powerful randomization test.
4.3 Simulation Results: Conditional Performance
We next examine the performance of the five tests across randomizations that are particularly balanced or imbalanced. First, we generated the potential outcomes using model (20) with (which corresponds to no treatment effect) and (which corresponds to a strong association between the covariates and potential outcomes). Then, we generated 10,000 randomizations and divided these randomizations into 10 groups according to quantiles of the Mahalanobis distance. Thus, the first group consists of the 1,000 best randomizations according to the Mahalanobis distance, while the tenth group consists of the 1,000 worst randomizations. Now we consider whether the five randomization tests are valid (i.e., reject Fisher’s Sharp Null when it is true 5% of the time) for randomizations conditional on a particular level of covariate balance. Conditional validity assesses to what extent these tests are valid across randomizations that are similar to the observed randomization.
Figure 3 displays the average rejection rate of each randomization test for each of the 10 quantile groups of the Mahalanobis distance. For the CEM-based tests, we display results for groups, because this resulted in the most power in Section 4.2. The conditional performance for higher groups are similar. Our conditional randomization test that uses and the unconditional randomization test that uses both exhibit average rejection rates close to the 5% level across all quantile groups, which suggests that both tests are conditionally valid across randomizations of any particular balance level. The story is quite different for the unconditional randomization test that uses : for low levels of covariate imbalance, the average rejection rate is below the 5% level, while for high levels of covariate imbalance the average rejection rate is notably above the 5% level. These rejection rates average out to 5%—as can be seen in Figure 1—and thus the unconditional randomization test that uses is unconditionally valid, but—as can be seen in Figure 3—it is not conditionally valid conditional on a particular balance level. In particular, the false rejection rate for the unconditional randomization test that uses appears to be monotonically increasing in covariate imbalance, which is intuitive given that treatment effects will be increasingly confounded with covariate effects as covariate imbalance increases. Meanwhile, the false rejection rate for the CEM-based tests also appears to be monotonically increasing in covariate imbalance according to the Mahalanobis distance, but to a much less severe degree. This is likely because these tests condition on balance for a coarsened version of the covariate space instead of balance for the continuous covariate space as measured by the Mahalanobis distance. In short, they are conditionally valid for the coarsened covariate space but not the continuous covariate space.
In summary, statistically powerful randomization tests can be constructed by conditioning on covariate balance through the assignment mechanism or by using a model-adjusted test statistic; either option will result in a more powerful test than an unconditional randomization test that uses an unadjusted test statstic. We also find that our conditional randomization test using unadjusted test statistics or unconditional randomization tests using model-adjusted test statistics appear to be approximately equivalent, both across complete randomizations as well as across randomizations of a particular balance level. Furthermore, we find that our conditional randomization test that directly conditions on group-level balance for continuous covariates is more powerful than other conditional randomization tests that condition on a coarsened version of the covariate space. Finally, it is particularly important to condition on group-level covariate balance or use a model-adjusted test statistic to ensure validity across randomizations of a particular balance level, because covariate imbalances can break the conditional validity of unconditional randomization tests that use unadjusted test statistics.
5 Discussion and Conclusion
Hennessy et al. (2016) outlined a conditional randomization test that conditions on the covariate balance observed after an experiment has been conducted, and showed that these tests are more powerful than standard unconditional randomization tests and comparable to randomization tests that use model-adjusted estimators, such as the post-stratified estimator in Miratrix et al. (2013). However, Hennessy et al. (2016) focused on the case when there are only categorical covariates. Here we proposed a methodology for conducting a randomization test that conditions on a form of covariate balance that allows for non-categorical covariates.
Through simulation, we found that our conditional randomization test is more powerful than unconditional randomization tests that use unadjusted test statistics as well as other conditional randomization tests inspired by the observational study literature, and that it is approximately equivalent to an unconditional randomization test that uses a regression-based test statistic. We also found that the conditional randomization tests and the unconditional randomization tests that use adjusted test statistics appear valid conditional on the observed covariate balance; the more traditional unconditional randomization tests that use unadjusted test statistics, however, are clearly not.
The above findings hold under a variety of data-generating scenarios, such as ones with treatment effect heterogeneity or model misspecification. Most of the literature has focused on increasing the power of randomization tests through the choice of the test statistic; to our knowledge, we are the first to do the same through the choice of the assignment mechanism for the general case when non-categorical covariates are present. Furthermore, we found evidence that these two avenues for constructing randomization tests are approximately equivalent in terms of statistical power. Thus, our methodology can achieve the power of model-adjustment while preserving the transparency of an unadjusted treatment effect estimate, thereby taking advantage of the benefits of both adjusted and unadjusted estimators as discussed by Lin (2013). Relatedly, we also discussed how this finding suggests connections between regression-based estimators after complete randomization and unadjusted estimators after rerandomization, which refines observations previously made by Li et al. (2018).
We focused on randomization tests for randomized experiments, but we believe that this work has implications beyond tests and experiments. Randomization tests can be inverted to yield confidence intervals for treatment effects (Rosenbaum, 2002b; Imbens & Rubin, 2015), and thus our method can go beyond testing the presence of a treatment effect. Some have criticized such randomization-based confidence intervals because they commonly make the assumption of a constant treatment effect for all units. However, recent works have suggested how to incorporate treatment effect heterogeneity in randomization tests (e.g., Ding et al. 2016; Caughey et al. 2016), and our work adds to this literature by suggesting how forms of covariate balance can be incorporated in randomization tests as well. An interesting line of future work would be to combine our conditional randomization test with these works to conduct randomization-based inference that incorporates both treatment effect heterogeneity and covariate balance.
Furthermore, most work on randomization tests for observational studies has focused on cases where only categorical covariates are present (Rosenbaum, 1984, 1988, 2002a, 2002b). Our work suggests a way to conduct randomization-based inference for observational studies when non-categorical covariates are present. However, because the assignment mechanism in an observational study is unknown, researchers need to determine when certain assignment mechanisms can be assumed within an observational study before conducting randomization-based inference. See Branson (2018) for a framework for how to conduct conditional randomization-based inference in this context.
6 Appendix: Additional Simulation Results
Here we present further power results of randomization tests similar to those presented in Section 4. All of the following sections and figures discuss the average rejection rate of Fisher’s Sharp Null for various randomization tests. In Section 6.1, we consider the same setup discussed in Section 4 and present results for our conditional randomization test for various acceptance probabilities and one or two tiers (instead of four tiers), as well as results for our conditional randomization test using the regression-adjusted test statistic (instead of ). Then, in Sections 6.3 and 6.4 we consider other data-generating processes not explored in Section 4, including:
when some covariate effects are positive and some are negative,
when there is treatmenet effect heterogeneity,
when there are non-normal covariates,
when the linear regression used in is misspecified.
6.1 Simulation Results for One and Two Tiers and for Conditional Randomization using
Consider the same simulation setup as Section 4, where the potential outcomes for units are generated using the model (20). In Section 4.2, we examined the power of our conditional randomization test for various acceptance probabilities for a fixed number of four tiers. Figure 4 shows the same results for one and two tiers, respectively. In other words, Figure 4 is analogous to Figure 0(b), but for one or two tiers instead of four. The results are quite similar to those presented in Figure 0(b): the power of our conditional randomization test increases as the acceptance probability decreases. Furthermore, by comparing Figures 0(b) and 4, one can see that the additional benefit of decreasing the acceptance probability increases with the number of tiers. This emphasizes the benefit of conditioning on multiple measures of balance, rather than just a single measure.
Furthermore, in Section 4 we focused on our conditional randomization test using the simple mean-difference test statistic . Figure 5 presents the unconditional and conditional performance of our conditional randomization test using the regression-adjusted test statistic . In other words, Figures 4(a) and 4(b) are the same as Figures 1 and 3, respectively, except we use instead of for our conditional randomization test. We find that the power results for our conditional randomization test using are essentially the same as those using , and thus there does not appear to be an additional benefit of using a conditional randomization distribution for the randomization test if a model-adjusted test statistic is used (or vice versa).
6.2 Simulation Results for Unbalanced Designs
Consider the same simulation setup as Section 4, where the potential outcomes for units are generated using the model (20). In Section 4, we considered balanced designs, where an equal number of units are assigned to treatment and control (i.e., ). Here we consider an unbalanced design, where and . Otherwise, the simulation setup discussed here is identical to the one discussed in Section 4. The results for this unbalanced design scenario are essentially identical to the results discussed in Section 4.
Figure 6 shows the power results of (1) the unconditional randomization test using ; (2) the unconditional randomization test using ; and (3) our conditional randomization test using . In other words, Figure 6 is analogous to Figure 1, except the results are for an unbalanced design where and instead of a balanced design where . The power of all three tests are slightly lower for this case as compared to their power for the balanced design, but otherwise the results from Figure 6 are identical to those from Figure 1: Our conditional randomization test is more powerful than the unconditional randomization test using , and the results of our conditional randomization test approach those of the unconditional randomization test using when the number of tiers increases or the acceptance probability decreases.
Meanwhile, Figure 7 shows the power results of the CEM-based randomization tests discussed in Section 4. In other words, Figure 7 is analogous to Figure 2, except the results are for the unbalanced design instead of the balanced design. For this unbalanced design scenario, we were only able to obtain results for and groups for the CEM-based tests, the reason being that there were less treated units in this unbalanced design, and thus less opportunities for CEM to find matches across many strata. This problem is the same as the issue that the CEM-based tests discard more and more units as the number of groups (or coarsened strata) increases, as discussed in Section 4. This again emphasizes the benefit of conditioning on forms of covariate balance that account for continuous covariates, instead of conditioning on a coarsened version of the covariate space. Otherwise, the results from Figure 7 are identical to those from Figure 2: These CEM-based tests tend to be more powerful than the unconditional randomization test that uses but not as powerful as our conditional randomization test, and their power tends to decrease as increases.
Similar to Section 4.3, we also examined the conditional performance of these randomization tests for this unbalanced design scenario. After the potential outcomes were generated from (20) for and , we simulated 10,000 randomizations (where and ) and computed the Mahalanobis distance for each randomization. Then, we divided these randomizations into 10 groups according to the 10 quantiles of the 10,000 Mahalanobis distances. Figure 8 shows the rejection rate of each of the five randomization tests for each quantile group of the Mahalanobis distance. In other words, Figure 8 is analogous to Figure 3, except for the unbalanced design instead of the balanced design. The results are again largely the same as those presented in Section 4.3: The unconditional randomization test using and the conditional randomization test using are conditionally valid across quantile groups, while the unconditional randomization test using is not conditionally valid and its rejection rate is monotonically increasing in covariate imbalance. Meanwhile, similar to Section 4.3, the false rejection rate for the CEM-based tests also appears to be monotonically increasing in covariate imbalance according to the Mahalanobis distance, but to a much less severe degree, suggesting that these tests are approximately conditionally valid.
6.3 Simulation Results for Alternative Data-Generating Linear Models
In Section 4, the potential outcomes were generated using the linear model (20) where all the covariates had positive effects on the outcomes, were unrelated to the treatment effect, and were normally distributed. Here we consider alternative linear models for the potential outcomes and compare power results for the unconditional randomization tests using and as well as our conditional randomization test using for these alternative models. We examine the performance of the randomization tests under each of the following models:
Positive/Negative Covariate Effects
Heterogeneous Treatment Effects
where . Following Ding et al. (2016), we set to induce strong treatment effect heterogeneity.
Different Covariate Distributions
where , and .
Similar to Section 4, the parameters and take on values and across simulations for the above models.
Figure 9 shows the power results of the randomization tests when the potential outcomes were generated from the above models. Figure 9 is analogous to Figure 1, except the potential outcomes were generated from models (21), (22), or (23) instead of model (20) used in Section 4. The results are largely the same: The conditional randomization test is more powerful than the unconditional randomization test that uses the unadjusted test statistic ; furthermore, as the number of tiers increases, the conditional randomization test approaches the unconditional randomization test that uses the regression-adjusted test statistic.
Similar to Section 4.3, we also examined the conditional performance of the randomization tests when the potential outcomes were generated from the above models. After the potential outcomes were generated for and for each of the three models, we simulated 10,000 randomizations and computed the Mahalanobis distance for each randomization. Then, we divided these randomizations into 10 groups according to the 10 quantiles of the 10,000 Mahalanobis distances. Figure 10 shows the rejection rate of each randomization test for each quantile group for each of the three potential outcome models. Figure 10 is analogous to Figure 3, except the potential outcomes were generated from models (21), (22), or (23) instead of model (20). The results are again largely the same as those presented in Section 4.3: The unconditional randomization test using and the conditional randomization test using are conditionally valid across quantile groups, while the unconditional randomization test using is not conditionally valid and its rejection rate is monotonically increasing in covariate imbalance. In short, Figures 9 and 10 suggest that the results found in Section 4 hold across many data-generating processes.