1 Matching Approximates Randomized Experiments, But What Kind of Randomized Experiments?
Randomized experiments are often considered the “gold standard” of causal inference because they on average achieve balance on all covariates—both observed and unobserved—across treatment groups (Rubin, 2008a)
. For this reason, the simple mean difference in outcomes between treatment and control groups is unbiased in estimating the average treatment effect for randomized experiments. However, in observational studies, it is often the case that pretreatment covariates affect units’ probability of receiving treatment. As a result, covariate distributions across treatment groups can be very different, which can lead to biased estimates of treatment effects. Furthermore, treatment effect estimates will be more sensitive to model specification, and thus inferences from statistical models that adjust for covariate imbalances may still be unreliable
(Rubin, 2007). Methods must be employed to address this systematic covariate imbalance.One popular method is matching, which is a prepocessing step that produces a subset of treatment and control units for which it is reasonable to assume that units are asif randomized to treatment or control (Ho et al., 2007). This preprocessing step is often called the design stage of an observational study, because the goal is to obtain a dataset whose treatment assignment mechanism approximates the design of a randomized experiment (Rubin, 2007, 2008b; Rosenbaum, 2010). The analysis stage of a matched dataset mimics that of a randomized experiment—e.g., the meandifference estimator and/or statistical models can be employed after matching, and their validity holds to the extent that the matched dataset approximates what could plausibly be produced by a randomized experiment (Ho et al., 2007; Stuart, 2010; Iacus et al., 2012; Diamond & Sekhon, 2013).
The credibility of the assumption that treatment is “asif randomized” is usually supported by demonstrations of covariate balance. Common diagnostics are tables of standardized covariate mean differences between the treatment and control groups and Love plots of these standardized mean differences (Stuart, 2010; Zubizarreta, 2012), as well as significance tests like tests and KStests (Imai et al., 2008). For example, a ruleofthumb is that covariate mean differences of a matched dataset should be below 0.1 (Normand et al., 2001; Zubizarreta, 2012; Resa & Zubizarreta, 2016). However, most recommend ensuring even tighter covariate balance if possible: Stuart (2010) recommends choosing the matching algorithm “that yields the smallest standardized difference of means across the largest number of covariates,” and Imai et al. (2008) recommends that “imbalance with respect to observed pretreatment covariates…should be minimized without limit where possible.”
Thus, it is common practice to iteratively run many matching algorithms in search of the matched dataset that exhibits the best covariate balance (Dehejia & Wahba, 2002; Ho et al., 2007; Caliendo & Kopeinig, 2008; Harder et al., 2010). Furthermore, with new computational methods, matching algorithms are becoming increasingly able to provide a matched dataset with strong covariate balance. For example, Zubizarreta (2012) develops an optimization approach to matching, which allows researchers to fulfill numerous balance constraints, such as restricting the covariate mean differences between the treatment and control groups to be nearly zero across many covariates. In short, current matching methodologies algorithmically instigate strong covariate balance by design, whether it is through a systematic search or an optimization approach.
However, standard analyses for matched datasets do not condition on these types of strong covariate balance. Instead, standard analyses follow the analysis of completely randomized, blocked, or matchedpair experiments, which do not incorporate the types of covariate balance constraints fulfilled by matching algorithms. For example, a randomized experiment is unlikely to produce covariate mean differences that are nearly zero across many covariates. Is it appropriate to analyze a matched dataset as if it were from a randomized experiment, even if it is unlikely that a randomized experiment would produce such a dataset?
Our answer is no. Our view is that if researchers are algorithmically searching for matched datasets that fulfill covariate balance constraints, then they should condition on this fact during the analysis stage of the resulting matched dataset. In this paper we propose an approach for designing and analyzing observational studies that incorporates this type of conditioning. Developing this approach involves three contributions. First, we formulate an “asif randomized” assumption that—unlike previous works—can incorporate any assignment mechanism of interest, including mechanisms that ensure covariate balance by design, as in matching algorithms. Second, we develop a valid randomization test for this asif randomized assumption, which allows researchers to determine the assignment mechanism that their matched dataset best approximates. Thus, this test encapsulates the design stage of an observational study. Third, we provide a treatment effect estimation strategy that uses the same assignment mechanism that was determined during the design stage, thereby unifying the design and analysis stages of the observational study, similar to how they are unified in the design and analysis of randomized experiments.
The remainder of this paper is as follows. In Section 2, we formalize the asif randomized assumption and demonstrate how a randomization test for this assumption can be used to assess if a matched dataset exhibits adequate covariate balance according to an assignment mechanism of interest. In Section 3, we outline our methodology for analyzing a matched dataset assuming the asif randomized assumption is correct. In Section 4, we present simulation results which show how our approach can yield more precise statistical inferences for matched datasets than standard methodologies. In Section 5, we conclude with discussions about how our approach can be extended beyond matching methods.
2 The Design Stage: A Randomization Test for Covariate Balance
2.1 Setup and Notation
Consider a matched dataset containing units with covariates and treatment assignment . Each is a
dimensional vector, and each
takes on values 1 or 0, denoting assignment to treatment and control, respectively. Each unit has potential outcomes , denoting unit ’s outcome under treatment and control. However, only one of the potential outcomes is observed for each unit. We follow the “Rubin causal model” (Holland, 1986) and assume that the covariates and potential outcomes are fixed, and thus only the assignment mechanism is stochastic. The treatment and control units in the matched dataset may or may not be explicitly paired or grouped.We assume that the causal estimand of interest is the average treatment effect (ATE) only for these units, defined as . The ATE for the units in the matched dataset may differ from the ATE in the full, prematched dataset. However, the composition of the prematched dataset and the actual matching algorithm that was used to generate the units is irrelevant for estimating , although they are relevant for defining it. Causal inferences about the units at hand can be generalized to some larger population to the extent that they are representative of that larger population.
Because is the only stochastic element in the matched dataset, it is essential to understand its distribution. This section outlines a test to assess if the observed treatment assignment comes from a known distribution. For example, researchers may want to assess if could reasonably be an assignment under complete randomization.
Most generally, the distribution of may depend on the potential outcomes and and the covariates . However, neither nor are ever completely observed. We employ two assumptions to constrain the distribution of to only be dependent on observed values: Strong Ignorability and the StableUnit Treatment Value Assumption (SUTVA). Strong Ignorability asserts that there is a nonzero probability of each unit receiving treatment, and that—conditional on covariates—the treatment assignment is independent of the potential outcomes (Rosenbaum & Rubin, 1983). SUTVA asserts that the potential outcomes of any unit depends on the treatment assignment only through and not through the treatment assignment of any other unit (Rubin, 1980). Strong Ignorability and SUTVA are commonly employed in observational studies, especially in the context of propensity score analyses (Dehejia & Wahba, 2002; Sekhon, 2009; Stuart, 2010; Austin, 2011).
Strong Ignorability is an untestable assumption, because violations of strong ignorability depend on unobserved covariates being related to treatment assignment. However, researchers can conduct sensitivity analyses to assess if treatment effect estimates are sensitive to violations of Strong Ignorability; see Rosenbaum (2002) for a review of this idea. Furthermore, see Sobel (2006), Hudgens & Halloran (2008), and Tchetgen & VanderWeele (2012) for a review of methodologies that address cases when SUTVA does not hold.
2.2 A Test for Strongly Ignorable Assignment Mechanisms
We now introduce some notation to discuss assignment mechanisms that incorporate covariate balance. Let denote the set of all possible treatment assignments, and let denote the probability measure on . Given , we want to focus on assignments that exhibit the types of covariate balance we would expect from a given matching algorithm. For example, if follows complete randomization, we may want to only consider randomizations where the standardized covariate mean differences are less than 0.1, because these are the types of assignments we focus on during the matching procedure.
To generalize this idea, let be a criterion defined to be 1 if a treatment assignment is acceptable and 0 otherwise. For example, may be defined to be 1 only if a particular assignment leads to the covariate mean differences being below a prespecified threshold. Such a criterion has been previously used for rerandomization schemes (Morgan & Rubin, 2012; Branson et al., 2016) and randomization tests (Branson & Bind, 2018; Branson & Miratrix, 2018). The following assumption asserts that the assignment mechanism for the units conditional on the covariates can be characterized by the pair and .
Generalized AsIf Randomization (GAR): Consider a probability measure on the set of possible randomizations and consider some criterion defined as 1 for acceptable randomizations and 0 otherwise. Then, for units , the treatment assignment for the units follows the distribution .
GAR asserts that the assignment mechanism for the units in the matched dataset is fully characterized by the probability measure and the criterion . Thus, we will write GAR to denote a specific instance of this assumption for a particular assignment mechanism. Any strongly ignorable assignment mechanism can be characterized by GAR for some and . For example, complete randomization for a number of treated units defines with defined to be 1 only if .
GAR can also incorporate pairs or blocks of units by defining to be 1 only if the number of treated units is fixed within each pair or block. Many matching algorithms pair or block units; e.g., optimal matching (Rosenbaum, 1989) does the former, while coarsened exact matching (Iacus et al., 2011, 2012) does the latter. There is an ongoing debate as to whether or not the pairwise or blockwise structure of such matched datasets should be ignored during the analysis stage. For example, some argue that pairwise matching is only a conduit to obtain treatment and control groups that are similar at the aggregate level, and once this is done, the matched dataset can be analyzed as if it were from a completely randomized experiment rather than a paired experiment (Ho et al., 2007; Schafer & Kang, 2008; Stuart, 2010). This distinction can be consequential; e.g., GagnonBartsch & ShemTov (2016) discuss a case where covariate balance diagnostics yield different conclusions depending on whether or not a matched dataset is viewed as a blocked randomized experiment or as a completely randomized experiment.
GAR can be used to address these kinds of debates. For example, say GAR and GAR characterize the assignment mechanisms for complete randomization and block randomization, respectively. Then, researchers can assess whether GAR or GAR is a more plausible assignment mechanism for a given matched dataset. This allows researchers to determine if they should view the matched dataset as a completely randomized or block randomized experiment. We demonstrate this idea in Section 4 and show how this distinction affects the analysis stage of an observational study. Now we provide a procedure for testing GAR.
level Randomization Test for GAR: Specify a probability measure on and some binary criterion .
Define a test statistic
. Generate random draws . Compute the following (approximate) randomizationbased twosided value:The random draws can be generated via rejectionsampling by generating and only accepting it if . If is particularly stringent, it can be computationally intensive to generate these random draws via rejectionsampling. For example, there may only be a small subset of randomizations such that the standardized covariate means are below a threshold. Branson & Bind (2018) proposed an importancesampling approach as an alternative to rejectionsampling for approximating values for conditional randomization tests; this approach can be used for the above randomization test if rejectionsampling is computationally intensive.
Importantly, the test statistic is not a function of the outcomes. This reflects the idea that the design of an experiment or observational study should not involve the outcomes in order to prevent researchers from biasing results (Rubin, 2007, 2008b). In general, the test statistic should be some measure of covariate balance, which allows researchers to test if the observed covariate balance is similar to the covariate balance one would expect from a given assignment mechanism. Some examples of covariate balance measures that are commonly used in observational studies are the Mahalanobis distance (Mahalanobis, 1936; Rosenbaum & Rubin, 1985; Gu & Rosenbaum, 1993; Diamond & Sekhon, 2013), standardized covariate mean differences (Stuart, 2010; Zubizarreta, 2012), and significance tests (Imai et al., 2008; Cattaneo et al., 2015). See Rosenbaum (2002) and Imbens & Rubin (2015) for comprehensive discussions of choices of test statistics for randomization tests.
Our test for GAR is similar to other tests for assessing if treatment and control covariate distributions are equal. For example, Hansen & Bowers (2008) and Hansen (2008) proposed a permutation test using the Mahalanobis distance as a test statistic, and GagnonBartsch & ShemTov (2016)
proposed a permutation test using machinelearning methods to construct a test statistic. These permutation tests are a special case of our randomization test, where random draws from
correspond to random permutations of . Another test is the CrossMatch test (Rosenbaum, 2005), which focuses on the pairwise nature of matched datasets. Related ideas are also in recent works on regression discontinuity designs. For example, Cattaneo et al. (2015), Li et al. (2015), and Mattei & Mealli (2016) assume that the assignment mechanism of units within a window around the cutoff in a regression discontinuity design follows independent Bernoulli trials. Cattaneo et al. (2015) and Mattei & Mealli (2016) use permutation tests to test this assumption, while Li et al. (2015) uses a Bayesian hierchical model.However, our randomization test for GAR differs from previous works in two key ways. One, these previous works rely on assignment mechanisms characterized by permutations either across units or within blocks or pairs of units. In contrast, our randomization test can incorporate any assignment mechanism of interest, including mechanisms that ensure covariate balance constraints are fulfilled, as in matching algorithms. We demonstrate the advantage of this flexibility in Section 4. Two, unlike common ruleofthumb diagnostics, our randomization test is a valid test specifically for the assignment mechanism of interest and units in the matched dataset. The following theorem establishes the validity of our randomization test; its proof is in the Appendix.
Theorem 2.1.
Define a probability measure on and a binary criterion such that the number of assignments where is greater than or equal to . Let be the hypothesis that , i.e., that GAR holds. Then, the level Randomization Test for GAR is a valid test, in the sense that
(2) 
where is the value from the level Randomization Test for GAR defined in (1).
3 The Analysis Stage: After Assuming GAR
Once GAR is assumed for a particular matched dataset, analyses of causal effects become relatively straightforward, to the extent that they are straightforward for a randomized experiment that uses the assignment mechanism characterized by GAR. There are randomizationbased, Neymanian, and Bayesian modes of inference for analyzing such randomized experiments.
Randomizationbased inference focuses on testing the Sharp Null Hypothesis, which states that
for all . Under the Sharp Null, the potential outcomes for any treatment assignment are known; thus, any test statistic can be computed for any under the Sharp Null. Researchers can also invert sharp hypotheses to obtain point estimates and uncertainty intervals. One possible sharp hypothesis is that the treatment effect is additive, i.e., that for some for all . A natural uncertainty interval is the set of such that one fails to reject this sharp hypothesis. Likewise, a natural point estimate is the that yields the highest value for this sharp hypothesis (Hodges Jr & Lehmann, 1963; Rosenbaum, 2002). See Rosenbaum (2002) and Imbens & Rubin (2015) for a general discussion of randomizationbased inference, and see Branson & Bind (2018) for a discussion of randomizationbased inference conditional on criteria of the form .The Neymanian mode of inference involves normal approximations for and conservative estimators for , which depend on the particular assignment mechanism
. There are wellknown Neymanian treatment effect estimators and corresponding variance estimators for many experimental designs. See
Imbens & Rubin (2015) for a comprehensive review of the Neymanian mode of inference for randomized experiments and the particulars of many experimental designs. For example, Miratrix et al. (2013) discusses the Neymanian mode of inference for blocked experiments, while Imai (2008) does the same for matchedpair experiments. See Pashley & Miratrix (2017) for a discussion of the differences in variance estimation between these two designs as well as results for variance estimation for hybrid designs that involve blocks and pairs. Yet other experimental designs for which there are wellknown Neymanian modes of inference are factorial designs (Dasgupta et al., 2015) and rerandomized experiments (Li et al., 2016). See Ding (2017) for a comparison of randomizationbased and Neymanian modes of inference for blocked, matchedpair, and factorial designs.Finally, the Bayesian mode of inference for estimating causal effects in randomized experiments was first formalized in Rubin (1978). Under the Bayesian paradigm, the three quantities that make up the data are treated as unknown, and models for these quantities must be specified. Under GAR, the distribution of is known, and thus the remaining work for a Bayesian analysis is to specify statistical models for the covariates and outcomes. The Bayesian mode of inference can be particularly useful for incorporating uncertainty in many complex data scenarios, such as noncompliance (Frangakis et al., 2002), missing data (Rubin, 1996), their combination (Barnard et al., 2002), and multisite trials (Dehejia, 2003). See Imbens (2004) and Heckman et al. (2014) for general discussions of Bayesian approaches for estimating the average treatment effect in randomized experiments.
In what follows, we focus on the randomizationbased mode of inference for testing GAR and analyzing a matched dataset under GAR.
4 Simulations
4.1 Simulation Setup
We follow the simulation setup of Austin (2009) and Resa & Zubizarreta (2016), which was used to evaluate different matching methodologies. Consider a dataset with units and
control units. Each unit has eight covariates, four of which are Normally distributed and four of which are Bernoulli distributed. These eight covariates are generated such that the true standardized difference in means between the treatment and control groups is 0.2 for half of the Normal and Bernoulli covariates and 0.5 for the other half of the covariates. Furthermore, there are three outcomes. The first outcome is linearly related with the covariates, and the other two outcomes are nonlinearly related with the covariates, with the third outcome being a more complex function of the covariates than the second outcome. For each outcome, there is an additive treatment effect of one. See
Resa & Zubizarreta (2016) for details about how these covariates and outcomes are generated.We repeated this datagenerating process 1,000 times. As we will see in the next section, there are severe covariate imbalances between treatment and control for these datasets. Treatment effect estimators will be confounded by these imbalances, which motivates matching methods that address covariate imbalance.
4.2 Illustration: Testing GAR After Propensity Score Matching
To illustrate our randomization test for GAR, first we will consider one of the most basic forms of matching: optimal matching using the propensity score. We implemented this procedure using the MatchIt R package (Stuart et al., 2011).
Let us focus on the first of the 1,000 datasets in our simulation. Figure 0(a) shows the absolute standardized mean differences for each covariate before and after propensity score matching. Clearly there are severe imbalances before matching. However, are the covariates adequately balanced after matching?
According to the 0.1 ruleofthumb, the answer is no. However, what magnitudes of absolute standardized covariate mean differences would we expect from a completely randomized experiment? To answer this question, we can test for GAR under complete randomization. To do this, we permuted 1,000 times and computed the absolute standardized covariate mean differences for each permutation. Thus, the standardized covariate mean differences act as the test statistics for our randomization test for GAR.
Figure 0(a)
shows the 95% quantiles of each absolute standardized covariate mean difference across these permutations, which we call the completerandomization quantiles. Each point along these quantiles corresponds to a
level test for GAR under complete randomization using each standardized covariate mean difference as a test statistic. According to the completerandomization quantiles, propensity score matching for this dataset somewhat adequately balanced the covariates, but there are still some significantly large standardized covariate mean differences. Notably, the completerandomization quantiles are substantially larger than the 0.1 ruleofthumb. This suggests that this ruleofthumb is anticonservative given the distribution of the covariates in the matched dataset, assuming that the goal is to approximate a completely randomized experiment; however, given another dataset, this ruleofthumb may well have been conservative instead of anticonservative. In contrast to these kinds of ruleofthumb diagnostics, our test for GAR accounts for the actual units and covariates we have at hand when attempting to approximate an experimental design for those units.Using our test for GAR, we conclude that propensity score matching does not balance the covariates adequately enough to assume complete randomization. Now we will turn to more advanced matching methods. We will see that not only will we be able to achieve adequate covariate balance to assume complete randomization, but also we will be able to achieve such strong covariate balance that we can assume an assignment mechanism that is more restrictive than complete randomization. This will lead to more precise statistical inferences than standard approaches for analyzing matched datasets.
4.3 Standard Statistical Inference for Matched Datasets with Strong Covariate Balance is Unnecessarily Conservative
Now we focus on matching methodologies that instigate strong covariate balance between the treatment and control groups. In their simulation study, Resa & Zubizarreta (2016) compared nearestneighbor matching using the propensity score, optimal subset matching (Rosenbaum, 2012), and cardinality matching (Zubizarreta et al., 2014). Cardinality matching finds the largest dataset within an observational study such that prespecified covariate balance constraints are achieved. Cardinality matching is similar to optimal subset matching in that it may discard some treated units in the name of achieving better covariate balance; however, it differs from optimal subset matching in that it ensures grouplevel covariate balance directly, rather than performing pairwise matching as a conduit to achieve grouplevel balance. Resa & Zubizarreta (2016) found that cardinality matching performs better than the nearestneighbor matching and optimal subset matching in terms of covariate balance, sample size, bias, and RMSE. Thus, we focus on cardinality matching, and differ to Resa & Zubizarreta (2016) and other simulation studies (e.g., Austin (2009) and Austin (2014)) for a comparison of other methods.
Cardinality matching can incorporate many types of covariate balance constraints, such as mean balance (Zubizarreta, 2012), fine balance (Rosenbaum et al., 2007), and strength balance (Hsu et al., 2015). We focus on constraining the absolute standardized covariate mean differences to be less than some threshold . Similar to Resa & Zubizarreta (2016), we consider threshold values . A more stringent threshold results in a matched dataset that exhibits better covariate balance but also a smaller sample size.
Similar to the previous section, as an illustration we first implement cardinality matching for a single dataset. Analogous to Figure 0(a), Figure 0(b) shows the absolute standardized covariate mean differences before and after cardinality matching for various thresholds . However, unlike Figure 0(a), Figure 0(b) shows both the 5% and 95% completerandomization quantiles of the standardized covariate mean differences. The way to interpret these quantiles is that any standardized covariate mean difference to the right of the rightmost quantile is surprisingly large, and any to the left of the leftmost quantile is surprisingly small. We can see that the standardized covariate mean differences produced by cardinality matching for are surprisingly small if we assume units are completely randomized to treatment and control.
Having covariate balance of this nature is promising in terms of bias; if covariate imbalances are minimal, so should be bias of treatment effect estimates. However, as we originally suggested in Section 1, ideally the uncertainty intervals for treatment effect estimators condition on the fact that we ensured strong covariate balance from the outset by design. If instead uncertainty intervals only assume that units are completely randomized, then they will not take advantage of this strong covariate balance and will be unnecessarily conservative. We now demonstrate this idea by examining the inferential properties of cardinality matching across the 1,000 simulated datasets.
The purpose of cardinality matching is to provide the largest matched dataset with a given level of covariate balance. So, first we will discuss the sample size and covariate balance achieved by cardinality matching. Sample sizes varied across the 1,000 matched datasets from cardinality matching: when , sample sizes ranged from 442 to 500; when , they ranged from 392 to 494; and when , they ranged from 388 to 490. When the sample size is 500, each of the 250 treated units was matched to a control unit; when the sample size was less than 500, some treated units were discarded in the name of achieving stronger covariate balance. As discussed in Resa & Zubizarreta (2016), discarding treated units changes the causal estimand. However, this is not problematic when the treatment effect is homogeneous (as is the case in this simulation setting).
Furthermore, finding a matched dataset that fulfills strong covariate balance criteria may either be computationally intensive or even impossible. Thus, similar to Resa & Zubizarreta (2016), when implementing cardinality matching using the designmatch R package (Zubizarreta & Kilcioglu, 2016), we relaxed the covariate balance constraints to be fulfilled approximately rather than strictly. This ensured that the matching problem could be solved in polynomial time. When , the actual standardized covariate mean differences ranged from 0.109 to 0.131 across the 1,000 matched datasets; when , they ranged from 0.022 to 0.037; and when , they ranged from 0.016 to 0.031.
Now we discuss the inferential properties of cardinality matching. Table 1
shows the bias, variance, RMSE, and coverage for the meandifference estimator across the 1,000 matched datasets. The confidence intervals were computed as
, where is the meandifference estimator and and are the sample variances in treatment and control, respectively. When , there is substantial bias from covariate imbalances, which in turn results in confidence intervals substantially undercovering. However, when , the bias is neglible, and also the confidence intervals tend to overcover. Furthermore, the overcoverage is most prominent for the first outcome, followed by the second, and finally by the third. This is, unsurprisingly, ordered by how “wellspecified” cardinality matching was, which only attempted to balance the raw covariates—which define the first outcome—and not the nonlinear functions of the covariates that define the second and third outcomes (as noted in Section 4.1). In particular, these standard confidence intervals overcover because they do not condition on the strong level of covariate balance achieved by matching.Outcome/Dataset  Bias  Variance  RMSE  Coverage 

First Outcome  
0.95  0.05  0.97  67.3%  
0.11  0.04  0.22  100%  
0.03  0.04  0.19  100%  
Second Outcome  
2.17  0.61  2.30  55.4%  
0.55  0.55  0.92  99.1%  
0.38  0.54  0.83  99.7%  
Third Outcome  
0.81  1.60  1.50  91.5%  
0.09  1.41  1.19  97.8%  
0.01  1.40  1.18  98.2% 
4.4 More Precise Statistical Inferences for Matched Datasets with Strong Covariate Balance
Now we demonstrate how we can use the approach discussed in Section 3 to conduct statistical inferences for matched datasets that are less conservative than the standard approach presented in the previous section. For each matched dataset in our simulation, consider testing the sharp hypothesis . Define the 95% randomizationbased confidence interval as the set of such that we fail to reject at the level, and define the randomizationbased point estimate as the that yields the highest value for testing . We will consider values when testing , which more than encompasses the range of confidence intervals produced in the previous section.
We will test using a randomization test, which follows this fivestep procedure:

Specify an assignment mechanism and test statistic .

Generate random draws .

Define assuming is true: .

Compute the (approximate) randomizationbased twosided value , where .

Reject if .
We use the meandifference estimator as the test statistic, and we consider two types of assignment mechanisms: (1) complete randomization, where random draws from correspond to permutations of ; and (2) constrained randomization, where random draws from correspond to permutations of for which all of the absolute standardized covariate mean differences are less than some threshold .
The above procedure is a valid test for if units are assigned using the corresponding , following wellknown results about randomization tests for randomized experiments (e.g., Imbens & Rubin 2015, Chapter 5). Thus, the validity of this procedure for a given dataset holds to the extent that GAR holds for that dataset.
In particular, GAR under constrained randomization plausibly holds for the matched datasets from cardinality matching, because cardinality matching constrains the absolute standardized covariate mean differences to be less than some threshold by construction. Thus, another way to interpret cardinality matching with covariate balance constraints is that it yields the largest dataset such that a variant of GAR that incorporates these covariate balance constraints plausibly holds. Making this connection can result in more precise and valid statistical inferences for such matched datasets.
To demonstrate, we will assume GAR under constrained randomization for the matched datasets with . Recall that the actual standardized covariate mean differences across the 1,000 matched datasets ranged from 0.016 to 0.031, so constrained randomization with thresholds or is a reasonable assumption for these datasets. Table 2 compares the performance of the randomizationbased confidence intervals under complete randomization and constrained randomization to that of the standard confidence interval. The randomizationbased confidence intervals under complete randomization are nearly identical to the standard confidence intervals, which emphasizes the point that standard practice is to analyze matched datasets as if they were from completely randomized experiments. Meanwhile, the randomizationbased confidence intervals under constrained randomization are narrower than the standard confidence intervals, where the confidence intervals narrow as the randomization distribution is more constrained.
First Outcome  Second Outcome  Third Outcome  

CI Method 

Coverage 

Coverage 

Coverage  
Standard  2.14  100%  4.55  99.7%  5.09  98.2%  
Complete Rand.  2.13  100%  4.55  99.6%  5.02  97.9%  
Constrained Rand. ()  1.36  99.9%  3.41  96.3%  4.61  96.8%  
Constrained Rand. ()  0.96  98.2%  2.93  91.2%  4.46  96.0% 
Let us focus on the results in Table 2 for the randomizationbased confidence intervals under constrained randomization with threshold . The coverage for these confidence intervals is closer to the nominal level than the standard approach, and they are substantially narrower: They are on average 45% the width for the first outcome, 64% the width for the second outcome, and 88% the width for the third outcome. Thus, our randomizationbased approach can result in more precise inference, especially if matching successfully balances on functions of the covariates that are related to outcomes of interest.
However, our approach is not a panacea: In Table 2, we can see that our randomizationbased confidence intervals using constrained randomization with resulted in undercoverage for the second outcome. This is likely because of the nonneglible bias of the treatment effect estimator, as seen in Table 1. Thus, in some sense it was lucky that the standard confidence interval achieved overcoverage for the second outcome despite this bias. Furthermore, we can see that our confidence intervals exhibited overcoverage for the first outcome. This is likely due to the fact that cardinality matching was actually able to achieve absolute standardized covariate mean differences that were less than , and thus one could still consider these confidence intervals conservative, but not as conservative as the standard confidence intervals.
In short, many matching algorithms naturally suggest variants of GAR that are reasonable to assume for certain matched datasets. By leveraging a variant of GAR that accounts for the types of covariate balance that modern matching algorithms instigate by design, our randomizationbased approach can provide causal inferences that are more precise than standard approaches. This additional precision is especially substantial if researchers are able to match on functions of the covariates that are relevant to outcomes of interest.
5 Discussion and Conclusion
Covariate imbalance is one of the principal problems of causal inferences in observational studies. To tackle this problem, matching algorithms can produce datasets that exhibit strong covariate balance. However, few have discussed how to best analyze these matched datasets. In particular, causal inferences for matched datasets can be unnecessarily conservative if they do not account for the strong covariate balance achieved in the dataset.
We developed a causal inference framework that explicitly addresses covariate balance in both the design and analysis stages of an observational study. In the design stage, we proposed a randomization test for assessing if the covariates in a matched dataset are adequately balanced according to what we call Generalized AsIf Randomization (GAR), which encapsulates any assignment mechanism of interest. We proved that this randomization test—unlike common ruleofthumb balance diagnostics—is valid for assessing if the treatment assignment in a matched dataset follows a particular assignment mechanism.
Once researchers determine a plausible assignment mechanism for a matched dataset, the resulting analysis for obtaining point estimates and confidence intervals is relatively straightforward. Through simulation, we found that our approach utilizing GAR yields more precise statistical inferences than standard approaches by conditioning on the covariate balance in a given matched dataset.
Our framework more comprehensively unifies the design and analysis stages of an observational study, which opens many avenues for future research. These avenues broadly fall into three topics: model adjustment, multiple experimental designs, and quasiexperiments.
We focused on the meandifference estimator for estimating the average treatment effect in an observational study after matching. However, matching in combination with modeladjustment can be quite powerful (Rubin & Thomas, 2000). Indeed, regression and other statistical models can be used as test statistics in our randomization test presented in Section 3. Standard analyses for matched datasets that utilize statistical models still use uncertainty intervals that are meant for completely randomized experimental designs. Thus, if the randomization distribution of a modelbased estimator under GAR for a particular assignment mechanism is more constrained than its distribution under complete randomization, our methodology will yield more precise causal inferences. Furthermore, possible advantages and complications with using modeladjusted estimators in randomized experiments—as discussed in Freedman (2008), Imbens & Wooldridge (2009), and Lin (2013)—also hold for their use in observational studies to the extent that GAR holds for the dataset to be analyzed. A future line of research would be to compare different combinations of estimators and variations of GAR to assess which combinations yield the most precise statistical inferences for a given matched dataset.
Furthermore, one could also consider different variations of GAR even for the same estimator. It is common practice to apply many matching algorithms and settle on the algorithm that yields the best covariate balance. However, one matching algorithm may yield a dataset that is adequately balanced according to one version of GAR and another algorithm will yield a dataset that is adequately balanced according to a different version of GAR. One example of this idea is presented in Bind & Rubin (2017), who considered six different types of hypothetical experiments that could be considered for an observational study. We are currently working on a methodology that combines the analyses of multiple matched datasets that approximate different variations of GAR.
Finally, our approach can be applied to settings beyond matching for observational studies. For example, recent works in the regression discontinuity design literature utilize the assumption that units near the discontinuity are asif randomized to treatment and control (Li et al., 2015; Cattaneo et al., 2015; Mattei & Mealli, 2016). These assumptions can be considered special cases of GAR. We are currently working on extending our framework to regression discontinuity designs and other quasiexperiments that utilize asif randomized assumptions.
6 Appendix: Proof of Theorem 2.1
Let be the null hypothesis that , i.e., that GAR holds. The true value for the level randomization test for GAR is
(3) 
where and . In other words, , where . As stated in Theorem 2.1, we require that because otherwise there is zero probability that .
Our randomization test uses the unbiased estimator
defined in (1) instead of the true value defined in (3). A wellknown result about randomization tests is that if validity holds for the test that rejects when , then the test that rejects when is also valid, where is unbiased for (e.g., Edgington & Onghena 2007, Chapter 3). Thus, we will prove the validity of the test that rejects when , and the validity of our test immediately follows.Let
be a random variable whose distribution is that of
where , and let be the CDF of . Under , , and thus .Because under , under . By the probability integral transform,
is only nearly uniformly distributed due to the discreteness of
. Nonetheless, is stochastically dominated by the uniform distribution, and thus(4) 
where , which concludes the proof of Theorem 2.1.
References
 Austin (2009) Austin, P. C. (2009). Some methods of propensityscore matching had superior performance to others: results of an empirical investigation and monte carlo simulations. Biometrical journal, 51(1), 171–184.
 Austin (2011) Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399–424.
 Austin (2014) Austin, P. C. (2014). A comparison of 12 algorithms for matching on the propensity score. Statistics in medicine, 33(6), 1057–1069.

Barnard et al. (2002)
Barnard, J., Frangakis, C., Hill, J., & Rubin, D. B. (2002).
School choice in ny city: A bayesian analysis of an imperfect
randomized experiment.
In
Case studies in Bayesian statistics
, (pp. 3–97). Springer.  Bind & Rubin (2017) Bind, M.A. C., & Rubin, D. B. (2017). Bridging observational studies and randomized experiments by embedding the former in the latter. Statistical methods in medical research, (p. 0962280217740609).
 Branson & Bind (2018) Branson, Z., & Bind, M.A. (2018). Randomizationbased inference for bernoulli trial experiments and implications for observational studies. Statistical Methods in Medical Research, 0(0). PMID: 29451089.
 Branson et al. (2016) Branson, Z., Dasgupta, T., Rubin, D. B., et al. (2016). Improving covariate balance in 2k factorial designs via rerandomization with an application to a new york city department of education high school study. The Annals of Applied Statistics, 10(4), 1958–1976.
 Branson & Miratrix (2018) Branson, Z., & Miratrix, L. (2018). Randomization tests that condition on noncategorical covariate balance. arXiv preprint arXiv:1802.01018.
 Caliendo & Kopeinig (2008) Caliendo, M., & Kopeinig, S. (2008). Some practical guidance for the implementation of propensity score matching. Journal of economic surveys, 22(1), 31–72.
 Cattaneo et al. (2015) Cattaneo, M. D., Frandsen, B. R., & Titiunik, R. (2015). Randomization inference in the regression discontinuity design: An application to party advantages in the us senate. Journal of Causal Inference, 3(1), 1–24.
 Dasgupta et al. (2015) Dasgupta, T., Pillai, N. S., & Rubin, D. B. (2015). Causal inference from 2k factorial designs by using potential outcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 727–753.
 Dehejia (2003) Dehejia, R. H. (2003). Was there a riverside miracle? a hierarchical framework for evaluating programs with grouped data. Journal of Business & Economic Statistics, 21(1), 1–11.
 Dehejia & Wahba (2002) Dehejia, R. H., & Wahba, S. (2002). Propensity scorematching methods for nonexperimental causal studies. The review of economics and statistics, 84(1), 151–161.
 Diamond & Sekhon (2013) Diamond, A., & Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3), 932–945.
 Ding (2017) Ding, P. (2017). A paradox from randomizationbased causal inference. Statistical science, 32(3), 331–345.
 Edgington & Onghena (2007) Edgington, E., & Onghena, P. (2007). Randomization tests. CRC Press.

Frangakis et al. (2002)
Frangakis, C. E., Rubin, D. B., & Zhou, X.H. (2002).
Clustered encouragement designs with individual noncompliance: Bayesian inference with randomization, and application to advance directive forms.
Biostatistics, 3(2), 147–164.  Freedman (2008) Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40(2), 180–193.
 GagnonBartsch & ShemTov (2016) GagnonBartsch, J., & ShemTov, Y. (2016). The classification permutation test: A nonparametric test for equality of multivariate distributions. arXiv preprint arXiv:1611.06408.
 Gu & Rosenbaum (1993) Gu, X. S., & Rosenbaum, P. R. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2(4), 405–420.
 Hansen (2008) Hansen, B. B. (2008). The essential role of balance tests in propensitymatched observational studies: Comments on ?a critical appraisal of propensityscore matching in the medical literature between 1996 and 2003?by peter austin, statistics in medicine. Statistics in medicine, 27(12), 2050–2054.
 Hansen & Bowers (2008) Hansen, B. B., & Bowers, J. (2008). Covariate balance in simple, stratified and clustered comparative studies. Statistical Science, (pp. 219–236).
 Harder et al. (2010) Harder, V. S., Stuart, E. A., & Anthony, J. C. (2010). Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological methods, 15(3), 234.
 Heckman et al. (2014) Heckman, J. J., Lopes, H. F., & Piatek, R. (2014). Treatment effects: A bayesian perspective. Econometric reviews, 33(14), 36–67.
 Ho et al. (2007) Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political analysis, 15(3), 199–236.
 Hodges Jr & Lehmann (1963) Hodges Jr, J. L., & Lehmann, E. L. (1963). Estimates of location based on rank tests. The Annals of Mathematical Statistics, (pp. 598–611).
 Holland (1986) Holland, P. (1986). Statistics and causal inference. Journal of the American Statistical Association, (pp. 945–960).
 Hsu et al. (2015) Hsu, J. Y., Zubizarreta, J. R., Small, D. S., & Rosenbaum, P. R. (2015). Strong control of the familywise error rate in observational studies that discover effect modification by exploratory methods. Biometrika, 102(4), 767–782.
 Hudgens & Halloran (2008) Hudgens, M. G., & Halloran, M. E. (2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482), 832–842.
 Iacus et al. (2011) Iacus, S. M., King, G., & Porro, G. (2011). Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association, 106(493), 345–361.
 Iacus et al. (2012) Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1), 1–24.
 Imai (2008) Imai, K. (2008). Variance identification and efficiency analysis in randomized experiments under the matchedpair design. Statistics in medicine, 27(24), 4857–4873.
 Imai et al. (2008) Imai, K., King, G., & Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the royal statistical society: series A (statistics in society), 171(2), 481–502.
 Imbens (2004) Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics, 86(1), 4–29.
 Imbens & Rubin (2015) Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
 Imbens & Wooldridge (2009) Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation. Journal of economic literature, 47(1), 5–86.
 Li et al. (2015) Li, F., Mattei, A., Mealli, F., et al. (2015). Evaluating the causal effect of university grants on student dropout: evidence from a regression discontinuity design using principal stratification. The Annals of Applied Statistics, 9(4), 1906–1931.
 Li et al. (2016) Li, X., Ding, P., & Rubin, D. B. (2016). Asymptotic theory of rerandomization in treatmentcontrol experiments. arXiv preprint arXiv:1604.00698.
 Lin (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining freedman’s critique. The Annals of Applied Statistics, 7(1), 295–318.
 Mahalanobis (1936) Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Sciences (Calcutta), 2, 49–55.
 Mattei & Mealli (2016) Mattei, A., & Mealli, F. (2016). Regression discontinuity designs as local randomized experiments. Observational Studies, (2), 156–173.
 Miratrix et al. (2013) Miratrix, L. W., Sekhon, J. S., & Yu, B. (2013). Adjusting treatment effect estimates by poststratification in randomized experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(2), 369–396.
 Morgan & Rubin (2012) Morgan, K. L., & Rubin, D. B. (2012). Rerandomization to improve covariate balance in experiments. The Annals of Statistics, 40(2), 1263–1282.
 Normand et al. (2001) Normand, S.L. T., Landrum, M. B., Guadagnoli, E., Ayanian, J. Z., Ryan, T. J., Cleary, P. D., & McNeil, B. J. (2001). Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores. Journal of clinical epidemiology, 54(4), 387–398.
 Pashley & Miratrix (2017) Pashley, N. E., & Miratrix, L. W. (2017). Insights on variance estimation for blocked and matched pairs designs. arXiv preprint arXiv:1710.10342.
 Resa & Zubizarreta (2016) Resa, M., & Zubizarreta, J. R. (2016). Evaluation of subset matching methods and forms of covariate balance. Statistics in medicine, 35(27), 4961–4979.
 Rosenbaum (2010) Rosenbaum, P. (2010). Design of observational studies.
 Rosenbaum (1989) Rosenbaum, P. R. (1989). Optimal matching for observational studies. Journal of the American Statistical Association, 84(408), 1024–1032.
 Rosenbaum (2002) Rosenbaum, P. R. (2002). Observational studies. In Observational studies, (pp. 1–17). Springer.
 Rosenbaum (2005) Rosenbaum, P. R. (2005). An exact distributionfree test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(4), 515–530.
 Rosenbaum (2012) Rosenbaum, P. R. (2012). Optimal matching of an optimally chosen subset in observational studies. Journal of Computational and Graphical Statistics, 21(1), 57–71.
 Rosenbaum et al. (2007) Rosenbaum, P. R., Ross, R. N., & Silber, J. H. (2007). Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. Journal of the American Statistical Association, 102(477), 75–83.
 Rosenbaum & Rubin (1983) Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55.
 Rosenbaum & Rubin (1985) Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1), 33–38.
 Rubin (1978) Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of statistics, (pp. 34–58).
 Rubin (1980) Rubin, D. B. (1980). Comment. Journal of the American Statistical Association, 75(371), 591–593.

Rubin (1996)
Rubin, D. B. (1996).
Multiple imputation after 18+ years.
Journal of the American statistical Association, 91(434), 473–489.  Rubin (2007) Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in medicine, 26(1), 20–36.
 Rubin (2008a) Rubin, D. B. (2008a). Comment: The design and analysis of gold standard randomized experiments. Journal of the American Statistical Association, 103(484), 1350–1353.
 Rubin (2008b) Rubin, D. B. (2008b). For objective causal inference, design trumps analysis. The Annals of Applied Statistics, (pp. 808–840).
 Rubin & Thomas (2000) Rubin, D. B., & Thomas, N. (2000). Combining propensity score matching with additional adjustments for prognostic covariates. Journal of the American Statistical Association, 95(450), 573–585.
 Schafer & Kang (2008) Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychological methods, 13(4), 279.
 Sekhon (2009) Sekhon, J. S. (2009). Opiates for the matches: Matching methods for causal inference. Annual Review of Political Science, 12, 487–508.
 Sobel (2006) Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate? causal inference in the face of interference. Journal of the American Statistical Association, 101(476), 1398–1407.
 Stuart (2010) Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1), 1.
 Stuart et al. (2011) Stuart, E. A., King, G., Imai, K., & Ho, D. (2011). Matchit: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, 42(8).
 Tchetgen & VanderWeele (2012) Tchetgen, E. J. T., & VanderWeele, T. J. (2012). On causal inference in the presence of interference. Statistical methods in medical research, 21(1), 55–75.
 Zubizarreta & Kilcioglu (2016) Zubizarreta, J., & Kilcioglu, C. (2016). Designmatch: Construction of optimally matched samples for randomized experiments and observational studies that are balanced by design. R package version 0.1, 1, 187.
 Zubizarreta (2012) Zubizarreta, J. R. (2012). Using mixed integer programming for matching in an observational study of kidney failure after surgery. Journal of the American Statistical Association, 107(500), 1360–1371.
 Zubizarreta et al. (2014) Zubizarreta, J. R., Paredes, R. D., & Rosenbaum, P. R. (2014). Matching for balance, pairing for heterogeneity in an observational study of the effectiveness of forprofit and notforprofit high schools in chile. The Annals of Applied Statistics, (pp. 204–231).
Comments
There are no comments yet.