## 1 Introduction

It is a long-standing idea in statistics that the design of an experiment should inform its analysis. Fisher placed the physical act of randomization at the center of his inferential theory, enshrining it as “the reasoned basis” for inference (Fisher, 1935). Building on these insights, Kempthorne (1955) proposed a randomization theory of inference from experiments, in which inference follows from the precise randomization mechanism used in the design. This approach has gained popularity in the causal inference literature because it relies on very few assumptions (Splawa-Neyman et al., 1990; Imbens and Rubin, 2015).

Yet the injunction to ‘analyze as you randomize’ is not always followed in practice, as noted by Senn (2004) who argues that in clinical trials the analysis does not always follow strictly from the randomization performed. For instance, a Bernoulli randomized experiment might be analyzed as if it were a completely randomized experiment, or we might analyze a completely randomized experiment as if it had been stratified.

This paper studies such as-if analyses in detail in the context of Neymanian inference, and makes three contributions. First we formalize the notion of as-if analyses, motivating their usefulness and proposing a rigorous validity criterion (Section 2

). Our framework is grounded in the randomization-based approach to inference. In the two examples we described above, the analysis conditions on some aspect of the observed assignment; for instance, in the first example, the complete randomization is obtained by fixing the number of treated units to its observed value. The idea that inference should be conditional on quantities that affect the precision of estimation is not new in the experimental design literature

(e.g., Cox, 1958, 2009) or the larger statistical inference literature (e.g., Särndal et al., 1989; Sundberg, 2003), and it has been reaffirmed recently in the causal inference literature (Branson and Miratrix, 2019; Hennessy et al., 2016). Our second contribution is to show that in our setting, conditioning leads to valid as-if analyses. We also warn against a dangerous pitfall: some as-if analyses look conditional on the surface, but are in fact neither conditional nor valid. This is the case, for instance, of analyzing a completely randomized experiment by conditioning on the covariate balance being no worse than that of the observed assignment (Section 3). Our third contribution is to show how our ideas can be used to suggest new methods (Section 4) and also show how they can be used to evaluate existing methods (Section 5).## 2 As-if confidence procedures

### 2.1 Setup

Consider units and let be a binary treatment indicator for unit . We adopt the potential outcomes framework (Rubin, 1974; Splawa-Neyman et al., 1990), where, under the Stable Unit Treatment Value Assumption (Rubin, 1980), each unit has two potential outcomes, one under treatment, , and one under control, , and the observed response is . We denote by , , and

the vectors of binary treatment assignments, treatment potential outcomes, and control potential outcomes for all

units. Let be our estimand of interest; in most of the examples we take to be the average treatment effect , but our results apply more generally to any estimand that is a function of the potential outcome vectors . An estimator is a function of the assignment vector and the observed outcomes. For clarity, we will generally write to emphasize the dependence of on , but keep the dependence on the potential outcomes implicit. Denote by the design describing how treatment is allocated, so for any particular assignment ,gives the probability of observing

under design . In randomization-based inference, we consider the potential outcomes as fixed and partially unknown quantities; the randomness comes exclusively from the assignment vector following the distribution . The estimand is therefore fixed because it is a function of the potential outcomes only, while the observed outcomes and the estimator are random because they are functions of the random assignment vector .Our focus is on the construction of confidence intervals for the estimand

under the randomization-based perspective. We define a confidence procedure as a function mapping any assignment and associated vector of observed outcomes, , to an interval in , where is the support of the randomization distribution. Standard confidence intervals are examples of confidence procedures, and are usually based on an approximate distribution of . For careful choices of and, the random variable

, standardized by the standard deviation

induced by the design , is asymptotically standard normal (see Li and Ding, 2017). We can then construct an interval(1) |

where is an estimator of

. Discussing the validity of these kinds of confidence procedures is difficult for two reasons. First, they are generally based on asymptotics, so validity in finite populations can only be approximate. Second, they use the square root of estimates of the variance which, in practice, tend to be biased. These two issues obscure the conceptual underpinnings of as-if analyses. We circumvent these issues by focusing instead on oracle confidence procedures, which are based on the true quantiles of the distribution of

induced by the design . Specifically, we consider level confidence intervals () of the form(2) |

where and are the upper and lower quantiles, respectively, of the distribution of under design . Because they do not depend on , the quantiles and are fixed. The confidence procedure in Equation (2) is said to be an oracle procedure because unlike the interval in Equation (1), it cannot be computed from observed data. Oracle procedures allow us to set aside the practical issues of approximation and estimability to focus on the essence of the problem. We discuss some of the practical issues that occur without oracles in supplementary material B.

### 2.2 As-if confidence procedures

Given data from an experiment, it is natural to consider the confidence procedure constructed with the design that was actually used to randomize the treatment assignment. Consider, however, the oracle procedure

based on the distribution of induced by some other design that assigns positive probability to the observed assignment. In this case we say that the experiment is analyzed as-if it were randomized according to . We generalize this idea further by allowing the design used in the oracle procedure to vary depending on the observed assignment. This can be formalized with the concept of a design map.

###### Definition 1 (Design map).

Let

be the set of designs, or probability distributions, with support in

. A function which maps each assignment to a design is a design map.A confidence procedure can then be constructed using design map as follows:

This is an instance of as-if analysis, in which the design used to analyze the data depends on the observed assignment. That is, while we traditionally have one rule for how to create a confidence interval, we may now have many rules, possibly as many as , specified via the design map. In the special case in which the design map is constant, i.e for all , we write instead of , with a slight abuse of notation. Note that the design map function itself is fixed before observing the treatment assignment.

###### Example 1.

Consider an experiment run according to a Bernoulli design with probability of treatment , where, to ensure that our estimator is defined, we remove assignments with no treated or control units. That is, is the set of all assignments such that at least one unit receives treatment and one unit receives control, and . Let () be the completely randomized design with treated units. We use the design map where . That is, we analyze the Bernoulli design as if it were completely randomized, with the observed number of treated units assigned to treatment. Consider the concrete case where . Suppose that we observe assignment with . The design corresponds to complete randomization with 3 units treated out of 10, and the confidence procedure is constructed using the distribution of induced by . We analyze as if a completely randomized design with 3 treated units had actually been run. If instead we observe with , then the confidence procedure would be constructed by considering the distribution of induced by .

###### Example 2.

Let our units have a single categorical covariate . Categorical covariates form blocks: they partition the sample based on covariate value. Assume that the actual experiment was run using complete randomization with fixed number of treated units but discarding assignments where there is not at least one treated unit and one control unit in each block. That is, is the set of all assignments with treated units such that at least one unit receives treatment and one unit receives control in each block and . This restriction on the assignment space is to account for the associated blocked estimator being undefined. However, with moderate size blocks we can ignore this nuisance event due to its low probability. For vector whose th entry is an integer strictly less than the size of the th block and strictly greater than 0, let be a block randomized design with the number of treated units in each block corresponding to the numbers given in vector . We use the design map where is the vector that gives the number of treated units within each block and is the blocked design corresponding to . We use post-stratification (Holt and Smith, 1979; Miratrix et al., 2013) to analyze this completely randomized design as if it were block randomized.

###### Example 3.

Assume that the actual experiment was run using complete randomization with exactly units treated. That is, is the set of all assignments such that units receive treatment and . Let be a continuous covariate for each unit . Define a covariate balance measure , e.g.

For any assignment , define , which will give us the set of assignments with covariate balance better than or equal to the observed covariate balance. We use the design map . Then this design map leads to analyzing the completely randomized design as if it were rerandomized (see Morgan and Rubin, 2012, for more on rerandomization) with acceptable covariate balance cut-off equal to that of the observed covariate balance.

###### Example 4.

Assume that the actual experiment was run using block randomization where is a fixed vector, and hence we can drop the in the notation used in Example 2, that gives the number of treated units within each block. That is, is the set of all assignments such that the number of treated units in each block is given by and . Let correspond to a completely randomized design, as laid out in Example 2, with , the total number of units treated across all blocks, treated. We use the constant design map . This corresponds to analyzing this block randomized design as if it were completely randomized.

Throughout the paper, we focus on settings in which the same estimator is used in the original analysis and in the as-if analysis. In practice, the two analyses might employ different estimators. For instance, in Example 2, we might analyze the completely randomized experiment with a difference-in-means estimator, but use the standard blocking estimator to analyze the as-if stratified experiment. We discuss this point further in supplementary material B but in the rest of this article, we fix the estimator and focus on the impact of changing only the design.

### 2.3 Validity, relevance and conditioning

We have formalized the concept of an as-if analysis, but we have not yet addressed an important question: why should we even consider such analyses instead of simply analyzing the way we randomize? Before we answer this question, we first introduce a minimum validity criterion for as-if procedures.

###### Definition 2 (Valid confidence procedure).

Fix . Let be the design used in the original experiment and let be a design map on . The confidence procedure is said to be valid with respect to , or -valid, at level if . When a procedure is valid at all levels, we simply say that it is -valid.

This criterion is intuitive: a confidence procedure is valid if its coverage is as advertised over the original design. The following simple result formalizes the popular injunction to “analyze the way you randomize” (Lachin, 1988, p. 317):

###### Proposition 1.

The procedure is -valid.

Given that the procedure is -valid, why should we consider alternative as-if analyses, even valid ones? That is, having observed , why should we use a design to perform the analysis, instead of the original

? A natural, but only partially correct, answer would be that the goal of an as-if analysis is to increase the precision of our estimator and obtain smaller confidence intervals while maintaining validity. After all, this is the purpose of restricted randomization approaches when considered at the design stage. For instance, if we have reasons to believe that a certain factor might affect the responses of experimental units, stratifying on this factor will reduce the variance of our estimator. This analogy, however, is misleading. The primary goal of an as-if analysis is not to increase the precision of the analysis but to increase its relevance. In fact, we argue heuristically in supplementary material

B that an as-if analysis will not increase precision on average, over the original assignment distribution. Rather, it is frequently the change of estimator that has created this impression.Informally, an observable quantity is relevant if a reasonable person cannot tell if a confidence interval will be too long or too short given that quantity. The concept of relevance captures the idea that our inference should be an accurate reflection of our uncertainty given the observed information from the realized randomization; our confidence intervals should be more precise if our uncertainty is lower and less precise if our uncertainty is higher. Defining the concept of relevance formally is difficult. See Buehler (1959) and Robinson (1979) for a more formal treatment. Our supplementary material C gives a precise discussion in the context of betting games, following Buehler (1959). We will not attempt a formal definition here and instead, following Liu and Meng (2016), we will illustrate its essence with a simple example.

Consider the Bernoulli design scenario of Example 1. From Equation (2), the oracle interval constructed from the original design has the same length regardless of the assignment vector actually observed. Yet, intuitively, an observed assignment vector with treated unit and control units should lead to less precise inference, and therefore wider confidence intervals than a balanced assignment vector with treated units and control units. In a sense, the confidence interval is too narrow if the observed assignment is severely imbalanced, too wide if it is well balanced, but right overall. Let be the set of all assignments with a single treated unit and the set of all assignments with 50 treated units. If the confidence interval has level , we expect

where, in general, the inequalities are strict. See the supplementary material C for a proof in a concrete setting. More formally, we say that the procedure is valid marginally, but is not valid conditional on the number of treated units. This example is illustrated in Figure 1, which shows that the coverage is below 0.95 if the proportion of treated units is not close to 0.5, and above 0.95 if the proportion is around 0.5. To remedy this, we should use wider confidence intervals in the first case, and narrower ones in the second. The right panel of Figure 1 shows that in this case, a large fraction of assignments have a proportion of treated units close to 0.5, therefore the confidence interval will be too large for many realizations of the design .

Our confidence interval should be relevant to the assignment vector actually observed, and reflect the appropriate level of uncertainty. In the context of randomization-based inference, this takes the form of an as-if analysis.

The concept of relevance and its connection to conditioning has a long history in statistics. Cox (1958) gives a dramatic example of a scenario in which two measuring instruments have widely different precisions, and one of them is chosen at random to measure a quantity of interest. He argues that the relevant measure of uncertainty is that of the instrument actually used, not that obtained by marginalizing over the choice of the instrument. In other words, the analysis should be conditional on the instrument actually chosen. This is an illustration of the conditionality principle (Birnbaum, 1962). In the context of randomization-based inference, this conditioning argument leads to valid as-if analyses, as we show in Section 3. An important complication, explored in Section 3, is that conditional as-if analyses are only a subset of possible as-if analyses, and while the former are guaranteed to be valid, the latter enjoy no such guarantees.

## 3 Conditional as-if analyses

### 3.1 Conditional design maps

We define a conditional as-if analysis as an analysis conducted with a conditional design map as defined below.

###### Definition 3.

[Conditional design map] Consider an experiment with design . Take any function , for some set , and for define the design as . Then a function such that is called a conditional design map.

It is easy to verify that a conditional design map also satisfies Definition 1. For , is a design, not the probability of under some design. The probability of any assignment under design is , where is fixed and the probability is that induced by . We introduced the shorthand in order to ease the notation.

For an alternative perspective on conditional design maps, notice that any function induces a partition of the support , where . The corresponding conditional design map would then, for a given , restrict and renormalize the original to the containing . An important note is that the mapping function, and therefore the partitioning of the assignment space, must be fixed before observing the treatment assignment.

###### Example 1 (cont.).

The design map in Example 1 is a conditional design map, with . Here we partition the assignments by the number of treated units.

###### Example 2 (cont.).

The design map in Example 2 is a conditional design map, with . Here we partition the assignments by the vector of the number of treated units in each block .

While Definition 3 implies Definition 1, the converse is not true: some design maps are not conditional. For instance, the design maps we consider in Example 3 and Example 4 are not conditional, as will be discussed in Section 3.2. We can now state our main validity result.

###### Theorem 1.

Consider a design and a function . Then an oracle procedure built with the conditional design map is -valid.

Proof of Theorem 1 is provided in supplementary material A. In fact, the intervals obtained are not just valid marginally; they are also conditionally valid within each , in the sense that

for any . Conditional validity implies unconditional validity because if we have valid inference for each partition of the assignment space then we will have validity over all partitions. Conditional validity is good: it implies increased relevance, at least with respect to function . We discuss this connection more in supplementary material C in the context of betting games.

###### Corollary 1.

Let be a partition of the set of all possible assignments, and

Then indexes the partition that is in. Using gives an -valid procedure, as a consequence of Theorem 1.

Details are provided in supplementary material A. Corollary 1 states that any partition of the support induces a valid oracle confidence procedure; having observed assignment , one simply needs to identify the unique element containing and construct an oracle interval using the design obtained by restricting to the set .

An additional benefit of using conditional design maps is replicability. Consider Example 1, and the corresponding discussion of relevance with Bernoulli designs in Section 2.3. Under the original analysis for the Bernoulli design we would expect that the estimates for the bad randomizations with an extreme proportion of treated units will be far from the truth, on average. But if we do not adjust the estimated precision of our estimators based on this information, we may not only have an estimate that is far from the truth but our confidence intervals will imply confidence in that poor estimate. Although our conditional analysis will cover the truth the same proportion of the time as the original analysis, we would expect the length of our confidence intervals to reflect less certainty when we have a poor randomization. In terms of reproducibility, this means that we are less likely to end up being confident in an extreme result.

### 3.2 Non-conditional design maps

Theorem 1 states that a sufficient condition for an as-if procedure to be valid is that it be a conditional as-if procedure. Although this condition is not necessary, we will now show that some non-conditional as-if analyses can have arbitrarily poor properties. Example 3 provides a sharp illustration of this phenomenon and, although it is an edge case, it helps build intuition for why some design maps are not valid.

###### Example 3 (cont.).

The design map introduced in Example 3 is not a conditional design map. This can be seen by noticing that the sets where do not form a partition of .

This example is particularly deceptive because the design map does involve a conditional distribution. And yet, it is not a conditional design map in the sense of Definition 3 because it does not partition the space of assignments; each assignment , except for the assignments with the very worst balance, will belong to multiple . Therefore Theorem 1 does not apply.

Consider the special case where covariates are drawn from a continuous distribution and (). We are interested in the average treatment effect, which is the difference in mean potential outcomes under treatment versus control. Suppose that assignments are balanced such that half the units are assigned to treatment and half are assigned to control. Then given any , with probability one there are only two assignments with exactly the same value for , and the assignment ; see supplementary material A for proof of this statement. By construction, then, our assignment is one of two worst case assignments in terms of balance for the set . Under the model , so the observed difference, will be the most extreme in and thus would lie outside the oracle confidence interval if, as is typical, , where is the size of set . Thus this design map would lead to poor coverage. In fact, we show in supplementary material A that if we instead make the inequality strict and take , the as-if procedure of Example 3 has a coverage of . Intuitively, this is because the observed assignment always has the worst covariate balance of all assignments within the support . Although this is an extreme example, it illustrates the fact that as-if analyses are not guaranteed to be valid if they are not conditional, and can indeed be extremely ill-behaved.

###### Example 4 (cont.).

The design map introduced in Example 4, in which we analyze a blocked design as if it were completely randomized, is also not a conditional design map. This can be seen by noticing that the complete randomization does not partition the blocked design but rather the blocked design is a subset, or a single element of a partition, of the completely randomized design.

This implies that Example 4 can also lead to invalid analysis; if the blocked design originally used is a particularly bad partition of the completely randomized design, in the sense of having wider conditional intervals, we will not have guaranteed validity using a completely randomized design for analysis. See Pashley and Miratrix (2020) for further discussion on when a blocked design can result in higher variance of estimators than an unblocked design.

### 3.3 How to build a better conditional analysis

The original goal of the as-if analysis of Example 3 was to incorporate the observed covariate balance in the analysis to increase relevance. We have shown that the design map originally proposed was not a conditional design map. We now show how to construct a conditional design map, and therefore a valid procedure, for this problem. The idea is to partition the support into sets of assignments with similar covariate balance and then use the induced conditional design map, as prescribed by Corollary 1. Let be the set of all possible covariate imbalance values achievable by the design , and be a partition of that set into ordered elements. That is, for any with , we have for all . This induces a partition of , where

Now we can directly apply the results of Corollary 1. This approach is similar in spirit to the conditional randomization tests proposed by Rosenbaum (1984); see also Branson and Miratrix (2019) and Hennessy et al. (2016). The resulting as-if analysis improves on the original analysis under by increasing its relevance. Indeed, suppose that the observed assignment has covariate balance . Then the confidence interval constructed using will involve all of the assignments in , including some whose covariate balances differ sharply from . In contrast, the procedure we just introduced restricts the randomization distribution to a subset of assignments containing only assignments with balance close to .

This does not, however, completely solve the original problem. Suppose, for instance, that by chance, . By definition, the randomization distribution of the as-if analyses we introduced above will include the assignment such that , but not such that even though might be more relevant to than , in the sense that we may have . This issue does not affect validity, but it raises concerns about relevance when the observed assignment is close to the boundary of a set . Informally, we would like to choose in such a way that the observed assignment is at the center of the set, as measured by covariate balance. For instance, fixing , we would like to construct an as-if procedure that randomizes within a set of the form , rather than . A naive approach would be to use the design mapping , but this is not a conditional design mapping. Branson and Miratrix (2019) discussed a similar approach in the context of randomization tests and also noted that it was not guaranteed to be valid.

Let’s explore further why does not have guaranteed validity. In this case, each assignment vector has an interval or window of acceptable covariate balances centered around it. The confidence interval for a given is guaranteed to have % coverage over all assignments within the window of covariate balances defined by set . So, if we built a confidence interval for each assignment in , using the design conditioning on , % of those intervals would cover the truth. However, we can only ever observe these intervals for those assignments with exactly the same covariate balance as . Furthermore, there are no guarantees about which assignments will result in a confidence interval covering the truth. Over the smaller subset of assignments with exactly the same covariate balance as , which lead to the same design over , the coverage may be less than %.

To build a better solution, we need more flexible tools. The following section will discuss how we can be more flexible, while still guaranteeing validity, by introducing some randomness.

## 4 Stochastic conditional as-if

The setting of Example 3 posed a problem of how to build valid procedures that allow the design mapping to vary based on the assignment. That is, we want to avoid making a strict partition of the assignment space but still guarantee validity. We can do this by introducing some randomness into our design map.

###### Definition 4.

[Stochastic conditional design map] Consider an experiment with design . For observed assignment , draw from a given distribution , indexed by with support on some set , and consider the design

The mapping , with and , is called a stochastic design mapping.

This is our bit of randomness that will allow us to blend our conditional maps to regain validity. In the special case where the distribution degenerates into , Definition 4 is equivalent to Definition 3. When is non-degenerate, the stochastic design map becomes a random function.

Before stating our theoretical result for stochastic design maps, we first examine how the added flexibility that these maps afford can be put to use in the context of Example 3. Let , , and define

Our selects a near the observed imbalance. Having observed and drawn , we then consider the design

with normalizing factor . In other words, we analyze the experiment by restricting the randomization to a set . Comparing to our original randomization set , we see that while is a set of imbalances centered on observed , is only centered on on average over draws of . The following theorem guarantees that this stochastic procedure is valid. The proof is in supplementary material A.

###### Theorem 2.

Consider a design and a variable , with conditional distribution . Then an oracle procedure built at level with the stochastic conditional design map , which draws and maps to , is -valid at level .

Stochastic conditional design maps mirror conditioning mechanisms introduced by Basse et al. (2019) in the context of randomization tests. Inference is also stochastic here in the sense that a single draw of determines the final reference distribution to calculate the confidence interval. We discuss some practical challenges with this approach in supplementary material B.

The randomness intrinsic to this method is similar to the introduction of randomness into uniformly most powerful (UMP) tests (see Lehmann and Romano, 2005). Instead of having one’s inference depend on a single draw of , one can use fuzzy confidence intervals (see Geyer and Meeden, 2005) to marginalize over the distribution . In a fuzzy interval, similar to fuzzy sets, membership in the interval is not binary but rather allowed to take values in . We discuss this further in supplementary material D.

## 5 Discussion: Implications for Matching

We next illustrate how the as-if framework and theory we have developed can be applied to evaluate a particular analysis method. As an example, we consider analyzing data matched post-randomization as if it was pair randomized. Matching is a powerful tool for extracting quasi-randomized experiments from observational studies (Ho et al., 2007; Rubin, 2007; Stuart, 2010). To highlight the conceptual difficulty with post-matching analysis, we consider the idealized setting where treatment is assigned according to a known Bernoulli randomization mechanism, , and matching is performed subsequently. Specifically, units are assigned to treatment independently with probability , where is a unit-level covariate vector. In addition, for simplicity, we focus on pair matching, where treated units are paired to control units with similar covariate values. One way to analyze the pairs is as if the randomization had been a matched pairs experiment. Although this method of analysis has already received scrutiny in the literature (see, e.g., Schafer and Kang, 2008; Stuart, 2010), it is worth asking: is this a conditional design map with guaranteed validity?

If we can exactly match on , then the situation is identical to that of Example 2; the as-if pair randomized design map is a conditional design map, and the procedure is therefore guaranteed to be valid. Exact matching is, however, often hard to achieve in practice. Instead, we generally rely on approximate matching, in which the covariate distance between the units within a pair is small, but not zero. Unfortunately, we will show that with approximate matching, the as-if pair randomized design map is not a conditional design map. To make this formal, let be a matching algorithm which, given an assignment and fixed covariates, returns a set of pairs which we denote , . Assuming a deterministic matching algorithm, let be the matching obtained from an observed assignment . Treating the matched data as a matched pairs experiment implies analyzing over all assignment vectors that permute the treatment assignment within the pairs. Let be an assignment obtained by such a permutation. A necessary condition for pairwise randomization to be a conditional procedure is that ; that is, for any permutation of treatment within pairs, the matching algorithm must return the original matches.

This condition for a valid conditional procedure is not guaranteed. To illustrate, consider the first three steps inside the light grey box in Figure 2, in which we consider a greedy matching algorithm. If we analyze the matched data as if it were a matched pairs design, then the permutation shown is allowable by the design. However, we see in the dotted rectangle of Figure 2 that if we had observed that permutation as the treatment assignment, we would have ended up matching the units differently, and therefore would have conducted a different analysis. This is essentially the issue we encountered earlier in Section 4; we have not created a partition of our space.

The upshot is that when matching is not exact, analyzing the data as if it came from a paired-randomized study cannot be justified by a conditioning argument. A proper conditional analysis would need to take into account the matching algorithm. Specifically, let be the set of all treatment assignment vectors that are permutations of treatment within a set of matches and let be the set of assignments that would lead to matches using algorithm . Then