Companies and charitable institutions may offer a treatment to their most deserving subjects as ranked by a running variable of interest. Subjects get the treatment if and only if , for some threshold
. To estimate the causal effect of the treatment, the investigators can use a regression discontinuity design (RDD)(Thistlethwaite and Campbell, 1960)
. Unfortunately, an RDD analysis can have very high variance(Jacob et al., 2012; Goldberger, 1972; Gelman and Imbens, 2017). In settings where experimenters can control the assignment, they can instead opt for a tie-breaker design (Lipsey et al., 1981; Trochim and Cappelleri, 1992). There, top ranked subjects get the treatment, the lowest ranked ones are placed in the control group and a cohort in the middle are randomized to treatment or control. This hybrid between an RDD and a fully randomized control trial (RCT) allows the experimenter to trade off statistical efficiency with a preference to treat the more deserving subjects.
Past settings for tie-breaker designs have included the offer of remedial English to incoming university students based on their high school English proficiency (Aiken et al., 1998) and scholarship offerings for two and four year colleges based on a judgment of the applicants’ needs and academic strengths (Abdulkadiroglu et al. (2017); Angrist et al. (2020)). We expect a growing number of settings where the experimenter can control the assignment. An airline that offers a perk such as a seat upgrade or a lounge pass to its most loyal customers could easily decide instead to randomize near the assignment threshold. Now that so many companies interact with their customers through digital interfaces, opportunities to use a tie-breaker design should increase greatly.
A simple model in which to compare an RDD to an RCT is a two-line regression on a running variable where both slope and intercept vary according to the treatment level. Owen and Varian (2020) study the tie-breaker design for this model using a three level design with 0, 50 and 100 percent treatment allocations. They find that statistical efficiency is monotonically increasing in the size of the group getting the randomized treatment, with the RCT being most efficient and RDD least efficient. The extreme efficiency ratios between RCT and RDD were earlier found by Jacob et al. (2012); Goldberger (1972).
While an RCT might be most statistically efficient, it is not attractive in our motivating settings. Whether students are ranked by ability or financial need or some combination of those, it makes sense to give the higher ranked students a greater chance of getting a scholarship. In commercial settings, the incremental value to the company from offering a perk would often increase with customer loyalty, making the RCT economically unattractive.
Choosing the amount of randomness in a tie-breaker design presents an exploration-exploitation tradeoff where more randomness increases statistical efficiency (exploration) at the expense of short-term value (exploitation). Owen and Varian (2020, Section 7) studied this tradeoff. One of their findings was that the entire optimal exploration-exploitation tradeoff curve could be obtained from a three level design with a treatment probability that was always , or . In particular, there was no benefit to using a sliding scale for . That work considered both a and a distribution for the running variable and, more importantly, it required half of the subjects to receive the treatment. In this article we show strong advantages to moving away from that three level design under more general conditions. We consider a setting where the are identically distributed from any mean-centered distribution with positive variance. Then we show how to optimize the entire exploration-exploitation tradeoff curve, given any desired fraction of treated subjects.
The precise design problem a user faces in our setting is to choose treatment probabilities for subjects based on their running variables . We will require for some function . This has the effect of forcing whenever . The user has to make this decision before observing the response values . Many popular efficiency metrics are convex in the , in which case the optimal can be found numerically via convex optimization as Morrison and Owen (2022)
do for vector valued. However, our setting with univariate is tractable enough to provide an explicit characterization of the optimal , even if the efficiency metric is non-convex. Given equality constraints on the expected proportion of subjects to be treated and the short-term value, our optimal treatment probabilities are piecewise constant in with a small number of distinct treatment probability levels. If we further impose a monotonicity constraint requiring treatment probabilities to be non-decreasing in , then there always exists an optimal design with just two treatment levels, assuming is continuous; general may require an additional level for a single running variable value. Such a constraint prevents, for example, more qualified students from having a lower chance of getting a scholarship than less qualified ones. We suggest precise equations and efficient methods to compute the levels and cut points. These operationally simple designs yield substantial efficiency improvements over the standard three level tie-breaker design, without sacrificing short-term value.
The results apply for the two-line regression model we mention above, which takes the form
Here is the response of interest, is the running variable, is the treatment indicator (with indicating treatment), is the parameter vector to be estimated, is homoskedastic noise, and indexes the subjects. The incremental value of the treatment is described by parameters and which we therefore consider to be the most important.
When the responses are available, the investigator has many more choices than the model above. Indeed, it is now popular for the RDD to be studied with kernel regression methods. See for instance Hahn et al. (2001), Calonico et al. (2014) and Imbens and Kalyanaraman (2012). Kluger and Owen (2021) find that in a locally weighted version of equation (1), estimation of the causal effect at is more efficient under the tie-breaker design than an RDD, for any choice of kernel bandwidth. The strategy for allocating treatments must however be made prior to measuring and for that we use (1) as a working model. When but not are available, Owen and Varian (2020, Section 8) present some numerical methods to study the efficiency tradeoff. In the present work, we stick to the two-line model for simplicity. We suppose that the investigator designs the study with a prior idea of the range of running variables over which linearity is reasonable. If the analysis given
ends up as a local linear regression with a rectangular or boxcar kernel over a subset of the range then the design they chose may not be optimal for that range but will generally still be better than an RDD(Kluger and Owen, 2021).
An outline of this paper is as follows. Section 2 provides some notation and assumptions, and explains our measurements of statistical efficiency and short-term value. Section 3 shows that the optimal exploration-exploitation trade-off can always be attained by a convex combination of two simple, essentially deterministic treatment assignments. “Essentially” here means that for with atoms, the optimal designs might randomize treatment at one or two of the atoms. We repeat this analysis after imposing a monotonicity constraint in Section 4, under which the optimal trade-off can be attained using an assignment with just two different treatment probability levels, one for and one for for some threshold , when is continuous. Armed with these designs, Section 5 describes the shape of the optimal exploration-exploitation trade-off, and then Section 6 illustrates this curve for some different distributions to demonstrate the efficiency gains over the standard three level tie-breaker. We also provide a fixed example of this curve based on data from Head Start, a government assistance program for low-income children. Section 7 provides a summary and discussion of the main results. There are appendices for some of the proofs as well as some R code to compute an optimal two level monotone design that we think is the one most investigators will want to use.
2 Setup and notation
Our formulation of the optimal design problem under the two-line model (1) is based on the classical literature on optimal design in multiple linear regression models, e.g. St. John and Draper (1975). In the simplest setting, the user fits a standard linear model with the goal of selecting covariate values to optimize some criterion based on the information matrix , where is the design matrix with -th row . Perhaps the most common such criterion is D-optimality, which corresponds to maximizing . Another popular choice is C-optimality, which minimizes for some choice of . This can be interpreted as , where
is the ordinary least squares estimator.
Such optimality criteria are often relaxed via design measures. A design measure
is a probability distribution on the space of covariates, and the relaxed optimal design problem corresponds to optimizing the desired functional ofover some space of design measures . For instance, a design measure is D-optimal (for the relaxed problem) if , and C-optimal if . This relaxation avoids cumbersome details arising from finiteness of . The original problem corresponds to restricting to only consist of discrete probability distributions supported on at most distinct points with probabilities that are multiples of (Kiefer, 1959).
Our setting is akin to the relaxed optimal design problem in that the user has full control over the conditional distribution of the treatment indicator given the running variable . However, standard results, such as the general equivalence theorem of Kiefer and Wolfowitz (1960), do not apply directly, since we do not have control over the entire joint distribution of , which determines that of the covariate vector . Instead, the running variables are externally determined; we assume they are identically distributed samples from some known distribution . This assumption encompasses the fixed setting, since in that case we can take to be the empirical distribution of the . Then expectations under sampling from are the same as expectations with the fixed. We take to be any distribution with mean 0 and positive, finite variance. Through treatment assignment, the experimenter can control the conditional distribution of given . We defer further discussion of efficiency to Section 2.1, which shows that D-optimality and C-optimality for our problem coincide with the efficiency metric used by Owen and Varian (2020).
With viewed as random, we are no longer optimizing a finite set of treatment probabilities , but rather specifying a design function to assign treatment probabilities via . All statements involving equality or uniqueness of design functions are assumed to hold on a set of values with probability 1 under , and all design functions are assumed to be measurable. We introduce some notation for certain forms of the design function . We will commonly encounter designs of the form
for a set . Another important special case consists of two level designs
for treatment probabilities and a threshold . For example, a sharp RDD with threshold is denoted , while is an RCT with treatment probability as is for other .
The condition ensures that is nondecreasing in ; we refer to such designs as monotone. In a monotone design, a subject cannot have a lower treatment probability than another subject with lower . We also define a symmetric design to be one for which ; for instance,2020) is both monotone and symmetric and defined for by
for transformed to be . Note is a sharp RDD and is an RCT.
Combining all three parts of our model — the two-line regression model (1), the distribution , and the design function — we have identically distributed triples which we use to estimate the coefficient vector with the ordinary least squares estimator . The experimenter’s choice of influences statistical efficiency, but there are other considerations. In our motivating contexts we take to be something like economic value or student success, in which case larger is better. In many but not all of these settings we will have . Recalling that , the average value of per customer under (1) and design function is
Implicit in these expectations is a dependence on . The subscript indicates the dependence on the design function. We omit the subscript when the expectation does not involve or when it is understood which is relevant.
Equation (5) shows our design choice is unaffected by . We assume that the proportion of treated subjects is fixed by an external budget, which is equivalent to an equality constraint for some . For instance, there would ordinarily be a set number of scholarships or perks to be given out. The only term affected by the design in (5) is then , as in Owen and Varian (2020). For , the short term average value per customer grows with and we would want that value to be large. Even if , is still a natural metric to quantify the short-term value from treating subjects with high values of , as it is the covariance between and the treatment indicator . Thus, to find the optimal exploration-exploitation trade-off curve, we seek a solution to the following problem:
Here is a collection of measurable design functions . For our purposes is either the set of all such functions (Section 3) or the subset of all monotone design functions (Section 4). We specify lower and upper bounds on (which depend on ) in Section 2.2.
Finally, we note the common finite variance of the cancels in variance ratios. As a result, we lose no generality in assuming throughout.
2.1 D-optimal and C-optimal design
Let . The design matrix of the two-line model (1) satisfies
We emphasize that depends on and the design function , and that corresponds exactly to the fixed “non-relaxed” information matrix if are centered and . Thus, the relaxed D-optimal design problem in our random setting corresponds to finding to maximize , given . Exact D-optimality for fixed is the same if we take to be the empirical distribution of the .
Since we’ve assumed that , the matrix is invertible and then
Because is the expected value of a positive semi-definite rank one matrix it is also positive semi-definite. Therefore
must have nonnegative determinant that is strictly positive iff is invertible. Additionally, D-optimality is equivalent to maximizing .
We now observe that this D-optimality criterion corresponds exactly with the efficiency metric of Owen and Varian (2020), who consider the asymptotic variance of . If is invertible, then additionally assuming the
are independent, by the law of large numbers
In particular, . Since
by standard block matrix inversion formulas, we see that
where is the statistical efficiency of the design . If , then these convergences occur at the rate . Avoiding the degenerate cases of , we say that a valid design has . We will show in Corollary 1 that any valid design yields , so is always well-defined and nonnegative. Additionally, since only depends on through and , which are fixed in our optimization problem, maximizing is equivalent to maximizing , i.e., D-optimality.
Equation (10) shows that is essentially an asymptotic C-optimality criterion with . The caveat is that corresponds to a limiting conditional variance of , given both and . As the design is chosen before (and , under the random framework) is observed, it would be desirable if could be interpreted as a limiting unconditional variance. We note that so long as the have finite variance, , since is conditionally unbiased given and . If we additionally assume the are uniformly integrable, is then precisely the limiting unconditional variance of as .
The D-optimality interpretation of avoids such assumptions and asymptotics while unifying the fixed and random settings. The scaling of by is nevertheless convenient to make interpretable as an asymptotic variance of .
There are other useful optimality criteria that may be of interest. For instance, the (asymptotic) variance of is of interest, because is the causal effect of the treatment at under the model (1). Our methods can be easily extended to any continuous functional of ; we briefly elaborate in Section 7. For concreteness, we focus on for the remainder of our discussion.
2.2 Bounds on short-term gain
Before studying optimal designs, we impose lower and upper bounds on the possible short-term gain constraints to consider, for each possible . For an upper bound we use , defined as the maximum that can be attained by any design with . It turns out that this upper bound is uniquely attained for any . If is continuous, it is uniquely attained by a sharp RDD. We remind the reader that uniqueness of a design function satisfying some property means that for any two design functions and with that property, we must have . In general, whether two design functions are equivalent depends on the choice of .
We frequently use the identity
for . Taking we have where is the invertible mapping
In particular, is an increasing linear function of .
For any and running variable distribution , there exists a unique design satisfying
for some . Any that satisfies (13), has
with equality if and only if under .
Notice that equation (14) does not specify at . If is continuous, then any value for yields an equivalent design function, but if has an atom at then we will require a specific value for . We allow to have atoms, to include empirical distributions. While we specify in the proof of Lemma 1 below, for brevity later results of this type do not give the values of design functions at such discontinuity points. The threshold in (14) is essentially unique. If there is an interval with then all step locations in provide equivalent designs.
If then the only design functions (again, up to uniqueness w.p.1 under ) are the constant functions and , and the result holds trivially. Thus, we can assume that . By (11), the existence of follows by taking and
with equality iff for a set of with probability one under , i.e., iff satisfies (14) with probability one under . ∎
Note that if is continuous, then for . Next, by symmetry the design that minimizes over all designs with is where . Notice that . We will impose a lower bound of on our designs. Designs with exist for all but would not be practically relevant for our setting, as they represent scenarios where subjects with smaller are preferentially treated. We don’t need designs with because can always be attained by the RCT for . We hence define the feasible input space to be
Our next result shows that for any design satisfying the constraints in (6) for some . Consequently, is well-defined and positive for any such .
For any , .
See Appendix A. ∎
3 Optimal design characterizations
We are now ready to solve the optimization problem (6) in the case that is the set of all design functions. Fixing , a feasible design satisfies the equality constraints in (6). We begin with a sufficient condition for a feasible design to be optimal.
If there exists feasible with
then solves (6). Furthermore, for some decreasing function .
In the case , Lemma 2 suggests that any feasible design with is optimal. This recovers the result of Owen and Varian (2020) that when half of the subjects are to be treated and is , any symmetric design, including the three level tie-breaker in (4), is optimal. Lemma 2 attempts to generalize this result to any , and any . However, in many cases, no feasible attains . The second part of Lemma 2 and the following result enable us to resolve this issue.
Suppose that feasible designs and satisfy . Then if , there exists a feasible with .
If then either of them is a suitable . Otherwise define
so that . Then is feasible with . ∎
Lemma 3 shows that the set of attainable by feasible designs is a closed interval. This suggests the following strategy for solving (6). Suppose design functions and are solutions to the following optimization problems:
solves (6) for . If the denominator for equals zero, then all feasible designs have the same efficiency, so any one of them is optimal.
Thus, we’ve reduced our original optimization problem (6) to two optimization problems (19). We will now show that solutions to (19) always exist and are unique, for any . The argument uses extensions of the Neyman-Pearson lemma (Neyman and Pearson, 1933) in hypothesis testing found by Dantzig and Wald (1951). We adapt these results to our setting with a formulation by Lehmann and Romano (2005):
Consider any measurable with . Define to be the set of all points such that
for some , where is some collection of measurable functions from into . For each let be the set of all satisfying (21). If is such that
is closed and convex and , then
There exists and such that
if and only if satisfies (22) for some .
is closed and convex to construct a separating hyperplane in. Sufficiency of (22) in claim 2 follows from part (ii) of that theorem, and is often called the method of undetermined multipliers. ∎
In our setting we will have constraints given by and . Our objective function will then be . Theorem 1 below uses Lemma 4 to solve (19), and hence the original optimization problem (6) by (20). For continuous the solutions take the forms and for intervals (which have finite length so long as ). When is not continuous, then one or more of the interval endpoints above may be atoms of with .
For any , there exist unique solutions and to the optimization problems (19). These solutions are characterized by
for some and which depend on and can be infinite if .
If , then the theorem follows by Lemma 1 and taking , and . Thus we can assume that . We give the proof for in detail. The argument for is completely symmetric.
As noted above, we are in the setting of Lemma 4 with , , , and . The collection here is the set of all measurable functions from into , so the corresponding is closed and convex, as shown in part (iv) of Theorem 3.6.1 in Lehmann and Romano (2005). By Lemma 1 and (11) we can write where is the invertible mapping from (12) and
Hence our previous assumption ensures is in the interior of .
(cf. part (ii) of Theorem 3.6.1 in Lehmann and Romano (2005)). If has no real roots then for all , contradicting . Thus we can factor for some (real) , showing that (24) is equivalent to (23). We can now conclude, by the second claim in Lemma 4, that the set of optimal solutions to (19) consists of precisely those feasible designs satisfying (23). Furthermore, the first claim of Lemma 4 ensures that such a design must exist.
Now we show uniqueness for . The same argument shows uniqueness for and hence uniqueness for . Suppose that and are both solutions for . By symmetry we can assume that either , or both and . Since and are feasible for (19), we must have and , in view of (11). We show that w.p.1. under . Note that we can assume without loss of generality that for any , because otherwise, we could increase to without changing on a set of positive probability. We can similarly assume that for any . Finally, we impose these two canonicalizing conditions on and as well.
Assume first that . Then we cannot have because we would then need either or with and to enforce and this would cause . We similarly cannot have with both and . Therefore after canonicalizing, we know that both and are equivalent to designs of the form given with and along with the analogous conditions and . Then our canonicalized and satisfy for all and so in particular .
It remains to handle the case where . We then have since . If then the support of is completely to the right of that of which violates . We can similarly rule out . As a result must have