Airlines, hotels and other companies may offer incentives such as free upgrades to their most loyal customers in the expectation that those customers will respond favorably with future business. The companies wish to measure the impact of those incentives while also trying to get the greatest benefit from them. An e-commerce company might want to offer some analytic tools to the customers most likely to benefit from them, while also measuring the impact of offering those tools.
These companies can rank their customers, offer the incentive to the highest ranked ones, and then measure impact with a regression discontinuity design (RDD). Or they can run a randomized controlled experiment (RCT) and measure impact by comparing results from customers with and without the incentive. The RDD is expected to have the greatest immediate payoff while the RCT is known to be more statistically efficient.
This tradeoff is naturally handled in a tie-breaker design. For a running variable , subjects in a tie-breaker design are allocated to a control condition if , to a test condition if and their treatment (test or control) is randomized if . If then no subjects are randomized and the data follow an RDD as introduced by Thistlethwaite and Campbell (1960). At the other extreme, if all the values are above and below , then the design is an RCT as described in texts on causal inference (Imbens and Rubin, 2015) or on experimental design (Box et al., 1978; Wu and Hamada, 2011). Tie-breaker designs are also called cutoff designs (Cappelleri and Trochim, 2003) and the running variable is also called an assignment variable or a forcing variable. Sometimes we refer to subjects getting the treatment or not, in place of getting test and control levels of the treatment.
Angrist et al. (2014) use a tie-breaker design to evaluate the effects of post secondary aid in Nebraska. In that setting, was a student ranking. Students were triaged into top, middle and bottom groups. The top students received aid, the bottom ones did not, and those in the middle group were randomized to receive aid or not. Aiken et al. (1998) report on a study about allocation of students to remedial English classes where the running variable is a measure of students’ reading ability before they matriculate.
Our interest is in optimizing the size of the RCT within a tie-breaker experiment. The RCT is well known to be more statistically efficient than the RDD. See for instance Jacob et al. (2012a, Section 6). However the positive impact from the test condition is ordinarily going to be better in an RDD. Companies may have more to gain by increasing business from their best customers. Similarly, merit-based scholarships are used when one wants to get academically stronger students into a class. There is thus an exploration-exploitation tradeoff here; the RCT is better for measuring impact while the RDD is expected to have more positive impact on the subjects under study.
It is possible to study this tradeoff via extensive Monte Carlo simulations or similar numerical exploration. While that approach can be used with very detailed assumptions about the distribution of and flexible models for the response of interest, it does not provide much insight into the general nature of the tradeoff. We consider a special case where the running variable has been rescaled to have a symmetric distribution centered at , and the experimental range is from to
. We will use a linear regression model for the response with a separate slope and intercept for test and control. In this setting,Jacob et al. (2012a, Section 6) found the RCT to be times as efficient as the RDD when
Figure 1 illustrates tie-breaker designs for four values of
. The assignment variable there has a Gaussian distribution, that we assume has been centered and scaled. The outcome variable is simulated from a linear model with a constant treatment effect. For instance, in the third panel, the topof customers get the treatment, the bottom do not and a fraction of the data in the middle have randomized allocation. For a Gaussian allocation variable, the experimental region in the middle of the data is where the data are most densely packed, which will typically be desirable.
This paper is organized as follows. Section 2 introduces a two-line regression relating an outcome to the assignment variable. The slope and intercept vary between treatment and control. The assignment variable will not always be Gaussian, but we can always rank order it, so that section is based on the ranks. Section 3 shows that the statistical efficiency of incorporating experimentation versus the plain regression discontinuity design at is . Thus, statistical efficiency is a monotone increasing function of the amount of experimentation. At the extreme, a pure RCT with is times as efficient as the RDD. We ordinarily expect that our outcome variable will show the greatest gains if we give the treatment to the highest ranked customers. Section 4 quantifies that cost in the two-line regression model and trades it off against statistical efficiency. The optimal is then dependent on the ratio between the value per customer of the short term return and the value of the information per customer that we get for a given . Although an experiment might be designed for a linear model, once the data are collected there may be nonlinearities that warrant a more flexible model. Section 5
repeats our analysis of the linear model for a pair of quadratic regression models. In this case, regression discontinuity design has a much higher variance than the experiment does. This is in line with recent findings ofGelman and Imbens (2017). Section 6 revisits the Gaussian case that we illustrate in Figure 1. It is similar to the uniform case. Here a full RCT is times as efficient as the RDD. It is qualitatively similar to the uniform case. Section 7 describes a numerical version of our approach that does not require a simplistic regression model. One can always use brute force optimization of a Monte Carlo simulation. We show how to replace the simulation inner loop by matrix algebra allowing faster and more thorough optimization. The tie-breaker literature has emphasized experiments in the middle range of the running variable . Section 8 looks at off center experiments, such as experimenting in just the second decile from the top. In our motivating applications, the incentive might only be offered to a small fraction of customers. Section 9 contains a short discussion of how to use the findings.
We close this introduction with some additional references. Since Thistlethwaite and Campbell (1960), there have been many applications of regression discontinuity designs, particularly in economics and political science. Textbook treatments and surveys may be found in Angrist and Pischke (2009), Angrist and Pischke (2014), Jacob et al. (2012b), Imbens and Lemieux (2008), Jacob et al. (2012a), Klaauw (2008), and Lee and Lemieux (2010).
It is well known in the literature that experiments are more efficient than regression discontinuity designs. Section 6 of Jacob et al. (2012a) discusses this point in depth. They include the four-fold efficiency improvement we get for uniformly distributed running variables and a factor of for normal running variables. The latter goes back to Goldberger (1972).
For an historical note, a tradeoff of this kind appeared in the Lanarkshire milk experiment, described by Student (1931). The goal was to measure the effect of a daily ration of milk on the health of school children. Among many complications was the fact that some of the schools chose to give the rations to the students that they thought needed it most. While that may have been the most beneficial way to allocate the school’s milk, it was very damaging to the process of learning the causal impact of the milk rations. A tie-breaker experiment might have been a good compromise.
We begin with a simple setting where there are an even number of customers , and exactly of them will receive the treatment. There is an “assignment variable”
that measures the suitability of the customer for the program. The assignment variable might be the output of a statistical machine learning model based on multiple variables, or it could be based on a subjective judgment of one or more experts or stakeholders.
We will simplify the problem by transforming to be equispaced in the interval . That is, after sorting the customers in increasing order of , we make a rank transformation to . If , the assignment variable is . Let indicate the treatment status; subjects that receive the treatment have and subjects that do not receive the treatment have .
We denote the experimental interval by for in . In our hybrid design the treatment assignment takes the form:
If , then we have a classic RDD with the discontinuity at . If , then we have a classic RCT. If , then we have a tie-breader design with measuring amount of the randomization.
The random allocation in equation (1) will make half of the for equal and the other half will be . One way to do this is to choose for a simple random sample of half of the elements in . Stratified schemes, setting for exactly one random member of each consecutive pair of indices in are also easy to implement.
The impact of the treatment is measured by a scalar outcome where is a measure of the benefit derived from customer . We suppose that the delay time between setting and observing is long enough to make bandit methods (see for instance, Scott (2015)) unsuitable. We will instead compare experimental designs using the following two-line regression model:
are IID random variables with meanand finite variance . Our analysis is based on the regression model (2) instead of the randomization because the treatment for subjects with outside is not random.
The effect of the treatment averaged over customers is . The factor of comes from comparing to . We can also estimate whether the effect increases or decreases with , through the coefficient . The quantity is also the magnitude of the treatment effect on a (hypothetical) average customer with .
Under model (2), we can distinguish customers for whom the treatment is effective from those for whom it is not. Suppose that is the incremental cost of offering the treatment to one customer. If , then there is a cutpoint with for customers with . If then the treatment either pays off on average at all , or pays off on average for no . If , then the treatment only pays off for customers with . We discuss that case further in Section 4.
3 Efficiency in the two-line model
We will analyze the data for by fitting model (2) by least squares. The parameter of interest is and we assume that are independent random variables with . The design matrix is with ’th row , and . Because does not depend on , we can compare designs assuming that .
Next, we look at how depends on . For large we can replace by . Similar integral approximations yield
where where is the average value of over the design. We let
We can reorder the rows and columns of (3) to make it block diagonal,
where the labels on the matrix above refer to the variables that the multiply and . It follows that
Thus the variances scale by . The individual component variances are and . These variances are smallest for small values of , corresponding to large values of . That is, the more randomized experimentation there is in the data, the less variance there is in the estimates. Therefore, the regression discontinuity design is worst and the randomized experiment is best. Larger values of also induce stronger correlations among the .
The estimated gain from the intervention for a customer with a given is . Next
after some algebra. The relative efficiency of the experiment versus regression discontinuity is
for all . That is, the randomized experiment with observations is as informative as the regression discontinuity with observations and this holds uniformly over all levels of the assignment variable . This factor of is given by Jacob et al. (2012a).
Figure 2 shows the variance of the treatment effect parameters as a function of . Some values from the plot are shown in Table 1. The regression discontinuity design has four times the variance of the experiment as we saw in equation (7). The slope coefficent for treatment always has three times the variance of the intercept coefficient as follows from (5). Figure 3 show the variance of the estimated impact versus for several choices of .
4 Cost of experimentation
We ordinarily expect the value of the incentive to increase with the variable . In that case the greatest return on the customers in the experiment arises from the regression discontinuity design with . The information gain from comes at some cost in the present sample. This section quantifies that cost.
For a deterministic allocation of or we have When is chosen randomly with , then It follows that the expected gain per customer in the hybrid design is
Neither nor appear in this gain and the value of does not affect our choice of . Only which models how the payoff from the incentive varies with the assignment variable makes a difference. Compared to the regression discontinuity design with , the cost of incorporating experimentation is
which grows slowly as increases from zero and then rapidly as approaches one.
If , then as expected, we gain the most from the regression discontinuity design and the least from the experiment. This is a classic exploration-exploitation tradeoff.
It is also possible that some settings have . This might happen if the incentive is additional free tutoring in the educational context, or if it is advice on how to best use an e-commerce company’s products in a context where higher performing customers already knew about the advice. In these cases the minimal cost is to give the incentive to the bottom customers and not the top customers. The analysis of this paper goes through by reversing the customer ranking, thereby replacing by and also changing the sign of .
Now we turn to optimizing the choice of given some assumptions on the relative value of the information in the data for future decisions and the expected gain on the experiment. The precision (inverse variance) of our estimate of is a linear function of and so is the expected gain. We can therefore trade off precision per customer with gain per customer. We think that is the most important parameter so we take the precision gain per customer to be
Alternatively, we could focus on which is both the average gain per customer and the gain for the customer at . The precision for turns out to be so it perfectly aligned with precision on . More generally the gain from the incentive at any specific has a variance given by (6). Any weighted average of precision of over points is a scalar multiple of from (8).
We trade off gain per customer and precision per customer with the value function
where measures the value for future decisions of having greater precision on .
Let be given by equation (9) with and . Then the maximum of over occurs at
Let . We will first maximize over , where does not depend on . Now has a unique maximum over at . The maximizing is when , it is when and it is when . Equation (10) translates these results back to the optimal . ∎
We see from equation (10) that the decision depends on the critical ratio . The numerator reflects the value of more efficient allocation and the denominator captures the value of improved information gathering. When then the discontinuity design with is optimal. The full experiment, , is never optimal unless or the value of information to be used in future decisions is infinite.
Figure 4 shows the value from equation (10) versus the ratio of the short term to long term value coefficients. The function is nearly equal to near the origin and has negative curvature on . If future uses are important enough that , then one should use . That is, when the future is very important the optimal hybrid is very close to an RCT.
5 Quadratic regression
A quadratic regression model
allows a richer exploration of the treatment effect. For instance, model (11) allows for the possibility that the treatment pays off if and only if is in some interval. It also allows for a situation where the payoff only comes outside of some interval. This model has even (symmetric) predictors , ,
and odd (antisymmetric) predictors, , . As in the linear case, the even and odd predictors are orthogonal to each other.
Now is a block diagonal matrix. Some of the entries are
as well as from Section 3 that we call here. We find that
Once again we get a block diagonal pattern with two identical blocks. This is a consequence of , and it will happen for more general models with odd and even predictors.
For , let be given by (12). Then
for a symmetric matrix
and a determinant
Multiplying above by the upper left submatrix in (12) yields times , after some lengthy manipulations. ∎
Figure 5 show the variance of the estimated impact versus for several choices of . Notice that the variance is given on a logarithmic scale there. The regression discontinuity design in the top curve there, has extremely large variances especially where is close to . The randomized design at the bottom has much smaller variance. Even the maximum variance in the experiment (at ) is smaller than the minimum variance in the regression discontinuity model (at ).
6 Gaussian case
The original assignment variable might have a nearly Gaussian distribution. Or we might believe that the two-line linear model fits better if we have transformed the assignment variable rank to normal scores , where is the cumulative distribution of the distribution.
We will experiment on the central data with choosing to get a fraction of data in the experiment. That leads to . After reordering the variables we find in this case that
Compared to the uniform scores case, the diagonal has changed from to . The value of from the uniform case changes to
For this Gaussian case, all estimated coefficients have the same variance, equal to . The variances for uniform assignment variables were not all the same. The difference stems from the points having variance in the uniform case instead of variance here. As before as increases, also increases and so decreases.
Now we work out the efficiency of the RCT compared to the RDD. For the RCT, yields and then . For the RDD, yields and then . Thus the efficiency of the RCT compared to the RDD is
as reported by Goldberger (1972). This is somewhat less than the efficiency gain of in the uniform case. The efficiency versus (not shown) has a qualitatively similar shape to the black curve for the coefficient of in the uniform case (Figure 2).
7 General numerical approach
The two line model for a running variable with a symmetric distribution made it simple to study central experimental windows of the form . In that setting the means of and were both zero, and the variance of parameter estimates depended simply on just one quantity . We may want to use a more general regression model, allow experimental windows that are not centered around the middle value of , have values that are not uniform or Gaussian, and we might also want to use models other than two regression lines. There might even be more than one running variable as in Abdulkadiroglu et al. (2017). The price for this flexibility is high; users have to answer some hard questions about their goals, and then do numerical optimization over parameters with a potentially expensive Monte Carlo inner loop. In this section we show that the inner loop can be done algebraicly.
We suppose that prior to treatment assignment, customer
has a known feature vectorwhich includes an intercept variable equal to , but not the treatment variable . For instance in the linear and quadratic models, the features are and , respectively. In the regression model
we have for the treated customers and for the others. Here models the effect of treatment.
The generalized tie-breaker study works with a vector and sets
In the random case, we suppose that
with probabilityand is with probabilty where need not be . Because contains an intercept term, the experimental window need not be centered on a central value of . The analyst must now choose , and .
The analogue of our previous approach is to find the matrix where
The lower right corner of is because it is using . Averaging over the outcomes of this way is statistically reasonable when . If are independent with mean zero and variance , then
This averages over the outcomes so that they do not have to be simulated.
One can now do brute force numerical search for good values of and and . A good choice would yield a favorably small . A bad choice will yield a larger variance covariance matrix. A very bad choice would lead to singular and one would of course reject the corresponding triple . For instance, such a singularity would happen if which is an obviously poor choice because then no customers would be in the treatment group.
Using a formula for the inverse of a block matrix we get
and . In an RCT with we have . For certain components of become nonzero (they were values off the main diagonal in the two line regression) increasing and hence increasing .
8 Non-central experimental regions
Our treatment of the two line model assumed that the experimental region was in the center of the range of the running variable. For a loyalty program one might prefer instead to allocate the benefit in a different way. Perhaps the top % get the benefit, and the next % are randomized to receive the benefit or not, while the bottom % do not get the benefit. For a less expensive incentive, the company might want to offer it to the top % of customers and then randomize it to the bottom %. We can model these options by taking
Let the running variable be random with . Let be random with a finite value of . Let with probability and otherwise. Then letting be the design matrix in the two line regression, and noting that , we have
under random sampling of and given for . The error holds because . The error could be less than if is a simple enough function to make stratification tractable.
We can center so that and then
We can scale to get so that . We retain more general scaling because has and rescaling would require working with the less convenient distribution .
We need the inverse of a block diagonal matrix containing just two unique square blocks. The following proposition specializes block matrix inversion to our case.
Let be an invertible matrix and
be an invertible matrix andbe a square matrix with the same dimensions as . If is invertible, then
for and .
Now and . ∎
Using Proposition 3 we get
Our primary interest is in , for the coefficient of . This is the lower right element of . Now
The asymptotic value of depends on certain integrals. For the case of primary interest to us with , and in the experimental region, these are
|Skew RDD (90th)|
|Skew RDD (80th)|
Table 2 shows for various designs. The first two are the full experiment and the RDD discussed previously. Next is an experiment on just the bottom half of . This strategy is inadmissible by our criteria. It has more variance than the RDD and also lower allocation efficiency.
Next, the table shows for an experiment on just the second % of the running variable, from the th to the th percentiles of the distribution. Just below it is an equal sized experiment in the middle. We see that experimenting in the middle is much more informative. Shifting the experimental region to one side reduces the sample size for either the treatment or control level of . It also affects the correlations among predictors in the two line model.
The variance for experimenting on the second decile looks large compared to the central experiments. It has within it a central experiment on just the middle third of the data from the th to the th percentiles of . Experimenting on the middle third of involves taking and which yields . However if we had only experimented over the range to (with cut points at and ) then would be only times as large as it is in the second decile experiment. Furthermore, reducing the range of by a factor of multiplies by and by . To adjust for these factors we divide by and get . As a result doing the experiment on the second % really is better than just doing a central experiment on the top %.
One tiny experiment involves just randomizing for one percent of the data centered on the median of . We get a variance of for this compared to for the RDD, so the tiny experiment is almost identical to the RDD. We can move the location of the tiny experiment. Table 2 shows the results for a tiny experiment near the ’th and ’th percentiles of . These are quite similar to skewed RDDs where the cutpoint is off center.
In an incentive plan, a regression discontinuity design rewards the a priori best customers but it has severe disadvantages if one wants to follow up with regression models to measure impact. There is a tradeoff between estimation efficiency and allocation efficiency. Proposition 2 provides a principled way to translate estimates or educated guesses about the present value of the incentives and future value of information into a choice of in a hybrid experiment.
In industrial settings, the incentive under study will change over time. Experience with similar though perhaps not identical prior incentive plans then gives some guidance for making the tradeoff.
We have examined a simple linear model because it is easiest to work with and is a reasonable starting point in many contexts. Analysts have many more models at their disposal when the data come in. Section 5 on the quadratic model provides a warning: the RDD becomes very unreliable already with this model which is only slightly more complicated than the two-line model.
In some applications, the allocation variable may be the output of a scoring model based on many customer variables. We expect that incorporating randomness into the design will give better data for refitting such an underlying scoring model, but following up that point is outside the scope of this article. The effects are likely to vary considerably from problem to problem.
- Abdulkadiroglu et al. (2017) Atila Abdulkadiroglu, Joshua D Angrist, Yusuke Narita, and Parag A Pathak. Impact evaluation in matching markets with general tie-breaking. Technical report, National Bureau of Economic Research, 2017. URL http://www.nber.org/papers/w24172.
- Aiken et al. (1998) Leona S Aiken, Stephen G West, David E Schwalm, James L Carroll, and Shenghwa Hsiung. Comparison of a randomized and two quasi-experimental designs in a single outcome evaluation: Efficacy of a university-level remedial writing program. Evaluation Review, 22(2):207–244, 1998.
- Angrist et al. (2014) Joshua Angrist, Sally Hudson, and Amanda Pallais. Leveling up: Early results from a randomized evaluation of post-secondary aid. Technical report, National Bureau of Economic Research, 2014. URL http://www.nber.org/papers/w20800.pdf.
- Angrist and Pischke (2009) Joshua D. Angrist and Jorn-Steffen Pischke. Mostly Harmless Econometrics. Princeton Univerity Press, Princeton, 2009.
- Angrist and Pischke (2014) Joshua D. Angrist and Jorn-Steffen Pischke. Mastering Metrics. Princeton Univerity Press, Princeton, 2014.
- Box et al. (1978) George E. P. Box, William Gordon Hunter, and J. Stuart Hunter. Statistics for experimenters. John Wiley and Sons, New York, 1978.
- Cappelleri and Trochim (2003) Joseph C. Cappelleri and William M. K. Trochim. Cutoff designs. In Marcel Dekker, editor, Encyclopedia of Biopharmaceutical Statistics. CRC Press, 2003. doi: 10.1081/E-EBS12000734. URL https://www.socialresearchmethods.net/research/Cutoff%20Designs%202003.pdf.
- Gelman and Imbens (2017) Andrew Gelman and Guido Imbens. Why high-order polynomials should not be used in regression discontinuity designs. Journal of Business & Economic Statistics, 0(0), 2017. URL http://www.nber.org/papers/w20405.
- Goldberger (1972) A. S. Goldberger. Selection bias in evaluating treatment effects: Some formal illustrations. Technical Report Discussion paper 128–72, Institute for Research on Poverty, University of Wisconsin–Madison, 1972.
- Imbens and Lemieux (2008) Guido Imbens and Thomas Lemieux. Regression discontinuity designs: a guide to practice. Journal of Econometrics, 142(2):615–635, 2008. URL www.nber.org/papers/w13039.pdf.
- Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
- Jacob et al. (2012a) Robin Jacob, Pei Zhu, Marie-Andrée Somers, and Howard Bloom. A practical guide to regression discontinuity. MDRC, 2012a.
- Jacob et al. (2012b) Robin Tepper Jacob, Pei Zhu, Marie-Andrée, Somers, and Howard Bloom. A practical guide to regression discontinuity. MDRC Publicatoins, July 2012b. URL https://www.mdrc.org/publication/practical-guide-regression-discontinuity.
- Klaauw (2008) Wilbert Van Der Klaauw. Regression–discontinuity analysis: A survey of recent developments in economics. LABOUR, 22(2):219–245, 2008. URL http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466.956&rep=rep1&type=pdf.
- Lee and Lemieux (2010) David S. Lee and Thomas Lemieux. Regression discontinuity designs in economics. Journal of Economic Literature, 48:281–355, June 2010. URL https://www.princeton.edu/~davidlee/wp/RDDEconomics.pdf.
- Scott (2015) Steven L. Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
- Student (1931) Student. The Lanarkshire milk experiment. Biometrika, 23(2/3):398–406, 1931.
- Thistlethwaite and Campbell (1960) D. L. Thistlethwaite and D. T. Campbell. Regression-discontinuity analysis: An alternative to the ex post facto experiment. Journal of Educational psychology, 51(6):309, 1960.
- Wu and Hamada (2011) C. F. Jeff Wu and Michael S. Hamada. Experiments: planning, analysis, and optimization. John Wiley & Sons, New York, 2011.