Learning causal effects from many randomized experiments using regularized instrumental variables

01/04/2017 ∙ by Alexander Peysakhovich, et al. ∙ 0

Scientific and business practices are increasingly resulting in large collections of randomized experiments. Analyzed together, these collections can tell us things that individual experiments in the collection cannot. We study how to learn causal relationships between variables from the kinds of collections faced by modern data scientists: the number of experiments is large, many experiments have very small effects, and the analyst lacks metadata (e.g., descriptions of the interventions). Here we use experimental groups as instrumental variables (IV) and show that a standard method (two-stage least squares) is biased even when the number of experiments is infinite. We show how a sparsity-inducing l0 regularization can --- in a reversal of the standard bias--variance tradeoff in regularization --- reduce bias (and thus error) of interventional predictions. Because we are interested in interventional loss minimization we also propose a modified cross-validation procedure (IVCV) to feasibly select the regularization parameter. We show, using a trick from Monte Carlo sampling, that IVCV can be done using summary statistics instead of raw data. This makes our full procedure simple to use in many real-world applications.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Randomized experiments (i.e. A/B tests, randomized controlled trials) are a popular practice in medicine, business, and public policy (Banerjee & Duflo, 2012; Kohavi et al., 2013). When decision-makers employ experimentation they have a far greater chance of learning true causal relationships and making good decisions than via observation alone (LaLonde, 1986; Meyer, 2015; Hemkens et al., 2016). However, a single experiment is often insufficient to learn about the causal mechanisms linking multiple variables — which in turn can be important for theory building and/or decision-making.

Consider the situation of a internet service for watching videos. The firm is interested in how watching different types of videos (e.g., funny vs. serious, short vs. long) affects user behaviors (e.g. by increasing time spent on the site, inducing subscriptions, etc.). This will inform decisions about content recommendation or content acquisition. Even though the firm can measure all relevant variables, learning a model on observational data will likely be misleading; for example, existing content recommendation systems and heterogeneous user dispositions will produce strong correlations between exposure to many video types and time spent or subscription, but it is not true that the magnitude of this correlation is the response that the company can expect if they intervene and change the promotion or availability of videos. Thus, we are interested not just in prediction but prediction under intervention (Bottou et al., 2013; Bottou, 2014; Pearl, 2009).

The standard solution here is to run a randomized experiment exposing some users to more of some type of video. However, a single A/B test will likely change many things in the complex system. It is hard to change the number of views of funny videos without affecting the number of views of serious videos or short videos. This problem is sometimes called ‘fat hand’ interventions because we touch multiple causal variables at once. This means the firm likely cannot learn a vector of causal effects (one for each video type) in such a simple manner. Thus, the company would need to use multiple A/B tests together (e.g., in a factorial design).

However, because routine product experimentation is common in internet companies (Bakshy et al., 2014; Varian, 2016; Kohavi et al., 2013), this firm has likely already run many A/B tests, including on the video recommendation algorithm. The method proposed in this paper can either be applied to a new set of experiments run explicitly to learn a causal effect vector (as in, e.g., Eckles et al., 2016), or can be applied to repurpose already run tests by treating them as random perturbations injected into the system and using that randomness in a smart way.

Our contributions arise from adapting the econometric method of instrumental variables (IV; Wright, 1928; Reiersöl, 1945; Angrist et al., 1996)

to this setting. It is well known that a standard IV estimator — two-stage least squares (TSLS) — is biased in finite samples

(Stock et al., 2012; Angrist & Pischke, 2008). For our case, it also has asymptotic bias. We show that this bias depends on the distribution of the treatment effects in the set of experiments under consideration.

Our main technical contribution is to introduce a multivariate regularization into the first stage of the TSLS procedure and show that it can reduce the bias and MSE of estimated causal effects. Because in finite samples this regularization procedure reduces bias but adds variance, we introduce a method to select this regularization parameter which we call instrumental variables cross-validation (IVCV). In an empirical evaluation that combines simulation and data from hundreds of real randomized experiments, we show that the regularization with IVCV outperforms TSLS and a Bayesian random effects model.

Finally, we show how to perform this estimation in a computationally and practically efficient way. Like standard TSLS, our regularization and cross-validation procedures only require summary statistics at the level of experimental groups. This is advantageous when using raw data is computationally or practically burdensome, e.g., in the case of internet companies. This means the computational and data storage complexities of the method are actually quite low. In addition, standard A/B testing platforms (Bakshy et al., 2014; Xu et al., 2015) should already compute and store all the required statistics, so the method here can be thought of as an “upcycling” of existing statistics.

2 Confounding and the Basic IV Model

Suppose we have some (potentially vector valued) random variable

and a scalar valued outcome variable . We want to ask: what happens to if I change some component of by one unit, holding the rest constant? Formally, we study a linear structural (i.e. data generating) equation pair

where , and are independent random variables with mean 0, without loss of generality. Note that in A/B testing we are often interested in relatively small changes to the system, and thus we can just think about locally linear approximations to the true function. We can also consider basis expansions. We refer to as the causal variables (in our motivating example this would be a vector of time spent on each video type), as the outcome variables (here overall user satisfaction), as the unobserved confounders, as noise, and as the causal effects.

In general, we are interested in estimating the causal effect because we are interested in intervention, e.g., one which will change our data-generating model to

In the presence of unobserved confounders, is not identified and trying to learn causal relationships using predictive models naively can lead us astray (Bottou et al., 2013; Bottou, 2014; Shalit et al., 2016; Pearl, 2009). Suppose that we have observational data of the form with completely unobserved. If we use this data to estimate the causal effect we can, due to the influence of the unobserved confounder, get an estimate that is (even in infinite samples) larger, smaller or even the opposite sign of the true causal effect (we describe this more fully in the Supplemental Material). Thus, the best predictor of given may not be lead to a good estimate of what would happen to if we intervened.

We now discuss instrumental variable (IV) estimator as a method for learning the causal effects. Suppose that we have some variable that has two properties. First, is not caused by anything in the system; that is, is as good as randomly assigned. Second, affects only via . This latter assumption is known as an exclusion restriction or complete mediation assumption. Formally, this modifies the structural equation (see the Supplemental Material for the DAG representation) for to be

The standard IV estimator for is two-stage least squares (TSLS) and works off the principle that the variance in can be broken down into two components. The first component is confounded with the true causal effect (i.e. comes from ). The second component, on the other hand, is independent of . Thus, if we could regress only on the random component, we could recover the causal effect . Knowing allows us to do exactly this (i.e. by using only the variation in caused by not ).

TSLS can be thought of as follows: in the first stage we regress on . We then replace by the predicted values from the regression. In the second stage, we regress on these fitted values.111 We make an additional assumption: in order to estimate the effect of each variable on with the other ’s held constant it must be the case that is such that it causes independent variation in all dimensions of . This means that we must, at least, have as many instruments as the dimension of for TSLS to work. It is straightforward to show that as approaches infinity this estimator converges to the true causal effect (Wooldridge, 2010, Theorem 5.1).

3 IV with Test Groups without Metadata

In our setting of interest, randomly assigned groups from a large collection of experiments are the instruments. That is, the IV is a categorical variable indicating which of

test groups a unit (e.g., user) was assigned to in one of many experiments. For simplicity of notation, we assume that each treatment group has exactly units assigned to it at random.

3.1 Computational Properties

The way to represent the first stage regression of the TSLS is to use the one-hot representation

(or dummy-variable encoding) of the group which each unit is assigned to, such that

is a -dimensional vector of 0s and a single 1 indicating the randomly assigned group.

In this setup the TSLS estimator has a very convenient form. The first stage regression of on simply yields estimates that are group level means of in each group. This means that if each group has the same number of units (e.g., users) and the same error variance, the second stage has a convenient form as well: we can recover by simply regressing group level averages of on (Angrist & Pischke, 2008, section 4.1.3).

Thus, to estimate causal effects from large meta-analyses practitioners do not need to retain or compute with the raw data (which can span millions or billions of rows in the context of A/B testing at a medium or large internet company), but rather can retain and compute with sample means of and in each A/B test group (this is now just thousands of rows of data). These are quantities that are recorded already in the most automated A/B testing systems (Bakshy et al., 2014; Xu et al., 2015). Working with summary statistics simplifies computation enormously and allows us to reuse existing data.

3.2 Asymptotic Bias in the Grouped IV Estimator

There are now multiple ways to think about the asymptotic properties of this “groups as IVs” estimator. Either we increase the size of each experiment () or we get more experiments (). The former is the standard asymptotic sequence, but for meta-analysis of a growing collection of experiments, the latter is the more natural asymptotic series, so we fix but we raise .

We fix ideas with the case where are scalar. We denote the group level means of our variables with bars (e.g., to be the random variable that is the group-level means of ). Recall that our TSLS is, in the group case, a regression of on .

Decompose the causal variable group level average into where is the true first stage of the IV model (i.e. what we are trying to learn in the first stage of the TSLS). In the case of experiments as instruments this term has a nice interpretation — it is the true average value of the causal variables when assigned to that experimental group.

While we are not considering asymptotic series where goes to infinity,

will generally also be large enough that so that we can use the normality of sample means guaranteed by the central limit theorem. Thus,

and are normal with mean and variance proportional to

With finite we can show that, even as , TSLS will be biased (cf. Bekker, 1994; Angrist & Krueger, 1995). Suppose for intuition that has mean and finite variance this bias has the closed form (see Supplemental Materials for a derivation of the general form):

To understand where this bias comes from, think about the case where is always . The instrument does nothing, however the group-level averages still include group-level confounding noise; that is, for finite , has positive variance. Thus, we simply recover the original observational estimate that we have already discussed as including omitted variable bias. When is not degenerate, and include variation from both and . As increases the influence of decreases and so is consistent for 222While in many cases, where variation induced by instrumental variables is large, this bias can be safely ignored, in the case of online A/B testing this is likely not the case. Since much of online experimentation involves hill climbing and small improvements (on the order of a few percent or less) that add up, the TSLS estimator can be quite biased in practice (more on this below).

4 Bias-Reducing Regularization

We now introduce a regularization procedure that can decrease bias in the TSLS estimator. We show that, in this setting a -regularized first stage is computationally feasible and can help reduce this bias under some conditions on the distribution of the latent treatment effects.

4.1 Intuition via a Mixture Model

There are many types of A/B tests conducted — some are micro-optimizations at the margin and some are larger explorations of the action space. Consider the stylized case with two types of tests calling the smaller variance type ‘weak’ tests while the larger variance ones are ‘strong’ test, where the type gives the distribution from which its treatment effects are drawn; that is,

is drawn from a two-component mixture model, with probability

, we have that has variance and with probability it has variance .

Notice that if we ran TSLS using only groups whose is drawn from component , then our estimator converges to

Because we will have that is a less biased estimator than If we don’t know which test is of which type and simply run a TSLS on the full data set, we will get some estimator that will be a weighted combination of these two quantities. Thus, with sufficient number of groups, we can actually improve our causal estimate by using less data (i.e. only the strong tests). Of course when the number of tests is finite we face a bias–variance tradeoff.

Within this discrete mixture model, we are limited to how much we can reduce bias (since ). However suppose that the treatment effects are drawn from a distribution which is an infinite mixture of normals that has full support on normals of all variances, such as a distribution, then we can asymptotically (in the large sense) reduce the bias below any by using only observations which come from components with arbitrarily large variances. We now introduce a regularization procedure to do this.

4.2 Formalizing First Stage Regularization

Consider a data set of vectors of group-level averages. Let

be the -value for a group-level observation under a ‘no intervention’ null with . These are straightforward to compute from the observational (i.e., within control condition) variance (or covariance matrix) of . For a given threshold , let

We then define the regularized IV estimator as

Thus, this procedure is equivalent to an regularization in the first stage of the TSLS regression. In particular, when

has a normal distribution, as in the present case, then this is equivalent to

-regularized least squares.

Recall that in the binary mixture example above, this regularization would preferentially retain groups that come from the higher variance (strong) component. This extends to infinite mixtures, such as the , where this procedure will preferentially set to zero for groups where is drawn from a lower variance component.

So far we have focused on scalar . This procedure naturally extends to multidimensional settings. Compute and simultaneously threshold all dimensions of the experimental group ; that is, if this probability is above a threshold we set the whole vector to This is thus a group- regularizer.333 We note that this group- regularization is inefficient if treatment effects are such that each A/B test only only moves a single dimension of (i.e. ’skinny hand’ interventions). In our evaluation we see that it works in real world applications, however, it is an interesting question for future research to learn its limitations. See the Supplemental Material for additional simulations and discussion.

5 Causal Cross-Validation

We now turn to an important practical question: because there is a bias–variance tradeoff how should one set the regularization parameter when is finite to optimize for prediction under intervention?

First, let us suppose that we have access to the raw data where a row is a which is a unit ’s, , and treatment assignment

We propose a procedure to set our hyperparameter

. We describe -fold version as it conveys the full intuition, but extension to -folds is straightforward.

Instrumental variables cross-validation algorithm (IVCV):

  1. Split each treatment in the data set into folds, call these new data sets and .

  2. Compute treatment level averages and as described above where now indexes experimental groups.

  3. Compute for a variety of thresholds using .

  4. Compute treatment level predictions of using fold for each level of : .

  5. Choose which minimizes

The intuition behind IVCV is similar to the main idea behind IV in general. Recall that our objective is to use variation in that is not caused by . The IVCV algorithm uses the value from fold and compares the prediction to the value in fold because fold and fold share a but differ in (since is independent across units but is the same within group). This intuition has been exploited in split-sample based estimators (Angrist & Krueger, 1995; Imbens et al., 1999; Hansen & Kozbur, 2014).

Figure 1: Comparison of stagewise vs. IVCV method. X-axis is the strength of regularization (lower -value implies stronger regularization). Optimizing for stagewise loss would imply using almost no regularization whereas optimizing for IVCV loss implies strong regularization. Causal loss coincides much more with IVCV loss than stagewise loss.

We can demonstrate the importance of using the full causal loss by comparing the IVCV procedure to other two candidates. The first is simply applying naive CV in the second stage (i.e., splitting each group into 2, training a model on fold 1 and computing the CV loss naively as ). The second is stagewise, in which the regularization parameter is chosen to minimize MSE in the first stage, and then the second stage is fit conditional on the selected model (as in Belloni et al., 2012; Hartford et al., 2016). We compare these approaches in a simple linear model with scalar , such that and ) with distributed

with 3 degrees of freedom and scale

, and .

Figure 1 shows naive (second stage) CV loss , first stage CV loss , true causal loss , and IVCV loss as a function of the first stage regularization parameter averaged over simulations of the model above. We see that both the first stage loss curve and the naive CV loss curve look very different from the causal loss curve. However, the IVCV loss curve matches almost exactly. Thus, either stage error naively yields a very different objective function from minimizing the causal error. In particular, we see that making the bias–variance tradeoffs for the first stage need not coincide with an desirable bias-variance tradeoff for causal inference.

The -regularized IV estimator only requires summary statistics per experimental group that are already routinely computed in the course of running A/B tests. However, IVCV as specified above requires uses raw data. In the Supplemental Material we show that IVCV can also be implemented using only summary statistics. This is because the distribution of two normal random variables which sum to another normal random variable has a closed form from which it is easy to sample. Thus, the full procedure is implementable using a highly compressed form of the original data.

6 Evaluation

We now evaluate these procedures empirically. True causal effects in real data are generally unobservable, so comparisons of methods usually lack a gold standard.444Examples of the kinds of evaluations usually done include: comparing different observational procedures to what is estimated by an experiment or comparing different procedures and showing that one yields estimates which are more ‘reasonable.’

On the other hand, simulations allow us to know the true causal effects, but can lack realism. We strike a middle ground by using simulations where we set the causal effects ourselves but other joint distributions are determined by a collection of real randomized experiments. These simulations use a model given by

Thus, in this case all the variance in that is not driven by our instruments is confounding variance.

6.1 Data

The multivariate case is made difficult and interesting when has a non-diagonal covariance matrix and has some unknown underlying distribution, so we generate these distributions from real data derived from 798 randomly assigned test groups from a sample of Facebook A/B tests.555Note that we use the collection of A/B tests only to generate a distribution for our first stage (i.e., ). In the Supplement Material we also consider the IVCV procedure in several completely synthetic data sets. The synthetic data allows us to elucidate the important assumptions for our procedure to work while the main evaluation shows that these assumptions are indeed satisfied in real world conditions. We define our endogenous, causal s as 7 key performance indicators (i.e. intermediate outcomes examined by decision-makers and analysts); we standardize these to have mean 0 and variance 1. As the distribution of we use the estimated covariance matrix among these outcomes in observational data. Third, we take the experiment-level empirical means of the s as the true , to which we add the confounding noise according to the distribution of .

Figure 2: A) Two dimensions of the multivariate means for sampled test groups (). B) QQ-plots for the dimensions of the sampled test groups (). The marginal distributions are notably non-normal.

We show a projection of these onto of the dimensions in Figure 2(A). We see that the A/B tests appear to have correlated effects but do span both dimensions independently, many groups are retained even with strong first stage regularization, and the distribution has much more pronounced extremes than would be expected under a Gaussian model. Figure 2

(B) compares the observed and Gaussian quantiles, illustrating that all dimensions are notably non-normal (Shapiro–Wilk tests of normality, all

s .

We set as the vector of ones and as a diagonal matrix with alternating elements and , so that there is both positive and negative confounding. For each simulated data set, we compute the causal mean squared error for ; that is, the expected risk from intervening on one of the causal variables at random. If is our estimated vector then this is

Figure 3: A) Causal error (relative to a naive observational estimator) for the full -regularization path (solid black), TSLS (solid red), IVCV selected parameters (dashed purple) and Bayesian random effects model (dashed teal). IVCV outperforms all other estimation techniques. B) Error in estimating causal effects for varying numbers of test groups . IVCV is useful even with a relatively small meta-analysis, while TSLS exhibits asymptotic bias. With a very small number of test groups, the Oracle can actually underperform TSLS because of near collinearity.

6.2 Results

In addition to the -regularized IV method and TSLS, we examine a Bayesian random effects model, as in Chamberlain & Imbens (2004) but with a

, rather than Gaussian, distribution for the instruments. Let

with the prior for (a standard prior in the literature). We also give the model the true covariance matrix for . To fit the model we use Stan Carpenter et al. (2016). We compare the Bayesian random effects model and our regularized IV model to the infeasible Oracle estimator where the estimate of the first stage is known with certainty.

Figure 3(A) shows the results for various dimensions of for 1,000 simulations. Because of the high level of confounding in the observational data, the observational (OLS) estimates of the causal effect are highly biased, such that even the standard TSLS decreases our causal MSE by over

We see that the -regularization path (black line) reduces error compared with TSLS and, with high regularization, approaches the Oracle estimator. Furthermore, feasible selection of this hyperparameter using IVCV leads to near optimal performance (purple line). The Bayesian random effects model can reduce bias, but substantially increases variance and thus MSE.

We also look at how large the collection of experimental groups needs to be to see advantages of a regularized estimator relative to a TSLS procedure. We repeat the TSLS, Oracle, and -regularization with IVCV analyses in 100 simulations with smaller (Figure 3(B)) for the case of the dimensional . Intuitively, what is important is the relative size of the tails of the distribution of the latent treatment effects . As the tails get fatter, fewer experiments are required to get draws from the more extreme components of the mixture. We see that in this realistic case where is determined using a sampled set of Facebook A/B tests, feasible selection of the -regularization hyperparameter using IVCV outperforms TSLS substantially for many values of . Thus, meta-analyses of even relatively small collections of experiments can be improved by the first-stage regularization.

7 Conclusion

Most analyses of randomized experiments, whether in academia, business, or public policy tends to look at each trial in isolation. When meta-analyses of experiments are conducted, these usually either pool data about multiple instances of the same intervention or to find heterogeneity in the effects of interventions across settings or methods (e.g., Hemkens et al., 2016). We instead propose combining many experiments can help us learn richer causal relationships that are not identified by any single experiment. IV models give a way of doing this pooling. We have shown that in such situations using easily-implemented regularization reduce bias and total error in estimating causal effects, and thus produce better predictions about interventions, than using standard TSLS methods.

We expand on the literature which uses multi-condition experiments as instruments (Eckles et al., 2016; Goldman & Rao, 2014). Such analyses feature a smaller number of experimental groups and a single causal variable. Our work is also related to research on IV estimation with weak instruments (Stock et al., 2012; Staiger & Stock, 1997; Stock & Yogo, 2005). In addition, we also contribute to existing research on regularized IV estimation (Belloni et al., 2012; Hansen & Kozbur, 2014; Chamberlain & Imbens, 2004). Our application domain motivates introducing a group- regularization and a feasible and data efficient cross-validation procedure, while previous techniques have used naive stagewise cross-validation.

The present work is part of a growing literature on machine learning techniques and causality

(Bottou, 2014), much of which has focused on learning causal graphs (Pearl, 2009), observational causal inference (Shalit et al., 2016), heterogeneous treatment effects (Grimmer et al., 2014; Athey & Imbens, 2016; Peysakhovich & Lada, 2016), or contextual bandit problems (Agarwal et al., 2014; Dudík et al., 2014; Swaminathan & Joachims, 2015), but only more recently on instrumental variables methods (Hartford et al., 2016; Peters et al., 2016).


  • Agarwal et al. (2014) Agarwal, Alekh, Hsu, Daniel, Kale, Satyen, Langford, John, Li, Lihong, and Schapire, Robert. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646, 2014.
  • Angrist & Krueger (1995) Angrist, Joshua D and Krueger, Alan B. Split-sample instrumental variables estimates of the return to schooling. Journal of Business & Economic Statistics, 13(2):225–235, 1995.
  • Angrist & Pischke (2008) Angrist, Joshua D and Pischke, Jörn-Steffen. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press, 2008.
  • Angrist et al. (1996) Angrist, Joshua D, Imbens, Guido W, and Rubin, Donald B. Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455, 1996.
  • Athey & Imbens (2016) Athey, Susan and Imbens, Guido. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
  • Bakshy et al. (2014) Bakshy, E., Eckles, D., and Bernstein, M. S. Designing and deploying online field experiments. In Proceedings of the 23rd ACM conference on the World Wide Web. ACM, 2014.
  • Banerjee & Duflo (2012) Banerjee, Abhijit and Duflo, Esther. Poor Economics: A Radical Rethinking of the Way to Fight Global Poverty. PublicAffairs, 2012.
  • Bekker (1994) Bekker, Paul A. Alternative approximations to the distributions of instrumental variable estimators. Econometrica: Journal of the Econometric Society, pp. 657–681, 1994.
  • Belloni et al. (2012) Belloni, Alexandre, Chen, Daniel, Chernozhukov, Victor, and Hansen, Christian. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429, 2012.
  • Bottou (2014) Bottou, Léon. From machine learning to machine reasoning. Machine Learning, 94(2):133–149, 2014.
  • Bottou et al. (2013) Bottou, Léon, Peters, Jonas, Candela, Joaquin Quinonero, Charles, Denis Xavier, Chickering, Max, Portugaly, Elon, Ray, Dipankar, Simard, Patrice Y, and Snelson, Ed. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(1):3207–3260, 2013.
  • Carpenter et al. (2016) Carpenter, Bob, Gelman, Andrew, Hoffman, Matt, Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Michael A, Guo, Jiqiang, Li, Peter, and Riddell, Allen. Stan: A probabilistic programming language. Journal of Statistical Software, 2016.
  • Chamberlain & Imbens (2004) Chamberlain, Gary and Imbens, Guido. Random effects estimators with many instrumental variables. Econometrica, 72(1):295–306, 2004.
  • Dudík et al. (2014) Dudík, Miroslav, Erhan, Dumitru, Langford, John, Li, Lihong, et al. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485–511, 2014.
  • Eckles et al. (2016) Eckles, Dean, Kizilcec, René F, and Bakshy, Eytan. Estimating peer effects in networks with peer encouragement designs. Proceedings of the National Academy of Sciences, 113(27):7316–7322, 2016.
  • Goldman & Rao (2014) Goldman, Mathew and Rao, Justin M. Experiments as instruments: Heterogeneous position effects in sponsored search auctions. Available at SSRN 2524688, 2014.
  • Grimmer et al. (2014) Grimmer, Justin, Messing, Solomon, and Westwood, Sean J. Estimating heterogeneous treatment effects and the effects of heterogeneous treatments with ensemble methods. Unpublished manuscript, Stanford University, Stanford, CA, 2014.
  • Hansen & Kozbur (2014) Hansen, Christian and Kozbur, Damian. Instrumental variables estimation with many weak instruments using regularized JIVE. Journal of Econometrics, 182(2):290–308, 2014.
  • Hartford et al. (2016) Hartford, Jason, Lewis, Greg, Leyton-Brown, Kevin, and Taddy, Matt. Counterfactual prediction with deep instrumental variables networks. arXiv preprint arXiv:1612.09596, 2016.
  • Hemkens et al. (2016) Hemkens, Lars G, Contopoulos-Ioannidis, Despina G, and Ioannidis, John PA. Agreement of treatment effects for mortality from routinely collected data and subsequent randomized trials: Meta-epidemiological survey. British Medical Journal, 352, 2016.
  • Imbens et al. (1999) Imbens, Guido, Angrist, Joshua, and Krueger, Alan. Jackknife instrumental variables estimation. Journal of Applied Econometrics, 14(1), 1999.
  • Kohavi et al. (2013) Kohavi, Ron, Deng, Alex, Frasca, Brian, Walker, Toby, Xu, Ya, and Pohlmann, Nils. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1168–1176. ACM, 2013.
  • LaLonde (1986) LaLonde, Robert J. Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, pp. 604–620, 1986.
  • Meyer (2015) Meyer, Michelle N. Two cheers for corporate experimentation: The A/B illusion and the virtues of data-driven innovation. J. on Telecomm. & High Tech. L., 13:273, 2015.
  • Owen (2016) Owen, Art B. Monte Carlo Theory, Methods and Examples. 2016. URL http://statweb.stanford.edu/~owen/mc/.
  • Pearl (2009) Pearl, Judea. Causality. Cambridge University Press, 2009.
  • Peters et al. (2016) Peters, Jonas, Bühlmann, Peter, and Meinshausen, Nicolai.

    Causal inference by using invariant prediction: identification and confidence intervals.

    Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):947–1012, 2016.
  • Peysakhovich & Lada (2016) Peysakhovich, Alexander and Lada, Akos. Combining observational and experimental data to find heterogeneous treatment effects. arXiv preprint arXiv:1611.02385, 2016.
  • Peysakhovich & Naecker (forthcoming) Peysakhovich, Alexander and Naecker, Jeffrey. Machine learning and behavioral economics: Evaluating models of choice under risk and ambiguity. Journal of Economic Behavior and Organization, forthcoming.
  • Reiersöl (1945) Reiersöl, Olav. Confluence analysis by means of instrumental sets of variables. PhD thesis, Stockholm College, 1945.
  • Shalit et al. (2016) Shalit, Uri, Johansson, Fredrik, and Sontag, David. Bounding and minimizing counterfactual error. arXiv preprint arXiv:1606.03976, 2016.
  • Staiger & Stock (1997) Staiger, Douglas and Stock, James H. Instrumental variables regression with weak instruments. Econometrica, pp. 557–586, 1997.
  • Stock & Yogo (2005) Stock, James H and Yogo, Motohiro. Testing for weak instruments in linear IV regression. In Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg, pp. 80–108. Cambridge University Press, 2005.
  • Stock et al. (2012) Stock, James H, Wright, Jonathan H, and Yogo, Motohiro.

    A survey of weak instruments and weak identification in generalized method of moments.

    Journal of Business & Economic Statistics, 2012.
  • Swaminathan & Joachims (2015) Swaminathan, Adith and Joachims, Thorsten. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, pp. 814–823, 2015.
  • Varian (2016) Varian, Hal. Intelligent technology. Finance and Development, 53(3), 2016.
  • Wooldridge (2010) Wooldridge, Jeffrey M. Econometric Analysis of Cross Section and Panel Data. MIT Press, 2010.
  • Wright (1928) Wright, Philip Green. The Tariff on Animal and Vegetable Oils. The Macmillan Co., 1928.
  • Xu et al. (2015) Xu, Ya, Chen, Nanyu, Fernandez, Addrian, Sinno, Omar, and Bhasin, Anmol. From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2227–2236. ACM, 2015.

8 Supplemental Material

8.1 Confounding in a Linear Model

Figure 4: DAG representing our structural equations, in which the relationship between and is confounded by , and including the instrumental variable . Crosses represent causal relationships that are ruled out by the IV assumptions.

Consider the linear structural equation pair from the main text:

where these variables have mean 0 and finite variances and .

Suppose that we only observe

where both are scalar. Since the underlying model is linear, we can try to estimate it using a linear regression. However, not including the confounder

in the regression yields the estimator:


When all variables are scalar algebra yields

8.2 Derivation of the Group IV Bias

Let us use the convention from the main text and denote by the group level mean of variable This means we get

Since the TSLS estimator in this case is a regression of on we can use the equation derived above for the scalar case to rewrite

8.3 IVCV With Only Summary Statistics

The -regularized IV estimator only requires the kinds of summary statistics per experimental group that are already recorded in the course of running A/B tests, which has practical and computational utility. However, the cross-validation procedure above requires the use of raw data. We now turn to the following question: if the raw data is unavailable, but summary statistics are, can we use these summary statistics to choose a threshold ?

Suppose that we have access to summary means for each treatment and the covariance matrix of conditional on which we denote by . We note that can be estimated very precisely from observational data or, in the case of the experimental meta-analysis just looking at covariances among known control groups. We assume that is large enough such that the distributions of and in groups of size are well approximated by the Gaussian

To perform IVCV under these assumptions, we use a result from the literature on Monte Carlo (Owen, 2016, ch. 8). If some vector is distributed multivariate normal then any linear combination has a normal distribution. Moreover, conditional on the distribution of is normal with mean and covariance matrix

This means if we know the observational covariance matrix then for every group we can take the group level averages and sample using the equation above to get and such that . Since by the central limit theorem the generating Gaussian model is approximately correct, this procedure simulates the split required by IVCV without having access to the raw data.

This gives us a summary-statistics-based IVCV algorithm:

Summary statistics instrumental variables cross-validation algorithm (sIVCV):

  1. Start with data comprising of treatment group means .

  2. Use the covariance matrix to perform Monte Carlo sampling to simulate groups and .

  3. Use the IVCV algorithm to set the hyperparameter using the simulated splits.

  4. Estimate using the selected hyperparameters on the full data set.

8.4 Synthetic IVCV Experiments

In addition to the real data that we have provided in the main text, we also consider the IVCV procedure in several completely synthetic data sets. This allows us to elucidate the important assumptions for our procedure to work while the main experiment shows that these assumptions are indeed satisfied in real world conditions.

We consider the same exact model as in the main text except we generate the first stage effects from a known parametric distribution and let be normal. First, we consider where the treatment effect is drawn from an independent distribution with degrees of freedom. Second, we consider where is drawn from a distribution with

degrees of freedom with a covariance matrix drawn from an inverse Wishart (a conjugate prior for covariance matrices and a standard way of generating covariance matrices) with

dim degrees of freedom. Note that in former case effects are axis aligned while in the latter case larger values of one dimension can predict more extreme values of (and ) on another dimension.

Finally, we consider a model where first we draw a variance

from an inverse gamma distribution then we draw

from an independent normal distribution with variance This means that components are mean-uncorrelated, but that one when component’s value is extreme, it is more likely that other components’ values are extreme. This is the multivariate analog of our motivating example where some A/B tests are strong explorations of the parameter spaces and others are micro-optimizations at the margin. Note that the marginal distribution for each dimension is, just like in the first example, a distribution with degrees of freedom (since the can be written as a mixture of normals drawn from the inverse gamma).

Figure 5 shows key main text figure replicated using the data generating processes above (left = independent , middle = Wishart , right = correlated variances). We restrict to because it is sufficient to illustrate our main points. We see that in the independent case the IVCV procedure (and indeed our multivariate regularization) can underperform the Bayesian random effects model fail to substantially improve on TSLS. This happens because in the independent case there is a high probability that a single dimension is extreme enough to pass the regularization threshold and thus even strong regularization does not necessarily remove bias. On the other hand, when outcomes are correlated (or their variances are) we see that multivariate IVCV performs well because being extreme in one component predicts having extreme outcomes in other components. This leads to an interesting question of whether there is a more efficient regularization design.

Figure 5: Performance of various IV estimation techniques under various first stage data generating assumptions (left = independent , right = Wishart , bottom = correlated variances). We see that when the induced components of are independent even for moderate dimensionality that the regularization performs less well. However, as soon as there is any correlation the IVCV procedure performs much better than TSLS and can both under or over-perform the Bayesian random effects model. In the main text we see that in a real distribution the IVCV does indeed beat the Bayesian model.