# Machine Learning Estimation of Heterogeneous Treatment Effects with Instruments

We consider the estimation of heterogeneous treatment effects with arbitrary machine learning methods in the presence of unobserved confounders with the aid of a valid instrument. Such settings arise in A/B tests with an intent-to-treat structure, where the experimenter randomizes over which user will receive a recommendation to take an action, and we are interested in the effect of the downstream action. We develop a statistical learning approach to the estimation of heterogeneous effects, reducing the problem to the minimization of an appropriate loss function that depends on a set of auxiliary models (each corresponding to a separate prediction task). The reduction enables the use of all recent algorithmic advances (e.g. neural nets, forests). We show that the estimated effect model is robust to estimation errors in the auxiliary models, by showing that the loss satisfies a Neyman orthogonality criterion. Our approach can be used to estimate projections of the true effect model on simpler hypothesis spaces. When these spaces are parametric, then the parameter estimates are asymptotically normal, which enables construction of confidence sets. We applied our method to estimate the effect of membership on downstream webpage engagement on TripAdvisor, using as an instrument an intent-to-treat A/B test among 4 million TripAdvisor users, where some users received an easier membership sign-up process. We also validate our method on synthetic data and on public datasets for the effects of schooling on income.

## Authors

• 35 publications
• 1 publication
• 5 publications
• 2 publications
• 2 publications
• 9 publications
• ### Valid Causal Inference with (Some) Invalid Instruments

Instrumental variable methods provide a powerful approach to estimating ...
06/19/2020 ∙ by Jason Hartford, et al. ∙ 8

• ### Assumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Spillover Effects Among A Vulnerable Subpopulation

Cluster randomized trials (CRTs) are a popular design to study the effec...
12/27/2020 ∙ by Chan Park, et al. ∙ 0

• ### A Robust Method for Estimating Individualized Treatment Effect

We consider an additive model with a main effect and effects from multip...
04/21/2020 ∙ by Haomiao Meng, et al. ∙ 0

• ### Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments

We propose strategies to estimate and make inference on key features of ...
12/13/2017 ∙ by Victor Chernozhukov, et al. ∙ 0

• ### Instrument Validity for Heterogeneous Causal Effects

This paper provides a general framework for testing instrument validity ...
09/04/2020 ∙ by Zhenting Sun, et al. ∙ 0

• ### Heterogeneous Demand Effects of Recommendation Strategies in a Mobile Application: Evidence from Econometric Models and Machine-Learning Instruments

In this paper, we examine the effectiveness of various recommendation st...
02/20/2021 ∙ by Panagiotis Adamopoulos, et al. ∙ 0

• ### Combining observational and experimental data to find heterogeneous treatment effects

Every design choice will have different effects on different units. Howe...
11/08/2016 ∙ by Alexander Peysakhovich, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A/B testing is the gold standard of causal inference. But even when A/B testing is feasible, estimating the effect of a treatment on an outcome might not be a straightforward task. One major difficulty is non-compliance: even if we randomize what treatment to recommend to a subject, the subject might not comply with the recommendation due to unobserved factors and follow the alternate action. The impact that unobserved factors might have on the measured outcome is a source of endogeneity and can lead to biased estimates of the effect. This problem arises in large scale data problems in the digital economy; when optimizing a digital service, we might often want to estimate the effect of some action taken by our users on downstream metrics. However, the service cannot force users to comply, but can only find means of incentivizing or recommending the action. The unobserved factors of compliance can lead to biased estimates if we consider the takers and not takers as exogenously assigned and employ machine learning approaches to estimate the potentially heterogeneous effect of the action on the downstream metric.

The problem can be solved by using the technique of instrumental variable (IV) regression: as long as the recommendation increases the probability of taking the treatment, then we know that there is at least some fraction of users that were assigned the treatment “exogeneously”. IV regression parses out this population of “exogenously treated” users and estimates an effect based solely on them.

Most classical IV approaches estimate a constant average treatment effect. However, to make personalized policy decisions (an emerging trend in most digital services) one might want to estimate a heterogeneous effect based on observable characteristics of the user. The latter is a daunting task, as we seek to estimate a function of observable characteristics as opposed to a single number. Hence, statistical power is at stake. Even estimating an ATE is non-trivial when effect and compliance are correlated through observables. The emergence of large data-sets in the digital economy alleviates this concern; with A/B tests running on millions of users it is possible to estimate complex heterogeneous effect models, even if compliance levels are relatively weak. Moreover, as we control for more and more observable features of the user, we also reduce the risk that correlation between effect and compliance is stemming from unobserved factors.

This leads to the question this work seeks to answer: how can we blend the power of modern machine learning approaches (e.g. random forests, gradient boosting, penalized regressions, neural networks) with instrumental variable methods, so as to estimate complex heterogeneous effect rules. Recent work at the intersection of machine learning and econometrics has proposed powerful methods for estimating the effect of a treatment on an outcome, while using machine learning methods for learning nuisance models that help de-bias the final effect rule. However, the majority of the work has either focused on 1) estimating average treatment effects or low dimensional parametric effect models (e.g. the double machine learning approach of

Chernozhukov2018

), 2) developing new algorithms for estimating fully non-parametric models of the effect (e.g. the IV forest method of

athey2019generalized , the DeepIV method of pmlr-v70-hartford17a ), 3) assuming that the treatment is exogenous once we condition on the observable features and reducing the problem to an appropriate square loss minimization framework (see e.g. nie2017quasi ; kunzel2017meta ).

Nevertheless, a general reduction of IV based machine learning estimation of heterogeneous effects to a more standard statistical learning problem that can incorporate existing algorithms in a black-box manner has not been formulated in prior work. In fact, the work of nie2017quasi

leaves this as an open question. Such a reduction can help us leverage the recent algorithmic advances in statistical learning theory so as to work with large data-sets. Our work proposes the reduction of heterogeneous effects estimation via instruments to a square loss minimization problem over a hypothesis space. This enables us to learn not only the true heterogeneous effect model, but also the projections of the true model in simpler hypothesis spaces for interpretability. Moreover, our work leverages recent advances in statistical learning with nuisance functions

chernozhukov2018plug ; foster2019orthogonal , to show that the mean squared error (MSE) of the learned model is robust to the estimation error of auxiliary models that need to be estimated (as is standard in IV regression). Thus we achieve MSE rates where the leading term depends only on the sample complexity of the hypothesis space of the heterogeneous effect model.

Some advantages of reducing our problem to a set of standard regression problems include being able to use existing algorithms and implementations, as well as recent advances of interpretability in machine learning. For instance, in our application we deploy the SHAP framework SHAP1 ; SHAP2 to interpret random forest based models of the heterogeneous effect. Furthermore, when the hypothesis space is low dimensional and parametric then our approach falls in the setting studied by prior work of Chernozhukov2018

and, hence, not only MSE rates but also confidence interval construction is relatively straightforward. This enables hypothesis testing on parametric projections of the true effect model.

We apply our approach to an intent-to-treat A/B test among 4 million users on a major travel webpage so as to estimate the effect of membership on downstream engagement. We identify sources of heterogeneity that have policy implications on which users the platform should engage more and potentially how to re-design the recommendation to target users with large effects. We validate the findings on a different cohort in a separate experiment among 10 million users on the same platform. Even though the new experiment was deployed on a much broader and different cohort, we identify common leading factors of effect heterogeneity, hence confirming our findings. As a robustness check we create semi-synthetic data with similar features and marginal distributions of variables as the real data, but where we know the ground truth. We find that our method performs well both in terms of MSE, identifying the relevant factors and coverage of the confidence intervals.

Finally, we apply our method to a more traditional IV application: estimating the effect of schooling on wages. We use a well studied public data set and observe that our approach automatically identifies sources of heterogeneity that were previously uncovered using more structural approaches. We also validate our method in this application on semi-synthetic data that emulate the true data.

## 2 Estimation of Heterogeneous Treatment Effects with Instruments

We consider estimation of heterogeneous treatment effects with respect to a set of features , of an endogenous treatment on an outcome with an instrument . For simplicity of exposition, we will restrict attention to the case where and are scalar variables, but several of our results extend to the case of multi-dimensional treatments and instruments.

is an instrumental variable if it has an effect on the treatment but does not have a direct effect on the outcome other than through the treatment. More formally, we assume the following moment condition:

 E[Y−θ0(X)T−f0(X)∣Z,X]=0 (1)

Equivalently we assume that: , with . We allow for the presence of confounders, i.e. could be correlated with via some unobserved common factor . However, our exclusion restriction on the instrument implies that the residual is mean zero conditional on the instrument. Together with the fact that the instrument also has an effect on the treatment at any value of the feature , i.e.: , allows us to identify the heterogeneous effect function . We focus on the case where the effect is linear in the treatment , which is wlog in the binary treatment setting, which is our main application, and since our goal is to focus on the non-linearity wrt (this greatly simplifies our problem, see Chen2015 ; ChenPouzo ; NeweyPowell ; hartford2017deep ).

Given i.i.d. samples from the data generating process, our goal is to estimate a model that achieves small expected mean-squared-error, i.e.: . Since the true function can be very complex and difficult to estimate in finite samples, we are also interested in estimating projections of the true on simpler hypothesis spaces . Projections are also useful for interpretability: one might want to understand what is the best linear projection of on , i.e. . In this case we will denote with the projection of on , i.e. and our goal would be to achieve small mean squared error with respect to . When is a low dimensional parametric class (e.g. a linear function on a low-dimensional feature space or a constant function), we are also interested in performing inference; i.e. constructing confidence intervals that asymptotically contain the correct parameter with probability equal to some target confidence level.

Our primary technical development is to formulate a loss minimization based approach that enables estimation of at MSE rates with respect to , that are primarily driven by the sample complexity of the function space . The latter is not trivial, since estimating essentially requires first estimating , which depends on the sample complexity of the true function space. Moreover, estimating also depends on estimating and . These functions could be in turn much more complex than . To achieve the robustness property, we will use the recent framework of orthogonal statistical learning of foster2019orthogonal , and formulate a loss function for learning an estimate of , that is orthogonal with respect to the nuisance functions that also need to be estimated from the data. Moreover, we want our loss function to be strongly convex with respect to so that fast estimation rates can be achieved. Then we can invoke the results of foster2019orthogonal to get fast convergence rates for several function classes of interest.

### 2.1 Warm-Up: Estimating the Average Treatment Effect (ATE)

For estimation of the average treatment effect (ATE), assuming that either there is no effect heterogeneity with respect to or there is no heterogeneous compliance with respect to , Chernozhukov2018 propose a method for estimating the ATE that solves the empirical analogue of the following moment equation:

 E[(Y−E[Y∣X]−θ(T−E[T∣X]))(Z−E[Z∣X])]=0 (2)

This moment function is orthogonal to all the functions , and that also need to be estimated from data. This moment avoids the estimation of the expected conditional on and satisfies an orthogonality condition that enables robustness of the estimate , to errors in the nuisance estimates and

. The estimate is asymptotically normal with variance equal to the variance of the method if the estimates were the correct ones, assuming that the mean squared error of these estimates decays at least at a rate of

(see Chernozhukov2018 for more details). This result requires that the nuisance estimates are fitted in a cross-fitting manner, i.e. we use half of the data to fit a model for each of these functions and then predict the values of the model on the other half of the samples. We refer to this algorithm as DMLATEIV.111For Double Machine Learning ATE estimation with Instrumental Variables.

DMLATEIV is algorithmically equivalent to a two stage algorithm, where we run a linear regression to predict

from and then running a linear regression between and the predicted . The coefficient in the final regression is exactly equal to the estimate (by simple algebraic manipulations).

Inconsistency under Effect and Compliance Heterogeneity The above estimate is a consistent estimate of the average treatment effect as long as there is either no effect heterogeneity with respect to or there is no heterogeneous compliance (i.e. the effect of the instrument on the treatment) with respect to . Otherwise it is inconsistent. The reason is that, if we let and , then the population quantity: is a function of . If we also have effect heterogeneity, then we are solving for a constant that in the limit satisfies: , where . On the other hand the true heterogeneous model satisfies the equation: . In the limit, the two quantities are related via the equation: . Then the constant effect that we estimate converges to the quantity: . If is not independent with , then is a re-weighted version of the true average treatment effect , re-weighted by the heterogeneous compliance. To account for this heterogeneous compliance we need to change our moment equation so as to re-weight based on , which is unknown and also needs to be estimated from data. Given that this function could be arbitrarily complex, we want our final estimate to be robust to estimation errors of . We can achieve this by considering a doubly robust approach to estimating . Suppose that we had some other method of computing an estimate of the heterogeneous treatment effect , then we can combine both estimates to get a more robust method for the ATE, e.g.:

 ^θDR=E[^θ(X)+(~Y−^θ(X)~T)~Z^β(X)] (3)

This approach has been analyzed in okui2012doubly in the case of constant treatment effects and an analogue of this average effect was also used by athey2017efficient in a policy learning problem as opposed to an estimation problem. In particular, the quantity is known as the compliance score ABADIE2003231 ; aronow2013beyond . Our methodological contribution in the next two sections is two-fold: i) first we propose a model-based stable approach for estimating a preliminary estimate , which does not necessarily require that everywhere (an assumption that is implicit in the latter method), ii) second we show that this doubly robust quantity can be used as a regression target and minimizing the square loss with respect to this target, corresponds to an orthogonal loss, as defined in chernozhukov2018plug ; foster2019orthogonal . In fact we show that this loss is orthogonal, even if the hypothesis space that we use in the final model does not contain the true effect function.

## 3 Preliminary Estimate of Conditional Average Treatment Effect (CATE)

Let and , as in the previous section. Then observe that we can re-write the moment condition as:

 E[Y−θ0(X)h0(Z,X)−f0(X)∣Z,X]=0. (4)

Moreover, observe that the functions and are related via: . Thus we can further re-write the moment condition in terms of instead of :

 E[Y−q0(X)−θ0(X)(h0(Z,X)−p0(X))∣Z,X]=0. (5)

Moreover, we can identify with the following subset of conditional moments, where the conditioning of is removed:

 E[(Y−q0(X)−θ(X)(h0(Z,X)−p0(X)))(h0(Z,X)−p0(X))∣X]=0. (6)

Equivalently, is a minimizer of the square loss:

 L1(θ;q0,h0,p0):=E[(Y−q0(X)−θ(X)(h0(Z,X)−p0(X)))2] (7)

since the derivative of this loss with respect to is equal to the moment equation and, thus, the first order condition for the loss minimization problem is satisfied by the true model . Moreover, if the loss function satisfies a functional analogue of strong convexity, then any minimizer of the loss achieves small mean squared error with respect to . This leads to the following approach:

This method is an extension of the classical two-stage-least-squares (2SLS) approach angrist2008mostly to allow for arbitrary machine learning models; ignoring the residualization part (i.e. if for instance ), then it boils down to: 1) predict the mean treatment from the instrument and with an arbritrary regression/classification method, 2) predict the outcome from the predicted treatment multiplied by the heterogeneous effect model . Residualization helps us remove the dependence of the mean squared error on the complexity of the baseline function . We achieve this by showing that this loss is orthogonal with respect to (see foster2019orthogonal for the definition of an orthogonal loss). However, orthogonality does not hold with respect to . This finding is reasonable since we are using as our regressor. Hence, any error in the measurement of the regressor can directly propagate to an error in . This is the same reason why in classical IV regression one cannot ignore the variance from the first stage of 2SLS when calculating confidence intervals.

###### Lemma 1.

The loss function is orthogonal to the nuisance functions , but not .

Strong convexity and overlap. Note that both the empirical loss and the population loss are convex in the prediction, which typically implies computational stability. Moreover, the second order directional derivative of the population loss in any functional direction is:

 E[(^h(Z,X)−^p(X))2(θ(X)−θ0(X))2] (9)

and let:

 V(X):=E[(^h(Z,X)−^p(X))2∣X] (10)

To be able to achieve mean-squared-error rates based on our loss minimization, we need the population version of the loss function to satisfy a functional analogue of -strong convexity:

 ∀θ∈Θ:E[V(X)⋅(θ(X)−θ0(X))2]≥λE[(θ(X)−θ0(X))2] (11)

This setting falls under the “single-index” setup of foster2019orthogonal . Using arguments from Lemma 1 of foster2019orthogonal , if:

 ∀θ∈Θ:E[V0(X)⋅(θ(X)−θ0(X))2]≥λ0E[(θ(X)−θ0(X))2] (12)

where

 V0(X):=E[(h0(Z,X)−p0(X))2∣X]=Var(E[T∣Z,X]∣X), (13)

then . A sufficient condition is that for all . This is a standard "overlap" condition that the instrument is exogenously varying at any and has a direct effect on the treatment at any . DMLIV only requires an "average" overlap condition, tailored particularly to the hypothesis space , hence it could handle settings where the instrument is weak for some subset of the population. For instance, if is a linear function class: , then for the oracle strong convexity to hold it suffices that: . Lemma 1, combined with the above discussion and the results of foster2019orthogonal yields:222This corollary follows by small modifications of the proofs of Theorem 1 and Theorem 3 of foster2019orthogonal that accounts for the non-orthogonality w.r.t. , so we omit its proof.

###### Corollary 2.

Assume all random variables are bounded. Suppose that in the final stage of DMLIV we use any algorithm that achieves expected generalization error

w.r.t. loss , i.e.:

 E[L1(^θDR;^q,^h,^p)−infθ∈ΘL1(θ;^q,^h,^p)]≤R2n (14)

Moreover, suppose that the nuisance estimates satisfy and and for all : . Then returned by DMLIV satisfies:

 E[(^θ(X)−θ0(X))2]≤O(R2n+h2n+g4nλ0) (15)

If empirical risk minimization is used in the final stage, i.e.:

 ^θ=argminθ∈ΘL1n(θ;^q,^h,^p) (16)

then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity koltchinskii2011introduction .

We note that due to the non-orthogonality of the loss w.r.t. to , the error in appears in the same order in the final MSE rate. However, since we are only using DMLIV as a preliminary estimator, this result is good enough for our purposes of achieving a preliminary MSE rate. Computational considerations. The empirical loss is not a standard square loss. However, we can re-write it as . Thus the problem is equivalent to a standard square loss minimization with label and sample weights . Thus we can use any out-of-the-box machine learning method that accepts sample weights, such as stochastic gradient based regression methods and gradient boosted or random forests. Alternatively, if we assume a linear representation of the effect function , then the problem is equivalent to regressing on the scaled features , and again any method for fitting linear models can be invoked.

## 4 DRIV: Orthogonal Loss for IV Estimation of CATE and Projections

We now present the main estimation algorithm that combines the doubly robust approach presented for ATE estimation with the preliminary estimator of the CATE to obtain a fully orthogonal and strongly convex loss. This method achieves a second order effect from all nuisance estimation errors and enables oracle rates for the target effect class and asymptotically valid inference for low dimensional target effect classes. In particular, given access to a first stage model of heterogeneous effects (such as the one produced by DMLIV), we can estimate a more robust model of heterogeneous effects via minimizing a square loss that treats the doubly robust quantity used in Equation (3) as the label:

 minθ∈ΘπL2(θ;θpre,β,p,q,r):=E⎡⎣(θpre(X)+(~Y−θpre(X)~T)~Zβ(X)−θ(X))2⎤⎦ (17)

We allow for a model space that is not necessarily equal to . The solution in Equation (3) is a special case of this minimization problem where the space contains only constant functions. Our main result shows that this loss is orthogonal to all nuisance functions , , , , . Moreover, it is strongly convex in the prediction , since conditional on all the nuisance estimates it is a standard square loss. Moreover, we show that the loss is orthogonal irrespective of what the model space , even if , as long as the preliminary estimate is consistent with respect to the true CATE (i.e. fit a flexible preliminary CATE and use it to project to a simpler hypothesis space).

###### Lemma 3.

The loss is orthogonal with respect to the nuisance functions , , , and .

This leads to the following algorithm for heterogeneous treatment effects:

If we use DMLIV for , even though DMLIV has a first order impact from the error of , the second stage estimate has a second order impact, since it has a second order impact from the first stage CATE error. Lemma 3 together with the results of foster2019orthogonal implies the following corollary:

###### Corollary 4.

Assume all random variables are bounded and for all . Suppose that in the final stage of DRIV we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

 E[L2(^θDR;θpre,^β,^p,^q,^r)−infθ∈ΘL2(θ;θpre,^β,^p,^q,^r)]≤R2n (18)

Moreover, suppose that each nuisance estimate , and the hypothesis space is convex. Then returned by DRIV satisfies:

 E[∥^θDR−θ∗∥22]≤O(R2n+g4n), (19)

where

 θ∗=argminθ∈ΘπL2(θ;θ0,β0,p0,q0,r0)≡argminθ∈ΘπE[(θ0(X)−θ(X))2] (20)

If empirical risk minimization is used in the final stage, i.e.:

 ^θDR=argminθ∈ΘπL2n(θ;θpre,^β,^p,^q,^r) (21)

then , where is the critical radius of the hypothesis space as defined via the localized Rademacher complexity koltchinskii2011introduction .

For the special case where is high-dimensional sparse linear and contains the true CATE, then it suffices to require only an rate for the nuisance functions, as opposed to an . This corollary stems from verifying that all the assumptions of chernozhukov2018plug hold for our loss function:

###### Corollary 5.

If is high-dimensional sparse linear, i.e. with , and , and the model is well-specified, i.e. , with , then if -penalized square loss minimization is used in the final step of DRIV, i.e.:

 ^θDR=argminξ∈RpL2n(ξ;θpre,^β,^p,^q,^r)+λ∥ξ∥1 (22)

it suffices that to get: for .

Interpretability through projections. The fact that our loss function can be used with any target allows us to perform inference on the projection of on a simple space

(e.g. decision trees, linear functions) for interpretability purposes. If we let

the label in the final regression of DRIV, then observe that when the nuisance estimates take their true values then , since the second part of has mean zero. Hence:

 L2(θ;θ0,β0,p0,q0,r0)=Var(θDR(X))+E[(θ0(X)−θ(X))2]. (23)

The first part is independent of and hence minimizing the oracle is equivalent to minimizing over , which is exactly the projection of on . One version of an interpretable model is estimating the CATE with respect to a subset of the variables, i.e.: (e.g. how treatment effect varies with a single feature). This boils down to setting some space of functions of .

If is a low dimensional set of features and is a the space of linear functions of , i.e. , then the first order condition of our loss is equal to the moment condition . Then orthogonality of our loss implies that DRIV is equivalent to an orthogonal moment estimation method Chernozhukov2018 . Thus using the results of Chernozhukov2018 we get the corollary:

###### Corollary 6 (Confidence Intervals).

The estimate returned by DRIV with for a constant independent of , is asymptotically normal with asymptotic variance equal to the hypothetical variance of as if the nuisance estimates had their true values.

Hence, we can use out-of-the-box packages for calculating CIs of an OLS regression to get -values on the coefficients.

### 4.1 Variance Reduction with Re-Weighting under Well-Specification

DRIV is orthogonal irrespective of whether contains the true CATE model and provides estimation rates for the projection of CATE on the space . However, it does suffer from a high variance, since we are dividing by the conditional co-variance . Hence, if the instrument is weak in some region of ’s then the method can suffer from instability. Moreover, the variance of the approach does not adapt to the model space , i.e. it could be that some ’s have a zero co-variance but the model is identified by the remainder of the ’s.

Unlike DRIV, DMLIV does not suffer from these drawbacks. However, as pointed out earlier, DMLIV is not orthogonal with respect to the nuisance function . We now show how we can combine the best of both worlds by simply re-weighting the DRIV loss, so as to put less weight on high variance regions. However, unlike DRIV, the loss that we construct is simply orthogonal and not universally orthogonal. In particular, since is a reasonable proxy on the magnitude of the variance of the regression target , we will re-weight the loss by :

 L2rw(θ;θpre,β,p,q,r)=E[(~Y~Z−θpre(X)(~T~Z−β(X))+θ(X)β(X))2] (24)

In other words we are re-weighting the DRIV loss to give less weight on samples that have high variance and solely using the samples with low variance to identify our true model. We can then show the following theorem:

###### Lemma 7.

Assuming , then the loss is orthogonal with respect to the nuisance functions , , , and .

We will refer to latter algorithm as DRIV-RW (for re-weighted DRIV). Similar to DRIV we can then get the following corollary:

###### Corollary 8.

Assume all random variables are bounded, and suppose that in the final stage of DRIV-RW we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

 E[L2rw(^θDR;θpre,^β,^p,^q,^r)−infθ∈ΘL2rw(θ;θpre,^β,^p,^q,^r)]≤R2n (25)

Moreover, suppose that each nuisance estimate , and for all : . Then returned by DRIV-RW satisfies:

 E[(^θDR(X)−θ0(X))2]≤O(R2n+g4nλ0), (26)

We can also make statements analogous to Corollary 4 for the case where empirical risk minimization is used in the final stage or -penalized empirical risk minimization over linear functions. Moreover, observe that even if the on-average overlap condition does not hold, we are still recovering a model that converges to the true one in terms of the re-weighted MSE, where we re-weight based on a measure of strength of the instrument (i.e. we will be predicting better in regions where the instrument is strong). This is a good fall-back guarantee to have in cases where the instrument happens to be weak.

Finally, in the case where the instrument is multi-dimensional, then we can follow an approach similar to DMLIV, where we construct an “optimal” instrument of the form . Then we can consider the analogue of DRIV-RW, but where takes the place of , i.e. let:

 ~Zπ=h(X,Z)−p(X) (27)

Then observe that the analogue of for the instrument is equal to as defined in the DMLIV section, i.e.

 βπ(X)=E[~T~Zπ∣X]=E[~Z2π∣X]=Var(E[T∣Z,X]∣X)=V(X)

then the loss function takes the form:

 L2π,rw(θ;θpre,V,p,q,h)=E[(~Y~Zπ−θpre(X)(~T~Zπ−V(X))+θ(X)V(X))2] (28)
###### Lemma 9.

Assuming that there exists , such that:

 E[(Y−q0(X)−θ0(X)(T−p0(X)))(h0(Z,X)−r0(X))∣X]=0 (29)

and that the preliminary estimate converges in mean-squared-error to then the loss is orthogonal with respect to the nuisance functions , , , , .

Observe that if we use DMLIV as the preliminary estimator, then because DMLIV solely uses the latter set of moment restrictions, it will converge in MSE to a function that satisfies the moment condition. Thus it will satisfy the requirement. We will refer to this algorithm as Projected-DRIV-RW.

###### Corollary 10.

Assume all random variables are bounded, and suppose that in the final stage of Projected-DRIV-RW we use any algorithm that achieves expected generalization error with respect to loss over hypothesis space , i.e.:

 E[L2π,rw(^θDR;θpre,^V,^p,^q,^h)−infθ∈ΘL2π,rw(θ;θpre,^V,^p,^q,^h)]≤R2n (30)

Moreover, suppose that each nuisance estimate , and for all : . Then returned by Projected-DRIV-RW satisfies:

 E[(^θDR(X)−θ0(X))2]≤O(R2n+g4nλ0), (31)

## 5 Example: Randomized Trials with Non-Compliance

Given that our main application is a special case of our framework where the instrument is the assignment in a randomized control trial with non-compliance, in this section we show how our algorithms simplify for this setting.

Randomized control trials with non-compliance, or equivalently intent-to-treat A/B tests, are the special case where the instrument and the treatment are binary, i.e. , and the instrument is fully exogenous. For simplicity, we assume that for all (i.e. a balanced test). In this case, the nuisance components can all be expressed as a function of the following quantity (which is typically referred to as the compliance score):

 Δ(X)= (2Z−1)Pr[T=1∣Z=1,X]−Pr[T=1∣Z=0,X]2 (32)

Then, after straightforward algebraic manipulations, we have:

 β(X)= (2Z−1)Δ(X)2 h(Z,X)−p(X)= Δ(X) V(X)= Δ(X)2

and the loss functions simplify to:

 L1(θ;q,Δ)= E[(Y−q(X)−θ(X)Δ(X))2] L2(θ;θpre,q,p,Δ)= E⎡⎣(θpre(X)+Y−q(X)−θpre(X)(T−p(X))Δ(X)−θ(X))2⎤⎦ L2rw(θ;θpre,q,p,Δ)= 14E[(Y−q(X)−θpre(X)(T−p(X))+Δ(X)(θpre(X)−θ(X)))2]

Moreover, observe that the loss is equivalent to the loss with , i.e. a zero preliminary estimator of the treatment effect. Thus in this case, we essentially need to estimate three nuisance components, i.e. . We can estimate by simply estimating and setting , which boils down to a classification problem among the treated and control populations for each value of . In fact, once we have estimated these two nuisance functions, we also know that: . Thus it all boils down to estimating and .

## 6 Estimating Effects of Membership on TripAdvisor

We apply our methods to estimate the treatment effect of membership on the number of days a user visits a leading travel assistance website, W. The instrument used was a 14-day intent-to-treat A/B test run during 2018, where users in group A received a new, easier membership sign-up process, while the users in group B did not. The treatment is whether a user became a member or not. Becoming a member and logging into TripAdvisor gives users exclusive access to trip planning tools, special deals and price alerts, and personalized ideas and travel advice. Our data consists of 4,606,041 total users in a 50:50 A/B test. For each user, we have a 28-day pre-experiment summary about their browsing and purchasing activity on TripAdvisor (see Sec. B.2). The instrument significantly increased the rate of treatment, and is assumed to satisfy the exclusion restriction.

We applied two sets of nuisance estimation models with different complexity characteristics: LASSO regression and logistic regression with an L2 penalty (LM); and gradient boosting regression and classification (GB). The only exception was

, where we used a fixed estimate of since the instrument was a large randomized experiment. See Sec. B.1 for details.333We attempted to use the R implementation of Generalized Random Forests (GRF)athey2019generalized to compare with our results. However, we could not fit due to the size of the data and insufficient memory errors (with 64GB RAM).

We estimate the ATE using DRIV projected onto a constant (Table 1). Using linear nuisance models results in very similar ATE estimates between DMLATEIV and DRIV. We compare the co-variate associations for both heterogeneity and compliance under DRIV to understand why. If there are co-variates with significant non-zero associations in both heterogeneity and compliance, this could lead to different estimates between DRIV and DMLATEIV (and vice versa). Replacing the CATE projection model with a linear regression, we obtain valid inferences for the co-variates associated with treatment effect heterogeneity (Figure 1). For compliance, we run a linear regression of the estimated quantity on , to assess its association with each of the features (see Sec. B.1

for details). Comparing treatment and compliance coefficients, os_type_linux and revenue_pre are the only coefficients substantially different from 0 in both. However, only a very small proportion of users in the experiment are Linux users, and the distribution of revenue is very positively skewed. This justifies the minor difference between the DMLATEIV and DRIV estimates. Moreover, we fit a shallow, heavily regularized random forest and interpret it using Shapley Additive Explanations (SHAP)

NIPS2017_7062 . SHAP gave directionally similar impact of each feature on the effect (Figure 1). However, since we constrained the model to have depth at most one, it essentially gives the features in order of importance if we were to split the population based on a single feature. This justifies why the order of importance of features in the forest is not in the same order as the magnitude of the rank in the linear model, since they have different interpretations. The features picked up by the forest intuitively make sense since an already highly engaged member of W, or a user who has recently made a booking, is less likely to further increase their visits to W.

Using gradient boosting nuisance models, we show that many inferences remain similar (Figure 2 in Appendix). The most notable changes in heterogeneity were for features which have a highly skewed distribution (e.g. visits to specific pages on W), or which appear rarely in the data (e.g. Linux users). This suggests the more powerful models are better able to fit the data for these smaller sub-populations, leading to different results. This explanation is supported by considering the differences between using a linear regression versus random forest for the CATE projection model across residualization models. The linear CATE projection model coefficients are largely similar for both residualization models (except the Linux operating system feature, which appears rarely in the data). Moving to a random forest for the CATE projection model with SHAP presents greater differences, especially for the highly skewed features.

While the point estimates of the ATE under DRIV and DMLATEIV are similar using gradient boosting residualization models, the confidence interval from DRIV is wider than under DMLATEIV. This highlights the potential advantages of DRIV over DMLATEIV in avoiding stronger conclusions about the ATE than could be warranted by the data.

Similar instrument from a recent experiment A recent 2019 A/B test of the same membership sign-up process provided another viable instrument. This 21-day A/B test included a much larger, more diverse population of users than in 2018 due to fewer restrictions for eligibility (see Sec. B.2 for details). We apply DRIV with gradient boosting residualization models and a linear projection of the CATE. The CATE distribution has generally higher values compared to the 2018 experiment which reflects the different experimental population. In particular, users in the 2018 experiment had much higher engagement and significantly higher revenue in the pre-experiment period. This was largely because users were only included in the 2018 experiment on their second visit. The higher baseline naturally makes it more difficult to achieve high treatment effects, explaining the generally lower CATE distribution in the 2018 experiment. We note that, unlike in 2018, the revenue coefficient is no longer significant. We again attribute this to the much higher revenue baseline in 2018. Despite the population differences, however, we observe "days_visited_vrs_pre" continues to have a very significant positive association. "days_visited_exp_pre" now also appears to have a significantly positive association, as does the iPhone device (which was not a feature in the 2018 experiment). The inclusion of iPhone users is another big domain shift in the two experiments.

Policy recommendations for W Our results offer several policy implications for W. Firstly, encourage iPhone users, and users who frequent vacation rentals pages to sign-up for membership. These users exhibited high treatment effects from membership. For frequent visitors to vacation rentals pages, this effect was robust across residualization models, CATE projections, and even different instruments (e.g. by providing stronger encouragements for sign-up on particular sub-pages). Secondly, encourage users who have yet to make a booking (low revenue_pre) and who are perhaps in the planning and research stage) to sign-up for membership so they can make use of trip-planning tools and personalized travel advice. Second, find ways to improve the membership offering for users who are already engaged: e.g. recently made a booking (high revenue_pre), were already frequent visitors (high days_visited_free_pre).

Validation on Semi-Synthetic Data In Appendix C, we validate the correctness of ATE and CATE from DRIV, by creating a semi-synthetic dataset with the same variables and such that the marginal distribution of each variable looks similar to the TripAdvisor data, but where we know the true effect model. We find that DRIV recovers a good estimate of the ATE. The CATE of DRIV with linear regression as final stage also recovers the true coefficients, and a random forest final stage picks the correct factors of heterogeneity as most important features. Moreover, coverage of DRIV ATE confidence intervals is almost nominal at 94%, while DMLATEIV can be very biased and have 0 coverage. we evaluate the empirical coverage of the confidence intervals across many monte arlo experiments (with fewer samples and stronger instrument) and find that DMLATEIV is so biased that the intervals have 0 coverage, while DRIV ATE intervals have almost nominal coverage of 94%

## 7 Estimating the Effect of Schooling on Wages

The causal impact of schooling on wages has been studied at length in Economics (see griliches1977estimating , card1993using , card2001estimating , hudson2011parental ), and although it is generally agreed that there is a positive impact, it is difficult to obtain a consistent estimate of the effect due to self-selection into education levels. To account for this endogeneity, Card (card1993using ) proposes using proximity to a 4-year college as an IV for schooling. We analyze Card’s data from the Nat. Long. Survey of Young Men (NLSYM, 1966) to estimate the ATE of education on wages and find sources of heterogeneity. To validate our findings, we use the NLSYM data to create a semi-synthetic dataset with known treatment effect and show that we correctly recover the true ATE and underlying heterogeneity. We describe the NLSYM data in depth in Appendix D. At high level, the data contains 3,010 rows with 22 mostly binary covariates , log wages (), years of schooling (), and 4-year college proximity indicator ().

We apply DMLATEIV and DRIV with linear (LM) or gradient boosted (GBM) nuisance models to estimate the ATE (Table 2). While the DMLATEIV results are consistent with Card’s ( CI), this estimate is likely biased in the presence of compliance and effect heterogeneity (see Sec. 2.1). The DRIV ATE estimates, albeit lower, still lie within the 95% CI of the DML ATE.

We study effect heterogeneity with a shallow random forest an the last stage of DRIV. Fig. 3

depicts the spread of treatment effects, and the important features selected. Most effects (89%) are positive, with very few very negative outliers. The heterogeneity is driven mainly by parental education variables. We project the DRIV treatment effect on the mother’s education variable to study this effect. In fig.

3, we note that treatment effects are highest among children of less educated mothers. This pattern has also been observed in card1993using and hudson2011parental .444See Notebook for LM nuisance models and Notebook for GBM nuisance models for these results. Card (card1993using ) posits that children with less educated parents are more likely to stop their schooling too soon due to the high costs of education. Considering that the marginal impact of an additional year of schooling is higher in earlier education, the effect will be larger for men in this group.

Semi-synthetic Data Results. We created semi-synthetic data from the NLSYM covariates and instrument , with generated treatments and outcomes based on known compliance and treatment functions (see Appx. D for details). We create a realistic treatment effect that depends on the mother’s education () and single mother at age 14 (): . The DGP is described in full in Appendix D. In Table 2, we see that DMLATEIV ATE (true ATE=0.609) is upwards biased and has poor coverage over 100 runs, whereas the DRIV ATE is less biased and has overall good coverage. With DRIV, we also cover the correct with the coefficient CIs when the final stage is the space of linear models on the relevant variables; with linear nuisance models (including products of features): vs , vs , and vs ; and with gradient boosted forest nuisance models: vs , vs , and vs . (these are the coefficients after variables where pre-processed and normalized to mean zero and variance ) 5 The large CI in the last parameter is due to being an order of magnitude smaller than .