The problem of heterogeneous treatment effect estimation in observational studies arises in a wide variety application areas (Athey, 2017), ranging from personalized medicine (Obermeyer and Emanuel, 2016) to offline evaluation of bandits (Dudík et al., 2011), and is also a key component of several proposals for learning decision rules (Athey and Wager, 2017; Hirano and Porter, 2009). There has been considerable interest in developing flexible and performant methods for heterogeneous treatment effect estimation. Some notable recent advances include proposals based on the lasso (Imai and Ratkovic, 2013), recursive partitioning (Athey and Imbens, 2016; Su et al., 2009), BART (Hahn, Murray, and Carvalho, 2017; Hill, 2011)2018; Wager and Athey, 2017), boosting (Powers et al., 2018), neural networks (Shalit, Johansson, and Sontag, 2017), etc., as well as combinations thereof (Künzel et al., 2017; Luedtke and van der Laan, 2016); see Dorie et al. (2017) for a recent survey and comparisons.
However, although this line of work has led to many promising methods, the literature does not yet provide a comprehensive answer as to how machine learning methods should be adapted for treatment effect estimation. First of all, there is no definitive guidance on how to turn a good generic predictor into a good treatment effect estimator that is robust to confounding. The process of developing “causal” variants of machine learning methods is still a fairly labor intensive process, effectively requiring the involvement of specialized researchers. Second, with some exceptions, the above methods are mostly justified via numerical experiments, and come with no formal convergence guarantees or regret bounds proving that the methods in fact succeed in isolating causal effects.
In this paper, we discuss a new approach to estimating heterogeneous treatment effects that addresses both of these concerns. Our framework allows for fully automatic specification of heterogeneous treatment effect estimators in terms of arbitrary loss minimization procedures. Moreover, we show how the resulting methods can achieve comparable regret bounds to oracle methods that know everything about the data-generating distribution except the treatment effects.
We formalize our problem in terms of the potential outcomes framework (Neyman, 1923; Rubin, 1974). The analyst has access to independent and identically distributed examples , , where denotes per-person features, is the observed outcome, and is the treatment assignment. We posit the existence of potential outcomes corresponding to the outcome we would have observed given the treatment assignment or respectively, such that , and seek to estimate the conditional average treatment effect (CATE) function
In order to identify , we assume unconfoundedness, i.e., the treatment assignment is as good as random once we control for the features (Rosenbaum and Rubin, 1983).
The treatment assignment is unconfounded, .
We write the treatment propensity as and the conditional response surfaces as for (throughout this paper, we use -superscripts to denote unknown population quantities). Under unconfoundedness, we can check that where . A few more lines of algebra then imply that the CATE function can be expressed as follows in terms of the propensity and the mean outcome :
This decomposition was originally used by Robinson (1988) to estimate parametric components in partially linear models, and has regularly been discussed in both statistics and econometrics ever since. For example, Zhao, Small, and Ertefaie (2017) use Robinson’s transfomation for post-selection inference on effect modification; Athey, Tibshirani, and Wager (2018) rely on it to grow a causal forest that is robust to confounding; Robins (2004) builds on it in developing -estimation for sequential trials, and Chernozhukov et al. (2017) present it as a leading example on how machine learning methods can be put to good use in estimating nuisance components for semiparametric inference.
With the representation (2), an oracle who knew both the functions and a priori could estimate the heterogeneous treatment effect function by simple loss minimization:
where the term is interpreted as a regularizer on the complexity of the function. In practice, this regularization could be explicit as in penalized regression, or implicit, e.g., as provided by a carefully designed deep neural network.
The difficulty, however, is that we never know the weighted main effect function and usually don’t know the treatment propensities either, and so the estimator (3) is not feasible. In this paper, we study the following class of two-step estimators motivated by the oracle procedure (3):
Fit and via appropriate methods tuned for optimal predictive accuracy, then
Estimate treatment effects via a plug-in version of (3), where , etc., denote held-out predictions, i.e., predictions made without using the -th training example,111Using hold-out prediction for nuisance components, also known as cross-fitting, is an increasingly popular approach for making machine learning methods usable in classical semiparametrics (Athey and Wager, 2017; Chernozhukov et al., 2017; Schick, 1986; van der Laan and Rose, 2011; Wager et al., 2016).
In other words, the first step learns an approximation for the oracle objective, and the second step optimizes it. We refer to this approach as the -learner in recognition of the work of Robinson (1988), and also to emphasize the role of residualization. We will also refer to the squared loss in (4) as the -loss.
This approach has several advantages over existing, more ad hoc proposals. The steps 1 and 2 capture the conceptual aspects of heterogeneous treatment effect estimation in a general way, and so implementing new variants of these methods reduces to routine work. Moreover, the key distinctive step using (4) is simply a loss minimization problem, and so can be efficiently solved via off-the-shelf software such as glmnet for high-dimensional regression (Friedman, Hastie, and Tibshirani, 2010), XGboost for boosting (Chen and Guestrin, 2016), or TensorFlow
for deep learning(Abadi et al., 2016). Finally, it is straightforward to tune the regularizer by simply cross validating the squared-error loss in (4), which avoids the use of more sophisticated model-assisted cross-validation procedures as developed in Athey and Imbens (2016) or Powers et al. (2018).
This paper makes the following contributions. First, we implement variants of our method based on penalized regression and boosting. In each case, we find that the -learner exhibits promising performance relative to existing proposals. Second, we prove that—in the case of penalized kernel regression—regret bounds for the feasible estimator for asymptotically match the best available bounds for the oracle method . Crucially, this quasi-oracle behavior holds even if the first-step predictors and are up to an order of magnitude less accurate than itself. Both results highlight the promise of the -learner as a general purpose framework for estimating heterogeneous treatment effects in observational studies.
1.1 Related Work
Under unconfoundedness (Assumption 1), the CATE function can be written as
As a consequence of this representation, it may be tempting to first estimate on the treated and control samples separately, and then set . This approach, however, is often not robust: Because and are not trained together, their difference may be unstable. As an example, consider fitting the lasso (Tibshirani, 1996) to estimate and in the following high-dimensional linear model, with , and . A naive approach would fit two separate lassos to the treated and control samples,
and then use it to deduce a treatment effect function, . However, the fact that both and are regularized towards 0 separately may inadvertently regularize the treatment effect estimate away from 0, even when everywhere. This problem is especially acute when the treated and control samples are of different sizes; see Künzel et al. (2017) for some striking examples.
The recent literature on heterogeneous treatment effect estimation has proposed several ideas on how to avoid such “regularization bias”. In particular, several recent papers have proposed structural changes to various machine learning methods aimed at focusing on accurate estimation of (Athey and Imbens, 2016; Hahn et al., 2017; Imai and Ratkovic, 2013; Powers et al., 2018; Shalit et al., 2017; Su et al., 2009; Wager and Athey, 2017). For example, with the lasso, Imai and Ratkovic (2013) advocate replacing (6) with a single lasso as follows,
where then . This approach always correctly regularizes towards a sparse
-vector for treatment heterogeneity. The other approaches cited above present variants and improvements of similar ideas in the context of more sophisticated machine learning methods; see, for example, Figure 1 ofShalit et al. (2017) for a neural network architecture designed to highlight treatment effect heterogeneity without being affected by confounders.
Another trend in the literature, closer to our paper, has focused on meta-learning approaches that are not closely tied to any specific machine learning method. Künzel, Sekhon, Bickel, and Yu (2017) proposed two approaches to heterogeneous treatment effect estimation via generic machine learning methods. One, called the -learner, first estimates via appropriate non-parametric regression methods. Then, on the treated observations, it defines pseudo-effects , and uses them to fit via a non-parametric regression. Another estimator is obtained analogously (see Künzel et al. (2017) for details), and the two treatment effect estimators are aggregated as
Another method, called the -learner, starts by noticing that
and then fitting on using any off-the-shelf method. Athey and Imbens (2016) propose a “transformed outcome” approach that involves regressing on to estimate .
In our experiments, we compare our method at length to those of Künzel et al. (2017). Relative to this line of work, our main contribution is our method, the -learner, which provides meaningful improvements over baselines in a variety of settings, and our analysis, which provides the first “quasi-oracle” regret bound we are aware of for non-parametric regression, i.e., where the regret of may decay faster than that of or .
Conceptually, we draw from the literature on semiparametric efficiency and constructions of orthogonal moments includingRobinson (1988) and, more broadly, Belloni et al. (2017), Bickel et al. (1998), Newey (1994), Robins (2004), Robins and Rotnitzky (1995), Robins et al. (2017), Tsiatis (2007), van der Laan and Rose (2011), etc., that aim at -rate estimation of a target parameter in the presence of nuisance components that cannot be estimated at a rate. Algorithmically, our approach has a close connection to targeted maximum likelihood estimation (Scharfstein et al., 1999; Van Der Laan and Rubin, 2006), which starts by estimating nuisance components non-parametrically, and then uses these first stage estimates to define a likelihood function that is optimized in a second step.
The main difference between this literature and our results is that existing results typically focus on estimating a single (or low-dimensional) target parameter, whereas we seek to estimate an object that may also be quite complicated itself. Another research direction that also use ideas from semiparametrics to estimate complex objects is centered on estimating optimal treatment allocation rules (Athey and Wager, 2017; Dudík et al., 2011; Luedtke and van der Laan, 2016; Zhang et al., 2012; Zhao et al., 2012). This problem is closely related to, but subtly different from the problem of estimating under squared-error loss; see Kitagawa and Tetenov (2018), Manski (2004) and Murphy (2005) for a discussion. We also note the work of van der Laan, Dudoit, and van der Vaart (2006), who consider non-parametric estimation by empirical minimization over a discrete grid.
2 A Motivating Example
To see how the -learner works in practice, we consider an example motivated by Arceneaux, Gerber, and Green (2006), who studied the effect of paid get-out-the-vote calls on voter turnout. A common difficulty in comparing the accuracy of heterogeneous treatment effect estimators on real data is that we do not have access to the ground truth. From this perspective, a major advantage of this application is that Arceneaux et al. (2006) found no effect of get-out-the-vote calls on voter turnout, and so we know what the correct answer is. We then “spike” the original dataset with a synthetic treatment effect such as to make the task of estimating heterogeneous treatment effects non-trivial. In other words, both the baseline signal and propensity scores are from real data; however, is chosen by us, and so we can check whether different methods in fact succeed in recovering it.
The dataset of Arceneaux et al. (2006)
has many covariates that are highly predictive of turnout, and the original study assigned people to the treatment and control condition with variable probabilities, resulting in a non-negligible amount of confounding.222The design of Arceneaux et al. (2006) was randomized separately by state and competitiveness of the election. Although the randomization probabilities were known to the experimenters, we here hide them from our algorithm, and require it to learn a model for the treatment propensities. We also note that, in the original data, not all voters assigned to be contacted could in fact answer the phone call, meaning that all effects should be interpreted as intent to treat effects. A naive analysis (without correcting for variable treatment propensities) estimates the average intent to treat effect of a single get-out-the-vote call on turnout as 4%; however, an appropriate analysis finds with high confidence that any treatment effect must be smaller than 1% in absolute value. The full sample has observations, of which were treated. We focus on covariates (including state, county, age, gender, etc.). Both the outcome and treatment are binary.
As discussed above, we assume that the treatment effect in the original data is 0, and spike in a synthetic treatment
where indicates whether
the -th unit voted in the year 2000, and is their age. Because
the outcomes are binary, we add in the synthetic treatment effect by
strategically flipping some outcome labels.333Denote the original unflipped outcomes as . To add in a treatment
effect , we first draw Bernoulli random variables
, we first draw Bernoulli random variableswith probability . Then, if , we set , whereas if , we set to or depending on whether or respectively. Finally, we set . As is typical in causal inference applications, the treatment heterogeneity here is quite subtle, with
, and so a large sample size is needed in order to reject a null hypothesis of no treatment heterogeneity. For our analysis, we focus on a subset ofsamples containing all the treated units and a random subset of the controls (thus, 2/5 of our analysis sample was treated). We further divide this sample into a training set of size 100,000, a test set of size 25,000, and a holdout set with the rest.
|Histogram of CATE||-learner estimates of CATE|
To use the -learner, we first estimate and to form the
-loss function in (4). To do so, we fit models for the nuisance components via both boosting and the lasso (both with tuning parameters selected via cross-validation), and pick the model that minimized cross-validated error. Perhaps unsurprisingly noting the large sample size, this criterion leads us to choose boosting for both and . Another option would have been to try to combine predictions from the lasso and boosting models, as advocated by Van der Laan, Polley, and Hubbard (2007).
Next, we optimize the -loss function. We again try methods based on both the lasso and boosting. This time, the lasso achieves a slightly lower training set cross-validated -loss than boosting, namely 0.1816 versus 0.1818. Because treatment effects are so weak (and so there is potential to overfit even in cross-validation), we have also examined -loss on the holdout set. The lasso again comes out ahead on the holdout set, and the improvement in -loss is stable, 0.1781 versus 0.1783.444Although the improvement in -loss is stable, the loss itself is somewhat different between the training and holdout samples. This appears to be due to the term induced by irreducible outcome noise. This term is large and noisy in absolute terms; however, it gets canceled out when comparing the accuracy of two models. This phenomenon plays a key role in the literature on model selection via cross-validation (Yang, 2007). We thus choose the lasso-based fit as our final model for .
Because we know the true CATE function in our semi-synthetic data generative distribution, we can evaluate the oracle test set mean-squared error, . Here, it is clear that the lasso does substantially better than boosting, achieving a mean-sqaured error of versus that corresponds to a factor 2.6 improvement in efficiency. The right panel of Figure 1 compares estimates from minimizing the
-loss using the lasso and boosting respectively. We see that the lasso is somewhat biased, but boosting is noisy, and the bias-variance trade-off favors the lasso. With a larger sample size, we’d expect boosting to prevail.
We also compared our approach to both the single lasso approach (7), and a popular non-parametric approach to heterogeneous treatment effect estimation via BART (Hill, 2011), with the estimated propensity score added in as a feature following the recommendation of Hahn, Murray, and Carvalho (2017). The single lasso got an oracle test set error of , whereas BART got . It thus appears that, in this example, there is value in using a non-parametric method for estimating and , but then using the simpler lasso for
. In contrast, the single lasso approach uses linear modeling everywhere (thus leading to inefficiencies and potential confounding), whereas BART uses non-parametric modeling everywhere (thus making it difficult to obtain a stablefit). Section 4 has a more comprehensive simulation evaluation of the -learner relative to several baselines, including the meta-learners of Künzel et al. (2017).
3 Asymptotic Theory
In order to illustrate formal aspects of the -learner, we study a variant of (4) based on penalized kernel regression. The problem of regularized kernel learning covers a broad class of methods that have been thoroughly studied in the statistical learning literature (see, e.g., Bartlett and Mendelson, 2006; Caponnetto and De Vito, 2007; Cucker and Smale, 2002; Steinwart and Christmann, 2008; Mendelson and Neeman, 2010), and thus provides an ideal case study for highlighting the promise of the -learner. Throughout this section, we will seek to derive performance guarantees for the -learner that match the best available guarantees for the infeasible oracle learner (3). We conduct our analysis in the following setting.
We study -penalized kernel regression, where is a reproducing kernel Hilbert space (RKHS) with a continuous, positive semi-definite kernel function . Let be a non-negative measure over the compact metric space , and let be a kernel with respect to . Let be defined as . By Mercer’s theorem (Cucker and Smale, 2002)
, there is an orthonormal basis of eigenfunctionsof
with corresponding eigenvaluessuch that
Consider the function defined by . Following Mendelson and Neeman (2010), we define the RKHS to be the image of : For every , define the corresponding element in by with the induced inner product .
Without loss of generality, we assume for all . We assume that the eigenvalues satisfy for some constant , and that the orthonormal eigenfunctions with are uniformly bounded, i.e., . Finally, we assume that the outcomes are almost surely bounded, .
The oracle CATE function satisfies for some .
We emphasize that, above, we do not assume that has a finite -norm; rather, we only assume that
we can make it have a finite -norm after a sufficient amount of smoothing.
More concretely, with , would be the identity operator, and so this assumption would be
equivalent to the strongest possible assumption that itself.
Then, as grows, this assumption gets progressively weaker, and at it would devolve
to simply asking that belong to the space of square-integrable
functions.555To see this, let for some , in which case . We also note that by taking where is 1 at the -th position and 0 otherwise, we have . Then,
Given this setup, we study oracle penalized regression rules of the following form,
where is a mapping from the sample indices to evenly sized data folds, such that and are each trained without considering observations in the -th data fold; typically we set to 5 or 10. Adding the upper bound (or, in fact, any finite upper bound on ) enables us to rule out some pathological behaviors.
We seek to characterize the accuracy of our estimator by bounding its regret ,
Note that if we have overlap, i.e., there is an such that for all , then following from , we have
meaning that regret bounds directly translate into squared-error loss bounds for , and vice-versa.
where the -notation hides logarithmic factors. In the case where is within the RKHS used for penalization, we recover the more familiar rate established by Caponnetto and De Vito (2007). Again, our goal is to establish regret bounds for our feasible estimator that match those for .
3.1 Fast Rates and Isomorphic Coordinate Projections
In order to establish regret bounds for , we first need to briefly review the proof techniques underlying (14). The argument of Mendelson and Neeman (2010) relies on the following quasi-isomorphic coordinate projection lemma of Bartlett (2008). To state this result, write
for the radius- ball of capped by , let denote the best approximation to within , and define -regret over . We also define the estimated and oracle -regret functions
is the oracle loss function on the samples used for empirical minimization, and
is the feasible cross-fitted loss function. is not actually observable in practice as it depends on ; however, this does not hinder us from establishing high-probability bounds for it. The lemma below is adapted from Bartlett (2008).
Let be any loss function, and be the associated regret. Let be a continuous positive function that is increasing in . Suppose that, for every and some , the following inequality holds:
Then, writing and , any solution to the empirical minimization problem with regularizer ,
also satisfies the following risk bound:
In other words, the above lemma reduces the problem of deriving regret bounds to establishing quasi-isomorphisms as in (19) (and any with-high-probability quasi-isomorphism guarantee yields a with-high-probability regret bound). In order to prove a regret bound (14) following the approach of Mendelson and Neeman (2010), we first establish a quasi-isomorphism in terms of the oracle empirical regret,
and then get a regret bound in terms of the regularized risk as in (21). Mendelson and Neeman (2010) then conclude their argument by noting that, for any (see their Corollary 2.7 for details 666Corollary 2.7 in Mendelson and Neeman (2010) proves (23) when . To show (23) with the extra constraint for , it directly follows from the original proof with the added constraint.),
and that by optimizing this bound (up to logarithmic factors) with and a specific choice of , and by setting to be , we recover the regret bound (14) for the oracle learner. Their argument for choosing the optimal value of to use in (23) leverages the approximation error bound
For our purposes, the upshot is that if we can match the strength of the quasi-isomorphism bounds (22) with our feasible loss function, then we can also match the rate of any regret bounds proved using Lemma 1. We do so via the following result.
Given the conditions in Lemma 1, suppose that the propensity estimate is uniformly consistent,
and the errors converge at rate
for some sequence such that
Suppose, moreover, that we have overlap, i.e., for some , and that Assumptions 2 and 3 hold. Then, for any , there is a constant such that the regret functions induced by the oracle learner (10) and the feasible learner (11) are coupled as
simultaneously for all with and , with probability at least .
This result implies that we can turn any quasi-isomorphism for the oracle learner (22) with error into a quasi-isomoprhism bound for with error inflated by (28). Thus, given any regret bound for the oracle learner built using Lemma 1, we can also get an analogous regret bound for the feasible learner provided we regularize just a little bit more. The following result makes this formal.
In other words, we have found that with penalized kernel regression, the -learner can match the best available performance guarantees available for the oracle learner (10) that knows everything about the data generating distribution except the true treatment effect function—both the feasible and the oracle learner satisfy
As we approach the semiparametric case, i.e., , we recover the well-known result from the semiparametric inference literature that, in order to get -consistent inference for a single target parameter, we need 4-th root consistent nuisance parameter estimates (see Robinson (1988), Chernozhukov et al. (2017), and references therein).
We emphasize that our quasi-oracle result depends on a local robustness property of the -loss function, and does not hold for general meta-learners; for example, it does not hold for the -learner of Künzel et al. (2017). To see this, pick such that , and modify the nuisance components used to form the -learner in (8) such that and . Recall that the -learner fits by minimizing , and fits by solving an analogous problem on the controls. Combining the estimates from these two loss functions, we see by inspection that its final estimate of the treatment effect is also shifted by . The perturbations are vanishingly small on the scale, and so would not affect conditions analogous to those of Theorem 3; yet they have a big enough effect on to break any convergence results on the scale of (30). In other words, while the R-learner can achieve the oracle regret bound with its nuisance parameters and learned at a slower rate (by Theorem 3), the X-learner’s regret bound is capped at the rate at which its nuisance parameters and are estimated, and thus does not exhibit the same kind of quasi-oracle property.777Künzel et al. (2017) do have some quasi-oracle type results; however, they only focus on the case where the number of control units grows much faster than the number of treated units . In this case, they show that the -learner performs as well as an oracle who already knew the mean response function for the controls, . Intriguingly, in this special case, we have and , and so the -learner as in (11) is roughly equivalent to the -learner procedure (8). Thus, at least qualitatively, we can interpret the result of Künzel et al. (2017) as a special case of our result in the case where the number of controls dominates the number of treated units (or vice-versa).
4 Simulation Experiments
As discussed several times already, our approach to heterogeneous treatment effect estimation via learning objectives can be implemented using any method that is framed as a loss minimization problem, such as boosting, decision trees, etc. In this section, we focus on simulation experiments using the-learner, a direct implementation of (4) based on both the lasso and boosting.
We consider the following methods for heterogeneous treatment effect estimation as baselines. The -learner fits a single model for , and then estimates ; the -learner fits the functions separately for , and then estimates ; the -learner and -learner are as described in Section 1.1.888The -, -, -. and -learners are named following the nomenclature of Künzel et al. (2017). In addition, for the boosting-based experiments, we consider the causal boosting algorithm (denoted by in Section 4.2) proposed by Powers et al. (2018).
Finally, for the lasso-based experiments, we consider an additional variant of our method, the -learner, that combines the spirit of - and -learners by adding an additional term in the loss function: using with
Heuristically, one may hope that the -learner may be more robust, as it has a “second chance” to eliminate confounders.
In all simulations, we generate data as follows: for different choices of -distribution indexed by dimension , noise level , propensity function , baseline main effect , and treatment effect function :
We consider the following specific setups.
Difficult nuisance components and an easy treatment effect function. We use the scaled Friedman (1991) function for the baseline main effect , along with , and , where .
Randomized trial. Here, for all , so it is possible to be accurate without explicitly controlling for confounding. We take , , and .
Easy propensity score and a difficult baseline. In this setup, there is strong confounding, but the propensity score is much easier to estimate than the baseline: , , , and the treatment effect is constant, .
Unrelated treatment and control arms, with data generated as , , , and . Here, and are uncorrelated, and so there is no upside to learning them jointly.
4.1 Lasso-based experiments
In this section, we compare -, -, -, -, and our - and -learners implemented via the lasso on simulated designs. For the -learner, we follow Imai and Ratkovic (2013) using (7), while for the -learner, we use (6). For the -, -, and -learners, we use
-penalized logistic regression to estimate propensity, and the lasso for all other regression estimates.
For all estimators, we run the lasso on the pairwise interactions of a natural spline basis expansion with 7 degrees of freedom on. We generate data points as the training set and generate a separate test set also with data points, and the reported the mean-squared error is on the test set. All lasso regularization penalties are chosen from 10-fold cross validation. For the - and -learners, we use 10-fold cross-fitting on and