1 Introduction
The problem of heterogeneous treatment effect estimation in observational studies arises in a wide variety application areas (Athey, 2017), ranging from personalized medicine (Obermeyer and Emanuel, 2016) to offline evaluation of bandits (Dudík et al., 2011), and is also a key component of several proposals for learning decision rules (Athey and Wager, 2017; Hirano and Porter, 2009). There has been considerable interest in developing flexible and performant methods for heterogeneous treatment effect estimation. Some notable recent advances include proposals based on the lasso (Imai and Ratkovic, 2013), recursive partitioning (Athey and Imbens, 2016; Su et al., 2009), BART (Hahn, Murray, and Carvalho, 2017; Hill, 2011)
(Athey, Tibshirani, and Wager, 2018; Wager and Athey, 2017), boosting (Powers et al., 2018), neural networks (Shalit, Johansson, and Sontag, 2017), etc., as well as combinations thereof (Künzel et al., 2017; Luedtke and van der Laan, 2016); see Dorie et al. (2017) for a recent survey and comparisons.However, although this line of work has led to many promising methods, the literature does not yet provide a comprehensive answer as to how machine learning methods should be adapted for treatment effect estimation. First of all, there is no definitive guidance on how to turn a good generic predictor into a good treatment effect estimator that is robust to confounding. The process of developing “causal” variants of machine learning methods is still a fairly labor intensive process, effectively requiring the involvement of specialized researchers. Second, with some exceptions, the above methods are mostly justified via numerical experiments, and come with no formal convergence guarantees or regret bounds proving that the methods in fact succeed in isolating causal effects.
In this paper, we discuss a new approach to estimating heterogeneous treatment effects that addresses both of these concerns. Our framework allows for fully automatic specification of heterogeneous treatment effect estimators in terms of arbitrary loss minimization procedures. Moreover, we show how the resulting methods can achieve comparable regret bounds to oracle methods that know everything about the datagenerating distribution except the treatment effects.
We formalize our problem in terms of the potential outcomes framework (Neyman, 1923; Rubin, 1974). The analyst has access to independent and identically distributed examples , , where denotes perperson features, is the observed outcome, and is the treatment assignment. We posit the existence of potential outcomes corresponding to the outcome we would have observed given the treatment assignment or respectively, such that , and seek to estimate the conditional average treatment effect (CATE) function
(1) 
In order to identify , we assume unconfoundedness, i.e., the treatment assignment is as good as random once we control for the features (Rosenbaum and Rubin, 1983).
Assumption 1.
The treatment assignment is unconfounded, .
We write the treatment propensity as and the conditional response surfaces as for (throughout this paper, we use superscripts to denote unknown population quantities). Under unconfoundedness, we can check that where . A few more lines of algebra then imply that the CATE function can be expressed as follows in terms of the propensity and the mean outcome :
(2) 
This decomposition was originally used by Robinson (1988) to estimate parametric components in partially linear models, and has regularly been discussed in both statistics and econometrics ever since. For example, Zhao, Small, and Ertefaie (2017) use Robinson’s transfomation for postselection inference on effect modification; Athey, Tibshirani, and Wager (2018) rely on it to grow a causal forest that is robust to confounding; Robins (2004) builds on it in developing estimation for sequential trials, and Chernozhukov et al. (2017) present it as a leading example on how machine learning methods can be put to good use in estimating nuisance components for semiparametric inference.
With the representation (2), an oracle who knew both the functions and a priori could estimate the heterogeneous treatment effect function by simple loss minimization:
(3) 
where the term is interpreted as a regularizer on the complexity of the function. In practice, this regularization could be explicit as in penalized regression, or implicit, e.g., as provided by a carefully designed deep neural network.
The difficulty, however, is that we never know the weighted main effect function and usually don’t know the treatment propensities either, and so the estimator (3) is not feasible. In this paper, we study the following class of twostep estimators motivated by the oracle procedure (3):

Fit and via appropriate methods tuned for optimal predictive accuracy, then

Estimate treatment effects via a plugin version of (3), where , etc., denote heldout predictions, i.e., predictions made without using the th training example,^{1}^{1}1Using holdout prediction for nuisance components, also known as crossfitting, is an increasingly popular approach for making machine learning methods usable in classical semiparametrics (Athey and Wager, 2017; Chernozhukov et al., 2017; Schick, 1986; van der Laan and Rose, 2011; Wager et al., 2016).
(4)
In other words, the first step learns an approximation for the oracle objective, and the second step optimizes it. We refer to this approach as the learner in recognition of the work of Robinson (1988), and also to emphasize the role of residualization. We will also refer to the squared loss in (4) as the loss.
This approach has several advantages over existing, more ad hoc proposals. The steps 1 and 2 capture the conceptual aspects of heterogeneous treatment effect estimation in a general way, and so implementing new variants of these methods reduces to routine work. Moreover, the key distinctive step using (4) is simply a loss minimization problem, and so can be efficiently solved via offtheshelf software such as glmnet for highdimensional regression (Friedman, Hastie, and Tibshirani, 2010), XGboost for boosting (Chen and Guestrin, 2016), or TensorFlow
for deep learning
(Abadi et al., 2016). Finally, it is straightforward to tune the regularizer by simply cross validating the squarederror loss in (4), which avoids the use of more sophisticated modelassisted crossvalidation procedures as developed in Athey and Imbens (2016) or Powers et al. (2018).This paper makes the following contributions. First, we implement variants of our method based on penalized regression and boosting. In each case, we find that the learner exhibits promising performance relative to existing proposals. Second, we prove that—in the case of penalized kernel regression—regret bounds for the feasible estimator for asymptotically match the best available bounds for the oracle method . Crucially, this quasioracle behavior holds even if the firststep predictors and are up to an order of magnitude less accurate than itself. Both results highlight the promise of the learner as a general purpose framework for estimating heterogeneous treatment effects in observational studies.
1.1 Related Work
Under unconfoundedness (Assumption 1), the CATE function can be written as
(5) 
As a consequence of this representation, it may be tempting to first estimate on the treated and control samples separately, and then set . This approach, however, is often not robust: Because and are not trained together, their difference may be unstable. As an example, consider fitting the lasso (Tibshirani, 1996) to estimate and in the following highdimensional linear model, with , and . A naive approach would fit two separate lassos to the treated and control samples,
(6) 
and then use it to deduce a treatment effect function, . However, the fact that both and are regularized towards 0 separately may inadvertently regularize the treatment effect estimate away from 0, even when everywhere. This problem is especially acute when the treated and control samples are of different sizes; see Künzel et al. (2017) for some striking examples.
The recent literature on heterogeneous treatment effect estimation has proposed several ideas on how to avoid such “regularization bias”. In particular, several recent papers have proposed structural changes to various machine learning methods aimed at focusing on accurate estimation of (Athey and Imbens, 2016; Hahn et al., 2017; Imai and Ratkovic, 2013; Powers et al., 2018; Shalit et al., 2017; Su et al., 2009; Wager and Athey, 2017). For example, with the lasso, Imai and Ratkovic (2013) advocate replacing (6) with a single lasso as follows,
(7) 
where then . This approach always correctly regularizes towards a sparse
vector for treatment heterogeneity. The other approaches cited above present variants and improvements of similar ideas in the context of more sophisticated machine learning methods; see, for example, Figure 1 of
Shalit et al. (2017) for a neural network architecture designed to highlight treatment effect heterogeneity without being affected by confounders.Another trend in the literature, closer to our paper, has focused on metalearning approaches that are not closely tied to any specific machine learning method. Künzel, Sekhon, Bickel, and Yu (2017) proposed two approaches to heterogeneous treatment effect estimation via generic machine learning methods. One, called the learner, first estimates via appropriate nonparametric regression methods. Then, on the treated observations, it defines pseudoeffects , and uses them to fit via a nonparametric regression. Another estimator is obtained analogously (see Künzel et al. (2017) for details), and the two treatment effect estimators are aggregated as
(8) 
Another method, called the learner, starts by noticing that
(9) 
and then fitting on using any offtheshelf method. Athey and Imbens (2016) propose a “transformed outcome” approach that involves regressing on to estimate .
In our experiments, we compare our method at length to those of Künzel et al. (2017). Relative to this line of work, our main contribution is our method, the learner, which provides meaningful improvements over baselines in a variety of settings, and our analysis, which provides the first “quasioracle” regret bound we are aware of for nonparametric regression, i.e., where the regret of may decay faster than that of or .
Conceptually, we draw from the literature on semiparametric efficiency and constructions of orthogonal moments including
Robinson (1988) and, more broadly, Belloni et al. (2017), Bickel et al. (1998), Newey (1994), Robins (2004), Robins and Rotnitzky (1995), Robins et al. (2017), Tsiatis (2007), van der Laan and Rose (2011), etc., that aim at rate estimation of a target parameter in the presence of nuisance components that cannot be estimated at a rate. Algorithmically, our approach has a close connection to targeted maximum likelihood estimation (Scharfstein et al., 1999; Van Der Laan and Rubin, 2006), which starts by estimating nuisance components nonparametrically, and then uses these first stage estimates to define a likelihood function that is optimized in a second step.The main difference between this literature and our results is that existing results typically focus on estimating a single (or lowdimensional) target parameter, whereas we seek to estimate an object that may also be quite complicated itself. Another research direction that also use ideas from semiparametrics to estimate complex objects is centered on estimating optimal treatment allocation rules (Athey and Wager, 2017; Dudík et al., 2011; Luedtke and van der Laan, 2016; Zhang et al., 2012; Zhao et al., 2012). This problem is closely related to, but subtly different from the problem of estimating under squarederror loss; see Kitagawa and Tetenov (2018), Manski (2004) and Murphy (2005) for a discussion. We also note the work of van der Laan, Dudoit, and van der Vaart (2006), who consider nonparametric estimation by empirical minimization over a discrete grid.
2 A Motivating Example
To see how the learner works in practice, we consider an example motivated by Arceneaux, Gerber, and Green (2006), who studied the effect of paid getoutthevote calls on voter turnout. A common difficulty in comparing the accuracy of heterogeneous treatment effect estimators on real data is that we do not have access to the ground truth. From this perspective, a major advantage of this application is that Arceneaux et al. (2006) found no effect of getoutthevote calls on voter turnout, and so we know what the correct answer is. We then “spike” the original dataset with a synthetic treatment effect such as to make the task of estimating heterogeneous treatment effects nontrivial. In other words, both the baseline signal and propensity scores are from real data; however, is chosen by us, and so we can check whether different methods in fact succeed in recovering it.
The dataset of Arceneaux et al. (2006)
has many covariates that are highly predictive of turnout, and the original study assigned people to the treatment and control condition with variable probabilities, resulting in a nonnegligible amount of confounding.
^{2}^{2}2The design of Arceneaux et al. (2006) was randomized separately by state and competitiveness of the election. Although the randomization probabilities were known to the experimenters, we here hide them from our algorithm, and require it to learn a model for the treatment propensities. We also note that, in the original data, not all voters assigned to be contacted could in fact answer the phone call, meaning that all effects should be interpreted as intent to treat effects. A naive analysis (without correcting for variable treatment propensities) estimates the average intent to treat effect of a single getoutthevote call on turnout as 4%; however, an appropriate analysis finds with high confidence that any treatment effect must be smaller than 1% in absolute value. The full sample has observations, of which were treated. We focus on covariates (including state, county, age, gender, etc.). Both the outcome and treatment are binary.As discussed above, we assume that the treatment effect in the original data is 0, and spike in a synthetic treatment effect , where indicates whether the th unit voted in the year 2000, and is their age. Because the outcomes are binary, we add in the synthetic treatment effect by strategically flipping some outcome labels.^{3}^{3}3Denote the original unflipped outcomes as . To add in a treatment effect
, we first draw Bernoulli random variables
with probability . Then, if , we set , whereas if , we set to or depending on whether or respectively. Finally, we set . As is typical in causal inference applications, the treatment heterogeneity here is quite subtle, with, and so a large sample size is needed in order to reject a null hypothesis of no treatment heterogeneity. For our analysis, we focus on a subset of
samples containing all the treated units and a random subset of the controls (thus, 2/5 of our analysis sample was treated). We further divide this sample into a training set of size 100,000, a test set of size 25,000, and a holdout set with the rest.Histogram of CATE  learner estimates of CATE 
To use the learner, we first estimate and to form the
loss function in (
4). To do so, we fit models for the nuisance components via both boosting and the lasso (both with tuning parameters selected via crossvalidation), and pick the model that minimized crossvalidated error. Perhaps unsurprisingly noting the large sample size, this criterion leads us to choose boosting for both and . Another option would have been to try to combine predictions from the lasso and boosting models, as advocated by Van der Laan, Polley, and Hubbard (2007).Next, we optimize the loss function. We again try methods based on both the lasso and boosting. This time, the lasso achieves a slightly lower training set crossvalidated loss than boosting, namely 0.1816 versus 0.1818. Because treatment effects are so weak (and so there is potential to overfit even in crossvalidation), we have also examined loss on the holdout set. The lasso again comes out ahead on the holdout set, and the improvement in loss is stable, 0.1781 versus 0.1783.^{4}^{4}4Although the improvement in loss is stable, the loss itself is somewhat different between the training and holdout samples. This appears to be due to the term induced by irreducible outcome noise. This term is large and noisy in absolute terms; however, it gets canceled out when comparing the accuracy of two models. This phenomenon plays a key role in the literature on model selection via crossvalidation (Yang, 2007). We thus choose the lassobased fit as our final model for .
Because we know the true CATE function in our semisynthetic data generative distribution, we can evaluate the oracle test set meansquared error, . Here, it is clear that the lasso does substantially better than boosting, achieving a meansqaured error of versus that corresponds to a factor 2.6 improvement in efficiency. The right panel of Figure 1 compares estimates from minimizing the
loss using the lasso and boosting respectively. We see that the lasso is somewhat biased, but boosting is noisy, and the biasvariance tradeoff favors the lasso. With a larger sample size, we’d expect boosting to prevail.
We also compared our approach to both the single lasso approach (7), and a popular nonparametric approach to heterogeneous treatment effect estimation via BART (Hill, 2011), with the estimated propensity score added in as a feature following the recommendation of Hahn, Murray, and Carvalho (2017). The single lasso got an oracle test set error of , whereas BART got . It thus appears that, in this example, there is value in using a nonparametric method for estimating and , but then using the simpler lasso for
. In contrast, the single lasso approach uses linear modeling everywhere (thus leading to inefficiencies and potential confounding), whereas BART uses nonparametric modeling everywhere (thus making it difficult to obtain a stable
fit). Section 4 has a more comprehensive simulation evaluation of the learner relative to several baselines, including the metalearners of Künzel et al. (2017).3 Asymptotic Theory
In order to illustrate formal aspects of the learner, we study a variant of (4) based on penalized kernel regression. The problem of regularized kernel learning covers a broad class of methods that have been thoroughly studied in the statistical learning literature (see, e.g., Bartlett and Mendelson, 2006; Caponnetto and De Vito, 2007; Cucker and Smale, 2002; Steinwart and Christmann, 2008; Mendelson and Neeman, 2010), and thus provides an ideal case study for highlighting the promise of the learner. Throughout this section, we will seek to derive performance guarantees for the learner that match the best available guarantees for the infeasible oracle learner (3). We conduct our analysis in the following setting.
We study penalized kernel regression, where is a reproducing kernel Hilbert space (RKHS) with a continuous, positive semidefinite kernel function . Let be a nonnegative measure over the compact metric space , and let be a kernel with respect to . Let be defined as . By Mercer’s theorem (Cucker and Smale, 2002)
, there is an orthonormal basis of eigenfunctions
ofwith corresponding eigenvalues
such thatConsider the function defined by . Following Mendelson and Neeman (2010), we define the RKHS to be the image of : For every , define the corresponding element in by with the induced inner product .
Assumption 2.
Without loss of generality, we assume for all . We assume that the eigenvalues satisfy for some constant , and that the orthonormal eigenfunctions with are uniformly bounded, i.e., . Finally, we assume that the outcomes are almost surely bounded, .
Assumption 3.
The oracle CATE function satisfies for some .
We emphasize that, above, we do not assume that has a finite norm; rather, we only assume that we can make it have a finite norm after a sufficient amount of smoothing. More concretely, with , would be the identity operator, and so this assumption would be equivalent to the strongest possible assumption that itself. Then, as grows, this assumption gets progressively weaker, and at it would devolve to simply asking that belong to the space of squareintegrable functions.^{5}^{5}5To see this, let for some , in which case . We also note that by taking where is 1 at the th position and 0 otherwise, we have . Then,
Given this setup, we study oracle penalized regression rules of the following form,
(10) 
as well as feasible analogues obtained by crossfitting (Chernozhukov et al., 2017; Schick, 1986):
(11) 
where is a mapping from the sample indices to evenly sized data folds, such that and are each trained without considering observations in the th data fold; typically we set to 5 or 10. Adding the upper bound (or, in fact, any finite upper bound on ) enables us to rule out some pathological behaviors.
We seek to characterize the accuracy of our estimator by bounding its regret ,
(12) 
Note that if we have overlap, i.e., there is an such that for all , then following from , we have
(13) 
meaning that regret bounds directly translate into squarederror loss bounds for , and viceversa.
The sharpest regret bounds for (10) given Assumptions 2 and 3 are due to Mendelson and Neeman (2010) (see also Steinwart, Hush, and Scovel (2009)), and scale as
(14) 
where the notation hides logarithmic factors. In the case where is within the RKHS used for penalization, we recover the more familiar rate established by Caponnetto and De Vito (2007). Again, our goal is to establish regret bounds for our feasible estimator that match those for .
3.1 Fast Rates and Isomorphic Coordinate Projections
In order to establish regret bounds for , we first need to briefly review the proof techniques underlying (14). The argument of Mendelson and Neeman (2010) relies on the following quasiisomorphic coordinate projection lemma of Bartlett (2008). To state this result, write
(15) 
for the radius ball of capped by , let denote the best approximation to within , and define regret over . We also define the estimated and oracle regret functions
(16) 
where
(17) 
is the oracle loss function on the samples used for empirical minimization, and
(18) 
is the feasible crossfitted loss function. is not actually observable in practice as it depends on ; however, this does not hinder us from establishing highprobability bounds for it. The lemma below is adapted from Bartlett (2008).
Lemma 1.
Let be any loss function, and be the associated regret. Let be a continuous positive function that is increasing in . Suppose that, for every and some , the following inequality holds:
(19) 
Then, writing and , any solution to the empirical minimization problem with regularizer ,
(20) 
also satisfies the following risk bound:
(21) 
In other words, the above lemma reduces the problem of deriving regret bounds to establishing quasiisomorphisms as in (19) (and any withhighprobability quasiisomorphism guarantee yields a withhighprobability regret bound). In order to prove a regret bound (14) following the approach of Mendelson and Neeman (2010), we first establish a quasiisomorphism in terms of the oracle empirical regret,
(22) 
and then get a regret bound in terms of the regularized risk as in (21). Mendelson and Neeman (2010) then conclude their argument by noting that, for any (see their Corollary 2.7 for details ^{6}^{6}6Corollary 2.7 in Mendelson and Neeman (2010) proves (23) when . To show (23) with the extra constraint for , it directly follows from the original proof with the added constraint.),
(23) 
and that by optimizing this bound (up to logarithmic factors) with and a specific choice of , and by setting to be , we recover the regret bound (14) for the oracle learner. Their argument for choosing the optimal value of to use in (23) leverages the approximation error bound
(24) 
For our purposes, the upshot is that if we can match the strength of the quasiisomorphism bounds (22) with our feasible loss function, then we can also match the rate of any regret bounds proved using Lemma 1. We do so via the following result.
Lemma 2.
Given the conditions in Lemma 1, suppose that the propensity estimate is uniformly consistent,
(25) 
and the errors converge at rate
(26) 
for some sequence such that
(27) 
Suppose, moreover, that we have overlap, i.e., for some , and that Assumptions 2 and 3 hold. Then, for any , there is a constant such that the regret functions induced by the oracle learner (10) and the feasible learner (11) are coupled as
(28) 
simultaneously for all with and , with probability at least .
This result implies that we can turn any quasiisomorphism for the oracle learner (22) with error into a quasiisomoprhism bound for with error inflated by (28). Thus, given any regret bound for the oracle learner built using Lemma 1, we can also get an analogous regret bound for the feasible learner provided we regularize just a little bit more. The following result makes this formal.
Theorem 3.
In other words, we have found that with penalized kernel regression, the learner can match the best available performance guarantees available for the oracle learner (10) that knows everything about the data generating distribution except the true treatment effect function—both the feasible and the oracle learner satisfy
(30) 
As we approach the semiparametric case, i.e., , we recover the wellknown result from the semiparametric inference literature that, in order to get consistent inference for a single target parameter, we need 4th root consistent nuisance parameter estimates (see Robinson (1988), Chernozhukov et al. (2017), and references therein).
We emphasize that our quasioracle result depends on a local robustness property of the loss function, and does not hold for general metalearners; for example, it does not hold for the learner of Künzel et al. (2017). To see this, pick such that , and modify the nuisance components used to form the learner in (8) such that and . Recall that the learner fits by minimizing , and fits by solving an analogous problem on the controls. Combining the estimates from these two loss functions, we see by inspection that its final estimate of the treatment effect is also shifted by . The perturbations are vanishingly small on the scale, and so would not affect conditions analogous to those of Theorem 3; yet they have a big enough effect on to break any convergence results on the scale of (30). In other words, while the Rlearner can achieve the oracle regret bound with its nuisance parameters and learned at a slower rate (by Theorem 3), the Xlearner’s regret bound is capped at the rate at which its nuisance parameters and are estimated, and thus does not exhibit the same kind of quasioracle property.^{7}^{7}7Künzel et al. (2017) do have some quasioracle type results; however, they only focus on the case where the number of control units grows much faster than the number of treated units . In this case, they show that the learner performs as well as an oracle who already knew the mean response function for the controls, . Intriguingly, in this special case, we have and , and so the learner as in (11) is roughly equivalent to the learner procedure (8). Thus, at least qualitatively, we can interpret the result of Künzel et al. (2017) as a special case of our result in the case where the number of controls dominates the number of treated units (or viceversa).
4 Simulation Experiments
As discussed several times already, our approach to heterogeneous treatment effect estimation via learning objectives can be implemented using any method that is framed as a loss minimization problem, such as boosting, decision trees, etc. In this section, we focus on simulation experiments using the
learner, a direct implementation of (4) based on both the lasso and boosting.We consider the following methods for heterogeneous treatment effect estimation as baselines. The learner fits a single model for , and then estimates ; the learner fits the functions separately for , and then estimates ; the learner and learner are as described in Section 1.1.^{8}^{8}8The , , . and learners are named following the nomenclature of Künzel et al. (2017). In addition, for the boostingbased experiments, we consider the causal boosting algorithm (denoted by in Section 4.2) proposed by Powers et al. (2018).
Finally, for the lassobased experiments, we consider an additional variant of our method, the learner, that combines the spirit of  and learners by adding an additional term in the loss function: using with
(31) 
Heuristically, one may hope that the learner may be more robust, as it has a “second chance” to eliminate confounders.
In all simulations, we generate data as follows: for different choices of distribution indexed by dimension , noise level , propensity function , baseline main effect , and treatment effect function :
(32) 
We consider the following specific setups.
Setup A
Difficult nuisance components and an easy treatment effect function. We use the scaled Friedman (1991) function for the baseline main effect , along with , and , where .
Setup B
Randomized trial. Here, for all , so it is possible to be accurate without explicitly controlling for confounding. We take , , and .
Setup C
Easy propensity score and a difficult baseline. In this setup, there is strong confounding, but the propensity score is much easier to estimate than the baseline: , , , and the treatment effect is constant, .
Setup D
Unrelated treatment and control arms, with data generated as , , , and . Here, and are uncorrelated, and so there is no upside to learning them jointly.
4.1 Lassobased experiments
In this section, we compare , , , , and our  and learners implemented via the lasso on simulated designs. For the learner, we follow Imai and Ratkovic (2013) using (7), while for the learner, we use (6). For the , , and learners, we use
penalized logistic regression to estimate propensity
, and the lasso for all other regression estimates.For all estimators, we run the lasso on the pairwise interactions of a natural spline basis expansion with 7 degrees of freedom on
. We generate data points as the training set and generate a separate test set also with data points, and the reported the meansquared error is on the test set. All lasso regularization penalties are chosen from 10fold cross validation. For the  and learners, we use 10fold crossfitting on and
Comments
There are no comments yet.