Log In Sign Up

Recursive Partitioning for Heterogeneous Causal Effects

In this paper we study the problems of estimating heterogeneity in causal effects in experimental or observational studies and conducting inference about the magnitude of the differences in treatment effects across subsets of the population. In applications, our method provides a data-driven approach to determine which subpopulations have large or small treatment effects and to test hypotheses about the differences in these effects. For experiments, our method allows researchers to identify heterogeneity in treatment effects that was not specified in a pre-analysis plan, without concern about invalidating inference due to multiple testing. In most of the literature on supervised machine learning (e.g. regression trees, random forests, LASSO, etc.), the goal is to build a model of the relationship between a unit's attributes and an observed outcome. A prominent role in these methods is played by cross-validation which compares predictions to actual outcomes in test samples, in order to select the level of complexity of the model that provides the best predictive power. Our method is closely related, but it differs in that it is tailored for predicting causal effects of a treatment rather than a unit's outcome. The challenge is that the "ground truth" for a causal effect is not observed for any individual unit: we observe the unit with the treatment, or without the treatment, but not both at the same time. Thus, it is not obvious how to use cross-validation to determine whether a causal effect has been accurately predicted. We propose several novel cross-validation criteria for this problem and demonstrate through simulations the conditions under which they perform better than standard methods for the problem of causal effects. We then apply the method to a large-scale field experiment re-ranking results on a search engine.


page 1

page 2

page 3

page 4


Synth-Validation: Selecting the Best Causal Inference Method for a Given Dataset

Many decisions in healthcare, business, and other policy domains are mad...

Machine Learning Tests for Effects on Multiple Outcomes

A core challenge in the analysis of experimental data is that the impact...

Interpretable Deep Causal Learning for Moderation Effects

In this extended abstract paper, we address the problem of interpretabil...

A Permutation Test for Assessing the Presence of Individual Differences in Treatment Effects

One size fits all approaches to medicine have become a thing of the past...

Accounting for Unobservable Heterogeneity in Cross Section Using Spatial First Differences

We propose a simple cross-sectional research design to identify causal e...

Ensemble Method for Estimating Individualized Treatment Effects

In many medical and business applications, researchers are interested in...

SortedEffects: Sorted Causal Effects in R

Chernozhukov et al. (2018) proposed the sorted effect method for nonline...

Code Repositories


Working repository for Causal Tree and extensions

view repo


Working repository for Causal Tree and extensions

view repo


Working repository for Causal Tree and extensions

view repo

1 The Problem

1.1 The Set Up

We consider a setup where there are units, indexed by . We postulate the existence of a pair of potential outcomes for each unit, (following the potential outcome or Rubin Causal Model [19], [11], [14], with the unit-level causal effect defined as the difference in potential outcomes, Let be the binary indicator for the treatment, with indicating that unit received the control treatment, and indicating that unit received the active treatment. The realized outcome for unit is the potential outcome corresponding to the treatment received:

Let be a -component vector of features, covariates or pretreatment variables, known not to be affected by the treatment. Our data consist of the triple , for

, which are regarded as an i.i.d sample drawn from a large population. Expectations and probabilities will refer to the distribution induced by the random sampling, or by the (conditional) random assignment of the treatment. We assume that observations are exchangeable, and that there is no interference (the stable unit treatment value assumption, or sutva

[20]). This assumption may be violated in settings where some units are connected through networks. Let be the marginal treatment probability, and let be the conditional treatment probability (the “propensity score” as defined by [17]). In a randomized experiment with constant treatment assignment probabilities for all values of .

1.2 Unconfoundedness

Throughout the paper, we maintain the assumption of randomization conditional on the covariates, or “unconfoundedness” ([17]), formalized as:

Assumption 1.


This assumption, sometimes referred to as “selection on observables” in the econometrics literature, is satisfied in a randomized experiment without conditioning on covariates, but also may be justified in observational studies if the researcher is able to observe all the variables that affect the unit’s receipt of treatment and are associated with the potential outcomes.

To simplify exposition, in the main body of the paper we maintain the stronger assumption of complete randomization, whereby . Later we show that by using propensity score weighting [19], we can adapt all of the methods to that case.

1.3 Conditional Average Treatment Effects and Partitioning

Define the conditional average treatment effect (CATE)

A large part of the causal inference literature (e.g. [14], [15]) is focused on estimating the population (marginal) average treatment effect . The main focus of the current paper is on obtaining accurate estimates of and inferences for the conditional average treatment effect . We are interested in estimators that are based on partitioning the feature space, and do not vary within the partitions.

2 Honest Inference for Population Averages

Our approach departs from conventional classification and regression trees (CART) in two fundamental ways. First, we focus on estimating conditional average treatment effects rather than predicting outcomes. This creates complications for conventional methods because we do not observe unit level causal effects for any unit. Second, we impose a separation between constructing the partition and estimating effects within leaves of the partition, using separate samples for the two tasks, in what we refer to as “honest” estimation. We contrast ‘’honest” estimation with “adaptive” estimation used in conventional CART, where the same data is used to build the partition and estimate leaf effects. In this section we introduce the changes induced by honest estimation in the context of the conventional prediction setting; in the next section we consider causal effects. In the discussion in this section we observe for each unit a pair of variables , with the interest in the conditional expectation .

2.1 Set Up

We begin by defining key concepts and functions. First, a tree or partitioning corresponds to a partitioning of the feature space , with the number of elements in the partition. We write

Let denote the space of partitions. Let denote the leaf such that . Let be the space of data samples from a population. Let be an algorithm that on the basis of a sample constructs a partition. As a very simple example, suppose the feature space is . In this case there are two possible partitions, (no split), or (full split), and so the space of trees is . Given a sample , the average outcomes in the two subsamples are and . A simple example of an algorithm is one that splits if the difference in average outcomes exceeds a threshold :

The potential bias in leaf estimates from adaptive estimation can be seen in this simple example. While is in general an unbiased estimator for the difference in the population conditional means , if we condition on finding that in a particular sample, we expect that is larger than the population analog.

Given a partition , define the conditional mean function as

which can be viewed as a step-function approximation to . Given a sample the estimated counterpart is

which is unbiased for . We index this estimator by the sample because we need to be precise which sample is used for estimation of the regression function.

2.2 The Honest Target

A central concern in this paper is the criterion used to compare alternative estimators; following much of the literature, we focus on Mean-squared error (MSE) criteria, but we will modify these criteria in a variety of ways. For the prediction case, we adjust the MSE by ; since this does not depend on an estimator, subtracting it does not affect how the criterion ranks estimators. Given a partition , define the mean squared error, where we average over a test sample and the conditional mean is estimated on an estimation sample , as

The (adjusted) expected mean squared error is the expectation of over the test sample and the estimation sample:

where the test and estimation samples are independent. In the algorithms we consider, we will consider a variety of estimators for the (adjusted) EMSE, all of which take the form of MSE estimators , evaluated at the units in sample , with the estimates based on sample and the tree . For brevity in this paper we will henceforth omit the term “adjusted” and abuse terminology slightly by referring to these objects as MSE functions.

Our ultimate goal is to construct and assess algorithms that maximize the “honest” criterion

Note that throughout the paper we focus on maximixing criterion functions, which typically involve the negative of mean-squared-error expressions.

2.3 The Adaptive Target

In the conventional CART approach the target is slightly different:

where the same training sample is used to construct and estimate the tree. Compared to our target the difference is that in our approach different samples and are used for construction of the tree and estimation of the conditional means respectively. We refer to the conventional CART approach as “adaptive,” and our approach as “honest.”

In practice there will be costs and benefits of the honest approach relative to the adaptive approach. The cost is sample size; given a data set, putting some data in the estimation sample leaves fewer units for the training data set. The advantage of honest estimation is that it avoids a problem of adaptive estimation, which is that spurious extreme values of are likely to be placed into the same leaf as other extreme values by the algorithm , and thus the sample means (in sample ) of the elements of are more extreme than they would be in an independent sample.

2.4 The Implementation of CART

There are two distinct parts of the conventional CART algorithm, initial tree building and cross-validation to select a complexity parameter used for pruing. Each part of the algorithm relies on a criterion function based on mean-squared error. In this paper we will take as given the overall structure of the CART algorithm (e.g., [4], [9]), and our focus will be on modifying the criteria.

In the tree-building phase, CART recursively partitions the observations of the training sample. For each leaf, the algorithm evaluates all candidate splits of that leaf (which induce alternative partitions ) using a “splitting” criterion that we refer to as the “in-sample” goodness of fit criterion . It is well-understood that the conventional criterion leads to “over-fitting,” a problem that is solved by cross-validation to select a penalty on tree depth. The in-sample goodness of fit criterion will always improve with additional splits, even though additional refinements of a partition might in fact increase the expected mean squared error, especially when the leaf sizes become small. The reason is that the criterion ignores the fact that smaller leaves lead to higher-variance estimates of leaf means.

To account for this factor, the conventional approach to avoiding “overfitting” is to add a penalty term to the criterion that is equal to a constant times the number of splits, so that essentially we only consider splits where the improvement in a goodness-of-fit criterion is above some threshold. The penalty term is choosen to maximize a goodness of fit criterion in cross-validation samples. In the conventional cross-validation the training sample is repeatedly split into two subsamples, the sample that is used to build a new tree as well as estimate the conditional means and the sample that is used to evaluate the estimates. We “prune” the tree using a penalty parameter that represents the cost of a leaf. We choose the optimal penalty parameter by evaluating the trees associated with each value of the penalty parameter. The goodness of fit criterion for cross-validation can be written as Note that the cross-validation criterion directly addresses the issue we highlighted with the in-sample goodness of fit criterion, since is independent of , and thus too-extreme estimates of leaf means will be penalized. The issue that smaller leaves lead to noisier estimates of leaf means is implicitly incorporated by the fact that a smaller leaf penalty will lead to deeper trees and thus smaller leaves, and the noisier estimates will lead to larger average across the cross-validation samples.

2.5 Honest Splitting

In our honest estimation algorithm, we modify CART in two ways. First, we use an independent sample instead of to estimate leaf means. Second (and closely related), we modify our splitting and cross-validation criteria to incorporate the fact that we will generate unbiased estimates using for leaf estimation (eliminating one aspect of over-fitting), where

is treated as a random variable in the tree building phase. In addition, we explicitly incorporate the fact that finer partitions generate greater variance in leaf estimates.

To begin developing our criteria, let us expand :

where we exploit the equality .

We wish to estimate on the basis of the training sample and knowledge of the sample size of the estimation sample . To construct an estimator for the second term, observe that within each leaf of the tree there is an unbiased estimator for the variance of the estimated mean in that leaf. Specifically, to estimate the variance of on the training sample we can use

where is the within-leaf variance, to estimate the variance. We then weight this by the leaf shares to estimate the expected variance. Assuming the leaf shares are approximately the same in the estimation and training sample, we can approximate this variance estimator by

To estimate the average of the squared outcome (the first term of the target criterion), we can use the square of the estimated means in the training sample , minus an estimate of its variance,

Combining these estimators leads to the following unbiased estimator for , denoted :

In practice we use the same sample size for the estimation sample and the training sample, so we use as the estimator

Comparing this to the criterion used in the conventional CART algorithm, which can be written as

the difference comes from the terms involving the variance. In the prediction setting the adjustment makes very little difference. Because of the form of the within-leaf sample variances, it follows that the gain from a particular split according to the unadjusted criterion is proportional to the gain based on , with the constant of proportionality a function of the leaf size. Thus, in contrast to the treatment effect case discussed below, the variance adjustment does matter much here.

2.6 Honest Crossvalidation

Even though is approximately unbiased as an estimator of our ideal criterion for a fixed , it is not unbiased when we use it repeatedly to evaluate splits using recursive partitioning on the training data . The reason is that initial splits tend to group together observations with similar, extreme outcomes. So after the training data has been divided once, the sample variance of observations in the training data within a given leaf is on average lower than the sample variance would be in a new, independent sample. Thus, is likely to overstate goodness of fit as we grow a deeper and deeper tree, implying that cross-validation can still play an important role with our honest estimation approach, though perhaps less so than in the conventional CART.

Because the conventional CART cross-validation criterion does not account for honest estimation we consider the analogue of our unbiased estimate of the criterion, which accounts for honest estimation by evaluating a partition using only outcomes for units from the cross-validation sample :

This estimator for the honest criterion is unbiased, although it may have higher variance than due to the small sample size of the cross-validation sample.

3 Honest Inference for Treatment Effects

In this section we change the focus to estimating conditional average treatment effects instead of estimating conditional population means. We refer to the estimators developed in this section as “Causal Tree” (CT) estimators. The setting with treatment effects creates some specific problems because we do not observe the value of the treatment effect whose conditional mean we wish to estimate. This complicates the calculation of the criteria we introduced in the previous section. However, a key point of this paper is that we can estimate these criteria and use those estimates for splitting and cross-validation.

We now observe in each sample the triple . For a sample let and denote the subsamples of treated and control units respectively, with cardinality and respectively, and let be the share of treated units. The concept of a tree remains the same as in the previous section. Given a tree , define for all and both treatment levels the population average outcome

and the average causal effect

The estimated counter parts are


Define the mean-squared error for treatment effects as

and define to be its expectation over the estimation and test samples,

A key challenge is that the workhorse mean-squared error function is infeasible, because we do not observe the . However, we show below that we can estimate it.

3.1 Modifying Conventional CART for Treatment Effects

Consider first modifying conventional (adaptive) CART to estimate heterogeneous treatment effects. Note that in the prediction case, using the fact that is constant within each leaf, we can write

In the treatment effect case we can use the fact that

to construct an unbiased estimator of :

This leads us to propose, by analogy to CART’s in-sample mean-squared error criterion ,

as an estimator for the infeasible in-sample goodness of fit criterion.

For cross-validation we used in the prediction case . Again, the treatment effect analog is infeasible, but we can use an unbiased estimate of it, which leads to

3.2 Modifying the Honest Approach

The honest approach described in the previous section for prediction problems also needs to be modified for the treatment effect setting. Using the same expansion as before, now applied to the treatment effect setting, we find

For splitting we can estimate both components of this expectation using only the training sample. This leads to an estimator for the infeasible criterion that depends only on :

For cross-validation we use the same expression, now with the cross-validation sample: .

These expressions are directly analogous to the criteria we proposed for the honest version of CART in the prediction case. The criteria reward a partition for finding strong heterogeneity in treatment effects, and penalize a partition that creates variance in leaf estimates. One difference with the prediction case, however, is that in the prediction case, the two terms are proportional; whereas for the treatment effect case they are not. It is possible to reduce the variance of a treatment effect estimator by introducing a split, even if both child leaves have the same average treatment effect, if a covariate affects the mean outcome but not treatment effects. In such a case, the split results in more homogenous leaves, and thus lower-variance estimates of the means of the treatment group and control group outcomes. Thus, the distinction between adaptive and honest splitting criterion will be more pronounced in this case.

The cross-validation criterion estimates treatment effects within leaves using the sample rather than , to account for the fact that leaf estimates will subsequently be constructed using an estimation sample that is independent of the training sample.

4 Four Partitioning Estimators for Causal Effects

In this section we briefly summarize our CT estimator, and then describe three alternative types of estimators. We compare CT to the alternatives theoretically and through simulations. For each of the four types there is an adaptive version and an honest version, where the latter takes into account that estimation will be done on a sample separate from the sample used for constructing the partition, leading to a total of eight estimators. Note that further variations are possible; for example, one could use adaptive splitting and cross-validation methods to construct a tree, but still perform honest estimation on a separate sample. We do not consider those variations in this paper.

4.1 Causal Trees (CT)

The discussion above developed our preferred estimator, Causal Trees. To summarize, for the adaptive version of causal trees, denoted CT-A, we use for splitting the objective . For cross-validation we use the same objective function, but evaluated at the samples and , namely . For the honest version, CT-H, the splitting objective function is . For cross-validation we use the same objective function, but now evaluated at the cross validation sample, .

4.2 Transformed Outcome Trees (TOT)

Our first alternative method is based on the insight that by using a transformed version of the outcome , it is possible to use off-the-shelf regression tree methods to focus splitting and cross-validation on treatment effects rather than outcomes. Similar approaches are used in [2], [6], [22], and [29]. Because , off-the-shelf CART methods can be used directly, where estimates of the sample average of within each leaf can be interpreted as estimates of treatment effects. This ease of application is the key attraction of this method. The main drawback (relative to CT-A) is that in general it is not efficient because it does not use the information in the treatment indicator beyond the construction of the transformed outcome. For example, the sample average in of within a given leaf will only be equal to if the fraction of treated observations within the leaf is exactly equal to . Since this method is primarily considered as a benchmark, in simulations we focus only on an adaptive version that can use existing learning methods entirely off-the-shelf. The adaptive version of the transformed outcome tree estimator we consider, TOT-A, uses the conventional CART algorithm with the transformed outcome replacing the original outcome. The honest version, TOT-H, uses the same splitting and cross-validation criteria, so that it builds the same trees; it differs only in that a separate estimation sample is used to construct the leaf estimates. The treatment effect estimator within a leaf is the same as the adaptive method, that is, the sample mean of within the leaf.

4.3 Fit-based Trees (F)

We consider two additional alternative methods for constructing trees, based on suggestions in the literature. In the first of these alternatives the choice of which feature to split on, and at what value of the feature to split, is based on comparisons of the goodness–of–fit of the outcome rather than the treatment effect. In standard CART of course goodness–of–fit of outcomes is also the split criterion, but here we estimate a model for treatment effects within each leaf. Specifically, we have a linear model with an intercept and an indicator for the treatment as the regressors, rather only an intercept as in standard CART. This approach is used in [30], who consider building general models at the leaves of the trees. Treatment effect estimation is a special case of their framework. [30] propose using statistical tests based on improvements in goodness-of-fit to determine when to stop growing the tree, rather than relying on cross-validation, but for ease of comparison to CART, in this paper we will stay closer to traditional CART in terms of growing deep trees and pruning them. We modify the mean-squared error function:

For the adaptive version F-A we follow conventional CART, using the criterion in place of for splitting, and the analog of with with in place of for cross-validation. For the honest version we use the analogs of and , with in place of , for splitting and cross-validation. Similar to the prediction case, the variance term in the honest splitting criterion does not make much of a difference for the choice of splits. An advantage of the fit-based tree approach is that it is a straightforward extension of conventional CART methods. In particular, the mean-squared error criterion is feasible, since is observed. To highlight the disadvantages of the F approach, consider a case where two splits improve the fit to an equal degree. In one case, the split leads to variation in average treatment effects, and in the other case it does not. The first split would be better from the perspective of estimating heterogeneous treatment effects, but the fit criterion would view the two splits as equally attractive.

4.4 Squared T-statistic Trees (TS)

For the last estimator we look for splits with the largest value for the square of the t-statistic for testing the null hypothesis that the average treatment effect is the same in the two potential leaves. This estimator was proposed by

[21]. If the two leaves are denoted (Left) and (Right), the square of the t-statistic is

where is the conditional sample variance given the split. At each leaf, successive splits are determined by selecting the split that maximizes . The concern with this criterion is that it places no value on splits that improve the fit. While such splits do not deserve as much weight as the fit criterion puts on them, they do have some value.

Both the adaptive and honest versions of the TS approach use as the splitting criterion. For cross-validation and pruning, it is less obvious how to proceed. [30] suggests that when using a statistical test for splitting, if it is desirable in an application to grow deep trees and then cross-validate to determine depth, then one can use a standard goodness of fit measure for pruning and cross-validation. However, this could undermine the key advantage of TS, to focus on heterogeneous treatment effects. For this reason, we instead propose to use the CT-A and CT-H criteria for cross-validation for TS-A and TS-H, respectively.

4.5 Comparison of the Causal Trees, the Fit Criterion, and the Squared t-statistic Criterion

It is useful to compare our proposed criterion to the F and TS criteria in a simple setting to gain insight into the relative merits of the three approaches. We do so here focusing on a decision whether to proceed with a single possible split, based on a binary covariate . Let and denote the trees without and with the split, and let , and denote the average outcomes for units with treatment status . Let , , and be the sample sizes for the corresponding subsamples. Let be the sample variance of the outcomes given a split, and let be the sample variance without a split. Define the squared t-statistics for testing that the average outcomes for control (treated) units in both leaves are identical,

Then we can write the improvement in goodness of fit from splitting the single leaf into two leaves as

Ignoring degrees-of-freedom correctcions, the change in our proposed criterion for the honest version of the causal tree in this simple setting can be written as a combination of the F and TS criteria:

Our criterion focuses primarily on . Unlike the TS approach, however, it incorporates the benefits of splits due to improvement in the fit.

5 Inference

Given the estimated conditional average treatment effect we also would like to do inference. Once constructed, the tree is a function of covariates, and if we use a distinct sample to conduct inference, then the problem reduces to that of estimating treatment effects in each member of a partition of the covariate space. For this problem, standard approaches are therefore valid for the estimates obtained via honest estimation, and in particular, no assumptions about model complexity are required. For the adaptive methods standard approaches to confidence intervals are not generally valid for the reasons discussed above, and below we document through simulations that this can be important in practice.

6 A Simulation Study

To assess the relative performance of the proposed algorithms we carried out a small simulation study with three distinct designs. In Table 1 we report a number of summary statistics from the simulations. We report averages; results for medians are similar. We report results for with either 500 or 1000 observations. When comparing adaptive to honest approaches, we report the ratio of the for adaptive estimation with to for honest estimation with , in order to highlight the tradeoff between sample size and bias reduction that arises with honest estimation. We evaluate using a test sample with observations to test the methods in order to minimize the sampling variance in our simulation results.

In all designs, the marginal treatment probability is . denotes the number of features. In each design, we have a model for the mean effect and for the treatment effect. Then, the potential outcomes are written

where , and the are independent of and one another, and . The designs are summarized as follows:

In each design, there are some covariates that affect treatment effects () and mean outcomes (); some covariates that enter but not ; and some covariates that do not affect outcomes at all (“noise” covariates). Design 1 does not have noise covariates. In Designs 2 and 3, the first few covariates enter , but only when their signs are positive, while they affect throughout their range. Different criterion will thus lead to different optimal splits, even within a covariate; F will focus more on splits when the covariates are negative.

The first panel of Table 1 compares the number of leaves in different designs and different values of . Recalling that TOT-A and TOT-H have the same splitting method, we see that it tends to build shallow trees. The failure to control for the realized value of leads to additional noise in estimates, which tends to lead to aggressive pruning. For the other estimators, the adaptive versions lead to shallower trees than the honest versions, as the honest versions correct for overfitting, and the main cost of small leaf size is high variance in leaf estimates. F-A and F-H are very similar; as discussed above, the splitting criterion are very similar, and further, the F estimators are less prone to overfitting treatment effects, because they split based upon overall model fit. We also observe that the F estimators build the deepest trees; they reward splitting on covariates that affect mean outcomes as well as treatment effects.

The second panel of Table 1 examines the performance of the alternative honest estimators, as evaluated by the infeasible criterion . We report the average of the ratio of for a given estimator to for our preferred estimtor, CT-H. The TOT-H estimator performs well in Designs 2 and 3, but suffers in Design 1. In Design 1, the variance of conditional on is very low at , and so the failure of TOT to account for the realization of results in a noticeable loss of performance. The F-H estimator suffers in all 3 designs; all designs give the F-H criterion attractive opportunities to split based on covariates that do not enter . F-H would perform better in alternative designs where ; F-H also does well at avoiding splits on noise covariates. The TS-H estimator performs well in Design 1, where affects and the same way, so that the CT-H criterion is aligned with TS-H. Design 3 is more complex, and the ideal splits from the perspective of balancing overall mean-squared error of treatment effects (including variance reduction) are different from those favored by TS-H. Thus, TS performs worse, and the difference is exacerbated with larger sample size, where there are more opportunities for the estimators to build deeper trees and thus to make different choices. We also calculate comparisons based on a feasible criterion, the average squared difference between the transformed outcome and the estimated treatment effect . For details for this comparison see the SI Appendix. In general the results are consistent with those from the infeasible criterion.

The third panel of Table 1 explores the costs and benefits to honest estimation. The Table reports the ratio of to for each estimator. The adaptive version uses the union of the training and estimation samples for tree-building, cross-validation, and leaf estimation. Thus it has double the sample size (1000 observations) at each step, while the honest version uses 500 of the observations in training and cross-validation, with the complement used for estimating treatment effects within leaves. The results show that there is a cost to honest estimation in terms of , varying by design and estimator.

The final two panels of Table 1 show the coverage rate for 90% confidence intervals. We achieve nominal coverage rates for honest methods in all designs, where, in contrast, the adaptive methods have coverage rates substantially below nominal rates. Thus, our simulations bear out the tradeoff that honest estimation sacrifices some goodness of fit (of treatment effects) in exchange for valid confidence intervals.

7 Observational Studies with Unconfoundedness

The discussion so far has focused on the setting where the assignment to treatment is randomized. The proposed methods can be adapted to observational studies under the assumption of unconfoundedness. In that case we need to modify the estimates within leaves to remove the bias from simple comparisons of treated and control units. There is a large literature on methods for doing so, e.g., [14]. For example, as in [10] we can do so by propensity score weighting. Efficiency will improve if we renormalize the weights within each leaf and and within the treatment and control group when estimating treatment effects. [5] propose approaches to trimming observations with extreme values for the propensity score to improve robustnesses. Note that there are some additional conditions required to establish asymptotic normality of treatment effect estimates when propensity score weighting is used (see, e.g., [10]); these results apply without modification to the estimation phase of honest partitioning algorithms.

8 The Literature

A small but growing literature seeks to apply supervised machine learning techniques to the problem of estimating heterogeneous treatment effects. Beyond those previously discussed, [23] transform the features rather than the outcomes and then apply LASSO to the model with the original outcome and the transformed features. [7] estimate for using random forests, then calculate . They then use machine learning algorithms to estimate as a function of the units’ attributes, . Our approach differs in that we apply machine learning methods directly to the treatment effect in a single stage procedure. [13] use LASSO to estimate the effects of both treatments and attributes, but with different penalty terms for the two types of features to allow for the possibility that the treatment effects are present but the magnitudes of the interactions are small. Their approach is similar to ours in that they distinguish between the estimation of treatment effects and the estimation of the impact of other attributes of units. [25] consider a model with the outcome linear in the covariates and the interaction with the treatment variable. Using Bayesian nonparametric methods with Dirichlet priors, they project their estimates of heterogeneous treatment effects down onto the feature space using LASSO-type regularization methods to get low-dimensional summaries of the heterogeneity. [6] and [2] propose a related appoach for finding the optimal treatment policy that combines inverse propensity score methods with “direct methods” (e.g. the “single tree” approach considered above) that predict the outcome as a function of the treatment and the unit attributes. The methods can be used to evaluate the average difference in outcomes from any two policies that map attributes to treatments, as well as to select the optimal policy function. They do not focus on hypothesis testing for heterogeneous treatment effects, and they use conventional approaches for cross-validation. Also related is the work on Targeted Learning [27]

, which modifies the loss function to increase the weight on the parts of the likelihood that concern the parameters of interest.

9 Conclusion

In this paper we introduce new methods for constructing trees for causal effects that allow us to do valid inference for the causal effects in randomized experiments and in observational studies satisfying unconfoundedness, without restrictions on the number of covariates or the complexity of the data generating process. Our methods partition the feature space into subspaces. The output of our method is a set of treatment effects and confidence intervals for each subspace.

A potentially important application of the techniques is to “data-mining” in randomized experiments. Our method can be used to explore any previously conducted randomized controlled trial, for example, medical studies or field experiments in developed economics. A researcher can apply our methods and discover subpopulations with lower-than-average or higher-than-average treatment effects, and can report confidence intervals for these estimates without concern about multiple testing.


  • [1] A. Abadie and G. Imbens, Large Sample Properties of Matching Estimators for Average Treatment Effects, Econometrica, 74(1), 235-267.
  • [2] A. Beygelzimer and J. Langford, The Offset Tree for Learning with Partial Labels,, (2009).
  • [3] L. Breiman, Random forests, Machine Learning, 45, (2001), 5-32.
  • [4] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, (1984), Wadsworth.
  • [5] R. Crump, R., J. Hotz, G. Imbens, and O. Mitnik, Nonparametric Tests for Treatment Effect Heterogeneity, Review of Economics and Statistics, 90(3), (2008), 389-405.
  • [6] M. Dudik, J. Langford and L. Li, Doubly Robust Policy Evaluation and Learning , Proceedings of the 28th International Conference on Machine Learning (ICML-11), (2011).
  • [7] J. Foster, J. Taylor and S. Ruberg, Subgroup Identification from Randomized Clinical Data, Statistics in Medicine, 30, (2010), 2867-2880.
  • [8] Green, D., and H. Kern, (2010), Detecting Heterogeneous Treatment Effects in Large-Scale Experiments Using Bayesian Additive Regression Trees, Unpublished Manuscript, Yale University.
  • [9] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, (2011), Springer.
  • [10] K. Hirano, G. Imbens and G. Ridder, Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score, Econometrica, 71 (4), (2003), 1161-1189.
  • [11] P. Holland, Statistics and Causal Inference (with discussion), Journal of the American Statistical Association, 81, (1986), 945-970.
  • [12] D. Horvitz, and D. Thompson, A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association, Vol. 47, (1952), 663 685.
  • [13] K. Imai and M. Ratkovic, Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation, Annals of Applied Statistics, 7(1), (2013), 443-470.
  • [14] G. Imbens and D. Rubin, Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge University Press, (2015).
  • [15] J. Pearl, Causality: Models, Reasoning and Inference, Cambridge University Press, (2000).
  • [16] P. Rosenbaum, Observational Studies, (2002), Springer.
  • [17] P. Rosenbaum and D. Rubin, The Central Role of the Propensity Score in Observational Studies for Causal Effects, Biometrika, 70, (1983), 41-55.
  • [18] M. Rosenblum and M. Van Der Laan., Optimizing Randomized Trial Designs to Distinguish which Subpopulations Benefit from Treatment , Biometrika, 98(4), (2011), 845-860.
  • [19] D. Rubin, Estimating Causal Effects of Treatments in Randomized and Non-randomized Studies Journal of Educational Psychology, 66, (1974), 688-701.
  • [20] D. Rubin, Bayesian inference for causaleffects: The Role of Randomization, Annals of Statistics, 6, (1978), 34-58.
  • [21] X. Su, C. Tsai, H. Wang, D. Nickerson, and B. Li, Subgroup Analysis via Recursive Partitioning, Journal of Machine Learning Research, 10, (2009), 141-158.
  • [22] J. Signovitch, J., Identifying informative biological markers in high-dimensional genomic data and clinical trials, PhD Thesis, Department of Biostatistics, Harvard University, (2007).
  • [23] L. Tian, A. Alizadeh, A. Gentles, and R. Tibshirani, A Simple Method for Estimating Interactions Between a Treatment and a Large Number of Covariates, Journal of the American Statistical Association, 109(508), (2014) 1517-1532.
  • [24] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), Volume 58, Issue 1. (1996), 267-288.
  • [25] M. Taddy, M. Gardner, L. Chen, and D. Draper,, Heterogeneous Treatment Effects in Digital Experimentation, Unpublished Manuscript, (2015), arXiv:1412.8563.
  • [26] V. Vapnik, Statistical Learning Theory, Wiley, (1998).
  • [27] M. Van Der Laan, and S. Rose, Targeted Learning: Causal Inference for Observational and Experimental Data, Springer, (2011).
  • [28] S. Wager, and S. Athey, Estimation and Inference of Heterogeneous Treatment Effects using Random Forests,, (2015).
  • [29] H. Weisburg, H. and V. Pontes, Post hoc subgroups in Clinical Trials: Anathema or Analytics? Clinical Trials, June, 2015.
  • [30] A. Zeileis, T. Hothorn, and K. Hornik, Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2), (2008), 492-514.