In many applications, we want to use data to draw inferences about the causal effect of a treatment: Examples include medical studies about the effect of a drug on health outcomes, studies of the impact of advertising or marketing offers on consumer purchases, evaluations of the effectiveness of government programs or public policies, and “A/B tests” (large-scale randomized experiments) commonly used by technology firms to select algorithms for ranking search results or making recommendations. Historically, most datasets have been too small to meaningfully explore heterogeneity of treatment effects beyond dividing the sample into a few subgroups. Recently, however, there has been an explosion of empirical settings where it is potentially feasible to customize estimates for individuals.
An impediment to exploring heterogeneous treatment effects is the fear that researchers will iteratively search for subgroups with high treatment levels, and then report only the results for subgroups with extreme effects, thus highlighting heterogeneity that may be purely spurious (Assmann et al., 2000; Cook et al., 2004). For this reason, protocols for clinical trials must specify in advance which subgroups will be analyzed, and other disciplines such as economics have instituted protocols for registering pre-analysis plans for randomized experiments or surveys. However, such procedural restrictions can make it difficult to discover strong but unexpected treatment effect heterogeneity. In this paper, we seek to address this challenge by developing a powerful, nonparametric method for heterogeneous treatment effect estimation that yields valid asymptotic confidence intervals for the true underlying treatment effect.
Classical approaches to nonparametric estimation of heterogeneous treatment effects include nearest-neighbor matching, kernel methods, and series estimation; see, e.g., Crump et al. (2008), Lee (2009), and Willke et al. (2012)
. These methods perform well in applications with a small number of covariates, but quickly break down as the number of covariates increases. In this paper, we explore the use of ideas from the machine learning literature to improve the performance of these classical methods with many covariates. We focus on the family of random forest algorithms introduced byBreiman (2001a), which allow for flexible modeling of interactions in high dimensions by building a large number of regression trees and averaging their predictions. Random forests are related to kernels and nearest-neighbor methods in that they make predictions using a weighted average of “nearby” observations; however, random forests differ in that they have a data-driven way to determine which nearby observations receive more weight, something that is especially important in environments with many covariates or complex interactions among covariates.
Despite their widespread success at prediction and classification, there are important hurdles that need to be cleared before random forests are directly useful to causal inference. Ideally, an estimator should be consistent with a well-understood asymptotic sampling distribution, so that a researcher can use it to test hypotheses and establish confidence intervals. For example, when deciding to use a drug for an individual, we may wish to test the hypothesis that the expected benefit from the treatment is less than the treatment cost. Asymptotic normality results are especially important in the causal inference setting, both because many policy applications require confidence intervals for decision-making, and because it can be difficult to directly evaluate the model’s performance using, e.g., cross validation, when estimating causal effects. Yet, the asymptotics of random forests have been largely left open, even in the standard regression or classification contexts.
This paper addresses these limitations, developing a forest-based method for treatment effect estimation that allows for a tractable asymptotic theory and valid statistical inference. Following Athey and Imbens (2016), our proposed forest is composed of causal trees that estimate the effect of the treatment at the leaves of the trees; we thus refer to our algorithm as a causal forest.
In the interest of generality, we begin our theoretical analysis by developing the desired consistency and asymptotic normality results in the context of regression forests. We prove these results for a particular variant of regression forests that uses subsampling to generate a variety of different trees, while it relies on deeply grown trees that satisfy a condition we call “honesty” to reduce bias. An example of an honest tree is one where the tree is grown using one subsample, while the predictions at the leaves of the tree are estimated using a different subsample. We also show that the heuristically motivated infinitesimal jackknife for random forests developed byEfron (2014) and Wager et al. (2014)
is consistent for the asymptotic variance of random forests in this setting. Our proof builds on classical ideas fromEfron and Stein (1981), Hájek (1968), and Hoeffding (1948), as well as the adaptive nearest neighbors interpretation of random forests of Lin and Jeon (2006). Given these general results, we next show that our consistency and asymptotic normality results extend from the regression setting to estimating heterogeneous treatment effects in the potential outcomes framework with unconfoundedness (Neyman, 1923; Rubin, 1974).
Although our main focus in this paper is causal inference, we note that there are a variety of important applications of the asymptotic normality result in a pure prediction context. For example, Kleinberg et al. (2015)
seek to improve the allocation of medicare funding for hip or knee replacement surgery by detecting patients who had been prescribed such a surgery, but were in fact likely to die of other causes before the surgery would have been useful to them. Here we need predictions for the probability that a given patient will survive for more than, say, one year that come with rigorous confidence statements; our results are the first that enable the use of random forests for this purpose.
Finally, we compare the performance of the causal forest algorithm against classical -nearest neighbor matching using simulations, finding that the causal forest dominates in terms of both bias and variance in a variety of settings, and that its advantage increases with the number of covariates. We also examine coverage rates of our confidence intervals for heterogeneous treatment effects.
1.1 Related Work
There has been a longstanding understanding in the machine learning literature that prediction methods such as random forests ought to be validated empirically (Breiman, 2001b): if the goal is prediction, then we should hold out a test set, and the method will be considered as good as its error rate is on this test set. However, there are fundamental challenges with applying a test set approach in the setting of causal inference. In the widely used potential outcomes framework we use to formalize our results (Neyman, 1923; Rubin, 1974), a treatment effect is understood as a difference between two potential outcomes, e.g., would the patient have died if they received the drug vs. if they didn’t receive it. Only one of these potential outcomes can ever be observed in practice, and so direct test-set evaluation is in general impossible.222Athey and Imbens (2016) have proposed indirect approaches to mimic test-set evaluation for causal inference. However, these approaches require an estimate of the true treatment effects and/or treatment propensities for all the observations in the test set, which creates a new set of challenges. In the absence of an observable ground truth in a test set, statistical theory plays a more central role in evaluating the noise in estimates of causal effects. Thus, when evaluating estimators of causal effects, asymptotic theory plays a much more important role than in the standard prediction context.
From a technical point of view, the main contribution of this paper is an asymptotic normality theory enabling us to do statistical inference using random forest predictions. Recent results by Biau (2012); Meinshausen (2006); Mentch and Hooker (2016); Scornet et al. (2015), and others have established asymptotic properties of particular variants and simplifications of the random forest algorithm. To our knowledge, however, we provide the first set of conditions under which predictions made by random forests are both asymptotically unbiased and Gaussian, thus allowing for classical statistical inference; the extension to the causal forests proposed in this paper is also new. We review the existing theoretical literature on random forests in more detail in Section 3.1.
A small but growing literature, including Green and Kern (2012), Hill (2011) and Hill and Su (2013), has considered the use of forest-based algorithms for estimating heterogeneous treatment effects. These papers use the Bayesian Additive Regression Tree (BART) method of Chipman et al. (2010)2011) use regression forests to estimate the effect of covariates on outcomes in treated and control groups separately, and then take the difference in predictions as data and project treatment effects onto units’ attributes using regression or classification trees (in contrast, we modify the standard random forest algorithm to focus on directly estimating heterogeneity in causal effects). A limitation of this line of work is that, until now, it has lacked formal statistical inference results.
We view our contribution as complementary to this literature, by showing that forest-based methods need not only be viewed as black-box heuristics, and can instead be used for rigorous asymptotic analysis. We believe that the theoretical tools developed here will be useful beyond the specific class of algorithms studied in our paper. In particular, our tools allow for a fairly direct analysis of variants of the method ofFoster et al. (2011). Using BART for rigorous statistical analysis may prove more challenging since, although BART is often successful in practice, there are currently no results guaranteeing posterior concentration around the true conditional mean function, or convergence of the MCMC sampler in polynomial time. Advances of this type would be of considerable interest.
Several papers use tree-based methods for estimating heterogeneous treatment effects. In growing trees to build our forest, we follow most closely the approach of Athey and Imbens (2016), who propose honest, causal trees, and obtain valid confidence intervals for average treatment effects for each of the subpopulations (leaves) identified by the algorithm. (Instead of personalizing predictions for each individual, this approach only provides treatment effect estimates for leaf-wise subgroups whose size must grow to infinity.) Other related approaches include those of Su et al. (2009) and Zeileis et al. (2008), which build a tree for treatment effects in subgroups and use statistical tests to determine splits; however, these papers do not analyze bias or consistency properties.
Finally, we note a growing literature on estimating heterogeneous treatment effects using different machine learning methods. Imai and Ratkovic (2013), Signorovitch (2007), Tian et al. (2014) and Weisberg and Pontes (2015) develop lasso-like methods for causal inference in a sparse high-dimensional linear setting. Beygelzimer and Langford (2009), Dudík et al. (2011), and others discuss procedures for transforming outcomes that enable off-the-shelf loss minimization methods to be used for optimal treatment policy estimation. In the econometrics literature, Bhattacharya and Dupas (2012); Dehejia (2005); Hirano and Porter (2009); Manski (2004)
estimate parametric or semi-parametric models for optimal policies, relying on regularization for covariate selection in the case ofBhattacharya and Dupas (2012). Taddy et al. (2016) use Bayesian nonparametric methods with Dirichlet priors to flexibly estimate the data-generating process, and then project the estimates of heterogeneous treatment effects down onto the feature space using regularization methods or regression trees to get low-dimensional summaries of the heterogeneity; but again, there are no guarantees about asymptotic properties.
2 Causal Forests
2.1 Treatment Estimation with Unconfoundedness
Suppose we have access to independent and identically distributed training examples labeled
, each of which consists of a feature vector, a response , and a treatment indicator . Following the potential outcomes framework of Neyman (1923) and Rubin (1974) (see Imbens and Rubin (2015) for a review), we then posit the existence of potential outcomes and corresponding respectively to the response the -th subject would have experienced with and without the treatment, and define the treatment effect at as
Our goal is to estimate this function . The main difficulty is that we can only ever observe one of the two potential outcomes for a given training example, and so cannot directly train machine learning methods on differences of the form .
In general, we cannot estimate simply from the observed data without further restrictions on the data generating distribution. A standard way to make progress is to assume unconfoundedness (Rosenbaum and Rubin, 1983), i.e., that the treatment assignment is independent of the potential outcomes for conditional on :
The motivation behind unconfoundedness is that, given continuity assumptions, it effectively implies that we can treat nearby observations in -space as having come from a randomized experiment; thus, nearest-neighbor matching and other local methods will in general be consistent for .
An immediate consequence of unconfoundedness is that
is the propensity of receiving treatment at . Thus, if we knew
, we would have access to a simple unbiased estimator for; this observation lies at the heart of methods based on propensity weighting (e.g., Hirano et al., 2003). Many early applications of machine learning to causal inference effectively reduce to estimating
using, e.g., boosting, a neural network, or even random forests, and then transforming this into an estimate forusing (3) (e.g., McCaffrey et al., 2004; Westreich et al., 2010). In this paper, we take a more indirect approach: We show that, under regularity assumptions, causal forests can use the unconfoundedness assumption (2) to achieve consistency without needing to explicitly estimate the propensity .
2.2 From Regression Trees to Causal Trees and Forests
At a high level, trees and forests can be thought of as nearest neighbor methods with an adaptive neighborhood metric. Given a test point , classical methods such as -nearest neighbors seek the closest points to according to some pre-specified distance measure, e.g., Euclidean distance. In contrast, tree-based methods also seek to find training examples that are close to
, but now closeness is defined with respect to a decision tree, and the closest points toare those that fall in the same leaf as it. The advantage of trees is that their leaves can be narrower along the directions where the signal is changing fast and wider along the other directions, potentially leading a to a substantial increase in power when the dimension of the feature space is even moderately large.
In this section, we seek to build causal trees that resemble their regression analogues as closely as possible. Suppose first that we only observe independent samples , and want to build a CART regression tree. We start by recursively splitting the feature space until we have partitioned it into a set of leaves , each of which only contains a few training samples. Then, given a test point , we evaluate the prediction by identifying the leaf containing and setting
Heuristically, this strategy is well-motivated if we believe the leaf to be small enough that the responses inside the leaf are roughly identically distributed. There are several procedures for how to place the splits in the decision tree; see, e.g., Hastie et al. (2009).
In the context of causal trees, we analogously want to think of the leaves as small enough that the pairs corresponding to the indices for which act as though they had come from a randomized experiment. Then, it is natural to estimate the treatment effect for any as
In the following sections, we will establish that such trees can be used to grow causal forests that are consistent for .333The causal tree algorithm presented above is a simplification of the method of Athey and Imbens (2016). The main difference between our approach and that of Athey and Imbens (2016) is that they seek to build a single well-tuned tree; to this end, they use fairly large leaves and apply a form propensity weighting based on (3) within each leaf to correct for variations in inside the leaf. In contrast, we follow Breiman (2001a) and build our causal forest using deep trees. Since our leaves are small, we are not required to apply any additional corrections inside them. However, if reliable propensity estimates are available, using them as weights for our method may improve performance (and would not conflict with the theoretical results).
Finally, given a procedure for generating a single causal tree, a causal forest generates an ensemble of such trees, each of which outputs an estimate . The forest then aggregates their predictions by averaging them: . We always assume that the individual causal trees in the forest are built using random subsamples of training examples, where ; for our theoretical results, we will assume that for some . The advantage of a forest over a single tree is that it is not always clear what the “best” causal tree is. In this case, as shown by Breiman (2001a), it is often better to generate many different decent-looking trees and average their predictions, instead of seeking a single highly-optimized tree. In practice, this aggregation scheme helps reduce variance and smooths sharp decision boundaries (Bühlmann and Yu, 2002).
2.3 Asymptotic Inference with Causal Forests
Our results require some conditions on the forest-growing scheme: The trees used to build the forest must be grown on subsamples of the training data, and the splitting rule must not “inappropriately” incorporate information about the outcomes as discussed formally in Section 2.4. However, given these high level conditions, we obtain a widely applicable consistency result that applies to several different interesting causal forest algorithms.
Our first result is that causal forests are consistent for the true treatment effect . To achieve pointwise consistency, we need to assume that the conditional mean functions and are both Lipschitz continuous. To our knowledge, all existing results on pointwise consistency of regression forests (e.g., Biau, 2012; Meinshausen, 2006) require an analogous condition on . This is not particularly surprising, as forests generally have smooth response surfaces (Bühlmann and Yu, 2002). In addition to continuity assumptions, we also need to assume that we have overlap, i.e., for some and all ,
This condition effectively guarantees that, for large enough , there will be enough treatment and control units near any test point for local methods to work.
Beyond consistency, in order to do statistical inference on the basis of the estimated treatment effects , we need to understand their asymptotic sampling distribution. Using the potential nearest neighbors construction of Lin and Jeon (2006) and classical analysis tools going back to Hoeffding (1948) and Hájek (1968), we show that—provided the sub-sample size scales appropriately with —the predictions made by a causal forest are asymptotically Gaussian and unbiased. Specifically, we show that
under the conditions required for consistency, provided the subsample size scales as for some
Moreover, we show that the asymptotic variance of causal forests can be accurately estimated. To do so, we use the infinitesimal jackknife for random forests developed by Efron (2014) and Wager et al. (2014), based on the original infinitesimal jackknife procedure of Jaeckel (1972). This method assumes that we have taken the number of trees to be large enough that the Monte Carlo variability of the forest does not matter; and only measures the randomness in due to the training sample.
To define the variance estimates, let be the treatment effect estimate given by the -th tree, and let indicate whether or not the -th training example was used for the -th tree.444For double-sample trees defined in Procedure 1, if the -th example appears in either the -sample or the -sample. Then, we set
where the covariance is taken with respect to the set of all the trees used in the forest. The term is a finite-sample correction for forests grown by subsampling without replacement; see Proposition 10. We show that this variance estimate is consistent, in the sense that .
2.4 Honest Trees and Forests
In our discussion so far, we have emphasized the flexible nature of our results: for a wide variety of causal forests that can be tailored to the application area, we achieve both consistency and centered asymptotic normality, provided the sub-sample size scales at an appropriate rate. Our results do, however, require the individual trees to satisfy a fairly strong condition, which we call honesty: a tree is honest if, for each training example , it only uses the response to estimate the within-leaf treatment effect using (5) or to decide where to place the splits, but not both. We discuss two causal forest algorithms that satisfy this condition.
Our first algorithm, which we call a double-sample tree, achieves honesty by dividing its training subsample into two halves and . Then, it uses the -sample to place the splits, while holding out the -sample to do within-leaf estimation; see Procedure 1 for details. In our experiments, we set the minimum leaf size to . A similar family of algorithms was discussed in detail by Denil et al. (2014), who showed that such forests could achieve competitive performance relative to standard tree algorithms that do not divide their training samples. In the semiparametric inference literature, related ideas go back at least to the work of Schick (1986).
We note that sample splitting procedures are sometimes criticized as inefficient because they “waste” half of the training data at each step of the estimation procedure. However, in our case, the forest subampling mechanism enables us to achieve honesty without wasting any data in this sense, because we re-randomize the -data splits over each subsample. Thus, although no data point can be used for split selection and leaf estimation in a single tree, each data point will participate in both and samples of some trees, and so will be used for both specifying the structure and treatment effect estimates of the forest. Although our original motivation for considering double-sample trees was to eliminate bias and thus enable centered confidence intervals, we find that in practice, double-sample trees can improve upon standard random forests in terms of mean-squared error as well.
Another way to build honest trees is to ignore the outcome data when placing splits, and instead first train a classification tree for the treatment assignments (Procedure 2). Such propensity trees can be particularly useful in observational studies, where we want to minimize bias due to variation in . Seeking estimators that match training examples based on estimated propensity is a longstanding idea in causal inference, going back to Rosenbaum and Rubin (1983).
For completeness, we briefly outline the motivation for the splitting rule of Athey and Imbens (2016) we use for our double-sample trees. This method is motivated by an algorithm for minimizing the squared-error loss in regression trees. Because regression trees compute predictions by averaging training responses over leaves, we can verify that
Thus, finding the squared-error minimizing split is equivalent to maximizing the variance of for ; note that for all trees, and so maximizing variance is equivalent to maximizing the sum of the . In Procedure 1, we emulate this algorithm by picking splits that maximize the variance of for .555Athey and Imbens (2016) also consider “honest splitting rules” that anticipate honest estimation, and correct for the additional sampling variance in small leaves using an idea closely related to the penalty of Mallows (1973). Although it could be of interest for further work, we do not study the effect of such splitting rules here.
In Appendix B
, we present evidence that adaptive forests with small leaves can overfit to outliers in ways that make them inconsistent near the edges of sample space. Thus, the forests ofBreiman (2001a) need to be modified in some way to get pointwise consistency results; here, we use honesty following, e.g., Wasserman and Roeder (2009). We note that there have been some recent theoretical investigations of non-honest forests, including Scornet et al. (2015) and Wager and Walther (2015). However, Scornet et al. (2015) do not consider pointwise properties of forests; whereas Wager and Walther (2015) show consistency of adaptive forests with larger leaves, but their bias bounds decay slower than the sampling variance of the forests and so cannot be used to establish centered asymptotic normality.
3 Asymptotic Theory for Random Forests
In order to use random forests to provide formally valid statistical inference, we need an asymptotic normality theory for random forests. In the interest of generality, we first develop such a theory in the context of classical regression forests, as originally introduced by Breiman (2001a). In this section, we assume that we have training examples for , a test point , and we want to estimate true conditional mean function
We also have access to a regression tree which can be used to get estimates of the conditional mean function at of the form , where is a source of auxiliary randomness. Our goal is to use this tree-growing scheme to build a random forest that can be used for valid statistical inference about .
We begin by precisely describing how we aggregate individual trees into a forest. For us, a random forest is an average of trees trained over all possible size- subsamples of the training data, marginalizing over the auxiliary noise . In practice, we compute such a random forest by Monte Carlo averaging, and set
where is drawn without replacement from , is a random draw from , and is the number of Monte Carlo replicates we can afford to perform. The formulation (12) arises as the limit of (11); thus, our theory effectively assumes that is large enough for Monte Carlo effects not to matter. The effects of using a finite are studied in detail by Mentch and Hooker (2016); see also Wager et al. (2014), who recommend taking on the order of .
The random forest with base learner and subsample size is
Next, as described in Section 2, we require that the trees in our forest be honest. Double-sample trees, as defined in Procedure 1, can always be used to build honest trees with respect to the -sample. In the context of causal trees for observational studies, propensity trees (Procedure 2) provide a simple recipe for building honest trees without sample splitting.
A tree grown on a training sample is honest if (a) (standard case) the tree does not use the responses in choosing where to place its splits; or (b) (double sample case) the tree does not use the -sample responses for placing splits.
In order to guarantee consistency, we also need to enforce that the leaves of the trees become small in all dimensions of the feature space as gets large.666Biau (2012) and Wager and Walther (2015) consider the estimation of low-dimensional signals embedded in a high-dimensional ambient space using random forests; in this case, the variable selection properties of trees also become important. We leave a study of asymptotic normality of random forests in high dimensions to future work. Here, we follow Meinshausen (2006), and achieve this effect by enforcing some randomness in the way trees choose the variables they split on: At each step, each variable is selected with probability at least for some (for example, we could satisfy this condition by completely randomizing the splitting variable with probability
). Formally, the randomness in how to pick the splitting features is contained in the auxiliary random variable.
A tree is a random-split tree if at every step of the tree-growing procedure, marginalizing over , the probability that the next split occurs along the -th feature is bounded below by for some , for all .
The remaining definitions are more technical. We use regularity to control the shape of the tree leaves, while symmetry is used to apply classical tools in establishing asymptotic normality.
A tree predictor grown by recursive partitioning is -regular for some if either (a) (standard case) each split leaves at least a fraction of the available training examples on each side of the split and, moreover, the trees are fully grown to depth for some , i.e., there are between and observations in each terminal node of the tree; or (b) (double sample case) if the predictor is a double-sample tree as in Procedure 1, the tree satisfies part (a) for the sample.
A predictor is symmetric if the (possibly randomized) output of the predictor does not depend on the order () in which the training examples are indexed.
Finally, in the context of classification and regression forests, we estimate the asymptotic variance of random forests using the original infinitesimal jackknife of Wager et al. (2014), i.e.,
where is the estimate for given by a single regression tree. We note that the finite-sample correction did not appear in Wager et al. (2014), as their paper focused on subsampling with replacement, whereas this correction is only appropriate for subsampling without replacement.
Given these preliminaries, we can state our main result on the asymptotic normality of random forests. As discussed in Section 2.3, we require that the conditional mean function be Lipschitz continuous. The asymptotic normality result requires for the subsample size to scale within the bounds given in (14). If the subsample size grows slower than this, the forest will still be asymptotically normal, but the forest may be asymptotically biased. For clarity, we state the following result with notation that makes the dependence of and on explicit; in most of the paper, however, we drop the subscripts to and when there is no risk of confusion.
Suppose that we have independent and identically distributed training examples . Suppose moreover that the features are independently and uniformly distributed
. Suppose moreover that the features are independently and uniformly distributed777The result also holds with a density that is bounded away from 0 and infinity; however, we assume uniformity for simpler exposition. , that and are Lipschitz-continuous, and finally that and for some constants , uniformly over all . Given this data-generating process, let be an honest, -regular with , and symmetric random-split tree in the sense of Definitions 2, 3, 4, and 5, and let be the estimate for given by a random forest with base learner and a subsample size . Finally, suppose that the subsample size scales as
Then, random forest predictions are asymptotically Gaussian:
Moreover, the asymptotic variance can be consistently estimated using the infinitesimal jackknife (8):
Remark 3 (binary classification).
We note that Theorem 1 also holds for binary classification forests with leaf size , as is default in the R-package randomForest (Liaw and Wiener, 2002). Here, we treat the output of the random forests as an estimate for the probability ; Theorem 1 then lets us construct valid confidence intervals for this probability. For classification forests with , the proof of Theorem 1 still holds if the individual classification trees are built by averaging observations within a leaf, but not if they are built by voting. Extending our results to voting trees is left as further work.
The proof of this result is organized as follows. In Section 3.2, we provide bounds for the bias of random forests, while Section 3.3 studies the sampling distributions of and establishes Gaussianity. Given a subsampling rate satisfying (14), the bias decays faster than the variance, thus allowing for (15). Before beginning the proof, however, we relate our result to existing results about random forests in Section 3.1.
3.1 Theoretical Background
There has been considerable work in understanding the theoretical properties of random forests. The convergence and consistency properties of trees and random forests have been studied by, among others, Biau (2012), Biau et al. (2008), Breiman (2004), Breiman et al. (1984), Meinshausen (2006), Scornet et al. (2015), Wager and Walther (2015), and Zhu et al. (2015). Meanwhile, their sampling variability has been analyzed by Duan (2011), Lin and Jeon (2006), Mentch and Hooker (2016), Sexton and Laake (2009), and Wager et al. (2014). However, to our knowledge, our Theorem 1 is the first result establishing conditions under which predictions made by random forests are asymptotically unbiased and normal.
Probably the closest existing result is that of Mentch and Hooker (2016), who showed that random forests based on subsampling are asymptotically normal under substantially stronger conditions than us: they require that the subsample size grows slower than , i.e., that . However, under these conditions, random forests will not in general be asymptotically unbiased. As a simple example, suppose that , that , and that we evaluate an honest random forest at . A quick calculation shows that the bias of the random forest decays as , while its variance decays as . If , the squared bias decays slower than the variance, and so confidence intervals built using the resulting Gaussian limit distribution will not cover . Thus, although the result of Mentch and Hooker (2016) may appear qualitatively similar to ours, it cannot be used for valid asymptotic statistical inference about .
The variance estimator was studied in the context of random forests by Wager et al. (2014), who showed empirically that the method worked well for many problems of interest. Wager et al. (2014) also emphasized that, when using in practice, it is important to account for Monte Carlo bias. Our analysis provides theoretical backing to these results, by showing that is in fact a consistent estimate for the variance of random forest predictions. The earlier work on this topic (Efron, 2014; Wager et al., 2014) had only motivated the estimator by highlighting connections to classical statistical ideas, but did not establish any formal justification for it.
Instead of using subsampling, Breiman originally described random forests in terms of bootstrap sampling, or bagging (Breiman, 1996). Random forests with bagging, however, have proven to be remarkably resistant to classical statistical analysis. As observed by Buja and Stuetzle (2006), Chen and Hall (2003), Friedman and Hall (2007) and others, estimators of this form can exhibit surprising properties even in simple situations; meanwhile, using subsampling rather than bootstrap sampling has been found to avoid several pitfalls (e.g., Politis et al., 1999). Although they are less common in the literature, random forests based on subsampling have also been occasionally studied and found to have good practical and theoretical properties (e.g., Bühlmann and Yu, 2002; Mentch and Hooker, 2016; Scornet et al., 2015; Strobl et al., 2007).
Finally, an interesting question for further theoretical study is to understand the optimal scaling of the subsample size for minimizing the mean-squared error of random forests. For subsampled nearest-neighbors estimation, the optimal rate for is (Biau et al., 2010; Samworth, 2012). Here, our specific value for depends on the upper bounds for bias developed in the following section. Now, as shown by Biau (2012), under some sparsity assumptions on , it is possible to get substantially stronger bounds for the bias of random forests; thus, it is plausible that under similar conditions we could push back the lower bound on the growth rate of the subsample size.
3.2 Bias and Honesty
We start by bounding the bias of regression trees. Our approach relies on showing that as the sample size available to the tree gets large, its leaves get small; Lipschitz-continuity of the conditional mean function and honesty then let us bound the bias. In order to state a formal result, define the diameter of a leaf as the length of the longest segment contained inside , and similarly let denote the length of the longest such segment that is parallel to the -th axis. The following lemma is a refinement of a result of Meinshausen (2006), who showed that for regular trees.
Let be a regular, random-split tree and let denote its leaf containing . Suppose that independently. Then, for any , and for large enough ,
This lemma then directly translates into a bound on the bias of a single regression tree. Since a forest is an average of independently-generated trees, the bias of the forest is the same as the bias of a single tree.
Under the conditions of Lemma 2, suppose moreover that is Lipschitz continuous and that the trees in the random forest are honest. Then, provided that , the bias of the random forest at is bounded by
the constant in the -bound is given in the proof.
3.3 Asymptotic Normality of Random Forests
Our analysis of the asymptotic normality of random forests builds on ideas developed by Hoeffding (1948) and Hájek (1968) for understanding classical statistical estimators such as -statistics. We begin by briefly reviewing their results to give some context to our proof. Given a predictor and independent training examples , …, , the Hájek projection of is defined as
In other words, the Hájek projection of captures the first-order effects in . Classical results imply that , and further:
Since the Hájek projection is a sum of independent random variables, we should expect it to be asymptotically normal under weak conditions. Thus whenever the ratio of the variance of to that of tends to 1, the theory of Hájek projections almost automatically guarantees that will be asymptotically normal.888 The moments defined in (
The moments defined in (17) depend on the data-generating process for the , and so cannot be observed in practice. Thus, the Hájek projection is mostly useful as an abstract theoretical tool. For a review of classical projection arguments, see Chapter 11 of Van der Vaart (2000).
If is a regression tree, however, the condition from (18) does not apply, and we cannot use the classical theory of Hájek projections directly. Our analysis is centered around a weaker form of this condition, which we call -incrementality. With our definition, predictors to which we can apply the argument (18) directly are 1-incremental.
The predictor is -incremental at if
where is the Hájek projection of (17). In our notation,
Our argument proceeds in two steps. First we establish lower bounds for the incrementality of regression trees in Section 3.3.1. Then, in Section 3.3.2 we show how we can turn weakly incremental predictors into 1-incremental ensembles by subsampling (Lemma 7), thus bringing us back into the realm of classical theory. We also establish the consistency of the infinitesimal jackknife for random forests. Our analysis of regression trees is motivated by the “potential nearest neighbors” model for random forests introduced by Lin and Jeon (2006); the key technical device used in Section 3.3.2 is the ANOVA decomposition of Efron and Stein (1981). The discussion of the infinitesimal jackknife for random forest builds on results of Efron (2014) and Wager et al. (2014).
3.3.1 Regression Trees and Incremental Predictors
Analyzing specific greedy tree models such as CART trees can be challenging. We thus follow the lead of Lin and Jeon (2006), and analyze a more general class of predictors—potential nearest neighbors predictors—that operate by doing a nearest-neighbor search over rectangles; see also Biau and Devroye (2010). The study of potential (or layered) nearest neighbors goes back at least to Barndorff-Nielsen and Sobel (1966).
Consider a set of points and a fixed . A point is a potential nearest neighbor (PNN) of if the smallest axis-aligned hyperrectangle with vertices and contains no other points . Extending this notion, a PNN -set of is a set of points of size such that there exists an axis aligned hyperrectangle containing , , and no other training points. A training example is called a -PNN of if there exists a PNN -set of containing . Finally, a predictor is a -PNN predictor over if, given a training set
and a test point , always outputs the average of the responses over a -PNN set of .
This formalism allows us to describe a wide variety of tree predictors. For example, as shown by Lin and Jeon (2006), any decision tree that makes axis-aligned splits and has leaves of size between and is a -PNN predictor. In particular, the base learners originally used by Breiman (2001a), namely CART trees grown up to a leaf size (Breiman et al., 1984), are -PNN predictors. Predictions made by -PNN predictors can always be written as
where is a selection variable that takes the value for indices in the selected leaf-set and 0 for all other indices. If the tree is honest, we know in addition that, for each , is independent of conditional on .
An important property of -PNN predictors is that we can often get a good idea about whether is non-zero even if we only get to see ; more formally, as we show below, the quantity cannot get too small. Establishing this fact is a key step in showing that -PNNs are incremental. In the following result, can be an arbitrary symmetric -PNN predictor.
Suppose that the observations are independent and identically distributed on with a density that is bounded away from infinity, and let be any symmetric -PNN predictor. Then, there is a constant depending only on and such that, as gets large,
where is the indicator for whether the observation is selected in the subsample. When is uniform over , the bound holds with .
When we see that, marginally, and so ; more generally, a similar calculation shows that . Thus, (20) can be interpreted as a lower bound on how much information contains about the selection event .
Thanks to this result, we are now ready to show that all honest and regular random-split trees are incremental. Notice that any symmetric -regular tree following Definition 4 is also a symmetric -PNN predictor.
Suppose that the conditions of Lemma 4 hold and that is an honest -regular symmetric tree in the sense of Definitions 2 (part a), 4 (part a), and 5. Suppose moreover that the conditional moments and are both Lipschitz continuous at . Finally, suppose that . Then is -incremental at with
where is the constant from Lemma 4.
Finally, the result of Theorem 5 also holds for double-sample trees of the form described in Procedure 1. To establish the following result, we note that a double-sample tree is an honest, symmetric -PNN predictor with respect to the -sample, while all the data in the -sample can be folded into the auxiliary noise term ; the details are worked out in the proof.
3.3.2 Subsampling Incremental Base Learners
In the previous section, we showed that decision trees are -incremental, in that the Hájek projection of preserves at least some of the variation of . In this section, we show that randomly subsampling -incremental predictors makes them 1-incremental; this then lets us proceed with a classical statistical analysis. The following lemma, which flows directly from the ANOVA decomposition of Efron and Stein (1981), provides a first motivating result for our analysis.
leads to an asymptotic Gaussianity result; from a technical point of view, it suffices to check Lyapunov-style conditions for the central limit theorem.
Finally, we end this section by motivating the finite sample correction appearing in (13) by considering the simple case where we have trivial trees that do not make any splits: . In this case, we can verify that the full random forest is nothing but , and the standard variance estimator
is well-known to be unbiased for . We show below that, for trivial trees , implying that our correction makes exactly unbiased in finite samples for trivial trees. Of course, , and so Theorem 9 would hold even without this finite-sample correction; however, we find it to substantially improve the performance of our method in practice.
For trivial trees , the variance estimate (13) is equivalent to the standard variance estimator , and .
4 Inferring Heterogeneous Treatment Effects
We now return to our main topic, namely estimating heterogeneous treatment effects using random forests in the potential outcomes framework with unconfoundedness, and adapt our asymptotic theory for regression forests to the setting of causal inference. Here, we again work with training data consisting of tuples for , where is a feature vector, is the response, and is the treatment assignment. Our goal is to estimate the conditional average treatment effect at a pre-specified test point . By analogy to Definition 1, we build our causal forest by averaging estimates for obtained by training causal trees over subsamples:
We seek an analogue to Theorem 1 for such causal forests.
Most of the definitions used to state Theorem 1 apply directly to this context; however, the notions of honesty and regularity need to be adapted slightly. Specifically, an honest causal tree is not allowed to look at the responses when making splits but can look at the treatment assignments . Meanwhile, a regular causal tree must have at least examples from both treatment classes in each leaf; in other words, regular causal trees seek to act as fully grown trees for the rare treatment assignment, while allowing for more instances of the common treatment assignment.
A causal tree grown on a training sample , …, is honest if (a) (standard case) the tree does not use the responses in choosing where to place its splits; or (b) (double sample case) the tree does not use the -sample responses for placing splits.
A causal tree grown by recursive partitioning is -regular at for some if either: (a) (standard case) (1) Each split leaves at least a fraction of the available training examples on each side of the split, (2) The leaf containing has at least observations from each treatment group () for some , and (3) The leaf containing has either less than observations with or observations with ; or (b) (double-sample case) for a double-sample tree as defined in Procedure 1, (a) holds for the sample.
Given these assumptions, we show a close analogue to Theorem 1, given below. The main difference relative to our first result about regression forests is that we now rely on unconfoundedness and overlap to achieve consistent estimation of . To see how these assumptions enter the proof, recall that an honest causal tree uses the features and the treatment assignments in choosing where to place its splits, but not the responses . Writing and for the indices of the treatment and control units in the leaf around , we then find that after the splitting stage
where the second equality follows by unconfoundedness (2). Thus, it suffices to show that the two above terms are consistent for estimating and . To do so, we can essentially emulate the argument leading to Theorem 1, provided we can establish an analogue to Lemma 2 and give a fast enough decaying upper bound to the diameter of ; this is where we need the overlap assumption. A proof of Theorem 11 is given in the appendix.
Suppose that we have independent and identically distributed training examples . Suppose, moreover, that the treatment assignment is unconfounded (2) and has overlap (6). Finally, suppose that both potential outcome distributions and satisfy the same regularity assumptions as the pair did in the statement of Theorem 1. Given this data-generating process, let be an honest, -regular with , and symmetric random-split causal forest in the sense of Definitions 0b, 3, 0b, and 5, and let be the estimate for given by a causal forest with base learner and a subsample size scaling as in (14). Then, the predictions are consistent and asymptotically both Gaussian and centered, and the variance of the causal forest can be consistently estimated using the infinitesimal jackknife for random forests, i.e., (7) holds.
(testing at many points) We note that it is not in general possible to construct causal trees that are regular in the sense of Definition 0b for all simultaneously. As a simple example, consider the situation where , and ; then, the tree can have at most 1 leaf for which it is regular. In the proof of Theorem 11, we avoided this issue by only considering a single test point , as it is always possible to build a tree that is regular at a single given point . In practice, if we want to build a causal tree that can be used to predict at many test points, we may need to assign different trees to be valid for different test points. Then, when predicting at a specific , we treat the set of trees that were assigned to be valid at that as the relevant forest and apply Theorem 11 to it.
5 Simulation Experiments
In observational studies, accurate estimation of heterogeneous treatment effects requires overcoming two potential sources of bias. First, we need to identify neighborhoods over which the actual treatment effect is reasonably stable and, second, we need to make sure that we are not biased by varying sampling propensities . The simulations here aim to test the ability of causal forests to respond to both of these factors.
Since causal forests are adaptive nearest neighbor estimators, it is natural to use a non-adaptive nearest neighborhood method as our baseline. We compare our method to the standard nearest neighbors (-NN) matching procedure, which estimates the treatment effect as
where and are the nearest neighbors to in the treatment () and control () samples respectively. We generate confidence intervals for the -NN method by modeling as Gaussian with mean and variance , where is the sample variance for .
The goal of this simulation study is to verify that forest-based methods can be used build rigorous, asymptotically valid confidence intervals that improve over non-adaptive methods like -NN in finite samples. The fact that forest-based methods hold promise for treatment effect estimation in terms of predictive error has already been conclusively established elsewhere; for example, BART methods following Hill (2011) won the recent Causal Inference Data Analysis Challenge at the 2016 Atlantic Causal Inference Conference. We hope that the conceptual tools developed in this paper will prove to be helpful in analyzing a wide variety of forest-based methods.
5.1 Experimental Setup
We describe our experiments in terms of the sample size , the ambient dimension , as well as the following functions:
In all our examples, we respect unconfoundedness (2), use , and have homoscedastic noise . We evaluate performance in terms of expected mean-squared error for estimating at a random test example , as well as expected coverage of with a target coverage rate of 0.95.
In our first experiment, we held the treatment effect fixed at , and tested the ability of our method to resist bias due to an interaction between and . This experiment is intended to emulate the problem that in observational studies, a treatment assignment is often correlated with potential outcomes, creating bias unless the statistical method accurately adjusts for covariates. -NN matching is a popular approach for performing this adjustment in practice. Here, we set
where is the -density with shape parameters and . We used samples and varied between 2 and 30. Since our goal is accurate propensity matching, we use propensity trees (Procedure 2) as our base learner; we grew trees with .
For our second experiment, we evaluated the ability of causal forests to adapt to heterogeneity in