LocalLinearForest
A python implementation of local linear forests (https://arxiv.org/pdf/1807.11408.pdf) based on sklearn
view repo
Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals, and can show poor predictive performance in the presence of strong, smooth effects. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data.
READ FULL TEXT VIEW PDFA python implementation of local linear forests (https://arxiv.org/pdf/1807.11408.pdf) based on sklearn
heterogeneous treatment effect estimation with causal forests
Python implementation of Local Linear Forest (https://arxiv.org/pdf/1807.11408.pdf)
None
Random forests (Breiman, 2001) are a popular method for non-parametric regression that have proven effective across many application areas (Cutler, Edwards Jr, Beard, Cutler, Hess, Gibson, and Lawler, 2007; Díaz-Uriarte and De Andres, 2006; Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston, 2003). A major weakness of random forests, however, is their inability to exploit smoothness in the regression surface they are estimating. As an example, consider the following setup with a smooth trend in the conditional response surface: We simulate
independently from the uniform distribution on
, with responses(1) |
and our goal is to estimate . The left panel of Figure 1 shows a set of predictions on this data from a random forest. The forest is unable to exploit strong local trends and, as a result, fits the target function using qualitatively the wrong shape: The prediction surface resembles a step function as opposed to a smooth curve.
In order to address this weakness, we take the perspective of random forests as an adaptive kernel method. This interpretation follows work by Athey, Tibshirani, and Wager (2018), Hothorn, Lausen, Benner, and Radespiel-Troger (2004), and Meinshausen (2006), and complements the traditional view of forests as an ensemble method (i.e., an average of predictions made by individual trees). Specifically, random forest predictions can be written as
(2) |
where the weights defined in (4) encode the weight given by the forest to the -th training example when predicting at . Now, as is well-known in the literature on non-parametric regression, if we want to fit smooth signals without some form of neighborhood averaging (e.g., kernel regression, -NN, or matching for causal inference), it is helpful to use a local regression adjustment to correct for potential misalignment between a test point and its neighborhood (Abadie and Imbens, 2011; Cleveland and Devlin, 1988; Fan and Gijbels, 1996; Heckman, Ichimura, and Todd, 1998; Loader, 1999; Newey, 1994; Stone, 1977; Tibshirani and Hastie, 1987). These types of adjustments are particularly important near boundaries, where neighborhoods are asymmetric by necessity, but with many covariates, the adjustments are also important away from boundaries given that local neighborhoods are often unbalanced due to sampling variation.
The goal of this paper is to improve the accuracy of forests on smooth signals using regression adjustments, potentially in many dimensions. By using the local regression adjustment, it is possible to adjust for asymmetries and imbalances in the set of nearby points used for prediction, ensuring that the weighted average of the feature vector of neighboring points is approximately equal to the target feature vector, and that predictions are centered. The improvement to forests from the regression adjustment is most likely to be large in cases where some features have strong effects with moderate curvature, so that regression adjustments are both effective and important. Many datasets have this characteristic; for example, labor market outcomes tend to improve with parents’ educational and labor market attainment, but there are diminishing returns.
In their simplest form, local linear forests take the forest weights , and use them for local regression:
(3) |
Here estimates the conditional mean function , and corrects for the local trend in . The ridge penalty prevents overfitting to the local trend, and plays a key role both in simulation experiments and asymptotic convergence results. Then, as discussed in Section 2.1, we can improve the performance of local linear forests by modifying the tree-splitting procedure used to get the weights , and making it account for the fact that we will use local regression to estimate . We choose all tuning parameters, including leaf size and , via cross-validation. As a first encouraging result we note that, in the motivating example from Figure 1, local linear forests have improved upon the bias of standard forests.
These improvements extend to many other types of forests, such as quantile regression forests
(Meinshausen, 2006) or, more broadly, generalized random forests (Athey, Tibshirani, and Wager, 2018). An extension of primary interest is the orthogonalized causal forests proposed by Athey, Tibshirani, and Wager (2018), which we discuss in Section 3 in detail; other cases are analogous.Theorem 1 gives a Central Limit Theorem for the predictions from a local linear forest at a given test point , specifying the asymptotic convergence rate and its dependence on subsampling and smoothness of . This allows us to build pointwise Gaussian confidence intervals, giving practitioners applicable uncertainty quantification. Observe that in Figure 1, the bias from regression forest predictions affects not only the prediction curve but also the corresponding confidence intervals, which are not centered on the true function. Local linear forests, in addressing this bias issue, improve over regression forests in both predictive performance and confidence interval coverage. Strikingly, our local linear forest confidence intervals simultaneously achieve better coverage and are shorter than those built using regression forests.
A simple form of (3), without regularization or modified tree-splitting procedures, was also considered in a recent paper by Bloniarz, Talwalkar, Yu, and Wu (2016). However, Bloniarz, Talwalkar, Yu, and Wu (2016) only report modest performance improvements over basic regression forests; for example, on the “Friedman function” they report roughly a 5% reduction in mean-squared error. In contrast, we find fairly large, systematic improvements from local linear forests; see, e.g., Figure 4 for corresponding results on the same Friedman function. It thus appears that our algorithmic modifications via regularization and optimized splitting play a qualitatively important role in getting local linear forests to work well. These empirical findings are also mirrored in our theory. For example, in order to prove rates of convergence for local linear forests that can exploit smoothness of and improve over corresponding rates available for regression forests, we need an appropriate amount of regularization in (3).
Finally, we note that one can also motivate local linear forests from the starting point of local linear regression. Despite working well in low dimensions, classical approaches to local linear regression are not applicable to even moderately high-dimensional problems.^{1}^{1}1
The curse of dimensionality for kernel weighting is well known. The popular core
R function loess (R Core Team, 2018) allows only 1-4 predictors, while locfit (Loader, 2013) crashes on the simulation from (1) with . In contrast, random forests are adept at fitting high-dimensional signals, both in terms of their stability and computational efficiency. From this perspective, random forests can be seen as an effective way of producing weights to use in local linear regression. In other words, local linear forests aim to combine the strength of random forests in fitting high dimensional signals, and the ability of local linear regression to capture smoothness.An implementation of local linear forests, compliant with the assumptions detailed in Section 4, is available in the R package grf.
Random forests were first introduced by Breiman (2001), building on the work of Breiman, Friedman, Stone, and Olshen (1984) on recursive partitioning (CART), Breiman (1996) on bagging, and Amit and Geman (1997) on randomized trees. Bühlmann and Yu (2002) shows how the bagging makes forests smoother than single trees, while Biau (2012) and Scornet, Biau, and Vert (2015) establishes asymptotic risk consistency of random forests under specific assumptions. More sophisticated tree-based ensembles motivated by random forests have been proposed by Basu, Kumbier, Brown, and Yu (2018), who iteratively grow feature-weighted tree ensembles that perform especially well for discovering interactions, Zhou and Hooker (2018), who consider a hybrid between random forests and boosting, and Zhu, Zeng, and Kosorok (2015), who do deeper search during splitting to mitigate the greediness of CART. Linero and Yang (2018) propose a Bayesian regression tree ensemble tailored to learning smooth, sparse signals and prove posterior minimaxity under certain conditions, highlighting the promise of tree-based methods that can adapt to smoothness.
The idea of considering random forests as an adaptive kernel method has been proposed by several papers. Hothorn, Lausen, Benner, and Radespiel-Troger (2004) suggest using weights from survival trees and gives compelling simulation results, albeit to our knowledge no theoretical guarantees. Meinshausen (2006) proposes this technique for quantile regression forests and gives asymptotic consistency of the resulting predictions. Athey, Tibshirani, and Wager (2018) leverage this idea to present generalized random forests as a method for solving heterogeneous estimating equations. They derive an asymptotic distribution and confidence intervals for the resulting predictions. Local linear forests build on this literature; the difference being that we use the kernel-based perspective on forests to exploit smoothness of rather than to target more complicated estimands (such as a quantile).
Early versions of confidence intervals for random forests, backed by heuristic arguments and empirical evidence, were proposed by
Sexton and Laake (2009) and Wager, Hastie, and Efron (2014). Mentch and Hooker (2016) then established asymptotic normality of random forests where each tree depends on a small subsample of training examples (so that there may be asymptotic bias), while Wager and Athey (2018) provide a characterization of forests that allows for larger subsamples, deriving both asymptotic normality and valid confidence intervals. The confidence intervals proposed here are motivated by the algorithm of Sexton and Laake (2009), and build on the random forest delta method developed by Athey, Tibshirani, and Wager (2018), taking advantage of improved subsampling rates for improved coverage.As mentioned in the introduction, a predecessor to this work is a paper by Bloniarz, Talwalkar, Yu, and Wu (2016), who consider local linear regression with supervised weighting functions, including ones produced by a forest. The main differences between our method and that of Bloniarz, Talwalkar, Yu, and Wu (2016) is that they do not adapt the tree-splitting procedure to account for the local linear correction, and do not consider algorithmic features—such as ridge penalization—that appear to be needed to achieve good performance both in theory and in practice. Additionally, our method is flexible to forests targeting any heterogeneous estimating equation, and in particular to causal forests. On the formal side, Bloniarz, Talwalkar, Yu, and Wu (2016) prove consistency of their method; however, they do not establish rates of convergence and thus, unlike in our Theorem 1, they cannot use smoothness of to improve convergence properties of the forest. They also do not provide a central limit theorem or confidence intervals.
More broadly, there is an extensive body of work on model-based trees that explores different combinations of local regression and trees. Torgo (1997) and Gama (2004) study functional models for tree leaves, fitting models instead of local averages at each node. Karalič (1992) suggests fitting a local linear regression in each leaf, and Torgo (1997) highlights the performance of kernel methods in general for MOB tree methods. Menze et al. (2011)
propose oblique random forests that learn split directions using the results from ridge regression, similar to our work developing splitting rules for local linear forests but more in the spirit of linear discriminant analysis (LDA). Case-specific random forests, introduced by
Xu, Nettleton, and Nordman (2016), use local information to upweight training samples not at the prediction step, but during the bootstrap to generate datasets for each tree. Zeileis, Hothorn, and Hornik (2008), and later Rusch and Zeileis (2013), propose not only prediction, but recursive partitioning via fitting a separate model in each leaf, similar to the residual splitting strategy of local linear forests. Local linear forests complement this literature; they differ, however, in treating forests as a kernel method. The leaf nodes in a local linear forest serve to provide neighbor information, and not local predictions.Our work is motivated by the literature on local linear regression and maximum likelihood estimation (Abadie and Imbens, 2011; Cleveland and Devlin, 1988; Fan and Gijbels, 1996; Heckman, Ichimura, and Todd, 1998; Loader, 1999; Newey, 1994; Stone, 1977; Tibshirani and Hastie, 1987). Stone (1977) introduces local linear regression and gives asymptotic consistency properties. Cleveland (1979) expands on this by introducing robust locally weighted regression, and Fan and Gijbels (1992) give a variable bandwidth version. Cleveland and Devlin (1988) explore further uses of locally weighted regression. Local linear regression has been particularly well-studied for longitudinal data, as in Li and Hsing (2010) and Yao, Muller, and Wang (2005). Cheng, Fan, and Marron (1997) use local polynomials to estimate the value of a function at the boundary of its domain. Abadie and Imbens (2011) show how incorporating a local linear correction improves nearest neighbor matching procedures.
Local linear forests use a random forest to generate weights that can then serve as a kernel for local linear regression. Suppose we have training data with , for . Consider using a random forest to estimate the conditional mean function at a fixed test point . Traditionally, random forests are viewed as an ensemble method, where tree predictions are averaged to obtain the final estimate. Specifically, for each tree in a forest of trees, we find the leaf with predicted response , which is simply the average response of all training data points assigned to . We then predict the average .
An alternate angle, advocated by Athey, Tibshirani, and Wager (2018), Hothorn, Lausen, Benner, and Radespiel-Troger (2004), and Meinshausen (2006), entails viewing random forests as adaptive weight generators, as follows. Equivalently write as
where the forest weight is
(4) |
Notice that by construction, and for each , . Athey, Tibshirani, and Wager (2018) use this perspective to harness random forests for solving weighted estimating equations, and give asymptotic guarantees on the resulting predictions.
Local linear forests solve the locally weighted least squares problem (3) with weights (4). Note that equation (3) has a closed-form solution, given below, following the closed-form solutions for ridge regression and classical local linear regression. Throughout this paper, we let be the diagonal matrix with , and let denote the diagonal matrix with and , so as to not penalize the intercept. We define , the centered regression matrix with intercept, as and . Then the local linear forest estimator can be explicitly written as
(5) |
Qualitatively, we can think of local linear regression as a weighting estimator with a modulated weighting function whose
-moments are better aligned with the test point
: with and , where the last relation would be exact without a ridge penalty (i.e., with ). Explicitly, , where is a vector of zeroes with 1 in the -th column.With the perspective of generating a kernel for local linear regression in mind, we move to discuss the appropriate splitting rule for local linear forests.
Random forests traditionally use Classification and Regression Trees (CART) from Breiman, Friedman, Stone, and Olshen (1984) splits, which proceed as follows. We consider a parent node with observations . For each candidate pair of child nodes , we take the mean value of inside each child, and . Then we choose to minimize the sum of squared errors
Knowing that we will use the forest weights to perform a local regression, we neither need nor want to use the forest to model strong, smooth signals; the final regression step can model the strong global effects. Instead, in the parent node , we run a ridge regression to predict from .
(6) |
We then run a standard CART split on the residuals , modeling local effects in the forest and regressing global effects back in at prediction. Observe that, much like the CART splitting rule, an appropriate software package can enforce that a forest using this splitting rule splits on every variable and gives balanced splits; hence this splitting rule may be used to grow honest and regular trees.
To explore the effects of CART and residual splitting rules, we consider this simulation first introduced by Friedman (1991). Generate independently and identically distributed and model from
(7) |
for . This model has become a popular study for evaluating nonparametric regression methods; see for example Chipman, George, and McCulloch (2010) and Taddy, Chen, Yu, and Wyle (2015). It is a natural setup to test how well an algorithm handles interactions , its ability to pick up a quadratic signal , and how it simultaneously models strong linear signals .
Figure 2 displays the split frequencies from an honest random forest (left) using standard CART splits, and a local linear forest (right). The x-axis is indexed by variable, here 1 through 5, and the y-axis gives tree depth for the first 4 levels of tree splits. Tiles are darkened according to how often trees in the forest split on that variable; a darker tile denotes more splits at that tree depth. CART splits very frequently on variable 4, which contributes the strongest linear signal, especially at the top of the tree but consistently throughout levels. Local linear forests rarely split on either of the strong linear signals, instead spending splits on the three that are more difficult to model.
Consider the following experiment, which highlights the benefit of the proposed splitting rule. We generate independently and uniformly over . We hold a cubic signal constant across simulations, and on each run increase the dimension and add another linear signal. Formally, we let and generate responses
(8) |
For example, at simulation we have and hence we model . RMSE is displayed in Figure 3.
In low dimension and with few linear signals, all three methods are comparable. However, they begin to differ quickly. Random forests are not designed for models with so many global linear signals, and hence their RMSE increases dramatically with . Moreover, as we add more linear effects, the gap between the two candidate splitting rules grows; heuristically, it becomes more important not to waste splits, and the residual splitting rule gives greater improvements. Note that at a certain point, however, the gap between splitting rules stays constant. Once the forests simply cannot fit a more complex linear function with a fixed amount of data, the marginal benefits of the residual splitting rule level out. We show this to emphasize the contexts in which this splitting rule meaningfully affects the results.
Unless noted otherwise, all random forests used in this paper are grown using a type of sub-sample splitting called “honesty”, used by Wager and Athey (2018) to derive the asymptotic distribution of random forest prediction. As outlined in Procedure 1 of Wager and Athey (2018), each tree in an honest forest is grown using two non-overlapping subsamples of the training data, call them and . We first choose a tree structure using only the data in , and write as the boolean indicator for whether the points and fall into the same leaf of . Then, in a second step, we define the set of neighbors of as ; this neighborhood function is what we then use to define the forest weights in (4). Note that we do not use the observed outcomes form sample to select split points; but, to ensure that each node has a certain fraction of observations from its parent, we may use the covariates from to select splits. This modification allows us to grow honest forests that comply with assumption 1, so that our theory is consistent and matches the implementation available online.
This type of subsample-splitting lets us control for potential overfitting when growing the tree , because the samples which are in the neighborhood were held out when growing . We note that, despite considerable interest in the literature, there are no available consistency results for random forests with fully grown trees that do not use honesty. Biau (2012) uses a different type of sample splitting, Biau, Devroye, and Lugosi (2008) and Wager and Walther (2015) rely on large leaves, while the results of Scornet, Biau, and Vert (2015) on fully grown trees rely on an unchecked high-level assumption. Thus, we choose to build our forests with honesty on by default.
Empirically, honesty can improve or worsen predictions; when honesty impairs prediction, regression forest do not have enough expressive power to counteract it, and correspondingly they suffer. Local linear corrections can help mitigate this problem without sacrificing honesty. For a detailed treatment of practical honest forests, see Section 5.
We recommend selecting ridge penalties by cross-validation, which can be done automatically in the R package. It is often reasonable to choose different values of for forest training and for local linear prediction. During forest growth, equation (6) gives ridge regression coefficients in each parent leaf. As trees are grown on subsamples of data, over-regularization at this step is a danger even in large leaves. Consequently, small values of are advisable for penalization on regressions during forest training. Furthermore, as we move to small leaves, computing meaningful regression coefficients becomes more difficult; the ridge regression can begin to mask signal instead of uncovering it. A heuristic that performs well in practice is to store the regression estimates on parent leaves . When the child leaf size shrinks below a cutoff, we use from the parent node to calculate ridge residual pseudo-outcomes, instead of estimating them from unstable regression coefficients on the small child dataset. In practice, this helps to avoid the pitfalls of over-regularizing and of regressing on a very small dataset when growing the forest. At the final regression prediction step (5
), however, a larger ridge penalty can control the variance and better accommodate noisy data.
With increasingly high-dimensional data, feature selection before prediction can significantly reduce error and decrease computation time. Often, a dataset will contain only a small number of features with strong global signals. In other cases, a researcher will know in advance which variables are consistently predictive or of special interest. In these cases, it is reasonable to run the regression prediction step on this smaller subset of predictors expected to contribute overarching trends. Such covariates, if they are not already known, can be chosen by a stepwise regression or lasso, or any other technique for automatic feature selection. Last, it is worth noting that these tuning suggestions are pragmatic in nature; the theoretical guarantees provided in Section
4 are for local linear forests trained without these heuristics.For conciseness, the majority of this paper focuses on local linear forests for non-parametric regression; however, a similar local linear correction can also be applied to quantile regression forests (Meinshausen, 2006), causal forests (Wager and Athey, 2018) or, more broadly, to any instance of generalized random forests (Athey, Tibshirani, and Wager, 2018). To highlight this potential, we detail the method and discuss an example for heterogeneous treatment effect estimation using local linear causal forests.
As in Athey, Tibshirani, and Wager (2018), we frame our discussion in terms of the Neyman-Rubin causal model (Imbens and Rubin, 2015). Suppose we have data , where are covariates, is the response, and is the treatment. In order to define the causal effect of the treatment , we posit potential outcomes for individual , and , corresponding to the response the subject would have experienced in the control and treated conditions respectively; we then observe . We seek to estimate the conditional average treatment effect (CATE) of , namely . Assuming uncounfoundedness (Rosenbaum and Rubin, 1983),
(9) |
the CATE is in general identified via non-parametric methods. We assume unconfoundedness when discussing treatment effect estimation through this paper. Wager and Athey (2018) proposed an extension of random forests for estimating CATEs, and Athey, Tibshirani, and Wager (2018) improved on the method by making it locally robust to confounding using the transformation of Robinson (1988). Here, we propose a local linear correction to the method of Athey, Tibshirani, and Wager (2018), the orthogonalized causal forest, to strengthen its performance when is smooth.
Local linear causal forests start as orthogonalized causal forests do, by estimating the nuisance components
(10) |
using a local linear forest. We then add a regularized adjustment to the standard causal forest equation,
(11) |
where the -superscript denotes leave-one-out predictions from the nuisance models. If nuisance estimates are accurate, the intercept should be 0; however, we leave it in the optimization for robustness. We cross-validate local linear causal forests (including and ) by minimizing the -learning criterion recommended by Nie and Wager (2017):
(12) |
To illustrate the value of the local linear causal forests, we consider a popular dataset from the General Social Survey (GSS) that explores how word choice reveals public opinions about welfare (Smith, Davern, Freese, and Hout, 2018). Individuals filling out the survey from 1986 to 2010 answered whether they believe the government spends too much, too little, or the right amount on the social safety net. ^{2}^{2}2The authors acknowledge that NORC at the University of Chicago, and all parties associated with the General Social Survey (including the principal investigators), offer the data and documentation “‘as is” with no warranty and assume no legal liability or responsibility for the completeness, accuracy, or usefulness of the data, or fitness for a particular purpose. GSS randomly assigned the wording of this question, such that the social safety net was either described as “welfare” or “assistance to the poor”. This change had a well-documented effect on responses due to the negative perception many Americans have about welfare; moreover, there is evidence of heterogeneity in the CATE surface (Green and Kern, 2012).
Here, we write if the -th sample received the “welfare” treatment, and define if the -th response was that the government spends too much on the social safety net. Thus, a positive treatment effect indicates that, conditionally on , using the phrase “welfare” as opposed to “assistance to the poor” increases the likelihood that the -th subject says the government spends too much on the social safety net. We base our analysis on covariates, including income, political views, age, and number of children. The full dataset has observations; here, to make the problem interesting, we test our method on smaller subsamples of the data.
In order to compare the performance of both methods, we use the transformed outcome metric of Athey and Imbens (2016). Noting that , they suggest examining the following test set error criterion
(13) |
If we can estimate , then (13
) gives an unbiased estimate of the mean-squared error of
. Here, we estimate via out-of-bag estimation on the full dataset with , assuming that a local linear forest with such a large sample size has negligible error.Table 1 has error estimates for both types of forests using (13), and verifies that using the local linear correction improves empirical performance across different subsample sizes. Practically, we can view this change as enabling us to get good predictions on less data, a powerful improvement in cases like survey sampling where data can be expensive and difficult to attain. Section 5 contains a more detailed simulation study of local linear causal forests, comparing them with a wider array of baseline methods.
Subsample size | 200 | 400 | 800 | 1200 | 1500 | 2000 |
---|---|---|---|---|---|---|
CF | 0.035 | 0.021 | 0.015 | 0.014 | 0.011 | 0.007 |
LLCF | 0.027 | 0.017 | 0.013 | 0.013 | 0.011 | 0.006 |
Returning to the regression case, before we delve into the main result and its proof, we briefly discuss why the asymptotic behavior of local linear forests cannot be directly derived from the existing results of Athey, Tibshirani, and Wager (2018). This is due to a key difference in the dependence structure of the forest. In the regression case, a random forest prediction at is , where, due to honesty, is independent of given . This conditional independence plays a key role in the argument of Wager and Athey (2018). Analogously to , we can write the local linear forest prediction as a weighted sum,
(14) |
where we use notation from (5). At a first glance, indeed looks like the output of a regression forest trained on observations . However, the dependence structure of this object is different. In a random forest, we make and independent by conditioning on . For a local linear forest, however, conditioning on will not guarantee that and are independent, thus breaking a key component in the argument of Wager and Athey (2018).
We now give a Central Limit Theorem for local linear forest predictions, beginning by stating assumptions on the forest.
(Regular Trees) We assume that the forest grows regular trees: that the trees are symmetric in permutations of training data index, split on every variable with probability bounded from below by
, and the trees are grown to depth for some ; and the trees and are balanced in that each split puts at least a fraction of parent observations into each child node.(Honest Forests) We assume that the forest is honest as described in Section 2.3, meaning that two distinct and independent subsamples are selected for each tree. Only the outcome values from one subsample are used to select the splits, and only those from the other to estimate parameters in the nodes.
Subsampling plays a central role in our asymptotic theory, as this is what allows us to prove asymptotic normality by building on the work of Efron and Stein (1981). Moreover, subsampling is what we use to tune the bias-variance trade-off of the forest: Forests whose trees are grown on small subsamples have lower bias but higher variance (and vice-versa).
In order to establish asymptotic unbiasedness of forests, Wager and Athey (2018) require a subsample size of at least , with
(15) |
This convergence rate of a traditional honest random forest, however, does not improve when is smooth. Here, we show that by using a local regression adjustment and assuming smoothness of , we can grow trees on smaller subsamples of size (16) without sacrificing asymptotic variance control. This allows us to decrease the bias (and improve the accuracy) of our estimates.
Our main result establishes asymptotic normality of local linear forest predictions, and gives this improved subsampling rate. The rate given depends on our bounds on the squared radius of a leaf, detailed in Appendix B in the statement and proof of Lemma 6.
Suppose we have training data identically and independently distributed on , where the density of is bounded away from infinity. Suppose furthermore that is twice continuously differentiable and that . Last, say that our trees are grown according to Assumptions 1 and 2, with and subsamples of size with , for
(16) |
and that the ridge regularization parameter grows at rate
Then for each fixed test point , there is a sequence such that
We remark that the rate of convergence is improved compared to the rate (15) given in Athey, Tibshirani, and Wager (2018), since we have assumed and consequently exploited smoothness.
This section complements our main result, as the Central Limit Theorem becomes far more useful when we have valid standard error estimates. Following
Athey, Tibshirani, and Wager (2018), we use the random forest delta method to develop pointwise confidence intervals for local linear forest predictions.The random forest delta method starts from a solution to a local estimating equation with random forest weights :
(17) |
Athey, Tibshirani, and Wager (2018) then propose estimating the error of these estimates as
(18) |
where is the slope of the expected estimating equation at the optimum, and
(19) |
The upshot is that measures the variance of an (infeasible) regression forest with response depending on the score function at the optimal parameter values, and that we can in fact estimate using tools originally developed for variance estimation with regression forests. Meanwhile, can be estimated directly using standard methods.
With local linear forests, solve (17) with score function
(20) |
where we again use notation defined in (5) and (14). First, we note that we have access to a simple and explicit estimator for : Twice differentiating (3) with respect to the parameters gives
(21) |
which we can directly read off of the forest. In this paper, we are only interested in confidence intervals for , which can now be represented in terms of :
(22) |
Next, we follow Athey, Tibshirani, and Wager (2018), and proceed using the bootstrap of little bags construction of Sexton and Laake (2009). At a high level this method is a computationally efficient half-sampling estimator. For any half sample , let be the average of the empirical scores averaged over trees that only use data from the half-sample :
(23) |
where is the set of trees that only use data from the half-sample , and contains neighbors of in the -th tree (throughout, we assume that the subsample used to grow each tree has less than samples). Then, a standard half-sampling estimator would simply use (Efron, 1982)
(24) |
Now, carrying out the full computation in (24) is impractical, and naive Monte Carlo approximations suffer from bias. However, as discussed in Athey, Tibshirani, and Wager (2018) and Sexton and Laake (2009), bias-corrected randomized algorithms are available and perform well. Here, we do not discuss these Monte Carlo bias corrections, and instead refer to Section 4.1 of Athey, Tibshirani, and Wager (2018) for details. Simulation results on empirical confidence interval performance are given in Section 5.3.
In this section, we compare local linear forests, random forests, BART (Chipman, George, and McCulloch, 2010)
, and gradient boosting
(Friedman, 2001). We also include a lasso-random forest baseline for local linear forests: on half of the training data, run a lasso (Tibshirani, 1996) regression; on the second half, use a random forest to model the corresponding residuals. Like local linear forests, this method combines regression and forests, making it a natural comparison; it is similar in spirit to the tree-augmented Cox model of Su and Tsai (2005), who combine pruned CART trees with proportional hazards regression.Random forests are trained using the R package grf (Tibshirani et al., 2018), and are cross-validated via the default parameter tuning in grf, which selects values for mtry, minimum leaf size, sample fraction, and two parameters (alpha and imbalance penalty) that control split balance. Local linear forests are tuned equivalently with additional cross-validation for regularization parameters. Variables for the regression at prediction are selected via the lasso. Because existing theoretical results for random forests rely on honesty, all random forests are built with the honest construction. All lasso models are implemented via glmnet (Friedman, Hastie, and Tibshirani, 2010) and cross-validated with their automatic cross-validation feature. Local linear regression is not included in these comparisons, since the implementations loess and locfit both fail for on this simulation; in Appendix A, we include Table 6, which compares local linear regression with this set of methods on lower dimensional linear models. Unless otherwise specified, all reported errors are root mean-squared error (RMSE) on 1000 test points averaged over 50 simulation runs.
Gradient boosted trees are implemented by the R package XGBoost (Chen et al., 2019), cross-validated on a grid over number of rounds, step size, and maximum depth. BART for treatment effect estimation is implemented following Hill (2011). As is standard, we use the BART package (McCulloch et al., 2019) without any additional tuning. The motivation for not tuning is that if we want to interpret the BART posterior in a Bayesian sense (as is often done), then cross-validating on the prior is hard to justify; and in fact most existing papers do not cross-validate BART.
The first design we study is Friedman’s example from equation (7). Figure 4 shows errors at fixed, with dimension varying from 10 to 50. There are two plots shown, to highlight the differences between error variance and . Table 4 reports a grid of errors for dimensions 10, 30, and 50, with and , and taking values of 5 and 20. The second design we consider is given in Section 1, as in equation (1). Again we test on a grid, letting dimension take values in 5 and 50, either 1000 or 5000, and at 0.1, 1, and 2. Errors are reported in Table 5.
The third simulation is designed to test how local linear forests perform in a more adversarial setting, where we expect random forests to outperform. We simulate i.i.d. and model responses as
(25) |
Here we test dimension and values of . For this simulation, we compare only honest random forests and local linear forests, in order to compare confidence intervals; we compute out of bag RMSE and average confidence interval coverage and length. Results are reported in Table 2.
Figure 4 shows RMSE from equation 7 at (left) and (right). In Section 2, we showed that local linear forests and standard regression forests split on very different variables when generating weights. Our intuition is that these are splits we have saved; we model the strong linear effects at the end with the local regression, and use the forest splits to capture more nuanced local relationships for the weights. Local linear forests consistently perform well as we vary the parameters, lending this credibility. The lasso-random forest baseline lines up closely with adaptive random forests in both simulations; both methods perform well, with lasso-random forest outperforming all other methods for the lowest dimension and . BART and random forests form the next tier of methods on the low noise case; in the high noise case, honest random forest are clustered with the other methods. Gradient boosting consistently reports higher RMSE. Table 4 in Appendix A shows the fuller RMSE comparison from Friedman’s model.
varied from 10 to 50. Plots is shown for error standard deviation
(left) and (right). Error was calculated in increments of 10, and averaged over 50 runs per method at each step. Methods evaluated are random forests (RF) local linear forests (LLF), lasso and random forests, boosted trees, and BART.We move to the second simulation setup, equation 1, meant to evaluate how methods perform in cases with a strong linear trend in the mean. Tree-based methods will be prone to bias on this setup, as the forests cannot always split on , and because the signal is global and smooth. Full error results on the range of competing methods are given in Appendix A in Table 5. Local linear forests do quite well here; they detect the strong linear signal in the tail, as we saw in Figure 1, and model it successfully throughout the range of the feature space. Gradient boosted trees perform very well in the low-noise case, but their performance sharply declines when we increase .
We also examine the behavior of our confidence intervals in each of the given simulation setups, shown here in Table 2. We give average coverage of confidence intervals from 50 repetitions of random forests and local linear forests on draws from the simulation setups in equations 1, 7, and 25, as well as average confidence interval length and RMSE. On equation 1, local linear forest confidence intervals are consistently shorter and closer to coverage, with correspondingly lower mean squared error. Here, both random forests and local linear forests achieve fairly low RMSE and coverage at or above . For the setup in equation 7, on the other hand, neither method achieves higher than coverage, and the local linear forest confidence intervals are longer than the random forest confidence intervals. This is an encouraging result, indicating that local linear forests confidence intervals are more adaptable to the context of the problem; we would hope for long confidence intervals when detection is difficult. Moreover, the poor coverage we see sometimes across both methods is likely because the confidence intervals are built on asymptotic results, which may not apply in these relatively low settings.
We include the approximate step function in equation 25 to highlight a favorable example for random forests. Local linear forests see equivalent or better coverage on this setup, although at the cost of longer confidence intervals in low dimensions. Especially on small training datasets, local linear forests also improve on random forest predictions in RMSE.
Setup | d | n | Coverage | Length | RMSE | |||
---|---|---|---|---|---|---|---|---|
Equation 1 | RF | LLF | RF | LLF | RF | LLF | ||
5 | 500 | 0.90 | 0.94 | 2.40 | 2.35 | 0.63 | 0.55 | |
5 | 2000 | 0.97 | 0.96 | 2.23 | 1.85 | 0.43 | 0.35 | |
5 | 10000 | 0.97 | 0.98 | 1.41 | 2.20 | 0.28 | 0.42 | |
20 | 500 | 0.88 | 0.92 | 2.23 | 2.13 | 0.68 | 0.55 | |
20 | 2000 | 0.89 | 0.96 | 2.14 | 2.12 | 0.17 | 0.09 | |
20 | 10000 | 0.97 | 0.99 | 1.23 | 0.89 | 0.24 | 0.13 | |
Equation 7 | RF | LLF | RF | LLF | RF | LLF | ||
5 | 500 | 0.54 | 0.65 | 3.56 | 3.82 | 2.36 | 2.03 | |
5 | 2000 | 0.63 | 0.69 | 3.17 | 3.21 | 1.77 | 1.58 | |
5 | 10000 | 0.70 | 0.75 | 2.75 | 2.77 | 1.32 | 1.18 | |
20 | 500 | 0.45 | 0.59 | 4.06 | 4.61 | 8.82 | 4.85 | |
20 | 2000 | 0.57 | 0.64 | 3.55 | 4.22 | 5.25 | 3.20 | |
20 | 10000 | 0.52 | 0.70 | 2.28 | 2.89 | 1.83 | 1.46 | |
Equation 25 | RF | LLF | RF | LLF | RF | LLF | ||
5 | 500 | 0.85 | 0.89 | 3.26 | 3.46 | 1.50 | 0.90 | |
5 | 2000 | 0.90 | 0.92 | 2.54 | 2.82 | 0.52 | 0.45 | |
5 | 10000 | 0.82 | 0.92 | 1.47 | 1.36 | 0.40 | 0.3 | |
20 | 500 | 0.85 | 0.89 | 3.36 | 3.19 | 1.50 | 0.98 | |
20 | 2000 | 0.90 | 0.90 | 2.71 | 2.36 | 0.62 | 0.46 | |
20 | 10000 | 0.87 | 0.92 | 1.94 | 1.55 | 0.58 | 0.37 |
As a real-data example, we include the well-known California housing data. The data contain summary statistics for housing prices in a California district from 1990, first introduced in a paper from Pace and Barry (1997) and downloaded from Kaggle. To make the problem more difficult (and more relevant to our method), we examine randomly chosen subsamples of size and compare errors of predicting median house income. Boxplots in Figure 5 show squared error (scaled down by due to the large units of the outcome variable) of all methods described here.
In Section 3, we introduced a real-data example where the local linear extension of causal forests naturally applies. Evaluating errors empirically, however, is difficult, so we supplement that with a simulation also used by Wager and Athey (2018) in evaluating causal forests and Künzel, Sekhon, Bickel, and Yu (2017), used to evaluate their meta-learner called the X-learner. Here we let . We fix the propensity and , and generate a causal effect from each
(26) | |||
(27) |
We will assume unconfoundedness (Rosenbaum and Rubin, 1983); therefore, because we hold propensity fixed, this is a randomized controlled trial.
We compare local linear forests, causal forests, and X-BART, which is the X-learner using BART as a base-learner. Causal forests as implemented by grf are tuned via the automatic self-tuning feature. As in the prediction simulation studies, we do not cross-validate X-BART because the authors recommend X-BART specifically for when a user does not want to carefully tune. We acknowledge that this may hinder its performance. Local linear causal forests are tuned via cross-validation. On these simulations, consider relatively small sample sizes ranging from to with dimension . The goal of this simulation is to evaluate how effectively we can learn a smooth heterogeneous treatment effect in the presence of many noise covariates. Wager and Athey (2018) emphasize equation 26 as a simulation that demonstrates how forests can suffer on the boundary of a feature space, because there is a spike near . RMSE over 100 repetitions is reported in Table 3. We can see that in these simulations, local linear forests give a significant improvement over causal forests. We can also observe how local linear forests make up for smaller training datasets by adding additional regularization, which shrinks as we gain more training data. Both of these setups are reasonable tests for how a method can learn heterogeneity, and demonstrate potential for meaningful improvement with thoughtful variable selection and robustness to smooth heterogeneous signals.
Simulation 1 (equation 26) | Simulation 2 (equation 27) | |||||
n | X-BART | CF | LLCF | X-BART | CF | LLCF |
200 | 1.01 | 0.94 | 0.80 | 0.67 | 0.77 | 0.71 |
400 | 0.76 | 0.50 | 0.47 | 0.56 | 0.55 | 0.50 |
600 | 0.61 | 0.39 | 0.35 | 0.50 | 0.41 | 0.38 |
800 | 0.55 | 0.35 | 0.31 | 0.46 | 0.34 | 0.32 |
1000 | 0.50 | 0.33 | 0.30 | 0.44 | 0.29 | 0.28 |
1200 | 0.48 | 0.32 | 0.28 | 0.42 | 0.27 | 0.26 |
In this paper, we proposed local linear forests as a modification of random forests equipped to model smooth signals and fix boundary bias issues. We presented asymptotic theory showing that, if we can assume smoother signals, we can get better rates of convergence as compared to generalized random forests. We showed on the welfare dataset that local linear forests can effectively model smooth heterogeneous causal effects, and illustrated when and why they outperform competing methods. We also gave confidence intervals from the delta method for the regression prediction case, and demonstrated their effectiveness in simulations.
The regression adjustments in local linear forests prove especially useful when some covariates have strong global effects with moderate curvature. Furthermore, the adjustment provides centered predictions, adjusting for errors due to an asymmetric set of neighbors. Likely there is a useful polynomial basis corresponding to every situation where local linear forests perform well, but this requires hand-tuning the functional form for competitive performance, and is not automatically suited to a mix of smooth and non-smooth signals. For a departure from frequentist techniques, BART and Gaussian processes are both hierarchical Bayesian methods; BART can be viewed as a form of Gaussian process with a flexible prior, making BART the preferred baseline.
There remains room for meaningful future work on this topic. In some applications, we may be interested in estimating the slope parameter , rather than merely accounting for it to improve the precision of . While local linear forests may be an appropriate method for doing so, we have not yet explored this topic and think it could be of significant interest. We have also not considered the theoretical or empirical improvements that could arise from assuming higher order smoothness in the functions we are estimating; searching for additional optimality results in this setting could be another interesting research question.
Consistency of random forests and other averaging classifiers.
JMLR, 9:2015–2033, 2008.Locally weighted regression: An approach to regression analysis by local fitting.
Journal of the American Statistical Association, 83(403):596–610, 1988. doi: 10.1080/01621459.1988.10478639.Probability inequalities for sums of bounded random variables.
American Statistical Association Journal, March 1963.Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data.
Ann. Statist., 38(6):3321–3351, 12 2010. doi: 10.1214/10-AOS813.We include first Table 4, giving a full error comparison of the lasso-random forest baseline, BART, boosting, random forests, and local linear forests, on equation 7. Errors are reported on dimension ranging from 10 to 50, from 5 to 20, and and , averaged over 50 training runs.
RF | Lasso-honest RF | LLF | BART | XGBoost | |||
10 | 1000 | 5 | 2.33 | 2.12 | 2.03 | 2.49 | 3.04 |
10 | 5000 | 5 | 1.90 | 1.48 | 1.57 | 1.51 | 2.65 |
30 | 1000 | 5 | 2.82 | 2.41 | 2.11 | 2.84 | 3.13 |
30 | 5000 | 5 | 2.08 | 1.61 | 1.73 | 2.03 | 2.53 |
50 | 1000 | 5 | 3.00 | 2.48 | 2.12 | 2.40 | 3.17 |
50 | 5000 | 5 | 2.18 | 1.82 | 1.80 | 2.106 | 2.636 |
1000 | 10 | 20 | 3.19 | 3.41 | 3.40 | 6.45 | 9.91 |
5000 | 10 | 20 | 2.43 | 2.35 | 2.29 | 3.85 | 9.48 |
1000 | 30 | 20 | 4.17 | 3.98 | 3.68 | 7.60 | 9.58 |
5000 | 30 | 20 | 2.97 | 2.66 | 2.40 | 4.78 | 9.18 |
1000 | 50 | 20 | 4.25 | 4.45 | 3.88 | 8.05 | 8.78 |
5000 | 50 | 20 | 3.16 | 2.67 | 2.35 | 4.95 | 8.80 |
We include next Table 5, again giving a more complete error comparison of the lasso-random forest baseline, BART, boosting, random forests, and local linear forests, on equation 1. Errors are reported on dimension ranging from 5 to 20, from 0.1 to 2, and and , averaged over 50 training runs.
RF | Lasso- RF | LLF | BART | XGBoost | |||
---|---|---|---|---|---|---|---|
5 | 1000 | 0.1 | 0.10 | 0.06 | 0.02 | 0.27 | 0.07 |
5 | 5000 | 0.1 | 0.06 | 0.02 | 0.02 | 0.22 | 0.06 |
50 | 1000 | 0.1 | 0.29 | 0.18 | 0.11 | 0.52 | 0.07 |
50 | 5000 | 0.1 | 0.18 | 0.10 | 0.07 | 0.62 | 0.06 |
5 | 1000 | 1 | 0.21 | 0.24 | 0.14 | 0.47 | 0.56 |
5 | 5000 | 1 | 0.15 | 0.11 | 0.09 | 0.26 | 0.52 |
50 | 1000 | 1 | 0.41 | 0.39 | 0.20 | 0.82 | 0.53 |
50 | 5000 | 1 | 0.23 | 0.21 | 0.10 | 0.57 | 0.52 |
5 | 1000 | 2 | 0.31 | 0.55 | 0.26 | 0.69 | 1.21 |
5 | 5000 | 2 | 0.25 | 0.28 | 0.21 | 0.40 | 1.18 |
50 | 1000 | 2 | 0.47 | 0.27 | 0.24 | 0.89 | 1.22 |
50 | 5000 | 2 | 0.33 | 0.27 | 0.15 | 0.70 | 0.96 |
To close this section, we consider some basic linear and polynomial models in low dimensions, in order to effectively compare local linear forests with local linear regression. We simulate and model responses from two models,
(28) | ||||
(29) |
where and . RMSE on the truth is reported, averaged over 50 runs, for lasso, local linear regression, BART, random forests, adaptive random forests, and local linear forests. In the simple linear case in equation 28, we see that lasso outperforms the other methods, as we would expect; in the polynomial given in equation 29, local linear regression performs the best, followed by BART ( case) and local linear forests ( cases).
Setup | Lasso | LLR | BART | RF | LLF | |
---|---|---|---|---|---|---|
Equation 28 | 1 | 0.12 | 0.15 | 0.48 | 0.73 | 0.22 |
5 | 0.39 | 0.92 | 1.27 | 1.25 | 0.96 | |
10 | 0.70 | 1.70 | 2.37 | 1.76 | 1.56 | |
Equation 29 | 1 | 1.55 | 0.22 | 0.50 | 0.86 | 0.69 |
5 | 1.55 | 0.92 | 1.31 | 1.32 | 1.28 | |
10 | 1.66 | 1.44 | 1.83 | 1.70 | 1.68 |
We now give a proof of Theorem 1, beginning by decomposing into bias and variance , the latter of which we will approximate by a regression forest. Define the diameter (and corresponding radius) of a tree leaf as the length of the longest line segment that can fit completely inside of the leaf. Thanks to our assumed uniform bound on the second derivative of , a Taylor expansion of around yields the following decomposition,
(30) |
Here is the average squared radius of leaves in the forest, where the diameter of a leaf is defined as the length of the longest line segment contained inside of it. We have isolated two error terms
(31) | ||||
(32) |
For simplicity, moving forward we will write , dropping the written dependence on .
To control the radius of a typical leaf containing (and thus the Taylor error in (30)), we use the following bound. Suppose independently, and let be any regular, random-split tree and let be the radius of its leaf containing the test point . Consider Lemma 2 of Wager and Athey [2018], which states the following: let be a regular, random-split tree and let denote its leaf containing . Suppose that independently. Then, for any , and for large enough ,
(33) |
Choose , so that , and . Consequently, by setting in the above bound, for sufficiently large we have
(34) | ||||
Next, to control the behavior of , we show that concentrates around its expectation. The proof of Lemma 2 uses concentration bounds for -statistics given by Hoeffding [1963].
Let be independent and identically distributed on . Let be forest weights from trees grown on subsamples of size and radius bounded by