Local Linear Forests

07/30/2018 ∙ by Rina Friedberg, et al. ∙ Stanford University 52

Random forests are a powerful method for non-parametric regression, but are limited in their ability to fit smooth signals, and can show poor predictive performance in the presence of strong, smooth effects. Taking the perspective of random forests as an adaptive kernel method, we pair the forest kernel with a local linear regression adjustment to better capture smoothness. The resulting procedure, local linear forests, enables us to improve on asymptotic rates of convergence for random forests with smooth signals, and provides substantial gains in accuracy on both real and simulated data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Random forests (Breiman, 2001) are a popular method for non-parametric regression that have proven effective across many application areas (Cutler, Edwards Jr, Beard, Cutler, Hess, Gibson, and Lawler, 2007; Díaz-Uriarte and De Andres, 2006; Svetnik, Liaw, Tong, Culberson, Sheridan, and Feuston, 2003). A major weakness of random forests, however, is their inability to exploit smoothness in the regression surface they are estimating. As an example, consider the following setup with a smooth trend in the conditional response surface: We simulate

independently from the uniform distribution on

, with responses

(1)

and our goal is to estimate . The left panel of Figure 1 shows a set of predictions on this data from a random forest. The forest is unable to exploit strong local trends and, as a result, fits the target function using qualitatively the wrong shape: The prediction surface resembles a step function as opposed to a smooth curve.

Figure 1: Example 95% confidence intervals from generalized random forests (left) and local linear forests (right) on out of bag predictions from equation 1. Training data were simulated from equation (1), with training points, dimension and errors . Forests were trained using the R package grf (Tibshirani et al., 2018) and tuned via cross-validation. True signal is shown as a smooth curve, with dots corresponding to forest predictions, and upper and lower bounds of pointwise confidence intervals connected in the dashed lines. Here the data was generated with , and , with subsampling rate .

In order to address this weakness, we take the perspective of random forests as an adaptive kernel method. This interpretation follows work by Athey, Tibshirani, and Wager (2018), Hothorn, Lausen, Benner, and Radespiel-Troger (2004), and Meinshausen (2006), and complements the traditional view of forests as an ensemble method (i.e., an average of predictions made by individual trees). Specifically, random forest predictions can be written as

(2)

where the weights defined in (4) encode the weight given by the forest to the -th training example when predicting at . Now, as is well-known in the literature on non-parametric regression, if we want to fit smooth signals without some form of neighborhood averaging (e.g., kernel regression, -NN, or matching for causal inference), it is helpful to use a local regression adjustment to correct for potential misalignment between a test point and its neighborhood (Abadie and Imbens, 2011; Cleveland and Devlin, 1988; Fan and Gijbels, 1996; Heckman, Ichimura, and Todd, 1998; Loader, 1999; Newey, 1994; Stone, 1977; Tibshirani and Hastie, 1987). These types of adjustments are particularly important near boundaries, where neighborhoods are asymmetric by necessity, but with many covariates, the adjustments are also important away from boundaries given that local neighborhoods are often unbalanced due to sampling variation.

The goal of this paper is to improve the accuracy of forests on smooth signals using regression adjustments, potentially in many dimensions. By using the local regression adjustment, it is possible to adjust for asymmetries and imbalances in the set of nearby points used for prediction, ensuring that the weighted average of the feature vector of neighboring points is approximately equal to the target feature vector, and that predictions are centered. The improvement to forests from the regression adjustment is most likely to be large in cases where some features have strong effects with moderate curvature, so that regression adjustments are both effective and important. Many datasets have this characteristic; for example, labor market outcomes tend to improve with parents’ educational and labor market attainment, but there are diminishing returns.

In their simplest form, local linear forests take the forest weights , and use them for local regression:

(3)

Here estimates the conditional mean function , and corrects for the local trend in . The ridge penalty prevents overfitting to the local trend, and plays a key role both in simulation experiments and asymptotic convergence results. Then, as discussed in Section 2.1, we can improve the performance of local linear forests by modifying the tree-splitting procedure used to get the weights , and making it account for the fact that we will use local regression to estimate . We choose all tuning parameters, including leaf size and , via cross-validation. As a first encouraging result we note that, in the motivating example from Figure 1, local linear forests have improved upon the bias of standard forests.

These improvements extend to many other types of forests, such as quantile regression forests

(Meinshausen, 2006) or, more broadly, generalized random forests (Athey, Tibshirani, and Wager, 2018). An extension of primary interest is the orthogonalized causal forests proposed by Athey, Tibshirani, and Wager (2018), which we discuss in Section 3 in detail; other cases are analogous.

Theorem 1 gives a Central Limit Theorem for the predictions from a local linear forest at a given test point , specifying the asymptotic convergence rate and its dependence on subsampling and smoothness of . This allows us to build pointwise Gaussian confidence intervals, giving practitioners applicable uncertainty quantification. Observe that in Figure 1, the bias from regression forest predictions affects not only the prediction curve but also the corresponding confidence intervals, which are not centered on the true function. Local linear forests, in addressing this bias issue, improve over regression forests in both predictive performance and confidence interval coverage. Strikingly, our local linear forest confidence intervals simultaneously achieve better coverage and are shorter than those built using regression forests.

A simple form of (3), without regularization or modified tree-splitting procedures, was also considered in a recent paper by Bloniarz, Talwalkar, Yu, and Wu (2016). However, Bloniarz, Talwalkar, Yu, and Wu (2016) only report modest performance improvements over basic regression forests; for example, on the “Friedman function” they report roughly a 5% reduction in mean-squared error. In contrast, we find fairly large, systematic improvements from local linear forests; see, e.g., Figure 4 for corresponding results on the same Friedman function. It thus appears that our algorithmic modifications via regularization and optimized splitting play a qualitatively important role in getting local linear forests to work well. These empirical findings are also mirrored in our theory. For example, in order to prove rates of convergence for local linear forests that can exploit smoothness of and improve over corresponding rates available for regression forests, we need an appropriate amount of regularization in (3).

Finally, we note that one can also motivate local linear forests from the starting point of local linear regression. Despite working well in low dimensions, classical approaches to local linear regression are not applicable to even moderately high-dimensional problems.111

The curse of dimensionality for kernel weighting is well known. The popular core

R function loess (R Core Team, 2018) allows only 1-4 predictors, while locfit (Loader, 2013) crashes on the simulation from (1) with . In contrast, random forests are adept at fitting high-dimensional signals, both in terms of their stability and computational efficiency. From this perspective, random forests can be seen as an effective way of producing weights to use in local linear regression. In other words, local linear forests aim to combine the strength of random forests in fitting high dimensional signals, and the ability of local linear regression to capture smoothness.

An implementation of local linear forests, compliant with the assumptions detailed in Section 4, is available in the R package grf.

1.1 Related Work

Random forests were first introduced by Breiman (2001), building on the work of Breiman, Friedman, Stone, and Olshen (1984) on recursive partitioning (CART), Breiman (1996) on bagging, and Amit and Geman (1997) on randomized trees. Bühlmann and Yu (2002) shows how the bagging makes forests smoother than single trees, while Biau (2012) and Scornet, Biau, and Vert (2015) establishes asymptotic risk consistency of random forests under specific assumptions. More sophisticated tree-based ensembles motivated by random forests have been proposed by Basu, Kumbier, Brown, and Yu (2018), who iteratively grow feature-weighted tree ensembles that perform especially well for discovering interactions, Zhou and Hooker (2018), who consider a hybrid between random forests and boosting, and Zhu, Zeng, and Kosorok (2015), who do deeper search during splitting to mitigate the greediness of CART. Linero and Yang (2018) propose a Bayesian regression tree ensemble tailored to learning smooth, sparse signals and prove posterior minimaxity under certain conditions, highlighting the promise of tree-based methods that can adapt to smoothness.

The idea of considering random forests as an adaptive kernel method has been proposed by several papers. Hothorn, Lausen, Benner, and Radespiel-Troger (2004) suggest using weights from survival trees and gives compelling simulation results, albeit to our knowledge no theoretical guarantees. Meinshausen (2006) proposes this technique for quantile regression forests and gives asymptotic consistency of the resulting predictions. Athey, Tibshirani, and Wager (2018) leverage this idea to present generalized random forests as a method for solving heterogeneous estimating equations. They derive an asymptotic distribution and confidence intervals for the resulting predictions. Local linear forests build on this literature; the difference being that we use the kernel-based perspective on forests to exploit smoothness of rather than to target more complicated estimands (such as a quantile).

Early versions of confidence intervals for random forests, backed by heuristic arguments and empirical evidence, were proposed by

Sexton and Laake (2009) and Wager, Hastie, and Efron (2014). Mentch and Hooker (2016) then established asymptotic normality of random forests where each tree depends on a small subsample of training examples (so that there may be asymptotic bias), while Wager and Athey (2018) provide a characterization of forests that allows for larger subsamples, deriving both asymptotic normality and valid confidence intervals. The confidence intervals proposed here are motivated by the algorithm of Sexton and Laake (2009), and build on the random forest delta method developed by Athey, Tibshirani, and Wager (2018), taking advantage of improved subsampling rates for improved coverage.

As mentioned in the introduction, a predecessor to this work is a paper by Bloniarz, Talwalkar, Yu, and Wu (2016), who consider local linear regression with supervised weighting functions, including ones produced by a forest. The main differences between our method and that of Bloniarz, Talwalkar, Yu, and Wu (2016) is that they do not adapt the tree-splitting procedure to account for the local linear correction, and do not consider algorithmic features—such as ridge penalization—that appear to be needed to achieve good performance both in theory and in practice. Additionally, our method is flexible to forests targeting any heterogeneous estimating equation, and in particular to causal forests. On the formal side, Bloniarz, Talwalkar, Yu, and Wu (2016) prove consistency of their method; however, they do not establish rates of convergence and thus, unlike in our Theorem 1, they cannot use smoothness of to improve convergence properties of the forest. They also do not provide a central limit theorem or confidence intervals.

More broadly, there is an extensive body of work on model-based trees that explores different combinations of local regression and trees. Torgo (1997) and Gama (2004) study functional models for tree leaves, fitting models instead of local averages at each node. Karalič (1992) suggests fitting a local linear regression in each leaf, and Torgo (1997) highlights the performance of kernel methods in general for MOB tree methods. Menze et al. (2011)

propose oblique random forests that learn split directions using the results from ridge regression, similar to our work developing splitting rules for local linear forests but more in the spirit of linear discriminant analysis (LDA). Case-specific random forests, introduced by

Xu, Nettleton, and Nordman (2016), use local information to upweight training samples not at the prediction step, but during the bootstrap to generate datasets for each tree. Zeileis, Hothorn, and Hornik (2008), and later Rusch and Zeileis (2013), propose not only prediction, but recursive partitioning via fitting a separate model in each leaf, similar to the residual splitting strategy of local linear forests. Local linear forests complement this literature; they differ, however, in treating forests as a kernel method. The leaf nodes in a local linear forest serve to provide neighbor information, and not local predictions.

Our work is motivated by the literature on local linear regression and maximum likelihood estimation (Abadie and Imbens, 2011; Cleveland and Devlin, 1988; Fan and Gijbels, 1996; Heckman, Ichimura, and Todd, 1998; Loader, 1999; Newey, 1994; Stone, 1977; Tibshirani and Hastie, 1987). Stone (1977) introduces local linear regression and gives asymptotic consistency properties. Cleveland (1979) expands on this by introducing robust locally weighted regression, and Fan and Gijbels (1992) give a variable bandwidth version. Cleveland and Devlin (1988) explore further uses of locally weighted regression. Local linear regression has been particularly well-studied for longitudinal data, as in Li and Hsing (2010) and Yao, Muller, and Wang (2005). Cheng, Fan, and Marron (1997) use local polynomials to estimate the value of a function at the boundary of its domain. Abadie and Imbens (2011) show how incorporating a local linear correction improves nearest neighbor matching procedures.

2 Local Linear Forests

Local linear forests use a random forest to generate weights that can then serve as a kernel for local linear regression. Suppose we have training data with , for . Consider using a random forest to estimate the conditional mean function at a fixed test point . Traditionally, random forests are viewed as an ensemble method, where tree predictions are averaged to obtain the final estimate. Specifically, for each tree in a forest of trees, we find the leaf with predicted response , which is simply the average response of all training data points assigned to . We then predict the average .

An alternate angle, advocated by Athey, Tibshirani, and Wager (2018), Hothorn, Lausen, Benner, and Radespiel-Troger (2004), and Meinshausen (2006), entails viewing random forests as adaptive weight generators, as follows. Equivalently write as

where the forest weight is

(4)

Notice that by construction, and for each , . Athey, Tibshirani, and Wager (2018) use this perspective to harness random forests for solving weighted estimating equations, and give asymptotic guarantees on the resulting predictions.

Local linear forests solve the locally weighted least squares problem (3) with weights (4). Note that equation (3) has a closed-form solution, given below, following the closed-form solutions for ridge regression and classical local linear regression. Throughout this paper, we let be the diagonal matrix with , and let denote the diagonal matrix with and , so as to not penalize the intercept. We define , the centered regression matrix with intercept, as and . Then the local linear forest estimator can be explicitly written as

(5)

Qualitatively, we can think of local linear regression as a weighting estimator with a modulated weighting function whose

-moments are better aligned with the test point

: with and , where the last relation would be exact without a ridge penalty (i.e., with ). Explicitly, , where is a vector of zeroes with 1 in the -th column.

With the perspective of generating a kernel for local linear regression in mind, we move to discuss the appropriate splitting rule for local linear forests.

2.1 Splitting for Local Regression

Random forests traditionally use Classification and Regression Trees (CART) from Breiman, Friedman, Stone, and Olshen (1984) splits, which proceed as follows. We consider a parent node with observations . For each candidate pair of child nodes , we take the mean value of inside each child, and . Then we choose to minimize the sum of squared errors

Knowing that we will use the forest weights to perform a local regression, we neither need nor want to use the forest to model strong, smooth signals; the final regression step can model the strong global effects. Instead, in the parent node , we run a ridge regression to predict from .

(6)

We then run a standard CART split on the residuals , modeling local effects in the forest and regressing global effects back in at prediction. Observe that, much like the CART splitting rule, an appropriate software package can enforce that a forest using this splitting rule splits on every variable and gives balanced splits; hence this splitting rule may be used to grow honest and regular trees.

To explore the effects of CART and residual splitting rules, we consider this simulation first introduced by Friedman (1991). Generate independently and identically distributed and model from

(7)

for . This model has become a popular study for evaluating nonparametric regression methods; see for example Chipman, George, and McCulloch (2010) and Taddy, Chen, Yu, and Wyle (2015). It is a natural setup to test how well an algorithm handles interactions , its ability to pick up a quadratic signal , and how it simultaneously models strong linear signals .

Figure 2: Split frequency plot for CART splits from an honest random forest (left) and residual splits from a local linear forest (right). Each forest was trained on observations from the data-generating process in 7. Variables 1 through 5 are on the x-axis, and the y-axis gives tree depth, starting with depth 1 at the top of the plot. Tile color is according to split frequency, so variables on which the forest splits frequently at depth have a dark tile in row .

Figure 2 displays the split frequencies from an honest random forest (left) using standard CART splits, and a local linear forest (right). The x-axis is indexed by variable, here 1 through 5, and the y-axis gives tree depth for the first 4 levels of tree splits. Tiles are darkened according to how often trees in the forest split on that variable; a darker tile denotes more splits at that tree depth. CART splits very frequently on variable 4, which contributes the strongest linear signal, especially at the top of the tree but consistently throughout levels. Local linear forests rarely split on either of the strong linear signals, instead spending splits on the three that are more difficult to model.

2.2 The Value of Local Linear Splitting

Consider the following experiment, which highlights the benefit of the proposed splitting rule. We generate independently and uniformly over . We hold a cubic signal constant across simulations, and on each run increase the dimension and add another linear signal. Formally, we let and generate responses

(8)

For example, at simulation we have and hence we model . RMSE is displayed in Figure 3.

Figure 3: Results from testing different splitting rules on data generated from equation 8. Here the x-axis is dimension , varying from 2 to 20, and we plot the RMSE of prediction from random forests and from local linear forests with CART splits and with the ridge residual splits. We let and check results on test points at runs for each value of .

In low dimension and with few linear signals, all three methods are comparable. However, they begin to differ quickly. Random forests are not designed for models with so many global linear signals, and hence their RMSE increases dramatically with . Moreover, as we add more linear effects, the gap between the two candidate splitting rules grows; heuristically, it becomes more important not to waste splits, and the residual splitting rule gives greater improvements. Note that at a certain point, however, the gap between splitting rules stays constant. Once the forests simply cannot fit a more complex linear function with a fixed amount of data, the marginal benefits of the residual splitting rule level out. We show this to emphasize the contexts in which this splitting rule meaningfully affects the results.

2.3 Honest Forests

Unless noted otherwise, all random forests used in this paper are grown using a type of sub-sample splitting called “honesty”, used by Wager and Athey (2018) to derive the asymptotic distribution of random forest prediction. As outlined in Procedure 1 of Wager and Athey (2018), each tree in an honest forest is grown using two non-overlapping subsamples of the training data, call them and . We first choose a tree structure using only the data in , and write as the boolean indicator for whether the points and fall into the same leaf of . Then, in a second step, we define the set of neighbors of as ; this neighborhood function is what we then use to define the forest weights in (4). Note that we do not use the observed outcomes form sample to select split points; but, to ensure that each node has a certain fraction of observations from its parent, we may use the covariates from to select splits. This modification allows us to grow honest forests that comply with assumption 1, so that our theory is consistent and matches the implementation available online.

This type of subsample-splitting lets us control for potential overfitting when growing the tree , because the samples which are in the neighborhood were held out when growing . We note that, despite considerable interest in the literature, there are no available consistency results for random forests with fully grown trees that do not use honesty. Biau (2012) uses a different type of sample splitting, Biau, Devroye, and Lugosi (2008) and Wager and Walther (2015) rely on large leaves, while the results of Scornet, Biau, and Vert (2015) on fully grown trees rely on an unchecked high-level assumption. Thus, we choose to build our forests with honesty on by default.

Empirically, honesty can improve or worsen predictions; when honesty impairs prediction, regression forest do not have enough expressive power to counteract it, and correspondingly they suffer. Local linear corrections can help mitigate this problem without sacrificing honesty. For a detailed treatment of practical honest forests, see Section 5.

2.4 Tuning a Local Linear Forest

We recommend selecting ridge penalties by cross-validation, which can be done automatically in the R package. It is often reasonable to choose different values of for forest training and for local linear prediction. During forest growth, equation (6) gives ridge regression coefficients in each parent leaf. As trees are grown on subsamples of data, over-regularization at this step is a danger even in large leaves. Consequently, small values of are advisable for penalization on regressions during forest training. Furthermore, as we move to small leaves, computing meaningful regression coefficients becomes more difficult; the ridge regression can begin to mask signal instead of uncovering it. A heuristic that performs well in practice is to store the regression estimates on parent leaves . When the child leaf size shrinks below a cutoff, we use from the parent node to calculate ridge residual pseudo-outcomes, instead of estimating them from unstable regression coefficients on the small child dataset. In practice, this helps to avoid the pitfalls of over-regularizing and of regressing on a very small dataset when growing the forest. At the final regression prediction step (5

), however, a larger ridge penalty can control the variance and better accommodate noisy data.

With increasingly high-dimensional data, feature selection before prediction can significantly reduce error and decrease computation time. Often, a dataset will contain only a small number of features with strong global signals. In other cases, a researcher will know in advance which variables are consistently predictive or of special interest. In these cases, it is reasonable to run the regression prediction step on this smaller subset of predictors expected to contribute overarching trends. Such covariates, if they are not already known, can be chosen by a stepwise regression or lasso, or any other technique for automatic feature selection. Last, it is worth noting that these tuning suggestions are pragmatic in nature; the theoretical guarantees provided in Section

4 are for local linear forests trained without these heuristics.

3 Extension to Causal Forests

For conciseness, the majority of this paper focuses on local linear forests for non-parametric regression; however, a similar local linear correction can also be applied to quantile regression forests (Meinshausen, 2006), causal forests (Wager and Athey, 2018) or, more broadly, to any instance of generalized random forests (Athey, Tibshirani, and Wager, 2018). To highlight this potential, we detail the method and discuss an example for heterogeneous treatment effect estimation using local linear causal forests.

As in Athey, Tibshirani, and Wager (2018), we frame our discussion in terms of the Neyman-Rubin causal model (Imbens and Rubin, 2015). Suppose we have data , where are covariates, is the response, and is the treatment. In order to define the causal effect of the treatment , we posit potential outcomes for individual , and , corresponding to the response the subject would have experienced in the control and treated conditions respectively; we then observe . We seek to estimate the conditional average treatment effect (CATE) of , namely . Assuming uncounfoundedness (Rosenbaum and Rubin, 1983),

(9)

the CATE is in general identified via non-parametric methods. We assume unconfoundedness when discussing treatment effect estimation through this paper. Wager and Athey (2018) proposed an extension of random forests for estimating CATEs, and Athey, Tibshirani, and Wager (2018) improved on the method by making it locally robust to confounding using the transformation of Robinson (1988). Here, we propose a local linear correction to the method of Athey, Tibshirani, and Wager (2018), the orthogonalized causal forest, to strengthen its performance when is smooth.

Local linear causal forests start as orthogonalized causal forests do, by estimating the nuisance components

(10)

using a local linear forest. We then add a regularized adjustment to the standard causal forest equation,

(11)

where the -superscript denotes leave-one-out predictions from the nuisance models. If nuisance estimates are accurate, the intercept should be 0; however, we leave it in the optimization for robustness. We cross-validate local linear causal forests (including and ) by minimizing the -learning criterion recommended by Nie and Wager (2017):

(12)

To illustrate the value of the local linear causal forests, we consider a popular dataset from the General Social Survey (GSS) that explores how word choice reveals public opinions about welfare (Smith, Davern, Freese, and Hout, 2018). Individuals filling out the survey from 1986 to 2010 answered whether they believe the government spends too much, too little, or the right amount on the social safety net. 222The authors acknowledge that NORC at the University of Chicago, and all parties associated with the General Social Survey (including the principal investigators), offer the data and documentation “‘as is” with no warranty and assume no legal liability or responsibility for the completeness, accuracy, or usefulness of the data, or fitness for a particular purpose. GSS randomly assigned the wording of this question, such that the social safety net was either described as “welfare” or “assistance to the poor”. This change had a well-documented effect on responses due to the negative perception many Americans have about welfare; moreover, there is evidence of heterogeneity in the CATE surface (Green and Kern, 2012).

Here, we write if the -th sample received the “welfare” treatment, and define if the -th response was that the government spends too much on the social safety net. Thus, a positive treatment effect indicates that, conditionally on , using the phrase “welfare” as opposed to “assistance to the poor” increases the likelihood that the -th subject says the government spends too much on the social safety net. We base our analysis on covariates, including income, political views, age, and number of children. The full dataset has observations; here, to make the problem interesting, we test our method on smaller subsamples of the data.

In order to compare the performance of both methods, we use the transformed outcome metric of Athey and Imbens (2016). Noting that , they suggest examining the following test set error criterion

(13)

If we can estimate , then (13

) gives an unbiased estimate of the mean-squared error of

. Here, we estimate via out-of-bag estimation on the full dataset with , assuming that a local linear forest with such a large sample size has negligible error.

Table 1 has error estimates for both types of forests using (13), and verifies that using the local linear correction improves empirical performance across different subsample sizes. Practically, we can view this change as enabling us to get good predictions on less data, a powerful improvement in cases like survey sampling where data can be expensive and difficult to attain. Section 5 contains a more detailed simulation study of local linear causal forests, comparing them with a wider array of baseline methods.

Subsample size 200 400 800 1200 1500 2000
CF 0.035 0.021 0.015 0.014 0.011 0.007
LLCF 0.027 0.017 0.013 0.013 0.011 0.006
Table 1: Estimated in-sample MSE (13) of estimating the treatment effect on subsampled welfare data, averaged over 200 runs at each subsample size. We show estimated error from local linear causal forests (LLCF) and standard causal forests (CF). Tuning parameters were selected via cross-validation using the R-learner objective.

4 Asymptotic Theory

Returning to the regression case, before we delve into the main result and its proof, we briefly discuss why the asymptotic behavior of local linear forests cannot be directly derived from the existing results of Athey, Tibshirani, and Wager (2018). This is due to a key difference in the dependence structure of the forest. In the regression case, a random forest prediction at is , where, due to honesty, is independent of given . This conditional independence plays a key role in the argument of Wager and Athey (2018). Analogously to , we can write the local linear forest prediction as a weighted sum,

(14)

where we use notation from (5). At a first glance, indeed looks like the output of a regression forest trained on observations . However, the dependence structure of this object is different. In a random forest, we make and independent by conditioning on . For a local linear forest, however, conditioning on will not guarantee that and are independent, thus breaking a key component in the argument of Wager and Athey (2018).

We now give a Central Limit Theorem for local linear forest predictions, beginning by stating assumptions on the forest.

Assumption 1.

(Regular Trees) We assume that the forest grows regular trees: that the trees are symmetric in permutations of training data index, split on every variable with probability bounded from below by

, and the trees are grown to depth for some ; and the trees and are balanced in that each split puts at least a fraction of parent observations into each child node.

Assumption 2.

(Honest Forests) We assume that the forest is honest as described in Section 2.3, meaning that two distinct and independent subsamples are selected for each tree. Only the outcome values from one subsample are used to select the splits, and only those from the other to estimate parameters in the nodes.

Subsampling plays a central role in our asymptotic theory, as this is what allows us to prove asymptotic normality by building on the work of Efron and Stein (1981). Moreover, subsampling is what we use to tune the bias-variance trade-off of the forest: Forests whose trees are grown on small subsamples have lower bias but higher variance (and vice-versa).

In order to establish asymptotic unbiasedness of forests, Wager and Athey (2018) require a subsample size of at least , with

(15)

This convergence rate of a traditional honest random forest, however, does not improve when is smooth. Here, we show that by using a local regression adjustment and assuming smoothness of , we can grow trees on smaller subsamples of size (16) without sacrificing asymptotic variance control. This allows us to decrease the bias (and improve the accuracy) of our estimates.

Our main result establishes asymptotic normality of local linear forest predictions, and gives this improved subsampling rate. The rate given depends on our bounds on the squared radius of a leaf, detailed in Appendix B in the statement and proof of Lemma 6.

Theorem 1.

Suppose we have training data identically and independently distributed on , where the density of is bounded away from infinity. Suppose furthermore that is twice continuously differentiable and that . Last, say that our trees are grown according to Assumptions 1 and 2, with and subsamples of size with , for

(16)

and that the ridge regularization parameter grows at rate

Then for each fixed test point , there is a sequence such that

We remark that the rate of convergence is improved compared to the rate (15) given in Athey, Tibshirani, and Wager (2018), since we have assumed and consequently exploited smoothness.

4.1 Pointwise Confidence Intervals

This section complements our main result, as the Central Limit Theorem becomes far more useful when we have valid standard error estimates. Following

Athey, Tibshirani, and Wager (2018), we use the random forest delta method to develop pointwise confidence intervals for local linear forest predictions.

The random forest delta method starts from a solution to a local estimating equation with random forest weights :

(17)

Athey, Tibshirani, and Wager (2018) then propose estimating the error of these estimates as

(18)

where is the slope of the expected estimating equation at the optimum, and

(19)

The upshot is that measures the variance of an (infeasible) regression forest with response depending on the score function at the optimal parameter values, and that we can in fact estimate using tools originally developed for variance estimation with regression forests. Meanwhile, can be estimated directly using standard methods.

With local linear forests, solve (17) with score function

(20)

where we again use notation defined in (5) and (14). First, we note that we have access to a simple and explicit estimator for : Twice differentiating (3) with respect to the parameters gives

(21)

which we can directly read off of the forest. In this paper, we are only interested in confidence intervals for , which can now be represented in terms of :

(22)

Next, we follow Athey, Tibshirani, and Wager (2018), and proceed using the bootstrap of little bags construction of Sexton and Laake (2009). At a high level this method is a computationally efficient half-sampling estimator. For any half sample , let be the average of the empirical scores averaged over trees that only use data from the half-sample :

(23)

where is the set of trees that only use data from the half-sample , and contains neighbors of in the -th tree (throughout, we assume that the subsample used to grow each tree has less than samples). Then, a standard half-sampling estimator would simply use (Efron, 1982)

(24)

Now, carrying out the full computation in (24) is impractical, and naive Monte Carlo approximations suffer from bias. However, as discussed in Athey, Tibshirani, and Wager (2018) and Sexton and Laake (2009), bias-corrected randomized algorithms are available and perform well. Here, we do not discuss these Monte Carlo bias corrections, and instead refer to Section 4.1 of Athey, Tibshirani, and Wager (2018) for details. Simulation results on empirical confidence interval performance are given in Section 5.3.

5 Simulation Study

5.1 Methods

In this section, we compare local linear forests, random forests, BART (Chipman, George, and McCulloch, 2010)

, and gradient boosting

(Friedman, 2001). We also include a lasso-random forest baseline for local linear forests: on half of the training data, run a lasso (Tibshirani, 1996) regression; on the second half, use a random forest to model the corresponding residuals. Like local linear forests, this method combines regression and forests, making it a natural comparison; it is similar in spirit to the tree-augmented Cox model of Su and Tsai (2005), who combine pruned CART trees with proportional hazards regression.

Random forests are trained using the R package grf (Tibshirani et al., 2018), and are cross-validated via the default parameter tuning in grf, which selects values for mtry, minimum leaf size, sample fraction, and two parameters (alpha and imbalance penalty) that control split balance. Local linear forests are tuned equivalently with additional cross-validation for regularization parameters. Variables for the regression at prediction are selected via the lasso. Because existing theoretical results for random forests rely on honesty, all random forests are built with the honest construction. All lasso models are implemented via glmnet (Friedman, Hastie, and Tibshirani, 2010) and cross-validated with their automatic cross-validation feature. Local linear regression is not included in these comparisons, since the implementations loess and locfit both fail for on this simulation; in Appendix A, we include Table 6, which compares local linear regression with this set of methods on lower dimensional linear models. Unless otherwise specified, all reported errors are root mean-squared error (RMSE) on 1000 test points averaged over 50 simulation runs.

Gradient boosted trees are implemented by the R package XGBoost (Chen et al., 2019), cross-validated on a grid over number of rounds, step size, and maximum depth. BART for treatment effect estimation is implemented following Hill (2011). As is standard, we use the BART package (McCulloch et al., 2019) without any additional tuning. The motivation for not tuning is that if we want to interpret the BART posterior in a Bayesian sense (as is often done), then cross-validating on the prior is hard to justify; and in fact most existing papers do not cross-validate BART.

5.2 Simulation Design

The first design we study is Friedman’s example from equation (7). Figure 4 shows errors at fixed, with dimension varying from 10 to 50. There are two plots shown, to highlight the differences between error variance and . Table 4 reports a grid of errors for dimensions 10, 30, and 50, with and , and taking values of 5 and 20. The second design we consider is given in Section 1, as in equation (1). Again we test on a grid, letting dimension take values in 5 and 50, either 1000 or 5000, and at 0.1, 1, and 2. Errors are reported in Table 5.

The third simulation is designed to test how local linear forests perform in a more adversarial setting, where we expect random forests to outperform. We simulate i.i.d. and model responses as

(25)

Here we test dimension and values of . For this simulation, we compare only honest random forests and local linear forests, in order to compare confidence intervals; we compute out of bag RMSE and average confidence interval coverage and length. Results are reported in Table 2.

5.3 Results

Figure 4 shows RMSE from equation 7 at (left) and (right). In Section 2, we showed that local linear forests and standard regression forests split on very different variables when generating weights. Our intuition is that these are splits we have saved; we model the strong linear effects at the end with the local regression, and use the forest splits to capture more nuanced local relationships for the weights. Local linear forests consistently perform well as we vary the parameters, lending this credibility. The lasso-random forest baseline lines up closely with adaptive random forests in both simulations; both methods perform well, with lasso-random forest outperforming all other methods for the lowest dimension and . BART and random forests form the next tier of methods on the low noise case; in the high noise case, honest random forest are clustered with the other methods. Gradient boosting consistently reports higher RMSE. Table 4 in Appendix A shows the fuller RMSE comparison from Friedman’s model.

Figure 4: RMSE of predictions on test samples from equation 7, with held fixed and dimension

varied from 10 to 50. Plots is shown for error standard deviation

(left) and (right). Error was calculated in increments of 10, and averaged over 50 runs per method at each step. Methods evaluated are random forests (RF) local linear forests (LLF), lasso and random forests, boosted trees, and BART.

We move to the second simulation setup, equation 1, meant to evaluate how methods perform in cases with a strong linear trend in the mean. Tree-based methods will be prone to bias on this setup, as the forests cannot always split on , and because the signal is global and smooth. Full error results on the range of competing methods are given in Appendix A in Table 5. Local linear forests do quite well here; they detect the strong linear signal in the tail, as we saw in Figure 1, and model it successfully throughout the range of the feature space. Gradient boosted trees perform very well in the low-noise case, but their performance sharply declines when we increase .

We also examine the behavior of our confidence intervals in each of the given simulation setups, shown here in Table 2. We give average coverage of confidence intervals from 50 repetitions of random forests and local linear forests on draws from the simulation setups in equations 1, 7, and 25, as well as average confidence interval length and RMSE. On equation 1, local linear forest confidence intervals are consistently shorter and closer to coverage, with correspondingly lower mean squared error. Here, both random forests and local linear forests achieve fairly low RMSE and coverage at or above . For the setup in equation 7, on the other hand, neither method achieves higher than coverage, and the local linear forest confidence intervals are longer than the random forest confidence intervals. This is an encouraging result, indicating that local linear forests confidence intervals are more adaptable to the context of the problem; we would hope for long confidence intervals when detection is difficult. Moreover, the poor coverage we see sometimes across both methods is likely because the confidence intervals are built on asymptotic results, which may not apply in these relatively low settings.

We include the approximate step function in equation 25 to highlight a favorable example for random forests. Local linear forests see equivalent or better coverage on this setup, although at the cost of longer confidence intervals in low dimensions. Especially on small training datasets, local linear forests also improve on random forest predictions in RMSE.

Setup d n Coverage Length RMSE
Equation 1 RF LLF RF LLF RF LLF
5 500 0.90 0.94 2.40 2.35 0.63 0.55
5 2000 0.97 0.96 2.23 1.85 0.43 0.35
5 10000 0.97 0.98 1.41 2.20 0.28 0.42
20 500 0.88 0.92 2.23 2.13 0.68 0.55
20 2000 0.89 0.96 2.14 2.12 0.17 0.09
20 10000 0.97 0.99 1.23 0.89 0.24 0.13
Equation 7 RF LLF RF LLF RF LLF
5 500 0.54 0.65 3.56 3.82 2.36 2.03
5 2000 0.63 0.69 3.17 3.21 1.77 1.58
5 10000 0.70 0.75 2.75 2.77 1.32 1.18
20 500 0.45 0.59 4.06 4.61 8.82 4.85
20 2000 0.57 0.64 3.55 4.22 5.25 3.20
20 10000 0.52 0.70 2.28 2.89 1.83 1.46
Equation 25 RF LLF RF LLF RF LLF
5 500 0.85 0.89 3.26 3.46 1.50 0.90
5 2000 0.90 0.92 2.54 2.82 0.52 0.45
5 10000 0.82 0.92 1.47 1.36 0.40 0.3
20 500 0.85 0.89 3.36 3.19 1.50 0.98
20 2000 0.90 0.90 2.71 2.36 0.62 0.46
20 10000 0.87 0.92 1.94 1.55 0.58 0.37
Table 2: Average coverage and length of confidence intervals from honest random forests (RF) and local linear forests (LLF), along with RMSE on the same out of bag (OOB) predictions. OOB coverage is averaged over 50 runs of the simulation setups in equations 1, 7, and 25 and reported for the given values of dimension and number of training points . We hold constant for equation 1, and constant for equation 7, and train on sample fraction .

5.4 Real Data Example

As a real-data example, we include the well-known California housing data. The data contain summary statistics for housing prices in a California district from 1990, first introduced in a paper from Pace and Barry (1997) and downloaded from Kaggle. To make the problem more difficult (and more relevant to our method), we examine randomly chosen subsamples of size and compare errors of predicting median house income. Boxplots in Figure 5 show squared error (scaled down by due to the large units of the outcome variable) of all methods described here.

Figure 5: MSE over of predictions on test samples from the california housing data with subsample size held fixed. Boxplots show errors over 50 simulation runs. Methods evaluated are honest and adaptive random forests (RF), local linear forests (LLF), lasso and random forests, boosted trees, and BART.

5.5 Local Linear Causal Forests

In Section 3, we introduced a real-data example where the local linear extension of causal forests naturally applies. Evaluating errors empirically, however, is difficult, so we supplement that with a simulation also used by Wager and Athey (2018) in evaluating causal forests and Künzel, Sekhon, Bickel, and Yu (2017), used to evaluate their meta-learner called the X-learner. Here we let . We fix the propensity and , and generate a causal effect from each

(26)
(27)

We will assume unconfoundedness (Rosenbaum and Rubin, 1983); therefore, because we hold propensity fixed, this is a randomized controlled trial.

We compare local linear forests, causal forests, and X-BART, which is the X-learner using BART as a base-learner. Causal forests as implemented by grf are tuned via the automatic self-tuning feature. As in the prediction simulation studies, we do not cross-validate X-BART because the authors recommend X-BART specifically for when a user does not want to carefully tune. We acknowledge that this may hinder its performance. Local linear causal forests are tuned via cross-validation. On these simulations, consider relatively small sample sizes ranging from to with dimension . The goal of this simulation is to evaluate how effectively we can learn a smooth heterogeneous treatment effect in the presence of many noise covariates. Wager and Athey (2018) emphasize equation 26 as a simulation that demonstrates how forests can suffer on the boundary of a feature space, because there is a spike near . RMSE over 100 repetitions is reported in Table 3. We can see that in these simulations, local linear forests give a significant improvement over causal forests. We can also observe how local linear forests make up for smaller training datasets by adding additional regularization, which shrinks as we gain more training data. Both of these setups are reasonable tests for how a method can learn heterogeneity, and demonstrate potential for meaningful improvement with thoughtful variable selection and robustness to smooth heterogeneous signals.

Simulation 1 (equation 26) Simulation 2 (equation 27)
n X-BART CF LLCF X-BART CF LLCF
200 1.01 0.94 0.80 0.67 0.77 0.71
400 0.76 0.50 0.47 0.56 0.55 0.50
600 0.61 0.39 0.35 0.50 0.41 0.38
800 0.55 0.35 0.31 0.46 0.34 0.32
1000 0.50 0.33 0.30 0.44 0.29 0.28
1200 0.48 0.32 0.28 0.42 0.27 0.26
Table 3: Average RMSE of predicting the heterogeneous treatment effect on 100 repetitions of the simulation given in equation (26). We vary the sample size from to in increments of 200, always testing on test points. We report errors from local linear causal forests (LLCF), causal forests (CF), and the X-learner with BART as base learner (X-BART). Minimizing errors are reported in bold.

6 Discussion

In this paper, we proposed local linear forests as a modification of random forests equipped to model smooth signals and fix boundary bias issues. We presented asymptotic theory showing that, if we can assume smoother signals, we can get better rates of convergence as compared to generalized random forests. We showed on the welfare dataset that local linear forests can effectively model smooth heterogeneous causal effects, and illustrated when and why they outperform competing methods. We also gave confidence intervals from the delta method for the regression prediction case, and demonstrated their effectiveness in simulations.

The regression adjustments in local linear forests prove especially useful when some covariates have strong global effects with moderate curvature. Furthermore, the adjustment provides centered predictions, adjusting for errors due to an asymmetric set of neighbors. Likely there is a useful polynomial basis corresponding to every situation where local linear forests perform well, but this requires hand-tuning the functional form for competitive performance, and is not automatically suited to a mix of smooth and non-smooth signals. For a departure from frequentist techniques, BART and Gaussian processes are both hierarchical Bayesian methods; BART can be viewed as a form of Gaussian process with a flexible prior, making BART the preferred baseline.

There remains room for meaningful future work on this topic. In some applications, we may be interested in estimating the slope parameter , rather than merely accounting for it to improve the precision of . While local linear forests may be an appropriate method for doing so, we have not yet explored this topic and think it could be of significant interest. We have also not considered the theoretical or empirical improvements that could arise from assuming higher order smoothness in the functions we are estimating; searching for additional optimality results in this setting could be another interesting research question.

References

  • Abadie and Imbens [2011] Alberto Abadie and Guido W Imbens. Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics, 29(1):1–11, 2011.
  • Amit and Geman [1997] Yali Amit and Donald Geman. Shape quantization and recognition with randomized trees. Neural Comput., 9(7):1545–1588, October 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.7.1545.
  • Athey and Imbens [2016] Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
  • Athey et al. [2018] Susan Athey, Julie Tibshirani, and Stefan Wager. Generalized random forests. Annals of Statistics, forthcoming, 2018.
  • Basu et al. [2018] Sumanta Basu, Karl Kumbier, James B. Brown, and Bin Yu. Iterative random forests to discover predictive and stable high-order interactions. Proceedings of the National Academy of Sciences, 2018. ISSN 0027-8424. doi: 10.1073/pnas.1711236115. URL http://www.pnas.org/content/early/2018/01/17/1711236115.
  • Biau [2012] Gérard Biau. Analysis of a random forests model. J. Mach. Learn. Res., 13(1):1063–1095, April 2012. ISSN 1532-4435.
  • Biau et al. [2008] Gérard Biau, Luc Devroye, and Gábor Lugosi.

    Consistency of random forests and other averaging classifiers.

    JMLR, 9:2015–2033, 2008.
  • Bloniarz et al. [2016] Adam Bloniarz, Ameet Talwalkar, Bin Yu, and Christopher Wu. Supervised neighborhoods for distributed nonparametric regression. In Artificial Intelligence and Statistics, pages 1450–1459, 2016.
  • Breiman et al. [1984] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis, 1984. ISBN 9780412048418.
  • Breiman [1996] Leo Breiman. Bagging predictors. Mach. Learn., 24(2):123–140, August 1996. ISSN 0885-6125. doi: 10.1023/A:1018054314350.
  • Breiman [2001] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001. ISSN 1573-0565. doi: 10.1023/A:1010933404324.
  • Bühlmann and Yu [2002] Peter Bühlmann and Bin Yu. Analyzing bagging. The Annals of Statistics, 30(4):927–961, 2002.
  • Chen et al. [2019] Tianqi Chen, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng, and Yutian Li. xgboost: Extreme Gradient Boosting, 2019. URL https://CRAN.R-project.org/package=xgboost. R package version 0.82.1.
  • Cheng et al. [1997] Ming-Yen Cheng, Jianqing Fan, and J. S. Marron. On automatic boundary corrections. The Annals of Statistics, 25(4):1691–1708, 1997. ISSN 00905364.
  • Chipman et al. [2010] Hugh A. Chipman, Edward I. George, and Robert E. McCulloch. Bart: Bayesian additive regression trees. Ann. Appl. Stat., 4(1):266–298, 03 2010. doi: 10.1214/09-AOAS285.
  • Cleveland [1979] William S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74:829–836, 1979.
  • Cleveland and Devlin [1988] William S. Cleveland and Susan J. Devlin.

    Locally weighted regression: An approach to regression analysis by local fitting.

    Journal of the American Statistical Association, 83(403):596–610, 1988. doi: 10.1080/01621459.1988.10478639.
  • Cutler et al. [2007] D Richard Cutler, Thomas C Edwards Jr, Karen H Beard, Adele Cutler, Kyle T Hess, Jacob Gibson, and Joshua J Lawler. Random forests for classification in ecology. Ecology, 88(11):2783–2792, 2007.
  • Díaz-Uriarte and De Andres [2006] Ramón Díaz-Uriarte and Sara Alvarez De Andres. Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(1):3, 2006.
  • Efron [1982] Bradley Efron. The jackknife, the bootstrap, and other resampling plans, volume 38. Siam, 1982.
  • Efron and Stein [1981] Bradley Efron and Charles Stein. The jackknife estimate of variance. The Annals of Statistics, pages 586–596, 1981.
  • Fan and Gijbels [1992] Jianqing Fan and Irene Gijbels. Variable bandwidth and local linear regression smoothers. Ann. Statist., 20(4):2008–2036, 12 1992. doi: 10.1214/aos/1176348900.
  • Fan and Gijbels [1996] Jianqing Fan and Ir ne Gijbels. Local polynomial modelling and its applications. Number 66 in Monographs on statistics and applied probability series. Chapman & Hall, London [u.a.], 1996. ISBN 0412983214.
  • Friedman et al. [2010] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
  • Friedman [1991] Jerome H. Friedman. Multivariate adaptive regression splines. Ann. Statist., 19(1):1–67, 03 1991. doi: 10.1214/aos/1176347963.
  • Friedman [2001] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Ann. Statist., 29(5):1189–1232, 10 2001. doi: 10.1214/aos/1013203451. URL https://doi.org/10.1214/aos/1013203451.
  • Gama [2004] João Gama. Functional trees. Mach. Learn., 55(3):219–250, June 2004. ISSN 0885-6125. doi: 10.1023/B:MACH.0000027782.67192.13. URL http://dx.doi.org/10.1023/B:MACH.0000027782.67192.13.
  • Green and Kern [2012] Donald P. Green and Holger L. Kern. Modeling heterogeneous treatment effects in survey experiments with bayesian additive regression trees. Public Opinion Quarterly, 76(3):491–511, 2012. doi: 10.1093/poq/nfs036.
  • Heckman et al. [1998] James J Heckman, Hidehiko Ichimura, and Petra Todd. Matching as an econometric evaluation estimator. The review of economic studies, 65(2):261–294, 1998.
  • Hill [2011] Jennifer L Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240, 2011.
  • Hoeffding [1963] Wassily Hoeffding.

    Probability inequalities for sums of bounded random variables.

    American Statistical Association Journal, March 1963.
  • Hothorn et al. [2004] Torsten Hothorn, Berthold Lausen, Axel Benner, and Martin Radespiel-Troger. Bagging survival trees. Statistics in Medicine, 23:77–91, 2004.
  • Imbens and Rubin [2015] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
  • Karalič [1992] Aram Karalič. Employing linear regression in regression tree leaves. In Proceedings of the 10th European Conference on Artificial Intelligence, ECAI ’92, pages 440–441, New York, NY, USA, 1992. John Wiley & Sons, Inc. ISBN 0-471-93608-1. URL http://dl.acm.org/citation.cfm?id=145448.146775.
  • Künzel et al. [2017] Soren R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning. ArXiv e-prints, June 2017.
  • Li and Hsing [2010] Yehua Li and Tailen Hsing.

    Uniform convergence rates for nonparametric regression and principal component analysis in functional/longitudinal data.

    Ann. Statist., 38(6):3321–3351, 12 2010. doi: 10.1214/10-AOS813.
  • Linero and Yang [2018] Antonio Ricardo Linero and Yun Yang. Bayesian Regression Tree Ensembles that Adapt to Smoothness and Sparsity. Journal of the Royal Statistical Society, Series B, 80(5):1087–1110, 2018.
  • Loader [2013] Catherine Loader. locfit: Local Regression, Likelihood and Density Estimation., 2013. URL https://CRAN.R-project.org/package=locfit. R package version 1.5-9.1.
  • Loader [1999] Clive Loader. Local regression and likelihood. New York: Springer-Verlag, 1999. ISBN 0-387-9877.
  • McCulloch et al. [2019] Robert McCulloch, Rodney Sparapani, Robert Gramacy, Charles Spanbauer, and Matthew Pratola. BART: Bayesian Additive Regression Trees, 2019. URL https://CRAN.R-project.org/package=BART. R package version 2.4.
  • Meinshausen [2006] Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7(Jun):983–999, 2006.
  • Mentch and Hooker [2016] Lucas Mentch and Giles Hooker. Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res., 17(1):841–881, January 2016. ISSN 1532-4435.
  • Menze et al. [2011] Bjoern H. Menze, B. Michael Kelm, Daniel N. Splitthoff, Ullrich Koethe, and Fred A. Hamprecht. On oblique random forests. In Dimitrios Gunopulos, Thomas Hofmann, Donato Malerba, and Michalis Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, pages 453–469, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg. ISBN 978-3-642-23783-6.
  • Newey [1994] Whitney K. Newey. Kernel estimation of partial means and a general variance estimator. Econometric Theory, 10(2):233–253, 1994. ISSN 02664666, 14694360.
  • Nie and Wager [2017] Xinkun Nie and Stefan Wager. Learning objectives for treatment effect estimation. arXiv preprint arXiv:1712.04912, 2017.
  • Pace and Barry [1997] Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997. URL https://EconPapers.repec.org/RePEc:eee:stapro:v:33:y:1997:i:3:p:291-297.
  • R Core Team [2018] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL https://www.R-project.org/.
  • Robinson [1988] Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931–954, 1988.
  • Rosenbaum and Rubin [1983] Paul R. Rosenbaum and Donald B. Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983. doi: 10.1093/biomet/70.1.41.
  • Rusch and Zeileis [2013] Thomas Rusch and Achim Zeileis. Gaining insight with recursive partitioning of generalized linear models. Journal of Statistical Computation and Simulation, 83(7):1301–1315, 2013. doi: 10.1080/00949655.2012.658804. URL https://doi.org/10.1080/00949655.2012.658804.
  • Scornet et al. [2015] Erwan Scornet, G rard Biau, and Jean-Philippe Vert. Consistency of random forests. Ann. Statist., 43(4):1716–1741, 08 2015. doi: 10.1214/15-AOS1321.
  • Sexton and Laake [2009] Joseph Sexton and Petter Laake. Standard errors for bagged and random forest estimators. Computational Statistics & Data Analysis, 53(3):801–811, 2009.
  • Smith et al. [2018] Tom W. Smith, Michael Davern, Jeremy Freese, and Michael Hout. General social surveys, 1972-2016 [machine-readable data file]. /Principal Investigator, Smith, Tom W.; Co-Principal Investigators, Peter V. Marsden and Michael Hout; Sponsored by National Science Foundation. –NORC ed.– Chicago: NORC: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org, 2018.
  • Stone [1977] Charles J. Stone. Consistent nonparametric regression. The Annals of Statistics, 5:595–620, 1977.
  • Su and Tsai [2005] Xiaogang Su and Chih-Ling Tsai. Tree-augmented Cox proportional hazards models. Biostatistics, 6(3):486–499, 04 2005. ISSN 1465-4644. doi: 10.1093/biostatistics/kxi024. URL https://doi.org/10.1093/biostatistics/kxi024.
  • Svetnik et al. [2003] Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Culberson, Robert P Sheridan, and Bradley P Feuston. Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.
  • Taddy et al. [2015] M. Taddy, C.-S. Chen, J. Yu, and M. Wyle. Bayesian and empirical Bayesian forests. ArXiv e-prints, feb 2015.
  • Tibshirani et al. [2018] Julie Tibshirani, Susan Athey, Stefan Wager, Marvin Wright, and all contributors to the included version of Eigen. grf: Generalized Random Forests (Beta), 2018. URL https://github.com/swager/grf. R package version 0.9.3.
  • Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
  • Tibshirani and Hastie [1987] Robert Tibshirani and Trevor Hastie. Local likelihood estimation. Journal of the American Statistical Association, 82(398):559–567, 1987.
  • Torgo [1997] Luís Torgo. Functional models for regression tree leaves. In Proceedings of the Fourteenth International Conference on Machine Learning, ICML ’97, pages 385–393, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. ISBN 1-55860-486-3. URL http://dl.acm.org/citation.cfm?id=645526.657280.
  • Wager and Athey [2018] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • Wager and Walther [2015] Stefan Wager and Guenther Walther. Adaptive concentration of regression trees, with application to random forests. arXiv preprint arXiv:1503.06388, 2015.
  • Wager et al. [2014] Stefan Wager, Trevor Hastie, and Bradley Efron. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J. Mach. Learn. Res., 15(1):1625–1651, January 2014. ISSN 1532-4435.
  • Xu et al. [2016] Ruo Xu, Dan Nettleton, and Daniel J. Nordman. Case-specific random forests. Journal of Computational and Graphical Statistics, 25(1):49–65, 2016. doi: 10.1080/10618600.2014.983641. URL https://doi.org/10.1080/10618600.2014.983641.
  • Yao et al. [2005] Fang Yao, Hans-Georg Muller, and Jane-Ling Wang. Functional linear regression analysis for longitudinal data. Ann. Statist., 33(6):2873–2903, 12 2005. doi: 10.1214/009053605000000660.
  • Zeileis et al. [2008] Achim Zeileis, Torsten Hothorn, and Kurt Hornik. Model-based recursive partitioning. Journal of Computational and Graphical Statistics, 17(2):492–514, 2008. doi: 10.1198/106186008X319331. URL https://doi.org/10.1198/106186008X319331.
  • Zhou and Hooker [2018] Yichen Zhou and Giles Hooker. Boulevard: Regularized stochastic gradient boosted trees and their limiting distribution. arXiv preprint arXiv:1806.09762, 2018.
  • Zhu et al. [2015] Ruoqing Zhu, Donglin Zeng, and Michael R Kosorok. Reinforcement learning trees. Journal of the American Statistical Association, 110(512):1770–1784, 2015.

Appendix A Remaining Simulation Results

We include first Table 4, giving a full error comparison of the lasso-random forest baseline, BART, boosting, random forests, and local linear forests, on equation 7. Errors are reported on dimension ranging from 10 to 50, from 5 to 20, and and , averaged over 50 training runs.

RF Lasso-honest RF LLF BART XGBoost
10 1000 5 2.33 2.12 2.03 2.49 3.04
10 5000 5 1.90 1.48 1.57 1.51 2.65
30 1000 5 2.82 2.41 2.11 2.84 3.13
30 5000 5 2.08 1.61 1.73 2.03 2.53
50 1000 5 3.00 2.48 2.12 2.40 3.17
50 5000 5 2.18 1.82 1.80 2.106 2.636
1000 10 20 3.19 3.41 3.40 6.45 9.91
5000 10 20 2.43 2.35 2.29 3.85 9.48
1000 30 20 4.17 3.98 3.68 7.60 9.58
5000 30 20 2.97 2.66 2.40 4.78 9.18
1000 50 20 4.25 4.45 3.88 8.05 8.78
5000 50 20 3.16 2.67 2.35 4.95 8.80
Table 4: RMSE from simulations on equation 7. We vary the dimension from 10 to 50 predictors in increments of 20, and consider error standard deviation ranging from 1 to 20, for a variety of signal-to-noise ratios. Note that for this setting, , as approximated over 10,000 Monte Carlo repetitions; so letting corresponds to a signal-to-noise ratio of about 23.8, while letting corresponds to a signal-to-noise ratio of about . We train on and points, and report test errors from predicting on test points. All errors reported are averaged over 50 runs and the methods are cross-validated as described in Section 5.1. Minimizing errors are reported in bold.

We include next Table 5, again giving a more complete error comparison of the lasso-random forest baseline, BART, boosting, random forests, and local linear forests, on equation 1. Errors are reported on dimension ranging from 5 to 20, from 0.1 to 2, and and , averaged over 50 training runs.

RF Lasso- RF LLF BART XGBoost
5 1000 0.1 0.10 0.06 0.02 0.27 0.07
5 5000 0.1 0.06 0.02 0.02 0.22 0.06
50 1000 0.1 0.29 0.18 0.11 0.52 0.07
50 5000 0.1 0.18 0.10 0.07 0.62 0.06
5 1000 1 0.21 0.24 0.14 0.47 0.56
5 5000 1 0.15 0.11 0.09 0.26 0.52
50 1000 1 0.41 0.39 0.20 0.82 0.53
50 5000 1 0.23 0.21 0.10 0.57 0.52
5 1000 2 0.31 0.55 0.26 0.69 1.21
5 5000 2 0.25 0.28 0.21 0.40 1.18
50 1000 2 0.47 0.27 0.24 0.89 1.22
50 5000 2 0.33 0.27 0.15 0.70 0.96
Table 5: RMSE from simulations on equation 1 on random forests, lasso-random forest, local linear forests, BART, and boosting. We vary sample size , error variance , and ambient dimension , and report test error on test points. We estimate as over 10,000 Monte Carlo repetitions, so that signal-to-noise ratio ranges from at to at . All errors are averaged over 50 runs, and minimizing errors are in bold.

To close this section, we consider some basic linear and polynomial models in low dimensions, in order to effectively compare local linear forests with local linear regression. We simulate and model responses from two models,

(28)
(29)

where and . RMSE on the truth is reported, averaged over 50 runs, for lasso, local linear regression, BART, random forests, adaptive random forests, and local linear forests. In the simple linear case in equation 28, we see that lasso outperforms the other methods, as we would expect; in the polynomial given in equation 29, local linear regression performs the best, followed by BART ( case) and local linear forests ( cases).

Setup Lasso LLR BART RF LLF
Equation 28 1 0.12 0.15 0.48 0.73 0.22
5 0.39 0.92 1.27 1.25 0.96
10 0.70 1.70 2.37 1.76 1.56
Equation 29 1 1.55 0.22 0.50 0.86 0.69
5 1.55 0.92 1.31 1.32 1.28
10 1.66 1.44 1.83 1.70 1.68
Table 6: RMSE from simulations on equations 28 and 29 on lasso, local linear regression (LLR), BART, random forests, adaptive random forests, and local linear forests. We vary error variance from 1 to 10 and fix . All errors are averaged over 50 runs, and minimizing errors are in bold.

Appendix B Remaining Proofs

Summary of Supporting Results

We now give a proof of Theorem 1, beginning by decomposing into bias and variance , the latter of which we will approximate by a regression forest. Define the diameter (and corresponding radius) of a tree leaf as the length of the longest line segment that can fit completely inside of the leaf. Thanks to our assumed uniform bound on the second derivative of , a Taylor expansion of around yields the following decomposition,

(30)

Here is the average squared radius of leaves in the forest, where the diameter of a leaf is defined as the length of the longest line segment contained inside of it. We have isolated two error terms

(31)
(32)

For simplicity, moving forward we will write , dropping the written dependence on .

To control the radius of a typical leaf containing (and thus the Taylor error in (30)), we use the following bound. Suppose independently, and let be any regular, random-split tree and let be the radius of its leaf containing the test point . Consider Lemma 2 of Wager and Athey [2018], which states the following: let be a regular, random-split tree and let denote its leaf containing . Suppose that independently. Then, for any , and for large enough ,

(33)

Choose , so that , and . Consequently, by setting in the above bound, for sufficiently large we have

(34)

Next, to control the behavior of , we show that concentrates around its expectation. The proof of Lemma 2 uses concentration bounds for -statistics given by Hoeffding [1963].

Lemma 2.

Let be independent and identically distributed on . Let be forest weights from trees grown on subsamples of size and radius bounded by from (34). Then,