Random Forests of Interaction Trees for Estimating Individualized Treatment Effects in Randomized Trials

09/14/2017 ∙ by Xiaogang Su, et al. ∙ 0

Assessing heterogeneous treatment effects has become a growing interest in advancing precision medicine. Individualized treatment effects (ITE) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees (Su et al., 2009). To this end, we first propose a smooth sigmoid surrogate (SSS) method, as an alternative to greedy search, to speed up tree construction. RFIT outperforms the traditional `separate regression' approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT can be obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Precision medicine aims to optimize the delivery of stratified or individualized therapies by integrating comprehensive patient data. This emerging approach has become a growing interest in many biomedical applications. To advance precision medicine, it is crucial to understand the differential effects of a treatment as opposed to its overall main effect in the conventional practice of medical decisions.

There are many approaches in this endeavor; see  Lipkovich, Dmitrienko, and D’Agostino (2016)

for a recent survey. Among them, tree-based methods are dominant for several reasons. Built simply on the basis of a two-sample test statistic, trees facilitate a powerful comprehensive modeling by recursively grouping data. Differential treatment effects essentially involve treatment-by-covariates interactions, which may be of nonlinear forms and of high orders. Trees excel in dealing with complex interactions. Tree models are capable of handling high-dimensional covariates of mixed types and an off-the-shelf tool in the sense that minimal data preparation is required.

Interaction trees (IT; Su et al., 2009) extend tree procedures to subgroup analysis by explicitly assessing the treatment-by-covariate interaction. Foster, Taylor, Ruberg (2011) identifies subgroups by estimating the potential outcomes, which they rebranded as ‘virtual twins’. Another approach, SIDES (Subgroup Identification based on Differential Effect Search) proposed by Lipkovich et al. (2011), addresses issues such as subgroups with enhanced treatment effects, taking into account both efficacy and toxicity. QUINT (QUalitative INteraction Trees; Dusseldorp and van Mechelen, 2014) focuses on qualitative interactions. Loh, He, and Man (2015) proposes a tree procedure for identifying subgroups that is less prone to biased variable selection. The optimal treatment regime (Murphy, 2003) offers an alternative way of looking at the problem. Along this direction, tree-based approaches are also common; see, e.g., (Zhang et al., 2012) and (Laber and Zhao, 2015).

There are typically two types of precision medicine: stratified medicine and personalized medicine. The aforementioned methods belong to the former scope, with focus on stratified treatment effects or regimes where groups of individuals showing homogeneous treatment effects are sought. That said, individualized treatment effects (ITE) are of key importance in deploying tailored treatment plans as part of personalized medicine. The ITE also affords deeper study of treatment efficacy. Furthermore, ITE estimation is a necessary first step for a number of methods used in stratified medicine and optimal treatment regime; see, e.g., Foster, Taylor, Ruberg (2011), Zhang et al. (2012), and Laber and Zhao (2015).

The focus of this article is on the estimation of ITE with data collected from randomized trials. We examine an ensemble learning approach that we coin as RFIT for random forests on the basis of interaction trees (Su et al., 2009). Our methodological contribution is twofold: first, we introduce a faster alternative splitting method, called smooth sigmoid surrogate (SSS), to speed up IT; second, we extend the infinitesimal jackknife method (Efron, 2014) to compute the standard errors for ITE estimates. Moreover, we compare our proposed approach to the commonly applied separate regression (SR). RFIT is superior to SR by working on an easier problem. We demonstrate the outperformance of RFIT over SR in estimating ITE via extensive numerical experiments.

The remainder of the article is organized as follows. In Section 2, we first introduce the concept of ITE within Rubin’s causal model framework. RFIT with SSS splitting for estimating ITE and the standard error formula for estimated ITE are then presented in detail. Section 3 contains simulation experiments that are designed to investigate the performance of SSS in splitting data, to compare RFIT with the conventional separate regression approach, and to demonstrate the validity of the SE formula. We illustrate our proposed RFIT approach with data from an acupuncture headache trial in Section 4.

2 Random Forests of Interaction Trees (RFIT)

Concerning randomized trials, consider data consisting of IID copies of where is the continuous response or outcome for the -th subject, is the binary treatment assignment indicator: 1 for the treated group and 0 for control, and is a

-dimensional covariate vector of mixed types.

The Neyman–Rubin causal model (see, e.g., Neyman, 1923; Rubin, 1974 & 2005) provides a way of finely calibrating the causal effect of treatment on the response via the concept of potential outcomes. Let and denote the response values for a subject when assigned to the treated and the control group, respectively. Either or , but not both, can be observed. The observed outcome is given by Within this framework, the treatment effect can be evaluated at three levels: the population level , the subpopulation level for a subset and the unit or subject level These three levels form a hierarchy of causal inference in increasing order of strength, in the sense that a lower-level inference can be obtained from that of an upper-level inference, but not vice versa. Let be a generic notation for treatment effect.

Definition 1.

The individualized treatment effect (ITE) is defined as

Note that is different from the (random) unit-level effect . Strictly speaking, is a subpopulation-level effect among individuals with Nevertheless, is the finest approximation to the unit-level inference that is possibly available in practice.

Causal inference is essentially concerned with estimating at different levels through the available data The difficulty in causal inference stems primarily from the convoluted roles (e.g., confounder, effect modifier or moderator, or mediator) played by each covariate in For experimental data from trials with random treatment assignment mechanisms, is independent of other variables. As a result, the unconfoundedness condition (Rubin, 2005), being sufficient for obtaining population-level inference from , is trivially met. Randomization renders the confounding issue of little concern; however, covariate modification to the treatment effects remains across both the subpopulation and unit levels.

Interaction trees (IT; Su et al., 2009) seek subgroups with heterogeneous treatment effects by following the paradigm of CART (Breiman et al., 1984); hence IT supplies causal inference at the subpopulation level. Nevertheless, results from IT can be building blocks for inferences at other levels: one has the flexibility to move backward to the overall effect estimation by integration and move forward to ITE via ensemble learning. The main objective of this article is to examine the use of random forests of interaction trees (RFIT) in estimating Random forests (Breiman, 2001) is an ensemble learning method, constructing a collection of tree models and integrating results across the tree models. Among its many merits, RF is among the top-performers in predictive modeling and provides many useful features such as proximity matrix, variable importance ranking, and partial dependence plot (Liaw and Wiener, 2002).

2.1 SSS for Identifying the Best Cutoff Point

To extend random forests on the basis of interaction trees, one essential ingredient is the splitting statistics. IT bisects data by maximizing the difference in treatment effects between two child nodes. A split on data is induced by a binary variable of general form

that applies a threshold on covariate at cutoff point ; recall that nominal variables can be made ordinal by sorting its levels according to the treatment effect estimate at each level (see Appendix A of Su et al., 2009). Any binary split results in the following table, where denotes the number of treated subjects in the left child node; denotes the sample mean response for treated subjects in the left child node; and similarly for other notations.

Child Node
Treatment Left Right
0 (, ) (, )
1 (, ) (, )

The splitting statistic in IT can be based on the Wald test for in the interaction model:


where The least squares estimate of is given by , corresponding to the concept of ‘difference in differences’ (DID). The resultant Wald test statistic amounts to




is the pooled estimator of measures the difference in treatment effects between two child nodes. With the conventional greedy search (GS) approach, the best cutoff point for is It is worth noting that minimizing the least squares (LS) criterion with Model (1) does not serve well in IT. A cutoff point can yield the minimum LS criterion merely for its strong additive effect (i.e., associated with ).

GS evaluates the splitting measure at every possible cutoff point for . This can be slow when the number of cutoff points to be evaluated is large, even though GS can be implemented by updating the computation of for neighboring ’s. Furthermore, this discrete optimization procedure yields erratic measures, as exemplified by the orange line in Figure 1(b). As a result, GS may mistakenly select a local spike due to large variation. These deficiencies motivate us to consider a smooth alternative to GS. Noting that the discreteness steps from the threshold indicator function involved in many components of the splitting statistics, our approach is to approximate

with a smooth sigmoid function. For this reason, we call the method ‘smooth sigmoid surrogate’ or SSS in short. While many sigmoid functions can be used, it is natural to consider the logistic or expit function


where is the cutoff point and is a shape or scale parameter. Figure 1(a) depicts the expit function at for different values.

To approximate , we start with approximating with for and as follows:

where approximates , is the total number of treated individuals, and is the total number of untreated individuals. Let denote the associated sum of responses values, which can be approximated in a similar manner:

where is the sum of response values for all treated individuals and similarly for the untreated. Note that quantities , , , and do not involve the split variable and can be computed beforehand. It follows that for and . Next, bringing into (3) yields its approximation Finally, plugging all the approximated quantities into in (2) yields


Now is a smooth objective function for only and can be directly maximized to obtain the best cutoff point

Besides , there is a scale parameter involved in given by (5). As shown by simulation in Section 3, the performance of the SSS method is quite robust with respect to the choice of for a wide range of values. Thus can be fixed a priori. In order to do so, we standardize the predictor , where

denote the sample mean and standard deviation of variable

, respectively. For standardized covariates, we recommend fixing a value in [10, 50]. With fixed , the best cutoff point can be obtained by maximizing with respect to and then transformed back to the original data scale for interpretability. This is a one-dimensional smooth optimization problem, which can be conveniently solved by many standard optimization routines. We use the Brent (1973) method available in the R (R Core Team, 2017) function optimize in our implementation. Given the nonconcave nature of the maximization problem, further techniques such as multi-start or partitioning the search range may be used in combination with Brent’s method. However, as shown in our numerical studies, a plain application of Brent’s method, without further efforts for locating the global optimum, works quite effectively in estimating the cutpoint.

SSS smooths out local spikes in GS splitting measures and hence helps signify the true cutoff point; see Figure 1(b) for one example. Our simulation in Section 3 shows that SSS outperforms GS in most scenarios, especially when dealing with weak signals. Another immediate advantage of SSS over GS is computational efficiency. The following proposition provides an asymptotic quantification of the computational complexity involved in both GS and SSS splitting.

Proposition 2.1.

Consider a typical data set of size in the interaction tree setting, where both GS and SSS are used to find the best cutoff point for a continuous predictor with distinct values. In terms of computation complexity, GS is at best with the updating scheme and without the updating scheme while SSS is , where is the number of iterations in Brent’s method.

A proof of Proposition 2.1 is relegated to the Supplementary Materials. Implementation of tree methods benefits from incremental updating; see, e.g., LeBlanc and Crowley (1993) and Utgoff, Berkman, and Clouse (1997). However, it is a common wrong impression that the GS splitting with updating is only of order Updating the IT splitting statistic entails sorting the response values according to the values within both treatment groups. It turns out that this sorting step would dominate the algorithm in complexity asymptotically with a rate of Comparatively, SSS depends on the number of iterations in Brent’s method, Although the number of iterations is affected by the convergence criterion and the desired accuracy, is generally small since Brent’s method has guaranteed convergence at a superlinear rate. Based on our limited numerical experience, rarely gets over 15 even for large . In other words, the rate for SSS essentially amounts to the linear rate

2.2 Estimating ITE via RFIT

RFIT follows the standard paradigm of random forests (Breiman, 2001). Take a bootstrap sample from data and then construct an IT using To split a node, a subset of covariates are randomly selected and the best cut for each covariate is identified and compared to determine the best split of data. The step is iterated till a large tree is grown. Each terminal node in is summarized with an estimated treatment effect , which is simply the difference in mean response between treated and untreated individuals falling into , i.e.,

where is the number of treated individuals in that fall into and for the untreated.

The entire tree construction procedure is then repeated on a number of bootstrap samples, which results in a sequence of bootstrap trees For each tree , an individual with covariate vector would fall into one and only one of its terminal node, which we denote as Letting the ITE for this individual can then be estimated as


Efron (2014) discussed methods for computing standard errors for bootstrap-based estimators and advocated the use of infinitesimal jackknife (IJ). The IJ approach is found preferable in random forests, as further explored by Wager, Hastie, and Efron (2014). Proposition 2.2 applies the IJ method to obtain a standard error formula for estimated ITE Its proof is outlined in the Supplementary Materials.

Proposition 2.2.

The IJ estimate of variance of

is given by


where and with being the number of times that the -th observation appears in the -th bootstrap resample. In other words, the quantity is the bootstrap covariance between and In practice, is biased upwards, especially for small or moderate . A bias-corrected version is given by


Further assuming approximate independence of and , another bias-corrected version is given by


which is easier to compute than (8).

The validity of these SE formulas will be investigated by simulation in Section 3. The bias-corrected SE formulas in (8) and (9) generally yield very similar results with superior performance to the uncorrected version (7). Note that computing (8) entails evaluation of the matrix at each different . Therefore, the SE given in (9) is recommended for its enhanced computational efficiency.

2.3 Comparison with SR

Under the potential outcome framework, separate regression (SR) is conventionally used to estimate ; see, e.g., van der Laan, Polley, Hubbard (2006) and Foster, Taylor, Ruberg (2011). The basic idea is to build a model based on data for treated individuals only to estimate and build a model based on data for untreated individuals only to estimate Let and denote the resultant estimates of and , respectively. Then ITE can be estimated as


Since SR essentially involves predictive modeling, random forests (Breiman, 2001) are commonly used in the literature.

We would like to argue that RFIT is superior to SR. This is primarily because RFIT works on a simpler problem. To explain, consider the model form where Functions and may involve different sets of covariates. In the clinical setting, covariates showing up in only are called prognostic factors while covariates showing up in are called predictive factors (see, e.g., Ballman, 2015). In other words, predictive factors interact with the treatment and hence cause differential treatment effects. In SR, both and have to be estimated to have the difference ; thus it has to take both prognostic and predictive factors into consideration. Comparatively, RFIT estimates directly by focusing on predictive factors only. This is because a prognostic factor won’t cause a difference in differences, referring to its splitting statistic in (2). In the following, we introduce a performance measure for RFIT and SE in estimating ITE and a theoretical understanding of the measure is attempted.

Both RFIT and SR take the bootstrap-based ensemble learning approach; the ITE estimates in (6) and in (10) involve randomness owing to bootstrap resampling, the current data , and the point at which the estimation is made. To compare RFIT with SR, we consider an average mean squares error (AMSE) measure defined by


where the expectation is taken with respect to the bootstrap distribution given the current data , the sampling distribution of data , and then the distribution of



where is the RFIT estimate of obtained with perfect bootstrap or and is the perfect bootstrap RFIT estimate if, furthermore, we are allowed to recollect data freely. Similarly, we define on the basis of and on the basis of in SR. In addition, define


and similarly Proposition 2.3 provides a decomposition of AMSE for the ITE estimate by RFIT and for by SR.

Proposition 2.3.

For the RFIT estimate in (6),


For the SR estimate in (10),

AMSE (15)

The first term of AMSE in (14) corresponds to Monte Carlo variation resulted from using a finite number of bootstrap samples. The second term represents the sampling variation owing to lack of endless supply of training data in reality. The third term is the bias. Similar interpretation holds true for the terms in (15), yet with an additional covariance term

Ensemble learners such as RF and bagging aim for variance reduction by imitating the endless supply of replicate data via bootstrap resampling. This is why we have the additional decomposition

in (14); similarly for and in (15). However, ensemble learning has little effect on the bias term in (14); similarly for the two bias terms in (15) as well as the covariance term The bias problem for ensemble learners such as random forests has been noted by Breiman (1999) and others. From another perspective, RF facilitates a smoothing procedure by averaging data over an adaptive neighborhood; as a result, it cuts the hill and fills the valley.

While both RFIT and SR would suffer from certain bias, the AMSE in SR tends to be larger than that of RFIT in general as we shall demonstrate numerically in Section 3. Numerical evidence shows that SR is more prone to the bias problem because it tends to underestimate a large ITE and overestimate a small ITE. In fact, such a bias also has an effect on the last covariance term in (15). A large ITE occurs when is large and/or is small. The smoothing effect yields with cut hills and with filled valleys. Thus tends to be negative. A similar observation holds for a small ITE, which occurs when is small and/or is large. As a result, the last term in (15) tends to be negative, leading to a more inflated AMSE for SR.

3 Simulation Studies

This section presents results from simulation studies designed to compare the smooth sigmoid surrogate (SSS) splitting method with greedy search (GS) in finding the best cutoff point; compare RFIT with separate regression (SR) in estimating the individualized treatment effects (ITE); and investigate the standard error (SE) formulas for the estimated ITE.

3.1 Comparison of SSS versus GS

To compare SSS with GS, we generated data from model


where , , and We considertwo sample sizes and , corresponding to relatively fewer and larger numbers of observations in a node. For each simulated data, both GS and SSS are used to identify the best cutoff point Different values are used in SSS. For each model configuration, 500 simulation runs are made.

Figure 2 presents the empirical density and the MSE measure, defined as , for the estimated cutoff point by SSS and GS. It can be seen that SSS compares favorably to GS in terms of MSE. From the empirical density plots, it can be seen that one important contribution made by SSS is the reduced variation as compared to GS. In most scenarios, SSS shows considerable stability with respect to the choice of the shape parameter , especially with relatively larger values. Too small an value can result in deteriorated performance and hence is not advisable. It seems desirable to balance a more accurate approximation to the indicator function with a relatively larger value and the more smooth objective function for optimization with a relatively smaller value. We would like to comment that estimating is not a good idea in the tree setting for several reasons: tree models seek threshold effects which entail a relatively large value; estimating unnecessarily slows down the computation at each node split; having a different for each covariate will make the results less comparable across covariates in order to find the best split. Recall that SSS works with standardized and transfers back to its original scale; we recommend fixing in SSS for standardized based on our more extensive numerical explorations that are not presented here. Henceforth, SSS with is used by default in the RFIT implementation.

To compare the computing time, we generated data from the same model; a slight modification was made in the way of simulating the covariate:

follows a discrete uniform distribution over

so that has a total of distinct values. This allows us to investigate the computing time with different . The choices for and are and Table 1 tabulates the computing time in seconds for GS and SSS splitting, averaged over 10 simulation runs for each setting. It can be seen that SSS is superior to GS in terms of computational efficiency. As expected, it takes longer for both GS and SSS as increases in general. It takes longer for GS as increases; but this is not the case for SSS.

3.2 Comparison of RFIT versus SR

We compare SSS to SR in estimating ITE. The data are generated with the following scheme: first simulate five () predictors for ; then we generate with a nonlinear polynomial

and and being independent from next, we generate where and is independent of both and . A random effect term is introduced to mimic some common characteristics shared by repeated measures and taken from the same subject. The unit-level effect equals , where represents additional random errors that can not be accounted for by covaraites Four models (I)–(IV) are considered for the ITE , as given below:

Model I: (17)
Model II: (18)
Model III: (19)
Model IV: (20)

Model I exemplifies a linear ITE; Model II represents a tree-structured model; Model III & IV are derived from two nonlinear models in Friedman (1991). Finally, we simulate the randomized treatment assignment variable independently from Bernoulli(0.5) and hence the observed response

For each training data set , both RFIT and SR are used to learn a model on ITE. In order to evaluate their performance, a test sample of size is generated beforehand. The ITE models trained with RFIT and SR in each simulation are applied to estimate the ITE for and a mean square error (MSE) measure is computed. Two sample sizes and are considered for the training data and a total of 200 simulation runs is used for each simulation setting.

Figure 3 plots the parallel boxes of MSE measures from 200 simulation runs for RFIT and SR. The averages are highlighted with blue bars, corresponding to estimates of the AMSE in (11). It can be seen that RFIT outperforms SR consistently in all the scenarios considered here. Again, the superiority of RFIT can be explained by the fact that it works on an easier task than SR by estimating directly. Additional numerical insight into the bias problem is provided by plotting the estimated ITE (averaged over 200 simulated runs) versus the actual ITE . See Section B of the Supplementary Material.

3.3 Standard Error Formulas

To investigate the validity and performance of the standard error (SE) formulas for estimated ITE, we generated training data sets of size from Model III in (19) and one test data set of size For each training data set , bootstrap samples is used to train RFIT and then the trained RFIT is applied to estimate ITE for each observation in together with standard errors. We repeat the experiment for 200 simulation runs. At the end of the experiment, we have 200 predicted ITE for each observation in , together with 200 SEs. Accordingly, we compute the standard deviation (SD) of these ITE estimates and average the SE values. If the SE formula works well, the SE values should be close to their corresponding SD values.

Figure 4 plots the averaged SE versus SD for each observation in the test sample . It can be seen that the uncorrected standard errors are overly conservative. After the bias-correction, they become reasonably close to the SD values. The bias-corrected SE presented here is computed from (9). The other version (8) that is somewhat harder to compute provides very similar results, which have been omitted from the plotting.

We experimented with other models in Section 3.2 and similar results were obtained. One issue pertains to the number of bootstrap samples needed. According to Efron (2014), a large , e.g., is needed to guarantee the validity of IJ-based standard errors. We experimented with different values. Generally speaking, ITE estimation stabilizes quickly even with a small , e.g., ; however, negative values may frequently occur to the bias-corrected variance estimates in both (8) and (9) when is small or moderate, e.g., Thus a large number of bootstrap samples are needed to have sensible results for the SE formulas.

4 Application: Acupuncture Trial

For further illustration of RFIT, we consider data collected from a acupuncture headache trial (Vickers et al., 2004), available at


In this randomized study, 401 patients with chronic headache, predominantly migraine, were randomly assigned either to receive up to 12 acupuncture treatments over three months or to a control intervention offering usual care. Among many other measurements, the primary end point of the trial is the change in headache severity score from baseline to 12 months since study entry. The acupuncture treatment was concluded effective overall by significantly bringing down the headache score and other outcome measures. More details of the trial and the results are reported in Vickers et al. (2004).

To apply RFIT, we consider only the 301 participants who completed the trial. The response variable is taken as the difference in headache severity score between baseline and 12 months, while the score at baseline is treated as a covariate. A total of 18 covariates are included in the analysis, which are essentially demographic, medical, or treatment variables measured at baseline. See Table

I in the Supplementary Materials for a brief variable description.

A total of trees are used to build RFIT, where the scale parameter is set as in SSS splitting. ITE is estimated for each individual in the same data set and the IJ-based standard error (SE) with bias correction is also computed. Figure 5 provides a bar plot of the estimated ITE, plus and minus one SE, sorted by ITE. It can be seen that a majority of ITE are above 0, indicating the effectiveness of acupuncture. Overall speaking, the treatment effects in this trial show certain heterogeneity, but not by much. It is interesting to note that the averaged ITE is 3.9. Comparatively, the unadjusted mean difference in headache score is 6.5 while the adjusted effect from ANCOVA is 4.6, as reported in Table 2 of Vickers et al. (2004). Figure 5 also shows many individuals, for whom the acupuncture treatment did not help much, including two individuals, the 44-th (with patient ID 222) and the 224th (with patient ID 630). Both are female patients aged 60 and 58, suffer migraine, and were assigned to the control group, but surprisingly achieved a reduction of 36 and 29.75 in severity score, respectively. Their initial severity scores are also relatively similar: 44.25 and 37. Their estimated ITEs turn out to be and , indicating a detrimental effect from acupuncture. Although the performances of these two patients are quite unusual relative the the remainder of the patients, they may indicate a small subgroup that is worth further investigation.

5 Discussion

We have tackled the problem of estimating individualized treatment effect (ITE) by using the random forests of interaction trees (RFIT). Smooth sigmoid surrogate (SSS) splitting is introduced to speed up RFIT and possibly improve its performance. We have also applied the infinitesimal jackknife method to derive a standard error for the estimated ITE. Altogether, RFIT provides enlightening results in deploying personalized medicine by informing a new patient about the potential efficacy of the treatment on him/her.

According to our numerical experiments, RFIT outperforms the commonly used separate regression (SR) approach for estimating ITE. SR estimates the potential outcomes separately and then takes difference. In RFIT, however, we first group individuals so that those with similar treatment effects are put together and then estimate the treatment effect by taking differences within each group. Comparatively, RFIT focuses on predictive covariates and estimation of ITE directly while SR has to deal with prognostic covariates and works on a harder problem. Since SR has been widely used as an intermediary step in other causal inference procedures, our method might contribute to their improvement as well.

To conclude, we identify several avenues for future research. First of all, our discussion has been restricted to data from randomized experiments. Assessing treatment effects with data from observational data can be very different, entailing adjustment for potential confounders. See, e.g., Su et al. (2012) and Wager and Athey (2016). Secondly, the standard error formula provides some assessment for precision in estimating ITE; however, issues such as consistency of RFIT, asymptotic normality of estimated ITE (see comments in Efron, 2014) and multiplicity have not been thoroughly addressed as of yet. Thirdly, besides variable importance ranking, several other useful features from random forests including partial dependence plots and proximity matrix (Liaw and Wiener, 2002) have yet to be explored for RFIT.


XS was partially supported by NIMHD grant 2G12MD007592 from NIH; LL was partially supported by AHRQ grant HS 020263; RL was supported in part by NSF grant 1633130.


  • Ballman (2015) Ballman, K. V. (2015). Biomarker: Predictive or Prognostic? Journal of Clinical Oncology, 33: 3968–3971.
  • Breiman (1999) Breiman, L. (1999). Using adaptive bagging to debias regressions. Technical Report # 547, Department of Statistics, University of California at Berkely.
  • Breiman (2001) Breiman, L. (2001). Random Forests. Machine Learning, 45: 5–32.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth International Group.
  • Brent (1973) Brent, R. (1973). Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall.
  • Dusseldorp and van Mechelen (2014) Dusseldorp, E. and van Mechelen, I. (2014). Qualitative interaction trees: a tool to identify qualitative treatment-subgroup interactions. Statistics in Medicine, 33: 219–237.
  • Efron (1982) Efron, B. (1982). The jackknife, the bootstrap and ohter resampling plans. CBMS-NSF Regional COnference Series in Applied Mathematics 38. Philadelphia, PA: Society for Industrial and Applied Mathematics (SIAM).
  • Efron (2014) Efron, B. (2014). Estimation and accuracy after model selection (with discussion). Journal of the American Statistical Association, 109: 991–1007.
  • Foster, Taylor, Ruberg (2011) Foster, J. C., Taylor, J. M. C., and Ruberg, S. J. (2011). Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30: 2867–2880.
  • Friedman (1991)

    Friedman, J. H. (1991). Multivariate adaptive regression splines.

    Annals of Statistics, 19: 1–67.
  • Imai and Ratkovic (2013) Imai, K. and Ratkovic, M. (2013). Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7: 443–470.
  • Laber and Zhao (2015) Laber, E. B. and Zhao, Y. Q. (2015). Tree-based methods for individualized treatment regimes. Biometrika, 102: 501–514.
  • LeBlanc and Crowley (1993) LeBlanc, M. and Crowley, J. (1993). Survival trees by goodness of split. Journal of the American Statistical Association, 88: 457–467.
  • Liaw and Wiener (2002) Liaw, A. and Wiener, M. (2002). Classification and Regression by randomForest. R News, 2/3: 18–22.
  • Lipkovich, Dmitrienko, and D’Agostino (2016) Lipkovich, I., Dmitrienko, A., D’Agostino, R. B. (2017). Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine, 36: 136–196.
  • Lipkovich et al. (2011) Lipkovich, I., Dmitrienko, A., Denne, J., and Enas, G. (2011). Subgroup identification based on differential effect search (SIDES): a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in Medicine, 30: 2601–2621.
  • Loh, He, and Man (2015) Loh, W.-Y., He, X., and Man, M. (2015). A regression tree approach to identifying subgroups with differential treatment effects. Statistics in Medicine, 34: 1818–1833.
  • Murphy (2003) Murphy, S. A. (2003). Optimal dynamic treatment regimes (with discussion). Journal of the Royal Statistical Society, Series B, 65: 331–366.
  • Neyman (1923)

    Neyman, J. (1923). On the application of probability theory to agricultural experiments.

    Essay on Principles, Section 9. Statistical Science, 5: 465–472, 1990. Translated by Dorota M. Dabrowska and Terence P. Speed.
  • R Core Team (2017) R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70: 41–55.
  • Rubin (1974) Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology, 66: 688–701.
  • Rubin (2005) Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100: 322–331.
  • Shen and He (2015) Shen, J. and He, X. (2015). Inference for Subgroup Analysis with a Structured Logistic-Normal Mixture Model. Journal of the American Statistical Association, 110: 303–312.
  • Su et al. (2012) Su, X., Kang, J., Fan, J., Levine, R., and Yan, X. (2012). Facilitating score and causal inference trees for large observational data. Journal of Machine Learning Research, 13: 2955–2994.
  • Su et al. (2009) Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., and Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10: 141–158.
  • Utgoff, Berkman, and Clouse (1997)

    Utgoff, P. E., Berkman, N. C., and Clouse, J. A. (1997). Decision tree induction based on efficient tree restructuring.

    Machine Learning, 29: 5–44.
  • Wager, Hastie, and Efron (2014)

    Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife.

    Journal of the Machine Learning Research, 15: 1625–1651.
  • Wager and Athey (2016) Wager, S. and Athey, S. (2016). Estimation and inference of heterogeneous treatment effects using random forests. arXiv preprint, arXiv:1510.04342v3.
  • van der Laan, Polley, Hubbard (2006) van der Laan, M., Polley, E., and Hubbard, A. (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology, 6(1).
  • Vickers et al. (2004) Vickers, A. J., Rees, R. W., Zollman, C. E., McCarney, R., Smith, C., Ellis, N., Fisher, P., and Van Haselen, R. (2004). Acupuncture for chronic headache in primary care: large, pragmatic, randomised trial. British Medical Journal, Primary Care, 328: 744–749.
  • Zhang et al. (2012) Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M., and Laber, E. (2012). Estimating optimal treatment regimes from a classification perspective. STAT, 1: 103–114.

Appendix A Proofs

This section contains proofs to the propositions.

a.1 Proof of Proposition 2.1

When the GS splitting is conducted without updating, GS evaluates the splitting criterion for every distinct splitting points of . For each , computation of is Note that extracting the unique values of is in general. Therefore, the total complexity amounts to in this case, where is the number of distinct values of Since is continuous with the complexity becomes

GS can be alternatively done via an updating formula. This entails sorting the and values in the ascending (or descending) order of distinct values so that the value at cutoff point can be conveniently obtained by utilizing its previous value at the neighboring cutoff point. The computation involved in the updating step itself is negligible and does not escalate the computational complexity level asymptotically. Nevertheless, a stable algorithm for general-purpose sorting is at best For example, the sort function in R is either if Shellsort is used or if the Quicksort method is used.

For SSS, each iterative step in Brent’s method involves evaluation of , which is SSS requires standardization of and transformation of back to the original scale, both operations being . Put together, its complexity is , where is the number of iterative steps in the optimization algorithm.

a.2 Proof of Proposition 2.2

The proof essentially follows Efron (2014) and Wager, Hastie, and Efron (2014), with more details added and some rewriting mainly for our own understanding. The arguments are based on the the ‘ideal bootstrap’ and treat the current data as fixed, where corresponds to all possible choices when taking a bootstrap sample from

Let denote the -th re-sample for Introduce random vector where counts the frequency that the -th observation in shows up in with It follows that with Fixing , completely determines Thus, the ITE estimate based on can be written as a function of The RFIT estimate with ideal bootstrap is the expectation of , i.e., Since the distribution of is fully determined by the probability vector , rewrite as a function of if more generally . The symbol is used to denote this function, for ensemble learners such as are essentially obtained via ‘bootstrap smoothing’.

Define the influence function


for , where with 1 in the th place and 0 elsewhere is a special case of that assigns mass 1 to the -th observation. Namely, the influence function is the directional derivative of at in the direction of , which essentially inspects the effect of an infinitesimal contamination at the th observation on the estimator.

The infinitesimal jackknife method (Efron, 1982) provides an estimate of variance for given by


Thus it remains to compute

For a multinomial probability , consider


where is the ratio of the product of probabilities in to that in Note that the multinomial coefficient does not show up in the summand of (A.3) since all choices of bootstrap samples are listed in the sum, in which case some of the bootstrap samples can be identical when ignoring the order of observations.


with in the -th position and elsewhere, we have, letting ,

where the second step uses the fact for and any constant obtained from the bionomial series and the fourth step ignores a second-order term of Bringing into (A.3), it follows that

using the fact that for any real-valued pairs According to its definition in (A.1), it is clear that Bringing it back into (A.2) yields the needed result for in (7). An alternative way of deriving is via Hájek projection as in Wager, Hastie, and Efron (2014).

In practice, a total of finite bootstrap sample is taken instead, which makes subject to additional Monte Carlo noise. Following the bias correction step suggested by Efron (2014), let denote the value obtained from bootstrap samples, as opposed to obtained from the ideal bootstrap. Thus has bootstrap mean and bootstrap variance where denote the bootstrap variance of It follows that

where denotes bootstrap expectation. Therefore, a bias-corrected version for is given by


where is replaced with its estimate This gives the expression in (8).

Wager, Hastie, and Efron (2014) further assumed that and in are approximately independent. Under this assumption, we have