1 Introduction
In many scientific problems, a primary goal is to discover structures which allow the problem to be described parsimoniously. For example, one may wish to find a small subset of candidate variables that are predictive of a response of interest; this structure is referred to as sparsity. Another structure is interaction (or additive) structure. An extreme case of additive structure is a generalized additive model (see, e.g., Hastie, 2017), where the effects of the predictors combine additively without any interactions. Teasing out additive structures can be valuable because it can substantially simplify the interpretation of a model. For example, if a given predictor does not interact with other predictors then it can be interpreted in isolation without reference to the values of other predictors. When predictors do interact, interpretation of the interactions is typically simplified whenever the interactions are of loworder. We consider the nonparametric regression problem , , where is a response of interest and
is a vector of predictors, however the methods we develop here can be easily extended to many other settings. The variables
and are said to interact if cannot be written as where and do not depend on and respectively. One can define higher order interactions similarly: a group of variables is said to have a way interaction if cannot be decomposed as a sum of or fewer functions, each of which depends on fewer than of the variables.Methods which estimate
using an ensemble of Bayesian decision trees have proven useful in a number of statistical problems. Beginning with the seminal work of Chipman et al. (2010), Bayesian additive regression trees (BART) have been successfully applied in a diverse range of settings including survival analysis (Sparapani et al., 2016), causal inference (Hahn et al., 2017), variable selection in high dimensional settings (Linero, 2016; Bleich et al., 2014), loglinear models (Murray, 2017), and analysis of functional data (Starling et al., 2018). A key motivating factor for the use of BART is precisely that it is designed to taking advantage of loworder interactions in the data. Indeed, Linero and Yang (2017) and Rockova and van der Pas (2017) illustrate theoretically that the presence of loworder interactions is precisely the type of structure which BART excels at capturing. Hence BART appears to be an ideal tool for extracting loworder and potentially nonlinear interactions.Surprisingly, we show that, despite the ability of BART to capture loworder interactions for prediction
purposes, it is nonetheless not suitable for conducting fullyBayesian inference for the
selection task of interaction detection. When taken at face value as a Bayesian model, we show empirically that BART generally leads to the detection of spurious interaction effects. This is not contradictory because optimal prediction accuracy is generally not sufficient to guarantee consistency in variable selection (see, e.g., Wang et al., 2007).We discuss the general problem which leads to the detection of spurious interactions; while this development is couched in the BART framework, we believe that the fundamental issues also occur for other decision tree ensembling methods. Specifically, the problem is that there is no penalty associated to including spurious interaction terms in the model. We then introduce a suitable modification to the BART framework which addresses this problem and allows BART detect interactions in a fullyBayesian fashion. We accomplish this by clustering the trees into nonoverlapping groups. Intuitively, the shallow trees comprising each cluster work together to learn a single loworder interaction. To bypass the need to specify the number of clusters, we induce the clustering through a Dirichlet process prior (Ferguson, 1973). We refer to the ensemble constructed in this fashion as a Dirichlet Process Forest (DPForest).
1.1 A Simple Example
To motivate the problem, we consider a simulated data example of Vo and Pati (2016). This example takes , , , and . We compare the DPForest we propose to a variant of BART referred to as SBART (Linero and Yang, 2017)
which can accommodate sparsity in variable selection. We also consider the recently proposed iterative random forests algorithm of
Basu et al. (2018), selecting interactions whose stability score is higher than 0.5. In Figure 1 we display the interaction structure detected by each method on this data; while we considered only one iteration of this experiment here, these results are typical of replications of the experiment.Here, SBART detects a spurious edge between and . This occurs because BART, despite its fundamentally additive nature, does not include any penalization which discourages unnecessary interactions from being included. On the contrary, BART expect interactions to occur between relevant predictions; considering a draw from a BART prior such that and are included in the model, an interaction between these variables is apriori likely. Adapting Bayesian decision tree ensembles to interaction detection then requires a prior which discourages the inclusion of weak interactions. The iRF similarly detects two spurious interactions and misses a relevant interaction between and .
1.2 Related Work
Recent work has studied the theoretical properties of BART. Linero and Yang (2017) and Rockova and van der Pas (2017) show that certain variants of BART are capable of adaptively attaining nearminimaxoptimal rates of posterior concentration when can be expressed as a sum of loworder interaction terms with each depending on a small subset of of the predictors. In view this, one might conclude that no modification to BART is needed. This is true if one cares only about the mean integrated squared error where . Optimal prediction performance, however, does not imply that variable selection and interaction detection are being performed adequately. If is the true interaction structure of the data is an estimate of , then attaining the minimax estimation rate for in terms of prediction error typically only guarantees that (not ).
Several other methods have been recently proposed in the literature specifically for the task of interaction detection. We offer a noncomprehensive review. For a recent review, see Bien et al. (2013). Lim and Hastie (2015) proposed a hierarchical grouplasso which enforces the constraint that the presence of a given interaction implies the presence of the associated main effects; a similar approach is given by Bien et al. (2013). A potential shortcoming of these approaches is that they focus on linear models and allow only pairwise interactions. Radchenko and James (2010) propose the VANISH algorithm, which allows for nonlinear effects through the use of basis function expansions, but again limits to pairwise interactions. Several decisiontree based methods have also been proposed. The additive groves procedure of Sorokina et al. (2008) uses an adaptive boostingtype algorithm to sequentially test for the presence of interactions between variables after performing a variable screening step. Basu et al. (2018) propose the iterative random forest (iRF) algorithm which flags “stable” interaction effects as those which appear consistently in many trees in a certain random forest.
2 Bayesian Tree Ensembles
2.1 The BART Prior
Our starting point is the Bayesian additive regression trees (BART) framework of Chipman et al. (2010), which treats the function as the realization of a sum of random decision trees
where denotes the tree structure (including the decision rules) of the tree and denotes the parameters associated to the leaf nodes; here, denotes the collection of leaf nodes of . Let denote the event that the point is associated to leaf in tree . The function then returns whenever occurs.
We follow Chipman et al. (2010) and specify a branching process prior for the tree structure . A sample from the prior for is generated iteratively, starting from a tree with a single node of depth
; this is made a branch with two children with probability
, and is made a leaf node otherwise. We repeat this process independently for all nodes of depth until all nodes at depth are leaves. After the structure of the tree is generated, each branch is associated with a decision rule of the form . The coordinate used to construct the decision rule is sampled with probability where is a probability vector. The splitting proportion will play a key role later as an avenue for inducing sparsity in the regression function. Finally, we generate where is the hyperrectangle corresponding to the values of that lead to branch . We remark that this choice for differs from the scheme used by other BART implementations; we adopt it to simplify the full conditionals we derive in Section 3.For the prior on we set conditional on and
. By taking the variance to be
we ensure that the prior level of signal is constant as increases. The normal prior is selected for its conjugacy; we note, however, that any prior for with mean and variance leads to the approximationby the central limit theorem. We fix
and ; we refer readers to Linero and Yang (2017) for further details regarding prior specification, and to Chipman et al. (2013) and Linero (2017) for detailed reviews of Bayesian decision tree methods.2.2 Leveraging Structural Information
Several recent developments have extended the BART methodology to take advantage of structural information. Linero (2016) noted that sparsity in can be accommodated automatically by setting . Recall here that
denotes the prior probability that, for a fixed branch, coordinate
will be used to construct a split at a that branch. Hence, if is nearlysparse with nonsparse entries, the prior will encourage realizations from the prior to include only the predictors with nonsparse entries. Linero and Yang (2017) showed that this prior for induces highly desirable posterior concentration properties; in particular, the posterior of concentrates at close to the oracle minimax rate if we had known the relevant predictors beforehand.Linero and Yang (2017) also introduce the SBART model, which uses soft decision trees (Irsoy et al., 2012)
which effectively replace the decision boundaries of BART with smooth sigmoid functions. This allows the SBART model to adapt to the smoothness level of
; consequently, if is assumed to be Hölder, the posterior for the SBART model concentrates around at close to the oracle minimax rate obtainable when the smoothness level is known apriori. While the methodology we develop applies to the usual BART models, we will use the SBART model with the sparsityinducing Dirichlet prior in all of our illustrations.3 DpForests
The distribution of in the BART model is parameterized by the splitting proportions , leaf variance , and tree topology parameters . To encourage a small number of loworder interactions, we specify a prior which clusters the trees into nonoverlapping groups such that each cluster constructs splits using different subsets of the predictors. A schematic is given in Figure 2 with . In this figure we see that the first two trees are dedicated to learning a main effect for while the second two trees are dedicated to learning an interaction between and .
We induce a clustering by using treespecific splitting proportions and using a Dirichlet process prior on (Ferguson, 1973). Specifically, we let conditional on and let where is a distribution and denotes the precision parameter of the Dirichlet process. Using the latentcluster interpretation of the Dirichlet process (see, .e.g, Teh et al., 2006) this can be approximated by the following generative model:

Draw for large .

Draw .

Draw where .

For , draw as described in Section 2 with .
The ’s cluster trees such that the trees within each group capture a single loworder interaction. Note that the use of the the sparsity inducing prior in step 3 above ensures that each will be nearlysparse, and hence the trees with will split on only a small subset of the predictors. The role played by this weight vector is to encourage a subset of the predictors to appear in multiple different interactions. For example, if there are interactions and we do not want to encourage an additional interaction. A large value of allows for this by encouraging to appear in several interactions.
3.1 Properties of the Prior
The degree of sparsity within each cluster of trees, as well as the overall number of clusters used, are determined by the hyperparameters
and . These hyperparameters are key in determining the interaction structures that the prior favors. To help anchor intuition we first consider several special cases of the DPForests model. First, we consider the behavior of the prior as with fixed. In this case, with high probability each will have only one nonsparse entry. Consequently, each tree in the ensemble will split on at most one predictor. Because the trees are composed additively, this implies that none of the variables interact, and hence the prior concentrates on a sparse generalized additive model (SPAM, Ravikumar et al., 2007). On the other hand, as we see that so that the prior reverts to original BART model with splitting proportions given by described by Bleich et al. (2014).We can conduct a similar analysis with fixed and with . As , each tree will be associated to a unique . As , on the other hand, all of the trees share the same so that the model collapses to the Dirichlet additive regression trees model described by Linero (2016).
The key difference between BART and a DPForest is that, once two variables are included, BART does not penalize interactions. Let and denote the event that variable and are included in the model, let denote the event that variables and interact, and let denote the joint prior distribution for . We study the prior on the interaction structure by examining the probabilities and In words, is the probability that interact given that both variables are relevant, while represents the probability that interact given that and interact. Additionally, we examine the relationship between the average number of twoway interactions included in the model and the number of variables included.
Figure 3 shows several relationships between these quantities as varies for both SBART and DPForests. We see that is quite large for all values of with SBART, implying that the prior expects any variables included in the model to interact; the trend is decreasing in only because a larger number of predictors will be included in the model, causing variables to compete for branches in the ensemble. DPForests do not encourage the inclusion of interactions, particularly when is small. Next, we see that is also uniformly large for SBART. This implies that the prior does not encourage interaction structures like the truth from Figure 1, while a DPForest with a small choice of does.
3.2 Default Prior Settings
A benefit of the BART framework is the existence of default priors which require minimal tuning from users. Where applicable, we do not stray from the defaults recommended in Section 2. Specific to DPForests, the key parameter controlling the behavior of the model is . On the basis of Figure 3 we recommend choosing to be small; we have found setting with mean to work well. Conversely, in our illustrations the results for the DPForest model do not depend strongly on , and we set . This leaves the weight vector to be specified. In our illustrations, we first run a screening step which removes irrelevant predictors. In principle any method can be used for screening; in our illustrations, we use SBART to screen variables which have posterior inclusion probability below , and set . A more principled alternative is to use another sparsityinducing prior on but we do not pursue this strategy here.
3.3 Computation and Inference
Inference for the DPForest model can be carried out using a Gibbs sampler with the Bayesian backfitting approach of Chipman et al. (2010). The Gibbs sampler operates on the state space . We use standard MetropoliswithinGibbs proposals to update and ; see Kapelner and Bleich (2016) and Pratola (2016) for details. The parameters , , , and can all be updated easily using the slice sampling algorithm of Neal (2003). Finally, , and all have conjugate fullconditional distributions:
Full conditional for : Note that is conditionally independent of all parameters given . By conjugacy of the Dirichlet distribution to multinomial sampling we have the full conditional where .
Full conditional for : The conjugacy of the Dirichlet prior to multinomial sampling implies a Dirichlet fullconditional when a single is used. To account for the clustering, we only consider the branches associated to trees with , giving the full conditional where is the number of branches associated to cluster which split on predictor .
Full conditional for : Let denote the full conditional for . The term comes in only through the factors (the prior probability of ) and where is the number of branches of tree which split on predictor (the likelihood of tree having split on the predictors that it has, give ). Hence .
Putting these pieces together, we arrive at Algorithm 1, which describes a single iteration of the Gibbs sampler.
4 Experiments
We now compare DPForests to existing methods on a number of synthetic datasets. We consider the following methods in addition to DPForests and SBART.
Additive groves: The additive groves procedure of Sorokina et al. (2008). Because tuning of the additive groves algorithm is computeintensive, we ran several pilot studies to choose appropriate tuning parameters which perform well for the given simulation settings.
Hierarchical group lasso: The hierarchical group lasso proposed by Lim and Hastie (2015) for interaction detection; we abbreviate this method by HL. This procedure was designed with linearity of in mind. Tuning parameters are selected by crossvalidation.
Hierarchical group lasso, least squares: HL is used to select the interactions and main effects, while the coefficients are estimated by least squares; we abbreviate this method by HLLS. Tuning parameters are selected by cross validation.
Iterative random forests: The iterative random forests (iRF) procedure proposed by Basu et al. (2018) as implemented in the iRF package on CRAN. We use the default trees and 10 iterations of the iRF algorithm.
Our simulation settings are borrowed from several existing works; we do not compare our methods to these other works due to a lack of publicly available software.

(Radchenko and James, 2010) We generate where , , and . We let be
where , , , , and . Each is further centered and scaled so that and .

(Vo and Pati, 2016) We generate with , , and . We let

Same as (S2), but without the interaction effects.

(Friedman, 1991) A common test case for BART, we generate with , and . We set
Each of these scenarios was replicated times. We evaluate each method according to the average number of false positives (FPs), false negatives (FNs), score, and integrated rootmean squared error . The score is a commonly used measure of overall accuracy that balances false positives against false negatives in variable selection tasks; see, for example, Zhang and Yang (2015).
Results for interaction detection are given in Figure 4. We omit the results for HL because HLLS performs uniformly better. Under all simulation settings, DPForests perform better than all other methods according to score. SBART is also competitive with other procedures on many of the datasets. As expected, the primary problem with SBART is that it has a relatively large number of false positives, i.e. it is susceptible to detecting spurious interactions. This issue is most pronounced on (S2) and (S3), with SBART detecting between 1.5 and 2 spurious interactions.
Additive groves and iterative random forests generally perform worse than SBART. In addition to having a larger false positives rate, these procedures are also prone to false negatives under simulation (S2). With the exception of (S1), the hierarchical grouplasso (HLLS) performs worse than the other methods. Under (S1), HLLS has reasonable performance as each component of can be reasonably wellapproximated by the assumed linear model. HLLS also appears to perform well under (S3); this, however, is due to the fact that HLLS typically misses several main effects, which is a substantially worse outcome than detecting a spurious interaction. The nonlinearities under (S2) and (S4) also create problems for HLLS.
All methods perform better for detecting the main effects. SBART and DPForests give identical results for the main effects due to the use of SBART in screening for DPForests. (S1) is the easiest setting, with all methods having very few false negatives and HLLS the only method having nonnegligible falsepositives. Under (S2), the nonBayesian procedures all have nonnegligible false negatives, and iRF and HLLS are additionally prone to false positives; the story is similar under (S3), with HLLS performing better in terms of false positives but worse in terms of false negatives. All methods perform well in terms of false positives under (S4), however iRF and HLLS also suffer from many false negatives.
Results for assessing prediction performance in terms of integrated root meansquared error (RMSE) are given in Figure 6. SBART and DPForests perform very similarly in terms of RMSE. All other methods perform substantially worse under all settings. This is likely due to a multitude of factors. First, any false negatives will contribute to poor predictive performance. Second, SBART and DPForests are able to take advantage of underlying smoothness in the response function which additive groves and iterative random forests cannot, while HL and HLLS suffer from an incorrect model specification.
SBART and DPForests are competitive in terms of runtime. For example, on a single replicate of (S4), SBART and DPForests took 118 seconds and 241 seconds respectively to obtain 40,000 samples from the posterior. By comparison, iRF took 279 second, HLLS took 91 seconds, and additive groves took 4966 seconds. Additive groves was by far the slowest procedure, due to the fact that recursive feature elimination is used. We conclude that, under these settings, DPForests outperform all competitors are a competitive computational budget.
We also consider the publicly available Boston housing dataset of Harrison and Rubinfeld (1978). Analysis of the interaction structures present in this dataset was previously undertaken by Radchenko and James (2010) and Vo and Pati (2016). This dataset consists of predictors and neighborhoods, and a continuous response corresponding to the median house value in a given neighborhood.
Method  RMSE 

DPForests  1.00 
iRF  1.22 
HL  1.18 
Additive Groves  1.16 
We compare the methods in terms of goodnessoffit, which is evaluated using a 5fold cross validated estimate of root mean squared prediction error. Results are given in Table 1. For prediction, the DPForest and SBART outperform the competing methods.
The DPForest includes most of the predictors in the model. This can be contrasted with the fit of a sparse additive model (SPAM) Ravikumar et al. (2007) and the fit of the VANISH model reported by Radchenko and James (2010), which include only a small number of predictors. Like the VANISH algorithm, the DPForest selects one interaction: there is strong evidence of an interaction between DIS (distance to an employment center in Boston) and LSTAT (the proportion of individuals in a neighborhood who are lowerstatus). This interaction was highly stable, and was selected by every fit to the data during crossvalidation; additionally, this interaction was selected by additive groves in 4 out of 5 folds during crossvalidation. Interestingly, this interaction was reportedly not selected by VANISH, which instead selects an interaction between the variables NOX (nitrusoxide concentration) and LSTAT.
Figure 7 gives a visualization of the LSTATDIS interaction. To summarize the interaction we use a “fitthefit” strategy and fit a generalized additive model to the fittedvalues of the DPForest with a thin plate spline term for the interaction (Wood, 2003). The plot then displays the LSTATspecific effect of DIS for the quantiles of LSTAT. This GAM nearly reproduces the fitted values from the DPForest and is easier to visualize. We see in Figure 7 a clear interaction between DIS and LSTAT. Intuitively, one expects that the closer a neighborhood is to an industry center the more expensive the housing will be. This is correct for areas with fewer lowerstatus individuals; however, this trend does not hold when there is a higher percentage of lowerstatus individuals. We remark also that the data is well supported near for all values of LSTAT, so that this behavior is unlikely to be due to extrapolation, though extrapolation may be an issue for large values of both LSTAT and DIS.
5 Discussion
We have introduced Dirichlet process forests (DPForests) and applied them to the problem of interaction detection. We demonstrated on both synthetic and real data that DPForests lead to improved interaction detection. Additionally, we demonstrated that DPForests are highly competitive with commonly used machine learning techniques for detecting loworder interactions.
There are a number of modifications one might make to improve performance further. One possibility is to allow to also vary by mixture component. This would allow different mixture components to have different signal levels; for example, under simulation (S4), we would expect that a smaller value of is appropriate for the mixture component responsible for relative to . The proposed DPForests model captures this feature only indirectly through the number of trees assigned to each mixture component.
Additionally, it would be interesting to quantify the improvement in performance of DPForests over SBART theoretically. It is unknown whether SBART is variableselection consistent, and establishing theoretically that DPForests are consistent for interaction detection while SBART is not remains an open problem.
References
 Basu et al. (2018) Basu, S., Kumbier, K., Brown, J. B., and Yu, B. (2018). iterative random forests to discover predictive and stable highorder interactions. Proceedings of the National Academy of Sciences, page 201711236.
 Bien et al. (2013) Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. Annals of statistics, 41(3):1111.
 Bleich et al. (2014) Bleich, J., Kapelner, A., George, E. I., and Jensen, S. T. (2014). Variable selection for BART: An application to gene regulation. The Annals of Applied Statistics, 8(3):1750–1781.
 Chipman et al. (2013) Chipman, H., George, E. I., Gramacy, R. B., and McCulloch, R. (2013). Bayesian treed response surface models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(4):298–305.
 Chipman et al. (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
 Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230.
 Friedman (1991) Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67.
 Hahn et al. (2017) Hahn, P. R., Murray, J. S., and Carvalho, C. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. arXiv preprint arXiv:1706.09523.
 Harrison and Rubinfeld (1978) Harrison, D. and Rubinfeld, D. L. (1978). Hedonic prices and demand for clean air. Journal of Environmental Economics and Management, 5:81–102.
 Hastie (2017) Hastie, T. J. (2017). Generalized additive models. In Statistical models in S, pages 249–307. Routledge.

Irsoy et al. (2012)
Irsoy, O., Yildiz, O. T., and Alpaydin, E. (2012).
Soft decision trees.
In
Proceedings of the International Conference on Pattern Recognition
.  Kapelner and Bleich (2016) Kapelner, A. and Bleich, J. (2016). bartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software, 70(4):1–40.
 Lim and Hastie (2015) Lim, M. and Hastie, T. (2015). Learning interactions via hierarchical grouplasso regularization. Journal of Computational and Graphical Statistics, 24(3):627–654.
 Linero (2016) Linero, A. R. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association. To appear.
 Linero (2017) Linero, A. R. (2017). A review of treebased Bayesian methods. Communications for Statistical Applications and Methods, 24(6):543–559.
 Linero and Yang (2017) Linero, A. R. and Yang, Y. (2017). Bayesian regression tree ensembles that adapt to smoothness and sparsity. arXiv preprint arXiv:1707.09461.
 Murray (2017) Murray, J. S. (2017). Loglinear Bayesian additive regression trees for categorical and count responses. arXiv preprint arXiv:1701.01503.
 Neal (2003) Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31:705–767.
 Pratola (2016) Pratola, M. (2016). Efficient MetropolisHastings proposal mechanisms for Bayesian regression tree models. Bayesian Analysis, 11(3):885–911.
 Radchenko and James (2010) Radchenko, P. and James, G. M. (2010). Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105(492):1541–1553.
 Ravikumar et al. (2007) Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L. (2007). SPAM: Sparse additive models. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 1201–1208.
 Rockova and van der Pas (2017) Rockova, V. and van der Pas, S. (2017). Posterior concentration for Bayesian regression trees and their ensembles. arXiv preprint arXiv:1078.08734.
 Sorokina et al. (2008) Sorokina, D., Caruana, R., Riedewald, M., and Fink, D. (2008). Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning, pages 1000–1007. ACM.
 Sparapani et al. (2016) Sparapani, R. A., Logan, B. R., McCulloch, R. E., and Laud, P. W. (2016). Nonparametric survival analysis using Bayesian additive regression trees (BART). Statistics in medicine.
 Starling et al. (2018) Starling, J. E., Murray, J. S., Carvalho, C. M., Bukowski, R., and Scott, J. G. (2018). Functional response regression with funbart: an analysis of patientspecific stillbirth risk. arXiv preprint arXiv:1805.07656.
 Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
 Vo and Pati (2016) Vo, G. and Pati, D. (2016). Sparse additive Gaussian process with soft interactions. arXiv preprint arXiv:1607.02670.
 Wang et al. (2007) Wang, H., Li, R., and Tsai, C.L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
 Wood (2003) Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114.
 Zhang and Yang (2015) Zhang, Y. and Yang, Y. (2015). Crossvalidation for selecting a model selection procedure. Journal of Econometrics, 187(1):95–112.
References
 Basu et al. (2018) Basu, S., Kumbier, K., Brown, J. B., and Yu, B. (2018). iterative random forests to discover predictive and stable highorder interactions. Proceedings of the National Academy of Sciences, page 201711236.
 Bien et al. (2013) Bien, J., Taylor, J., and Tibshirani, R. (2013). A lasso for hierarchical interactions. Annals of statistics, 41(3):1111.
 Bleich et al. (2014) Bleich, J., Kapelner, A., George, E. I., and Jensen, S. T. (2014). Variable selection for BART: An application to gene regulation. The Annals of Applied Statistics, 8(3):1750–1781.
 Chipman et al. (2013) Chipman, H., George, E. I., Gramacy, R. B., and McCulloch, R. (2013). Bayesian treed response surface models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(4):298–305.
 Chipman et al. (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
 Ferguson (1973) Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209–230.
 Friedman (1991) Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1–67.
 Hahn et al. (2017) Hahn, P. R., Murray, J. S., and Carvalho, C. (2017). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. arXiv preprint arXiv:1706.09523.
 Harrison and Rubinfeld (1978) Harrison, D. and Rubinfeld, D. L. (1978). Hedonic prices and demand for clean air. Journal of Environmental Economics and Management, 5:81–102.
 Hastie (2017) Hastie, T. J. (2017). Generalized additive models. In Statistical models in S, pages 249–307. Routledge.

Irsoy et al. (2012)
Irsoy, O., Yildiz, O. T., and Alpaydin, E. (2012).
Soft decision trees.
In
Proceedings of the International Conference on Pattern Recognition
.  Kapelner and Bleich (2016) Kapelner, A. and Bleich, J. (2016). bartMachine: Machine learning with Bayesian additive regression trees. Journal of Statistical Software, 70(4):1–40.
 Lim and Hastie (2015) Lim, M. and Hastie, T. (2015). Learning interactions via hierarchical grouplasso regularization. Journal of Computational and Graphical Statistics, 24(3):627–654.
 Linero (2016) Linero, A. R. (2016). Bayesian regression trees for high dimensional prediction and variable selection. Journal of the American Statistical Association. To appear.
 Linero (2017) Linero, A. R. (2017). A review of treebased Bayesian methods. Communications for Statistical Applications and Methods, 24(6):543–559.
 Linero and Yang (2017) Linero, A. R. and Yang, Y. (2017). Bayesian regression tree ensembles that adapt to smoothness and sparsity. arXiv preprint arXiv:1707.09461.
 Murray (2017) Murray, J. S. (2017). Loglinear Bayesian additive regression trees for categorical and count responses. arXiv preprint arXiv:1701.01503.
 Neal (2003) Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31:705–767.
 Pratola (2016) Pratola, M. (2016). Efficient MetropolisHastings proposal mechanisms for Bayesian regression tree models. Bayesian Analysis, 11(3):885–911.
 Radchenko and James (2010) Radchenko, P. and James, G. M. (2010). Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105(492):1541–1553.
 Ravikumar et al. (2007) Ravikumar, P., Liu, H., Lafferty, J., and Wasserman, L. (2007). SPAM: Sparse additive models. In Proceedings of the 20th International Conference on Neural Information Processing Systems, pages 1201–1208.
 Rockova and van der Pas (2017) Rockova, V. and van der Pas, S. (2017). Posterior concentration for Bayesian regression trees and their ensembles. arXiv preprint arXiv:1078.08734.
 Sorokina et al. (2008) Sorokina, D., Caruana, R., Riedewald, M., and Fink, D. (2008). Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning, pages 1000–1007. ACM.
 Sparapani et al. (2016) Sparapani, R. A., Logan, B. R., McCulloch, R. E., and Laud, P. W. (2016). Nonparametric survival analysis using Bayesian additive regression trees (BART). Statistics in medicine.
 Starling et al. (2018) Starling, J. E., Murray, J. S., Carvalho, C. M., Bukowski, R., and Scott, J. G. (2018). Functional response regression with funbart: an analysis of patientspecific stillbirth risk. arXiv preprint arXiv:1805.07656.
 Teh et al. (2006) Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581.
 Vo and Pati (2016) Vo, G. and Pati, D. (2016). Sparse additive Gaussian process with soft interactions. arXiv preprint arXiv:1607.02670.
 Wang et al. (2007) Wang, H., Li, R., and Tsai, C.L. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94(3):553–568.
 Wood (2003) Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(1):95–114.
 Zhang and Yang (2015) Zhang, Y. and Yang, Y. (2015). Crossvalidation for selecting a model selection procedure. Journal of Econometrics, 187(1):95–112.