randomness-as-regularization
This contains codes for simulations in the RaR paper
view repo
Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.
READ FULL TEXT VIEW PDFThis contains codes for simulations in the RaR paper
Despite being proposed nearly two decades ago, random forests (Breiman, 2001)
remain among the most popular and successful off-the-shelf supervised learning methods with demonstrated success in numerous scientific domains from ecology
(Prasad et al., 2006; Cutler et al., 2007; Coleman et al., 2017) to image recognition (Bernard et al., 2007; Huang et al., 2010; Guo et al., 2011; Fanelli et al., 2013) to bioinformatics (Díaz-Uriarte and De Andres, 2006; Mehrmohamadi et al., 2016). In comparison with many competing methods, random forests are generally quite competitive in terms of predictive accuracy despite being largely considered simpler and more computationally tractable. A recent large-scale empirical study by Fernández-Delgado et al. (2014)found them to be the top performing classifier when compared across 100’s of alternatives across 100’s of datasets. Though Breiman once famously crowned Adaboost
(Freund et al., 1996) the best “off-the-shelf classifier in the world”, random forests have since taken that spot in the minds of many.The sustained success random forests has led naturally to the desire to better understand the statistical and mathematical properties of the procedure. Lin and Jeon (2006) introduced the potential nearest neighbor framework and Biau and Devroye (2010) later established related consistency properties. Biau et al. (2008), Biau (2012), and Klusowski (2019b)
analyzed convergence properties and rates for idealized versions of the procedure. Methods for obtaining estimates of the standard errors of random forest predictions were developed in
Sexton and Laake (2009) and Wager et al. (2014) while algorithmic convergence studies determining how many trees are necessary for stabilization were provided by Lopes et al. (2019a) and Lopes et al. (2019b). The out-of-bagg variable importance measures originally suggested by Breiman (2001) have since been analyzed thoroughly and shown to be biased towards correlated features and those with many categories (Strobl et al., 2007, 2008; Toloşi and Lengauer, 2011; Nicodemus et al., 2010). Hooker and Mentch (2019)recently provided an explanation for this kind of behavior. The core random forest methodology has been extended to numerous frameworks including quantile regression
(Meinshausen, 2006)(Zhu et al., 2015), and survival analysis (Hothorn et al., 2005; Ishwaran et al., 2008; Cui et al., 2017; Steingrimsson et al., 2019). For a more comprehensive overview of general random-forest-related research, we refer the reader to a recent review paper from Biau and Scornet (2016).In the last several years, a number of important statistical properties of random forests have also been established whenever base learners are constructed with subsamples rather than bootstrap samples. Scornet et al. (2015) provided the first consistency result for Breiman’s original random forest algorithm whenever the true underlying regression function is assumed to be additive. Mentch and Hooker (2016)
provided the first distributional results for random forests, demonstrating that predictions are asymptotically normal under various regularity conditions and proposing accompanying procedures for producing confidence intervals and hypothesis tests for feature importance.
Mentch and Hooker (2017) extended this idea to grid-structured testing to allow for hypothesis tests for additivity in the regression function. Coleman et al. (2019) proposed a more computationally efficient permutation-based hypothesis test allowing such procedures to scale easily to big data settings where large test sets may be utilized. Wager and Athey (2018) established both consistency and asymptotic normality for honest and regular trees whenever many trees are constructed. Peng et al. (2019) weakened the conditions necessary for asymptotic normality in a more general setting and provided Berry-Esseen Theorems describing their rate of convergence.Despite the impressive volume of research from the past two decades and the exciting recent progress in establishing their statistical properties, a satisfying explanation for the sustained empirical success of random forests has yet to be provided. In their recent review, Biau and Scornet (2016) note that research investigating the properties of random forest tuning parameters is “unfortunately rare” and that “present results are insufficient to explain in full generality the remarkable behavior of random forests.” Indeed, while numerous studies have experimented with various tuning parameter setups and strategies (Genuer et al., 2008; Bernard et al., 2009; Genuer et al., 2010; Duroux and Scornet, 2016; Scornet, 2017; Probst and Boulesteix, 2017; Probst et al., 2019)
, results from these works have largely been limited to high-level, somewhat heuristic takeaways with most concluding only that building more trees helps to stabilize predictions and that the other tuning parameters can have a substantial effect on accuracy. Other work has attempted to garner insight by drawing connections between tree-based ensembles and other related procedures such as kernel methods
(Scornet, 2016; Olson and Wyner, 2018)(Welbl, 2014; Biau et al., 2016). However, to our knowledge, besides the original motivation provided by Breiman (2001), the only other work devoted to attempting to explain the success of random forests was provided recently by Wyner et al. (2017) who hypothesized that their strong performance was due to their behavior as “self-averaging interpolators.” Belkin et al. (2019) very recently wrote in support of this notion. However, as expanded upon in detail in the following section, in our view, these existing explanations fall short on a number of fronts. In general, the arguments provided apply equally well to other learning methods not generally thought to have strong and robust empirical performance, fail to provide insight into the role played by the randomness, and apply only to particular problem setups.In contrast, the work presented here offers a concrete explanation for the role played by the randomness most commonly injected into random forest procedures. Instead of assuming that random forests simply do “work well”, we take a more principled approach in trying to isolate the effects of that randomness and determine when its inclusion results in improved accuracy over a baseline approach like bagging. In particular, we argue that the randomness serves to regularize the procedure, making it highly advantageous in low signal-to-noise ratio settings. To drive home this point, we further demonstrate that incorporating similar randomness into alternative (non tree-based) model selection procedures can result in improved predictive accuracy over existing methods in exactly the settings where such improvements would be expected. This approach is inspired largely by recent work on degrees of freedom for model selection by Tibshirani (2015) and Hastie et al. (2017).
The remainder of this paper is laid out as follows. In Section 2 we formalize the random forest procedure and continue the above discussion, providing a more detailed overview of the shortcomings of existing explanations for their success. In Section 3 we discuss degrees of freedom for model selection procedures and demonstrate that within a traditional random forest context, more randomness results in procedures with fewer degrees of freedom. We emphasize and build upon this finding in Section 4 by demonstrating in numerous settings using both real and synthetic data that the relative improvement in accuracy offered by random forests appears directly related to the relative amount of signal contained within the data. Finally, in Section 5 we develop linear model forward-selection-style analogues for both bagging and random forests and produce near identical results, finding in particular that in highly noisy low-dimensional settings, injecting randomness into the selection procedure can outperform even highly competitive explicit regularization methods such as the lasso.
We begin by formalizing the random forest framework in which we will work in the following sections. Unless otherwise noted, throughout the remainder of this paper we will consider a general regression framework in which we observe (training) data of the form where each ,
denotes a vector of
features, denotes the response, and the variables have a general relationship of the form(1) |
where
is often assumed to be independent noise with mean 0 and variance
. For a given point , a random forest prediction at takes the form(2) |
where the base-learners are typically tree-based models constructed on some resample of the original data . In the work that follows we assume these base models are regression trees built according to the standard CART criterion (Breiman et al., 1984) whereby an internal node is split in an axis-aligned fashion into left and right daughter nodes of the form and whenever the decision is made to split the feature at . The particular variable and split location are chosen from among those available as that pair which minimizes the resulting within-node variance of the offspring. For a more detailed discussion of this procedure, we refer the reader to recent work by Scornet et al. (2015) or Klusowski (2019a).
Throughout the literature on random forests, it is common to succinctly contain all randomness in the single term as in equation (2) above. We note however that for our purposes below, it may be convenient to consider this more explicitly as . Written in this form, serves to select the resample of the original data utilized in the tree. While much recent work has focused on subsampled random forests, here we consider the resamples to be bootstrap samples as originally put forth in Breiman (2001). The second randomization component then determines the eligible features to be split at each node in the tree. As is standard, we assume that at each internal node, the split must be be chosen from among only candidate features and that such candidates are selected uniformly at random without replacement. When , the procedure reduces to bagging (Breiman, 1996).
The original reasoning behind random forests provided by Breiman (2001) was based on an extension of the randomized tree analysis given in Amit and Geman (1997). Breiman showed that the accuracy of any randomized ensemble depends on two components: the strength (accuracy) of the individual base-learners and the amount of dependence between them. Thus, the original motivation for a procedure like random forests might be seen from a statistical perspective as akin to the classic bias-variance tradeoff. In the same way that some procedures (e.g. the lasso (Tibshirani, 1996; Chen et al., 2001)) consider sacrificing a small amount of bias in exchange for a large reduction in variance, random forest ensembles might be seen as sacrificing a small amount of accuracy at the base-learner level (by injecting the extra randomness) for a large reduction in between-tree correlation. Hastie et al. (2009) provide a thorough, high-level discussion of this effect in showing that the mtry parameter serves to reduce the variance of the ensemble.
However, this discussion from Breiman (2001) might be better seen as motivation for why a randomized ensemble could potentially improve accuracy rather than an explanation as to why random forests in particular do seem to work well. Breiman himself experiments with different kinds of randomness in the original manuscript and suggests that in practice users can also experiment with different forms to try and determine what works best in particular settings. Furthermore, in his concluding remarks, Breiman notes that while the additional randomness at the base learner level helps to reduce the variance of the ensemble, the magnitude of improvement often seen with random forests suggested to him that perhaps it somehow also “act[s] to reduce bias” but that ultimately “the mechanism for this [was] not obvious.” In the years since, it has been shown quite clearly that the benefits sometimes seen with random forests are the result of variance reduction alone; see Hastie et al. (2009) for a more complete discussion.
In recent work, Wyner et al. (2017) take a more definitive stance, conjecturing that both random forests and AdaBoost (Freund et al., 1996) work well because both procedures are “self-averaging interpolators” that fit the training data perfectly while retaining some degree of smoothness due to the averaging. The key to their success, they argue, is that in practice, datasets often contain only small amounts of noise and these algorithms are able to mitigate the effects of noisy data points by localizing their effect so as to not disturb the larger regions where the data consists mostly of signal. Indeed, the authors acknowledge that the procedures “do in fact overfit the noise – but only the noise. They do not allow the overfit to metastasize to modestly larger neighborhoods around the errors.”
As the central claim of Wyner et al. (2017) is that random forests “work not in spite, but because of interpolation” we now make this notion and argument explicit.
A classifier (or regressor) is said to be an interpolating classifier (regressor) if for every training point , .
This definition of an interpolating classifier is taken directly from Wyner et al. (2017); for completeness and because it will be directly relevant to the immediate conversation, we expand the definition so as to apply in the same fashion to regression contexts.
Consider first the classification setting wherein the response and we seek to utilize the training data to construct a classifier . The random forest procedure laid out above for regression can be easily amended to perform classification – indeed, the original procedure provided in Breiman (2001) was discussed largely for classification problems. Here, split locations within trees are chosen as those which provide the greatest empirical reduction in Gini impurity and final predictions are taken as the majority vote – the class predicted most often – across all trees in the forest.
Now consider a particular observation . In order to be more explicit and without loss of generality, suppose that . Given resamples of the original data , whenever trees are fully grown so that each terminal node contains only a single observation, it must necessarily be the case that (i.e. the tree predicts the correct class for ) whenever . Thus, in order for the random forest classifier to select the correct class for , (i.e. ) it suffices to ensure that is selected in a plurality (or simple majority in the case of binary classification) of the resamples. When bootstrap samples are used, it is well known that the probability of each observation appearing is approximately 0.632. Thus, given bootstrap samples, the probability that the observation appears in at least half of these resamples is approximately
where denotes the binomial cdf evaluated at the point with parameters (the number of trials) and (the probability of success in each trial). For even moderately large , this probability is quite large; see the left plot in Figure 1. Thus, for binary classification problems, the probability of interpolating any given training observation is large whenever is also moderately large.
Note however that according to the definition above, a classifier is only designated as an interpolator if it interpolates all training observations. While an exact calculation for the probability of all points appearing in at least half of the bootstrap samples is somewhat involved, serves as a simple upper bound. Plots of these upper bounds on the interpolation probability are shown in the right plot in Figure 1 across a range of sample sizes for . In each case, for fixed , the probability that the classifier is an interpolator tends to 0 as and thus, in order for random forest classifiers to necessarily interpolate, the number of bootstrap replicates must be treated as an increasing function of . Though perhaps obvious, we stress that this is not generally the manner in which such estimators are constructed. In all software with which we are familiar, default values of are set to a fixed moderate size, usually on the order of a few hundred, independent of the size of the training set. Recent work from Lopes et al. (2019b) has also provided a means for estimating the algorithmic variance of random forest classifiers and shown that such variance sometimes vanishes quite quickly after relatively few bootstrap samples. Thus, while it’s possible to construct random forests in such a way that they necessarily interpolate with high probability in classification settings, it is not clear that random forests would generally be constructed in this fashion in practice and thus it is also not clear that the interpolation-based explanation offered by Wyner et al. (2017) is sufficient to explain the strong performance of random forests, even in specific contexts. It is also worth noting that on certain datasets where random forests happen to produce good models with low generalization error, they may likely also fit quite well on the training data, perhaps even nearly interpolating. This, however, is certainly possible for any modeling procedure and thus in no way would aid in explaining the particular success of random forests.
Before continuing with our critique, it’s worth pausing to note where these existing explanations are in agreement. Both Breiman (2001) and Wyner et al. (2017) seem to largely agree on the following points:
Points 1 and 4 are largely irrelevant to the discussion in the remainder of this paper as we focus exclusively on random forests; we include these points here only in the interest of completeness. Point 3 has been alluded to in numerous papers throughout the years and has likely been key to the sustained popularity of the random forest procedure.
We take slight issue with the now popular wisdom in point 2, that random forests simply “are better” than bagging or other similar randomized approaches. While this does seem to be the case surprisingly often in practice on real-world datasets (see, for example, the recent large-scale comparison form Fernández-Delgado et al. (2014) discussed in the introduction) it is certainly not a universal truth and in our view, is a potentially naive foundation on which to build a theory for explaining their success. As discussed above, Breiman (2001) does provide some motivation for why a randomized ensemble might sometimes outperform its nonrandomized counterpart in showing that the generalization error of a classifier can be bounded above by a function of base-learner accuracy and correlation. Breiman stops short, however, of providing any more explicit explanation for the role played by the randomness or in what situations that randomness might be expected to help the most. Wyner et al. (2017), on the other hand, seem to largely ignore the role of randomness altogether. The explanation the authors provide for random forest success would seem to apply equally well to bagging. In the sections below, we focus our attention heavily on determining when the inclusion of such randomness provides the greatest benefit and provide an explicit characterization of the role it plays.
It is also important to stress that the theories offered by Breiman (2001) and Wyner et al. (2017) pertain only to the classification setting, whereas our focus is primarily on regression. The interpolation hypothesis put forth by Wyner et al. depends on an even stricter setup whereby trees are built to full depth, bootstrapping (or at least subsampling without replacement at a rate of at least ) is used to generate resamples, and the number of trees grows with at a sufficiently fast rate. Previous work, however, has repeatedly shown that random forests can still achieve a high degree of predictive accuracy when trees are not built to full depth (Duroux and Scornet, 2016), and/or are constructed via subsampling (Zaman and Hirose, 2009; Mentch and Hooker, 2016; Wager and Athey, 2018), and/or when relatively few trees are built (Lopes et al., 2019b).
To see why random forests cannot be considered interpolators in a regression setting, even when individual trees are built to full depth, note that the final regression estimate is taken as the average of predictions across all trees rather than the majority vote. While it’s certainly clear that an average of interpolators is itself an interpolator since for every training point
the bootstrapping mechanism in play with random forests precludes the possibility of interpolating on with exceedingly high probability. As noted above, each bootstrap sample will omit, on average, 36.8% of the original observations and thus individual trees, even if fully grown, will not, in general, fit perfectly on that out-of-sample (out-of-bagg) data. In other words, while a fully-grown tree will necessarily interpolate the observations selected within its corresponding bootstrap sample, it will generally not fit perfectly to all training observations. For a given point in the training data, random forest regression estimates at are therefore a weighted average of and the other response values observed, and hence with exceedingly high probability, the random forest will not interpolate.
Figure 2 demonstrates this effect clearly. The left panel shows two generic interpolating functions – Interpolator A and Interpolator B – on a simple toy dataset while the right panel shows one of the same interpolators along with the predicted regression function resulting from bagging with 100 regression trees, each grown to full depth. In this toy example (motivated by the examples shown in Section 3.2 of Wyner et al. (2017)), our training data consists of 11 points each of the form for except for where we instead have the observation . Denote the location of this point by . Wyner et al. (2017) contend that if the observed response is considered “unusually noisy” at , then interpolating estimators can perform well by “localizing” the effect of this noise as is seen in the left panel of Figure 2. Indeed, we can see from this plot that both interpolators still fit the remaining data perfectly well despite the presence of the noisy observation. However, as can be seen in the right-hand panel, whenever we treat this as a regression problem and build trees to full depth, the random forest (in this case, simple bagging since we have only 1 feature) does not interpolate, but instead looks to be attempting to smooth-out the effect of the outlying observation.
This, however, stands in opposition to the reasoning provided in Wyner et al. (2017). Here the authors are highly critical of the traditional statistical notion of signal and noise and seem to take some issue with the general regression setup given in (1). To computer scientists, they claim, in many problems “there is no noise in the classical sense. Instead there are only complex signals. There are residuals, but these do not represent irreducible random errors.” But if the widespread empirical success of random forests is really “not in spite, but because of interpolation” as claimed, then one must believe that real-world data is generally low-noise, a claim argued against firmly by, for example, Hastie et al. (2017). Crucially, this means that not only are data largely free of what scientists may think of as classical kinds of noise like measurement error, but also that all sources of variation in the response
can be explained almost entirely by the available set of predictor variables
.Perhaps most importantly, if the success of random forests is the result of interpolation and interpolation is beneficial because real-world data has very high signal-to-noise ratios (SNRs), then random forests ought not perform well at all on datasets with low SNRs. Consider again the plot on the left-hand-side of Figure 2. If the outlying point at is actually the only “good signal” while the rest of the data are noisy, then the interpolators shown would be isolating signal rather than noise and hence be performing quite poorly.
But this is exactly the opposite of what we see with random forests in practice. In the following sections, we show repeatedly on both real and synthetic data that relative to non-randomized procedures like bagging, the benefit of the additional randomness is most apparent in low SNR settings. In Section 3 we show that the mtry parameter has a direct effect on the number of degrees of freedom (dof) associated with the procedure, with low values of mtry (i.e. more randomness) corresponding to the least flexible model forms with the fewest dof. Given this, in Section 4 we go on to show that as expected, the advantage offered by random forests is most dramatic at low SNRs, and that this advantage is eventually lost to bagging at high SNRs. We also consider the problem from a slightly different perspective and show that the optimal value of mtry is almost perfectly (positively) correlated with the SNR. We posit that this behavior is due to a regularizing effect caused by the randomness and bolster this claim by demonstrating the same, relatively surprising results hold in simpler linear model setups where randomness is injected into a forward selection process.
Recall from the previous section that we assume data of the form where each , denotes a vector of features, denotes the response, and the variables have a general relationship of the form . Assume further that the errors are uncorrelated with mean 0 and (common) variance . Given a particular regression estimate that produces fitted values , the degrees of freedom (Efron, 1986; Efron and Tibshirani, 1990; Tibshirani, 2015) of is defined as
(3) |
The degrees of freedom (dof) of a particular estimator is generally understood as a measure of its flexibility; estimators with high dof depend more heavily on the particular values observed in the original data and hence are higher variance. Understanding the dof associated with various estimation procedures can provide valuable insights into their behavior as well as the situations when they might be expected to perform better or worse relative to a set of alternative methods. Tibshirani (2015) took an important step in this regard, showing that adaptive procedures like best subset selection (BSS) and forward stepwise selection (FS), even when selecting a model with terms, had more than dof because of the increased dependence on the data incurred through the selection aspect. These additional dof were coined the search degrees of freedom. More recently, Hastie et al. (2017) provided a thorough collection of simulations to demonstrate the predictive advantages of regularized procedures like the lasso and relaxed lasso over BSS and FS, especially in low signal-to-noise ratio (SNR) settings where the SNR is defined as
Much of the work in the following sections was inspired by the approach taken in Hastie et al. (2017) and various portions of the work below follow closely to the setups considered there.
We begin our work by estimating the dof of random forests under various values of mtry. In linear model contexts, the dof for different estimators is generally shown by plotting the estimated dof against the average number of nonzero coefficients in the selected models. In our context with tree-based estimators, we use maxnodes – the maximum number of terminal nodes that any tree within the forest can have – as an analogue. For any given forest with fixed value of mtry, we should expect that as trees are allowed to grow deeper (i.e. maxnodes takes larger values), the estimators should become more sensitive to the data and hence incur higher dof.
We consider two general model forms: a linear model
and the model
which we refer to as ‘MARSadd’ as it is additive and first appeared in the work on Multivariate Adaptive Regression Splines (MARS) by
Friedman (1991). Features in the MARSadd model are sampled independently from . For the linear model, in line with Hastie et al. (2017), we consider three different settings:Low: , ,
Medium: , ,
High-10: , ,
where is the total (training) sample size, denotes the total number of features generated, and is the number of features with a nonzero coefficient thus considered signal. Rows of are independently drawn from , where has entry = . We take and set the first components of equal to 1 with the rest set to 0. This setup corresponds to the general sampling scheme and ‘beta-type 2’ setting from Hastie et al. (2017). For both models here as well as throughout the majority of the remainder of this paper, we consider sampling the noise as where is chosen to produce a corresponding SNR level , so that, for example, in the linear model case, we take
Finally, in most previous literature, denotes the number of features eligible for splitting at each node. Here and throughout the remainder of the paper, we adopt a slightly different (but equivalent) convention by defining mtry as the proportion of eligible features so that . This is purely for readability and ease of interpretation so that readers may more immediately see whether the number of available features is large or small (relative to ) regardless of the particular model setup being considered.
Results are shown in Figure 3. In each of the four model setups we take the SNR to be equal to 3.52 and estimate the dof for random forests with mtry equal to 0.1, 0.33, 0.67, and 1. Note that when the model reduces to bagging (Breiman, 1996) and corresponds to the standard choice of eligible features at each node in regression settings, as is the default in most software. The forests are constructed using the randomForest package in R (Liaw et al., 2002) with the default settings for all arguments except for mtry and maxnodes. Each point in each plot in Figure 3 corresponds to a Monte Carlo estimate of the dof formula given in (3) evaluated over 500 iterations.
Several clear patterns are apparent in Figure 3. First and perhaps most obviously, as conjectured above, in each case we see that the dof increases as maxnodes increases and trees can be grown to a greater depth. Each plot also shows the same general concave increasing shape for each forest. Furthermore, the estimated dof function for each forest lies above the diagonal (shown as a dotted line in each plot), supporting the general notion formalized in Tibshirani (2015) that adaptive procedures like the tree-based models employed here incur additional dof as a result of this search.
More importantly for our purposes, in each plot in Figure 3 we see that at every fixed level of maxnodes, the dof increases with mtry. In particular, bagging () always contains more dof than the default implementation for random forest regression (). Finally, we note that the patterns seen in these plots also hold for numerous other regression functions and SNRs that were experimented with; Figure 9 in Appendix A shows the results of the same experiments above carried out at a much lower SNR of 0.09 and the findings are nearly identical.
The empirical results in the preceding section suggest that the mtry parameter in random forests is directly tied to its dof with larger values resulting in estimators with higher dof and more flexibility. Based on this intuition and the results for linear estimators provided in Hastie et al. (2017), we should therefore expect that random forests with smaller values of mtry to perform well in noisy settings while bagging should perform best – potentially even better than random forests – at high SNRs. We now investigate this more formally.
We begin by comparing the relative performance of random forests and bagging on simulated data across a range of plausible SNRs. In addition to the linear model setup described in the previous section, we now include the additional regression function
which we refer to as ‘MARS’ because like the additive model used previously, it first appeared in the MARS paper (Friedman, 1991), though note that unlike the previous model, it contains explicit interactions between the features. This particular MARS model has proven popular in random forest publications in recent years, appearing, for example, in Biau (2012) and Mentch and Hooker (2016). In the simulation setups described below, we consider the medium setting for the linear model (, , ) and for the MARS model, we take and consider sample sizes of and 10000. Features for the linear model are generated in the same fashion as above and those in the MARS model are sampled independently from .
As in the previous section, the variance of the noise term is chosen so as to induce particular SNRs. Here, following Hastie et al. (2017), we consider 10 SNR values equally spaced between 0.05 and 6 on the log scale. Forests are again constructed using the randomForest package with the default settings except in the case of bagging where the default value of mtry is changed so that all features are available at each split.
Here, in comparing the performance of “random forests” against “bagging”, we stress that we are merely assessing the difference in predictive accuracies between forests constructed with (traditional random forests) versus those built with (bagging). To compare the relative performance for a fixed model setup with fixed SNR, we first generate a training dataset and then evaluate the mean squared error (MSE) for both bagging and random forests on an independently generated test dataset consisting of 1000 observations. This entire process is then repeated 500 times for each setting and we record the average difference in accuracy () across these repititions.
Results are shown in Figure 4. Each plot shows the average difference in test errors versus the SNR; note that positive differences indicate that random forests () are outperforming bagging (). In each of the four regression setups shown in Figure 4, the improvement in accuracy seen with random forests decreases as the SNR increases. For large SNR values, bagging eventually begins to outperform the traditional random forests. Note also that for the MARS function, random forests seem to retain their relative improvement longer (i.e. for larger SNRs) with larger training samples. Given these results, the conventional wisdom that random forests simply “are better” than bagging seems largely unfounded; rather, the optimal value of mtry seems to be a function of the SNR.
The results above indicate that random forests () generally seem to outperform bagging () unless the SNR is large. We now reverse the direction of this investigation and estimate the optimal value of mtry across various SNR levels.
Here, as above, we consider both the MARS model and a linear model. In the same fashion as in previous simulations, the errors are sampled from a where is chosen to produce a particular SNR and we consider the same 10 SNR values as above. For the MARS model, we take and generate features independently from and for the linear model, we take and with features drawn from where the entry of is given by with . The first coefficients in the linear model are set equal to 1 with the rest set equal to 0.
For both models, we consider (training) sample sizes of both and and generate an independent test set of the same size. We construct forests using all possible values of mtry with the remaining options at the default settings in randomForest. The entire process is repeated 500 times and the mtry value corresponding to the forest with the lowest average test error for each setting is selected. The results are shown in Figure 5. As expected, corroborating the findings above, the optimal value of mtry increases with the SNR and the same general pattern emerges for both models and sample sizes. Figure 10 in Appendix A shows a slightly different calculation where the optimal mtry value on each of the 500 iterations is determined and the overall mean is then calculated. Here too we see exactly the same general pattern in that as the SNR increases, so does the optimal value of mtry.
The work above presents strong empirical evidence that the relative improvement in predictive accuracy seen with random forests is largest at low SNRs and more generally, that the optimal value of mtry appears to be a direct (increasing) function of the SNR. These results, however, pertain only to those particular simulation settings that some may argue are highly idealized. Real-world data may contain far more complex relationships and thus we now explore whether the same general findings above also appear in more natural contexts.
Dataset | ||
---|---|---|
Abalone Age [abalone] (Waugh, 1995) | 8 | 4177 |
Bike Sharing [bike] (Fanaee-T and Gama, 2014) | 11 | 731 |
Bioston Housing [boston] (Harrison Jr and Rubinfeld, 1978) | 13 | 506 |
Concrete Compressive Strength [concrete] (Yeh, 1998) | 8 | 1030 |
CPU Performance [cpu] (Ein-Dor and Feldmesser, 1987) | 7 | 209 |
Conventional and Social Movie [csm] (Ahmed et al., 2015) | 10 | 187 |
Facebook Metrics [fb] (Moro et al., 2016) | 7 | 499 |
Parkinsons Telemonitoring [parkinsons] (Tsanas et al., 2009) | 20 | 5875 |
Servo System [servo] (Quinlan, 1993) | 4 | 167 |
Solar Flare [solar] (Li et al., 2000) | 10 | 1066 |
Aquatic Toxicity [AquaticTox] (He and Jurs, 2005) | 468 | 322 |
Molecular Descriptor Influencing Melting Point [mtp2] (Bergström et al., 2003) | 1142 | 274 |
Weighted Holistic Invariant Molecular Descriptor [pah] (Todeschini et al., 1995) | 112 | 80 |
Adrenergic Blocking Potencies [phen] (Cammarata, 1972) | 110 | 22 |
PDGFR Inhibitor [pdgfr] (Guha and Jurs, 2004) | 320 | 79 |
To investigate this, we utilize 10 datasets intended for regression from the UCI Machine Learning Repository (Dua and Graff, 2017). Because most of these datasets are low-dimensional, five additional high-dimensional datasets were also included, four of which were downloaded from openml.org (Vanschoren et al., 2013) with the other (AquaticTox) taken from the R package QSARdata. Summaries of these datasets are provided in Table 1. For datasets containing missing values (csm and fb), the corresponding rows of data were removed. Here we do not know the true SNR and thus to compare the relative performance of bagging and random forests, we inject additional random noise into the response, where each and is chosen as some proportion
of the variance of the original response variable. We consider
where corresponds to the case where no additional noise is added and performance is thus compared on the original data. To compare performance, we measure the relative test error (RTE)(4) |
where and denote the 10-fold cross-validation error on bagging and random forests, respectively, and is the empirical variance of the original response. For each setting on each dataset, the process of adding additional random noise is replicated 500 times and the results are averaged. Once again, forests are constructed using the randomForest package with the default settings except for fixing for bagging and for random forests.
Results are shown in Figure 6 with low-dimensional datasets shown in the left plot and high-dimensional datasets shown on the right. Note that to aid in presentation, these display the shifted RTE rather than the raw calculation in (4). For a given proportion of additional noise , let RTE() denote the corresponding relative test error. The shifted RTE at noise level is then defined as RTE() - RTE(0). This ensures that the relative error for each dataset begins at the origin thereby allowing us to present all results in a single easy-to-interpret plot.
In both plots in Figure 6, the same general pattern appears as has been seen in the subsections above: as we increase the amount of additional noise inserted into the models, the relative improvement in predictive accuracy seen with random forests becomes more pronounced. The only slight exceptions are seen on the low-dimensional csm dataset and the high-dimensional pah dataset were the relative error appears to decrease by a small amount before eventually coming back up when large amounts of noise are added. It is not clear why the initial temporary drops occur in these two datasets, though it is worth noting that even the largest magnitudes of drops are quite small at approximately 0.2% and 0.13% in the csm and pah datasets, respectively.
The results from the previous sections suggest that the optimal value of mtry for random forests is highly data-dependent, with smaller values preferred in noisy settings and bagging () being preferred when very little noise is present. These findings are much in line with the classic understanding of random forests as a means by which the variance of the ensemble is reduced as a by-product of reducing the correlation between trees. The benefits of this variance reduction are most apparent at low SNRs.
In our view, however, this remains only a partial explanation. While randomizing the collection of features eligible for splitting at each node is one way to reduce between-tree correlation, it is certainly not the only means by which this can be accomplished. Breiman (2001) experimented with alternative implementations finding that even very naive approaches like adding random noise to the outputs of each tree could sometimes be beneficial. But as discussed in the opening sections, Breiman (2001), like many others after, also noted that the particular approach where features are randomly precluded from splitting at each node seemed to produce substantially more accurate predictions than other strategies for reducing between-tree correlation. Why this was the case, however, was not clear and has remained a subject of speculation in the nearly two decades following the inception of the procedure.
As already briefly mentioned above, Hastie et al. (2017) recently provided an extended comparison of several variable selection procedures including the lasso (Tibshirani, 1996; Chen et al., 2001), relaxed lasso (Meinshausen, 2007), forward stepwise selection (FS), and best subset selection (BSS). The relaxed lasso estimator utilized in Hastie et al. (2017) takes the form
where denotes the vector of coefficient estimates obtained via least squares (LS) when computed on only those variables selected via the lasso and filled in with 0 for variables not selected. Perhaps the most striking takeaway from their study is that in low-dimensional settings where , the more aggressive procedures (FS and BSS) with higher dof are generally not competitive with a regularized approach like the lasso at low SNRs but eventually produce more accurate predictions when the SNR becomes large. Relaxed lasso, taking the weighted average of lasso and LS-after-lasso estimates, possesses the ability to effectively trade-off the amount of regularization needed depending on the SNR and always seems to perform well.
For these kinds of estimators whose inner-workings are better understood, the reasoning behind the results observed is relatively straightforward. In low SNR settings, procedures like the lasso and relaxed lasso that explicitly regularize the problem can prevent overfitting to the noise by applying shrinkage to the coefficient estimates of the selected features. Given that we see the same general pattern here – random forests () outperforming bagging () except at high SNRs – it is reasonable to suspect that the additional randomness in random forests is playing a similar regularization role. Indeed, by randomly not allowing certain features to be split, random forests may be seen as effectively shrinking the potential influence of features, with the amount of shrinking being proportional to the amount of additional randomness added (with smaller values of mtry inducing more randomness). Speculation to this effect is described in Hastie et al. (2009).
Importantly however, if this is the kind of underlying effect that allows random forests to perform well in practice, such an effect should not be limited to tree-based ensembles. Indeed, if regression trees are seen as merely a complex form of forward selection, then if we were to create bagged and randomized versions of a standard forward selection procedure – analogues to the traditional tree-based versions of bagging and random forests – we should expect to see the same general patterns of relative improvement. In the following sections, we explore this idea and confirm that not only does a similar pattern emerge, but these ensemble-ized extensions of standard forward selection procedures exhibit surprisingly strong performance relative to alternative procedures.
Algorithm 1 formalizes the notion of randomized forward selection (RandFS), which can be seen as a random forest analogue to traditional forward stepwise selection (FS). Note that for any subset of the feature indices , we define as the matrix of feature values whose index is in and as the projection matrix onto the column span of . To carry out RandFS, we begin by drawing bootstrap samples from the original data and performing forward selection on each. However, like random forests, at each step, only a random selection of features are eligible for being included in the model. The process continues until the desired model size (depth) is obtained and the final predictions are taken as an average over the predictions generated by each individual model. When so that all features are eligible at each step, we refer to the procedure as bagged forward selection (BaggFS).
We begin by estimating the dof for RandFS as well as for FS, lasso, and relaxed lasso. Here we follow the same initial setup utilized in Hastie et al. (2017) where we assume a linear model with , , and . Rows of are sampled independently from , where the entry of takes the form with and errors are sampled independently from with chosen to satisfy a particular SNR, in this case 0.7. The first components of are set equal to 1 with the rest equal to 0.
The plots in Figure 7 show the estimated dof for the various methods against the number of nonzero coefficient estimates produced. Each point in each plot corresponds to a Monte Carlo estimate of the dof formula given in (3) evaluated over 500 iterations. As expected, the plot on the right shows quite clearly that as with random forests, larger values of mtry produce RandFS procedures with more dof. More surprising is the relative relationship of the RandFS models to the more classical procedures. The plot on the left shows that the dof for BaggFS is almost identical to that of standard FS. The standard mtry value of 0.33 produces a RandFS model with dof very similar to that of the relaxed lasso with , corresponding to least squares after lasso. Smaller values of mtry appear to offer even more regularization, producing dof similar to relaxed lasso with larger values, corresponding to an estimate where more weight is put on the original lasso coefficient estimates.
Building on the intuition from previous sections as well as that provided in Hastie et al. (2017), the dof results in Figure 7 suggest that we should expect to see RandFS models with small values of mtry have a potential advantage in predictive accuracy relative to FS and BaggFS at low SNRs. We now compare the performance of RandFS relative to BaggFS and the more classical procedures.
Here we consider several linear model setups taken directly from Hastie et al. (2017) and similar to those considered in Section 3. We consider four settings:
Low: , ,
Medium: , ,
High-5: , ,
High-10: , , .
As above, rows of are independently drawn from , where has entry = with and where we set the first components of equal to 1 with the rest set to 0, corresponding to the beta-type 2 setting in Hastie et al. (2017). Noise is once again sampled from where is chosen to produce a corresponding SNR level and we consider the same 10 values utilized above.
Tuning of the FS, lasso, and relaxed lasso procedures is done in exactly the same fashion as in Hastie et al. (2017). In all cases, tuning parameters are optimized on an independent validation set of size . For the low setting, the lasso shrinkage parameter follows the default glmnet settings being tuned across 50 values ranging from small to large fractions of . For relaxed lasso, is tuned across the same 50 values and the parameter that weights the average of the lasso and LS-after-lasso estimates is chosen from 10 equally spaced values between 0 and 1. The depth of the models in FS, BaggFS, and RandFS are tuned across . Note that for BaggFS and RandFS, a selected depth of means that each individual model is built to a depth of and the final average is then taken; since different individual models will generally select different features, the final averaged model will generally contain more than features. In addition to considering the default RandFS () and BaggFS (), we also consider tuning the mtry in RandFS across 10 equally spaced values between 0.1 and 1. In the medium, high-5, and high-10 settings, the parameter in lasso and relaxed lasso is tuned across 100 values rather than 50 and model depths for FS, BaggFS, and RandFS are tuned across ; all other setups remain the same.
To measure performance, we again follow the lead of Hastie et al. (2017) and calculate the test error relative to the Bayes error rate. Specifically, given a test point with and , the relative test error (RTE) to Bayes of a regression estimate is given by
Results are shown in Figure 8; the same plots with error bars corresponding to standard deviation are shown in Appendix A. Each point in each plot corresponds to an average over 100 replications. As expected, the explicit regularizers (lasso and relaxed lasso) perform well in the high-dimensional settings and at low SNRs. In the high-dimensional settings in particular, most methods perform similarly at low SNRs but for larger SNRs, relaxed lasso begins to perform substantially better.
In the low and medium settings, BaggFS largely performs as expected with respect to classical FS. At low SNRs, the variance stabilization offered by BaggFS allows it to outperform FS but that advantage dies out at medium SNRs and both procedures appear similarly optimal relative to the others at high SNRs. It can also be seen in these settings that as originally hypothesized above, RandFS with fixed outperforms BaggFS at low SNRs but loses the advantage at high SNRs.
The performance of RandFS in these settings with respect to the other procedures, however, is what is perhaps most surprising. Note that in the low setting, the RandFS procedure with tuned mtry outperforms all other methods – including lasso and relaxed lasso – until the mid-range SNR values. In the medium setting, RandFS exhibits a similar property to that of relaxed lasso, performing very well at low SNRs but adapting (likely selecting larger values of mtry) to also perform very well at large SNRs. Even more remarkable is the performance of RandFS when and is not tuned – in both the low and medium settings, even this default RandFS outperforms the lasso across all SNRs.
Before concluding out work, we provide some additional intuition into the implicit regularization that appears to be taking place with the randomized ensembles (random forests and RandFS) studied in previous sections. This is more apparent and easily described in the more classical linear model forward selection setting with RandFS, though the same kind of effect is likely present with random forests, which might be seen as simply a more complex form of randomized forward selection.
As above, suppose we have data of the form where each , denotes a vector of features, denotes the response, and the variables have a general relationship of the form . Now suppose that we obtain a regression estimate by averaging over models, each built via randomized forward selection on to a depth of . Note that this is identical to the RandFS procedure described above except that for technical reasons, models are built on the original data each time rather than on bootstrap samples. Each of the individual models produces an estimate of the form
where
is the feature selected at the
step in the model and is the corresponding coefficient estimate. Now consider a particular feature, say, and suppose that the ordinary least squares (OLS) estimator exists and that the OLS coefficient estimate for
is given by . More generally, given an active set containing a subset of feature indices, let denote the OLS estimate of the coefficient for when calculated over only the features with indices in .Given an orthogonal design matrix, for any indexing set , is equal to whenever and equal to 0 otherwise. Thus, if denotes the indices of those features selected for inclusion in the model, then if is selected (i.e. ) and otherwise. The final coefficient estimate produced by RandFS is thus of the form
where denotes the proportion of models in which appears. In practice, for each feature , the selection proportion will depend on the particular data observed, the value of mtry, the depth of each model, and the importance of relative to the other features. In particular though, so long as is relatively large, should still be expected to be somewhat large even for moderately small values of mtry whenever is a relatively important feature because it will have a very good chance of being included in the model if it is made eligible at any step. Thus, in this sense, not only does RandFS have a shrinkage effect on each variable, but variables that appear more important by virtue of being included in many models will be shrunken by less than those included only occasionally. In a very recent study released almost simultaneously to this work, LeJeune et al. (2019) demonstrate the same kind of shrinkage effect for randomly subsampled OLS ensemble estimators.
The results in the previous sections provide substantial evidence that the mtry parameter – the distinguishing feature of random forests – has an implicit regularization effect on the procedure. Much like the tuning parameter in explicit regularization procedures like lasso and ridge regression, mtry serves to mitigate overfitting to particular features. Speculation along these lines is discussed informally in Hastie et al. (2009) who point out that random forests seem to behave similarly to ridge regression. Thus, contrary to conventional wisdom, random forests are not simply “better” than bagging, but rather, the relative success of random forests depends heavily on the amount of noise present in the problem at hand. At low to moderate SNRs, random forests are seen to perform substantially better than bagging whereas bagging becomes far more competitive in high SNR regimes. This suggests further that at least in regression contexts, the success of random forests is not in fact due to any kind of potential interpolation as was recently hypothesized in Wyner et al. (2017).
The obvious question is then, “Why do random forests appear to work so well in practice on real datasets?” As discussed in the introduction, there is certainly considerable evidence that random forests do often perform very well in a variety of settings across numerous scientific domains. In our view, the clear answer is that in practice, many datasets simply are quite noisy. Hastie et al. (2017) provide a thorough discussion along these lines, arguing that although commonly employed in simulations, on real data, SNRs as large as 5 or 6 are extremely rare. If this is indeed the case, it would explain why random forests are so often viewed as inherently superior to bagging.
Finally, we note that our findings are very much in line both with previous findings and with common practice. In practical applications, random forests are generally used “off-the-shelf” without any tuning of the mtry parameter. The plots corresponding to the low and medium settings in Figure 8 may shed some light on this: though not always optimal, the procedure with a fixed mtry value of 0.33 generally performs quite well. On the other hand, the results provided above do strongly support the notion discussed in many previous studies (Díaz-Uriarte and De Andres, 2006; Genuer et al., 2008; Bernard et al., 2009; Genuer et al., 2010; Probst et al., 2019) that the mtry parameter can significantly influence performance. Just as with explicit regularization procedures, the results above suggest that the mtry parameter in random forests can be thought of as controlling the amount of shrinkage and regularization and thus is best tuned in practice.
This research was supported in part by the University of Pittsburgh Center for Research Computing through the resources provided.
Influence of hyperparameters on random forest accuracy.
In International Workshop on Multiple Classifier Systems, pages 171–180. Springer, 2009.Journal of Multivariate Analysis
, 101(10):2499–2518, 2010.An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization.
Machine learning, 40(2):139–157, 2000.Progress in Artificial Intelligence
, 2(2-3):113–127, 2014.International Journal of Computer Vision
, 101(3):437–458, 2013.Relevance of airborne lidar and multispectral image data for urban scene classification using random forests.
ISPRS Journal of Photogrammetry and Remote Sensing, 66(1):56–66, 2011.Head pose estimation based on random forests for multiclass classification.
In 2010 20th International Conference on Pattern Recognition, pages 934–937. IEEE, 2010.In Section 3, Figure 3 shows the estimated degrees of freedom for random forests across various values of mtry in four different regression setups at a fixed SNR of 3.52. In each case we see that the dof increases with both maxnodes and mtry and we note that the same relationships were seen to also emerge in alternative setups. Figure 9 below shows the same experiments carried out with the SNR set to the much smaller level of 0.09 and indeed the findings remain the same.
In Section 4.2 we considered the problem of estimating the optimal value of mtry for random forests for both the linear and MARS models at various SNR levels. In the main text, Figure 5 shows plots of the optimal mtry measured as that which obtains the lowest average test error over 500 iterations. In contrast, Figure 10 here calculates the optimal mtry value on each of the 500 iterations and then takes the empirical mean. The results in Figure 10 show the same general pattern as seen in Figure 5. As the SNR increases, so does the optimal value of mtry.