1 Aggregation, two kinds of randomization and resampling
As clearly shown by Bia_Sco:2016:TEST, two key ingredients of random forests are the aggregation
process (several trees are combined to get the final estimator) and thediversity of the trees that are aggregated. We can distinguish two sources of diversity:
randomization of the partition ,
randomization of the labels, that is, of the predicted value in each cell of , given .
In Breiman’s random forests, resampling acts on the two kinds of randomization, while the choice of at each node of each tree acts on the randomization of the partitions only. In purely random forests, partitions are built independently from the data, so resampling (if any) only acts on the randomization of the labels. Therefore, the role of each kind of randomization is easier to understand, and to quantify separately, for purely random forests. In this section, we propose to do so for the one-dimensional toy forest introduced by Arl_Gen:2014.
1.1 Toy forest
We first define the toy forest model, assuming that . Let be some integers. The partition associated to each tree is given by
with . Then, for each tree, a subsample of size is chosen (uniformly over the subsamples of size ), independently from . The tree estimator is defined as usual: the predicted value at is the average of the such that belongs to , the cell of the partition that contains . Finally, the forest estimator is obtained by averaging trees, where the trees are independent conditionally to .
Assume that , the are independent with the same distribution, the noise-level is constant, is of class , and . Let (to avoid border effects, see Arl_Gen:2014 for details) be fixed. Then, Table 1 provides the order of magnitude of
the quadratic risk at of a toy forest with trees, in various situations: with or without aggregation ( or ), with or without randomization of the partitions (we remove randomization of partitions by putting for all trees). Subsampling can also be removed by taking in all formulas. Note that in Table 1, the risk is written as the sum of an approximation error and an estimation error, see Appendix A. The main lines of the proof are given in Appendix B.
|Single tree||Infinite forest|
|No randomization of partitions|
|Randomization of partitions|
Table 1 allows to quantify the impact of aggregation, randomization of the partitions, randomization of the labels—and their combinations—on the performance of toy forests:
aggregation: comparing the two columns of Table 1 shows that aggregation always improve the performance (which is true for any forest model, by Jensen’s inequality). The improvement can be huge: when partitions and labels are randomized, and , both approximation and estimation errors decrease by an order of magnitude.
randomization of the partitions: comparing the two lines of Table 1 shows that randomizing the partitions strongly improves the performance of the infinite forest (there is no change for a single tree, of course). The approximation error decreases by an order of magnitude, as previously showed by Arl_Gen:2014. The estimation error also decreases—as showed by Gen:2012 for another pure forest—, but by a factor only.
randomization of the labels: comparing with in the formulas of Table 1 shows the influence of subsampling, which is the only randomization mechanism for the labels. Single trees perform worse when (as expected since the sample size is lowered). The performance of infinite forests does not change with subsampling, which might seem a bit surprising given several results mentioned by Bia_Sco:2016:TEST. This phenomenon corresponds to the fact that subagging does not improve a stable estimator (Buh_Yu:2002), and that a regular histogram is stable. Section 1.2 below explains why there is no contradiction with the random forests literature.
Section 1.1 sheds some light on previous theoretical results on random forests, and suggests a few conjectures which deserve to be investigated.
Parametrization of the trees.
The end of Section 3.1 of Biau and Scornet’s survey might seem contradictory with the above results for the toy forest. According to most papers in the literature, “random forests reduce the estimation error of a single tree, while maintaining the same approximation error”. Moreover, an infinite forest can be consistent even when a single tree (grown with a sample of size ) is not. Table 1 precisely shows the opposite situation: the estimation error is almost the same for a single tree and for an infinite forest, while the approximation error is dramatically smaller for an infinite forest. In addition, when an infinite forest is consistent, hence a single tree trained with points is also consistent.
The point is that these results and ours consider different parametrizations of the trees. In Section 1.1, trees are parametrized by the number of leaves ; so, when comparing a tree with a forest, we think fair to compare (i) a tree of leaves trained with data points, with (ii) a forest where each tree has leaves and is trained with data points. In the literature, trees are often parametrized by the number of data points per cell. Then, comparisons are done between (i) a tree of leaves trained with data points, and (ii) a forest where each tree has leaves and is trained with data points. So, if we take , the two approaches consider (approximately) the same forest (ii), but the reference trees (i) are quite different.
We do not mean that one of these two parametrizations is definitely better than the other: is a natural parameter for Breiman’s random forests, while toy forests are naturally parametrized by their number of leaves. Nevertheless, one must keep in mind that any comparison between a forest and a single tree trained with the full sample does depend on the parametrization.
The parametrization by can also hide some difficulties. For the toy forest model, has to be chosen, and this is not an easy task. One could think that this problem is solved by taking the parametrization with, say, . This is wrong because we then have to choose the subsample size , which is equivalent to the original problem since .
What about Breiman’s forests?
Section 1.1 suggests that for the toy forest, the most important ingredient in the tree diversity is the randomization of the partitions. We conjecture that this holds true for general random forests.
Nevertheless, we do not mean that the resampling step should always be discarded, since for Breiman’s random forests (for instance), resampling also acts on the randomization of the partitions. A key open problem is to quantify the relative roles of resampling and of the choice of on the randomization of the partitions. Section 2 below shows that “hold-out random forests” can be a good playground for such investigations.
Bootstrap or subsampling?
Another important question is the choice between the bootstrap and subsampling, which remains an open problem according to Bia_Sco:2016:TEST.
We conjecture that Table 1 is also valid for the out of bootstrap, which would mean that bootstrap and subsampling are fully equivalent with respect to the randomization of the labels, with no impact on the performance of pure forests. Assuming this holds true, subsampling (with or without replacement) with remains interesting for reducing the computational cost—Table 1 only requires that .
A key open problem remains: compare bootstrap and subsampling with respect to the randomization of the partitions for Breiman’s random forests. Again, “hold-out random forests” described in Section 2 should be a good starting point for such a comparison.
2 Hold-out random forests
We now consider a more complex forest model, called hold-out random forests, which is close to Breiman’s random forests while being simpler to analyze. Hold-out random forests have been proposed by Bia:2012 and appear in the experiments of Arl_Gen:2014.
Hold-out random forests can be defined as follows. First, the data set is split, once and for all, into two subsamples and , of respective sizes and , satisfying . This split is done independently from . Then, conditionally to , the trees are built independently as follows. The partition associated with the -th tree is built as for Breiman’s random forests with as data set. In other words, it is the partition defined by Bia_Sco:2016:TEST with training set as an input. The -th tree estimate at is defined by
where is the cell of this partition that contains and is the number of points such that . Finally, the hold-out forest estimate is defined by
In the definition of , the building of the partitions depends on the same parameters as Breiman’s random forests: , , the fact that resampling can be done with or without replacement, and the resample size . It is also possible to add another resampling step when assigning labels to the leaves of each tree with ; we do not consider it here since Section 1 suggests that this would not change much the performance (at least for forests).
A key property of hold-out random forests is that they are purely random: the subsamples and are independent, hence the partition associated with the trees are independent from . Note however that each partition still depends on some data (through ), hence it could adapt to some features of the data, such as the “sparsity” of the regression function, the non-uniformity over of the smoothness of , or the non-uniformity of the distribution of . Therefore, hold-out random forests can capture much of the complexity of Breiman’s random forests, while being easier to analyze since they are purely random. In particular, we can apply the results proved in Appendix A (where our is written , and our is hidden in the relationship between and the partition of the -th tree). Then, the quadratic risk of can be (approximately) decomposed into the sum of an approximation error and an estimation error, and these two terms can be studied separately. For instance, the results of Arl_Gen:2014 can be applied in order to analyze the approximation error.
For now, we only study the behaviour of these two terms in a short numerical experiment. The results are summarized by Table 2, where estimated values of the approximation and estimation errors are reported as a function of , and of the parameters of the partition building process (, and bootstrap).
|Single tree||Large forest|
Detailed information about this experiment can be found in Appendix C. Let us emphasize here that we consider a single data generation setting, hence these results must be interpreted with care.
Based on Table 2 and our experience about Breiman’s random forests, we can make the following comments.
Choice of .
As illustrated in Table 2, choosing instead of decreases the risk of an infinite forest. When there is no bootstrap, the performance gain is significant and the reason is that it is the only source of randomization of partitions. But, even in presence of bootstrap, it allows to slightly reduce the approximation error. In the same experiments with , the gain of decreasing in the bootstrap case is larger (see the supplementary material).
Our belief is that when there is some bootstrap, the additional randomization given by taking can reduce the risk in some cases, where typically (which holds true in our experiments). This is supported by the experiments of Gen_Pog_Tul:2008, where small values of mtry give significantly lower risk than for some classification problems. For regression, Gen_Pog_Tul:2008 obtain similar performance when decreasing mtry, which is consistent with Table 2 since these experiments are done in the bootstrap case.
When and only a small proportion of the coordinates of are informative, we conjecture that the optimal is close to (provided that there is some bootstrap step for randomizing the partitions). Indeed, if mtry is significantly smaller than
, then, the probability to choose at least one informative coordinate inis not close to , hence the randomization of the partitions might be too strong.
Bootstrap, and randomization of the partitions.
When , according to Table 2, the bootstrap helps to significantly reduce the risk, compared with the no randomization case. Overall, we get a significantly smaller risk when there is at least one source of randomization of the partitions.
Comparing the three combinations of parameters (bootstrap, , or both) for which the partitions are randomized is more difficult: the differences observed in Table 2 might not be significant. Nevertheless, Table 2 suggests that the lowest risk might be obtained when two sources of randomization are present ( and bootstrap). And if we have to choose only one source of randomization, it seems that randomizing with only yields a smaller risk than bootstrapping only.
Appendix A Approximation and estimation errors
We state a general decomposition of the risk of a forest having the -property (that is, when partitions are built independently from ), that we need for proving the results of Section 1, but can be useful more generally. We assume that for all .
For any random forest having the -property, following Bia_Sco:2016:TEST, we can write
is the number of times appears in the -th resample, is the cell containing in the -th tree, and
Now, let us define
By definition of the conditional expectation, we can decompose the risk of at into three terms
In the fixed-design regression setting (where the are deterministic), is called approximation error, , and is called estimation error. Things are a bit more complicated in the random-design setting—when are independent and identically distributed—since in general. Up to minor differences related to how is defined on empty cells, is still the approximation error, and the estimation error is .
Let us finally assume that are independent and define
Then, since the weights only depend on through , we have the following formula for the estimation error
For instance, in the homoscedastic case, and
Appendix B Analysis of the toy forest: proofs
We prove the results stated in Section 1 for the one-dimensional toy forest.
Since we assume is of class , we can use the results of Arl_Gen:2014 for the approximation error (up to minor differences in the definition of , due to event where is empty, which has a small probability since ). We assume that and for simplicity, so the quantities appearing in Table 1 indeed provide the order of magnitude of .
The middle term in decomposition (2) is negligible in front of for a single tree, which can be proved using results from Arl:2008a, as soon as and . We assume that it can also be neglected for an infinite forest.
For the estimation error, we can use Eq. (3) and the following arguments. First, for every , belongs to with probability . Combined with the subsampling process, we get that
is close to its expectation with probability almost one if . Assuming that this holds simultaneously for a huge fraction of the subsamples, we get the approximation
Now, we note that conditionally to , the variables ,
are independent and follow a Bernoulli distribution with the same parameter
Similar arguments apply for justifying the top line of Table 1, where almost surely.
Note that we have not given a full rigorous proof of the results shown in Table 1, because of the approximation (4) and of the term that we have neglected. We are convinced that the parts of the proof that we have skipped might only require to add some technical assumptions, which would not help to reach our goal of understanding better random forests in general.
Appendix C Details about the experiments
This section describes the experiments whose results are shown in Section 2.
Data generation process.
We take , with . Table 2 only shows the results for . Results for are shown in supplementary material.
The data are independent with the same distribution: , with independent from , , and the regression function is defined by
The function is proportional to the Friedman1 function which was introduced by Fri:1991. Note that when , only depends on the first coordinates of .
Then, the two subsamples are defined by and .
We always take and .
Trees and forests.
For each , each experimental condition (bootstrap or not, or ), we build some hold-out random trees and forests as defined in Section 2. These are built with the randomForest R package (Lia_Wie:2002; R_Core:2014), with appropriate parameters ( is controlled by maxnodes, while ).
Resampling within (when there is some resampling) is done with a bootstrap sample of size (that is, with replacement and ).
“Large” forests are made of trees, a number of trees suggested by Arl_Gen:2014.
Estimates of approximation and estimation error.
Estimating approximation and estimation errors (as defined by Eq. (2)) requires to estimate some expectations over (which includes the randomness of as well as the randomness of the choice of bootstrap subsamples of and of the repeated choices of a subset ). This is done with a Monte-Carlo approximation, with replicates for trees and replicates for forests. This number might seem small, but we observe that large forests are quite stable, hence expectations can be evaluated precisely from a small number of replicates.
We estimate the approximation error (integrated over ) as follows. For each partition that we build, we compute the corresponding “ideal” tree, which maps each piece of the partition to the average of over it (this average can be computed almost exactly from the definition of ). Then, to each forest we associate the “ideal” forest which is the average of the ideal trees. We can thus compute for any , and estimate its expectation with respect to . Averaging these estimates over uniform random points provides our estimate of the approximation error.
We estimate the estimation error (integrated over ) from Eq. (3); since is known, we focus on the remaining term. Given some hold-out random forest, for any and , we can compute
Then, averaging over several replicate trees/forests and over uniform random points , we get an estimate of the estimation error (divided by ).
Summarizing the results in Table 2.
Given the estimates of the (integrated) approximation and estimation errors that we obtain for every , we plot each kind of error as a function of (in - scale for the approximation error), and we fit a simple linear model (with an intercept). The estimated parameters of the model directly give the results shown in Table 2 (in which the value of the intercept for the estimation error is omitted for simplicity). The corresponding graphs are shown in supplementary material.
The research of the authors is partly supported by the French Agence Nationale de la Recherche (ANR 2011 BS01 010 01 projet Calibration). S. Arlot is also partly supported by Institut des Hautes Études Scientifiques (IHES, Le Bois-Marie, 35, route de Chartres, 91440 Bures-Sur-Yvette, France).
Appendix D Supplementary Material
|Single tree||Large forest|