1 Theoretical Background
We consider a supervised learning setting, in which we assume that training and test points are drawn i.i.d. according to some distribution
over the input space and labels . For training, we have given a labeled sample , where is adimensional featurevector and
is the corresponding target vector. For regression problems we have and . For binary classification with we use and for multiclass problems with classes we encode each label as a onehot vector which contains a ‘’ at coordinate for label . In this paper we are interested in the overfitting behavior of ensembles and more specifically of Random Forests. In this context we refer to overfitting as the ushaped curved depicted in Figure 1 in which the test error increases again after a certain complexity, but not the fact that in many practical applications there is a gap between the test and training error. We assume a convex combination of classifier each scaled by their respective weights :with . For concreteness, we assume that is the model class of axisaligned decision trees and is a random forest. An axisaligned DT partitions the input space into increasingly smaller ddimensional hypercubes called leaves and uses independent predictions for each leaf node in the tree. Formally, we represent a tree as a directed graph with a root node and two children per node. Each node in the tree belongs to a subcube and all children of each node recursively partition the region of their parent node into nonoverlapping smaller cubes. Inner nodes perform an axisaligned split where is a feature index and is a splitthreshold. Each node is associated with split function that is ‘1’ if belongs to the hypercube of that node and ’0’ if not. To compute and the gini score (CART algorithm [3]) or the information gain (ID3 algorithm [17]) is minimized which measure the impurity of a split. The induction starts with the root node and the entire dataset. Then the optimal splitting is computed and the training data is split into the left part () and the right part (
). This splitting is repeated recursively until either a node is ‘pure’ (it contains only examples form one class) or another abort criterion, e.g. a maximum number of leaf nodes, is reached. The predictions on the leaf nodes are computed by estimating the class probabilities of all observations in that specific leaf. To classify a new example one starts at the root node and traverses the tree according to the comparison
in each inner node. Let be the total number of leaf nodes in the tree and let be the nodes visited depending on the outcome of then the prediction function of a tree is given bywhere is the (constant) prediction value per leaf and are the split functions of the individual hypercubes.
A Random Forest extends a single DT by training a set of axisaligned decision trees on bootstrap samples using randomly sampled features and weighting them equally. Algorithm 1 summarizes the RF algorithm.
2 There is no doubledescent in RF
In statistical learning theory the generalization error of a model
is bounded in terms of its empirical errorgiven some loss function
and a complexity measure for the trained model. For concreteness consider a binary classification problem with and let be a prediction model and be the classification margin. We denote the binary classification error of on with respect to with and the empirical classification error of wrt. on with :(1)  
(2) 
Intuitively, a large margin indicates how convinced we are in our predictions. For example consider the two predictions and , which both can be considered to be class ‘+1‘. Now using means that we only accept predictions greater as ‘+1‘, so that the prediction for will be regarded as wrong in any case (regardless of the actual label ). Using indicates that we do not care if the prediction is or – both are considered equally to belong to class ‘+1’. The following theorem bounds the generalization error of a convex combination of classifiers in terms of their individual Rademacher complexities.
Theorem 1 (Convex combination of classifiers [7]).
Let denote a set of base classifiers and let with be the convex combination of classifiers . Furthermore, let be the Rademacher complexity of the ith classifier. Then, for a fixed margin and for any , with probability at least over the choice of sample of size drawn i.i.d. according to , the following inequality holds:
where is a constant depending on which tends to for for any and any .
Theorem 1 offers two interesting insights: First, the Rademacher complexity of a convex combination of classifiers does not increase and thus, an ensemble is not more likely to overfit than each of its individual base learners. Second, the individual Rademacher complexities of each base learner are scaled by their respective weights. For , where all classifiers have the same complexity, this bound recovers the wellknown result from [11].
The key question now becomes how to compute the Rademacher complexity of the trees inside the forest. It is wellknown that the Rademacher complexity is related to the VCdimension via
(3) 
Interestingly, the exact VCDimension of decision trees is unknown. Asian etal. performed an exhaustive search to compute the VC dimension of trees with depth up to in asian/etal/2009, but so far no general formula has been discovered. However, there exist some useful bounds. A decision tree with nodes trained on binary features has a VCdimension of at most [12]:
(4) 
Leboeuf et al. extend this bound for continuous features in leboeuf/etal/2020 by introducing the concept of partition functions into the VCframework. They are able to show that the VCdimension of a decision tree trained on continuous features is of order . Unfortunately, the expression discovered by the authors is computationally expensive so that experiments with larger trees are impractical^{1}^{1}1The authors provide a simplified version of their expression which works well for trees with less than leaf nodes on our test system, but anything beyond that would take too long.. For our analysis in this paper we are interested in the asymptotic behavior of Decision Trees and Random Forests. Hence we use the following asymptotic Rademacher complexity:
(5) 
The previous discussion highlights two things: First, the complexity of an RF does not increase when adding more trees but it averages. Second, the complexity of a tree largely depends on the number of features and the total number of nodes (ignoring any factors). Belkin et al. empirically showed in belkin/etal/2019 that Random Forest exhibit a doubledescent curve. Similar to our discussion here, the authors introduce the number of nodes as a measure of complexity for single trees, but then use the total number of nodes in the forest throughout their discussion. While we acknowledge that this is a very intuitive definition of complexity it is not consistent with our above discussion and the results in learning theory. Hence, we propose to use the average (asymptotic) Rademacher complexity as a capacity measure. We argue, that with this adapted definition, there is no doubledescent occurring in Random Forests but rather a single descent in which we fit the training data better and better the more capacity is given to the model.
A RF shows a single descent curve
We validate our hypothesis experimentally. To do so we train various RF models with different complexities and compare their overfitting behavior on five different datasets depicted in table 1. By today’s standards these datasets are small to medium size which allows us to quickly train and evaluate different configurations, but large enough to train large trees. The code for our experiments is available under https://github.com/sbuschjaeger/rfdoubledescent.
Dataset  N  C  d 

Adult  32 562  2  108 
Bank  45 211  2  51 
EEG  14 980  2  14 
Magic  19 019  2  10 
Nomao  34 465  2  174 
values and computed a onehot encoding for categorical features. Each dataset is available under
https://archive.ics.uci.edu/ml/datasets.Our experimental protocol is as follows: Oshiro et al. showed in [14] empirically on a variety of datasets that the prediction of a RF stabilizes between and trees and adding more trees to the ensemble does not yield significantly better results. Hence, we train the ‘base’ Random Forests with trees. To control the complexity of a Random Forest, we limit the maximum number of leaf nodes of the individual trees to . In all our experiments we perform a 5fold cross validation and report the average error across these runs.


Figure 1(a) shows the results of this experiment. Solid lines show the test error and dashed lines are the training error. Note the logarithmic scale on the xaxis. It can be clearly seen that for both, RF and DT, the training error decreases towards 0 for larger values. On the adult, bank, magic and nomao dataset we see the ‘classic’ ushaped overfitting curve for a DT in which the error first improves and then suddenly increases again. On the EEG dataset the DT shows a singledescent in which it never overfits, but its test error is much higher than that of a RF. On all datasets, the DT seems to reach a plateau after a certain number of maximum leaf nodes. Looking at the RF we see a singledescent curve on all but the adult dataset in which the RF fits the data better and better with larger . Only on the adult dataset there are signs of small overfitting for the RF.
When there is no doubledescent in Random Forests, then why are they performing better than single trees? Interestingly, the above discussion may already offers a reasonable explanation of this behavior. First, a Random Forest uses both feature sampling as well as bootstrapping for training new trees. When done with care^{2}^{2}2In scikit learn [16] the implementation may evaluate more than features if no sufficient split has been found., then feature sampling can reduce the number of features to so that also becomes smaller and thereby also reduces . Second, bootstrap sampling samples data points with replacement. Given a dataset with observations, there are only
unique data points per individual bootstrap sample in the limit. Thus, the effective size of each bootstrap sample reduces to roughly
which can lead to smaller trees because the entire training set is smaller and easier to learn due to duplicate observations. Last and maybe most important, tree induction algorithms such as CART or ID3 are adaptive in a sense, that the treestructure is datadependent. In the worstcase, a complete tree is build in which single observations are isolated in the leaf nodes so that every leafnode contains exactly one example. However, it is impossible to grow a tree beyond isolating single observations because there simply is not any data left to split. Subsequently, the Rademacher complexity cannot grown beyond this point and is limited by an inherent, datadependent limit. We summarize these arguments into the following hypothesis.
The maximum Rademacher complexity of RF and DT is bounded by the data
Figure 1(b) shows the Rademacher complexity of the DT and RF for the previous experiment. As one can see the Rademacher complexity for both models on all datasets steeply increases until they both converge against a maximum from which they then continue to plateau. So indeed, both models have an inherent maximum Rademacher complexity given by the data as expected. Contrary to the above discussion, however, the RF has a larger Rademacher complexity than a DT on all but the magic dataset. For a better understanding we look at the average height of trees produced by the two algorithms in Figure 1(c). Here we can see that RF – on average – has larger trees than the DT given the same number of maximum leaf nodes. We hypothesize that due to the feature and bootstrap sampling suboptimal features are chosen during the splits. Hence, a RF requires more splits in total to achieve a small loss leading to larger trees with larger Rademacher complexities.
Combining both experiments leads to a mixed explanation why RF seems to be so resilient to overfitting: For trees trained via greedy algorithms such as CART one cannot (freely) overparameterize the final model because its complexity is inherently bounded by the provided data. Even if one allows for more leaf nodes, the algorithm simply cannot make use of more parameters. A similar argument holds for a Random Forest: Adding more trees does not increase the Rademacher complexity as implied by Theorem 1. Thus one can add more and more trees without the risk of overfitting. Similar, increasing only increases the Rademacher complexity up to the inherent limit given by the data. Thus even if one allows for more parameters, a RF cannot make use of them. Its Rademacher complexity is inherently bounded by the data. However, as shown in Figure 1(a)  1(c) there does not seem to be a direct, dataindependent connection between the maximum number of leaf nodes and this inherent maximum Rademacher complexity.
3 Complexity does not predict the performance of a RF
The above discussion already shows that the Rademacher complexity of a forest does not seem to be an accurate predictor for the generalization error of the ensemble. In this section we further challenge the notion that complexity is a predictor of the performance of a tree ensemble and construct ensembles with large complexities that do not overfit and trees with small complexities that do overfit. For example, we could conceive a very complex tree by simply introducing unnecessary comparisons e.g. by comparing against infinity . This comparison is always true effectively adding a useless decision node to the tree and a forest of such trees would have a huge Rademacher complexity while having the same performance as before. Clearly these trees would neither be in the spirit of DT learning nor really useful in practice.
Hence we will now look a small variation of this idea. We study the performance of a DT which approximates the decision boundary of a good, not overfitted RF and similar we study a RF which approximates the decision boundary of a bad, overfitted DT. It is conceivable that the DT ‘inherits’ the positive properties of the RF and likewise that such a RF has all the negative properties of the original DT. Algorithm 2 summarizes this approach. We first train a regular reference model e.g a RF or DT given our the data in line 2. Then, we sample points along the decision boundary of the model by using augmented copies of the training data. Specifically, we copy the training data times and add Gaussian Noise to the observations in these copies as shown in line . Then we apply the reference model (RF or DT) to this augmented data in line and use its predictions as the new label for fitting the actual model in line .
A RF that approximates a DT does not overfit


We repeat the above experiments with data augmentation training. Again we limit the maximum number of leaf nodes . We train a RF with trees and approximate it with a DT using and . We call this algorithm DT with Data Augmentation (DADT). Similar, we train a single DT and approximate it with a RF with trees, and denoted as RF with Data Augmentation (DARF). Figure 2(a) shows the error curves for this experiment. Again, note the logarithmic scale on the xaxis. First we see that the training error approaches zero for larger for both models as expected. Second we see that the decision tree DADT despite fitting the decision boundary of a RF shows clear signs of overfitting. Third, and maybe even more remarkable, the forest DARF trained via dataaugmentation on the bad, overfitted labels from the DT still does not overfit but also has a single descent. To gain a better picture we can again look at the Rademacher complexities of these two models in Figure 2(b). Similar to before there is a steep increase for both models. However, the DADT now converges against a smaller Rademacher complexity compared to the DARF which now has a much larger Rademacher complexity across all datasets despite the fact that DARF has a better test error. The forest does not overfit in a ushaped curved as expected but also shows a single descent whereas the DT still does overfit in a ushaped similar to before. For a better comparison between the individual methods we combine them in a single plot. Figure 2(c) shows the asymptotic Rademacher complexity over the test and train error of all methods. The dashed lines depict the training error, whereas the solid lines are the test error. Note that some curves stop early because their respective Rademacher complexities are not large enough to fill the entire plot. As one can see, DT and RF have a comparably small maximum Rademacher complexity. RF seems to minimize the training error more aggressively and reaches a smaller error with smaller complexities, whereas DT starts to overfit comparably early. DADT seems to have the smallest Rademacher complexity but also overfits the most on some datasets (e.g. adult or magic). DARF has the largest complexity but does not seem to overfit atall. It slowly converges against the original RF’s performance, except on the adult dataset where it also does not overfit. Both, DT and DADT show a ushaped curved whereas RF and DARF both show a single descent in most cases. Clearly, the Rademacher complexity fails to explain the performance of the data augmented trees and forests. We argue that the reason for this lies in the algorithm used to train the trees and not in the modelclass itself.
4 Negative Correlation Forests
The previous section implies that the model alone does not fully explain its performance. We argue that the learning algorithm also plays a crucial role in the generalization capabilities of the model and more specifically that the tradeoff between bias and diversity plays a crucial role. We show that there is a large region of different diversity levels which are all equally good and it does not matter what specific tradeoff is achieved as long as it falls into the same region.
More formally, the biasvariance decomposition of the meansquarederror states that the expected error of a model can be decomposed into its bias and variance [13, 9]:
where is a random process (e.g. due to bootstrap sampling.) induced by the algorithm that generates , is the bias of the algorithm and is its variance. Considering the ensemble , then the variance term can be further decomposed into a covariance (see e.g. [5] for more details):
(6) 
where we dropped for a better readability and is the covariance of the predictions across the ensemble.
This decomposition does not directly relate the training error of a model to its generalization capabilities, but it shows how the individual training and testing losses are structured [6]. Although suspected for some time and exploited in numerous ensembling algorithms, the exact connection between the diversity of an ensemble and its generalization error was only established relatively recently. Germain et al. showed in germain/etal/15 that the diversity of an ensemble and its generalization error can be connected in a PACstyle bound shown in Theorem 2.
Theorem 2 (PACStyle CBound [10]).
Let denote a set of base classifiers and let be the ensemble. Then, for any , with probability at least over the choice of sample of size drawn i.i.d. according to , the following inequality holds:
where for and any and where is the covariance of the ensemble evaluated on the sample .
Intuitively, this result shows that an ensemble of powerful learners with a small bias that sometimes disagree will be better than a an ensemble with a comparable bias in which all models agree.
While the original RF algorithm produces accurate ensembles its diversity is implicitly determined by the bootstrap and features samples. Hence it is difficult to precisely control its diversity. For a direct control over the diversity we will now introduce Negative Correlation Forest. Bound 2 is stated for the 01 loss which makes the direct minimization of this bound difficult as noted in [10]. Luckily, the minimization over the biasvariance decomposition of the MSE is much more approachable and has already been studied in the form of Negative Correlation Learning (NCL). NCL offers a fine grained control over the diversity of an ensemble of neural networks by minimizing the following objective (see [5, 19, 6] for more details):
(7) 
where , is the identity matrix with on the main diagonal and is the regularization strength. For this trains classifier independently and no further diversity among the ensemble members is enforced, for more diversity is enforced during training and for diversity is discouraged. While a large diversity helps to reduce the overall error it also implies that some trees must be wrong on some data points and thus means that their respective bias again increases. Finding a good balance between the bias and diversity is therefore crucial for a good performance.
NCL was first introduced to train ensembles of neural networks because they can easily be optimized via gradientbased algorithms minimizing Eq. 7. We adapt this approach to RF by training an initial RF with algorithm 1 which is then refined by optimizing the NCL objective: Recall that DTs use a series of axisaligned splits of the form and where is a feature index, is a threshold to determine the leaf nodes and each leaf node has a (constant) prediction associated with it. Let be the parameter vector of tree (e.g. containing split values, feature indices and leafpredictions) and let be the parameter vector of the entire ensemble . Then our goal is to solve
(8) 
for a given tradeoff
. We propose to minimize this objective via stochastic gradientdescent. SGD is an iterative algorithm which takes a small step into the negative direction of the gradient in each iteration
by using an estimation of the true gradient(9) 
where
(10) 
is the gradient of wrt. to computed on a minibatch .
Unfortunately, the axisaligned splits of a DT are not differentiable and thus it is difficult to refine them further with gradientbased approaches. However, the leaf predictions are simple constants that can easily be updated via SGD. Formally, we use leading to
(11) 
Algorithm 3 summarizes the NCForest algorithm. First an initial RF is trained with trees using at most leaf nodes and features. Then, the leafpredictions are extracted from the forest in line and SGD is performed in line to . Given our previous experiments, the learning algorithm seems to play a crucial role in the performance of a model and Theorem 2 implies that a diverse ensemble should generalize better. However, too much diversity will also likely hurt the performance of the forest because then the bias increases.
There is an optimal tradeoff between bias and diversity
Again we validate our hypothesis experimentally. We train an initial RF with trees with a maximum number of leaf nodes. Due to the bootstrap sampling and due to the feature sampling, this initial RF already has some diversity. Hence, we use negative values to deemphasize diversity and positive values to emphasize diversity. Last, we noticed that between and there is a steep increase in the diversity because it starts to dominate the optimization (c.f. [5] which reports a similar effect for Neural Networks). Hence we vary and minimize the NCL objective over epochs using the Adam optimizer with a step size of and a batch size of
implemented in PyTorch
[15]. We also experimented with more leaf nodes, different values and more epochs but the testerror would not improve further with different parameters.As seen in Figure 3(a) and Figure 3(b) our NCForest method indeed allows for a fine control over the diversity in the ensemble. Increasing from to leads to a larger bias and more diversity in the ensemble while the overall ensemble loss nearly remains constant as expected. Increasing leads to a steep increase in both where the ensemble loss also increases because the diversity starts to dominate the optimization.
Looking at Figure 3(c) we can see the average training and testing error of the trees in the ensemble as well as the test and train error of the entire ensemble. Again, dashed lines depict the train error and solid lins are the test error. Moreover, we marked the performance of a DT and a RF for better comparability^{3}^{3}3By definition a single DT does not have any diversity. For presentational purposes we assigned a nonzero diversity to it in order to not break the logarithmic axis.. We can see two effects here. First, there seems to be a large region in which the diversity does not affect the ensemble error, but only the individual errors of the trees. In this region the performance of each individual trees changes, but the overall ensemble error remains nearly constant. The corresponding plots are akin to a bathtub: If the diversity is too small or too large, then performance suffers. Only in the middle of both extremes we find a good tradeoff between both. For example, on the EEG dataset a diversity between seems to be a perfect tradeoff with similar performance even though the average test error steeply increases for larger diversities. Second we find that a RF seems to achieve a good balance between both quantities with its default parameters, although minor improvements are possible, e.g. on the adult or the EEG dataset. We conclude that a larger diversity does not necessarily result in a better forest, but it can hurt the performance at some point. Likewise not enough diversity also leads to a bad performance and a good balance between both quantities must be achieved. Interestingly, there seems to be a comparably large region of similar performances where the exact tradeoff between bias and diversity does not matter. Hence, the diversity does have some effect on the final model, but its effect might be overrated once this region is found by the algorithm.


5 Conclusion
In this paper we revisited explanations for the success of Random Forests and showed that for most of these explanations an experiment can be constructed where they do not work or only offer little insight. First, given a proper definition of the complexity of forest, a RF does not exhibit a double descent, but rather has a single descent in which it simply fits the data better with increasing complexity. Second, a DT shows the ‘classic’ ushaped curve in which it starts to overfit at some point. We continued to show that a RF does not ‘inherit’ these bad properties of a DT and retains a singledescent curve even if the RF is fitted on a groundtruth from an overfitted DT. Similar, a DT does not ‘inherit‘ the good properties of a RF, but also keeps its ushaped overfitting curve when fitted on the groundtruth of a good, not overfitted RF. In all these experiments the Rademacher complexity did not accurately predict the performance of each classifier. In fact, DTs trained via data augmentation had a smaller complexity than their RF counterparts but a much worse test error. Hence, we argue that the training algorithm plays a crucial role in the performance of a model. We introduced Negative Correlation Forest (NCForest) which refines the leaf nodes of a forest to explicitly control the diversity among the trees. We hypothesized that a tree ensemble of diverse trees with sufficiently small bias should have a better generalization error than a homogeneous forest. And indeed there seems to be a bathtublike correlation between the diversity and the test error. Having too few or having too much diversity hurts the performance, whereas the right balance of both quantities maximizes performance. Luckily there seems to be a comparably large region of different diversities which achieve this tradeoff and a RF seems to find this region in most cases although some improvements are possible when this tradeoff is further refined.
Acknowledgments
Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by ResourceConstrained Analysis”, DFG project number 124020371, SFB project A1, http://sfb876.tudortmund.de. Part of the work on this research has been funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01—18038A), https://www.ml2r.de/. Thank you Mirko Bunse and Lukas Pfahler for their helpful comments on this paper.
References
 [1] (2016) A random forest guided tour. Test 25 (2), pp. 197–227. Cited by: There is no DoubleDescent in Random Forests.
 [2] (2012) Analysis of a random forests model. Journal of Machine Learning Research 13 (Apr), pp. 1063–1095. Cited by: There is no DoubleDescent in Random Forests.
 [3] (1984) Classification and regression trees. Taylor & Francis. External Links: ISBN 9780412048418, LCCN 83019708, Link Cited by: §1.
 [4] (2000) Some infinity theory for predictor ensembles. Technical report Technical Report 579, Statistics Dept. UCB. Cited by: There is no DoubleDescent in Random Forests.
 [5] (2005) Managing Diversity in Regression Ensembles. Jmlr, pp. 1621–1650. External Links: Document, ISSN 15505081, Link Cited by: §4, §4, §4.
 [6] (2020) Generalized negative correlation learning for deep ensembling. arXiv preprint arXiv:2011.02952. Cited by: §4, §4.
 [7] (2014) Deep boosting. In International conference on machine learning, pp. 1179–1187. Cited by: Theorem 1.
 [8] (2014) Narrowing the gap: random forests in theory and in practice. In International conference on machine learning (ICML), Cited by: There is no DoubleDescent in Random Forests.
 [9] (1992) Neural Networks and the Bias/Variance Dilemma. Vol. 4. External Links: Document, ISSN 08997667, Link Cited by: §4.
 [10] (2015) Risk bounds for the majority vote: from a pacbayesian analysis to a learning algorithm. Journal of Machine Learning Research 16 (26), pp. 787–860. External Links: Link Cited by: §4, Theorem 2.
 [11] (2002) Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics 30 (1), pp. 1–50. Cited by: §2.
 [12] (1997) Pessimistic decision tree pruning based on tree size. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 195–201. Cited by: §2.
 [13] (1952) The Utility of Wealth. Journal of Political Economy. External Links: Document, ISSN 00223808 Cited by: §4.

[14]
(2012)
How many trees in a random forest?.
In
International workshop on machine learning and data mining in pattern recognition
, pp. 154–168. Cited by: §2, There is no DoubleDescent in Random Forests.  [15] (2019) PyTorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, External Links: ISSN 10495258 Cited by: §4.
 [16] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: footnote 2.
 [17] (1986) Induction of decision trees. Machine learning 1 (1), pp. 81–106. Cited by: §1.
 [18] (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: There is no DoubleDescent in Random Forests.
 [19] (2019) Joint Training of Neural Network Ensembles. External Links: Document, 1902.04422, Link Cited by: §4.
 [20] (2021) Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64 (3), pp. 107–115. Cited by: There is no DoubleDescent in Random Forests.
Comments
There are no comments yet.