There is no Double-Descent in Random Forests

11/08/2021
by   Sebastian Buschjäger, et al.
TU Dortmund
23

Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/06/2020

Benign Overfitting and Noisy Features

Modern machine learning often operates in the regime where the number of...
10/19/2021

Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement

Random Forests (RF) are among the state-of-the-art in many machine learn...
10/15/2019

Breadth-first, Depth-next Training of Random Forests

In this paper we analyze, evaluate, and improve the performance of train...
06/10/2015

Randomer Forests

Random forests (RF) is a popular general purpose classifier that has bee...
03/02/2021

Slow-Growing Trees

Random Forest's performance can be matched by a single slow-growing tree...
12/11/2020

Beyond Occam's Razor in System Identification: Double-Descent when Modeling Dynamics

System identification aims to build models of dynamical systems from dat...
09/27/2021

Analysis of Trade-offs in RF Photonic Links based on Multi-Bias Tuning of Silicon Photonic Ring-Assisted Mach Zehnder Modulators

Recent progress in silicon-based photonic integrated circuits (PICs) hav...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Theoretical Background

We consider a supervised learning setting, in which we assume that training and test points are drawn i.i.d. according to some distribution

over the input space and labels . For training, we have given a labeled sample , where is a

-dimensional feature-vector and

is the corresponding target vector. For regression problems we have and . For binary classification with we use and for multiclass problems with classes we encode each label as a one-hot vector which contains a ‘’ at coordinate for label . In this paper we are interested in the overfitting behavior of ensembles and more specifically of Random Forests. In this context we refer to overfitting as the u-shaped curved depicted in Figure 1 in which the test error increases again after a certain complexity, but not the fact that in many practical applications there is a gap between the test and training error. We assume a convex combination of classifier each scaled by their respective weights :

with . For concreteness, we assume that is the model class of axis-aligned decision trees and is a random forest. An axis-aligned DT partitions the input space into increasingly smaller d-dimensional hypercubes called leaves and uses independent predictions for each leaf node in the tree. Formally, we represent a tree as a directed graph with a root node and two children per node. Each node in the tree belongs to a sub-cube and all children of each node recursively partition the region of their parent node into non-overlapping smaller cubes. Inner nodes perform an axis-aligned split where is a feature index and is a split-threshold. Each node is associated with split function that is ‘1’ if belongs to the hyper-cube of that node and ’0’ if not. To compute and the gini score (CART algorithm [3]) or the information gain (ID3 algorithm [17]) is minimized which measure the impurity of a split. The induction starts with the root node and the entire dataset. Then the optimal splitting is computed and the training data is split into the left part () and the right part (

). This splitting is repeated recursively until either a node is ‘pure’ (it contains only examples form one class) or another abort criterion, e.g. a maximum number of leaf nodes, is reached. The predictions on the leaf nodes are computed by estimating the class probabilities of all observations in that specific leaf. To classify a new example one starts at the root node and traverses the tree according to the comparison

in each inner node. Let be the total number of leaf nodes in the tree and let be the nodes visited depending on the outcome of then the prediction function of a tree is given by

where is the (constant) prediction value per leaf and are the split functions of the individual hypercubes.

A Random Forest extends a single DT by training a set of axis-aligned decision trees on bootstrap samples using randomly sampled features and weighting them equally. Algorithm 1 summarizes the RF algorithm.

1:for  do
2:     
3:     
4:     
5:     while  do
6:         
7:         
8:         
9:         
10:         
11:         
12:         if tree_not_done then
13:              
14:                             
15:     
16:     
Algorithm 1 Random Forest algorithm.

2 There is no double-descent in RF

In statistical learning theory the generalization error of a model

is bounded in terms of its empirical error

given some loss function

and a complexity measure for the trained model. For concreteness consider a binary classification problem with and let be a prediction model and be the classification margin. We denote the binary classification error of on with respect to with and the empirical classification error of wrt. on with :

(1)
(2)

Intuitively, a large margin indicates how convinced we are in our predictions. For example consider the two predictions and , which both can be considered to be class ‘+1‘. Now using means that we only accept predictions greater as ‘+1‘, so that the prediction for will be regarded as wrong in any case (regardless of the actual label ). Using indicates that we do not care if the prediction is or – both are considered equally to belong to class ‘+1’. The following theorem bounds the generalization error of a convex combination of classifiers in terms of their individual Rademacher complexities.

Theorem 1 (Convex combination of classifiers [7]).

Let denote a set of base classifiers and let with be the convex combination of classifiers . Furthermore, let be the Rademacher complexity of the i-th classifier. Then, for a fixed margin and for any , with probability at least over the choice of sample of size drawn i.i.d. according to , the following inequality holds:

where is a constant depending on which tends to for for any and any .

Theorem 1 offers two interesting insights: First, the Rademacher complexity of a convex combination of classifiers does not increase and thus, an ensemble is not more likely to overfit than each of its individual base learners. Second, the individual Rademacher complexities of each base learner are scaled by their respective weights. For , where all classifiers have the same complexity, this bound recovers the well-known result from [11].

The key question now becomes how to compute the Rademacher complexity of the trees inside the forest. It is well-known that the Rademacher complexity is related to the VC-dimension via

(3)

Interestingly, the exact VC-Dimension of decision trees is unknown. Asian etal. performed an exhaustive search to compute the VC dimension of trees with depth up to in asian/etal/2009, but so far no general formula has been discovered. However, there exist some useful bounds. A decision tree with nodes trained on binary features has a VC-dimension of at most [12]:

(4)

Leboeuf et al. extend this bound for continuous features in leboeuf/etal/2020 by introducing the concept of partition functions into the VC-framework. They are able to show that the VC-dimension of a decision tree trained on continuous features is of order . Unfortunately, the expression discovered by the authors is computationally expensive so that experiments with larger trees are impractical111The authors provide a simplified version of their expression which works well for trees with less than leaf nodes on our test system, but anything beyond that would take too long.. For our analysis in this paper we are interested in the asymptotic behavior of Decision Trees and Random Forests. Hence we use the following asymptotic Rademacher complexity:

(5)

The previous discussion highlights two things: First, the complexity of an RF does not increase when adding more trees but it averages. Second, the complexity of a tree largely depends on the number of features and the total number of nodes (ignoring any factors). Belkin et al. empirically showed in belkin/etal/2019 that Random Forest exhibit a double-descent curve. Similar to our discussion here, the authors introduce the number of nodes as a measure of complexity for single trees, but then use the total number of nodes in the forest throughout their discussion. While we acknowledge that this is a very intuitive definition of complexity it is not consistent with our above discussion and the results in learning theory. Hence, we propose to use the average (asymptotic) Rademacher complexity as a capacity measure. We argue, that with this adapted definition, there is no double-descent occurring in Random Forests but rather a single descent in which we fit the training data better and better the more capacity is given to the model.

A RF shows a single descent curve

We validate our hypothesis experimentally. To do so we train various RF models with different complexities and compare their overfitting behavior on five different datasets depicted in table 1. By today’s standards these datasets are small to medium size which allows us to quickly train and evaluate different configurations, but large enough to train large trees. The code for our experiments is available under https://github.com/sbuschjaeger/rf-double-descent.

Dataset N C d
Adult 32 562 2 108
Bank 45 211 2 51
EEG 14 980 2 14
Magic 19 019 2 10
Nomao 34 465 2 174
Table 1: Datasets used for our experiments. We performed minimal pre-processing on each dataset removing instances which contain NaN

values and computed a one-hot encoding for categorical features. Each dataset is available under

https://archive.ics.uci.edu/ml/datasets.

Our experimental protocol is as follows: Oshiro et al. showed in [14] empirically on a variety of datasets that the prediction of a RF stabilizes between and trees and adding more trees to the ensemble does not yield significantly better results. Hence, we train the ‘base’ Random Forests with trees. To control the complexity of a Random Forest, we limit the maximum number of leaf nodes of the individual trees to . In all our experiments we perform a 5-fold cross validation and report the average error across these runs.

Adult Bank EEG Magic Nomao

(a) Test and Training error.
(b) Average Rademacher complexity.
(c) Average tree height.
Figure 2: Test and training error of RF and DT (first column), the average Rademacher complexity (second column) and the average height of the trees (third column) over the maximum number of leaf nodes . Each row depicts one dataset. Results are averaged over a 5-fold cross validation. Solid lines are the test error and dashed lines are the training error. Best viewed in color.

Figure 1(a) shows the results of this experiment. Solid lines show the test error and dashed lines are the training error. Note the logarithmic scale on the x-axis. It can be clearly seen that for both, RF and DT, the training error decreases towards 0 for larger values. On the adult, bank, magic and nomao dataset we see the ‘classic’ u-shaped overfitting curve for a DT in which the error first improves and then suddenly increases again. On the EEG dataset the DT shows a single-descent in which it never overfits, but its test error is much higher than that of a RF. On all datasets, the DT seems to reach a plateau after a certain number of maximum leaf nodes. Looking at the RF we see a single-descent curve on all but the adult dataset in which the RF fits the data better and better with larger . Only on the adult dataset there are signs of small overfitting for the RF.

When there is no double-descent in Random Forests, then why are they performing better than single trees? Interestingly, the above discussion may already offers a reasonable explanation of this behavior. First, a Random Forest uses both feature sampling as well as bootstrapping for training new trees. When done with care222In scikit learn [16] the implementation may evaluate more than features if no sufficient split has been found., then feature sampling can reduce the number of features to so that also becomes smaller and thereby also reduces . Second, bootstrap sampling samples data points with replacement. Given a dataset with observations, there are only

unique data points per individual bootstrap sample in the limit. Thus, the effective size of each bootstrap sample reduces to roughly

which can lead to smaller trees because the entire training set is smaller and easier to learn due to duplicate observations. Last and maybe most important, tree induction algorithms such as CART or ID3 are adaptive in a sense, that the tree-structure is data-dependent. In the worst-case, a complete tree is build in which single observations are isolated in the leaf nodes so that every leaf-node contains exactly one example. However, it is impossible to grow a tree beyond isolating single observations because there simply is not any data left to split. Subsequently, the Rademacher complexity cannot grown beyond this point and is limited by an inherent, data-dependent limit. We summarize these arguments into the following hypothesis.

The maximum Rademacher complexity of RF and DT is bounded by the data

Figure 1(b) shows the Rademacher complexity of the DT and RF for the previous experiment. As one can see the Rademacher complexity for both models on all datasets steeply increases until they both converge against a maximum from which they then continue to plateau. So indeed, both models have an inherent maximum Rademacher complexity given by the data as expected. Contrary to the above discussion, however, the RF has a larger Rademacher complexity than a DT on all but the magic dataset. For a better understanding we look at the average height of trees produced by the two algorithms in Figure 1(c). Here we can see that RF – on average – has larger trees than the DT given the same number of maximum leaf nodes. We hypothesize that due to the feature and bootstrap sampling sub-optimal features are chosen during the splits. Hence, a RF requires more splits in total to achieve a small loss leading to larger trees with larger Rademacher complexities.

Combining both experiments leads to a mixed explanation why RF seems to be so resilient to overfitting: For trees trained via greedy algorithms such as CART one cannot (freely) over-parameterize the final model because its complexity is inherently bounded by the provided data. Even if one allows for more leaf nodes, the algorithm simply cannot make use of more parameters. A similar argument holds for a Random Forest: Adding more trees does not increase the Rademacher complexity as implied by Theorem 1. Thus one can add more and more trees without the risk of overfitting. Similar, increasing only increases the Rademacher complexity up to the inherent limit given by the data. Thus even if one allows for more parameters, a RF cannot make use of them. Its Rademacher complexity is inherently bounded by the data. However, as shown in Figure 1(a) - 1(c) there does not seem to be a direct, data-independent connection between the maximum number of leaf nodes and this inherent maximum Rademacher complexity.

3 Complexity does not predict the performance of a RF

The above discussion already shows that the Rademacher complexity of a forest does not seem to be an accurate predictor for the generalization error of the ensemble. In this section we further challenge the notion that complexity is a predictor of the performance of a tree ensemble and construct ensembles with large complexities that do not overfit and trees with small complexities that do overfit. For example, we could conceive a very complex tree by simply introducing unnecessary comparisons e.g. by comparing against infinity . This comparison is always true effectively adding a useless decision node to the tree and a forest of such trees would have a huge Rademacher complexity while having the same performance as before. Clearly these trees would neither be in the spirit of DT learning nor really useful in practice.

Hence we will now look a small variation of this idea. We study the performance of a DT which approximates the decision boundary of a good, not overfitted RF and similar we study a RF which approximates the decision boundary of a bad, overfitted DT. It is conceivable that the DT ‘inherits’ the positive properties of the RF and likewise that such a RF has all the negative properties of the original DT. Algorithm 2 summarizes this approach. We first train a regular reference model e.g a RF or DT given our the data in line 2. Then, we sample points along the decision boundary of the model by using augmented copies of the training data. Specifically, we copy the training data times and add Gaussian Noise to the observations in these copies as shown in line . Then we apply the reference model (RF or DT) to this augmented data in line and use its predictions as the new label for fitting the actual model in line .

1:
2: Train a DT or RF
3:for  do Generate training data
4:      Augment data
5:      Apply original model
6:     
7: Train a RF or DT
Algorithm 2 Training with Data Augmentation.

A RF that approximates a DT does not overfit

Adult Bank EEG Magic Nomao

(a) Test and training error.
(b) Average Rademacher Complexity.
(c) Test and training error of all methods.
Figure 3: Test and training error of DA-RF and DA-DT (first column), the average Rademacher complexity (second column) and the test and training error of all methods. Each row depicts one dataset. Results are averaged over a 5-fold cross validation. Solid lines are the test error and dashed lines are the training error. Best viewed in color.

We repeat the above experiments with data augmentation training. Again we limit the maximum number of leaf nodes . We train a RF with trees and approximate it with a DT using and . We call this algorithm DT with Data Augmentation (DA-DT). Similar, we train a single DT and approximate it with a RF with trees, and denoted as RF with Data Augmentation (DA-RF). Figure 2(a) shows the error curves for this experiment. Again, note the logarithmic scale on the x-axis. First we see that the training error approaches zero for larger for both models as expected. Second we see that the decision tree DA-DT despite fitting the decision boundary of a RF shows clear signs of overfitting. Third, and maybe even more remarkable, the forest DA-RF trained via data-augmentation on the bad, overfitted labels from the DT still does not overfit but also has a single descent. To gain a better picture we can again look at the Rademacher complexities of these two models in Figure 2(b). Similar to before there is a steep increase for both models. However, the DA-DT now converges against a smaller Rademacher complexity compared to the DA-RF which now has a much larger Rademacher complexity across all datasets despite the fact that DA-RF has a better test error. The forest does not overfit in a u-shaped curved as expected but also shows a single descent whereas the DT still does overfit in a u-shaped similar to before. For a better comparison between the individual methods we combine them in a single plot. Figure 2(c) shows the asymptotic Rademacher complexity over the test and train error of all methods. The dashed lines depict the training error, whereas the solid lines are the test error. Note that some curves stop early because their respective Rademacher complexities are not large enough to fill the entire plot. As one can see, DT and RF have a comparably small maximum Rademacher complexity. RF seems to minimize the training error more aggressively and reaches a smaller error with smaller complexities, whereas DT starts to overfit comparably early. DA-DT seems to have the smallest Rademacher complexity but also overfits the most on some datasets (e.g. adult or magic). DA-RF has the largest complexity but does not seem to overfit at-all. It slowly converges against the original RF’s performance, except on the adult dataset where it also does not overfit. Both, DT and DA-DT show a u-shaped curved whereas RF and DA-RF both show a single descent in most cases. Clearly, the Rademacher complexity fails to explain the performance of the data augmented trees and forests. We argue that the reason for this lies in the algorithm used to train the trees and not in the model-class itself.

4 Negative Correlation Forests

The previous section implies that the model alone does not fully explain its performance. We argue that the learning algorithm also plays a crucial role in the generalization capabilities of the model and more specifically that the trade-off between bias and diversity plays a crucial role. We show that there is a large region of different diversity levels which are all equally good and it does not matter what specific trade-off is achieved as long as it falls into the same region.

More formally, the bias-variance decomposition of the mean-squared-error states that the expected error of a model can be decomposed into its bias and variance [13, 9]:

where is a random process (e.g. due to bootstrap sampling.) induced by the algorithm that generates , is the bias of the algorithm and is its variance. Considering the ensemble , then the variance term can be further decomposed into a co-variance (see e.g. [5] for more details):

(6)

where we dropped for a better readability and is the co-variance of the predictions across the ensemble.

This decomposition does not directly relate the training error of a model to its generalization capabilities, but it shows how the individual training and testing losses are structured [6]. Although suspected for some time and exploited in numerous ensembling algorithms, the exact connection between the diversity of an ensemble and its generalization error was only established relatively recently. Germain et al. showed in germain/etal/15 that the diversity of an ensemble and its generalization error can be connected in a PAC-style bound shown in Theorem 2.

Theorem 2 (PAC-Style C-Bound [10]).

Let denote a set of base classifiers and let be the ensemble. Then, for any , with probability at least over the choice of sample of size drawn i.i.d. according to , the following inequality holds:

where for and any and where is the co-variance of the ensemble evaluated on the sample .

Intuitively, this result shows that an ensemble of powerful learners with a small bias that sometimes disagree will be better than a an ensemble with a comparable bias in which all models agree.

While the original RF algorithm produces accurate ensembles its diversity is implicitly determined by the bootstrap and features samples. Hence it is difficult to precisely control its diversity. For a direct control over the diversity we will now introduce Negative Correlation Forest. Bound 2 is stated for the 0-1 loss which makes the direct minimization of this bound difficult as noted in [10]. Luckily, the minimization over the bias-variance decomposition of the MSE is much more approachable and has already been studied in the form of Negative Correlation Learning (NCL). NCL offers a fine grained control over the diversity of an ensemble of neural networks by minimizing the following objective (see [5, 19, 6] for more details):

(7)

where , is the identity matrix with on the main diagonal and is the regularization strength. For this trains classifier independently and no further diversity among the ensemble members is enforced, for more diversity is enforced during training and for diversity is discouraged. While a large diversity helps to reduce the overall error it also implies that some trees must be wrong on some data points and thus means that their respective bias again increases. Finding a good balance between the bias and diversity is therefore crucial for a good performance.

NCL was first introduced to train ensembles of neural networks because they can easily be optimized via gradient-based algorithms minimizing Eq. 7. We adapt this approach to RF by training an initial RF with algorithm 1 which is then refined by optimizing the NCL objective: Recall that DTs use a series of axis-aligned splits of the form and where is a feature index, is a threshold to determine the leaf nodes and each leaf node has a (constant) prediction associated with it. Let be the parameter vector of tree (e.g. containing split values, feature indices and leaf-predictions) and let be the parameter vector of the entire ensemble . Then our goal is to solve

(8)

for a given trade-off

. We propose to minimize this objective via stochastic gradient-descent. SGD is an iterative algorithm which takes a small step into the negative direction of the gradient in each iteration

by using an estimation of the true gradient

(9)

where

(10)

is the gradient of wrt. to computed on a mini-batch .

Unfortunately, the axis-aligned splits of a DT are not differentiable and thus it is difficult to refine them further with gradient-based approaches. However, the leaf predictions are simple constants that can easily be updated via SGD. Formally, we use leading to

(11)
1: Algorithm 1
2: Use constant weights
3:for  do Init. leaf predictions
4:     
5:for receive batch  do Perform SGD
6:     for  do
7:          Using Eq. 9 + Eq. 11      
Algorithm 3 Negative Correlation Forest (NCForest).

Algorithm 3 summarizes the NCForest algorithm. First an initial RF is trained with trees using at most leaf nodes and features. Then, the leaf-predictions are extracted from the forest in line and SGD is performed in line to . Given our previous experiments, the learning algorithm seems to play a crucial role in the performance of a model and Theorem 2 implies that a diverse ensemble should generalize better. However, too much diversity will also likely hurt the performance of the forest because then the bias increases.

There is an optimal trade-off between bias and diversity

Again we validate our hypothesis experimentally. We train an initial RF with trees with a maximum number of leaf nodes. Due to the bootstrap sampling and due to the feature sampling, this initial RF already has some diversity. Hence, we use negative values to de-emphasize diversity and positive values to emphasize diversity. Last, we noticed that between and there is a steep increase in the diversity because it starts to dominate the optimization (c.f. [5] which reports a similar effect for Neural Networks). Hence we vary and minimize the NCL objective over epochs using the Adam optimizer with a step size of and a batch size of

implemented in PyTorch

[15]. We also experimented with more leaf nodes, different values and more epochs but the test-error would not improve further with different parameters.

As seen in Figure 3(a) and Figure 3(b) our NCForest method indeed allows for a fine control over the diversity in the ensemble. Increasing from to leads to a larger bias and more diversity in the ensemble while the overall ensemble loss nearly remains constant as expected. Increasing leads to a steep increase in both where the ensemble loss also increases because the diversity starts to dominate the optimization.

Looking at Figure 3(c) we can see the average training and testing error of the trees in the ensemble as well as the test and train error of the entire ensemble. Again, dashed lines depict the train error and solid lins are the test error. Moreover, we marked the performance of a DT and a RF for better comparability333By definition a single DT does not have any diversity. For presentational purposes we assigned a non-zero diversity to it in order to not break the logarithmic axis.. We can see two effects here. First, there seems to be a large region in which the diversity does not affect the ensemble error, but only the individual errors of the trees. In this region the performance of each individual trees changes, but the overall ensemble error remains nearly constant. The corresponding plots are akin to a bathtub: If the diversity is too small or too large, then performance suffers. Only in the middle of both extremes we find a good trade-off between both. For example, on the EEG dataset a diversity between seems to be a perfect trade-off with similar performance even though the average test error steeply increases for larger diversities. Second we find that a RF seems to achieve a good balance between both quantities with its default parameters, although minor improvements are possible, e.g. on the adult or the EEG dataset. We conclude that a larger diversity does not necessarily result in a better forest, but it can hurt the performance at some point. Likewise not enough diversity also leads to a bad performance and a good balance between both quantities must be achieved. Interestingly, there seems to be a comparably large region of similar performances where the exact trade-off between bias and diversity does not matter. Hence, the diversity does have some effect on the final model, but its effect might be overrated once this region is found by the algorithm.

Adult Bank EEG Magic Nomao

(a) Mean Squared Error.
(b) Mean Squared Error (zoom in).
(c) Test error.
Figure 4: Mean-Squared error (first column, second column) over different values and the test and train error (third column) over different diversities. Results are averaged over a 5-fold cross validation. Best viewed in color.

5 Conclusion

In this paper we revisited explanations for the success of Random Forests and showed that for most of these explanations an experiment can be constructed where they do not work or only offer little insight. First, given a proper definition of the complexity of forest, a RF does not exhibit a double descent, but rather has a single descent in which it simply fits the data better with increasing complexity. Second, a DT shows the ‘classic’ u-shaped curve in which it starts to overfit at some point. We continued to show that a RF does not ‘inherit’ these bad properties of a DT and retains a single-descent curve even if the RF is fitted on a ground-truth from an overfitted DT. Similar, a DT does not ‘inherit‘ the good properties of a RF, but also keeps its u-shaped overfitting curve when fitted on the ground-truth of a good, not overfitted RF. In all these experiments the Rademacher complexity did not accurately predict the performance of each classifier. In fact, DTs trained via data augmentation had a smaller complexity than their RF counterparts but a much worse test error. Hence, we argue that the training algorithm plays a crucial role in the performance of a model. We introduced Negative Correlation Forest (NCForest) which refines the leaf nodes of a forest to explicitly control the diversity among the trees. We hypothesized that a tree ensemble of diverse trees with sufficiently small bias should have a better generalization error than a homogeneous forest. And indeed there seems to be a bathtub-like correlation between the diversity and the test error. Having too few or having too much diversity hurts the performance, whereas the right balance of both quantities maximizes performance. Luckily there seems to be a comparably large region of different diversities which achieve this trade-off and a RF seems to find this region in most cases although some improvements are possible when this trade-off is further refined.

Acknowledgments

Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”, DFG project number 124020371, SFB project A1, http://sfb876.tu-dortmund.de. Part of the work on this research has been funded by the Federal Ministry of Education and Research of Germany as part of the competence center for machine learning ML2R (01—18038A), https://www.ml2r.de/. Thank you Mirko Bunse and Lukas Pfahler for their helpful comments on this paper.

References

  • [1] G. Biau and E. Scornet (2016) A random forest guided tour. Test 25 (2), pp. 197–227. Cited by: There is no Double-Descent in Random Forests.
  • [2] G. Biau (2012) Analysis of a random forests model. Journal of Machine Learning Research 13 (Apr), pp. 1063–1095. Cited by: There is no Double-Descent in Random Forests.
  • [3] L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen (1984) Classification and regression trees. Taylor & Francis. External Links: ISBN 9780412048418, LCCN 83019708, Link Cited by: §1.
  • [4] L. Breiman (2000) Some infinity theory for predictor ensembles. Technical report Technical Report 579, Statistics Dept. UCB. Cited by: There is no Double-Descent in Random Forests.
  • [5] G. Brown, J. L. Wyatt, and P. Tino (2005) Managing Diversity in Regression Ensembles. Jmlr, pp. 1621–1650. External Links: Document, ISSN 15505081, Link Cited by: §4, §4, §4.
  • [6] S. Buschjäger, L. Pfahler, and K. Morik (2020) Generalized negative correlation learning for deep ensembling. arXiv preprint arXiv:2011.02952. Cited by: §4, §4.
  • [7] C. Cortes, M. Mohri, and U. Syed (2014) Deep boosting. In International conference on machine learning, pp. 1179–1187. Cited by: Theorem 1.
  • [8] M. Denil, D. Matheson, and N. De Freitas (2014) Narrowing the gap: random forests in theory and in practice. In International conference on machine learning (ICML), Cited by: There is no Double-Descent in Random Forests.
  • [9] S. Geman, E. Bienenstock, and R. Doursat (1992) Neural Networks and the Bias/Variance Dilemma. Vol. 4. External Links: Document, ISSN 0899-7667, Link Cited by: §4.
  • [10] P. Germain, A. Lacasse, F. Laviolette, M. March, and J. Roy (2015) Risk bounds for the majority vote: from a pac-bayesian analysis to a learning algorithm. Journal of Machine Learning Research 16 (26), pp. 787–860. External Links: Link Cited by: §4, Theorem 2.
  • [11] V. Koltchinskii and D. Panchenko (2002) Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics 30 (1), pp. 1–50. Cited by: §2.
  • [12] Y. Mansour (1997) Pessimistic decision tree pruning based on tree size. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 195–201. Cited by: §2.
  • [13] H. Markowitz (1952) The Utility of Wealth. Journal of Political Economy. External Links: Document, ISSN 0022-3808 Cited by: §4.
  • [14] T. M. Oshiro, P. S. Perez, and J. A. Baranauskas (2012) How many trees in a random forest?. In

    International workshop on machine learning and data mining in pattern recognition

    ,
    pp. 154–168. Cited by: §2, There is no Double-Descent in Random Forests.
  • [15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, External Links: ISSN 10495258 Cited by: §4.
  • [16] F. Pedregosa et al. (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: footnote 2.
  • [17] J. R. Quinlan (1986) Induction of decision trees. Machine learning 1 (1), pp. 81–106. Cited by: §1.
  • [18] S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge university press. Cited by: There is no Double-Descent in Random Forests.
  • [19] A. M. Webb, C. Reynolds, D. Iliescu, H. Reeve, M. Lujan, and G. Brown (2019) Joint Training of Neural Network Ensembles. External Links: Document, 1902.04422, Link Cited by: §4.
  • [20] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2021) Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64 (3), pp. 107–115. Cited by: There is no Double-Descent in Random Forests.