Ensemble methods are machine learning methods where each learner provides an estimate of the target variables and all estimates are then combined in some fashion hopefully to reduce the generalization error compared to a single learner. Random forest, devised by Breiman in the early 2000s(Breiman, 2001)
, is on top of the list of the most successful ensemble methods currently applied to deal with a wide range of prediction problems. With the help of combining several randomized decision trees during the training phase and aggregating their predictions by averaging or voting, this supervised learning procedure has shown high-quality performance in settings where the number of variables involved is much larger than the number of observations. Moreover, the applications of various versions of random forest in a large number of fields including bioinformatics(Díazuriarte and Andrés, 2006), survival analysis (Ishwaran et al., 2008), 3D face analysis (Fanelli et al., 2013), cancer detection (Paul et al., 2018), stereo matching (Park and Yoon, 2018; Kim et al., 2017), head and body orientation (Lee et al., 2017), head and neck CT images for radiotherapy planning (Wang et al., 2018), 3D human shape tracking (Huang et al., 2018), gaze redirection problem in images (Kononenko et al., 2017), salient object detection and segmentation (Song et al., 2017), organ segmentation (Farag et al., 2017), visual attribute prediction (Li et al., 2016), scene labeling (Cordts et al., 2017) further demonstrate the practical effectiveness of the algorithm, where splitting criterions utilizing sample information such as those based on information gain (Quinlan, 1986), information gain ratio (Quinlan, 1993) and Gini dimension (Breiman et al., 1984)
are employed. However, in general, from the statistical perspective, those variants of random forest classifiers are not universal consistent. For example,Biau et al. (2008) constructs a specific distribution as a counterexample where the classifier does not converge due to the nature of Gini dimension criterion.
Because of its wide applications, efforts have been paid in the random forest society to further investigate various versions of random forests from the theoretical perspective. For instance, Arlot and Genuer (2014) studies the approximation error of some purely random forest models and shows the error of an infinite forest decreases at a faster rate (with respect to the size of each tree) than a single tree. Furthermore, concerning with classification problems, Biau et al. (2008) established weak convergence of the purely random forest, which is the radically simplest version of random forest classifiers (Breiman, 2000), while Biau (2012) shows that Breiman’s procedure is consistent and adapts to sparsity, in the sense that the rate of convergence depends only on the number of strong/active features, and Scornet et al. (2015) proves the consistency for Breiman’s original algorithm in the context of additive models. Moreover, Wager and Walther (2015) builds an adaptive concentration as a framework for describing the statistical properties of adaptively grown trees by viewing training trees as occurring in two stages. Finally, Jaouad Mourtada and Scornet (2017) establishes the consistency of modified Mondrian Forests (Balaji Lakshminarayanan and Teh, 2014) that can be implemented online while achieving the minimax rate for the estimation of Lipschitz continuous functions.
In this paper, we propose a random forest algorithm named best-scored random forest which not only achieves almost optimal convergence rates, but also enjoys satisfactory performance on several benchmark data sets. The main contributions of this paper are twofold: (i)
Concerning with theoretical analysis, we establish almost optimal convergence rates for the best-scored random trees under certain restrictions and successfully extend it to the case of best-scored random forest. The convergence analysis is conducted under the framework of regularized empirical risk minimization. As is in the traditional learning theory analysis where decomposing the error term into bias and variance terms is attached great significance, we proceed the analysis by decomposing the error term into data-free and data-dependent error terms, respectively. To be more precise, the data-free error term can be dealt with by applying techniques from the approximation theory whereas the data-dependent error term can be resolved through exploiting arguments from the empirical process theory, such as oracle inequalities with regard to regularized empirical risk minimization in our case. In order to have a more rigorous understanding on the consistency of random forest, we present a counterexample which explicitly demonstrate that all dimensions should have the probability to be split from in order to achieve the consistency.(ii) When it comes to numerical experiments, we first improve the original random splitting criterion by proposing an adaptive random partition method, which differs from the purely random partition in its conducting the node selection process in another way. Specifically, at each step, we need to randomly select a sample point from the training data set and the to-be-split node is the one which this point falls in. This idea follows the fact that when randomly picking sample points from the whole training data set, nodes with more samples will be more likely to be selected while nodes with fewer samples are less possible to be chosen. Consequently, we have a greater probability to obtain cells with sample sizes evenly distributed. Empirical experiments further show that the adaptive/recursive method enhances the efficiency of the algorithm for it actually increases the effective number of splits compared to the original purely random partition method.
The rest of this paper is organized as follows: Section 2 introduces some fundamental notations and definitions related to the best-scored random forest. We provide our main results and statements on the oracle inequalities and learning rates of the best-scored random forest in Section 3. Some comments and discussions on the main results will be also presented in this section. The main analysis on bounding error terms is provided in Section 4. A counterexample aiming at giving a more rigorous understanding of the consistency of random forest is presented in Section 5. Numerical experiments conducted upon comparisons between best-scored random forest and other classification methods are given in Section 6. All the proofs of Section 3 and Section 4 can be found in Section 7. Finally, we conclude this paper with a brief discussion in Section 8.
Throughout this paper, we suppose that the data set given is of independent and identically distributed
-valued random variables with the same distribution as the generic pair, where
is the feature vector while
is the binary label. The joint distributionof is determined by the marginal distribution of and the a posteriori probability defined by
The learning goal of binary classification is to find a decision function such that for the new data , we have with high probability.
In order to precisely describe our learning goal, also for the ease of convenience, we suppose that and . Under that condition, it is legitimate to consider the classification loss defined by . We define the risk of the decision function by
and the empirical risk by
where denotes the average of Dirac measures at . The smallest possible risk is called the Bayes risk, and a measurable function so that holds is called Bayes decision function. By simple calculation, we obtain the Bayes decision function
2.2 Purely Random Tree and Forest
Considering how the purely random forest put forward by Breiman (2000) plays a vital and fundamental role in theory and practice, we base our following analysis on this specific kind of random forest. To illustrate one possible general construction process of one tree in the forest, some randomizing variables are in need to describe the selection process of the node, the coordinate and the cut point at each step. Therefore, we introduce a random vector which describes the splitting rule at the -th step of tree construction. To be specific, denotes the randomly chosen leaf to be split at the -th step from the candidates which are defined to be all the leaves presented in the -step, thus the leaf choosing procedure follows a recursive manner. The random variable denotes the dimension chosen to be split from for the leaf. The random variables , are independent and identically multinomial distributed with each dimension having equal probability to be chosen. The random variables , are independent and identically distributed from . These variables are recognized as proportional factors representing the ratio between the length of the newly generated leaf in the -th dimension after completing the -th split and the length of the being cut leaf in the -th dimension. In other words, the length of the newly generated leaf in the -th dimension can be calculated by multiplying length of the leaf in the -th dimension and the proportional factor .
Statements mentioned above give a quantifiable description of the splitting process of the purely random decision tree. However, for a clearer understanding of this splitting approach, we might give one simple example where we develop a partition on . To begin with, we randomly choose a dimension out of candidates and uniformly split at random from the chosen dimension so that is split into two leaves which are and , respectively. Secondly, a leaf is chosen uniformly at random, say , and we once again randomly pick the dimension and do the split on that leaf, which leads to a partition consisting of . Thirdly, a leaf is randomly selected from all three leaves, say , and the third split is once again conducted on that chosen leaf with dimension and node chosen the same way as before, which leads to a partition consisting of . The construction process will continue in this approach until the number of splits reaches our satisfaction. To notify, the above process can be extended to the general cases where construction is conducted on the feature space . All of the above procedures lead to a so-called partition variable which is defined by and takes values in space . The probability measure of is denoted as .
Assume that any specific partition variable can be treated as a latent splitting policy. Considering a partition with splits, we denote and define the resulting collection of cells as which is a partition of , where . If we focus on certain sample point , then the corresponding cell in which that point falls is defined by . Here, we introduce the random tree decision rule, that is a map defined by
Assume that our random forest is determined by the latent splitting policies consisting of independent and identically distributed random variables drawn from , and the number of splits for trees are presented as . As usual, the random forest decision rule can be defined by
2.3 Best-scored Random Forest
Considering that the preliminary work of analysis of a random forest is to focus on how to give appropriate partitions to several independent trees, we first define a function set which contains all the possible partitions as follows:
In this paper, without loss of generality, we only consider cells with the shape of . To be specific, choosing as the number of splits, the resulting leaves presented as in fact construct a partition of with splits. What we also attach great significance is that is the value of leaf , and thus the set contains all the potential decision rules. Moreover, for fixed , we denote the collection of trees with number of splits as
Let be a fixed number to be chosen later and assume that the forest consists of trees. For , suppose that are independent and identically distributed random variables drawn from , which are also the splitting policies of those trees. For , we might as well derive a random function set induced by that specific splitting policy after steps as
which is a subset of . We also denote .
Having found an appropriate random tree decision rule under policy denoted as , we are supposed to scrutinize the convergence properties of that rule. To this end, we need to introduce the framework of regularized empirical risk minimization, see also Definition 7.18 in Steinwart and Christmann (2008). Let be the set of measurable functions on , be a non-empty set,
be a loss function, andbe a function. A learning method (see e.g. Definition 6.1 in Steinwart and Christmann (2008)) whose decision function satisfying
for all and is called regularized empirical risk minimization.
To notify, we put forward an idea that the number of splits is the one should be penalized on. The reason why the penalization on is necessary is that it not only significantly reduces the huge amount of calculation, which then the number of splits is bounded and the function set has a finite VC dimension, but more importantly refrains from overfitting. With the same data set , the above regularized empirical risk minimization problem with respect to each function space turns into
It is noteworthy that the regularized empirical risk minimization under any policy can be bounded simply by having a quick look at the situation where no split is applied to . As a consequence, the optimization problem can be represented as
where denotes the empirical risk for taking for all with . Thus, it can be apparently seen that the best number of splits is upper bounded by , which leads to a capacity reduction of the underlying function set. Therefore, the following function spaces will be all added an extra condition .
where is the number of splits of the decision function . Its population version can be denoted by
The fact that directly aggregating all random trees at hand is not always sensible, since some of them may not be able to classify the data with proper manners. For this reason, we advocate a new method named as best-scored random forest. Every tree in the random forest is chosen from candidates and the main principle is to retain only the tree yielding the minimal regularized empirical risk, which is
where is the number of splits of and . Apparent as it is, is the regularized empirical risk minimizer with respect to the random function space In other words, is the solution to the regularized empirical risk minimization problem
Similarly, is denoted as the solution of the population version of regularized minimization problem in the space
Again, is the corresponding number of splits of .
3 Main Results and Statements
In this section, we present main results on the oracle inequalities and learning rates for the best-scored random tree and forest. More precisely, section 3.1 gives the fundamental assumptions for the analysis of our classification algorithm. Section 3.2 is devoted to the oracle inequality of the best-scored random trees. Then, in section 3.3, we use the established oracle inequalities to derive learning rates. On account of the results of those base classifiers, learning rates of the ensemble forest will be established in section 3.4. Finally, we present some comments and discussions concerning the obtained main results.
3.1 Fundamental Assumptions
To present our main results, we need to make assumptions on the behavior of
in the vicinity of the decision boundary by means of the posterior probabilitydefined as in (2.1). To this end, we write
and define the distance to the decision boundary by
where , see also Definition 8.5 in Steinwart and Christmann (2008). In the following, the distance always denotes the -norm if not mentioned otherwise.
The distribution on is said to have noise exponent if there exists a constant such that
The notion of the noise exponent can be traced back to Tsybakov (2004). Tsybakov’s noise assumption is extensively used in the literature of dynamical systems (Hang and Steinwart, 2017), Neyman-Pearson classification (Zhao et al., 2016)2015), bipartite ranking (Agarwal, 2014), etc. Assumption 3.1 is intrinsically related to the analysis of the estimation error. Note that for any , if is close to , then the amount of noise will be large in the labeling process at . From that perspective, this assumption gives a measurement on the size of the set of points whose noise is high during the labeling process. As is widely acknowledged, points whose a posteriori probability are far away from are in favor since they provide a clear choice of labels. Consequently, the Assumption 3.1 is used to guarantee that the probability that points with high noise occur is low. In that case, samples which are less useful in classification will be lesser and thus leads to a better analysis of the data-based error, that is the estimation error.
The distribution on has margin-noise exponent if there exists a constant such that
The margin-noise exponent was put forward by Steinwart and Scovel (2007) where the relationship between margin-noise exponent and noise exponent is analyzed as well. Assumption 3.2 measures the size of the set of points, denoted as , which are close to the opposite class and the integral is with respect to the measure . This geometric noise assumption is essentially related to the approximation error, since in the context of random tree construction, only those cells which intersect the decision boundary contribute to the approximation error. Therefore, the concentration of mass near the decision boundary determines the approximation ability to some extent.
The last assumption describes the relation between the distance to the decision boundary and the discrepancy of the posterior probability to level , see also Definition 8.16 in Steinwart and Christmann (2008).
Let be a metric space, be a distribution on , and be a version of its posterior probability. We say that the associated distance to the decision boundary defined by (3.1) controls the noise by the exponent if there exists a constant such that
holds for -almost all .
Note that since for all , the above assumption becomes trivial whenever . Consequently, the assumption only considers points with sufficiently small distance to the opposite class. In short, it states that is close to the level of complete noise if approaches the decision boundary.
3.2 Oracle Inequality for Best-scored Random Tree
We now establish an oracle inequality for the best-scored random tree.
Theorem 3.1 (Oracle inequality for best-scored random trees).
with probability at least , where is a constant depending on and is a constant depending on , and which will be specified later in the proof.
3.3 Learning Rates for Best-scored Random Trees
Based on the established oracle inequality, we now state our main results on the convergence rates for best-scored random trees.
Let be the classification loss, be the noise exponent as in Assumption 3.1, , be the margin-noise exponent as in Assumption 3.2, be the number of candidate trees. Then, for all , all , and all , with probability at least , the -th tree in the best-scored random forest learns with , the rate
where , is a constant which will be specified in the proof later, independent of and only depending on constants , , , , , and .
Let us briefly discuss the constants and . If we consider the deterministic binary tree, then for tree with number of splits , the effective number of splits for each dimension is approximately . However, we need to take randomness into consideration, which leads to the decrease of the effective number of splits written as with . From the proof of this theorem later, we see that the constant deceases to some constant when the number of candidate trees increases. Moreover, note that the constant can be taken as small as possible. Therefore, under Assumptions 3.1 and 3.2, the learning rates (3.3) are close to
which is optimal only when both and converge to infinity simultaneously. However, as the following example shows, if belongs to certain Hölder spaces , and cannot converge to infinity simultaneously. Note that if is the Lebesgue measure on and the Bayes boundary has nonzero -dimensional Hausdorff measure, then Binev et al. (2014) shows that the constraint must hold.
Lemma A.2 in Blaschzyk and Steinwart (2018) shows that if is Hölder-continuous with exponent , then controls the noise from above with exponent , i.e., there exists a constant such that holds for -almost all . This together with Lemma 8.23 in Steinwart and Christmann (2008) implies that has margin exponent and margin-noise exponent . Consider the Lipschitz continuous space where , then the restriction implies that and consequently the convergence rate becomes
which is obviously slower than the minimax rate .
Let be the classification loss, be the noise exponent as in Assumption 3.1, , the associated distance to the decision boundary defined by (3.1) controls the noise by the exponent as in Assumption 3.3, be the number of candidate trees. Then, for and all , with probability at least , the -th tree in the best-scored random forest learns with , the rate
where and is a constant which will be specified in the proof later, independent of and only depending on constants , , , and .
Note that the constant can be taken as small as possible. Therefore, if the noise exponent (and thus ) is sufficiently large, the learning rate (3.4) is close to . In other words, we achieve asymptotically the optimal rate, see e.g. Tsybakov (2004); Steinwart and Scovel (2007). Moreover, if or is large, the rate is rather insensitive to the input dimension .
3.4 Learning Rates for Ensemble Forest
Inspired by the classical random forest, we propose a diverse and thus more accurate version of random forest. Here, we intend to construct our best-scored random forest basing on the majority voting result of best-scored trees, each of which is generated according to the procedure in (2.9).
Let , be the best-scored classification trees determined by the criterion mentioned above. As usual, we perform majority voting to make the final decision
Let be the classification loss, be the noise exponent as in Assumption 3.1, , be the margin-noise exponent as in Assumption 3.2, be the number of best-scored trees in the forest and be the number of candidate trees. Then, for and all , with probability at least , learns with , the rate
where and is a constant which will be specified in the proof later, independent of and only depending on , , , , , and .
Let be the classification loss, be the noise exponent as in Assumption 3.1, , the associated distance to the decision boundary defined by (3.1) controls the noise by the exponent as in Assumption 3.3, be the number of best-scored trees in the forest and be the number of candidate trees. Then, for and all , with probability at least , the best-scored random forest learns with , the rate
where and is a constant which will be specified in the proof later, independent of and only depending on constants , , , and .
Again, if the noise exponent is sufficiently large, we achieve asymptotically the optimal rate for the best-scored random forest. Moreover, if or is large, the rate is rather insensitive to the input dimension as well.
3.5 Comments and Discussions
This section presents some comments and discussions on the obtained theoretical results on the oracle inequalities, convergence rates for best-scored random trees and the learning rates for the ensemble forest.
From the theoretical perspective we notice that, on the one hand, rather than giving extra assumptions on the capacity of the function space, the size of function space in our algorithm is completely decided by the regularization term, while on the other hand, the approximation error is strictly calculated step by step according to the purely random scheme. Under certain assumptions in Section 3.1, when go to infinity, we establish the asymptotically optimal learning rates (3.3) for the best-scored random trees, which is close to . Elementary analysis shows that asymptotically optimal rates for ensemble random forest, that is, , can be achieved.
As is already mentioned in the introduction, efforts have been paid to derive learning rates for various kinds of random forests in the literature. Similar to our algorithm, Genuer (2012) and Arlot and Genuer (2014) analyze purely random partitions independent of the data using bias-variance decomposition. More precisely, in the context of one-dimensional regression problems where target functions are Lipschitz continuous, Genuer (2012) shows that purely uniformly random trees/forests where the partitions were obtained by drawing random thresholds at random in can both achieve minimax convergence rates . Based on these models and their analysis, Arlot and Genuer (2014) obtains optimal convergence rates over twice continuous differentiable functions in one-dimensional case for purely uniformly random forests and toy purely random partitions where the individual partitions corresponded to randomly shifts of the regular partition of in intervals. Note that in the latter work, boundaries and high-dimension cases are not considered. Concerning with high-dimension cases, under a sparsity assumption, Biau (2012) proves the convergence rate where denotes the number of active/strong variables. This rate is strictly faster than the commonly -dimensional optimal rate if and strictly slower than the -dimensional optimal rate . Furthermore, Gey and Mary-Huard (2014) proposes a penalized criterion and derive a risk bound inequality for the tree classifier generated by CART and thus obtain convergence rates which becomes asymptotically if . Moreover, based on assumptions with respect to noise exponent and smoothness of the target function , Binev et al. (2014) derives learning rates for certain recursive tree which are of order when and . This rate can never better . Last but not least, recently, if the target functions are Lipschitz continuous, Jaouad Mourtada and Scornet (2017) establishes the -dimensional optimal rate for online Mondrian Forests.
4 Error Analysis
In this section, we conduct error analysis by bounding the approximation error term and the sample error term respectively.
4.1 Bounding the Approximation Error Term
The following new result on bounding the approximation error term, which plays a key role in the learning theory analysis, shows that under certain assumptions on the amount of noise, the regularized approximation error possesses a polynomial decay with respect to the regularization parameter .
Let be the classification loss, be the margin-noise exponent as in Assumption 3.2 and be the number of candidate trees. Then for any fixed and , with probability at least , there holds for the -the tree in the best-scored random forest that
where and is a constant depending on and .
4.2 Bounding the Sample Error Term
In learning theory, the sample error can be bounded by means of the Rademacher average and Talagrand inequality. Let be a hypothesis space, the Rademacher average is defined as follows, see e.g., Definition 7.18 in Steinwart and Christmann (2008):
Definition 4.1 (Rademacher Average).
Let , , be a Rademacher sequence with respect to some distribution , that is, a sequence of i.i.d. random variables such that . The -th empirical Rademacher average of is defined as
Recall the function space defined as in (2.4). In the following analysis, for the sake of convenience, we need to reformulate the definition of . Let be fixed. Let be a partition of with number of splits and denote the family of all partitions . Furthermore, we define
Then, for all , there exists some such that can be written as . Therefore, can be equivalently defined as
To establish the bounds on the sample error, we are encouraged to give a description of the capacity of the function space.
Definition 4.2 (VC dimension).
Let be a class of subsets of and be a finite set. The trace of on is defined by . Its cardinality is denoted by . We say that shatters if , that is, if for every , there exists a such that . For , let
Then, the set is a Vapnik-Cervonenkis (VC) class if there exists such that and the minimal of such is called the VC dimension of , and abbreviated as .
The VC dimension of in (4.1) can be upper bounded by .
Definition 4.3 (Covering Number).
Let be a metric space, and . We call an -net of if for all there exists an such that . Moreover. the -covering number of is defined as
where denotes the closed ball in centered at with radius .
Let be a class of subsets of , denote as the collection of the indicator functions of all , that is, . Moreover, as usual, for any probability measure , is denoted as the space with respect to equipped with the norm .
Definition 4.4 (Entropy Number).
Let be a metric space, and be an integer. The -th entropy number of is defined as
For , denote
where denotes the classification loss of , that is, .
Let be defined as in (4.6). Then, for all , the -th entropy number of satisfies
where is a constant depending on and .
Having established the entropy number of the function set , its Rademacher average can be bounded as follows:
5 A Counterexample
Here, we present a counterexample to illustrate that in order to ensure the consistency of the forest, all dimensions must have the chance to be split. In other words, if the splits only occur on some predefined dimensions, this may leads to the inconsistency of the forest. We mention that the following counterexample is merely a simple case where we only consider a predefined dimension of size one.
First of all, we construct a special sample distribution which can be described by the following feature vector
where are unit vectors, and can thus be viewed as feature points located at the vertexes of a -dimension cube on which all probability mass is assumed to be concentrate. Points are then labeled as follows:
It can be seen that we have the Bayes risk in this case. Samples are chosen in the form of pair uniformly at random with replacement from the above mentioned sample distribution and therefore form a data set of size .
Secondly, we scrutinize results of performing splits only from one dimension to form a decision tree. If we project all data to that dimension, say the th dimension, from which we will perform the splits, then the classification rule will be based on the projection data. Therefore, in this case, no matter where the chosen cut point is, the classification result of any point can be obtained by
which is named as the classifier of the th dimension. Without loss of generality, we assume that samples located on different sides of the splits are classified into different classes, i.e. . For other cases, the analysis can be conducted in the same way.
The above process provides us a way to construct one decision tree and therefore we are able to develop a forest with each tree built in that way. The constructed -tree forest consists of classifiers of different dimensions where the number of classifiers from the same dimension, say the dimension, is denoted as , so that we have . As a result, the forest classifier is still the majority vote of trees and presented as
However, we illustrate that the above forest classifier cannot be consistent.
On one hand, if is even, for any feature point , we assume that there exists an even number such that can be presented by