Two-stage Best-scored Random Forest for Large-scale Regression

05/09/2019 ∙ by Hanyuan Hang, et al. ∙ 0

We propose a novel method designed for large-scale regression problems, namely the two-stage best-scored random forest (TBRF). "Best-scored" means to select one regression tree with the best empirical performance out of a certain number of purely random regression tree candidates, and "two-stage" means to divide the original random tree splitting procedure into two: In stage one, the feature space is partitioned into non-overlapping cells; in stage two, child trees grow separately on these cells. The strengths of this algorithm can be summarized as follows: First of all, the pure randomness in TBRF leads to the almost optimal learning rates, and also makes ensemble learning possible, which resolves the boundary discontinuities long plaguing the existing algorithms. Secondly, the two-stage procedure paves the way for parallel computing, leading to computational efficiency. Last but not least, TBRF can serve as an inclusive framework where different mainstream regression strategies such as linear predictor and least squares support vector machines (LS-SVMs) can also be incorporated as value assignment approaches on leaves of the child trees, depending on the characteristics of the underlying data sets. Numerical assessments on comparisons with other state-of-the-art methods on several large-scale real data sets validate the promising prediction accuracy and high computational efficiency of our algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ever-increasing scale of modern scientific and technological data sets raises urgent requirements for learning algorithms that not only maintain desirable prediction accuracy but also have high computational efficiency (Wen et al., 2018; Guo et al., 2018; Thomann et al., 2017; Hsieh et al., 2014). However, a major challenge is that the data analysis and learning algorithms suitable for modest-sized data sets often encounter difficulties or are even infeasible to tackle large-volume data sets, which leads to the current popular research direction named large-scale regression (Collobert and Bengio, 2001; Raskutti and Mahoney, 2016). In the literature, efforts have been made to conquer the large-scale regression problems and each method has its own merits in its own regimes. Typically, the mainstream solutions come in two flavors, which are horizontal methods and vertical methods. The essence of the horizontal methods, also called distributed learning, is to partition the data set into data subsets, store them on multiple machines and allow these machines to train in parallel with each machine processing its local data to give local predictors. Then the local predictors are synthesized to give a final predictor (Zhang et al., 2013, 2015; Lin et al., 2017; Guo et al., 2017). Nevertheless, horizontal methods have to face their own problems. Specifically, what is originally needed is a global predictor defined on the whole feature space which should be trained based on all the training data via the chosen regression algorithm. However, the local predictors also defined on the whole feature space are actually trained based only on the information provided by the data subsets. In this manner, chances are high that each local predictor may be very different from the desired global predictor, let alone the synthesized final predictor.

The other category of methods to resolve large-scale regression problems is the vertical methods. Its main idea is to first partition the whole feature space (i.e. the input domain) into multiple non-overlapping cells where different partition methods (Suykens et al., 2002; Espinoza et al., 2006; Bennett and Blue, 1998; Wu et al., 1999; Chang et al., 2010) can be employed. Then, for each of the resulting cells, a predictor is training based on samples falling into that cell via regression strategies such as Gaussian process regression (Park et al., 2011; Park and Huang, 2016; Park and Apley, 2018), support vector machines (Meister and Steinwart, 2016; Thomann et al., 2017), etc. However, the long-standing boundary discontinuities have always been a headache for vertical methods for degrading the regression accuracy, and literature has committed to settling this problem. For example, Park et al. (2011) first applies Gaussian process regression to each cell of the decomposed domain with equal boundary constraints merely at a finite number of locations. After finding that this method cannot essentially solve the boundary discontinuities, they propose a solution specially for this issue in Park and Huang (2016) which constraints the predictions of local regressions to share a common boundary. To further mitigate the boundary discontinuities, recently, Park and Apley (2018) proposes Patch Kriging (PK) which improves previous work with the help of adding additional pseudo-observations to the boundaries. However, boundaries where two adjacent Gaussian processes are joined up are artificially chosen, which may have a great impact on the final predictor. Moreover, their approach is fundamentally different from the original Gaussian process which is a global method with respect to algorithm structure. Additionally, their method may not be that appropriate for parallel computing. Another vertical method called the Voronoi partition support vector machine (VP-SVM) (Meister and Steinwart, 2016) is available for parallel computing, while boundary discontinuities are not demonstrably solved. Besides, their method also no longer shares the same spirit as the original global algorithm LS-SVMs (Suykens et al., 2002). To the best of our knowledge, up till now, there is no such algorithm that not only overcomes the boundary discontinuities problems that long plague the vertical methods, but also takes full advantage of the huge parallel computing resources brought by the big data era to obtain results both efficient and effective.

Aiming at solving these tough problems, in this paper, we propose a novel vertical algorithm named the two-stage best-scored random forest, which is an exact fit for solving large-scale regression problems. To be specific, in stage one, the feature space is partitioned following an adaptive random splitting criterion into a number of cells, which paves the way for parallel computing. In stage two, splits are continuously conducted on each cell separately following a purely random splitting criterion. Due to the inherent randomness of this splitting criterion, for each cell, we are able to establish different regression trees under different partitions, and then pick up the one with the best empirical performance to be the child best-scored random tree of that cell. Accordingly, we name this selection strategy the “best-scored” method. Subsequently, the concatenation of child best-scored random trees from all cells forms a parent best-scored random tree. By following the above construction procedure repeatedly, we are able to establish a number of different parent best-scored random trees whose ensemble is just the two-stage best-scored random forest. The prominent strengths of our algorithm over other vertical methods can be demonstrated from the following perspectives:

(i) In most of the existing vertical methods, the feature space is usually artificially partitioned into different non-overlapping cells and the original algorithm is then applied to each of these regions, respectively. In the original algorithm, the prediction of any point in the feature space is influenced by the information of all the sample points, whereas in the corresponding vertical methods, the prediction of any point may be only affected by the information of sample points in its belonging cell. This usually leads to an essential change of the algorithm structure and accordingly, the global smoothness of the original method is jeopardized and only the smoothness within each cell can now be guaranteed, often resulting in the boundary discontinuity problem. In contrast, this is never a problem for our two-stage best-scored random forest (TBRF) method, since random forest (RF) is intrinsically an ensemble method bringing its asymptotic smoothness. As for our two-stage random forest method, we only divide the whole original splitting process of one tree into two stages for the sake of parallelism. This does not change the nature of TBRF as an RF method.

(ii)

Owing to the two-stage structure of our proposed algorithm and the architecture of the random forests, the TBRF achieves satisfying performance in terms of computational efficiency and prediction accuracy, which have always had great significance in the big data era. Specifically, the computational efficiency is twofold. First of all, the algorithm can be significantly sped up by leveraging parallel computing in both stages. Considering that parent trees in the forest require different adaptive random partitions of the feature space which are conducted in stage one, we can assign each adaptive partition to a different core for acceleration. This is a direct advantage of parallelism brought by the ensemble learning resided in the random forest. Moreover, the establishment of child best-scored random trees whose total number is the total amount of cells in all parent trees, can be also assigned to different cores, so that the computational burden can be decentralized. For another, the adaptive random partition in stage one is completely data-driven and this splitting mechanism makes the number of samples falling into each cell more evenly distributed. Therefore, it increases the number of effective splits, and further reduces the training time for parallel computing. When it comes to the prediction accuracy, we manage to incorporate some existing mainstream regression algorithms as value assignment methods into our random forest architecture. In addition to only assigning a constant to each terminal node of the trees, we employ a few alternatives, such as fitting linear regression functions for low dimensional data, and utilizing a Gaussian kernel for high dimensional data, due to their different performances when encountering different dimensional data. Numerical experiments further demonstrate the effectiveness in choosing appropriate assignment strategies for different data. Moreover, the asymptotic smoothness brought by the ensemble learning and the property of having many tunable hyperparameters further contribute to the improvement of accuracy.

(iii) The satisfactory performance of the two-stage best-scored random forest is supported by compact theoretical analysis under the framework of regularized empirical risk minimization. To be specific, by decomposing the error term into data-free and data-dependent error terms which are dealt with by techniques from approximation theory and empirical process theory, respectively, we establish the almost optimal learning rates for both parent best-scored random trees and their ensemble forest under certain mild assumptions on the smoothness of the target functions.

The paper is organized as follows: Section 2 is dedicated to the explanation on the algorithm architecture. We present the main results and statements on the almost optimal learning rates in Section 3 with the corresponding error analysis lucidly demonstrated in Section 4. Architecture analysis and empirical assessments of comparisons between different vertical methods based on real data sets are provided in Sections 5 and 6. For the sake of clarity, all the proofs of Section 3 and Section 4 are presented in Section 7. Finally, we conclude this paper in Section 8.

2 Establishment of the Main Algorithm

In this section, we propose a new random forest method for regression which gathers the advantages of vertical methods and ensemble learning. A lucid illustration requires to break down the algorithm into four steps. First, we adopt an adaptive random partition method to split the feature space into several cells in stage one. Second, by building the best-scored random tree for regression on each cell in stage two and gathering them together, we are able to obtain a parent random tree. Third, due to the intrinsic randomness of the partition method, we are able to establish a certain number of parent random trees under different partitions of the feature space. Last but not least, by combining these parent random trees to form an ensemble, we obtain the Two-stage Best-scored Random Forest.

2.1 Notations

The goal in a supervised learning problem is to predict the value of an unobserved output variable

after observing the value of an input variable . To be exact, we need to derive a predictor which maps the observed input value of to a prediction of the unobserved output value of . The choice of predictor should be based on the training data of i.i.d observations, which are with the same distribution as the generic pair

, drawn from an unknown probability measure

on . We assume that is non-empty, for some and is the marginal distribution of .

According to the learning target, it is legitimate to consider the least squares loss defined by . Then, for a measurable decision function , the risk is defined by

and the empirical risk is defined by

where is the empirical measure associated to data and is the Dirac measure at . The Bayes risk which is the minimal risk with respect to and can be given by

In addition, a measurable function with is called a Bayes decision function. By minimizing the risk, the Bayes decision function is

which is a -almost surely -valued function.

In order to achieve our two-stage random forest for regression, we first consider the development of the parent random forest under one specific feature space partition. Therefore, we assume that is a partition of such that none of its cells is empty, which is for every . To present our approach in a clear and rigorous mathematical expression, there is a need for us to introduce some more definitions and notations. First of all, the index set is defined as

which indicates the samples of contained in and also the corresponding data set

Additionally, for every , the loss on the corresponding cell is defined by

where is the least squares loss for our regression problem.

2.2 Best-scored Random Trees

One crucial step of the two-stage best-scored random forest algorithm is building parent best-scored random trees under certain partitions of the feature space. Therefore, we first focus on the development of one parent tree which is the summation of child trees. An appropriate splitting approach of the feature space is inseparable for the tree establishment. Therefore, we introduce a random partition method in our case.

2.2.1 Purely Random Partition

Purely random forest put forward by Breiman (2000) is an algorithm parallel to forests based on well-known splitting criteria such as information gain (Quinlan, 1986), information gain ratio (Quinlan, 1993) and Gini index (Breiman et al., 1984). Since it is widely acknowledged that forest established by the latter three criteria are not universal consistent, while consistency can be obtained by the first one, we base our forest on this purely random splitting criterion.

A clear illustration of the splitting mechanism at the -th step of one possible random tree construction requires a random vector . The first term in the triplet denotes the leaf to be split at the -th step chosen from all the leaves presented in the -th step. The second term in the triplet represents the dimension chosen to be split from for the leaf. Moreover,

are i.i.d. multinomial random variables with all dimension having equal probability to be split from. The third term

stands for the ratio of the length in the -th dimension of the newly generated leaf after the -th split to the length in the -th dimension of leaf , which is a proportional factor. In this manner, the length in the -th dimension of the newly generated leaf can be calculated by multiplying the length in the -th dimension of leaf and the proportional factor . We mention here that are independent and identically distributed from .

To provide more insight into the above mathematical formulation of the splitting process of the purely random tree, we take the tree construction on as a simple example, which is the same for construction on . One specific construction procedure is shown in Figure 1 where we take . First of all, we pick up a dimension out of candidates randomly, and then split uniformly at random from that dimension. The resulting split being a

-dimensional hyperplane parallel to the axis partitions

into two leaves, say and . Next, a leaf is chosen uniformly at random, e.g. , and we go on picking the dimension and the cut-point uniformly at random to implement the second split, which leads to a partition of : . When conducting the third split, we still randomly select one leaf presented in the last step, e.g. , and the third split is once again conducted on it as before. The resulting partition of then becomes . This above recursive process will not stop until the number of splits reaches out satisfaction. Further scrutiny will find that the splitting procedure leads to a partition variable, namely which takes value in space . From now on, stands for the probability measure of .

Figure 1: Possible construction procedures of -split axis-parallel purely random partitions in a -dimensional space.

It is legitimate to assume that any specific partition variable can be recognized as a latent splitting criterion. To be specific, if we consider a -split procedure carried out by following , then the collection of the resulting non-overlapping leaves can be defined by , and further abbreviated as . Now, if we focus on the partition on certain cell , for example, then we have . Moreover, for any point , it is bound to fall into certain cell which can then be denoted by .

Here, we introduce a map defined by

(1)

where the event set is defined by

Formula (1) is called the random tree decision rule for regression on .

2.2.2 Child Best-scored Random Tree

In this subsection, we consider the establishing procedure of a child best-scored random tree defined on the feature space . Specifically, the child random tree is originally developed on and then extended to . Concerning with the fact that the performance of the tree obtained by conducting random partition once may not be that desirable, we improve this by choosing one tree with the best performance out of candidates on . The tree picked out is then called the child best-scored random tree. Therefore, when analyzing the behaviors of trees on , we suppose that splitting procedures they follow can be represented by the independent and identically distributed random variables drawn from , respectively.

For clearer illustration of theoretical analysis, we first give the definitions of some function sets. We assume that is a function set containing all the possible partitions of a random tree over , which is defined as follows:

(2)

Here, we choose as the number of splits, the resulting leaves presented as actually form a -split partition of . It is important to notify that is the value of leaf . Without loss of generality, in this paper, we only consider cells with the shape of . Moreover, for , we derive the function set induced by the splitting policy as

(3)

where represents the resulting -split partition of by following the splitting policy . Note that is a subset of .

However, we should notice that every function is only defined on while a random tree function from to is finally needed. To this end, for every , we define the zero-extension by

(4)

which should be equipped with the same number of splits

as the decision tree

. Then, the function set only defined on can also be extended to , that is

(5)

Moreover, the extension of function set can also be obtained with the same manner, which is

(6)

Furthermore, we denote .

In order to find an appropriate random tree decision rule under policy denoted as , we are supposed to conduct an optimization problem. To this end, we conduct our analysis under the framework of regularized empirical risk minimization. To begin with, regularized empirical risk minimization is a learning method providing us with a better preparation for more involved analysis of our specific random forest. Let be a loss and be a non-empty set, where is the set of measurable functions on and be a function. The learning method whose decision function satisfying

for all and is named regularized empirical risk minimization.

In this paper, we propose that the number of splits is what we should penalize on. By penalizing on , we are able to give some constraints on the complexity of the function set so that the set will have a finite VC dimension (Vapnik and Chervonenkis, 1971), and therefore make the algorithm PAC learnable (Valiant, 1984). Besides, it can also refrain the learning results from overfitting. With data set , the above regularized empirical risk minimization problem with respect to each function set turns into

(7)

It is well worth mentioning that since the exponent of will not have influence on the performance of the selection procedure, we penalize on to obtain better convergence properties.

Observation finds that the regularized empirical risk minimization under any policy can be bounded simply by considering the case where no split is applied to . Consequently, we present the optimization problem as follows:

where stands for the empirical risk for taking for all with . Therefore, from the above inequality, we obtain that the number of splits is upper bounded by . Accordingly, the capacity of the underlying function set can be largely reduced, and here and subsequently, the function sets will all be added an extra condition where .

To establish the random tree decision rule for regression on , we zero-extend (1) to the whole feature space. It can be apparently observed that our random tree decision rule on induced by is the solution to the optimization problem (7) and it can be further denoted by

(8)

where is the number of splits of the decision function . Its population version is presented by

It is necessary to note that our primary idea is to conduct the regularized empirical risk minimization problem using and , which is

(9)

It can be observed that when we take , the solution of the optimization problem (9) coincides with (8) on . Since the following analysis will be carry out on , we can directly optimize (8). Furthermore, it is easy to verify that if a Bayes decision function w.r.t.  and exists, it additionally is a Bayes decision function w.r.t.  and .

Now, we focus on establishing the best-scored random tree on induced by , also called the child best-scored random tree, which is chosen from candidates. The main principle is to retain only the tree yielding the minimal regularized empirical risk, which is

(10)

where is the number of splits of and . Apparently, is the regularized empirical risk minimizer with respect to the random function set

(11)

Put another way, is the solution to the regularized empirical risk minimization problem

Similar as it is, we denote by the solution of the population version of regularized minimization problem in the set

(12)

We mention here that is the corresponding number of splits of .

2.2.3 Parent Best-scored Random Tree

In this subsection, we first build the parent random tree by adding all the child ones. After that, in order to show that our parent random tree is indeed a solution of an usual random tree algorithm on the feature space, we need to consider the indicator function sets defined on of a child random tree and direct sums of the indicator function sets of several trees.

First of all, adding all child best-scored random trees generated by (10) together leads to the parent best-scored random tree, which is defined by

(13)

where denotes the splitting criteria on .

Recall that we have mention the process of extending the indicator function set of a tree on to an indicator function set on in (4) and (5), we now give a formal description of that in the following proposition.

Let and be an indicator function space of the form (2) on . Denote by the zero-extension of to defined by

Then, the set is still an indicator function set on . We define that the number of splits of the decision tree on is the same as the number of splits on , which is

(14)

Based on this proposition, we are now able to construct an indicator function set by a direct sum of indicator function sets and with and .

For such that and , let and be indicator function sets of the form (2) on and , respectively. Furthermore, let and be the indicator function sets of all functions of and extended to in the sense of Proposition 2.2.3. Let and given by (14) be the associated the number of splits. Then and hence the direct sum

exists. The direct sum is also an indicator function set of random trees. For and , let and be the unique function that . Then, we define the number of splits on the direct sum space by

To relate Proposition 2.2.3 and Proposition 2.2.3 with (13), there is a need to introduce more notations. For pairwise disjoint with , let be the best-scored function space (11) induced by for every based on Proposition 2.2.3. A joined indicator function space of can be therefore designed analogously to Proposition 2.2.3. Specifically, for an arbitrary index set and a vector , the direct sum

where , is still an indicator function space of random tree with squared number of splits

(15)

If , we simply write . To notify, contains inter alia given by (13).

Here, we briefly investigate the regularized empirical risk of . For arbitrary , we have

(16)

The first equality is derived by (Meister and Steinwart, 2016). The second equality is established because the risk of on equals that of . The inequality is a direct result of (10), where the number of splits for arbitrary according to Proposition 2.2.3 is defined by , and is the corresponding number of splits of on . The last two equalities hold the same ways as the first two ones.

Judging from (16), is the random tree function with respect to and , as well as the regularized parameter . In other words, the latter best-scored random tree derived from equals our parent best-scored random tree (13).

For the sake of clarity, we summarize some assumptions for the joined best-scored function sets as follows:

[Joined best-scored decision tree spaces] For pairwise disjoint subsets of , let be the best-scored random tree function sets induced by . Consequently, for , we define the joined best-scored function space and equip it with the number of splits (15).

2.3 Two-stage Best-scored Random Forest

Having developed the parent random tree under one specific partition of the feature space, it is legitimate to ponder whether we can devise an ensemble of trees by injecting randomness into the feature partition in stage one. To fulfill this idea, we propose a data splitting approach named as the adaptive random partition and establish the Two-stage Best-scored Random Forest by ensemble learning.

2.3.1 Adaptive Random Partition of the Feature Space

To describe the above two-stage random forest algorithm, only has to be some partition of . Nevertheless, concerning with the theoretical investigations that will be conducted on the learning rates of our new algorithm, there is a need for us to further specify the partition. For this purpose, we denote a series of balls with radius and mutually distinct centers by

where is the Euclidean norm in . Furthermore, we can choose and such that .

Considering how large the sample size will be and how the sample density may vary in the feature space , we propose an adaptive random partition approach. This method serves as a preprocessing of partitioning the feature space into cells containing fewer data which facilitates the following regression works on cells. Moreover, owing to the randomness resided in the partition, it paves the way for ensemble. A considerable advantage of this proposal over the purely random partition is that it efficiently takes the sample information into consideration. To be precise, since the construction of the purely random partition is independent of the whole data set, it may possibly suffer from the dilemma where there is over-splitting on sample-sparse area and under-splitting on sample-dense area. However, the adaptive random partition is much wiser for it utilizes sample information in a relatively easy way and still fulfills the objective of dividing the space into small cells. The specific partition procedure is similar to the proposed process in Section 2.2.1 with difference in how to choose the to-be-split cell.

In the purely random partition, in the random vector denotes the randomly chosen cell to be split at the -th step of tree construction. Here, we propose that when choosing a to-be-split cell, we first randomly select sample points from the training data set who are then labeled by the cells they belong to. Later, we choose the cell that is the majority vote of the sample labels to be . This idea follows the fact that when randomly picking sample points from the whole training data set, cells with more samples will be more likely to be selected while cells with fewer samples are less possible to be chosen. In this manner, we may obtain feature space partitions where the sample sizes of resulting cells are more evenly distributed.

2.3.2 Ensemble Forest

We now construct the two-stage best-scored random forest basing on the average results of parent best-scored random trees. Due to the intrinsic randomness resided in the partition method, we are able to construct several different parent best-scored random trees under different partitions of the feature space. To be specific, each of these trees is generated according to the procedure in (13) under different input partition , . To clarify, the splitting criterion for each of the tree in the forest is denoted by , , where is already the splitting criterion corresponding to the child best-scored random tree for the -th tree on its . Moreover, we denote the parent best-scored trees in the forest as . As usual, we perform average to obtain the two-stage best-scored random forest decision rule

(17)

where denotes the collection of all splitting criteria of trees in the forest. Finally, we establish our large-scale regression predictor, the two-stage best-scored random forest .

3 Main Results and Statements

In this section, we present main results on the oracle inequalities and learning rates for the random trees and forests.

3.1 Fundamental Assumption

In this paper, we are interested in the ground-truth functions that satisfy the following restrictions on their smoothness:

The Bayes decision function is -Hölder continuous with respect to -norm . That is, there exists a constant such that

3.2 Oracle Inequality for Parent Best-scored Random Trees

We now establish an oracle inequality for parent best-scored random trees based on the least squares loss and best-scored function space.

Let for , be the least squares loss, be the probability measure on and be the probability measure induced by the splitting criterion . Then for all , and , the parent best-scored random tree (13) satisfies

with probability at least , where is a constant depending on , and . The result holds for all parent best-scored random tree criterion .

3.3 Learning Rates for Parent Best-scored Random Trees

We now state our main result on the learning rates for parent best-scored random trees based on the established oracle inequality.

Let be the least squares loss, be the probability measure on and be the probability measure induced by the splitting criterion . Let be a partition of and be the number of candidate trees on . Suppose that the Bayes decision function satisfies Assumption 3.1 with exponent . Then for all and , with probability at least , there holds for the parent best-scored random tree (13) that

where and depending on and .

3.4 Learning Rates for Two-stage Best-scored Random Forest

We now present the main result on the learning rates for two-stage best-scored random forest in (17). This diverse and also accurate ensemble forest is based on the collection of parent best-scored random trees generated by different feature space partition.

Let be the least squares loss, be the probability measure on and be the probability measure induced by the splitting criterion . Let the collection of different partitions that generate the ensemble be and be the number of candidate trees on . Suppose that the Bayes decision function satisfies Assumption 3.1 with exponent . Then, for all and , with probability at least , there holds

where depending on and .

According to the proof related to Theorem 3.4, we find that the coefficient may decrease with the number of trees in the forest increasing. In other words, in theory, more trees may lead to smoother forest predictor and therefore, better learning rates. Moreover, this phenomenon is also supported by the experimental results shown later in Figure 4 where the predictor becomes smoother and has a better fit when increases.

3.5 Comments and Discussions

In this subsection, we present some comments and discussions on the obtained theoretical results on the oracle inequality, learning rates for the parent random trees and then for the two-stage best-scored random forest.

We highlight that our two-stage best-scored random forest algorithm aims at dealing with regression problems with enormous amount of data. To begin with, in the literature, vertical methods to deal with large-scale regression problem have gained its popularity owing to its capability of parallel computing. In this paper, we adopt a decision-tree like feature space splitting criterion named the adaptive random partition which is defined as the partition in stage one. Moreover, the following partitions for conducting random trees on the resulting cells from stage one is called the partitions in stage two, and they follow a purely random splitting criterion. In the literature, classical splitting criteria such as information gain, information gain ratio and Gini index have been scrutinized mostly from the perspective of experimental performance, while there are only a few of them concerning with theoretical learning rates, such as Biau (2012) and Scornet et al. (2015). However, the conditions under which their learning rates are derived are too strong to testified in practical. Compared to these classical splitting criteria, our purely random splitting criterion achieves satisfying learning rates only with some descriptions of the smoothness of the Bayes decision functions.

Second, we propose a novel idea in our model selection process, which is denominated as the best-scored method. To clarify, choosing one random tree with the best regression performance out of several candidates helps to improve the accuracy of the base predictors. For a certain order of number of splits , when the number of candidates is large enough, the function space generated by those trees will also be large enough to cover sufficient possible partition results. Consequently, the probability is high for us to choose the random tree with the best performance, which will lead to a remarkably small approximation error.

Third, the learning rate of one parent best-scored random tree is and the learning rate of the two-stage best-scored random forest is with the same order. Here, we should notice that due to the intrinsic randomness of our splitting criterion, for a -split random tree, the effective number of splits for each dimension is approximately rather than , where we take . Moreover, since is concerned with the capacity of the partition function space and our function space is not that large, we can take as small as possible, even close to .

In the machine learning literature, all kinds of vertical or horizontal regression methods have been studied extensively and understood. For example, a vertical-like method mixing

-NN and SVM for regression is theoretically scrutinized by Hable (2013). In his paper, for every testing point, the global SVM is applied to the nearest neighbors instead of to the whole training data. Moreover, a universal risk-consistency is provided. In Meister and Steinwart (2016), they derive the learning rate of the localized SVM when the Bayes decision function is in a Besov-like space with -degrees of smoothness. As for large-scale regression problem with horizontal methods, Zhang et al. (2015)

proposes a divide and conquer kernel ridge regression and provide learning rates with respect to different kernels. With the Bayes decision function in the corresponding reproducing kernel Hilbert space (RKHS), they obtain a learning rate of

for kernels with finite rank and a learning rate of for kernels in a Sobolev space with -degrees of smoothness. Both of these prediction rates are minimax-optimal learning rates. Lin et al. (2017) conducts a distributed learning with the least squares regularization scheme in a RKHS and obtains the almost optimal learning rates in expectation which are . The learning rates are established under the smoothness assumption with respect to the -th power of the integral operator and an -related capacity assumption. Guo et al. (2017) focuses on the distributed regression with bias corrected regularization kernel network and also obtains a learning rates of order where is the capacity related parameter. According to the above analysis, it can be seen that the work presented in our study has not only innovations but also complete theoretical supports.

4 Error Analysis

In this section, we give error analysis by bounding the approximation error term and the sample error term, respectively.

4.1 Bounding the Approximation Error Term

Denote the population version of the parent best-scored random tree as

with as in (12). The following theoretical result on bounding the approximation error term shows that, under smoothness assumptions for the Bayes decision function, the regularized approximation error possesses a polynomial decay with respect to each regularization parameter .

Let be the least squares loss, be the probability measure on with marginal distribution , be the probability measure induced by the splitting criterion . Assume that is a partition of and is the number of candidate trees on each . Suppose that the Bayes decision function satisfies Assumption 3.1 with exponent . Then, for any fixed and , with probability at least , there holds that

where is a constant depending on and , and is a universal constant.

4.2 Bounding the Sample Error Term

To establish the bounds on the sample error, we give four descriptions of the capacity of the function set in Definition 4.2, Definition 4.2, Definition 4.2 and Definition 4.2. Then, we should analyze on the complexity of the regression function set so as to derive the sample error bounds. More specifically, the complexity of the random forest function set comes from two aspects which are one induced by the feature space partition and the other induced by value assignment.

Firstly, we consider the complexity induced by partition. In that case, we might scrutinize the situation where there is a binary value assignment, i.e. . Particularly, we need to focus on its VC dimension (Lemma 4.2), covering numbers (Lemma 4.2) and entropy numbers (Lemma 4.2). Secondly, there exists a relationship in terms of empirical Rademacher average between the binary value assignment induced complexity and the continuous value assignment induced complexity. Therefore, we are able to derive the empirical Rademacher average for regression in Lemma 4.2.

[VC dimension] Let be a class of subsets of and be a finite set. The trace of on is defined by . Its cardinality is denoted by . We say that shatters if , that is, if for every , there exists a such that . For , let

Then, the set is a Vapnik-Chervonenkis class if there exists such that and the minimal of such is called the VC dimension of , and abbreviated as .

[Covering Numbers] Let be a metric space, and . We call an -net of if for all there exists an such that . Moreover. the -covering number of is defined as

where denotes the closed ball in centered at with radius .

[Entropy Numbers] Let be a metric space, and