Random forest models are non-parametric regression resp. classification trees that highly rely on the idea of bagging and feature sub-spacing during tree construction. This way, one aims to construct highly predictive models by averaging (for continuous outcomes) or taking majority votes (for categorical outcomes) over CART trees constructed on bootstrapped samples. At each node of a tree, the best cut is selected by optimizing a CART-split criterion such as the Gini impurity (for classification) or the squared prediction error (for regression) over a subsample of the feature space. This methodology has been proven to work well in predicting new outcomes as first shown in breiman2001random . Despite that and closely related to the prediction of a new instance is the question how reliable this prognosis is. For example in khalilia2011predicting , random forest models have been used in predicting disease risk in highly imbalanced data. Beyond point estimators, however, little information was known about the dispersion of disease risk prediction. In fact, estimating residual variance based upon machine learning techniques has experienced less attention compared to the extensive investigations on pure prediction. One exception is given in mendez2011estimating , where bootstrap corrected residual variance estimators are proposed. Moreover, they are analyzed in a simulation study for regression problems but no theoretical guarantees such as consistency have been proven. A similar observation holds for the jackknife-type sampling variance estimators given in wager2014confidence . In the present paper we will close this gap by investigating the theoretical properties of a new residual variance estimator within the random forest framework. The estimator is inspired by the one proposed in mendez2011estimating and is shown to be consistent for estimating residual variance if the random forest estimate for the regression function is -consistent. Thereby, our theoretical derivations are build upon existing results in the literature.
First theoretical properties of the random forest method such as (-) consistency have already been proven in breiman2001random while connections to layered nearest neighbors were made in lin2006random and biau2010layered . The early consistency results were later extended by several authors (meinshausen2006quantile, ; biau2008consistency, ; wager2014confidence, ; scornet2015consistency, ; scornet2016asymptotics, )
; particularly allowing for stronger results (as central limit theorems) or a more reasonable mathematical model that better approximates the true random forest approach. In particular, varying mathematical forces such as feature sub-spacing, bagging and the tree construction process make the analysis of the true random forest as applied in practice very complicated.
In the current work, we therefore decided to build upon the mathematical description of the random forest method as described in scornet2015consistency . This allows the applicability of our estimator for a wide range of functional relationships while also incorporating relevant features of the algorithm such as the split-criterion.
Our paper is structured as follows. In the next section, we give a brief overview of the random forest and state the model framework. In addition, consistency results are stated. In the third section, we provide a residual variance estimate and prove its consistency in -sense. Furthermore, bias-corrected residual variance estimators are proposed. Note that all proofs can be found in the appendix.
2 Model Framework and Random Forest
Our framework is the
regression estimation in which the covariable vectoris assumed to lie on the -dimensional unit-cube, i.e. . Of primary interest in the current paper is the estimation of the residual variance in a functional relation of the form
Here, , and with and is independent of . Given a training set
of i.i.d. pairs , , we aim to deliver an estimate that is at least -consistent. The construction of will be based on the random forest estimate approximating the regression function . In the sequel, we will stick to the notation as given in scornet2015consistency and shortly introduce the random forest model and corresponding mathematical forces involved in it.
The random forest model for regression is a collection of regression trees, where for each tree, a bootstrap sample is taken from using with or without replacement procedures. This is denoted as the resampling strategy . Other sampling strategies than these two within the random forest model have been considered in ramosaj2017wins , for example. Furthermore, at each node of the tree, feature sub-spacing is conducted selecting features for possible split direction. Denote with
the generic random variable responsible for both, the bootstrap sample construction and the feature sub-spacing procedure. Then,are assumed to be independent copies of responsible for this random process in the -th tree, independent of . The combination of the trees is conducted through averaging. i.e.
and is referred to as the finite forest estimate of . As explained in scornet2015consistency
, the strong law of large numbers (for) allows to study instead of . Hence, we set
Similar to scornet2015consistency , we refer to the random forest algorithm by identfiying three parameters responsibly for the random forest tree construction:
the number of pre-selected directions for splitting,
, the number of sampled points in the bootstrap step and
, the number of leaves in each tree.
Let be a sequence of generic cells in obtained at tree depth , and denote by the number of observations falling in , where we set . Here, we denote a cut as the pair , where represents the selected variable in which its domain is cut at . Furthermore, let be the set of all possible cuts in . It should be noted that the restriction of the feature domain to the -dimensional unit-cube is no restriction since the random forest is invariant under monotone transformations.
Then formally, the random forest algorithm constructs decision trees resulting in regression estimators according to the following algorithm:
In order to establish -consistency of the residual variance estimate , we require at least -consistency of the random forest method. That is,
where the expectation is taken with respect to and . Here, is an independent copy of for .
Several authors attempted to prove that (4) is valid, i.e. that random forests are consistent in -sense. biau2008consistency , for example, assumed a simplified version of the random forest assuming that cuts happen independent of the response variable in a purely random fashion. scornet2015consistency , established consistency of the original random forest by assuming that is the additive expansion of continuous functions on the unit cube. Therein, proofs have been provided for fully grown trees () and not fully grown trees making additional assumptions on the asymptotic relation between and . For example, Theorem 1 in scornet2015consistency guarantees condition (4) for additive Gaussian regression models provided that and , , such that the resampling strategy is restricted to sampling without replacement. In this context it should be noted that assumption (4) does not automatically lead to pointwise consistency, since the latter is rather hard to prove for random forest models and counterexamples exist on the original random forest model as mentioned in wager2014asymptotic .
Anyhow, predicting outcomes among the training set using the random forest is usually done by using Out-Of-Bag (OOB) subsamples. That is, averaging does not happen over all trees but over those trees that did not have the corresponding data point in their resampled data set during tree construction. This way, one aims to deliver unbiased estimators for predicted values. In addition, OOB samples have the advantage of delivering internal accuracy estimates, without separating the sample into a training and test set. This way, the training sample size can be left sufficiently large. From a mathematical perspective, OOB-estimators of random forest have the nice property that independence between observed responses and predicted remains valid for . This, because the prediction of is based on samples not containing the point for fixed . Thus, the independence property directly results from the independence assumption given in (2). However, the justification to analyze infinite forests instead of finite forests as in (3) is unclear for OOB-estimates, since one does not consider the average over decision trees, but rather a random subset of , depending on the data point one aims to predict. If we denote with the OOB prediction of , for and the corresponding finite forest estimate, then we provide our first result proving the justification of considering infinite forests even for OOB samples.
Under Model (1), OOB predictions of finite forests are consistent, that is for all
The consistency assumption in (4) implies the consistency of the corresponding OOB-estimate. That is:
These preliminary results allow the construction of a consistent residual variance estimator based on OOB samples.
3 Residual Variance Estimation
We estimate the residuals based on OOB samples, i.e. we set for
which we denote as OOB-estimated residuals. Their sample variance
or OOB-estimated residual variance is our proposed estimator. Here, denotes the mean of . A similar estimator has been proposed in mendez2011estimating , for which simulation studies on some functional relationships between and were considered for practical implementation. The next result guarantees asymptotic unbiasedness and consistency of under Assumption (4).
Remark 1 (Key Assumptions and Other Machine Learning Techniques).
(a) Beyond Assumption (4) the structure of the random forest is only used to prove (5) and to maintain that the error variables are independent from and to have , both for all fixed . Thus, the results can be extended to all methods guaranteeing these assumptions.
(b) Moreover, carefully checking the proof of Theorem 1, the independence of towards can also be substituted by , while still maintaining the consistency result.
3.1 Bias-corrected Estimation
As explained in mendez2011estimating , the estimator (7) may be biased for finite sample size . To this end, mendez2011estimating proposed a biased-corrected version of via parametric bootstrapping. Their idea is as follows: Given the data generate i.i.d. parametric bootstrap residuals independent from , , with mean and variancebias-corrected bootstrap version of is given by
Here, is the OOB-estimation of using the tree structure of and feeding it with the bootstrapped sample in which terminal node values are substituted with corresponding ’s where .
In the following, we provide two important results regarding the bias-corrected version of . In Theorem 2, we prove that the bias-corected estimator in (3.1) is -consistent. This guarantees that the proposed bootstrapping scheme does not systematically inflate our estimate. However, comes with additional computation costs. Therefore, in Theorem 3, we provide an asymptotic lower bound which enables a fast, bias-corrected estimation of for finite sample sizes.
Consider the parametric bootstrapping scheme as described for the estimate in (3.1). Then for the random forest model, the follwoing inequality holds almost surely conditional on as
The result in Theorem 3 leads to a residual variance estimate that is computationally cheaper than the corresponding bootstrapped version, i.e. one can consider
instead of , while saving considerable memory and computational time costs. This will lead to almost surely.
The random forest is known as a powerful tool in applied data analysis for classification, regression and variable selection (liaw2002classification, ; lunetta2004screening, ; diaz2006gene, ; strobl2007bias, ; genuer2010variable, ; khalilia2011predicting, ). Beyond its practical use, corresponding theoretical properties have been investigated under various conditions (breiman2001random, ; biau2008consistency, ; biau2010layered, ; wager2014asymptotic, ; scornet2015consistency, ) covering topics such as the -consistent estimation of the regression function. However, a comprehensive treatment on how to estimate corresponding dispersion parameters as the variance is almost not to be found in the literature.
An exception is given by the residual variance estimators proposed and examined in simulations in mendez2011estimating . In the present paper, we complement their analyses by theoretically investigating residual variance estimators in regression models. To this end, we first show that analyzing the infinite forest estimate is legitimate, even when switching to OOB samples. This allows us to prove consistency of the OOB-errors’ sample variance in the -sense if the random forest regression function estimate is assumed to be -consistent. In addition, we also give some theoretical insight on the bias corrected residual variance estimate for finite samples as proposed in mendez2011estimating .
As the structure of the random forest is only needed to maintain the independence property in OOB samples, the current approach is also valid for any method that provides -consistent regression function estimates.
Proof of Lemma 1.
Let be fixed and be an arbitrary and fixed point in the unit cube. Consider . If we denote with the number of the regression trees not containing the -th observation, then it follows
such that . Since , where , it follows by the strong law of large numbers for fixed that as . Hence , as for fixed . For given , this justifies the consideration of
Proof of Corollary 1.
Proof of Theorem 1.
by Corollary 1 as . The second and last equality follows from the identical distribution of the sequences resp. .
Furthermore, let . Then using the Cauchy-Schwarz inequality we obtain
by Corollary 1 as which completes the proof.
Proof of Theorem 2.
To be mathematically precise, let and
be defined on some probability spaceand let the parametric bootstrap variables be defined on another probability space . Then, all random variables can be defined (via projections) on the joined product space ; explaining the assumption that the random variables are independent from and i.i.d. generated from a distribution with finite second moment with and .
Within this framework consider and denote with the set of the -th bootstrapped sample for . Then the sequence of sets is independent. In particular, conditioned on , forms a sequence of i.i.d. random variables.
Now, note that random forest models are the weighted sum of the response variable. Hence, denoting with the hyper-rectangle obtained after constructing one random decision tree with seed parameter containing , then the infinite random forest model can be rewritten as
see, e.g., the proof of Theorem 2 in scornet2015consistency for a similar observation. Here, holds almost surely and the weights are defined as
where is the number of data points falling in . Further let be the event that both points, and , fall in the same cell under the tree constructed by . Due to sampling without replacement, there are choices to pick a fixed observation . Therefore, we obtain
Setting and , we obtain the following result for every fixed using the Cauchy-Schwarz inequality:
In order to prove -consistency of the bootstrapped corrected estimate, based on (5), we only need to show that and as . Now, note that almost surely. Conditioning on , we know that and are independent for such that almost surely. Combining these two results, we obtain with (11):