1 Introduction
Random forest models are nonparametric regression resp. classification trees that highly rely on the idea of bagging and feature subspacing during tree construction. This way, one aims to construct highly predictive models by averaging (for continuous outcomes) or taking majority votes (for categorical outcomes) over CART trees constructed on bootstrapped samples. At each node of a tree, the best cut is selected by optimizing a CARTsplit criterion such as the Gini impurity (for classification) or the squared prediction error (for regression) over a subsample of the feature space. This methodology has been proven to work well in predicting new outcomes as first shown in breiman2001random . Despite that and closely related to the prediction of a new instance is the question how reliable this prognosis is. For example in khalilia2011predicting , random forest models have been used in predicting disease risk in highly imbalanced data. Beyond point estimators, however, little information was known about the dispersion of disease risk prediction. In fact, estimating residual variance based upon machine learning techniques has experienced less attention compared to the extensive investigations on pure prediction. One exception is given in mendez2011estimating , where bootstrap corrected residual variance estimators are proposed. Moreover, they are analyzed in a simulation study for regression problems but no theoretical guarantees such as consistency have been proven. A similar observation holds for the jackknifetype sampling variance estimators given in wager2014confidence . In the present paper we will close this gap by investigating the theoretical properties of a new residual variance estimator within the random forest framework. The estimator is inspired by the one proposed in mendez2011estimating and is shown to be consistent for estimating residual variance if the random forest estimate for the regression function is consistent. Thereby, our theoretical derivations are build upon existing results in the literature.
First theoretical properties of the random forest method such as () consistency have already been proven in breiman2001random while connections to layered nearest neighbors were made in lin2006random and biau2010layered . The early consistency results were later extended by several authors (meinshausen2006quantile, ; biau2008consistency, ; wager2014confidence, ; scornet2015consistency, ; scornet2016asymptotics, )
; particularly allowing for stronger results (as central limit theorems) or a more reasonable mathematical model that better approximates the true random forest approach. In particular, varying mathematical forces such as feature subspacing, bagging and the tree construction process make the analysis of the true random forest as applied in practice very complicated.
In the current work, we therefore decided to build upon the mathematical description of the random forest method as described in scornet2015consistency . This allows the applicability of our estimator for a wide range of functional relationships while also incorporating relevant features of the algorithm such as the splitcriterion.
Our paper is structured as follows. In the next section, we give a brief overview of the random forest and state the model framework. In addition, consistency results are stated. In the third section, we provide a residual variance estimate and prove its consistency in sense. Furthermore, biascorrected residual variance estimators are proposed. Note that all proofs can be found in the appendix.
2 Model Framework and Random Forest
Our framework is the
regression estimation in which the covariable vector
is assumed to lie on the dimensional unitcube, i.e. . Of primary interest in the current paper is the estimation of the residual variance in a functional relation of the form(1) 
Here, , and with and is independent of . Given a training set
(2) 
of i.i.d. pairs , , we aim to deliver an estimate that is at least consistent. The construction of will be based on the random forest estimate approximating the regression function . In the sequel, we will stick to the notation as given in scornet2015consistency and shortly introduce the random forest model and corresponding mathematical forces involved in it.
The random forest model for regression is a collection of regression trees, where for each tree, a bootstrap sample is taken from using with or without replacement procedures. This is denoted as the resampling strategy . Other sampling strategies than these two within the random forest model have been considered in ramosaj2017wins , for example. Furthermore, at each node of the tree, feature subspacing is conducted selecting features for possible split direction. Denote with
the generic random variable responsible for both, the bootstrap sample construction and the feature subspacing procedure. Then,
are assumed to be independent copies of responsible for this random process in the th tree, independent of . The combination of the trees is conducted through averaging. i.e.(3) 
and is referred to as the finite forest estimate of . As explained in scornet2015consistency
, the strong law of large numbers (for
) allows to study instead of . Hence, we set(3’) 
Similar to scornet2015consistency , we refer to the random forest algorithm by identfiying three parameters responsibly for the random forest tree construction:

the number of preselected directions for splitting,

, the number of sampled points in the bootstrap step and

, the number of leaves in each tree.
Let be a sequence of generic cells in obtained at tree depth , and denote by the number of observations falling in , where we set . Here, we denote a cut as the pair , where represents the selected variable in which its domain is cut at . Furthermore, let be the set of all possible cuts in . It should be noted that the restriction of the feature domain to the dimensional unitcube is no restriction since the random forest is invariant under monotone transformations.
Then formally, the random forest algorithm constructs decision trees resulting in regression estimators according to the following algorithm:
In order to establish consistency of the residual variance estimate , we require at least consistency of the random forest method. That is,
(4) 
where the expectation is taken with respect to and . Here, is an independent copy of for .
Several authors attempted to prove that (4) is valid, i.e. that random forests are consistent in sense. biau2008consistency , for example, assumed a simplified version of the random forest assuming that cuts happen independent of the response variable in a purely random fashion. scornet2015consistency , established consistency of the original random forest by assuming that is the additive expansion of continuous functions on the unit cube. Therein, proofs have been provided for fully grown trees () and not fully grown trees making additional assumptions on the asymptotic relation between and . For example, Theorem 1 in scornet2015consistency guarantees condition (4) for additive Gaussian regression models provided that and , , such that the resampling strategy is restricted to sampling without replacement. In this context it should be noted that assumption (4) does not automatically lead to pointwise consistency, since the latter is rather hard to prove for random forest models and counterexamples exist on the original random forest model as mentioned in wager2014asymptotic .
Anyhow, predicting outcomes among the training set using the random forest is usually done by using OutOfBag (OOB) subsamples. That is, averaging does not happen over all trees but over those trees that did not have the corresponding data point in their resampled data set during tree construction. This way, one aims to deliver unbiased estimators for predicted values. In addition, OOB samples have the advantage of delivering internal accuracy estimates, without separating the sample into a training and test set. This way, the training sample size can be left sufficiently large. From a mathematical perspective, OOBestimators of random forest have the nice property that independence between observed responses and predicted remains valid for . This, because the prediction of is based on samples not containing the point for fixed . Thus, the independence property directly results from the independence assumption given in (2). However, the justification to analyze infinite forests instead of finite forests as in (3) is unclear for OOBestimates, since one does not consider the average over decision trees, but rather a random subset of , depending on the data point one aims to predict. If we denote with the OOB prediction of , for and the corresponding finite forest estimate, then we provide our first result proving the justification of considering infinite forests even for OOB samples.
Lemma 1.
Under Model (1), OOB predictions of finite forests are consistent, that is for all
The consistency assumption in (4) implies the consistency of the corresponding OOBestimate. That is:
Corollary 1.
These preliminary results allow the construction of a consistent residual variance estimator based on OOB samples.
3 Residual Variance Estimation
We estimate the residuals based on OOB samples, i.e. we set for
(6) 
which we denote as OOBestimated residuals. Their sample variance
(7) 
or OOBestimated residual variance is our proposed estimator. Here, denotes the mean of . A similar estimator has been proposed in mendez2011estimating , for which simulation studies on some functional relationships between and were considered for practical implementation. The next result guarantees asymptotic unbiasedness and consistency of under Assumption (4).
Theorem 1.
Remark 1 (Key Assumptions and Other Machine Learning Techniques).
(a) Beyond Assumption (4) the structure of the random forest is only used to prove (5) and to maintain that
the error variables are independent from and to have , both for all fixed . Thus, the results can be extended to all methods guaranteeing these assumptions.
(b) Moreover, carefully checking the proof of Theorem 1, the independence of towards can also be substituted by , while still maintaining the consistency result.
3.1 Biascorrected Estimation
As explained in mendez2011estimating , the estimator (7) may be biased for finite sample size . To this end, mendez2011estimating proposed a biasedcorrected version of via parametric bootstrapping. Their idea is as follows: Given the data generate i.i.d. parametric bootstrap residuals independent from , , with mean and variance
from a parametric distribution with finite second moment, e.g. the normal distribution. Then, a
biascorrected bootstrap version of is given by(8) 
Here, is the OOBestimation of using the tree structure of and feeding it with the bootstrapped sample in which terminal node values are substituted with corresponding ’s where .
In the following, we provide two important results regarding the biascorrected version of . In Theorem 2, we prove that the biascorected estimator in (3.1) is consistent. This guarantees that the proposed bootstrapping scheme does not systematically inflate our estimate. However, comes with additional computation costs. Therefore, in Theorem 3, we provide an asymptotic lower bound which enables a fast, biascorrected estimation of for finite sample sizes.
Theorem 2.
Theorem 3.
Consider the parametric bootstrapping scheme as described for the estimate in (3.1). Then for the random forest model, the follwoing inequality holds almost surely conditional on as
The result in Theorem 3 leads to a residual variance estimate that is computationally cheaper than the corresponding bootstrapped version, i.e. one can consider
(9) 
instead of , while saving considerable memory and computational time costs. This will lead to almost surely.
4 Conclusion
The random forest is known as a powerful tool in applied data analysis for classification, regression and variable selection (liaw2002classification, ; lunetta2004screening, ; diaz2006gene, ; strobl2007bias, ; genuer2010variable, ; khalilia2011predicting, ). Beyond its practical use, corresponding theoretical properties have been investigated under various conditions (breiman2001random, ; biau2008consistency, ; biau2010layered, ; wager2014asymptotic, ; scornet2015consistency, ) covering topics such as the consistent estimation of the regression function. However, a comprehensive treatment on how to estimate corresponding dispersion parameters as the variance is almost not to be found in the literature.
An exception is given by the residual variance estimators proposed and examined in simulations in mendez2011estimating . In the present paper, we complement their analyses by theoretically investigating residual variance estimators in regression models. To this end, we first show that analyzing the infinite forest estimate is legitimate, even when switching to OOB samples. This allows us to prove consistency of the OOBerrors’ sample variance in the sense if the random forest regression function estimate is assumed to be consistent. In addition, we also give some theoretical insight on the bias corrected residual variance estimate for finite samples as proposed in mendez2011estimating .
As the structure of the random forest is only needed to maintain the independence property in OOB samples, the current approach is also valid for any method that provides consistent regression function estimates.
5 Appendix.
Proof of Lemma 1.
Let be fixed and be an arbitrary and fixed point in the unit cube. Consider . If we denote with the number of the regression trees not containing the th observation, then it follows
such that . Since , where , it follows by the strong law of large numbers for fixed that as . Hence , as for fixed . For given , this justifies the consideration of
∎
Proof of Corollary 1.
Proof of Theorem 1.
Consider and from (6)–(7). Using Corollary 1 and independence of and for all it follows that
by Corollary 1 as . The second and last equality follows from the identical distribution of the sequences resp. .
Furthermore, let . Then using the CauchySchwarz inequality we obtain
by Corollary 1 as which completes the proof.
∎
Proof of Theorem 2.
To be mathematically precise, let and
be defined on some probability space
and let the parametric bootstrap variables be defined on another probability space . Then, all random variables can be defined (via projections) on the joined product space ; explaining the assumption that the random variables are independent from and i.i.d. generated from a distribution with finite second moment with and .Within this framework consider and denote with the set of the th bootstrapped sample for . Then the sequence of sets is independent. In particular, conditioned on , forms a sequence of i.i.d. random variables.
Now, note that random forest models are the weighted sum of the response variable. Hence, denoting with the hyperrectangle obtained after constructing one random decision tree with seed parameter containing , then the infinite random forest model can be rewritten as
(10) 
see, e.g., the proof of Theorem 2 in scornet2015consistency for a similar observation. Here, holds almost surely and the weights are defined as
where is the number of data points falling in . Further let be the event that both points, and , fall in the same cell under the tree constructed by . Due to sampling without replacement, there are choices to pick a fixed observation . Therefore, we obtain
(11) 
Setting and , we obtain the following result for every fixed using the CauchySchwarz inequality:
(12) 
In order to prove consistency of the bootstrapped corrected estimate, based on (5), we only need to show that and as . Now, note that almost surely. Conditioning on , we know that and are independent for such that almost surely. Combining these two results, we obtain with (11):
(13)  
(14) 
where the inequality results by applying (11) on the weights and the last equality from the fact that , the convergence from Theorem 1 and . Furthermore, by applying Jensen’s inequality, we obtain