DeepAI

# Consistent Estimation of Residual Variance with Random Forest Out-Of-Bag Errors

The issue of estimating residual variance in regression models has experienced relatively little attention in the machine learning community. However, the estimate is of primary interest in many practical applications, e.g. as a primary step towards the construction of prediction intervals. Here, we consider this issue for the random forest. Therein, the functional relationship between covariates and response variable is modeled by a weighted sum of the latter. The dependence structure is, however, involved in the weights that are constructed during the tree construction process making the model complex in mathematical analysis. Restricting to L2-consistent random forest models, we provide random forest based residual variance estimators and prove their consistency.

• 9 publications
• 40 publications
10/02/2019

### A note on the consistency of the random forest algorithm

Examples are given of data-generating models for which Breiman's random ...
01/29/2022

### Geometry- and Accuracy-Preserving Random Forest Proximities

Random forests are considered one of the best out-of-the-box classificat...
12/16/2019

### A Unified Framework for Random Forest Prediction Error Estimation

We introduce a unified framework for random forest prediction error esti...
10/03/2021

### Treeging

Treeging combines the flexible mean structure of regression trees with t...
04/26/2021

### Multi-Output Random Forest Regression to Emulate the Earliest Stages of Planet Formation

In the current paradigm of planet formation research, it is believed tha...
11/30/2021

### CovidAlert – A Wristwatch-based System to Alert Users from Face Touching

Worldwide 2019 million people have been infected and 4.5 million have lo...
11/17/2020

### A statistical machine learning approach for benchmarking in the presence of complex contextual factors and peer groups

The ability to compare between individuals or organisations fairly is im...

## 1 Introduction

Random forest models are non-parametric regression resp. classification trees that highly rely on the idea of bagging and feature sub-spacing during tree construction. This way, one aims to construct highly predictive models by averaging (for continuous outcomes) or taking majority votes (for categorical outcomes) over CART trees constructed on bootstrapped samples. At each node of a tree, the best cut is selected by optimizing a CART-split criterion such as the Gini impurity (for classification) or the squared prediction error (for regression) over a subsample of the feature space. This methodology has been proven to work well in predicting new outcomes as first shown in breiman2001random . Despite that and closely related to the prediction of a new instance is the question how reliable this prognosis is. For example in khalilia2011predicting , random forest models have been used in predicting disease risk in highly imbalanced data. Beyond point estimators, however, little information was known about the dispersion of disease risk prediction. In fact, estimating residual variance based upon machine learning techniques has experienced less attention compared to the extensive investigations on pure prediction. One exception is given in mendez2011estimating , where bootstrap corrected residual variance estimators are proposed. Moreover, they are analyzed in a simulation study for regression problems but no theoretical guarantees such as consistency have been proven. A similar observation holds for the jackknife-type sampling variance estimators given in wager2014confidence . In the present paper we will close this gap by investigating the theoretical properties of a new residual variance estimator within the random forest framework. The estimator is inspired by the one proposed in mendez2011estimating and is shown to be consistent for estimating residual variance if the random forest estimate for the regression function is -consistent. Thereby, our theoretical derivations are build upon existing results in the literature.

First theoretical properties of the random forest method such as (-) consistency have already been proven in breiman2001random while connections to layered nearest neighbors were made in lin2006random and biau2010layered . The early consistency results were later extended by several authors (meinshausen2006quantile, ; biau2008consistency, ; wager2014confidence, ; scornet2015consistency, ; scornet2016asymptotics, )

; particularly allowing for stronger results (as central limit theorems) or a more reasonable mathematical model that better approximates the true random forest approach. In particular, varying mathematical forces such as feature sub-spacing, bagging and the tree construction process make the analysis of the true random forest as applied in practice very complicated.

In the current work, we therefore decided to build upon the mathematical description of the random forest method as described in scornet2015consistency . This allows the applicability of our estimator for a wide range of functional relationships while also incorporating relevant features of the algorithm such as the split-criterion.

Our paper is structured as follows. In the next section, we give a brief overview of the random forest and state the model framework. In addition, consistency results are stated. In the third section, we provide a residual variance estimate and prove its consistency in -sense. Furthermore, bias-corrected residual variance estimators are proposed. Note that all proofs can be found in the appendix.

## 2 Model Framework and Random Forest

Our framework is the

regression estimation in which the covariable vector

is assumed to lie on the -dimensional unit-cube, i.e. . Of primary interest in the current paper is the estimation of the residual variance in a functional relation of the form

 Y =m(X)+ϵ. (1)

Here, , and with and is independent of . Given a training set

 Dn={(X⊤i,Yi)∈[0,1]p×R:i=1,…,n}, (2)

of i.i.d. pairs , , we aim to deliver an estimate that is at least -consistent. The construction of will be based on the random forest estimate approximating the regression function . In the sequel, we will stick to the notation as given in scornet2015consistency and shortly introduce the random forest model and corresponding mathematical forces involved in it.
The random forest model for regression is a collection of regression trees, where for each tree, a bootstrap sample is taken from using with or without replacement procedures. This is denoted as the resampling strategy . Other sampling strategies than these two within the random forest model have been considered in ramosaj2017wins , for example. Furthermore, at each node of the tree, feature sub-spacing is conducted selecting features for possible split direction. Denote with

the generic random variable responsible for both, the bootstrap sample construction and the feature sub-spacing procedure. Then,

are assumed to be independent copies of responsible for this random process in the -th tree, independent of . The combination of the trees is conducted through averaging. i.e.

 mM,n(x;Θ1,…ΘM,Dn)=1MM∑j=1mn(x;Θj,Dn) (3)

and is referred to as the finite forest estimate of . As explained in scornet2015consistency

, the strong law of large numbers (for

) allows to study instead of . Hence, we set

 mn(x)=mn(x;Dn)=EΘ[mn(x;Θ,Dn)]. (3’)

Similar to scornet2015consistency , we refer to the random forest algorithm by identfiying three parameters responsibly for the random forest tree construction:

• the number of pre-selected directions for splitting,

• , the number of sampled points in the bootstrap step and

• , the number of leaves in each tree.

Let be a sequence of generic cells in obtained at tree depth , and denote by the number of observations falling in , where we set . Here, we denote a cut as the pair , where represents the selected variable in which its domain is cut at . Furthermore, let be the set of all possible cuts in . It should be noted that the restriction of the feature domain to the -dimensional unit-cube is no restriction since the random forest is invariant under monotone transformations.

Then formally, the random forest algorithm constructs decision trees resulting in regression estimators according to the following algorithm:

In order to establish -consistency of the residual variance estimate , we require at least -consistency of the random forest method. That is,

 limn→∞E[(mn(X)−m(X))2]=0, (4)

where the expectation is taken with respect to and . Here, is an independent copy of for .

Several authors attempted to prove that (4) is valid, i.e. that random forests are consistent in -sense. biau2008consistency , for example, assumed a simplified version of the random forest assuming that cuts happen independent of the response variable in a purely random fashion. scornet2015consistency , established consistency of the original random forest by assuming that is the additive expansion of continuous functions on the unit cube. Therein, proofs have been provided for fully grown trees () and not fully grown trees making additional assumptions on the asymptotic relation between and . For example, Theorem 1 in scornet2015consistency guarantees condition (4) for additive Gaussian regression models provided that and , , such that the resampling strategy is restricted to sampling without replacement. In this context it should be noted that assumption (4) does not automatically lead to pointwise consistency, since the latter is rather hard to prove for random forest models and counterexamples exist on the original random forest model as mentioned in wager2014asymptotic .

Anyhow, predicting outcomes among the training set using the random forest is usually done by using Out-Of-Bag (OOB) subsamples. That is, averaging does not happen over all trees but over those trees that did not have the corresponding data point in their resampled data set during tree construction. This way, one aims to deliver unbiased estimators for predicted values. In addition, OOB samples have the advantage of delivering internal accuracy estimates, without separating the sample into a training and test set. This way, the training sample size can be left sufficiently large. From a mathematical perspective, OOB-estimators of random forest have the nice property that independence between observed responses and predicted remains valid for . This, because the prediction of is based on samples not containing the point for fixed . Thus, the independence property directly results from the independence assumption given in (2). However, the justification to analyze infinite forests instead of finite forests as in (3) is unclear for OOB-estimates, since one does not consider the average over decision trees, but rather a random subset of , depending on the data point one aims to predict. If we denote with the OOB prediction of , for and the corresponding finite forest estimate, then we provide our first result proving the justification of considering infinite forests even for OOB samples.

###### Lemma 1.

Under Model (1), OOB predictions of finite forests are consistent, that is for all

 mOOBM,n(x;Θ1,…,ΘM)⟶mOOBn(x),PΘ - a.s. as M→∞.

The consistency assumption in (4) implies the consistency of the corresponding OOB-estimate. That is:

###### Corollary 1.

For every fixed under Model (1) and assuming (4), OOB-estimators based on random forests are -consistent in the following sense

 limn→∞E[(mOOBn(Xi)−m(Xi))2]=0. (5)

These preliminary results allow the construction of a consistent residual variance estimator based on OOB samples.

## 3 Residual Variance Estimation

We estimate the residuals based on OOB samples, i.e. we set for

 ^ϵi=Yi−mOOBn(Xi), (6)

which we denote as OOB-estimated residuals. Their sample variance

 ^σ2RF=1nn∑i=1(^ϵi−¯ϵ⋅)2 (7)

or OOB-estimated residual variance is our proposed estimator. Here, denotes the mean of . A similar estimator has been proposed in mendez2011estimating , for which simulation studies on some functional relationships between and were considered for practical implementation. The next result guarantees asymptotic unbiasedness and consistency of under Assumption (4).

###### Theorem 1.

Assume regression Model (1) and that (4) is valid. Then the residual variance estimate given in (7) is asymptotically unbiased and -consistent as , i.e.

 ^σ2RFL1⟶σ2as n→∞.
###### Remark 1 (Key Assumptions and Other Machine Learning Techniques).

(a) Beyond Assumption (4) the structure of the random forest is only used to prove (5) and to maintain that the error variables are independent from and to have , both for all fixed . Thus, the results can be extended to all methods guaranteeing these assumptions.
(b) Moreover, carefully checking the proof of Theorem 1, the independence of towards can also be substituted by , while still maintaining the consistency result.

### 3.1 Bias-corrected Estimation

As explained in mendez2011estimating , the estimator (7) may be biased for finite sample size . To this end, mendez2011estimating proposed a biased-corrected version of via parametric bootstrapping. Their idea is as follows: Given the data generate i.i.d. parametric bootstrap residuals independent from , , with mean and variance

from a parametric distribution with finite second moment, e.g. the normal distribution. Then, a

bias-corrected bootstrap version of is given by

 ^σ2RFboot =^σ2RF−1nBB∑b=1n∑i=1(mOOBn,b(Xi)−mOOBn(Xi))2 =:^σ2RF−^RB(mn). (8)

Here, is the OOB-estimation of using the tree structure of and feeding it with the bootstrapped sample in which terminal node values are substituted with corresponding ’s where .

In the following, we provide two important results regarding the bias-corrected version of . In Theorem 2, we prove that the bias-corected estimator in (3.1) is -consistent. This guarantees that the proposed bootstrapping scheme does not systematically inflate our estimate. However, comes with additional computation costs. Therefore, in Theorem 3, we provide an asymptotic lower bound which enables a fast, bias-corrected estimation of for finite sample sizes.

###### Theorem 2.

Consider the random forest based parametric bootstrapping scheme as described for the estimation of (3.1). Assume further that the resampling strategy is restricted to sampling without replacement. Assuming Model (1) and condition (4) with , as . Then is asymptotically -consistent, that is

 ^σ2RFbootL1⟶σ2, as n→∞.
###### Theorem 3.

Consider the parametric bootstrapping scheme as described for the estimate in (3.1). Then for the random forest model, the follwoing inequality holds almost surely conditional on as

 ^RB(mn)≥^σ2RFa2n.

The result in Theorem 3 leads to a residual variance estimate that is computationally cheaper than the corresponding bootstrapped version, i.e. one can consider

 ^σ2RFfast=^σ2RF(1−1a2n) (9)

instead of , while saving considerable memory and computational time costs. This will lead to almost surely.

## 4 Conclusion

The random forest is known as a powerful tool in applied data analysis for classification, regression and variable selection (liaw2002classification, ; lunetta2004screening, ; diaz2006gene, ; strobl2007bias, ; genuer2010variable, ; khalilia2011predicting, ). Beyond its practical use, corresponding theoretical properties have been investigated under various conditions (breiman2001random, ; biau2008consistency, ; biau2010layered, ; wager2014asymptotic, ; scornet2015consistency, ) covering topics such as the -consistent estimation of the regression function. However, a comprehensive treatment on how to estimate corresponding dispersion parameters as the variance is almost not to be found in the literature.

An exception is given by the residual variance estimators proposed and examined in simulations in mendez2011estimating . In the present paper, we complement their analyses by theoretically investigating residual variance estimators in regression models. To this end, we first show that analyzing the infinite forest estimate is legitimate, even when switching to OOB samples. This allows us to prove consistency of the OOB-errors’ sample variance in the -sense if the random forest regression function estimate is assumed to be -consistent. In addition, we also give some theoretical insight on the bias corrected residual variance estimate for finite samples as proposed in mendez2011estimating .

As the structure of the random forest is only needed to maintain the independence property in OOB samples, the current approach is also valid for any method that provides -consistent regression function estimates.

## 5 Appendix.

In this section we state the proofs for Lemma 1, Corollary 1, Theorem 1, Theorem 2 and Theorem 3.

###### Proof of Lemma 1.

Let be fixed and be an arbitrary and fixed point in the unit cube. Consider . If we denote with the number of the regression trees not containing the -th observation, then it follows

 Zi∼Bin(M,pn) where pn={1−an/n for subsampling(1−1/n)n for bootstrapping with replacement.

such that . Since , where , it follows by the strong law of large numbers for fixed that as . Hence , as for fixed . For given , this justifies the consideration of

###### Proof of Corollary 1.

Let be fixed. Define the reduced sample as the OOB-sample of . Then it follows from the independence assumption in (2), that is independent of . Hence, can be treated as an independent copy of . The result thus follows immediately from (4). ∎

###### Proof of Theorem 1.

Consider and from (6)–(7). Using Corollary 1 and independence of and for all it follows that

 E[1nn∑i=1^ϵ2i] =1nn∑i=1E[{(Yi−m(Xi))+(m(Xi)−mOOBn(Xi))}2] =E[(Y1−m(X1))2]+1nn∑i=1{2E[(Yi−m(Xi))(m(Xi)−mOOBn(Xi))]+ E[(m(Xi)−mOOBn(Xi))2]} =σ2+1nn∑i=1{2(E[m(Xi)E[Yi|Xi]]−E[YimOOBn(Xi)]−E[m(Xi)2]+ E[m(Xi)E[mOOBn(Xi)|Xi]])+E[(m(Xi)−mOOBn(Xi))2]} =σ2+1nn∑i=1{2(E[m(Xi)E[Yi|Xi]]−E[m(Xi)mOOBn(Xi)]−E[ϵimOOBn(Xi)]− E[m(Xi)2]+E[m(Xi)E[mOOBn(Xi)|Xi]])+E[(m(Xi)−mOOBn(Xi))2]} =σ2+1nn∑i=1{2(E[m(Xi)2]−E[m(Xi)E[mOOBn(Xi)|Xi]]−E[m(Xi)2]+ E[m(Xi)E[mOOBn(Xi)|Xi]])+E[(m(Xi)−mOOBn(Xi))2]} =σ2+E[(m(X1)−mOOBn(X1))2]⟶σ2

by Corollary 1 as . The second and last equality follows from the identical distribution of the sequences resp. .

Furthermore, let . Then using the Cauchy-Schwarz inequality we obtain

 0≤E⎡⎣(1nn∑i=1^ϵi)2⎤⎦ =1n2E⎡⎣(n∑i=1Δn(Xi)+ϵi)2⎤⎦ =1n2E[n∑i=1(Δn(Xi)+ϵi)2]+1n2E[∑i≠j(Δn(Xi)+ϵi)(Δn(Xj)+ϵj)] =1n2n∑i=1E[Δn(Xi)2+2ϵiΔn(Xi)+ϵ2i]+ +1n2∑i≠jE[Δn(Xi)Δn(Xj)+Δn(Xi)ϵj+Δn(Xj)ϵi] ≤σ2n+1n2n∑i=1E[Δn(Xi)2]+1n2∑i≠j√E[Δn(Xi)2]E[Δn(Xj)2]+ +σ√E[Δn(Xi)2]+σ√E[Δn(Xj)2] id=σ2n+E[Δn(X1)2]n+(1−1n)(E[Δn(X1)2]+2σ√E[Δn(X1)2]) ⟶0

by Corollary 1 as which completes the proof.

###### Proof of Theorem 2.

To be mathematically precise, let and

be defined on some probability space

and let the parametric bootstrap variables be defined on another probability space . Then, all random variables can be defined (via projections) on the joined product space ; explaining the assumption that the random variables are independent from and i.i.d. generated from a distribution with finite second moment with and .

Within this framework consider and denote with the set of the -th bootstrapped sample for . Then the sequence of sets is independent. In particular, conditioned on , forms a sequence of i.i.d. random variables.

Now, note that random forest models are the weighted sum of the response variable. Hence, denoting with the hyper-rectangle obtained after constructing one random decision tree with seed parameter containing , then the infinite random forest model can be rewritten as

 mOOBn(Xi) =n∑j=1WOOBn,j(Xi)Yj, (10)

see, e.g., the proof of Theorem 2 in scornet2015consistency for a similar observation. Here, holds almost surely and the weights are defined as

 WOOBn,j(Xi)=EΘ[1{Xj∈An(Xi;Θ)}Nn(An(Xi;Θ))],

where is the number of data points falling in . Further let be the event that both points, and , fall in the same cell under the tree constructed by . Due to sampling without replacement, there are choices to pick a fixed observation . Therefore, we obtain

 WOOBnj(Xi)≤max1≤i≤nPΘ(XjΘ↔Xi)≤(n−2an−1−1)(n−1an−1)≤an−1n−1 (11)

Setting and , we obtain the following result for every fixed using the Cauchy-Schwarz inequality:

 E[(mOOBn,b(Xi)−mOOBn(Xi))2] =E⎡⎣(n∑j=1WOOBnj(Xi)Y∗j−mOOBn(Xi))2⎤⎦ =E⎡⎣(n∑j=1WOOBnj(Xi)⋅(Y∗j−Yj))2⎤⎦ =E[(An,i+Bn,i)2] ≤E[A2n,i]+2E[A2n,i]E[B2n,i]+E[B2n,i] (12)

In order to prove -consistency of the bootstrapped corrected estimate, based on (5), we only need to show that and as . Now, note that almost surely. Conditioning on , we know that and are independent for such that almost surely. Combining these two results, we obtain with (11):

 E[B2n,i] =n∑j=1E[WOOBnj(Xi)2(ϵ∗j−ϵj)2]+ +∑j≠ℓE[WOOBnj(Xi)WOOBnℓ(Xi)(ϵ∗j−ϵj)(ϵ∗ℓ−ϵℓ)] =n∑j=1E[WOOBnj(Xi)2E[(ϵ∗j−ϵj)2|Dn]]+ +∑j≠ℓE[WOOBnj(Xi)WOOBnℓ(Xi)E[(ϵ∗j−ϵj)(ϵ∗ℓ−ϵℓ)|Dn]] (13) =n∑j=1E[WOOBn,j(Xi)2(^σ2RF+ϵ2j)]+ +∑j≠ℓE[WOOBnj(Xi)WOOBnℓ(Xi)ϵjϵℓ] ≤a2n−1n−1nn−1(E[^σ2RF]+σ2)+a2n−1(n−1)2∑j≠ℓE[ϵjϵℓ] =a2n−1n−1nn−1(E[^σ2RF]+σ2)⟶0,n→∞, (14)

where the inequality results by applying (11) on the weights and the last equality from the fact that , the convergence from Theorem 1 and . Furthermore, by applying Jensen’s inequality, we obtain

 E[A2n,i] ≤E[n∑j=1WOOBnj(Xi)(mOOBn(Xj)−m(Xj))2] =E[n∑j=1j≠iWOOBnj(Xi)(mOOBn(Xj)−m(Xj))2]+ +E[EΘ[Nn(An(Xi))−1](mOOBn(Xi)−m(Xi))