Complete Analysis of a Random Forest Model

05/07/2018
by   Jason M. Klusowski, et al.
0

Random forests have become an important tool for improving accuracy in regression problems since their popularization by (Breiman, 2001) and others. In this paper, we revisit a random forest model originally proposed by (Breiman, 2004) and later studied by (Biau, 2012), where a feature is selected at random and the split occurs at the midpoint of the block containing the chosen feature. If the regression function is sparse and depends only on a small, unknown subset of S out of d features, we show that given n observations, this random forest model outputs a predictor that has a mean-squared prediction error of order (n√(^S-1 n))^-1/S2+1 . When S ≤ 0.72 d , this rate is better than the minimax optimal rate n^-2/d+2 for d -dimensional, Lipschitz function classes. As a consequence of our analysis, we show that the variance of the forest decays with the depth of the tree at a rate that is independent of the ambient dimension, even when the trees are fully grown. In particular, if ℓ_avg (resp. ℓ_max ) is the average (resp. maximum) number of observations per leaf node, we show that the variance of this forest is Θ(ℓ^-1_avg(√( n))^-(S-1)) , which for the case of S = d , is similar in form to the lower bound Ω(ℓ^-1_max( n)^-(d-1)) of (Lin and Jeon, 2006) for any random forest model with a nonadaptive splitting scheme. We also show that the bias is tight for any linear model with nonzero parameter vector. Thus, we completely characterize the fundamental limits of this random forest model. Our new analysis also implies that better theoretical performance can be achieved if the trees are grown less aggressively (i.e., grown to a shallower depth) than previous work would otherwise recommend.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset