Extrapolated cross-validation for randomized ensembles

02/27/2023
by   Jin-Hong Du, et al.
0

Ensemble methods such as bagging and random forests are ubiquitous in fields ranging from finance to genomics. However, the question of the efficient tuning of ensemble parameters has received relatively little attention. In this paper, we propose a cross-validation method, ECV (Extrapolated Cross-Validation), for tuning the ensemble and subsample sizes of randomized ensembles. Our method builds on two main ingredients: two initial estimators for small ensemble sizes using out-of-bag errors and a novel risk extrapolation technique leveraging the structure of the prediction risk decomposition. By establishing uniform consistency over ensemble and subsample sizes, we show that ECV yields δ-optimal (with respect to the oracle-tuned risk) ensembles for squared prediction risk. Our theory accommodates general ensemble predictors, requires mild moment assumptions, and allows for high-dimensional regimes where the feature dimension grows with the sample size. As an illustrative example, we employ ECV to predict surface protein abundances from gene expressions in single-cell multiomics using random forests. Compared to sample-split cross-validation and K-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting. Meanwhile, its computational cost is considerably lower owing to the use of the risk extrapolation technique. Further numerical results demonstrate the finite-sample accuracy of ECV for several common ensemble predictors.

READ FULL TEXT

page 3

page 13

page 16

research
04/25/2023

Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation

We study subsampling-based ridge ensembles in the proportional asymptoti...
research
08/04/2013

Risk-consistency of cross-validation with lasso-type procedures

The lasso and related sparsity inducing algorithms have been the target ...
research
01/22/2023

Design-based individual prediction

A design-based individual prediction approach is developed based on the ...
research
03/03/2020

Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions

We study the problem of out-of-sample risk estimation in the high dimens...
research
12/13/2021

Machine Learning-based Prediction of Porosity for Concrete Containing Supplementary Cementitious Materials

Porosity has been identified as the key indicator of the durability prop...
research
03/04/2013

A Sharp Bound on the Computation-Accuracy Tradeoff for Majority Voting Ensembles

When random forests are used for binary classification, an ensemble of t...
research
10/20/2022

Bagging in overparameterized learning: Risk characterization and risk monotonization

Bagging is a commonly used ensemble technique in statistics and machine ...

Please sign up or login with your details

Forgot password? Click here to reset