Efficient estimation and correction of selection-induced bias with order statistics

09/07/2023
by   Yann McLatchie, et al.
0

Model selection aims to identify a sufficiently well performing model that is possibly simpler than the most complex model among a pool of candidates. However, the decision-making process itself can inadvertently introduce non-negligible bias when the cross-validation estimates of predictive performance are marred by excessive noise. In finite data regimes, cross-validated estimates can encourage the statistician to select one model over another when it is not actually better for future data. While this bias remains negligible in the case of few models, when the pool of candidates grows, and model selection decisions are compounded (as in forward search), the expected magnitude of selection-induced bias is likely to grow too. This paper introduces an efficient approach to estimate and correct selection-induced bias based on order statistics. Numerical experiments demonstrate the reliability of our approach in estimating both selection-induced bias and over-fitting along compounded model selection decisions, with specific application to forward search. This work represents a light-weight alternative to more computationally expensive approaches to correcting selection-induced bias, such as nested cross-validation and the bootstrap. Our approach rests on several theoretic assumptions, and we provide a diagnostic to help understand when these may not be valid and when to fall back on safer, albeit more computationally expensive approaches. The accompanying code facilitates its practical implementation and fosters further exploration in this area.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2022

Cross validation for model selection: a primer with examples from ecology

The growing use of model-selection principles in ecology for statistical...
research
01/25/2019

Rescaling and other forms of unsupervised preprocessing introduce bias into cross-validation

Cross-validation of predictive models is the de-facto standard for model...
research
11/25/2020

Surrogate-based Bayesian Comparison of Computationally Expensive Models: Application to Microbially Induced Calcite Precipitation

Geochemical processes in subsurface reservoirs affected by microbial act...
research
01/09/2018

Test Error Estimation after Model Selection Using Validation Error

When performing supervised learning with the model selected using valida...
research
09/27/2019

Bootstrap Cross-validation Improves Model Selection in Pharmacometrics

Cross-validation assesses the predictive ability of a model, allowing on...
research
01/28/2019

Inference after black box selection

We consider the problem of inference for parameters selected to report o...
research
12/30/2016

Adaptive Lambda Least-Squares Temporal Difference Learning

Temporal Difference learning or TD(λ) is a fundamental algorithm in the ...

Please sign up or login with your details

Forgot password? Click here to reset