Correction of overfitting bias in regression models
Regression analysis based on many covariates is becoming increasingly common. When the number p of covariates is of the same order as the number N of observations, statistical inference like maximum likelihood estimation of regression and nuisance parameters in the model becomes unreliable due to overfitting. This overfitting most often leads to systematic biases in (some of) the estimators. In the literature, several methods to overcome overfitting bias or to adjust estimates have been proposed. The vast majority of these focus on the regression parameters only, either via empirical regularization methods or by expansion for small ratios p/N. This failure to correctly estimate also the nuisance parameters may lead to significant errors in outcome predictions. In this paper we study the overfitting bias of maximum likelihood estimators for regression and the nuisance parameters in parametric regression models in the overfitting regime (p/N<1). We compute the asymptotic characteristic function of the maximum likelihood estimators, and show how it can be used to estimate their overfitting biases by solving a small set of non-linear equations. These estimated biases enable us to correct the estimators and make them asymptotically unbiased. To illustrate the theory we performed simulation studies for multiple parametric regression models. In all cases we find excellent agreement between theory and simulations.
READ FULL TEXT