Linear Regression, Covariate Selection and the Failure of Modelling
It is argued that all model based approaches to the selection of covariates in linear regression have failed. This applies to frequentist approaches based on P-values and to Bayesian approaches although for different reasons. In the first part of the paper 13 model based procedures are compared to the model-free Gaussian covariate procedure in terms of the covariates selected and the time required. The comparison is based on four data sets and two simulations. There is nothing special about these data sets which are often used as examples in the literature. All the model based procedures failed. In the second part of the paper it is argued that the cause of this failure is the very use of a model. If the model involves all the available covariates standard P-values can be used. The use of P-values in this situation is quite straightforward. As soon as the model specifies only some unknown subset of the covariates the problem being to identify this subset the situation changes radically. There are many P-values, they are dependent and most of them are invalid. The Bayesian paradigm also assumes a correct model but although there are no conceptual problems with a large number of covariates there is a considerable overhead causing computational and allocation problems even for moderately sized data sets. The Gaussian covariate procedure is based on P-values which are defined as the probability that a random Gaussian covariate is better than the covariate being considered. These P-values are exact and valid whatever the situation. The allocation requirements and the algorithmic complexity are both linear in the size of the data making the procedure capable of handling large data sets. It outperforms all the other procedures in every respect.
READ FULL TEXT