Variable selection using pseudo-variables
Penalized regression has become a standard tool for model building across a wide range of application domains. Common practice is to tune the amount of penalization to tradeoff bias and variance or to optimize some other measure of performance of the estimated model. An advantage of such automated model-building procedures is that their operating characteristics are well-defined, i.e., completely data-driven, and thereby they can be systematically studied. However, in many applications it is desirable to incorporate domain knowledge into the model building process; one way to do this is to characterize each model along the solution path of a penalized regression estimator in terms of an operating characteristic that is meaningful within a domain context and then to allow domain experts to choose from among these models using these operating characteristics as well as other factors not available to the estimation algorithm. We derive an estimator of the false selection rate for each model along the solution path using a novel variable addition method. The proposed estimator applies to both fixed and random designs and allows for p ≫ n. The proposed estimator can be used to estimate a model with a pre-specified false selection rate or can be overlaid on the solution path to facilitate interactive model exploration. We characterize the asymptotic behavior of the proposed estimator in the case of a linear model under a fixed design; however, simulation experiments show that the proposed estimator provides consistently more accurate estimates of the false selection rate than competing methods across a wide range of models.
READ FULL TEXT