Bayesian subset selection and variable importance for interpretable prediction and classification

04/20/2021
by   Daniel R. Kowal, et al.
0

Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often eschewed due to selection instability, computational bottlenecks, and lack of post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model ℳ, we elicit predictively-competitive subsets using linear decision analysis. The approach is customizable for (local) prediction or classification and provides interpretable summaries of ℳ. A key quantity is the acceptable family of subsets, which leverages the predictive distribution from ℳ to identify subsets that offer nearly-optimal prediction. The acceptable family spawns new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. Crucially, the linear coefficients for any subset inherit regularization and predictive uncertainty quantification via ℳ. The proposed approach exhibits excellent prediction, interval estimation, and variable selection for simulated data, including p=400 > n. These tools are applied to a large education dataset with highly correlated covariates, where the acceptable family is especially useful. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and features highly competitive prediction with remarkable stability.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset