On the early solution path of best subset selection
The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discovery. In practice, it is often statistically intangible to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical guarantee for the early discovery of a model selector to navigate scientific adventure on the sea of big data. In this paper, we focus on the early solution path of best subset selection (BSS), where the sparsity constraint is set to be lower than the true sparsity. Under a sparse high-dimensional linear model, we establish the sufficient and (near) necessary condition for BSS to achieve sure early selection, or equivalently, zero false discovery throughout its entire early path. Essentially, this condition boils down to a lower bound of the minimum projected signal margin that characterizes the fundamental gap in signal capturing between sure selection models and those with spurious discovery. Defined through projection operators, this margin is independent of the restricted eigenvalues of the design, suggesting the robustness of BSS against collinearity. On the numerical aspect, we choose CoSaMP (Compressive Sampling Matching Pursuit) to approximate the BSS solutions, and we show that the resulting early path exhibits much lower false discovery rate (FDR) than LASSO, MCP and SCAD, especially in presence of highly correlated design. Finally, we apply CoSaMP to perform preliminary feature screening for the knockoff filter to enhance its power.
READ FULL TEXT