On the early solution path of best subset selection

by   Ziwei Zhu, et al.

The early solution path, which tracks the first few variables that enter the model of a selection procedure, is of profound importance to scientific discovery. In practice, it is often statistically intangible to identify all the important features with no false discovery, let alone the intimidating expense of experiments to test their significance. Such realistic limitation calls for statistical guarantee for the early discovery of a model selector to navigate scientific adventure on the sea of big data. In this paper, we focus on the early solution path of best subset selection (BSS), where the sparsity constraint is set to be lower than the true sparsity. Under a sparse high-dimensional linear model, we establish the sufficient and (near) necessary condition for BSS to achieve sure early selection, or equivalently, zero false discovery throughout its entire early path. Essentially, this condition boils down to a lower bound of the minimum projected signal margin that characterizes the fundamental gap in signal capturing between sure selection models and those with spurious discovery. Defined through projection operators, this margin is independent of the restricted eigenvalues of the design, suggesting the robustness of BSS against collinearity. On the numerical aspect, we choose CoSaMP (Compressive Sampling Matching Pursuit) to approximate the BSS solutions, and we show that the resulting early path exhibits much lower false discovery rate (FDR) than LASSO, MCP and SCAD, especially in presence of highly correlated design. Finally, we apply CoSaMP to perform preliminary feature screening for the knockoff filter to enhance its power.


page 1

page 2

page 3

page 4


When is best subset selection the "best"?

Best subset selection (BSS) is fundamental in statistics and machine lea...

Tale of two c(omplex)ities

For decades, best subset selection (BSS) has eluded statisticians mainly...

High-dimensional variable selection with heterogeneous signals: A precise asymptotic perspective

We study the problem of exact support recovery for high-dimensional spar...

Model selection with lasso-zero: adding straw to the haystack to better find needles

The high-dimensional linear model y = X β^0 + ϵ is considered and the fo...

Optimal neighbourhood selection in structural equation models

We study the optimal sample complexity of neighbourhood selection in lin...

Variable selection in linear regression models: choosing the best subset is not always the best choice

Variable selection in linear regression settings is a much discussed pro...

Please sign up or login with your details

Forgot password? Click here to reset