1 Introduction
Sparse approximate solutions to linear systems are desirable for providing interpretable results that succinctly identify important features. For and , regularization (Eq. 1^{1}^{1}1We consider the Lagrange form of subset selection. Since the problem is nonconvex this is weaker than the constrained form, meaning that all solutions of the Lagrange problem are solutions to a constrained problem.), called “best subset selection,” is a natural way to achieve sparsity by directly penalizing nonzero elements of . This intuition is fortified by theoretical justification. Foster and George (1994) demonstrate that for general predictor matrix , regularization achieves the asymptotic minimax rate of risk inflation. Unfortunately, it is well known that regularization is nonconvex and NP hard (Natarajan, 1995).
(1) 
(2) 
Despite the computational difficulty, the optimality of regularization has motivated approximation methods such as Zhang (2009), who provide a ForwardBackward greedy algorithm with asymptotic guarantees. Additionally, integer programming has been used to find solutions for problems of bounded size (Konno and Yamamoto, 2009; Miyashiro and Takano, 2015; Gatu and Kontoghiorghes, 2012).
Instead of regularization, it is common to use regularization (Eq. 2), known as the Lasso (Tibshirani, 1996). This convex relaxation of regularization achieves sparse solutions which are model consistent and unique under regularity conditions, which, among other things, limit the correlations between columns of (Zhao and Yu, 2006; Tibshirani et al., 2013). Additionally, is a reasonable substitute for regularization because the norm is the best convex approximation to the norm (Ramirez et al., 2013). However, on realworld data sets, regularization tends to select incorrect models since the norm shrinks all coefficients including those which are in the active set (Friedman, 2012; Mazumder et al., 2011). This bias can be particularly troublesome in very sparse settings, where the predictive risk of can be arbitrarily worse than that of (Lin et al., 2008).
In order to take advantage of the computational tractability of regularization, and the optimality of , we develop the Lass0 (“Lasszero”), a method which uses an solution as an initialization step to find a locally optimal solution. At each computationally efficient step, the Lass0 improves upon the objective, often finding sparser solutions without sacrificing prediction accuracy.
Previous literature, such as SparseNet (Mazumder et al., 2011), also explored the relationship between and solutions. Yet unlike our approach, SparseNet reparameterizes the problem with MC+ loss and solves a generalized soft thresholding problem at each iteration requiring a large number of problems to solve to reach . Alternatively, Johnson et al. (2015) use the objective as a criterion to select among different models from the LARS (Efron et al., 2004) solution set. Additionally, they compare forward stepwise regression to regression. However, in neither case do they improve upon the results by optimizing directly, as in our work.
In the remainder of this paper, Section 2 details the Lass0 algorithm. Section 3 provides theoretical guarantees for convergence in the orthogonal case and elimination of redundant variables in the general case. Section 4 presents empirical results on synthetic and real world data. Finally, we conclude in Section 5 with directions for future work in the general context of nonconvex optimization.
2 Lass0
We propose a new method for finding sparse approximate solutions to linear problems, which we call the Lass0. The full pseudocode of the Lass0 algorithm is presented in Algorithm 1, and we refer to the lines through this section. The method is initialized by a solution to L regularization, , given a particular . The Lass0 then uses an iterative algorithm to find a locally optimal solution that minimizes the objective function of the L regularization (Eq. 3).
(3) 
If indicates the support, the first step in the Lass0 is to compute
, the ordinary least squares solution constrained such that every zero entry of
must remain zero. is formally defined as,(4) 
For each entry,
, of the resulting vector, we compute the effect of individually adding or removing it from
in Lines 6 and 7. Note that by adding an entry to the support, we increase the penalty by, but potentially create a better estimate for
, resulting in a lower loss term. Similarly, the opposite may be true when removing an entry from the support set.This procedure yields a new solution vector for each . The which minimizes the objective function is selected as in Line 8. Then, in Line 9, we accept only if it is strictly better than the solution we began with, . The iterative algorithm terminates whenever there is no improvement.
This procedure is equivalent to greedy coordinate minimization where we warmstart the optimization procedure with the regularization solution. Additionally, we note that any regularization with norm is nonconvex. While the present work focuses on regularization, the Lass0 can be applied to approximate solutions to any other nonconvex regularization with minimal changes.
3 Theoretical properties
Theorem 1
Assuming that is orthogonal, the Lass0 solution is the regularization solution.
Proof. Recall that Lass0 is initialized with the regularization solution. With an orthogonal set of covariates, it is well known that the solution to regularization, , is softthresholding of the components of at level (Eq. 5). Additionally, it is well known that in this case the solution to L regularization, , is hardthresholding of the components of at level (Eq. 6).
(5) 
(6) 
We will prove that the Lass0 solution, . Since the solutions to and regularization depend on , the proof is divided in three cases to cover all possible values of , and we use the same regularization parameter for both algorithms.

Case : Since , therefore . Note that in the orthogonal case, the least squares solution is . In the first step of the Lass0 algorithm we find which corresponds to setting , and otherwise. Therefore, in the first step the algorithm will reach the hardthresholding at level and terminate.

Case : Since , then . Let and let . The Lass0 will only choose and remove element from if,
(7) Yet such inequality will never hold, since and it would imply
(8) Which contradicts . Thus Lass0 will never remove an element from .
Similarly, if we let Lass0 will only choose and add element to if,
(9) Thus Lass0 will add element to if and only if . Therefore, . Furthermore, since is optimized by OLS, .

Case : Since , then . The result that follows from an analogous argument to the above, omitted for the sake of brevity.
For sparse solutions, it is important to know how a given algorithm will behave when faced with strongly correlated features. For example, the elastic net (Zou and Hastie, 2005) assigns identical coefficients to identical variables. In contrast, regularization picks one of the strongly correlated features. The latter behavior is desirable in situations where including both variables in the support would be considered redundant. We now prove that when two variables are strongly correlated, Lass0 behaves similarly to regularization: it only selects one among a group of strongly correlated features.
Theorem 2
Assume that , then either or (or both).
Proof. Let be the solution at any step of Lass0. We will prove that if both indices , meaning and , at least one of them will become zero in the solution.
Without loss of generality, let be the least squares solution that preserves all the constraints of and also enforces . Let , then , implying .
4 Experimental results
We generate synthetic data from a linear model , where each sample is generated using with high correlation. The coefficients are generated as , with sparsity enforced by setting some to zero. We compare and against the true underlying support, . We use 10fold cross validation (CV) testing and report the average Hamming distance between the estimated and true supports. Figure 1 shows Hamming distances over different levels of sparsity in the true support. The Lass0 consistently yields models which are closer to the true support than the optimally chosen model.
We evaluate the Lass0 on nine realworld data sets sourced from the publicly available repositories (Valdar et al., 2006; Lichman, 2013). Table 1
shows the mean and standard deviation for the normalized root mean squared error (NRMSE) and cardinality of the support for the estimated
. For all data sets, both regularization methods produce very similar NRMSE values. However, in most cases the Lass0 reduced the size of the active set, often by or more. Combined with the above results showing that the Lass0 yields models closer to the true sparse synthetic model, we see that the Lass0 tends to produce sparser, more fidelitous models than regularization.Data  NRMSE  NRMSE Lass0  

Pyrimidines  101.4 47.5  103.1 42.3  28  16.1 5.5  7.6 5.1 
Ailerons  42.2 1.8  42.5 1.9  41  24 3.3  6.9 1.1 
Triazines  98.9 14.5  97.5 19.7  61  17.9 9.2  7.3 6.6 
Airplane stocks  36.1 5.2  36.4 5.6  10  8.5 0.5  7.8 0.9 
Pole Telecomm  73.3 2.1  73.4 2.1  49  22.7 1.1  24.5 0.9 
Bank domains  69.9 3  70.7 3.1  33  9.2 2.7  5.2 8.2 
Pumadyn domains  89 2.2  88.9 2.4  33  5.2 8.2  1 0 
Breast Cancer  93.5 15.3  96.7 19.7  33  16 8.7  18.8 5.7 
Mice  103.5 5  105 6.6  100  17 6.6  6.3 4.3 
5 Future work
We intend to build upon Theorem 1 to support our empirical observations. Additionally, we expect that this paper’s general approach can be applied to other nonconvex optimization problems. While convex relaxations may yield interesting problems in their own right, they are often good approximations to nonconvex solutions. Using convex results to initialize an efficient search for a locally optimal nonconvex solution can combine the strengths of convex and nonconvex formulations.
Acknowledgments
We thank Ryan Tibshirani for his advice and fruitful suggestions. This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE 1252522 and the National Science Foundation award No. IIS0953330.
References
 Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of statistics, 32(2):407–499.
 Foster and George (1994) Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. The Annals of Statistics, pages 1947–1975.
 Friedman (2012) Friedman, J. H. (2012). Fast sparse regression and classification. International Journal of Forecasting, 28(3):722–738.
 Gatu and Kontoghiorghes (2012) Gatu, C. and Kontoghiorghes, E. J. (2012). Branchandbound algorithms for computing the bestsubset regression models. Journal of Computational and Graphical Statistics.
 Johnson et al. (2015) Johnson, K. D., Lin, D., Ungar, L. H., Foster, D. P., and Stine, R. A. (2015). A risk ratio comparison of and penalized regression. arXiv preprint arXiv:1510.06319.

Konno and Yamamoto (2009)
Konno, H. and Yamamoto, R. (2009).
Choosing the best set of variables in regression analysis using integer programming.
Journal of Global Optimization, 44(2):273–282. 
Lichman (2013)
Lichman, M. (2013).
UCI machine learning repository.

Lin et al. (2008)
Lin, D., Pitler, E., Foster, D. P., and Ungar, L. H. (2008).
In defense of l0.
In
Workshop on Feature Selection,(ICML 2008)
.  Mazumder et al. (2011) Mazumder, R., Friedman, J. H., and Hastie, T. (2011). Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495).
 Miyashiro and Takano (2015) Miyashiro, R. and Takano, Y. (2015). Subset selection by mallows cp: A mixed integer programming approach. Expert Systems with Applications, 42(1):325–331.
 Natarajan (1995) Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234.
 Ramirez et al. (2013) Ramirez, C., Kreinovich, V., and Argaez, M. (2013). Why l1 is a good approximation to l0: A geometric explanation. Journal of Uncertain Systems, 7.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
 Tibshirani et al. (2013) Tibshirani, R. J. et al. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics, 7:1456–1490.
 Valdar et al. (2006) Valdar, W., Solberg, L. C., Gauguier, D., Burnett, S., Klenerman, P., Cookson, W. O., Taylor, M. S., Rawlins, J. N. P., Mott, R., and Flint, J. (2006). Genomewide genetic association of complex traits in heterogeneous stock mice. Nature genetics, 38(8):879–887.
 Zhang (2009) Zhang, T. (2009). Adaptive forwardbackward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems, pages 1921–1928.
 Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563.
 Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.
Comments
There are no comments yet.