As technology rapidly advances, modern statistical analysis often encounters regressions with a large number of explanatory variables (also known as features). Hence, sparse regression and variable selection have been studied intensively in the past decades. As an alternative to regular linear regression, ridge regression, first proposed by18], nor applicable for variable selection.
To overcome these issues, we consider the sparse ridge regression problem as below:
denotes the response vector,represents the model matrix, is the vector of regression coefficients (i.e., estimand), and is a positive tuning parameter for the ridge penalty (i.e., penalty). Here, is the norm, which counts the number of nonzero entries of vector . The value of represents the number of features to be chosen. In (F0), it aims to find a best -sparse estimator, which minimizes the least square error with a squared penalty. It is easy to see that when , (F0) reduces to a special case, which is known as sparse regression. Note that in the signal process literature (cf. ), the formulation (F0) can also coincide with sparse signal recovery. Without loss of generality, let us assume that and and the data are normalized such that for all .
scenarios with equal probability, where the th scenario set is for . The constraint means that at most portion of scenarios can be violated. Hence, we can reformulate (F0) as a CCP below：
where denotes the indicator function. In Section 2, we will investigate how applicable the recent progress on CCP (e.g., [1, 26, 31]) can be to solve (F0-CCP). It appears that many existing approaches may not work well due to the scalability issue or resulting in trivial solutions. In Section 4, we propose two novel scalable algorithms and their integration to solve the sparse ridge regression with theoretical guarantees.
Relevant Literature. The ridge regression has been extensively studied in statistics [13, 27, 41]. However, although many desirable properties, the ridge estimator is often not sparse as it is computed by using the smoothed squared penalty. Enabling the sparsity in regression has also attracted a significant amount of work including the LASSO using penalty , the Bridge estimator using penalty , the SCAD using non-convex penalty , the MCP using minimax concave penalty  among many others. Several excellent and comprehensive reviews of sparse regression can be found in , , and . In particular, it is worthy of mentioning that in , Zou and Hastie proposed a well-known “elastic net” approach, which integrates the ridge penalty (i.e., squared penalty) and
penalty into the ordinary least-square objective to obtain a sparse estimator. However, similar to the LASSO method, the elastic net might not consistently find the true sparse solution. On the contrary, instead, we introduce a constraintin (F0), which strictly enforces the sparsity on , and therefore, can obtain a best -sparse estimator.
). There has been various effective approximation algorithms or heuristics introduced to solve sparse regression[11, 12, 16, 24, 25, 29]. For example, in , Das and Kempe studied greedy approach (or forward stepwise selection method) and proved its approximation guarantee when the covariance matrix is nearly identity and has constant bandwidth. However, the greedy approach has been found prohibitively expensive when the number of features (i.e., ) becomes large . Recently, Hazimeh and Mazumder in  integrated coordinate descent method with local combinatorial search, and reported that the proposed method outperforms the existing ones. However, this method does not provide any provable guarantee on the solution quality. Many researchers have also attempted to solve sparse regression by developing exact algorithms (e.g., branch and cut), or using mixed integer program (MIP) solvers. It has been shown that for certain large-sized instances with large signal-to-noise ratios, the proposed MIP approaches with warm start (a good initial solution) work quite well and can yield very high-quality solutions [3, 4, 28]. In particular, in , Bertsimas and Van Parys also studied sparse ridge regression and developed a branch and cut algorithm. However, through our numerical study, these exact approaches can only solve medium-sized instances to the near optimality, and their performances highly rely on the speed of commercial solvers and can vary significantly from one dataset to another. In this work, our emphasis is to develop fast approximation algorithms with attractive scalability property and theoretical performance guarantees.
Our Approaches and Contributions. In this work, we will focus on studying the sparse ridge regression (F0) from the angle of chance-constrained program (F0-CCP). We will first investigate various existing approaches of CCP to solve (F0-CCP). One particular approach, which has been used to solve sparse regressions 
, is to introduce one binary variable for each indicator function in (F0-CCP) and linearize it with big-M coefficient. Oftentimes, such a method can be very slow in computation, in particular for large-scale datasets. To overcome the aforementioned challenge, we develop a big-M free mixed integer second order conic (MISOC) reformulation for (F0-CCP). We further show that its continuous relaxation is equivalent to that of a mixed integer convex (MIC) formulation in [4, 11]. Moreover, these two formulations motivate us to construct a greedy approach (i.e., forward selection) in a much more efficient way than those in the literature. The performance guarantee of our greedy approach is also established. A randomized algorithm is studied by investigating the continuous relaxations of the proposed MISOC formulation. Numerical study shows that the proposed methods work quite well, in particular, the greedy approach outperforms the other methods both in running time and accuracy of variable selection. The contributions are summarized below:
We investigate theoretical properties of three existing approaches of CCP to solve (F0-CCP), i.e., the big-M method, the conditional-value-at-risk (i.e., ) approach, and the heuristic algorithm from , and shed some lights on why those methods are not applicable to solve the sparse ridge regression (F0).
We establish a mixed integer second order conic (MISOC) reformulation for (F0-CCP) from perspective formulation  and prove its continuous relaxation is equivalent to that of a mixed integer convex formulation in the work [4, 11]. We also show that the proposed MISOC formulation can be stronger than the naive big-M formulation.
Based on the reformulations, we develop an efficient greedy approach for solving (F0-CCP), and prove its performance guarantee under a mild condition. From our analysis, the proposed greedy approach is theoretically sound and computationally efficient.
Through establishing a relationship between the continuous relaxation value of the MISOC formulation and the optimal value of (F0-CCP) (i.e., ), we develop a randomized algorithm based on the optimal continuous relaxation solution of the MISOC formulation, and derive its theoretical properties. Such a continuous relaxation solution can help reduce the number of potential features and thus can be integrated with greedy approach.
The remainder of the paper is organized as follows.
Section 2 investigates the applicability of several existing approaches of CCP to the sparse ridge regression (F0).
Section 3 develops two big-M free mixed integer convex program formulations and proves their equivalence.
Section 4 proposes and analyzes two scalable algorithms and proves their performance guarantees.
The numerical studies of the proposed scalable algorithms are presented in Section 5. We conclude this work with some discussion in Section 6.
The following notation is used throughout the paper. We use bold-letters (e.g., ) to denote vectors or matrices, and use corresponding non-bold letters to denote their components. Given a positive integer number , we let and let denote identity matrix. Given a subset , we let denote the subvector of with entries from subset , and be a matrix with a subset columns from . For a matrix , we let
denote its smallest and largest singular values, respectively. Given a vector, we let be a diagonal matrix with diagonal entries from . For a matrix , we let denotes its th column. Given a set , we let denote its convex hull. Given a finite set , we let denote its cardinality. Given two sets , we let denote the set of elements in but not in , let denote the union of and and let be their symmetric difference, i.e., .
2 Existing Solution Approaches
In this section, we investigate three commonly used solution approaches to solve (F0-CCP).
2.1 Big-M Method
One typical method for a CCP is to formulate it as a mixed integer program (MIP) by introducing a binary variable for each scenario , i.e., , and then using big-M method to linearize it, i.e., suppose that with a large positive number , then is equivalent to . Therefore, (F0-CCP) can be reformulated as the following MIP:
The advantage of (F0-big-M) is that it can be directly solved by off-the-shelf solvers (e.g., CPLEX, Gurobi). However, one has to choose the vector properly.
There are many ways to choose the big-M coefficients (i.e., ). One typical way is that for each , one can let be equal to the largest value of given that the optimal value of (F0-CCP) is bounded by , i.e., let be equal to the larger optimal value of the following two convex quadratic programs
To solve (1), it needs to compute an upper bound of . Note that the objective value of any feasible solution to (F0-CCP) suffices. A naive upper bound is since is feasible to (F0-CCP). On the other hand, to obtain vector , one has to solve two convex quadratic programs in (1) for each , which can be very time-consuming, in particular when is large.
Here we derive a slightly weaker but a closed-form vector . Note that all the convex programs in (1) share the same constraint . Thus, the key proof idea is to relax the constraint in (1) into a weaker one, which is more amenable for a closed-form upper bound of vector . This result is summarized below.
Suppose that is an upper bound to , then vector can be chosen as
In (1), it is sufficient to find an upper bound of vector for any feasible satisfies
First, the above constraint implies that
where the first inequality is due to the inequality . Therefore, we have
where the first inequality is due to . Since , thus we have
for each .
On the other hand, we note that
Thus, another upper bound can be developed by letting , which implies that for each . ∎
It is known that this MIP (F0-big-M) with big-M coefficients typically has a very weak continuous relaxation value. Consequently, there has been significant research on improving the big-M coefficients of (F0-big-M), for example, [1, 3, 32, 33, 37]. However, the tightening procedures tend to be time consuming in particular for large-scale datasets. In Section 3, we will derive two big-M free MIP formulations, whose continuous relaxation can be proven to be stronger than that of (F0-big-M) .
Another well-known approximation of CCP is the so-called conditional value at risk () approximation (see  for details), which is to replace the nonconvex probabilistic constraint by a convex constraint. For the sparse ridge regression in (F0-CCP), the resulting formulation is
The only feasible solution to (3) is , i.e., .
Therefore, the approach yields a trivial solution for (F0-CCP). Hence, it is not a desirable approach and other alternatives are more preferred.
2.3 Heuristic Algorithm in 
In the recent work of , Ahmed et al. proposed a heuristic algorithm for a CCP with discrete distribution. It was reported that such a method can solve most of their numerical instances to the near-optimality (i.e., within 4% optimality gap). The key idea of the heuristic algorithm in  is to minimize the sum of infeasibilities for all scenarios when the objective value is upper bounded by . Specifically, they considered the following optimization problem
Let be an optimal solution to (4) given an upper bound of . The heuristic algorithm is to decrease the value of if , and increase it, otherwise. This searching (i.e., bisection) procedure will terminate after a finite number of iterations. The detailed procedure is described in Algorithm 1. Let denote the output solution from Algorithm 1. Then clearly,
It is worth remarking that for any given upper bound , the formulation (4) is similar to the Dantzig selector proposed by . The difference between Algorithm 1 and LASSO is that this iterative procedure simultaneously guarantees the sparsity and reduces the regression error while LASSO seeks a trade-off between the error and penalty of . We also note that Algorithm 1 might not be computationally efficient since it requires to solve (4) multiple times. To the best of our knowledge, there is not a known performance guarantee of Algorithm 1.
3 Two Reformulations of Sparse Ridge Regression: Big-M Free
Note that the Big-M formulation in (F0-big-M) is quite compact since it only involves number of variables (i.e., ). However, it is usually a weak formulation in the sense that the continuous relaxation value of (F0-big-M) can be quite faraway from the optimal value . In this section, we propose two big-M free reformulations of (F0-CCP) from the distinct perspectives and prove their equivalence.
3.1 Mixed Integer Second Order Conic (MISOC) Formulation
In this subsection, we will derive an MISOC formulation and its analytical properties. To begin with, we first make an observation from the perspective formulation in . Let us consider a nonconvex set
for each . The results in  shows that the convex hull of , denoted as , can be characterized as below.
(Lemma 3.1. in ) For each , the convex hull of set is
Lemma 1 suggests an extended formulation for (F0-CCP) without big-M coefficients. To achieve this goal, we first introduce a variable to be the upper bound of for each and binary variable , thus, (F0-CCP) is equal to
which can be equivalently reformulated as
The above result is summarized in the following theorem.
The formulation (F0-CCP) is equivalent to
Next, we show that the convex hull of the feasible region of (F0-MISOC) is equal to that of its continuous relaxation. Therefore, it suggests that we might not be able to improve the formulation by simply exploring the constraint system of (F0-MISOC). For notational convenience, let us denote as the feasible region of (F0-MISOC), i.e.,
We show that the continuous relaxation of the set is equivalent to , i.e.,
Let denote as the feasible region of (F0-MISOC). Then
Let be the continuous relaxation set of , i.e.,
We would like to show that .
It is clear that .
To prove , we only need to show that for any given point , we have . Since , which is an integral polytope, there exists integral extreme points such that with for all and . Now we construct for each as follows:
First of all, we claim that for all . Indeed, for any ,
As , thus, for each , we have
Finally, we remark that if an upper bound of is known, then (F0-MISOC) can be further strengthened by adding the constraints for each . This result is summarized in the following corollary.
3.2 Mixed Integer Convex (MIC) Formulation
In this subsection, we will derive an equivalent MIC formulation to (F0-CCP). The main idea is to separate the optimization in (F0-CCP) into two steps: (i) first, we optimize over by fixing its nonzero entries with at most , and (ii) then we select the best subset of nonzero entries with size at most . After the first step, it turns out that we can arrive at a convex integer program, which is big-M free. We would like to acknowledge that this result has been independently observed by recent work in  and . For the completeness of this paper, we also present a different way of proof here.
The formulation (F0-CCP) is equivalent to
We first reformulate (F0-CCP
) as a combinatorial optimization problem. Letand we reformulate (F0-CCP) as
where is a submatrix of with columns from subset . Note that the inner minimization has closed-form solution . Hence, (F0-CCP) is equivalent to
For any given with , it remains to show that
By letting binary variable if and 0, otherwsie, then the formulation in (9) is equivalent to the following mixed integer convex program
Note that in , Bertsimas and Van Parys proposed a branch and cut algorithm to solve (F0-MIC), which was shown to be effective in solving some large-sized instances. In the next subsection, we will show that the continuous relaxation of (F0-MIC) is equivalent to that of (F0-MISOC). Therefore, it can be more appealing to solve (F0-MISOC) directly by MISOC solvers (e.g., CPLEX, Gurobi). Indeed, we numerically compares the branch and cut algorithm with directly solving (F0-MISOC) in Section 5.
Finally, we remark that given the set of selected features , its corresponding estimator can be computed by the following formula:
where denotes a sub-vector of with entries from subset .
3.3 Formulation Comparisons
In this subsection, we will focus on comparing (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC) according to their continuous relaxation bounds. First, let denote the continuous relaxation of (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC), respectively, i.e., equationparentequation
Next, in the following theorem, we will show a comparison of proposed formulations, i.e., (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC). In particular, we prove that , i.e., the continuous relaxation bounds of (F0-MISOC) and (F0-MIC) coincide. In addition, we show that by adding big-M constraints for each into (F0-MISOC), we arrive at a tighter relaxation bound than that of (F0-big-M), i.e., .