1 Introduction
As technology rapidly advances, modern statistical analysis often encounters regressions with a large number of explanatory variables (also known as features). Hence, sparse regression and variable selection have been studied intensively in the past decades. As an alternative to regular linear regression, ridge regression, first proposed by
[22], has several desirable advantages including stable solution, estimator variance reduction, and efficient computation. However, its solution is usually neither sparse
[18], nor applicable for variable selection.To overcome these issues, we consider the sparse ridge regression problem as below:
(F0) 
where
denotes the response vector,
represents the model matrix, is the vector of regression coefficients (i.e., estimand), and is a positive tuning parameter for the ridge penalty (i.e., penalty). Here, is the norm, which counts the number of nonzero entries of vector . The value of represents the number of features to be chosen. In (F0), it aims to find a best sparse estimator, which minimizes the least square error with a squared penalty. It is easy to see that when , (F0) reduces to a special case, which is known as sparse regression. Note that in the signal process literature (cf. [40]), the formulation (F0) can also coincide with sparse signal recovery. Without loss of generality, let us assume that and and the data are normalized such that for all .It is noted that the sparse ridge regression (F0) can be reformulated as a chance constrained program (CCP) with finite support [1, 26]. That is, we consider
scenarios with equal probability
, where the th scenario set is for . The constraint means that at most portion of scenarios can be violated. Hence, we can reformulate (F0) as a CCP below：(F0CCP) 
where denotes the indicator function. In Section 2, we will investigate how applicable the recent progress on CCP (e.g., [1, 26, 31]) can be to solve (F0CCP). It appears that many existing approaches may not work well due to the scalability issue or resulting in trivial solutions. In Section 4, we propose two novel scalable algorithms and their integration to solve the sparse ridge regression with theoretical guarantees.
Relevant Literature. The ridge regression has been extensively studied in statistics [13, 27, 41]. However, although many desirable properties, the ridge estimator is often not sparse as it is computed by using the smoothed squared penalty. Enabling the sparsity in regression has also attracted a significant amount of work including the LASSO using penalty [38], the Bridge estimator using penalty [23], the SCAD using nonconvex penalty [43], the MCP using minimax concave penalty [44] among many others. Several excellent and comprehensive reviews of sparse regression can be found in [5], [19], and [14]. In particular, it is worthy of mentioning that in [46], Zou and Hastie proposed a wellknown “elastic net” approach, which integrates the ridge penalty (i.e., squared penalty) and
penalty into the ordinary leastsquare objective to obtain a sparse estimator. However, similar to the LASSO method, the elastic net might not consistently find the true sparse solution. On the contrary, instead, we introduce a constraint
in (F0), which strictly enforces the sparsity on , and therefore, can obtain a best sparse estimator.Note that it has been proven that the exact sparse linear regression (i.e., using norm) is NPhard (cf., [30]), so is the sparse ridge regression (F0
). There has been various effective approximation algorithms or heuristics introduced to solve sparse regression
[11, 12, 16, 24, 25, 29]. For example, in [9], Das and Kempe studied greedy approach (or forward stepwise selection method) and proved its approximation guarantee when the covariance matrix is nearly identity and has constant bandwidth. However, the greedy approach has been found prohibitively expensive when the number of features (i.e., ) becomes large [20]. Recently, Hazimeh and Mazumder in [21] integrated coordinate descent method with local combinatorial search, and reported that the proposed method outperforms the existing ones. However, this method does not provide any provable guarantee on the solution quality. Many researchers have also attempted to solve sparse regression by developing exact algorithms (e.g., branch and cut), or using mixed integer program (MIP) solvers. It has been shown that for certain largesized instances with large signaltonoise ratios, the proposed MIP approaches with warm start (a good initial solution) work quite well and can yield very highquality solutions [3, 4, 28]. In particular, in [4], Bertsimas and Van Parys also studied sparse ridge regression and developed a branch and cut algorithm. However, through our numerical study, these exact approaches can only solve mediumsized instances to the near optimality, and their performances highly rely on the speed of commercial solvers and can vary significantly from one dataset to another. In this work, our emphasis is to develop fast approximation algorithms with attractive scalability property and theoretical performance guarantees.Our Approaches and Contributions. In this work, we will focus on studying the sparse ridge regression (F0) from the angle of chanceconstrained program (F0CCP). We will first investigate various existing approaches of CCP to solve (F0CCP). One particular approach, which has been used to solve sparse regressions [3]
, is to introduce one binary variable for each indicator function in (
F0CCP) and linearize it with bigM coefficient. Oftentimes, such a method can be very slow in computation, in particular for largescale datasets. To overcome the aforementioned challenge, we develop a bigM free mixed integer second order conic (MISOC) reformulation for (F0CCP). We further show that its continuous relaxation is equivalent to that of a mixed integer convex (MIC) formulation in [4, 11]. Moreover, these two formulations motivate us to construct a greedy approach (i.e., forward selection) in a much more efficient way than those in the literature. The performance guarantee of our greedy approach is also established. A randomized algorithm is studied by investigating the continuous relaxations of the proposed MISOC formulation. Numerical study shows that the proposed methods work quite well, in particular, the greedy approach outperforms the other methods both in running time and accuracy of variable selection. The contributions are summarized below:
We investigate theoretical properties of three existing approaches of CCP to solve (F0CCP), i.e., the bigM method, the conditionalvalueatrisk (i.e., ) approach, and the heuristic algorithm from [1], and shed some lights on why those methods are not applicable to solve the sparse ridge regression (F0).

We establish a mixed integer second order conic (MISOC) reformulation for (F0CCP) from perspective formulation [17] and prove its continuous relaxation is equivalent to that of a mixed integer convex formulation in the work [4, 11]. We also show that the proposed MISOC formulation can be stronger than the naive bigM formulation.

Based on the reformulations, we develop an efficient greedy approach for solving (F0CCP), and prove its performance guarantee under a mild condition. From our analysis, the proposed greedy approach is theoretically sound and computationally efficient.

Through establishing a relationship between the continuous relaxation value of the MISOC formulation and the optimal value of (F0CCP) (i.e., ), we develop a randomized algorithm based on the optimal continuous relaxation solution of the MISOC formulation, and derive its theoretical properties. Such a continuous relaxation solution can help reduce the number of potential features and thus can be integrated with greedy approach.
The remainder of the paper is organized as follows.
Section 2 investigates the applicability of several existing approaches of CCP to the sparse ridge regression (F0).
Section 3 develops two bigM free mixed integer convex program formulations and proves their equivalence.
Section 4 proposes and analyzes two scalable algorithms and proves their performance guarantees.
The numerical studies of the proposed scalable algorithms are presented in Section 5. We conclude this work with some discussion in Section 6.
The following notation is used throughout the paper. We use boldletters (e.g., ) to denote vectors or matrices, and use corresponding nonbold letters to denote their components. Given a positive integer number , we let and let denote identity matrix. Given a subset , we let denote the subvector of with entries from subset , and be a matrix with a subset columns from . For a matrix , we let
denote its smallest and largest singular values, respectively. Given a vector
, we let be a diagonal matrix with diagonal entries from . For a matrix , we let denotes its th column. Given a set , we let denote its convex hull. Given a finite set , we let denote its cardinality. Given two sets , we let denote the set of elements in but not in , let denote the union of and and let be their symmetric difference, i.e., .2 Existing Solution Approaches
In this section, we investigate three commonly used solution approaches to solve (F0CCP).
2.1 BigM Method
One typical method for a CCP is to formulate it as a mixed integer program (MIP) by introducing a binary variable for each scenario , i.e., , and then using bigM method to linearize it, i.e., suppose that with a large positive number , then is equivalent to . Therefore, (F0CCP) can be reformulated as the following MIP:
(F0bigM) 
The advantage of (F0bigM) is that it can be directly solved by offtheshelf solvers (e.g., CPLEX, Gurobi). However, one has to choose the vector properly.
There are many ways to choose the bigM coefficients (i.e., ). One typical way is that for each , one can let be equal to the largest value of given that the optimal value of (F0CCP) is bounded by , i.e., let be equal to the larger optimal value of the following two convex quadratic programs
(1) 
To solve (1), it needs to compute an upper bound of . Note that the objective value of any feasible solution to (F0CCP) suffices. A naive upper bound is since is feasible to (F0CCP). On the other hand, to obtain vector , one has to solve two convex quadratic programs in (1) for each , which can be very timeconsuming, in particular when is large.
Here we derive a slightly weaker but a closedform vector . Note that all the convex programs in (1) share the same constraint . Thus, the key proof idea is to relax the constraint in (1) into a weaker one, which is more amenable for a closedform upper bound of vector . This result is summarized below.
Proposition 1.
Suppose that is an upper bound to , then vector can be chosen as
(2) 
Proof.
In (1), it is sufficient to find an upper bound of vector for any feasible satisfies
First, the above constraint implies that
where the first inequality is due to the inequality . Therefore, we have
where the first inequality is due to . Since , thus we have
for each .
On the other hand, we note that
Thus, another upper bound can be developed by letting , which implies that for each . ∎
It is known that this MIP (F0bigM) with bigM coefficients typically has a very weak continuous relaxation value. Consequently, there has been significant research on improving the bigM coefficients of (F0bigM), for example, [1, 3, 32, 33, 37]. However, the tightening procedures tend to be time consuming in particular for largescale datasets. In Section 3, we will derive two bigM free MIP formulations, whose continuous relaxation can be proven to be stronger than that of (F0bigM) .
2.2 Approximation
Another wellknown approximation of CCP is the socalled conditional value at risk () approximation (see [31] for details), which is to replace the nonconvex probabilistic constraint by a convex constraint. For the sparse ridge regression in (F0CCP), the resulting formulation is
(3) 
where . It is seen that (3) is a convex optimization problem and provides a feasible solution to (F0CCP). Thus . However, we observe that the only feasible solution to (3) is .
Proposition 2.
The only feasible solution to (3) is , i.e., .
Proof.
Therefore, the approach yields a trivial solution for (F0CCP). Hence, it is not a desirable approach and other alternatives are more preferred.
2.3 Heuristic Algorithm in [1]
In the recent work of [1], Ahmed et al. proposed a heuristic algorithm for a CCP with discrete distribution. It was reported that such a method can solve most of their numerical instances to the nearoptimality (i.e., within 4% optimality gap). The key idea of the heuristic algorithm in [1] is to minimize the sum of infeasibilities for all scenarios when the objective value is upper bounded by . Specifically, they considered the following optimization problem
(4) 
Let be an optimal solution to (4) given an upper bound of . The heuristic algorithm is to decrease the value of if , and increase it, otherwise. This searching (i.e., bisection) procedure will terminate after a finite number of iterations. The detailed procedure is described in Algorithm 1. Let denote the output solution from Algorithm 1. Then clearly,
Proposition 3.
Proof.

To prove the first part, according to the description of Algorithm 1, it will terminate if and only if . And after one iteration, the difference between and is halved. Thus, suppose Algorithm 1 will terminate with at most steps, then we must have
i.e., .

For the second part, we start with a feasible solution to (F0CCP). Thus, clearly, in Algorithm 1, we keep track of the feasible solutions from iteration to iteration. Thus, the output of Algorithm 1 is feasible to (F0CCP), i.e., .
∎
It is worth remarking that for any given upper bound , the formulation (4) is similar to the Dantzig selector proposed by [6]. The difference between Algorithm 1 and LASSO is that this iterative procedure simultaneously guarantees the sparsity and reduces the regression error while LASSO seeks a tradeoff between the error and penalty of . We also note that Algorithm 1 might not be computationally efficient since it requires to solve (4) multiple times. To the best of our knowledge, there is not a known performance guarantee of Algorithm 1.
3 Two Reformulations of Sparse Ridge Regression: BigM Free
Note that the BigM formulation in (F0bigM) is quite compact since it only involves number of variables (i.e., ). However, it is usually a weak formulation in the sense that the continuous relaxation value of (F0bigM) can be quite faraway from the optimal value . In this section, we propose two bigM free reformulations of (F0CCP) from the distinct perspectives and prove their equivalence.
3.1 Mixed Integer Second Order Conic (MISOC) Formulation
In this subsection, we will derive an MISOC formulation and its analytical properties. To begin with, we first make an observation from the perspective formulation in [17]. Let us consider a nonconvex set
(5) 
for each . The results in [17] shows that the convex hull of , denoted as , can be characterized as below.
Lemma 1.
(Lemma 3.1. in [17]) For each , the convex hull of set is
(6) 
Lemma 1 suggests an extended formulation for (F0CCP) without bigM coefficients. To achieve this goal, we first introduce a variable to be the upper bound of for each and binary variable , thus, (F0CCP) is equal to
which can be equivalently reformulated as
(7) 
Note that (i) in (7), we replace by and enforce to be binary for each ; and (ii) from Lemma 1, can be described by (6).
The above result is summarized in the following theorem.
Theorem 1.
The formulation (F0CCP) is equivalent to
(F0MISOC) 
This formulation (F0MISOC) introduces more variables than (F0bigM), but it does not require any bigM coefficients.
Next, we show that the convex hull of the feasible region of (F0MISOC) is equal to that of its continuous relaxation. Therefore, it suggests that we might not be able to improve the formulation by simply exploring the constraint system of (F0MISOC). For notational convenience, let us denote as the feasible region of (F0MISOC), i.e.,
(8) 
We show that the continuous relaxation of the set is equivalent to , i.e.,
Proposition 4.
Let denote as the feasible region of (F0MISOC). Then
Proof.
Let be the continuous relaxation set of , i.e.,
We would like to show that .

It is clear that .

To prove , we only need to show that for any given point , we have . Since , which is an integral polytope, there exists integral extreme points such that with for all and . Now we construct for each as follows:
First of all, we claim that for all . Indeed, for any ,
As , thus, for each , we have
Thus, .
∎
Finally, we remark that if an upper bound of is known, then (F0MISOC) can be further strengthened by adding the constraints for each . This result is summarized in the following corollary.
Proposition 5.
The formulation (F0CCP) is equivalent to
(F0MISOCM) 
where vector can be chosen according to Proposition 1 and the set is defined in (8).
3.2 Mixed Integer Convex (MIC) Formulation
In this subsection, we will derive an equivalent MIC formulation to (F0CCP). The main idea is to separate the optimization in (F0CCP) into two steps: (i) first, we optimize over by fixing its nonzero entries with at most , and (ii) then we select the best subset of nonzero entries with size at most . After the first step, it turns out that we can arrive at a convex integer program, which is bigM free. We would like to acknowledge that this result has been independently observed by recent work in [4] and [11]. For the completeness of this paper, we also present a different way of proof here.
Proposition 6.
The formulation (F0CCP) is equivalent to
(F0MIC) 
Proof.
We first reformulate (F0CCP
) as a combinatorial optimization problem. Let
and we reformulate (F0CCP) aswhere is a submatrix of with columns from subset . Note that the inner minimization has closedform solution . Hence, (F0CCP) is equivalent to
(9) 
For any given with , it remains to show that
By letting binary variable if and 0, otherwsie, then the formulation in (9) is equivalent to the following mixed integer convex program
∎
Note that in [4], Bertsimas and Van Parys proposed a branch and cut algorithm to solve (F0MIC), which was shown to be effective in solving some largesized instances. In the next subsection, we will show that the continuous relaxation of (F0MIC) is equivalent to that of (F0MISOC). Therefore, it can be more appealing to solve (F0MISOC) directly by MISOC solvers (e.g., CPLEX, Gurobi). Indeed, we numerically compares the branch and cut algorithm with directly solving (F0MISOC) in Section 5.
Finally, we remark that given the set of selected features , its corresponding estimator can be computed by the following formula:
(10) 
where denotes a subvector of with entries from subset .
3.3 Formulation Comparisons
In this subsection, we will focus on comparing (F0bigM), (F0MISOC), (F0MISOCM) and (F0MIC) according to their continuous relaxation bounds. First, let denote the continuous relaxation of (F0bigM), (F0MISOC), (F0MISOCM) and (F0MIC), respectively, i.e., equationparentequation
(11a)  
(11b)  
(11c)  
(11d) 
Next, in the following theorem, we will show a comparison of proposed formulations, i.e., (F0bigM), (F0MISOC), (F0MISOCM) and (F0MIC). In particular, we prove that , i.e., the continuous relaxation bounds of (F0MISOC) and (F0MIC) coincide. In addition, we show that by adding bigM constraints for each into (F0MISOC), we arrive at a tighter relaxation bound than that of (F0bigM), i.e., .
Comments
There are no comments yet.