The CCP Selector: Scalable Algorithms for Sparse Ridge Regression from Chance-Constrained Programming

06/11/2018 ∙ by Weijun Xie, et al. ∙ Virginia Polytechnic Institute and State University 0

Sparse regression and variable selection for large-scale data have been rapidly developed in the past decades. This work focuses on sparse ridge regression, which considers the exact L_0 norm to pursue the sparsity. We pave out a theoretical foundation to understand why many existing approaches may not work well for this problem, in particular on large-scale datasets. Inspired by reformulating the problem as a chance-constrained program, we derive a novel mixed integer second order conic (MISOC) reformulation and prove that its continuous relaxation is equivalent to that of the convex integer formulation proposed in a recent work. Based upon these two formulations, we develop two new scalable algorithms, the greedy and randomized algorithms, for sparse ridge regression with desirable theoretical properties. The proposed algorithms are proved to yield near-optimal solutions under mild conditions. In the case of much larger dimensions, we propose to integrate the greedy algorithm with the randomized algorithm, which can greedily search the features from the nonzero subset identified by the continuous relaxation of the MISOC formulation. The merits of the proposed methods are elaborated through a set of numerical examples in comparison with several existing ones.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As technology rapidly advances, modern statistical analysis often encounters regressions with a large number of explanatory variables (also known as features). Hence, sparse regression and variable selection have been studied intensively in the past decades. As an alternative to regular linear regression, ridge regression, first proposed by

[22]

, has several desirable advantages including stable solution, estimator variance reduction, and efficient computation. However, its solution is usually neither sparse

[18], nor applicable for variable selection.

To overcome these issues, we consider the sparse ridge regression problem as below:

(F0)

where

denotes the response vector,

represents the model matrix, is the vector of regression coefficients (i.e., estimand), and is a positive tuning parameter for the ridge penalty (i.e., penalty). Here, is the norm, which counts the number of nonzero entries of vector . The value of represents the number of features to be chosen. In (F0), it aims to find a best -sparse estimator, which minimizes the least square error with a squared penalty. It is easy to see that when , (F0) reduces to a special case, which is known as sparse regression. Note that in the signal process literature (cf. [40]), the formulation (F0) can also coincide with sparse signal recovery. Without loss of generality, let us assume that and and the data are normalized such that for all .

It is noted that the sparse ridge regression (F0) can be reformulated as a chance constrained program (CCP) with finite support [1, 26]. That is, we consider

scenarios with equal probability

, where the th scenario set is for . The constraint means that at most portion of scenarios can be violated. Hence, we can reformulate (F0) as a CCP below:

(F0-CCP)

where denotes the indicator function. In Section 2, we will investigate how applicable the recent progress on CCP (e.g., [1, 26, 31]) can be to solve (F0-CCP). It appears that many existing approaches may not work well due to the scalability issue or resulting in trivial solutions. In Section 4, we propose two novel scalable algorithms and their integration to solve the sparse ridge regression with theoretical guarantees.

Relevant Literature. The ridge regression has been extensively studied in statistics [13, 27, 41]. However, although many desirable properties, the ridge estimator is often not sparse as it is computed by using the smoothed squared penalty. Enabling the sparsity in regression has also attracted a significant amount of work including the LASSO using penalty [38], the Bridge estimator using penalty [23], the SCAD using non-convex penalty [43], the MCP using minimax concave penalty [44] among many others. Several excellent and comprehensive reviews of sparse regression can be found in [5], [19], and [14]. In particular, it is worthy of mentioning that in [46], Zou and Hastie proposed a well-known “elastic net” approach, which integrates the ridge penalty (i.e., squared penalty) and

penalty into the ordinary least-square objective to obtain a sparse estimator. However, similar to the LASSO method, the elastic net might not consistently find the true sparse solution. On the contrary, instead, we introduce a constraint

in (F0), which strictly enforces the sparsity on , and therefore, can obtain a best -sparse estimator.

Note that it has been proven that the exact sparse linear regression (i.e., using norm) is NP-hard (cf., [30]), so is the sparse ridge regression (F0

). There has been various effective approximation algorithms or heuristics introduced to solve sparse regression

[11, 12, 16, 24, 25, 29]. For example, in [9], Das and Kempe studied greedy approach (or forward stepwise selection method) and proved its approximation guarantee when the covariance matrix is nearly identity and has constant bandwidth. However, the greedy approach has been found prohibitively expensive when the number of features (i.e., ) becomes large [20]. Recently, Hazimeh and Mazumder in [21] integrated coordinate descent method with local combinatorial search, and reported that the proposed method outperforms the existing ones. However, this method does not provide any provable guarantee on the solution quality. Many researchers have also attempted to solve sparse regression by developing exact algorithms (e.g., branch and cut), or using mixed integer program (MIP) solvers. It has been shown that for certain large-sized instances with large signal-to-noise ratios, the proposed MIP approaches with warm start (a good initial solution) work quite well and can yield very high-quality solutions [3, 4, 28]. In particular, in [4], Bertsimas and Van Parys also studied sparse ridge regression and developed a branch and cut algorithm. However, through our numerical study, these exact approaches can only solve medium-sized instances to the near optimality, and their performances highly rely on the speed of commercial solvers and can vary significantly from one dataset to another. In this work, our emphasis is to develop fast approximation algorithms with attractive scalability property and theoretical performance guarantees.

Our Approaches and Contributions. In this work, we will focus on studying the sparse ridge regression (F0) from the angle of chance-constrained program (F0-CCP). We will first investigate various existing approaches of CCP to solve (F0-CCP). One particular approach, which has been used to solve sparse regressions [3]

, is to introduce one binary variable for each indicator function in (

F0-CCP) and linearize it with big-M coefficient. Oftentimes, such a method can be very slow in computation, in particular for large-scale datasets. To overcome the aforementioned challenge, we develop a big-M free mixed integer second order conic (MISOC) reformulation for (F0-CCP). We further show that its continuous relaxation is equivalent to that of a mixed integer convex (MIC) formulation in [4, 11]. Moreover, these two formulations motivate us to construct a greedy approach (i.e., forward selection) in a much more efficient way than those in the literature. The performance guarantee of our greedy approach is also established. A randomized algorithm is studied by investigating the continuous relaxations of the proposed MISOC formulation. Numerical study shows that the proposed methods work quite well, in particular, the greedy approach outperforms the other methods both in running time and accuracy of variable selection. The contributions are summarized below:

  1. We investigate theoretical properties of three existing approaches of CCP to solve (F0-CCP), i.e., the big-M method, the conditional-value-at-risk (i.e., ) approach, and the heuristic algorithm from [1], and shed some lights on why those methods are not applicable to solve the sparse ridge regression (F0).

  2. We establish a mixed integer second order conic (MISOC) reformulation for (F0-CCP) from perspective formulation [17] and prove its continuous relaxation is equivalent to that of a mixed integer convex formulation in the work [4, 11]. We also show that the proposed MISOC formulation can be stronger than the naive big-M formulation.

  3. Based on the reformulations, we develop an efficient greedy approach for solving (F0-CCP), and prove its performance guarantee under a mild condition. From our analysis, the proposed greedy approach is theoretically sound and computationally efficient.

  4. Through establishing a relationship between the continuous relaxation value of the MISOC formulation and the optimal value of (F0-CCP) (i.e., ), we develop a randomized algorithm based on the optimal continuous relaxation solution of the MISOC formulation, and derive its theoretical properties. Such a continuous relaxation solution can help reduce the number of potential features and thus can be integrated with greedy approach.

The remainder of the paper is organized as follows. Section 2 investigates the applicability of several existing approaches of CCP to the sparse ridge regression (F0). Section 3 develops two big-M free mixed integer convex program formulations and proves their equivalence. Section 4 proposes and analyzes two scalable algorithms and proves their performance guarantees. The numerical studies of the proposed scalable algorithms are presented in Section 5. We conclude this work with some discussion in Section 6.

The following notation is used throughout the paper. We use bold-letters (e.g., ) to denote vectors or matrices, and use corresponding non-bold letters to denote their components. Given a positive integer number , we let and let denote identity matrix. Given a subset , we let denote the subvector of with entries from subset , and be a matrix with a subset columns from . For a matrix , we let

denote its smallest and largest singular values, respectively. Given a vector

, we let be a diagonal matrix with diagonal entries from . For a matrix , we let denotes its th column. Given a set , we let denote its convex hull. Given a finite set , we let denote its cardinality. Given two sets , we let denote the set of elements in but not in , let denote the union of and and let be their symmetric difference, i.e., .

2 Existing Solution Approaches

In this section, we investigate three commonly used solution approaches to solve (F0-CCP).

2.1 Big-M Method

One typical method for a CCP is to formulate it as a mixed integer program (MIP) by introducing a binary variable for each scenario , i.e., , and then using big-M method to linearize it, i.e., suppose that with a large positive number , then is equivalent to . Therefore, (F0-CCP) can be reformulated as the following MIP:

(F0-big-M)

The advantage of (F0-big-M) is that it can be directly solved by off-the-shelf solvers (e.g., CPLEX, Gurobi). However, one has to choose the vector properly.

There are many ways to choose the big-M coefficients (i.e., ). One typical way is that for each , one can let be equal to the largest value of given that the optimal value of (F0-CCP) is bounded by , i.e., let be equal to the larger optimal value of the following two convex quadratic programs

(1)

To solve (1), it needs to compute an upper bound of . Note that the objective value of any feasible solution to (F0-CCP) suffices. A naive upper bound is since is feasible to (F0-CCP). On the other hand, to obtain vector , one has to solve two convex quadratic programs in (1) for each , which can be very time-consuming, in particular when is large.

Here we derive a slightly weaker but a closed-form vector . Note that all the convex programs in (1) share the same constraint . Thus, the key proof idea is to relax the constraint in (1) into a weaker one, which is more amenable for a closed-form upper bound of vector . This result is summarized below.

Proposition 1.

Suppose that is an upper bound to , then vector can be chosen as

(2)
Proof.

In (1), it is sufficient to find an upper bound of vector for any feasible satisfies

First, the above constraint implies that

where the first inequality is due to the inequality . Therefore, we have

where the first inequality is due to . Since , thus we have

for each .

On the other hand, we note that

Thus, another upper bound can be developed by letting , which implies that for each . ∎

It is known that this MIP (F0-big-M) with big-M coefficients typically has a very weak continuous relaxation value. Consequently, there has been significant research on improving the big-M coefficients of (F0-big-M), for example, [1, 3, 32, 33, 37]. However, the tightening procedures tend to be time consuming in particular for large-scale datasets. In Section 3, we will derive two big-M free MIP formulations, whose continuous relaxation can be proven to be stronger than that of (F0-big-M) .

2.2 Approximation

Another well-known approximation of CCP is the so-called conditional value at risk () approximation (see [31] for details), which is to replace the nonconvex probabilistic constraint by a convex constraint. For the sparse ridge regression in (F0-CCP), the resulting formulation is

(3)

where . It is seen that (3) is a convex optimization problem and provides a feasible solution to (F0-CCP). Thus . However, we observe that the only feasible solution to (3) is .

Proposition 2.

The only feasible solution to (3) is , i.e., .

Proof.

We first observe that the infimum in (3) must be achievable. Indeed, is continuous and convex in , and and . Therefore, the infimum in (3) must exist. Hence, in (3), we can replace the infimum by existing operator as below:

Since and , therefore, , i.e.

which implies that and for each . ∎

Therefore, the approach yields a trivial solution for (F0-CCP). Hence, it is not a desirable approach and other alternatives are more preferred.

2.3 Heuristic Algorithm in [1]

In the recent work of [1], Ahmed et al. proposed a heuristic algorithm for a CCP with discrete distribution. It was reported that such a method can solve most of their numerical instances to the near-optimality (i.e., within 4% optimality gap). The key idea of the heuristic algorithm in [1] is to minimize the sum of infeasibilities for all scenarios when the objective value is upper bounded by . Specifically, they considered the following optimization problem

(4)

Let be an optimal solution to (4) given an upper bound of . The heuristic algorithm is to decrease the value of if , and increase it, otherwise. This searching (i.e., bisection) procedure will terminate after a finite number of iterations. The detailed procedure is described in Algorithm 1. Let denote the output solution from Algorithm 1. Then clearly,

Proposition 3.

For Algorithm 1, the following two properties hold:

  1. It terminates with at most iterations; and

  2. It generates a feasible solution to (F0-CCP), i.e., .

Proof.
  1. To prove the first part, according to the description of Algorithm 1, it will terminate if and only if . And after one iteration, the difference between and is halved. Thus, suppose Algorithm 1 will terminate with at most steps, then we must have

    i.e., .

  2. For the second part, we start with a feasible solution to (F0-CCP). Thus, clearly, in Algorithm 1, we keep track of the feasible solutions from iteration to iteration. Thus, the output of Algorithm 1 is feasible to (F0-CCP), i.e., .

1:Let and be known lower and upper bounds for (F0-CCP), let be the stopping tolerance parameter.
2:while  do
3:     .
4:     Let be an optimal solution of (4) and set for all .
5:     if  then
6:         .
7:     else
8:         .
9:     end if
10:end while
11:Output .
Algorithm 1 Heuristic Algorithm in [1]

It is worth remarking that for any given upper bound , the formulation (4) is similar to the Dantzig selector proposed by [6]. The difference between Algorithm 1 and LASSO is that this iterative procedure simultaneously guarantees the sparsity and reduces the regression error while LASSO seeks a trade-off between the error and penalty of . We also note that Algorithm 1 might not be computationally efficient since it requires to solve (4) multiple times. To the best of our knowledge, there is not a known performance guarantee of Algorithm 1.

3 Two Reformulations of Sparse Ridge Regression: Big-M Free

Note that the Big-M formulation in (F0-big-M) is quite compact since it only involves number of variables (i.e., ). However, it is usually a weak formulation in the sense that the continuous relaxation value of (F0-big-M) can be quite faraway from the optimal value . In this section, we propose two big-M free reformulations of (F0-CCP) from the distinct perspectives and prove their equivalence.

3.1 Mixed Integer Second Order Conic (MISOC) Formulation

In this subsection, we will derive an MISOC formulation and its analytical properties. To begin with, we first make an observation from the perspective formulation in [17]. Let us consider a nonconvex set

(5)

for each . The results in [17] shows that the convex hull of , denoted as , can be characterized as below.

Lemma 1.

(Lemma 3.1. in [17]) For each , the convex hull of set is

(6)

Lemma 1 suggests an extended formulation for (F0-CCP) without big-M coefficients. To achieve this goal, we first introduce a variable to be the upper bound of for each and binary variable , thus, (F0-CCP) is equal to

which can be equivalently reformulated as

(7)

Note that (i) in (7), we replace by and enforce to be binary for each ; and (ii) from Lemma 1, can be described by (6).

The above result is summarized in the following theorem.

Theorem 1.

The formulation (F0-CCP) is equivalent to

(F0-MISOC)

This formulation (F0-MISOC) introduces more variables than (F0-big-M), but it does not require any big-M coefficients.

Next, we show that the convex hull of the feasible region of (F0-MISOC) is equal to that of its continuous relaxation. Therefore, it suggests that we might not be able to improve the formulation by simply exploring the constraint system of (F0-MISOC). For notational convenience, let us denote as the feasible region of (F0-MISOC), i.e.,

(8)

We show that the continuous relaxation of the set is equivalent to , i.e.,

Proposition 4.

Let denote as the feasible region of (F0-MISOC). Then

Proof.

Let be the continuous relaxation set of , i.e.,

We would like to show that .

  1. It is clear that .

  2. To prove , we only need to show that for any given point , we have . Since , which is an integral polytope, there exists integral extreme points such that with for all and . Now we construct for each as follows:

    First of all, we claim that for all . Indeed, for any ,

    As , thus, for each , we have

    Thus, .

Finally, we remark that if an upper bound of is known, then (F0-MISOC) can be further strengthened by adding the constraints for each . This result is summarized in the following corollary.

Proposition 5.

The formulation (F0-CCP) is equivalent to

(F0-MISOC-M)

where vector can be chosen according to Proposition 1 and the set is defined in (8).

3.2 Mixed Integer Convex (MIC) Formulation

In this subsection, we will derive an equivalent MIC formulation to (F0-CCP). The main idea is to separate the optimization in (F0-CCP) into two steps: (i) first, we optimize over by fixing its nonzero entries with at most , and (ii) then we select the best subset of nonzero entries with size at most . After the first step, it turns out that we can arrive at a convex integer program, which is big-M free. We would like to acknowledge that this result has been independently observed by recent work in [4] and [11]. For the completeness of this paper, we also present a different way of proof here.

Proposition 6.

The formulation (F0-CCP) is equivalent to

(F0-MIC)
Proof.

We first reformulate (F0-CCP

) as a combinatorial optimization problem. Let

and we reformulate (F0-CCP) as

where is a submatrix of with columns from subset . Note that the inner minimization has closed-form solution . Hence, (F0-CCP) is equivalent to

(9)

For any given with , it remains to show that

By letting binary variable if and 0, otherwsie, then the formulation in (9) is equivalent to the following mixed integer convex program

Note that in [4], Bertsimas and Van Parys proposed a branch and cut algorithm to solve (F0-MIC), which was shown to be effective in solving some large-sized instances. In the next subsection, we will show that the continuous relaxation of (F0-MIC) is equivalent to that of (F0-MISOC). Therefore, it can be more appealing to solve (F0-MISOC) directly by MISOC solvers (e.g., CPLEX, Gurobi). Indeed, we numerically compares the branch and cut algorithm with directly solving (F0-MISOC) in Section 5.

Finally, we remark that given the set of selected features , its corresponding estimator can be computed by the following formula:

(10)

where denotes a sub-vector of with entries from subset .

3.3 Formulation Comparisons

In this subsection, we will focus on comparing (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC) according to their continuous relaxation bounds. First, let denote the continuous relaxation of (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC), respectively, i.e., equationparentequation

(11a)
(11b)
(11c)
(11d)

Next, in the following theorem, we will show a comparison of proposed formulations, i.e., (F0-big-M), (F0-MISOC), (F0-MISOC-M) and (F0-MIC). In particular, we prove that , i.e., the continuous relaxation bounds of (F0-MISOC) and (F0-MIC) coincide. In addition, we show that by adding big-M constraints for each into (F0-MISOC), we arrive at a tighter relaxation bound than that of (F0-big-M), i.e., .

Theorem 2.

Let denote optimal values of (11a), (11b), (11c) and (11d), respectively, then

  1. ; and

  2. .

Proof.
  1. By Lemma A.1. [34], we note that (11c) is equivalent to

    s.t.

    where by default, we let . Now let and introduce a new variable to denote for each . Then the above formulation is equivalent to

    s.t.

    Finally, in the above formulation, replace . Then we arrive at (11b).

  2. Note that the set of the constraints in (11b) is a subset of those in (11c). Thus, .

  3. Let be optimal solution of (11c). Clearly, we must have for each , otherwise, suppose that there exists such that , then the objective value of (11c) can be strictly less than by letting and for each