A Safe Screening Rule for Sparse Logistic Regression

07/16/2013 ∙ by Jie Wang, et al. ∙ 0

The l1-regularized logistic regression (or sparse logistic regression) is a widely used method for simultaneous classification and feature selection. Although many recent efforts have been devoted to its efficient implementation, its application to high dimensional data still poses significant challenges. In this paper, we present a fast and effective sparse logistic regression screening rule (Slores) to identify the 0 components in the solution vector, which may lead to a substantial reduction in the number of features to be entered to the optimization. An appealing feature of Slores is that the data set needs to be scanned only once to run the screening and its computational cost is negligible compared to that of solving the sparse logistic regression problem. Moreover, Slores is independent of solvers for sparse logistic regression, thus Slores can be integrated with any existing solver to improve the efficiency. We have evaluated Slores using high-dimensional data sets from different applications. Extensive experimental results demonstrate that Slores outperforms the existing state-of-the-art screening rules and the efficiency of solving sparse logistic regression is improved by one magnitude in general.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

Code Repositories

biglasso

biglasso: Extending Lasso Model Fitting to Big Data in R


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Logistic regression (LR) is a popular and well established classification method that has been widely used in many domains such as machine learning

[5, 8], text mining [4, 9], image processing [10, 17], bioinformatics [1, 15, 22, 30, 31], medical and social sciences [2, 19] etc. When the number of feature variables is large compared to the number of training samples, logistic regression is prone to over-fitting. To reduce over-fitting, regularization has been shown to be a promising approach. Typical examples include and regularization. Although regularized LR is more challenging to solve compared to regularized LR, it has received much attention in the last few years and the interest in it is growing [23, 27, 31] due to the increasing prevalence of high-dimensional data. The most appealing property of regularized LR is the sparsity of the resulting models, which is equivalent to feature selection.

In the past few years, many algorithms have been proposed to efficiently solve the regularized LR [6, 14, 13, 20]. However, for large-scale problems, solving the regularized LR with higher accuracy remains challenging. One promising solution is by “screening”, that is, we first identify the “inactive” features, which have coefficients in the solution and then discard them from the optimization. This would result in a reduced feature matrix and substantial savings in computational cost and memory size. In [7], El Ghaoui et al. proposed novel screening rules, called “SAFE”, to accelerate the optimization for a class of regularized problems, including LASSO [25], regularized LR and

regularized support vector machines. Inspired by SAFE, Tibshirani

et al. [24] proposed “strong rules” for a large class of regularized problems, including LASSO, elastic net, regularized LR and more general convex problems. In [29, 28], Xiang et al. proposed “DOME” rules to further improve SAFE rules for LASSO based on the observation that SAFE rules can be understood as a special case of the general “sphere test”. Although both strong rules and the sphere tests are more effective in discarding features than SAFE for solving LASSO, it is worthwhile to mention that strong rules may mistakenly discard features that have non-zero coefficients in the solution and the sphere tests are not easy to be generalized to handle the regularized LR. To the best of our knowledge, the SAFE rule is the only screening test for the regularized LR that is “safe”, that is, it only discards features that are guaranteed to be absent from the resulting models.

Figure 1: Comparison of Slores, strong rule and SAFE on the prostate cancer data set.

In this paper, we develop novel screening rules, called “Slores”, for the

regularized LR. The proposed screening tests detect inactive features by estimating an upper bound of the inner product between each feature vector and the “dual optimal solution” of the

regularized LR, which is unknown. The more accurate the estimation is, the more inactive features can be detected. An accurate estimation of such an upper bound turns out to be quite challenging. Indeed most of the key ideas/insights behind existing “safe” screening rules for LASSO heavily rely on the least square loss, which are not applicable for the regularized LR case due to the presence of the logistic loss. To this end, we propose a novel framework to accurately estimate an upper bound. Our key technical contribution is to formulate the estimation of an upper bound of the inner product as a constrained convex optimization problem and show that it admits a closed form solution. Therefore, the estimation of the inner product can be computed efficiently. Our extensive experiments have shown that Slores discards far more features than SAFE yet requires much less computational efforts. In contrast with strong rules, Slores is “safe”, i.e., it never discards features which have non-zero coefficients in the solution. To illustrate the effectiveness of Slores, we compare Slores, strong rule and SAFE on a data set of prostate cancer along a sequence of parameters equally spaced on the scale from to , where is the parameter for the penalty and is the smallest tuning parameter [12] such that the solution of the regularized LR is [please refer to Eq. (1)]. The data matrix contains patients with features. To measure the performance of different screening rules, we compute the rejection ratio which is the ratio between the number of features discarded by screening rules and the number of features with 0 coefficients in the solution. Therefore, the larger the rejection ratio is, the more effective the screening rule is. The results are shown in Fig. 1. Clearly, Slores discards far more features than SAFE especially when is large while the strong rule is not applicable when . We present more experimental results and discussions to demonstrate the effectiveness of Slores in Section 6.

2 Basics and Motivations

In this section, we briefly review the basics of the regularized LR and then motivate the general screening rules via the KKT conditions. Suppose we are given a set of training samples and the associate labels , where and for all . The regularized logistic regression is:

(LRP)

where and are the model parameters to be estimated, , and is the tuning parameter. Let the data matrix be with the row being and the column being .

Let and for . The dual problem of (LRP) (please refer to the supplement) is given by

(LRD)

To simplify notations, we denote the feasible set of problem (LRD) as , and let and be the optimal solutions of problems (LRP) and (LRD) respectively. In [12], the authors have shown that for some special choice of the tuning parameter , both of (LRP) and (LRD) have closed form solutions. In fact, let , , and and be the cardinalities of and respectively. We define

(1)

where

(2)

( denotes the component of a vector.) Then, it is known [12] that and whenever . When , it is known that (LRD) has a unique optimal solution. (For completeness, we include the proof in the supplement.) We can now write the KKT conditions of problems (LRP) and (LRD) as

(3)

In view of Eq. (3), we can see that

(R1)

In other words, if , then the KKT conditions imply that the coefficient of in the solution is and thus the feature can be safely removed from the optimization of (LRP). However, for the general case in which , (R1) is not applicable since it assumes the knowledge of . Although it is unknown, we can still estimate a region which contains . As a result, if , we can also conclude that by (R1). In other words, (R1) can be relaxed as

(R1)

In this paper, (R1) serves as the foundation for constructing our screening rules, Slores. From (R1), it is easy to see that screening rules with smaller are more aggressive in discarding features. To give a tight estimation of , we need to restrict the region which includes as small as possible. In Section 3, we show that the estimation of the upper bound can be obtained via solving a convex optimization problem. We show in Section 4 that the convex optimization problem admits a closed form solution and derive Slores in Section 5 based on (R1).

3 Estimating the Upper Bound via Solving a Convex Optimization Problem

In this section, we present a novel framework to estimate an upper bound of . In the subsequent development, we assume a parameter and the corresponding dual optimal are given. In our Slores rule to be presented in Section 5, we set and to be and given in Eqs. (1) and (2). We formulate the estimation of as a constrained convex optimization problem in this section, which will be shown to admit a closed form solution in Section 4.

For the dual function , it follows that Since is a diagonal matrix, it follows that , where

is the identity matrix. Thus,

is strongly convex with modulus [18]. Rigorously, we have the following lemma.

Lemma 1.

Let and , then

(4)

b). If , the inequality in (4) becomes a strict inequality, i.e., “” becomes “”.

Given , it is easy to see that both of and belong to . Therefore, Lemma 1 can be a useful tool to bound with the knowledge of . In fact, we have the following theorem.

Theorem 2.

Let , then the following holds:

(5)

b). If , the inequality in (5) becomes a strict inequality, i.e., “” becomes “”.

Proof.

a). It is easy to see that , and . Therefore, both of and belong to the set . By Lemma 1, we have

(6)

Let . It is easy to see that

Therefore, we can see that and thus

Then the inequality in (6) becomes

(7)

On the other hand, by noting that (LRD) is feasible, we can see that the Slater’s conditions holds and thus the KKT conditions [21] lead to:

(8)

where , and is the normal cone of at [21]. Because and is an open set, is an interior point of and thus [21]. Therefore, Eq. (8) becomes:

(9)

Let , and . We can see that . By the complementary slackness condition, if , we have . Therefore,

Similarly, we have

Recall (7), the inequality in (5) follows.

b). The proof is the same as part a) by noting part b) of Lemma 1. ∎

Theorem 2 implies that is inside a ball centred at with radius

(10)

Recall that to make our screening rules more aggressive in discarding features, we need to get a tight upper bound of [please see (R1)]. Thus, it is desirable to further restrict the possible region of . Clearly, we can see that

(11)

since is feasible for problem (LRD). On the other hand, we call the set defined in the proof of Theorem 2 the “active set” of . In fact, we have the following lemma for the active set.

Lemma 3.

Given the optimal solution of problem (LRD), the active set is not empty if .

Since , we can see that is not empty by Lemma 3. We pick and set

(12)

It follows that . Due to the feasibility of for problem (LRD), satisfies

(13)

As a result, Theorem 2, Eq. (11) and (13) imply that is contained in the following set:

Since , we can see that Therefore, (R1) implies that if

(UBP)

is smaller than , we can conclude that and can be discarded from the optimization of (LRP). Notice that, we replace the notations and with and to emphasize their dependence on . Clearly, as long as we can solve for , (R1) would be an applicable screening rule to discard features which have coefficients in . We give a closed form solution of problem (42) in the next section.

4 Solving the Convex Optimization Problem (UBP)

In this section, we show how to solve the convex optimization problem (42) based on the standard Lagrangian multiplier method. We first transform problem (42) into a pair of convex minimization problem (UBP) via Eq. (15) and then show that the strong duality holds for (UBP) in Lemma 6. The strong duality guarantees the applicability of the Lagrangian multiplier method. We then give the closed form solution of (UBP) in Theorem 8. After we solve problem (UBP), it is straightforward to compute the solution of problem (42) via Eq. (15).

Before we solve (42) for the general case, it is worthwhile to mention a special case in which . Clearly, is the projection operator which projects a vector onto the orthogonal complement of the space spanned by . In fact, we have the following theorem.

Theorem 4.

Let , and assume is known. For , if , then .

Because of (R1), we immediately have the following corollary.

Corollary 5.

Let and . If , then .

For the general case in which , let

(14)

Clearly, we have

(15)

Therefore, we can solve problem (42) by solving the two sub-problems in (14).

Let . Then problems in (14) can be written uniformly as

(UBP)

To make use of the standard Lagrangian multiplier method, we transform problem (UBP) to the following minimization problem:

(UBP)

by noting that .

Lemma 6.

Let and assume is known. The strong duality holds for problem (UBP). Moreover, problem (UBP) admits an optimal solution in .

Because the strong duality holds for problem (UBP) by Lemma 6, the Lagrangian multiplier method is applicable for (UBP). In general, we need to first solve the dual problem and then recover the optimal solution of the primal problem via KKT conditions. Recall that and are defined by Eq. (10) and (12) respectively. Lemma 7 derives the dual problems of (UBP) for different cases.

Lemma 7.

Let and assume is known. For and , let . Denote

a). If , the dual problem of (UBP) is equivalent to:

(UBD)

Moreover, attains its maximum in .

b). If , the dual problem of (UBP) is equivalent to:

(UBD)

We can now solve problem (UBP) in the following theorem.

Theorem 8.

Let , and assume is known. For and , let .

  1. If then

    (16)
  2. If then

    (17)

    where

    (18)

Notice that, although the dual problems of (UBP) in Lemma 7 are different, the resulting upper bound can be given by Theorem 8 in a uniform way. The tricky part is how to deal with the extremal cases in which . To avoid the lengthy discussion of Theorem 8, we omit the proof in the main text and include the details in the supplement.

5 The proposed Slores Rule for Regularized Logistic Regression

Using (R1), we are now ready to construct the screening rules for the Regularized Logistic Regression. By Corollary 5, we can see that the orthogonality between the feature and the response vector implies the absence of from the resulting model. For the general case in which , (R1) implies that if then the feature can be discarded from the optimization of (LRP). Notice that, letting , and have been solved by Theorem 8. Rigorously, we have the following theorem.

Theorem 9 (Slores).

Let and assume is known.

  1. If , then ;

  2. If and either of the following holds:

    1. ,

    2. ,

    then .

Based on Theorem 9, we construct the Slores rule as summarized below in Algorithm 1.

  Initialize ;
  if  then
     set ;
  else
     for  to  do
         
         if  then
            remove from ;
         else if  then
            remove from ;
         end if
     end for
  end if
  Return:
Algorithm 1

Notice that, the output of Slores is the indices of the features that need to be entered to the optimization. As a result, suppose the output of Algorithm 1 is , we can substitute the full matrix in problem (LRP) with the sub-matrix