# Linear convergence of SDCA in statistical estimation

In this paper, we consider stochastic dual coordinate (SDCA) without strongly convex assumption or convex assumption. We show that SDCA converges linearly under mild conditions termed restricted strong convexity. This covers a wide array of popular statistical models including Lasso, group Lasso, and logistic regression with ℓ_1 regularization, corrected Lasso and linear regression with SCAD regularizer. This significantly improves previous convergence results on SDCA for problems that are not strongly convex. As a by product, we derive a dual free form of SDCA that can handle general regularization term, which is of interest by itself.

## Authors

• 14 publications
• 42 publications
• ### SAGA and Restricted Strong Convexity

SAGA is a fast incremental gradient method on the finite sum problem and...
02/19/2017 ∙ by Chao Qu, et al. ∙ 0

• ### Linear Convergence of the Randomized Feasible Descent Method Under the Weak Strong Convexity Assumption

In this paper we generalize the framework of the feasible descent method...
06/08/2015 ∙ by Chenxin Ma, et al. ∙ 0

• ### Linear Convergence of SVRG in Statistical Estimation

SVRG and its variants are among the state of art optimization algorithms...
11/07/2016 ∙ by Chao Qu, et al. ∙ 0

• ### Stochastic Primal-Dual Proximal ExtraGradient Descent for Compositely Regularized Optimization

We consider a wide range of regularized stochastic minimization problems...
08/20/2017 ∙ by Tianyi Lin, et al. ∙ 0

• ### On Sparsity Inducing Regularization Methods for Machine Learning

During the past years there has been an explosion of interest in learnin...
03/25/2013 ∙ by Andreas Argyriou, et al. ∙ 0

• ### Accelerated Proximal Stochastic Dual Coordinate Ascent for Regularized Loss Minimization

We introduce a proximal version of the stochastic dual coordinate ascent...
09/10/2013 ∙ by Shai Shalev-Shwartz, et al. ∙ 0

• ### Dual optimization for convex constrained objectives without the gradient-Lipschitz assumption

The minimization of convex objectives coming from linear supervised lear...
07/10/2018 ∙ by Martin Bompaire, et al. ∙ 6

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

First order methods have again become a central focus of research in optimization and particularly in machine learning in recent years, thanks to its ability to address very large scale empirical risk minimization problems that are ubiquitous in machine learning, a task that is often challenging for other algorithms such interior point methods. The randomized (dual) coordinate version of the first order method samples one data point and updates the objective function at each time step, which avoids the computations of the full gradient and pushes the speed to a higher level. Related methods have been implemented in various software packages

Vedaldi and Fulkerson (2008). In particular, the randomized dual coordinate method considers the following problem.

 minw∈ΩF(w):=1nn∑i=1fi(w)+λg(w)=f(w)+λg(w), (1)

where

is a convex loss function of each sample and

is the regularization, is a convex compact set. Instead of directly solving the primal problem it look at the dual problem

 D(α)=1nn∑i=1−ψ∗i(−αi)−λg∗(1λnn∑i=1Xiαi),

where it assumes the loss function has the form . Shalev-Shwartz and Zhang (2013) consider this dual form and proved linear convergence of the stochastic dual coordinate ascent method (SDCA) when . They further extend the result into a general form, which allows the regulerizer to be a general strongly convex function Shalev-Shwartz and Zhang (2014). Qu and Richtárik (2015)

propose the AdaSDCA, an adaptive variant of SDCA, to allow the method to adaptively change the probability distribution over the dual variables through the iterative process. Their experiment results outperform the non-adaptive methods.

Dünner et al. (2016) consider a primal-dual frame work and extend SDCA to the non-strongly convex case with sublinear rate. Allen-Zhu and Yuan (2015a) further improve the convergece speed using a novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of the smoothness parameter. Other acceleration techniques Qu and Richtárik (2014); Qu et al. (2015); Nesterov (2012), as well as mini-batch and distributed variants on coordinate method Liu and Wright (2015); Zhao et al. (2014); Jaggi et al. (2014); Mahajan et al. (2014) have been studied in literature. See Wright (2015) for a review on the coordinate method.

Our goal is to investigate how to extend SDCA to non-convex statistical problems. Non-convex optimization problems attract fast growning attentions due to the rise of numerous applications, notably non-convex M estimators (e.g.,SCAD Fan and Li (2001), MCP Zhang (2010)

Goodfellow et al. (2016) and robust regression (e.g., corrrect Lasso in Loh and Wainwright, 2011). Shalev-Shwartz (2016) proposes a dual free version of SDCA and proves the linear convergence of it, which addresses the case that each individual loss function may be non-convex, but their sum is strongly convex. This extends the applicability of SDCA. From a technical perspective, due to non-convexity of , the paper avoids explicitly using its dual (hence the name “dual free”), by introducing pseudo-dual variables.

In this paper, we consider using SDCA to solve M-estimators that are not strongly convex, or even non-convex. We show that under the restricted strong convexity condition, SDCA converges linearly. This setup includes well known formulations such as Lasso, Group Lasso, Logistic regression with regularization, linear regression with SCAD regularizer Fan and Li (2001), Corrected Lasso Loh and Wainwright (2011), to name a few. This significantly improves upon existing theoretic results Shalev-Shwartz and Zhang (2013, 2014); Shalev-Shwartz (2016) which only established sublinear convergence on convex objectives.

To this end, we first adapt SDCA Shalev-Shwartz and Zhang (2014) into a generalized dual free form in Algorithm 1. This is because to apply SDCA in our setup, we need to introduce a non-convex term and thus demand a dual free analysis. We remark that the theory about dual free SDCA established in Shalev-Shwartz (2016) does not apply to our setup for the following reasons:

1. In Shalev-Shwartz (2016), needs to be strongly convex, thus does not apply to Lasso or non-convex M-estimators.

2. Shalev-Shwartz (2016) only studies the special case where , while the M-estimators we consider include non-smooth regularization such as or norm.

We illustrate the update rule of Algorithm 1 using an example. While our main focus is for non-strongly convex problems, we start with a strongly convex example to obtain some intuitions. Suppose and . It is easy to see and

It is then clear that to apply SDCA needs to be strongly convex, or otherwise the proximal step becomes ill-defined: may be infinity (if is unbounded) or not unique. This observation motivates the following preprocessing step: if is not strongly convex, for example Lasso where and , we redefine the formulation by adding a strongly convex term to and subtract it from . More precisely, for , we define and apply Algorithm 1 on (which is equivalent to Lasso), where the value of will be specified later. Our analysis is thus focused on this alternative representation. The main challenge arises from the non-convexity of the newly defined , which precludes the use of the dual method as in Shalev-Shwartz and Zhang (2014). While in Shalev-Shwartz (2016), a dual free SDCA algorithm is proposed and analyzed, the results do not apply to out setting for reasons discussed above.

Our contributions are two-fold. 1. We prove linear convergence of SDCA for a class of problem that are not strongly convex or even non-convex, making use of the concept restricted strong convexity (RSC). To the best of our knowledge, this is the first work to prove linear convergence of SDCA under this setting which includes several statistical model such as Lasso, Group Lasso, logistic regression, linear regression with SCAD regularization, corrected Lasso, to name a few. 2. As a byproduct, we derive a dual free from SDCA that extends the work of Shalev-Shwartz (2016) to account for more general regularization .

Related work.   Agarwal et al. (2010) prove linear convergence of the batched gradient and composite gradient method in Lasso, Low rank matrix completion problem using the RSC condition. Loh and Wainwright (2013) extend this to a class of non-convex M-estimators. In spirit, our work can be thought as a stochastic counterpart. Recently, Qu et al. (2016, 2017) consider SVRG and SAGA in the similar setting as ours, but they look at the primal problem. Shalev-Shwartz (2016) considers dual free SDCA, but the analysis does not apply to the case where is not strongly convex. Similarly, Allen-Zhu and Yuan (2015b) consider the non-strongly convex setting in SVRG (), where is convex, each individual may be non-convex, but only establish sub-linear convergence. Recently, Li et al. (2016) consider SVRG with a zero-norm constraint and prove linear convergence for this specific formulation. In comparison, our results hold more generally, covering not only sparsity model but also corrected Lasso with noisy covariate, group sparsity model, and beyond.

## 2 Problem setup

In this paper we consider two setups, namely, (1) convex but not strongly convex and (2) non-convex . For the convex case, we consider

 ming(w)≤ρF(w):=f(w)+λg(w):=1nn∑i=1fi(w)+λg(w), (2)

where is some pre-defined radius, is the loss function for sample , and is a norm. Here we assume each is convex and -smooth.

For the non-convex we consider

 mindλ(w)≤ρF(w):=f(w)+dλ,μ(w)=1nn∑i=1fi(w)+dλ,μ(w), (3)

where is some pre-defined radius, is convex and smooth, is a non-convex regularizer depending on a tuning parameter and a parameter explained in section 2.3. This M-estimator includes a side constraint depending on , which needs to be a convex function and admits a lower bound . It is also closed w.r.t. to , more details are deferred to section 2.3.

We list some examples that belong to these two setups.

• Lasso: .

• Logistic regression with penalty:

 F(w)=n∑i=1log(1+exp(−yixTiw))+λ∥w∥1.
• Group Lasso .

• Corrected Lasso Loh and Wainwright (2011): , where is some positive definite matrix.

• linear regression with SCAD regularizer Fan and Li (2001):

The first three examples belong to first setup while the last two belong to the second setup.

### 2.1 Restricted Strongly Convexity

Restricted Strongly Convexity (RSC) is proposed in Negahban et al. (2009); Van De Geer et al. (2009) and has been explored in several other work Agarwal et al. (2010); Loh and Wainwright (2013). We say the loss function satisfies the RSC condition with curvature and the tolerance parameter with respect to the norm if

 Δf(w1,w2)≜f(w1)−f(w2)−⟨∇f(w2),w1−w2⟩≥κ2∥w1−w2∥22−τg2(w1−w2). (4)

When is strongly convex, then it is RSC with and . However in many case, may not be strongly convex, especially in the high dimensional case where the ambient dimension . On the other hand, RSC is easier to be satisfied. Take Lasso as an example, under some mild condition, it is shown that Negahban et al. (2009) where , are some positive constants. Besides Lasso, the RSC condition holds for a large range of statistical models including log-linear model, group sparsity model, and low-rank model. See Negahban et al. (2009) for more detailed discussions.

### 2.2 Assumption on convex regularizer g(w)

Decomposability is the other ingredient needed to analyze the algorithm.

Definition: A regularizer is decomposable with respect to a pair of subspaces if

 g(α+β)=g(α)+g(β)for allα∈A,β∈B⊥,

where means the orthogonal complement.

A concrete example is

regularization for sparse vector supported on subset

. We define the subspace pairs with respect to the subset , and The decomposability is thus easy to verify. Other widely used examples include non-overlap group norms such as, and the nuclear norm  Negahban et al. (2009). In the rest of the paper, we denote as the projection of on the subspace .

#### Subspace compatibility constant

For any subspace of , the subspace compatibility constant with respect to the pair is give by

 Ψ(A)=supu∈A∖{0}g(u)∥u∥2.

That is, it is the Lipschitz constant of the regularizer restricted in . For example, for the above-mentioned sparse vector with cardinality , for .

### 2.3 Assumption on non-convex regularizer dλ,μ(w)

We consider regularizers that are separable across coordinates, i.e., . Besides the separability, we make further assumptions on the univariate function :

1. satisfies and is symmetric about zero (i.e., ).

2. On the nonnegative real line, is nondecreasing.

3. For , is nonincreasing in t.

4. is differentiable for all and subdifferentiable at , with .

5. is convex.

For instance, SCAD satisfying these assumptions.

 ⎧⎪⎨⎪⎩λ|t|,for |t|≤λ,−(t2−2ζλ|t|+λ2)/(2(ζ−1)),for λ<|t|≤ζλ,(ζ+1)λ2/2,for |t|>ζλ,

where is a fixed parameter. It satisfies the assumption with and Loh and Wainwright (2013).

### 2.4 Applying SDCA

#### 2.4.1 Convex F(w)

Following a similar line as we did for Lasso, to apply the SDCA algorithm, we define , Correspondingly, the new smoothness parameters are and . Problem (2) is thus equivalent to the following

 minw∈Ω11+nn+1∑i=1ϕi(w)+~λ~g(w). (5)

This enables us to apply Algorithm 1 to the problem with , , , . In particular, while is not convex, is still convex (1-strongly convex). We exploit this property in the proof and define , where is a convex compact set. Since is -strongly convex, is -smooth (Theorem 1 of Nesterov, 2005).

#### 2.4.2 Non-convex F(w)

Similarly, we define

 ϕi(w)=n+1nfi(w)fori=1,...,n, ϕn+1(w)=−(~λ+μ)(n+1)2∥w∥22, ~g(w)=12∥w∥22+λ~λdλ(w),

and then apply Algorithm 1 on

 minw∈Ω11+nn+1∑i=1ϕi(w)+~λ~g(w).

The update rule of proximal step for different , (such as SCAD and MCP) can be found in Loh and Wainwright (2013).

## 3 Theoretical Result

In this section, we present the main theoretical results, and some corollaries that instantiate the main results in several well known statistical models.

### 3.1 Convex F(w)

To begin with, we define several terms related to the algorithm.

• is true unknown parameter. is the dual norm of . Conjugate function , where .

• is the optimal solution to problem 2, and we assume is in the interior of w.l.o.g. by choosing large enough.

• is an optimal solution pair satisfying .

• where

We remark that our definition on the first potential is the same as in Shalev-Shwartz (2016), while the second one is different. If , our definition on reduces to that in Shalev-Shwartz (2016), i.e., . To see this, when , and . We then define another potential , which is a combination of and . Notice that using smoothness of and Lemma 1 and 2 in the appendix, it is not hard to show Thus if converges, so does .

For notational simplicity, we define the following two terms used in the theorem.

• Effective RSC parameter:

• Tolerance: where , is a universal positive constant.

###### Theorem 1.

Assume each is smooth and convex, satisfies the RSC condition with parameter , is feasible, the regularizer is decomposable w.r.t such that , and the Algorithm 1 runs with , where , is chosen such that . If we choose the regularization parameter such that where is some universal positive constant, then we have

 E(Ct)≤(1−η~λ)tC0,

until , where the expectation is for the randomness of sampling of in the algorithm.

Some remarks are in order for interpreting the theorem.

1. In several statistical models, the requirement of is easy to satisfy under mild conditions. For instance, in Lasso we have . , if the feature vector is sampled from . Thus, if  , we have .

2. In some models, we can choose the pair of subspace such that and thus . In Lasso, as we mentioned above , thus , i.e., this tolerance is dominated by statistical error if is small and is large.

3. We know and , thus , thus, if converge, so does When , using Lemma 5 in the supplementary material, it is easy to get

Combining these remarks, Theorem 1 states that the optimization error decreases geometrically until it achieves the tolerance which is dominated by the statistical error , thus can be ignored from the statistical view.

If in Problem (1) is indeed 1-strongly convex, we have the following proposition, which extends dual-free SDCA Shalev-Shwartz (2016) into the general regularization case. Notice we now directly apply the Algorithm 1 on Problem (1) and change the definitions of and correspondingly. In particular, , , where . is still defined in the same way, i.e.,

###### Proposition 1.

Suppose each is smooth, is convex, is 1- strongly convex, is the optimal solution of Problem (1), and . Then we have: (I) If each is convex, we run the Algorithm 1 with , then (II) Otherwise, we run the Algorithm 1 with , and Note that , , thus decreases geometrically.

In the following we present several corollaries that instantiate Theorem 1 with several concrete statistical models. This essentially requires to choose appropriate subspace pair in these models and check the RSC condition.

#### 3.1.1 Sparse linear regression

Our first example is Lasso, where and . We assume each feature vector

is generated from Normal distribution

and the true parameter is sparse with cardinality . The observation is generated by , where is a Gaussian noise with mean

and variance

. We denote the data matrix by and is the th column of . Without loss of generality, we assume is column normalized, i.e., for all We denote

as the smallest eigenvalue of

, and .

###### Corollary 1.

Assume is the true parameter supported on a subset with cardinality at most , and we choose the parameter such that and hold, where , where are some universal positive constants. Then we run the Algorithm 1 with and have

 E(Ct)≤(1−η~λ)tC0

with probability at least until where . are some universal positive constants.

The requirement is documented in literature Negahban et al. (2009) to ensure that Lasso is statistically consistent. And is needed for fast convergence of optimization algorithms, which is similar to the condition proposed in in Agarwal et al. (2010) for batch optimization algorithm. When , which is necessary for statistical consistency of Lasso, we have , which guarantees the existence of . Also notice under this condition, is of a lower order of . Using remark 3 in Theorem 1, we have , which is dominated by the statistical error and hence can be ignored from the statistical perspective. Thus to sum up, Corollary 1 states the optimization error decreases geometrically until it achieves the statistical limit of Lasso.

#### 3.1.2 Group Sparsity Model

Yuan and Lin (2006) introduce the group Lasso to allow predefined groups of covariates to be selected together into or out of a model together. The most commonly used regularizer to encourage group sparsity is . In the following, we define group sparsity formally. We assume groups are disjointed, i.e., and . The regularization is . When , it reduces to the commonly used group Lasso Yuan and Lin (2006), and another popularly used case is Turlach et al. (2005); Quattoni et al. (2009). We require the following condition, which generalizes the column normalization condition in the Lasso case. Given a group of size and , the associated operator norm satisfies

 |||XGi|||q→2√n≤1  for all  i=1,2,...,NG.

The condition reduces to the column normalized condition when each group contains only one feature (i.e., Lasso).

We now define the subspace pair in the group sparsity model. For a subset with cardinality , we define the subspace

 A(SG)={w|wGi=0  for all   i∉SG},

and . The orthogonal complement is

 B⊥(SG)={w|wGi=0  for all   i∈SG}.

We can easily verify that

 ∥α+β∥G,q=∥α∥G,q+∥β∥G,q,

for any and .

In the following corollary, we use , i.e., group Lasso, as an example. We assume the observation is generated by , where , and .

###### Corollary 2 (Group Lasso).

Assume and each group has parameters, i.e., . Denote by the cardinality of non-zero group, and we choose parameters such that

 λ≥max(4σ(√mn+√logNGn),c1ρσ2(Σ)(√mn+√3logNGn)2);and0<~λ≤~κ,where~κ=σ1(Σ)−c2σ2(Σ)sG(√mn+√3logNGn)2;

where and are positive constant depending only on . If we run the Algorithm 1 with , then we have

 E(Ct)≤(1−η~λ)tC0

with probability at least , until where .

We offer some discussions to interpret the corollary. To satisfy the requirement , it suffices to have

 sG⎛⎝√mn+√3logNGn⎞⎠2=o(1).

This is a mild condition, as it is needed to guarantee the statistical consistency of group Lasso Negahban et al. (2009). Notice that the condition is easily satisfied when and are small. Under this same condition, since we conclude that is dominated by Again, it implies the optimization error decrease geometrically up to the scale which is dominated by the statistical error of the model.

#### 3.1.3 Extension to generalized linear model

We consider the generalized linear model of the following form,

 minw∈Ω1nn∑i=1(Φ(w,xi)−yi⟨w,xi⟩)+λ∥w∥1,

which covers such case as Lasso (where ) and logistic regression (where ). In this model, we have

 Δf(w1,w2)=1nn∑i=1Φ′′(⟨wt,xi⟩)⟨xi,w1−w2⟩2,

where for some The RSC condition thus is equivalent to:

 1nn∑i=1Φ′′(⟨wt,xi⟩)⟨xi,w1−w2⟩2≥κ2∥w1−w2∥22−τg2(w1−w2)  for  w1,w2∈Ω. (6)

Here we require to be a bounded set Loh and Wainwright (2013). This requirement is essential since in some generalized linear model approaches to zero as diverges. For instance, in logistic regression, , which tends to zero as . For a broad class of generalized linear models, RSC holds with , thus the same result as that of Lasso holds, modulus change of constants.

### 3.2 Non-convex F(w)

In the non-convex case, we assume the following RSC condition:

 Δf(w1,w2)≥κ2∥w1−w2∥22−τ∥w1−w2∥21

with for some constant . We again define the potential , and in the same way with convex case. The main difference is that now we have and the effective RSC parameter is different. The necessary notations for presenting the theorem are listed below:

• is the unknown true parameter that is -sparse. Conjugate function , where Note is convex due to convexity of .

• is the global optimum of Problem (3), we assume it is in the interior of w.l.o.g.

• is an optimal solution pair satisfying