SAGA and Restricted Strong Convexity

SAGA is a fast incremental gradient method on the finite sum problem and its effectiveness has been tested on a vast of applications. In this paper, we analyze SAGA on a class of non-strongly convex and non-convex statistical problem such as Lasso, group Lasso, Logistic regression with ℓ_1 regularization, linear regression with SCAD regularization and Correct Lasso. We prove that SAGA enjoys the linear convergence rate up to the statistical estimation accuracy, under the assumption of restricted strong convexity (RSC). It significantly extends the applicability of SAGA in convex and non-convex optimization.

Authors

• 17 publications
• 60 publications
• 48 publications
• Linear convergence of SDCA in statistical estimation

In this paper, we consider stochastic dual coordinate (SDCA) without st...
01/26/2017 ∙ by Chao Qu, et al. ∙ 0

• Linear Convergence of SVRG in Statistical Estimation

SVRG and its variants are among the state of art optimization algorithms...
11/07/2016 ∙ by Chao Qu, et al. ∙ 0

• A Unified Convergence Analysis for Shuffling-Type Gradient Methods

In this paper, we provide a unified convergence analysis for a class of ...
02/19/2020 ∙ by Lam M. Nguyen, et al. ∙ 0

• Non-convex Global Minimization and False Discovery Rate Control for the TREX

The TREX is a recently introduced method for performing sparse high-dime...
04/22/2016 ∙ by Jacob Bien, et al. ∙ 0

• Improved SVRG for Non-Strongly-Convex or Sum-of-Non-Convex Objectives

Many classical algorithms are found until several years later to outlive...
06/05/2015 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

In this paper we analyze boosting algorithms in linear regression from a...
05/16/2015 ∙ by Robert M. Freund, et al. ∙ 0

• Learning Feature Nonlinearities with Non-Convex Regularized Binned Regression

For various applications, the relations between the dependent and indepe...
05/20/2017 ∙ by Samet Oymak, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study the finite sum problem in the following forms:

• Convex :

 (1)

where is a convex loss such as and is a norm, is some predefined radius. We denote the dual norm of as and assume that each is smooth.

• Non-convex :

 % Minimize:gλ(θ)≤ρG(θ)≜f(θ)+gλ,μ(θ)=1nn∑i=1fi(θ)+gλ,μ(θ), (2)

where is convex and smooth, is some non-convex regularizer, is close related to and we defer the formal definition to the section 2.3.

Such finite sum structure is common for machine learning problems particularly in the empirical risk minimization (

ERM) setting. To solve the above problem, the standard Prox-full gradient (FG) method update iterative by

 θk+1=proxγλψ(θk−γ∇f(θk)).

It is well known that FG enjoys fast linear convergence under smoothness and strong convexity assumption. However this result may be less appealing when is large since the cost of calculation of full gradient scales with . Stochastic gradient (SG) method remedies this issue but only possess the sub-linear convergence rate.

Recently, a set of stochastic algorithms including SVRG Johnson and Zhang (2013); Xiao and Zhang (2014), SAGA Defazio et al. (2014), SAG Schmidt et al. (2013) SDCA  Shalev-Shwartz and Zhang (2014) and many others Harikandeh et al. (2015); Qu et al. (2015); Zhang and Lin (2015) have been proposed to exploit the finite sum structure and enjoy linear rate convergence under smoothness and strong convexity assumption on . We study SAGA in this paper. From a high level, SAGA is a midpoint between SAG and SVRG, see the discuss in Defazio et al. (2014)

for more details. Different from SVRG, it is a fully incremental gradient method. Comparing with SAG, it uses an unbiased estimator of the gradient, which results in an easier proof among other things. In fact, to the best of our knowledge, the analysis of SAG has not yet been extended to proximal operator version.

A second trendy topic in optimization and statistical estimation is the study of non-convex problems, due to a vast array of applications such as SCAD Fan and Li (2001), MCP Zhang and Zhang (2012), robust regression (Corrected Lasso Loh and Wainwright (2011)

) and deep learning

Goodfellow et al. (2016). Some previous work have established fast convergence for batch gradient methods without assuming strong convexity or even convexity: Xiao and Zhang (2013) proposed a homotopy method to solve Lasso with RIP condition. Agarwal et al. (2010) analyzed the convergence rate of batched composite gradient method on several models, such as Lasso, logistic regression with regularization and noisy matrix decomposition, and showed that the convergence is linear under mild conditions of the solution (sparse or low rank). Loh and Wainwright (2011, 2013) extended the above work to the non-convex case.

These two line of research thus motivate this work to investigate whether SAGA enjoys the linear convergence rate without strong convexity or even in the non-convex problem. Specifically, we prove that under Restricted strong convexity assumption, SAGA converges linearly up to the fundamental statistical precision of the model, which covers five statistical models we mentioned above but not limited to these. In a high level, it is a stochastic counterpart of the work in Loh and Wainwright (2013), albeit with more involved analysis due to the stochastic nature of SAGA.

We list some notable non-strongly convex and non-convex problems in the following. Indeed, our work proves that SAGA converges linearly in all these models. Note that the first three belong to the non-strongly convex category especially when and the last two are non-convex.

1. Lasso: and .

2. Group Lasso: , .

3. Logistic Regression with regularization: and .

4. Corrected Lasso Loh and Wainwright (2011): where is some positive definite matrix.

5. Regression with SCAD regularizer Fan and Li (2001): .

Very recently, Qu et al. (2016) explore the similar idea of us called restrict strong convexity condition (RSC) Negahban et al. (2009) on SVRG and prove that under this condition, a class of ERM

problem has the linear convergence even without strongly convex or even the convex assumption. From a high level perspective, our work can be thought as of similar spirit but for SAGA algorithm. We believe analyzing the SAGA algorithm is indeed important as SAGA enjoys certain advantage compared to SVRG. As discussed above, SVRG is not a completely incremental algorithm since it need to calculate the full gradient in every epoch, while SAGA avoids the computation of the full gradient by keeping a table of gradient. Moreover, although in general SAGA costs

storage (which is inferior to SVRG), in many scenarios the requirement of storage can be reduced to

. For example, many loss function

take the form

for a vector

and since is a constant we just need to store the scalar for rather than full gradient. When this scenario is possible, SAGA can perform similarly or even better than SVRG. In addition, SVRG has an additional parameter besides step size to tune – the number of iteration per inner loop. To conclude, both SVRG and SAGA can be more suitable for some problems, and hence it is useful to understand the performance of SAGA for non-strongly convex or non-convex setups. At last, the proof steps are very different. In particular, we define a Lyapunov function in SAGA and prove it converges geometrically until the optimality gap achieves the statistical tolerance, while Qu et al. (2016) directly look at evolution of .

1.1 Related work

There are a plethora of work on the finite sum problem and we review those most closely related to ours. Li et al. (2016) consider SVRG on a non-convex sparse linear regression setting different from ours, where is convex and the non-convexity comes from the hard-thresholding operator. We focus on a non-convex regularizer such as SCAD and corrected Lasso. In addition, we consider a unified framework on SAGA thus our work not only covers the linear sparse model but also the group sparsity and other model satisfying our assumptions. Karimi et al. (2016); Reddi et al. (2016); Hajinezhad et al. (2016) proved global linear convergence of SVRG and SAGA on non-convex problems by revisiting the concept Polyak-Łojasiewicz inequality or its equivalent idea such as error bound . We emphasize that our work looks at the problem from different perspective. In particular, our theory asserts that the algorithm converges faster with sparser , while their results are independent of the sparsity . Empirical observation seems to agree with our theorem. Indeed, when

is dense enough a phase transition from linear rate to sublinear rate occurs (also observed in

Qu et al. (2016)), which agrees with the prediction of our theorem. Furthermore, their work requires the epigraph of

to be a polyhedral set which limits its applicability. For instance, the popular group Lasso does not satisfy such an assumption. Other non-convex stochastic variance reduction works include

Shalev-Shwartz (2016); Shamir (2015) and Allen-Zhu and Hazan (2016): Shalev-Shwartz (2016) considers the setting that is strongly convex but each individual is non-convex. Shamir (2015) discusses a projection version of non-convex SVRG and its specific application on PCA. Allen-Zhu and Hazan (2016) consider a general non-convex problem, which only achieves a sublinear convergence rate.

2 Preliminaries

2.1 Restricted Strong Convexity

As mentioned in the abstract, Restricted strong convexity (RSC) is the key assumption underlying our results. We therefore define RSC formally. We say a function satisfies RSC w.r.t. to a norm with parameter if the following holds.

 f(θ2)−f(θ1)−⟨∇f(θ2),θ2−θ1⟩≥σ2∥θ2−θ1∥22−τσψ2(θ2−θ1). (3)

We remark that we assume satisfies the RSC rather than individual loss function . Indeed, does not satisfy RSC in practice. Note that when is strongly convex, obviously we have . For more discussions on RSC, we refer reader to Negahban et al. (2009).

2.2 Assumptions for the Convex regularizer Ψ(θ)

2.2.1 Decomposibility of Ψ(θ)

Given a pair of subspaces in , the orthogonal complement of is

 ¯M⊥={v∈Rp|⟨u,v⟩=0~{}for all~{}u∈¯M}.

is known as the model subspace, where is called the perturbation subspace, representing the deviation from the model subspace. A regularizer is decomposable w.r.t.  if

 ψ(θ+β)=ψ(θ)+ψ(β)

for all and A concrete example is regularization for sparse vector supported on subset . We define the subspace pairs with respect to the subset , and The decomposability is thus easy to verify. Other widely used examples include non-overlap group norms such as, and the nuclear norm Negahban et al. (2009). In the rest of the paper, we denote as the projection of on the subspace .

2.2.2 Subspace compatibility

Given the regularizer , the subspace compatibility is given by

 H(¯M)=supθ∈¯M∖{0}ψ(θ)∥θ∥2.

In other words, it is the Lipschitz constant of the regularizer restricted in For instance, in the above-mentioned sparse vector example with cardinality , .

2.3 Assumptions for the Nonconvex regularizer gλ,μ(θ)

In the non-convex case, we consider regularizers that are separable across coordinates, i.e., . Besides the separability, we have additional assumptions on . For the univariate function , we assume

1. satisfies and is symmetric around zero. That is, .

2. On the nonnegative real line, is nondecreasing.

3. For , is nonincreasing in t.

4. is differentiable at all and subdifferentiable at , with for a constant .

5. is convex.

We provide two examples satisfying the above assumptions.

 ⎧⎪⎨⎪⎩λ|t|,for |t|≤λ,−(t2−2ζλ|t|+λ2)/(2(ζ−1)),for λ<|t|≤ζλ,(ζ+1)λ2/2,for |t|>ζλ,

where is a fixed parameter. It satisfies the assumption with and Loh and Wainwright (2013).

 (2)MCPλ,b(t)≜sign(t)λ∫|t|0(1−zλb)+dz,

where is a fixed parameter. MCP satisfies the assumption with and  Loh and Wainwright (2013).

2.4 Implementation of the algorithm

For the convex case, we directly apply the Algorithm 1. As to the non-convex case, we essentially solve the following equivalent problem

 Minimize:gλ(θ)≤ρ(f(θ)−μ2∥θ∥22)+λgλ(θ).

We define and . To implement Algorithm 1 on non-convex , we replace and in the algorithm by and . Remark that according to the assumptions on in Section 2.3, is convex thus the proximal step is well-defined. The update rule of proximal operator on several (such as SCAD) can be found in Loh and Wainwright (2013) .

3 Main result

In this section, we present the main theoretical results, and some corollaries that instantiate the main results in several well known statistical models.

3.1 Convex G(θ)

We first present the results on convex . In particular, we prove a Lyapunov function converges geometrically until achieves some tolerance. To this end, we first define the Lyapunov function

 Tk≜1nn∑i=1(fi(ϕki)−fi(^θ)−⟨∇fi(^θ),ϕki−^θ⟩)+(c+α)∥θk−^θ∥22+b(G(θk)−G(^θ)),

where is the optimal solution of problem (1), , , are some positive constant will be specified later in the theorems. Notice our definition is a little different from the one in the original SAGA paper in Defazio et al. (2014). In particular, we have an additional term and choose different value of and , which helps us to utilize the idea of RSC.

We list some notations used in the following theorems and corollaries.

• is the unknown true parameter. is the optimal solution of (1).

• is the dual norm of .

• Modified restricted strongly convex parameter:

 ¯σ=σ−64τσH2(¯M).
• Tolerance

 δ=24τσ(8H(¯M)∥^θ−θ∗∥2+8ψ(θ∗M⊥))2
Theorem 1.

Assume each is smooth and convex, satisfies the RSC condition with parameter and , is feasible, the regularizer is decomposable w.r.t. , if we choose the parameter where is some universal positive constant, then with , , , , , we have

 ETk≤(1−1κ)kT0,

until where the expectation is for the randomness of sampling of in the algorithm.

Some remarks are in order.

• The requirement is easy to satisfy in some popular statistical models. Take Lasso as an example, where , are some positive constant, . Thus . Hence when , we have .

• Since depends on , the convergence rate is indeed affected by the sparsity (Lasso for example )as we mentioned in the introduction. Particularly, sparser leads to larger and faster convergence rate.

• In some models, we can choose the subspace pair such that , thus the tolerance is simplified to . In Lasso as we mentioned above, i.e., the tolerance is dominated by the statistical error

• When , use modified restricted strong convexity (Lemma 5 in the appendix), it is easy to derive

Combine all remarks together, the theorem says the Lyapunov function decreases geometrically until achieves the tolerance . This tolerance is dominated by the statistical error , thus can be ignored from the statistical perspective.

3.1.1 Sparse linear regression

The first model we consider is Lasso, where and . More concretely, we consider a model where each data point

is i.i.d. sampled from a zero-mean normal distribution, i.e.,

. We denote the data matrix by

and the smallest eigenvalue of

by , and let . The observation is generated by , where is a zero mean sub-Gaussian noise with variance . We use to denote -th column of . Without loss of generality, we require to be column-normalized, i.e., . Here, the constant is chosen arbitrarily to simplify the exposition, as we can always rescale the data.

Corollary 1.

Assume is the true parameter supported on a subset with cardinality at most , and we choose such that , , then with , , , , , we have

 ETk≤(1−1κ)kT0,

with probability at least

, until where Here are some universal positive constants.

We offer some discussions on this corollary.

• The requirement of is known to play an important role in proving bounds on the statistical error of Lasso, see Negahban et al. (2009) and reference therein for further details.

• The requirement is to guarantee the fast global convergence of the algorithm, which is similar to the requirement in its batch counterpart Agarwal et al. (2010).

• When is small and is large, which is necessary for statistical consistency of Lasso, we obtain , which guarantees the existences of . Under this condition we have , which is dominated by

3.1.2 Group Sparse model

The group sparsity model aims to find a regressors such that predefined groups of covariates are selected into or out of a model together. The most commonly used regularization to encourage group sparsity is . Formally, we are given a class of disjoint groups of the features, i.e., and . The regularization term is . When , it reduces to the popular group Lasso Yuan and Lin (2006) while another widely used case is Turlach et al. (2005); Quattoni et al. (2009).

We now define the subspace pair in the group sparsity model. For a subset with cardinality , we define the subspace

 M(SG)={θ|θGi=0  for all   i∉SG},

and . The orthogonal complement is

 ¯M⊥(SG)={θ|θGi=0  for all %   i∈SG}.

We can easily verify that

 ∥α+β∥G,q=∥α∥G,q+∥β∥G,q,

for any and .

We mainly focus on the discussion on the case , i.e., group Lasso. We require the following condition, which generalizes the column normalization condition in the Lasso case. Given a group of size and , the associated operator norm satisfies

 |||XGi|||q→2√n≤1  for all  i=1,2,...,NG.

The condition reduces to the column normalized condition when each group contains only one feature (i.e., Lasso).

In the following corollary, we use , i.e., group Lasso, as an example. We assume the observation is generated by , where , and .

Corollary 2.

(Group Lasso) Assume and each group has parameters, i.e., . Denote the cardinality of non-zero group by , and we choose parameter such that

 λ≥max(4ς(√mn+√logNGn),c1ρσ2(Σ)(√mn+√3logNGn)2),

then with , , , , , we have

 ETk≤(1−1κ)kT0

with probability at least , until where , and are positive constant depending only on , . are some universal positive constants.

We offer some discussions to put above corollary into context.

• To satisfy the requirement of , it suffices to have . It is also the condition to guarantee the statistical consistency of group Lasso Negahban et al. (2009).

• and affect the speed of the convergence, in particular, smaller and leads to larger and thus .

• The requirement of is similar to the batch gradient method in Agarwal et al. (2010).

3.2 Non-convex G(θ)

The definition of Lyapunov function in the non-convex case is same with the convex one, i.e.,

 Tk≜1nn∑i=1(fi(ϕki)−fi(^θ)−⟨∇fi(^θ),ϕki−^θ⟩)+(c+α)∥θk−^θ∥22+b(G(θk)−G(^θ)).

Note that is the global optimum of problem (2) and is convex, thus is always positive. In the non-convex case, we require satisfy the RSC condition with parameter , where is some positive constant.

We list some notations used in the following theorem and corollaries of it.

• is the global optimum of problem (2), and is the unknown true parameter with cardinality .

• Modified restricted strongly convex parameter:

 ¯σ=σ−64rτlogpn−μ.

Recall is defined in section 2.3 and represent the degree of non-convexity.

• Tolerance , where is some universal positive constant.

Theorem 2.

Suppose is sparse, is the global optimum of Problem (2), each is L smooth and convex, satisfies the RSC condition with , , , satisfies the assumption in Section 2.3, and , where is some positive constant, then with , , , , , we have

 ETk≤(1−1κ)kT0,

until where the expectation is for the randomness of sampling of in the algorithm.

• Notice that we require , that is . Thus to satisfy this requirement, the non-convex parameter can not be large.

• The tolerance is dominated by the statistical error , when the model is sparse ( is small ) and is large.

• When , using the modified restricted strong convexity on non-convex (Lemma 10 in the appendix), we obtain

• The requirement of is similar to the batched gradient algorithm Loh and Wainwright (2013).

Again, the theorem says the Lyapunov function decreases geometrically until achieves the tolerance and this tolerance can be ignored from the statistical perspective.

3.2.1 Linear regression with SCAD regularization

The first non-convex model we considered is linear regression with SCAD regularization. The loss function is , and is with parameter and . The data are generated in the similar way as that in Lasso case.

Corollary 3.

(Linear regression with SCAD regularization) Suppose is the true parameter supported on a subset with cardinality at most , is the global optimum, , and we choose such that then with , , , , , we have

 ETk≤(1−1κ)kT0,

with probability at least , until where , Here are some universal positive constants.

We remark that to satisfy the requirement , we need the non-convex parameter to be small, the model sparse (r is small) and the number of sample large.

3.2.2 Linear regression with noisy covariates

The corrected Lasso is proposed by Loh and Wainwright (2011). Suppose data are generated according to a linear model where is a random zero-mean sub-Gaussian noise with variance The observation of is corrupted by addictive noise, in particular, , where is a random vector independent of , with zero-mean and known covariance matrix . Define and . Our goal is to estimate based on and (but not which is not observable), and the corrected Lasso proposes to solve the following:

 ^θ∈argmin∥θ∥1≤ρ12θT^Γθ−^γθ+λ∥θ∥1.

Equivalently, it solves

 min∥θ∥1≤ρ12nn∑i=1(yi−θTzi)2−12θTΣwθ+λ∥θ∥1.

Notice that due to the term , the optimization problem is non-convex.

3.3 Corrected Lasso

We consider a model where each data point is i.i.d. sampled from a zero-mean normal distribution, i.e., . We denote the data matrix by , the smallest eigenvalue of by and the largest eigenvalue by and let . We observe which is corrupted by addictive noise, i.e., , where is a random vector independent of , with zero-mean and known covariance matrix .

Corollary 4.

(Corrected Lasso) Suppose we are given i.i.d. observations from the linear model with additive noise, is sparse and , , where . Let be the global optimum. We choose where , then with , , , , , we have

 ETk≤(1−1κ)kT0,

with high probability at least until where to are some universal positive constants.

Some remarks are listed below.

• The result can be easily extended to more general

• To satisfy the requirement , we need

 γ≤14(12σmin(Σ)−c1σmin(Σ)max((σmax(Σ)+γwσmin(Σ))2,1)rlogpn).

Similar requirement is needed in the batch gradient method Loh and Wainwright (2013).

• The requirement of is similar to that in batch gradient method Loh and Wainwright (2013).

3.4 Extension to Generalized linear model

The results on Lasso and group Lasso are readily extended to generalized linear models, where we consider the model

 ^θ=argminθ∈Ω′{1nn∑i=1Φ(θ,xi)−yi⟨θ,xi⟩+λ∥θ∥1},

with and , where is a universal constant Loh and Wainwright (2013). This requirement is essential, for instance for the logistic function , the Hessian function approached to zero as its argument diverges. Notice that when , the problem reduces to Lasso. The RSC condition admit the form

 1nn∑i=1Φ′′(⟨θt,xi⟩)(⟨xi,θ−θ′⟩)2≥σ2∥θ−θ′∥22−τσ∥θ−θ′∥1,% for allθ,θ′∈Ω′

For a board class of log-linear models, the RSC condition holds with . Therefore, we obtain same results as those of Lasso, modulus change of constants. For more details of RSC conditions in generalized linear model, we refer the readers to Negahban et al. (2009).

4 Empirical Result

We report the experimental results in this section to validate our theorem that SAGA can enjoys the linear convergence rate without strong convexity or even without convexity. We did experiment both on synthetic and real datasets and compare SAGA with several candidate algorithms. The experiment setup is similar to Qu et al. (2016). Due to space constraints, some addition simulation results are presented in the appendix. The algorithms tested are Prox-SVRG Xiao and Zhang (2014), Prox-SAG which is a proximal version of the algorithm in Schmidt et al. (2013), proximal stochastic gradient (Prox-SGD), regularized dual averaging method (RDA) Xiao (2010) and the proximal full gradient method (Prox-GD) Nesterov (2013). For the algorithms with a constant learning rate (i.e., SAGA,Prox-SAG, Prox-SVRG, Prox-GD), we tune the learning rate from an exponential grid and chose the one with best performance. Below are some remarks on the candidate algorithms.

• The linear convergence of SVRG in our setting has been proved in Qu et al. (2016).

• We adapt SAG to its Prox version. To the best of our knowledge, the convergence of Prox-SAG has not been established. In addition, it is not known whether the Prox-SAG converges or not although it works well in the experiment.

• The step size in Prox-SGD is