1 Introduction
We study the finite sum problem in the following forms:
-
Convex :
(1) where is a convex loss such as and is a norm, is some predefined radius. We denote the dual norm of as and assume that each is smooth.
-
Non-convex :
(2) where is convex and smooth, is some non-convex regularizer, is close related to and we defer the formal definition to the section 2.3.
Such finite sum structure is common for machine learning problems particularly in the empirical risk minimization (
ERM) setting. To solve the above problem, the standard Prox-full gradient (FG) method update iterative byIt is well known that FG enjoys fast linear convergence under smoothness and strong convexity assumption. However this result may be less appealing when is large since the cost of calculation of full gradient scales with . Stochastic gradient (SG) method remedies this issue but only possess the sub-linear convergence rate.
Recently, a set of stochastic algorithms including SVRG Johnson and Zhang (2013); Xiao and Zhang (2014), SAGA Defazio et al. (2014), SAG Schmidt et al. (2013) SDCA Shalev-Shwartz and Zhang (2014) and many others Harikandeh et al. (2015); Qu et al. (2015); Zhang and Lin (2015) have been proposed to exploit the finite sum structure and enjoy linear rate convergence under smoothness and strong convexity assumption on . We study SAGA in this paper. From a high level, SAGA is a midpoint between SAG and SVRG, see the discuss in Defazio et al. (2014)
for more details. Different from SVRG, it is a fully incremental gradient method. Comparing with SAG, it uses an unbiased estimator of the gradient, which results in an easier proof among other things. In fact, to the best of our knowledge, the analysis of SAG has not yet been extended to proximal operator version.
A second trendy topic in optimization and statistical estimation is the study of non-convex problems, due to a vast array of applications such as SCAD Fan and Li (2001), MCP Zhang and Zhang (2012), robust regression (Corrected Lasso Loh and Wainwright (2011)
) and deep learning
Goodfellow et al. (2016). Some previous work have established fast convergence for batch gradient methods without assuming strong convexity or even convexity: Xiao and Zhang (2013) proposed a homotopy method to solve Lasso with RIP condition. Agarwal et al. (2010) analyzed the convergence rate of batched composite gradient method on several models, such as Lasso, logistic regression with regularization and noisy matrix decomposition, and showed that the convergence is linear under mild conditions of the solution (sparse or low rank). Loh and Wainwright (2011, 2013) extended the above work to the non-convex case.These two line of research thus motivate this work to investigate whether SAGA enjoys the linear convergence rate without strong convexity or even in the non-convex problem. Specifically, we prove that under Restricted strong convexity assumption, SAGA converges linearly up to the fundamental statistical precision of the model, which covers five statistical models we mentioned above but not limited to these. In a high level, it is a stochastic counterpart of the work in Loh and Wainwright (2013), albeit with more involved analysis due to the stochastic nature of SAGA.
We list some notable non-strongly convex and non-convex problems in the following. Indeed, our work proves that SAGA converges linearly in all these models. Note that the first three belong to the non-strongly convex category especially when and the last two are non-convex.
Very recently, Qu et al. (2016) explore the similar idea of us called restrict strong convexity condition (RSC) Negahban et al. (2009) on SVRG and prove that under this condition, a class of ERM
problem has the linear convergence even without strongly convex or even the convex assumption. From a high level perspective, our work can be thought as of similar spirit but for SAGA algorithm. We believe analyzing the SAGA algorithm is indeed important as SAGA enjoys certain advantage compared to SVRG. As discussed above, SVRG is not a completely incremental algorithm since it need to calculate the full gradient in every epoch, while SAGA avoids the computation of the full gradient by keeping a table of gradient. Moreover, although in general SAGA costs
storage (which is inferior to SVRG), in many scenarios the requirement of storage can be reduced to. For example, many loss function
take the formfor a vector
and since is a constant we just need to store the scalar for rather than full gradient. When this scenario is possible, SAGA can perform similarly or even better than SVRG. In addition, SVRG has an additional parameter besides step size to tune – the number of iteration per inner loop. To conclude, both SVRG and SAGA can be more suitable for some problems, and hence it is useful to understand the performance of SAGA for non-strongly convex or non-convex setups. At last, the proof steps are very different. In particular, we define a Lyapunov function in SAGA and prove it converges geometrically until the optimality gap achieves the statistical tolerance, while Qu et al. (2016) directly look at evolution of .1.1 Related work
There are a plethora of work on the finite sum problem and we review those most closely related to ours. Li et al. (2016) consider SVRG on a non-convex sparse linear regression setting different from ours, where is convex and the non-convexity comes from the hard-thresholding operator. We focus on a non-convex regularizer such as SCAD and corrected Lasso. In addition, we consider a unified framework on SAGA thus our work not only covers the linear sparse model but also the group sparsity and other model satisfying our assumptions. Karimi et al. (2016); Reddi et al. (2016); Hajinezhad et al. (2016) proved global linear convergence of SVRG and SAGA on non-convex problems by revisiting the concept Polyak-Łojasiewicz inequality or its equivalent idea such as error bound . We emphasize that our work looks at the problem from different perspective. In particular, our theory asserts that the algorithm converges faster with sparser , while their results are independent of the sparsity . Empirical observation seems to agree with our theorem. Indeed, when
is dense enough a phase transition from linear rate to sublinear rate occurs (also observed in
Qu et al. (2016)), which agrees with the prediction of our theorem. Furthermore, their work requires the epigraph ofto be a polyhedral set which limits its applicability. For instance, the popular group Lasso does not satisfy such an assumption. Other non-convex stochastic variance reduction works include
Shalev-Shwartz (2016); Shamir (2015) and Allen-Zhu and Hazan (2016): Shalev-Shwartz (2016) considers the setting that is strongly convex but each individual is non-convex. Shamir (2015) discusses a projection version of non-convex SVRG and its specific application on PCA. Allen-Zhu and Hazan (2016) consider a general non-convex problem, which only achieves a sublinear convergence rate.2 Preliminaries
2.1 Restricted Strong Convexity
As mentioned in the abstract, Restricted strong convexity (RSC) is the key assumption underlying our results. We therefore define RSC formally. We say a function satisfies RSC w.r.t. to a norm with parameter if the following holds.
(3) |
We remark that we assume satisfies the RSC rather than individual loss function . Indeed, does not satisfy RSC in practice. Note that when is strongly convex, obviously we have . For more discussions on RSC, we refer reader to Negahban et al. (2009).
2.2 Assumptions for the Convex regularizer
2.2.1 Decomposibility of
Given a pair of subspaces in , the orthogonal complement of is
is known as the model subspace, where is called the perturbation subspace, representing the deviation from the model subspace. A regularizer is decomposable w.r.t. if
for all and A concrete example is regularization for sparse vector supported on subset . We define the subspace pairs with respect to the subset , and The decomposability is thus easy to verify. Other widely used examples include non-overlap group norms such as, and the nuclear norm Negahban et al. (2009). In the rest of the paper, we denote as the projection of on the subspace .
2.2.2 Subspace compatibility
Given the regularizer , the subspace compatibility is given by
In other words, it is the Lipschitz constant of the regularizer restricted in For instance, in the above-mentioned sparse vector example with cardinality , .
2.3 Assumptions for the Nonconvex regularizer
In the non-convex case, we consider regularizers that are separable across coordinates, i.e., . Besides the separability, we have additional assumptions on . For the univariate function , we assume
-
satisfies and is symmetric around zero. That is, .
-
On the nonnegative real line, is nondecreasing.
-
For , is nonincreasing in t.
-
is differentiable at all and subdifferentiable at , with for a constant .
-
is convex.
We provide two examples satisfying the above assumptions.
where is a fixed parameter. It satisfies the assumption with and Loh and Wainwright (2013).
where is a fixed parameter. MCP satisfies the assumption with and Loh and Wainwright (2013).
2.4 Implementation of the algorithm
For the convex case, we directly apply the Algorithm 1. As to the non-convex case, we essentially solve the following equivalent problem
We define and . To implement Algorithm 1 on non-convex , we replace and in the algorithm by and . Remark that according to the assumptions on in Section 2.3, is convex thus the proximal step is well-defined. The update rule of proximal operator on several (such as SCAD) can be found in Loh and Wainwright (2013) .
3 Main result
In this section, we present the main theoretical results, and some corollaries that instantiate the main results in several well known statistical models.
3.1 Convex
We first present the results on convex . In particular, we prove a Lyapunov function converges geometrically until achieves some tolerance. To this end, we first define the Lyapunov function
where is the optimal solution of problem (1), , , are some positive constant will be specified later in the theorems. Notice our definition is a little different from the one in the original SAGA paper in Defazio et al. (2014). In particular, we have an additional term and choose different value of and , which helps us to utilize the idea of RSC.
We list some notations used in the following theorems and corollaries.
-
is the unknown true parameter. is the optimal solution of (1).
-
is the dual norm of .
-
Modified restricted strongly convex parameter:
-
Tolerance
Theorem 1.
Assume each is smooth and convex, satisfies the RSC condition with parameter and , is feasible, the regularizer is decomposable w.r.t. , if we choose the parameter where is some universal positive constant, then with , , , , , we have
until where the expectation is for the randomness of sampling of in the algorithm.
Some remarks are in order.
-
The requirement is easy to satisfy in some popular statistical models. Take Lasso as an example, where , are some positive constant, . Thus . Hence when , we have .
-
Since depends on , the convergence rate is indeed affected by the sparsity (Lasso for example )as we mentioned in the introduction. Particularly, sparser leads to larger and faster convergence rate.
-
In some models, we can choose the subspace pair such that , thus the tolerance is simplified to . In Lasso as we mentioned above, i.e., the tolerance is dominated by the statistical error
-
When , use modified restricted strong convexity (Lemma 5 in the appendix), it is easy to derive
Combine all remarks together, the theorem says the Lyapunov function decreases geometrically until achieves the tolerance . This tolerance is dominated by the statistical error , thus can be ignored from the statistical perspective.
3.1.1 Sparse linear regression
The first model we consider is Lasso, where and . More concretely, we consider a model where each data point
is i.i.d. sampled from a zero-mean normal distribution, i.e.,
. We denote the data matrix byand the smallest eigenvalue of
by , and let . The observation is generated by , where is a zero mean sub-Gaussian noise with variance . We use to denote -th column of . Without loss of generality, we require to be column-normalized, i.e., . Here, the constant is chosen arbitrarily to simplify the exposition, as we can always rescale the data.Corollary 1.
Assume is the true parameter supported on a subset with cardinality at most , and we choose such that , , then with , , , , , we have
with probability at least
We offer some discussions on this corollary.
-
The requirement of is known to play an important role in proving bounds on the statistical error of Lasso, see Negahban et al. (2009) and reference therein for further details.
-
The requirement is to guarantee the fast global convergence of the algorithm, which is similar to the requirement in its batch counterpart Agarwal et al. (2010).
-
When is small and is large, which is necessary for statistical consistency of Lasso, we obtain , which guarantees the existences of . Under this condition we have , which is dominated by
3.1.2 Group Sparse model
The group sparsity model aims to find a regressors such that predefined groups of covariates are selected into or out of a model together. The most commonly used regularization to encourage group sparsity is . Formally, we are given a class of disjoint groups of the features, i.e., and . The regularization term is . When , it reduces to the popular group Lasso Yuan and Lin (2006) while another widely used case is Turlach et al. (2005); Quattoni et al. (2009).
We now define the subspace pair in the group sparsity model. For a subset with cardinality , we define the subspace
and . The orthogonal complement is
We can easily verify that
for any and .
We mainly focus on the discussion on the case , i.e., group Lasso. We require the following condition, which generalizes the column normalization condition in the Lasso case. Given a group of size and , the associated operator norm satisfies
The condition reduces to the column normalized condition when each group contains only one feature (i.e., Lasso).
In the following corollary, we use , i.e., group Lasso, as an example. We assume the observation is generated by , where , and .
Corollary 2.
(Group Lasso) Assume and each group has parameters, i.e., . Denote the cardinality of non-zero group by , and we choose parameter such that
then with , , , , , we have
with probability at least , until where , and are positive constant depending only on , . are some universal positive constants.
We offer some discussions to put above corollary into context.
-
To satisfy the requirement of , it suffices to have . It is also the condition to guarantee the statistical consistency of group Lasso Negahban et al. (2009).
-
and affect the speed of the convergence, in particular, smaller and leads to larger and thus .
-
The requirement of is similar to the batch gradient method in Agarwal et al. (2010).
3.2 Non-convex
The definition of Lyapunov function in the non-convex case is same with the convex one, i.e.,
Note that is the global optimum of problem (2) and is convex, thus is always positive. In the non-convex case, we require satisfy the RSC condition with parameter , where is some positive constant.
We list some notations used in the following theorem and corollaries of it.
Theorem 2.
Suppose is sparse, is the global optimum of Problem (2), each is L smooth and convex, satisfies the RSC condition with , , , satisfies the assumption in Section 2.3, and , where is some positive constant, then with , , , , , we have
until where the expectation is for the randomness of sampling of in the algorithm.
-
Notice that we require , that is . Thus to satisfy this requirement, the non-convex parameter can not be large.
-
The tolerance is dominated by the statistical error , when the model is sparse ( is small ) and is large.
-
When , using the modified restricted strong convexity on non-convex (Lemma 10 in the appendix), we obtain
-
The requirement of is similar to the batched gradient algorithm Loh and Wainwright (2013).
Again, the theorem says the Lyapunov function decreases geometrically until achieves the tolerance and this tolerance can be ignored from the statistical perspective.
3.2.1 Linear regression with SCAD regularization
The first non-convex model we considered is linear regression with SCAD regularization. The loss function is , and is with parameter and . The data are generated in the similar way as that in Lasso case.
Corollary 3.
(Linear regression with SCAD regularization) Suppose is the true parameter supported on a subset with cardinality at most , is the global optimum, , and we choose such that then with , , , , , we have
with probability at least , until where , Here are some universal positive constants.
We remark that to satisfy the requirement , we need the non-convex parameter to be small, the model sparse (r is small) and the number of sample large.
3.2.2 Linear regression with noisy covariates
The corrected Lasso is proposed by Loh and Wainwright (2011). Suppose data are generated according to a linear model where is a random zero-mean sub-Gaussian noise with variance The observation of is corrupted by addictive noise, in particular, , where is a random vector independent of , with zero-mean and known covariance matrix . Define and . Our goal is to estimate based on and (but not which is not observable), and the corrected Lasso proposes to solve the following:
Equivalently, it solves
Notice that due to the term , the optimization problem is non-convex.
3.3 Corrected Lasso
We consider a model where each data point is i.i.d. sampled from a zero-mean normal distribution, i.e., . We denote the data matrix by , the smallest eigenvalue of by and the largest eigenvalue by and let . We observe which is corrupted by addictive noise, i.e., , where is a random vector independent of , with zero-mean and known covariance matrix .
Corollary 4.
(Corrected Lasso) Suppose we are given i.i.d. observations from the linear model with additive noise, is sparse and , , where . Let be the global optimum. We choose where , then with , , , , , we have
with high probability at least until where to are some universal positive constants.
Some remarks are listed below.
3.4 Extension to Generalized linear model
The results on Lasso and group Lasso are readily extended to generalized linear models, where we consider the model
with and , where is a universal constant Loh and Wainwright (2013). This requirement is essential, for instance for the logistic function , the Hessian function approached to zero as its argument diverges. Notice that when , the problem reduces to Lasso. The RSC condition admit the form
For a board class of log-linear models, the RSC condition holds with . Therefore, we obtain same results as those of Lasso, modulus change of constants. For more details of RSC conditions in generalized linear model, we refer the readers to Negahban et al. (2009).
4 Empirical Result
We report the experimental results in this section to validate our theorem that SAGA can enjoys the linear convergence rate without strong convexity or even without convexity. We did experiment both on synthetic and real datasets and compare SAGA with several candidate algorithms. The experiment setup is similar to Qu et al. (2016). Due to space constraints, some addition simulation results are presented in the appendix. The algorithms tested are Prox-SVRG Xiao and Zhang (2014), Prox-SAG which is a proximal version of the algorithm in Schmidt et al. (2013), proximal stochastic gradient (Prox-SGD), regularized dual averaging method (RDA) Xiao (2010) and the proximal full gradient method (Prox-GD) Nesterov (2013). For the algorithms with a constant learning rate (i.e., SAGA,Prox-SAG, Prox-SVRG, Prox-GD), we tune the learning rate from an exponential grid and chose the one with best performance. Below are some remarks on the candidate algorithms.
-
The linear convergence of SVRG in our setting has been proved in Qu et al. (2016).
-
We adapt SAG to its Prox version. To the best of our knowledge, the convergence of Prox-SAG has not been established. In addition, it is not known whether the Prox-SAG converges or not although it works well in the experiment.
-
The step size in Prox-SGD is