First order methods have again become a central focus of research in optimization and particularly in machine learning in recent years, thanks to its ability to address very large scale empirical risk minimization problems that are ubiquitous in machine learning, a task that is often challenging for other algorithms such interior point methods. The randomized (dual) coordinate version of the first order method samples one data point and updates the objective function at each time step, which avoids the computations of the full gradient and pushes the speed to a higher level. Related methods have been implemented in various software packagesVedaldi and Fulkerson (2008). In particular, the randomized dual coordinate method considers the following problem.
is a convex loss function of each sample andis the regularization, is a convex compact set. Instead of directly solving the primal problem it look at the dual problem
where it assumes the loss function has the form . Shalev-Shwartz and Zhang (2013) consider this dual form and proved linear convergence of the stochastic dual coordinate ascent method (SDCA) when . They further extend the result into a general form, which allows the regulerizer to be a general strongly convex function Shalev-Shwartz and Zhang (2014). Qu and Richtárik (2015)
propose the AdaSDCA, an adaptive variant of SDCA, to allow the method to adaptively change the probability distribution over the dual variables through the iterative process. Their experiment results outperform the non-adaptive methods.Dünner et al. (2016) consider a primal-dual frame work and extend SDCA to the non-strongly convex case with sublinear rate. Allen-Zhu and Yuan (2015a) further improve the convergece speed using a novel non-uniform sampling that selects each coordinate with a probability proportional to the square root of the smoothness parameter. Other acceleration techniques Qu and Richtárik (2014); Qu et al. (2015); Nesterov (2012), as well as mini-batch and distributed variants on coordinate method Liu and Wright (2015); Zhao et al. (2014); Jaggi et al. (2014); Mahajan et al. (2014) have been studied in literature. See Wright (2015) for a review on the coordinate method.
Our goal is to investigate how to extend SDCA to non-convex statistical problems. Non-convex optimization problems attract fast growning attentions due to the rise of numerous applications, notably non-convex M estimators (e.g.,SCAD Fan and Li (2001), MCP Zhang (2010)2016) and robust regression (e.g., corrrect Lasso in Loh and Wainwright, 2011). Shalev-Shwartz (2016) proposes a dual free version of SDCA and proves the linear convergence of it, which addresses the case that each individual loss function may be non-convex, but their sum is strongly convex. This extends the applicability of SDCA. From a technical perspective, due to non-convexity of , the paper avoids explicitly using its dual (hence the name “dual free”), by introducing pseudo-dual variables.
In this paper, we consider using SDCA to solve M-estimators that are not strongly convex, or even non-convex. We show that under the restricted strong convexity condition, SDCA converges linearly. This setup includes well known formulations such as Lasso, Group Lasso, Logistic regression with regularization, linear regression with SCAD regularizer Fan and Li (2001), Corrected Lasso Loh and Wainwright (2011), to name a few. This significantly improves upon existing theoretic results Shalev-Shwartz and Zhang (2013, 2014); Shalev-Shwartz (2016) which only established sublinear convergence on convex objectives.
To this end, we first adapt SDCA Shalev-Shwartz and Zhang (2014) into a generalized dual free form in Algorithm 1. This is because to apply SDCA in our setup, we need to introduce a non-convex term and thus demand a dual free analysis. We remark that the theory about dual free SDCA established in Shalev-Shwartz (2016) does not apply to our setup for the following reasons:
We illustrate the update rule of Algorithm 1 using an example. While our main focus is for non-strongly convex problems, we start with a strongly convex example to obtain some intuitions. Suppose and . It is easy to see and
It is then clear that to apply SDCA needs to be strongly convex, or otherwise the proximal step becomes ill-defined: may be infinity (if is unbounded) or not unique. This observation motivates the following preprocessing step: if is not strongly convex, for example Lasso where and , we redefine the formulation by adding a strongly convex term to and subtract it from . More precisely, for , we define and apply Algorithm 1 on (which is equivalent to Lasso), where the value of will be specified later. Our analysis is thus focused on this alternative representation. The main challenge arises from the non-convexity of the newly defined , which precludes the use of the dual method as in Shalev-Shwartz and Zhang (2014). While in Shalev-Shwartz (2016), a dual free SDCA algorithm is proposed and analyzed, the results do not apply to out setting for reasons discussed above.
Our contributions are two-fold. 1. We prove linear convergence of SDCA for a class of problem that are not strongly convex or even non-convex, making use of the concept restricted strong convexity (RSC). To the best of our knowledge, this is the first work to prove linear convergence of SDCA under this setting which includes several statistical model such as Lasso, Group Lasso, logistic regression, linear regression with SCAD regularization, corrected Lasso, to name a few. 2. As a byproduct, we derive a dual free from SDCA that extends the work of Shalev-Shwartz (2016) to account for more general regularization .
Related work. Agarwal et al. (2010) prove linear convergence of the batched gradient and composite gradient method in Lasso, Low rank matrix completion problem using the RSC condition. Loh and Wainwright (2013) extend this to a class of non-convex M-estimators. In spirit, our work can be thought as a stochastic counterpart. Recently, Qu et al. (2016, 2017) consider SVRG and SAGA in the similar setting as ours, but they look at the primal problem. Shalev-Shwartz (2016) considers dual free SDCA, but the analysis does not apply to the case where is not strongly convex. Similarly, Allen-Zhu and Yuan (2015b) consider the non-strongly convex setting in SVRG (), where is convex, each individual may be non-convex, but only establish sub-linear convergence. Recently, Li et al. (2016) consider SVRG with a zero-norm constraint and prove linear convergence for this specific formulation. In comparison, our results hold more generally, covering not only sparsity model but also corrected Lasso with noisy covariate, group sparsity model, and beyond.
2 Problem setup
In this paper we consider two setups, namely, (1) convex but not strongly convex and (2) non-convex . For the convex case, we consider
where is some pre-defined radius, is the loss function for sample , and is a norm. Here we assume each is convex and -smooth.
For the non-convex we consider
where is some pre-defined radius, is convex and smooth, is a non-convex regularizer depending on a tuning parameter and a parameter explained in section 2.3. This M-estimator includes a side constraint depending on , which needs to be a convex function and admits a lower bound . It is also closed w.r.t. to , more details are deferred to section 2.3.
We list some examples that belong to these two setups.
The first three examples belong to first setup while the last two belong to the second setup.
2.1 Restricted Strongly Convexity
Restricted Strongly Convexity (RSC) is proposed in Negahban et al. (2009); Van De Geer et al. (2009) and has been explored in several other work Agarwal et al. (2010); Loh and Wainwright (2013). We say the loss function satisfies the RSC condition with curvature and the tolerance parameter with respect to the norm if
When is strongly convex, then it is RSC with and . However in many case, may not be strongly convex, especially in the high dimensional case where the ambient dimension . On the other hand, RSC is easier to be satisfied. Take Lasso as an example, under some mild condition, it is shown that Negahban et al. (2009) where , are some positive constants. Besides Lasso, the RSC condition holds for a large range of statistical models including log-linear model, group sparsity model, and low-rank model. See Negahban et al. (2009) for more detailed discussions.
2.2 Assumption on convex regularizer
Decomposability is the other ingredient needed to analyze the algorithm.
Definition: A regularizer is decomposable with respect to a pair of subspaces if
where means the orthogonal complement.
A concrete example is
regularization for sparse vector supported on subset. We define the subspace pairs with respect to the subset , and The decomposability is thus easy to verify. Other widely used examples include non-overlap group norms such as, and the nuclear norm Negahban et al. (2009). In the rest of the paper, we denote as the projection of on the subspace .
Subspace compatibility constant
For any subspace of , the subspace compatibility constant with respect to the pair is give by
That is, it is the Lipschitz constant of the regularizer restricted in . For example, for the above-mentioned sparse vector with cardinality , for .
2.3 Assumption on non-convex regularizer
We consider regularizers that are separable across coordinates, i.e., . Besides the separability, we make further assumptions on the univariate function :
satisfies and is symmetric about zero (i.e., ).
On the nonnegative real line, is nondecreasing.
For , is nonincreasing in t.
is differentiable for all and subdifferentiable at , with .
For instance, SCAD satisfying these assumptions.
where is a fixed parameter. It satisfies the assumption with and Loh and Wainwright (2013).
2.4 Applying SDCA
Following a similar line as we did for Lasso, to apply the SDCA algorithm, we define , Correspondingly, the new smoothness parameters are and . Problem (2) is thus equivalent to the following
This enables us to apply Algorithm 1 to the problem with , , , . In particular, while is not convex, is still convex (1-strongly convex). We exploit this property in the proof and define , where is a convex compact set. Since is -strongly convex, is -smooth (Theorem 1 of Nesterov, 2005).
3 Theoretical Result
In this section, we present the main theoretical results, and some corollaries that instantiate the main results in several well known statistical models.
To begin with, we define several terms related to the algorithm.
is true unknown parameter. is the dual norm of . Conjugate function , where .
is the optimal solution to problem 2, and we assume is in the interior of w.l.o.g. by choosing large enough.
is an optimal solution pair satisfying .
We remark that our definition on the first potential is the same as in Shalev-Shwartz (2016), while the second one is different. If , our definition on reduces to that in Shalev-Shwartz (2016), i.e., . To see this, when , and . We then define another potential , which is a combination of and . Notice that using smoothness of and Lemma 1 and 2 in the appendix, it is not hard to show Thus if converges, so does .
For notational simplicity, we define the following two terms used in the theorem.
Effective RSC parameter:
Tolerance: where , is a universal positive constant.
Assume each is smooth and convex, satisfies the RSC condition with parameter , is feasible, the regularizer is decomposable w.r.t such that , and the Algorithm 1 runs with , where , is chosen such that . If we choose the regularization parameter such that where is some universal positive constant, then we have
until , where the expectation is for the randomness of sampling of in the algorithm.
Some remarks are in order for interpreting the theorem.
In several statistical models, the requirement of is easy to satisfy under mild conditions. For instance, in Lasso we have . , if the feature vector is sampled from . Thus, if , we have .
In some models, we can choose the pair of subspace such that and thus . In Lasso, as we mentioned above , thus , i.e., this tolerance is dominated by statistical error if is small and is large.
We know and , thus , thus, if converge, so does When , using Lemma 5 in the supplementary material, it is easy to get
Combining these remarks, Theorem 1 states that the optimization error decreases geometrically until it achieves the tolerance which is dominated by the statistical error , thus can be ignored from the statistical view.
If in Problem (1) is indeed 1-strongly convex, we have the following proposition, which extends dual-free SDCA Shalev-Shwartz (2016) into the general regularization case. Notice we now directly apply the Algorithm 1 on Problem (1) and change the definitions of and correspondingly. In particular, , , where . is still defined in the same way, i.e.,
In the following we present several corollaries that instantiate Theorem 1 with several concrete statistical models. This essentially requires to choose appropriate subspace pair in these models and check the RSC condition.
3.1.1 Sparse linear regression
Our first example is Lasso, where and . We assume each feature vector
is generated from Normal distributionand the true parameter is sparse with cardinality . The observation is generated by , where is a Gaussian noise with mean
and variance. We denote the data matrix by and is the th column of . Without loss of generality, we assume is column normalized, i.e., for all We denote
as the smallest eigenvalue of, and .
Assume is the true parameter supported on a subset with cardinality at most , and we choose the parameter such that and hold, where , where are some universal positive constants. Then we run the Algorithm 1 with and have
with probability at least until where . are some universal positive constants.
The requirement is documented in literature Negahban et al. (2009) to ensure that Lasso is statistically consistent. And is needed for fast convergence of optimization algorithms, which is similar to the condition proposed in in Agarwal et al. (2010) for batch optimization algorithm. When , which is necessary for statistical consistency of Lasso, we have , which guarantees the existence of . Also notice under this condition, is of a lower order of . Using remark 3 in Theorem 1, we have , which is dominated by the statistical error and hence can be ignored from the statistical perspective. Thus to sum up, Corollary 1 states the optimization error decreases geometrically until it achieves the statistical limit of Lasso.
3.1.2 Group Sparsity Model
Yuan and Lin (2006) introduce the group Lasso to allow predefined groups of covariates to be selected together into or out of a model together. The most commonly used regularizer to encourage group sparsity is . In the following, we define group sparsity formally. We assume groups are disjointed, i.e., and . The regularization is . When , it reduces to the commonly used group Lasso Yuan and Lin (2006), and another popularly used case is Turlach et al. (2005); Quattoni et al. (2009). We require the following condition, which generalizes the column normalization condition in the Lasso case. Given a group of size and , the associated operator norm satisfies
The condition reduces to the column normalized condition when each group contains only one feature (i.e., Lasso).
We now define the subspace pair in the group sparsity model. For a subset with cardinality , we define the subspace
and . The orthogonal complement is
We can easily verify that
for any and .
In the following corollary, we use , i.e., group Lasso, as an example. We assume the observation is generated by , where , and .
Corollary 2 (Group Lasso).
Assume and each group has parameters, i.e., . Denote by the cardinality of non-zero group, and we choose parameters such that
where and are positive constant depending only on . If we run the Algorithm 1 with , then we have
with probability at least , until where .
We offer some discussions to interpret the corollary. To satisfy the requirement , it suffices to have
This is a mild condition, as it is needed to guarantee the statistical consistency of group Lasso Negahban et al. (2009). Notice that the condition is easily satisfied when and are small. Under this same condition, since we conclude that is dominated by Again, it implies the optimization error decrease geometrically up to the scale which is dominated by the statistical error of the model.
3.1.3 Extension to generalized linear model
We consider the generalized linear model of the following form,
which covers such case as Lasso (where ) and logistic regression (where ). In this model, we have
where for some The RSC condition thus is equivalent to:
Here we require to be a bounded set Loh and Wainwright (2013). This requirement is essential since in some generalized linear model approaches to zero as diverges. For instance, in logistic regression, , which tends to zero as . For a broad class of generalized linear models, RSC holds with , thus the same result as that of Lasso holds, modulus change of constants.
In the non-convex case, we assume the following RSC condition:
with for some constant . We again define the potential , and in the same way with convex case. The main difference is that now we have and the effective RSC parameter is different. The necessary notations for presenting the theorem are listed below:
is the unknown true parameter that is -sparse. Conjugate function , where Note is convex due to convexity of .
is the global optimum of Problem (3), we assume it is in the interior of w.l.o.g.
is an optimal solution pair satisfying