Linear Convergence of SVRG in Statistical Estimation

11/07/2016 ∙ by Chao Qu, et al. ∙ 0

SVRG and its variants are among the state of art optimization algorithms for large scale machine learning problems. It is well known that SVRG converges linearly when the objective function is strongly convex. However this setup can be restrictive, and does not include several important formulations such as Lasso, group Lasso, logistic regression, and some non-convex models including corrected Lasso and SCAD. In this paper, we prove that, for a class of statistical M-estimators covering examples mentioned above, SVRG solves the formulation with a linear convergence rate without strong convexity or even convexity. Our analysis makes use of restricted strong convexity, under which we show that SVRG converges linearly to the fundamental statistical precision of the model, i.e., the difference between true unknown parameter θ^* and the optimal solution θ̂ of the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we establish fast convergence rate of stochastic variance reduction gradient (SVRG) for a class of problems motivated by applications in high dimensional statistics where

the problems are not strongly convex, or even non-convex. High-dimensional statistics has achieved remarkable success in the last decade, including results on consistency and rates for various estimator under non-asymptotic high-dimensional scaling, especially when the problem dimension is larger than the number of data (e.g., Negahban et al., 2009; Candès and Recht, 2009, and many others (Candes et al., 2006; Wainwright, 2006; Chen et al., 2011)) . It is now well known that while this setup appears ill-posed, the estimation or recovery is indeed possible by exploiting the underlying structure

of the parameter space – notable examples include sparse vectors, low-rank matrices, and structured regression functions, among others. Recently, estimators leading to non-convex optimizations have gained fast growing attention. Not only it typically has better statistical properties in the high dimensional regime, but also in contrast to common belief, under many cases there exist efficient algorithms that provably find near-optimal solutions

Loh and Wainwright (2011); Zhang and Zhang (2012); Loh and Wainwright (2013) .

Computation challenges of statistical estimators and machine learning algorithms have been an active area of study, thanks to countless applications involving big data – datasets where both and are large. In particular, there are renewed interests in first order methods to solves the following class of optimization problems:

(1)

Problem (1

) naturally arises in statistics and machine learning. In supervised learning, we are given a sample of

training data , and thus is the corresponding loss, e.g., the squared loss , is a convex set corresponding to the class of hypothesis, and is the (possibly non-convex) regularization. Many widely applied statistical formulations are examples of Problem (1). A partial list includes:

  • Lasso: and .

  • Group Lasso: , .

  • Logistic Regression with regularization: and .

  • Corrected Lasso Loh and Wainwright (2011): where is some positive definite matrix.

  • Regression with SCAD regularizer Fan and Li (2001): .

In the first three examples, the objective functions are not strongly convex when . Example 4 is non-convex when , and the last example is non-convex due to the SCAD regularizer.

Projected gradient method, proximal gradient method, dual averaging method Nesterov (2009) and several variants of them have been proposed to solve Problem (1). However, at each step, these batched gradient descent methods need to evaluate all derivatives, corresponding to each , which can be expensive for large

. Accordingly, stochastic gradient descent (SGD) methods have gained attention, because of its significantly lighter computation load for each iteration: at iteration

, only one data point – sampled from and indexed by  – is used to update the parameter according to

or its proximal counterpart (w.r.t. the regularization function )

Although the computational cost in each step is low, SGD often suffers from slow convergence, i.e., sub-linear convergence rate even with strong assumptions (strong convexity and smoothness). Recently, one state-of-art technique to improve the convergence of SGD called the variance-reduction-gradient has been proposed Johnson and Zhang (2013); Xiao and Zhang (2014)

. As the name suggests, it devises a better unbiased estimator of stochastic gradient

such that the variance diminishes when . In particular, in SVRG and its variants, the algorithm keeps a snapshot after every SGD iterations and calculate the full gradient just for this snapshot, then the variance reduced gradient is computed by

It is shown in Johnson and Zhang (2013) that when is strongly convex and is smooth, SVRG and its variants enjoy linear convergence, i.e., steps suffices to obtain an -optimal solution. Equivalently, the gradient complexity (i.e., the number of gradient evaluation needed) is , where is the smoothness of and is the strong convexity of .

What if is not strongly convex or even not convex? As we discussed above, many popular machine learning models belongs to this. When is not strongly convex, existing theory only guarantees that SVRG will converge sub-linearly. A folklore method is to add a dummy strongly convex term to the objective function and then apply the algorithm Shalev-Shwartz and Zhang (2014); Allen-Zhu and Yuan (2015). This undermines the performance of the model, particularly its ability to recover sparse solutions. One may attempt to reduce to zero in the hope of reproducing the optimal solution of the original formulation, but the convergence will still be sub-linear via this approach. As for the non-convex case , to the best of our knowledge, no work provides linear convergence guarantees for the above mentioned examples using SVRG.

Contribution of the paper

We show that for a class of problems, SVRG achieves linear convergence without strong convexity or convexity assumption. In particular we prove the gradient complexity of SVRG is when is larger than the statistical tolerance, where is the modified restricted stronly convex parameter defined in Theorem 1 and Theorem 2

. Notice If we replace modified restricted stronly convex parameter by the strong convexity, above result becomes standard result of SVRG. Indeed, in the proof, our effort is to replace strong convexity by Restricted strong convexity. Our analysis is general and covers many formulations of interest, including all examples mentioned above. Notice that RSC is known to hold with high probability for a broad class of statistical models including sparse linear model, group sparsity model and low rank matrix model. Further more, the batched gradient method with RSC assumption by

Loh and Wainwright (2013) has the gradient complexity (). Thus our result is better than the batched one, especially when the problem is ill-conditioned ().

We also remark that while we present analsyis for the vanilla SVRG Xiao and Zhang (2014) , the analysis for variants of SVRG Nitanda (2014); Harikandeh et al. (2015); Nitanda (2014) is similar and indeed such extension is straightforward.

Related work

There is a line of work establishing fast convergence rate without strong convexity assumptions for batch gradient methods: Xiao and Zhang (2013) proposed a homotopy method to solve Lasso with RIP condition. Agarwal et al. (2010) analyzed the convergence rate of batched composite gradient method on several models, such as Lasso, logistic regression with regularization and noisy matrix decomposition, showed that the convergence is linear under mild condition (sparse or low rank). Loh and Wainwright (2011, 2013) extended above work to the non-convex case. Conceptually, our work can be thought as the stochastic counterpart of it, albeit with more involved analysis due to the stochastic nature of SVRG.

In general, when the function is not strongly convex, stochastic variance-reduction type method has shown to converge with a sub-linear rate: SVRGJohnson and Zhang (2013), SAG Mairal (2013), MISO Mairal (2015), and SAGA Defazio et al. (2014) are shown to converge with gradient complexity for non-strongly convex functions with a sub-linear rate of . Allen-Zhu and Yuan (2015) propose which solves the non-strongly convex problem with gradient complexity . Shalev-Shwartz (2016) analyzed SDCA – another stochastic gradient type algorithm with variance reduction – and established similar results. He allowed each to be non-convex but needs to be strongly convex for linear convergence to hold. Neither work establishes linear convergence of the above mentioned examples, especially when is non-convex.

Recently, several papers revisit an old idea called Polyak-Lojasiewica inequality and use it to replace the strongly convex assumption Karimi et al. (2016); Reddi et al. (2016); Gong and Ye (2014), to establish fast rates. They established linear convergence of SVRG without strong convexity for Lasso and Logistic regression. The contributions of our work differs from theirs in two aspects. First, the linear convergence rate they established does not depend on sparsity

, which does not agree with the empirical observation. We report simulation results on solving Lasso using SVRG in the Appendix, which shows a phase transition on rate: when

is dense enough, the rate becomes sub-linear. A careful analysis of their result shows that that the convergence result using P-L inequality depends on a so-called Hoffman parameter. Unfortunately it is not clear how to characterize or bound the Hoffman parameter, although from the simulation results it is conceivable that such parameter must correlated with the sparsity level. In contrast, our results state that the algorithm converges faster with sparser and a phase transition happens when is dense enough, which clearly fits better with the empirical observation. Second, their results require the epigraph of to be a polyhedral set, thus are not applicable to popular models such as group Lasso.

Li et al. (2016) consider the sparse linear problem with “norm” constraint and solve it using stochastic variance reduced gradient hard thresholding

algorithm (SVR-GHT), where the proof also uses the idea of RSC. In contrast, we establish a unified framework that provides more general result which covers not only sparse linear regression, but also group sparsity, corrupted data model (corrected Lasso), SCAD we mentioned above but not limited to these.

2 Problem Setup and Notations

In this paper, we consider two setups, namely the convex but not strongly convex case, and the non-convex case. For the first one we consider the following form:

(2)

where is a pre-defined radius, and the regularization function is a norm. The functions , and consequently , are convex. Yet, neither nor are necessarily strongly convex. We remark that the side-constraint in (2) is included without loss of generality: it is easy to see that for the unconstrained case, the optimal solution satisfies , where lower bounds for all .

For the second case we consider the following non-convex estimator.

(3)

where is convex, is a non-convex regularizer depending on a tuning parameter and a parameter explained in section 2.3. This M-estimator also includes a side constraint depending on , which needed to be a convex function and have a lower bound . This is close related to , for more details we defer to section 2.3. Similarly as the first case, the side constraint is added without loss of generality.

2.1 Rsc

A central concept we use in this paper is Restricted strong convexity (RSC), initially proposed in Negahban et al. (2009) and explored in Agarwal et al. (2010); Loh and Wainwright (2013). A function satisfies restricted strong convexity with respect to and with parameter over the set if for all

(4)

where the second term on the right hand side is called the tolerance, which essentially measures how far deviates from being strongly convex. Clearly, when , the RSC condition reduces to strong convexity. However, strong convexity can be restrictive in some cases. For example, it is well known that strong convexity does not hold for Lasso or logistic regression in the high-dimensional regime where the dimension is larger than the number of data . In contrast, in many of such problems, RSC holds with relatively small tolerance. Recall is convex, which implies We remark that in our analysis, we only require RSC to hold for

, rather than on individual loss functions

. This agrees with the case in practices, where RSC does not hold on in general.

2.2 Assumptions on

RSC is a useful property because for many formulations, the tolerance is small along some directions. To this end, we need the concept of decomposable regularizers. Given a pair of subspaces in , the orthogonal complement of is

is known as the model subspace, where is called the perturbation subspace, representing the deviation from the model subspace. A regularizer is decomposable w.r.t.  if

for all and Given the regularizer , the subspace compatibility is given by

For more discussions and intuitions on decomposable regularizer, we refer reader to Negahban et al. (2009). Some examples of decomposable regularizers are in order.

norm regularization

norm are widely used as a regularizer to encourage sparse solutions. As such, the subspace is chosen according to the -sparse vector in dimension space. Specifically, given a subset with cardinality , we let

In this case, we let and it is easy to see that

which implies that is decomposable with and .

Group sparsity regularization

Group sparsity extends the concept of sparsity, and has found a wide variety of applications Yuan and Lin (2006). For simplicity, we consider the case of non-overlapping groups. Suppose all features are grouped into disjoint blocks, say, . The grouped norm is defined as

where . Notice that group Lasso is thus a special case where . Since blocks are disjoint, we can define the subspace in the following way. For a subset with cardinality , we define the subspace

Similar to Lasso we have . The orthogonal complement is

It is not hard to see that

for any and .

2.3 Assumptions on Nonconvex regularizer

In the non-convex case, we consider regularizers that are separable across coordinates, i.e., . Besides the separability, we have additional assumptions on . For the univariate function , we assume

  1. satisfies and is symmetric around zero (i.e., ).

  2. On the nonnegative real line, is nondecreasing.

  3. For , is nonincreasing in t.

  4. is differentiable at all and subdifferentiable at , with for a constant .

  5. is convex.

We provide two examples satisfying above assumptions.

where is a fixed parameter. It satisfies the assumption with and Loh and Wainwright (2013).

where is a fixed parameter. MCP satisfies the assumption with and  Loh and Wainwright (2013).

3 Main Result

In this section, we present our main theorems, which asserts linear convergence of SVRG under RSC, for both the convex and non-convex setups. We then instantiate it on the sparsity model, group sparsity model, linear regression with corrupted covariate and linear regression with SCAD regularizer. All proofs are deferred in Appendix.

We analyze the (vanilla) SVRG (See Algorithm 1) proposed in Xiao and Zhang (2014) to solve Problem (2). We remark that our proof can easily be adapted to other accelerated versions of SVRG, e.g., non-uniform sampling. The algorithm contains an inner loop and an outer loop. We use the superscript to denote the step in the outer iteration and subscript to denote the step in the inner iteration throughout the paper. For the non-convex problem  (3), we adapt SVRG to Algorithm 2. The idea of Algorithm 2 is to solve

Since is convex, the proximal step in the algorithm is well defined. Also notice is randomly picked from to rather than average.

  Input: update frequency , stepsize , initialization
  for  do
      , ,
     for  to  do
        Pick uniformly random from
        
        
     end for
     
  end for
Algorithm 1 Convex Proximal SVRG
  Input: update frequency , stepsize , initialization
  for  do
      , ,
     for  to  do
        Pick uniformly random from
        
        
     end for
      for random chosen .
  end for
Algorithm 2 Non-Convex Proximal SVRG

3.1 Results for convex

To avoid notation clutter, we define the following terms that appear frequently in our theorem and corollaries.

Definition 1 (List of notations).
  • Dual norm of : .

  • Unknown true parameter: .

  • Optimal solution of Problem (2): .

  • Modified restricted strongly convex parameter:

  • Contraction factor: where

  • Statistical tolerance:

The main theorem bounds the optimality gap .

Theorem 1.

In Problem (2), suppose each is smooth, is feasible, i.e., , satisfies RSC with parameter , the regularizer is decomposable w.r.t. , such that and suppose for some constant c. Consider any the regularization parameter satisfies , then for any tolerance , if then with probability at least , where is universal positive constant.

To put Theorem 1 in context, some remarks are in order.

  1. If we compare with result in standard SVRG (with strong convexity) Xiao and Zhang (2014), the difference is that we use modified restricted strongly convex rather than strongly convex parameter . Indeed the high level idea of the proof is to replace strong convexity by RSC. Set where is some universal positive constant, as that inXiao and Zhang (2014) such that , we have the gradient complexity when (2m gradients in inner loop and and n gradients for outer loop ).

  2. In many statistical models (see corollaries for concrete examples), we can choose suitable subspace , step size and to obtain and satisfying , . For instance in Lasso, since and (suppose the feature vector is sampled from ), when is sparse (i.e., is small) we can set , e.g., , if

  3. Smaller leads to larger , thus smaller and , which leads to faster convergence.

  4. In terms of the tolerance, notice that in cases like sparse regression we can choose such that , and hence the tolerance equals to . Under above setting in 1 and 2, and combined with the fact that (in Lasso), we have , i.e., the tolerance is dominated by the statistical error of the model.

Therefore, Theorem 1 indeed states that the optimality gap decreases geometrically until it reaches the statistical tolerance. Moreover, this statistical tolerance is dominated by , and thus can be ignored from a statistic perspective when solving formulations such as sparse regression via Lasso. It is instructive to instantiate the above general results to several concrete statistical models, by choosing appropriate subspace pair and check the RSC conditions, which we detail in the following subsections.

3.1.1 Sparse regression

The first model we consider is Lasso, where and . More concretely, we consider a model where each data point

is i.i.d. sampled from a zero-mean normal distribution, i.e.,

. We denote the data matrix by

and the smallest eigenvalue of

by , and let . The observation is generated by , where is the zero mean sub-Gaussian noise with variance . We use to denote -th column of . Without loss of generality, we require is column-normalized, i.e., . Here, the constant is chosen arbitrarily to simplify the exposition, as we can always rescale the data.

Corollary 1 (Lasso).

Suppose is supported on a subset of cardinality at most r, and we choose such that , then , are some universal positive constants. For any

we have

with probability , for where are universal positive constants.

We offer some discussions to put this corollary in context. To achieve statistical consistency for Lasso, it is necessary to have Negahban et al. (2009). Under such a condition, we have which implies . Thus is bounded away from zero. Moreover, if we set following standard practice of SVRG Johnson and Zhang (2013); Xiao and Zhang (2014) and set , then which guarantees the convergence of the algorithm. The requirement of is commonly used to prove the statistical property of Lasso Negahban et al. (2009). Further notice that under this setting, we have , which implies that the statistical tolerance is of a lower order to which is the statistical error of the optimal solution of Lasso. Hence it can be ignored from the statistical view. Combining these together, Corollary 1 states that the objective gap decreases geometrically until it achieves the fundamental statistical limit of Lasso.

3.1.2 Group sparsity model

In many applications, we need to consider the group sparsity, i.e., a group of coefficients are set to zero simultaneously. We assume features are partitioned into disjoint groups, i.e., , and assume . That is, the regularization is . For example, group Lasso corresponds to . Other choice of may include , which is suggested in Turlach et al. (2005).

Besides RSC condition, we need the following group counterpart of the column normalization condition: Given a group of size , and , we define the associated operator norm , and require that

Observe that when are all singleton, this condition reduces to column normalization condition. We assume the data generation model is , and .

We discuss the case of , i.e., Group Lasso in the following.

Corollary 2.

Suppose the dimension of is and each group has parameters, i.e., , is the cardinality of non-zeros group, is zero mean sub-Gaussian noise with variance , for some constant c, If we choose , and let

where and are some strictly positive numbers which only depends on , then for any

we have

with high probability, for

(5)

Notice that to ensure , it suffices to have

This is a mild condition, as it is needed to guarantee the statistical consistency of Group Lasso Negahban et al. (2009). In practice, this condition is not hard to satisfy when and are small. We can easily adjust to make . Since and is in the order of if we set , , we have Thus, similar as the case of Lasso, the objective gap decreases geometrically up to the scale , i.e., dominated by the statistical error of the model.

3.1.3 Extension to Generalized linear model

The results on Lasso and group Lasso are readily extended to generalized linear models, where we consider the model

with and , where is a universal constant Loh and Wainwright (2013). This requirement is essential, for instance for the logistic function , the Hessian function approached to zero as its argument diverges. Notice that when , the problem reduces to Lasso. The RSC condition admit the form

For a board class of log-linear models, the RSC condition holds with . Therefore, we obtain same results as those of Lasso, modulus change of constants. For more details of RSC conditions in generalized linear model, we refer the readers to Negahban et al. (2009).

3.2 Results on non-convex

We define the following notations.

  • Modified restricted strongly convex parameter , where , is a constant, is the cardinality of .

  • Contraction factor

    (6)
  • Statistical tolerance

    (7)

    where

    (8)
Theorem 2.

In Problem (3), suppose each is smooth, , is feasible, satisfies Assumptions in section 2.3, satisfies RSC with parameter , , and , by choosing suitable and . Suppose is the global optimal, for some positive constant c, consider any choice of the regularization parameter such that , then for any tolerance if

then with probability at least .

We provide some remarks to make the theorem more interpretable.

  1. We require to ensure . In addition, the non-convex parameter can not be larger than . In particular, if , then and it is not possible to set by tunning and learning rate .

  2. We consider a concrete case to obtain a sense of the value of different terms we defind. Suppose and if we set which is typical for SVRG, and suppose we have , , then we have the contraction factor . Furthermore, we have , which leads to . When the model is sparse, the tolerance is dominated by statistical error of the model.

3.2.1 Linear regression with SCAD

The first non-convex model we consider is the linear regression with SCAD. That is, and is with parameter and . The data are generated in a same way as in the Lasso example.

Corollary 3 (Linear regression with SCAD).

Suppose we have i.i.d. observations , is supported on a subset of cardinality at most , is the global optimum, for some positive constant c, and in the algorithm. We choose such that , Then for any tolerance

where and are defined in (6) and (8) with . if

then with probability at least .

Suppose we have , , , , then and . Notice in this setting we have =