1 Introduction
Stochastic optimization [2]
is a class of powerful optimization tool for solving largescale problems in machine learning, pattern recognition and computer vision. For example, stochastic gradient descent (SGD
[2]) is an efficient method for solving the following optimization problem, which is a fundamental to machine learning,(1) 
where denotes the loss function, and
denotes the regularization function. The problem (1) includes many useful models such as support vector machine (SVM), logistic regression and neural network. When sample size
is large, even the firstorder methods become computationally burdensome due to their periteration complexity of . While SGD only computes gradient of one sample instead of all samples in each iteration, thus, it has only periteration complexity of . Despite its scalability, due to the existence of variance in stochastic process, the stochastic gradient is much noisier than the batch gradient. Thus, the step size has to be decreased gradually as stochastic learning proceeds, leading to slower convergence than the batch method. Recently, a number of accelerated algorithms have successfully been proposed to reduce this variance. For example, stochastic average gradient (SAG [3]) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. Stochastic dual coordinate ascent (SDCA
[4]) performs the stochastic coordinate ascent on the dual problems to obtain also a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient (APCG [5]) method accelerates the SDCA method by using Nesterov’s accelerated method [6]. However, these fast methods require much space to store old gradients or dual variables. Thus, stochastic variance reduced gradient (SVRG [7, 8]) methods are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover,[9] proposes the SAGA method, which extends the SAG method and enjoys better theoretical convergence rate than both SAG and SVRG. Recently, [10] presents an accelerated SVRG by using the Nesterov’s acceleration technique [6]. Moreover, [11] proposes a novel momentum accelerated SVRG method (Katyusha) via using the strongly convex parameter, which reaches a faster convergence rate. In addition, [12] specially proposes a class of stochastic composite optimization methods for sparse learning, when is a sparsityinducing regularizer such as norm and nuclear norm.MethodsConvergence rateProblems  

Nonconvex incremental ADMM [13]  ✓, Unknown  
NESTT [14]  ✓,  
Nonconvex minibatch stochastic ADMM (ours)  ✓,  ✓, 
Nonconvex SVRGADMM (ours and [15])  ✓,  ✓, 
Nonconvex SAGAADMM (ours)  ✓,  ✓, 
Though the above methods can effectively solve many problems in machine learning, they are still difficultly to be competent for some complicated problems with the nonseparable and nonsmooth regularization function as follows
(2) 
where is a given matrix, denotes the loss function, and denotes the regularization function. With regard to , we are interested in a sparsityinducing regularization functions, e.g. norm and nuclear norm. The problem (2) includes the graphguided fuzed Lasso [16], the overlapping group Lasso[17], and generalized Lasso [18]. It is well known that the alternating direction method of multipliers (ADMM [19, 20, 21]) is an efficient optimization method for the problem (2). Specifically, we can use auxiliary variable to make the problem (2) be suitable for the general ADMM form. When sample size is large, due to the need of computing the empirical risk loss function on all training samples at each iteration, the offline or batch ADMM is unsuitable for largescale learning problems. Thus, the online and stochastic versions of ADMM [22, 23, 24] have been successfully developed for the largescale problems. Due to the existence of variance in the stochastic process, these stochastic ADMMs also suffer from the slow convergence rate. Recently, some accelerated stochastic ADMMs are effectively proposed to reduce this variance. For example, SAGADMM [25] is proposed by additionally using the previous estimated gradients. An accelerated stochastic ADMM [26] is proposed by using Nesterov’s accelerated method [6]. SDCAADMM [27] obtains linearly convergence rate for the strong problem by solving its dual problem. SCASADMM [28] and SVRGADMM [29] are developed, and reach the fast convergence rate with no extra space for the previous gradients or dual variables. Moreover, [30] proposes an accelerated SVRGADMM by using the momentum accelerated technique. More recently, [31] proposes a fast stochastic ADMM, which achieves a nonergodic convergence rate of for the convex problem. In addition, an adaptive stochastic ADMM [32] is proposed by using the adaptive gradients. Due to that the penalty parameter in ADMM can affect convergence [33], another adaptive stochastic ADMM [34] is proposed by using the adaptive penalty parameters.
So far, the above study on stochastic optimization methods relies heavily on strongly convex or convex problems. However, there exist many useful nonconvex models in machine learning such as nonconvex empirical risk minimization models[35]
and deep learning
[36]. Thus, the study of nonconvex optimization methods is much needed. Recently, some works focus on studying the stochastic gradient methods for the largescale nonconvex optimizations. For example, [37, 38] have established the iteration complexity of for the SGD to obtain an stationary solution of the nonconvex problems. [39, 40, 41] have proved that the variance reduced stochastic gradient methods such as the nonconvex SVRG and SAGA reach the iteration complexity of . At the same time, [42] has proved that the variance reduced stochastic gradient methods also reach the iteration complexity of for the nonconvex nonsmooth composite problems. More recently, [43] propose a faster nonconvex stochastic optimization method (Natasha) via using the strongly nonconvex parameter. [44] proposes a faster gradientbased nonconvex optimization by using catalyst approach in [45].Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as nonconvex graphguided regularization risk loss minimizations [1]
and tensor decomposition
[46]. Recently, some works [47, 48, 49, 50, 46] have begun to study the ADMM method for the nonconvex optimization, but they only focus on studying the deterministic ADMMs for the nonconvex optimization. Due to the need of computing the empirical loss function on all the training examples at each iteration, these nonconvex ADMMs are not yet well competent to the largescale learning problems. Recently, [13] has proposed a distributed, asynchronous and incremental algorithm based on the ADMM method for the largescale nonconvex problems, but this method is difficult for the nonconvex problem (2) with the nonseparable and nonsmooth regularizers such as graphguided fused lasso and overlapping group lasso. A nonconvex primal dual splitting (NESTT [14]) method is proposed for the distributed and stochastic optimization, but it is also difficult for the nonconvex problem (2). More recently, our initial manuscript [1] proposes the stochastic ADMMs with variance reduction (e.g., nonconvex SVRGADMM and nonconvex SAGAADMM) for optimizing these nonconvex problems with some complicated structure regularizers such as graphguided fuzed Lasso, overlapping group Lasso, sparse plus lowrank penalties. In addition, our initial manuscript [1] and Zheng and Kwok’s paper [15] simultaneously propose the nonconvex SVRGADMM method^{1}^{1}1 The first version of our manuscript[1](https://arxiv.org/abs/1610.02758v1) proposes both nonconvex SVRGADMM and SAGAADMM, which is online available in Oct. 10, 2016 . The first version of [15] (https://arxiv.org/abs/1604.07070v1) only proposes the convex SVRGADMM, which is online available in Apr. 24, 2016 and named as ’FastandLight Stochastic ADMM’. While, the second version of [15] (https://arxiv.org/abs/1604.07070v2) adds the nonconvex SVRGADMM, which is online available in Oct. 12, 2016 and renamed as ’Stochastic VarianceReduced ADMM’. . At present, to our knowledge, there still exist two important problems needing to be addressed:
Whether the general stochastic ADMM without VR technique is convergent for the nonconvex optimization?

What is convergence rate of the general stochastic ADMM for the nonconvex optimization, if convergent?
In the paper, we provide the positive answers to them by developing a class of minibatch stochastic ADMMs for the nonconvex optimization. Specifically, we study the minibatch stochastic ADMMs for optimizing the nonconvex nonsmooth problem below:
(3)  
where , each is a nonconvex and smooth loss function, is nonsmooth and possibly nonconvex, and , and denote the given matrices and vector, respectively. The problem (3) is inspired by the structural risk minimization in machine learning [51]. In summary, our main contributions are fourfold as follows:

We propose the minibatch stochastic ADMM for the nonconvex nonsmooth optimization. Moreover, we prove that, given an appropriate minibatch size, the minibatch stochastic ADMM reaches a fast convergence rate of to obtain a stationary point.

We extend the minibatch stochastic gradient method to both the nonconvex SVRGADMM and SAGAADMM, proposed in our initial manuscript [1]. Moreover, we prove that these stochastic ADMMs also reach a convergence rate of without condition on the minibatch size.

We provide a specific parameter selection for step size of stochastic gradients and penalty parameter of the augmented Lagrangian function.

Some numerical experiments demonstrate the effectiveness of the proposed algorithms.
In addition, Table I shows the convergence rate summary of the stochastic/incremental ADMMs for optimizing the nonconvex problems.
1.1 Notations
denotes the Euclidean norm of a vector or the spectral norm of a matrix. denotes an
dimensional identity matrix.
denotes a positive definite matrix , and . Let denote the generalized inverse of matrix .denotes the smallest eigenvalues of matrix
. and denotes the largest and smallest eigenvalues of positive matrix , respectively. The other notations used in this paper is summarized as follows: , , , and .2 Nonconvex Minibatch Stochastic ADMM without VR
In this section, we propose a minibatch stochastic ADMM to optimize the nonconvex problem (3). Moreover, we study convergence of the minibatch stochastic ADMM. In particular, we prove that, given an appropriate minibatch size, it reaches the convergence rate of .
First, we review the deterministic ADMM for solving the problem (3). The augmented Lagrangian function of (3) is defined as follows:
(4) 
where is the Lagrange multiplier, and is the penalty parameter. At th iteration, the ADMM executes the update:
(5)  
(6)  
(7) 
Next, we give a mild assumption, as in the general stochastic optimization [37, 38] and the initial convex stochastic ADMM [24].
Assumption 1.
For smooth function , there exists a stochastic firstorder oracle that returns a noisy estimation to the gradient of , and the noisy estimation satisfies
(8)  
(9) 
where the expectation is taken with respect to the random variable
.Let be the size of minibatch , and denotes a set of i.i.d. random variables, and the stochastic gradient is given by
Clearly, we have
(10)  
(11) 
In the stochastic ADMM algorithm, we can update and by (5) and (7), respectively, as in the deterministic ADMM. However, to update the variable , we will define an approximated function of the form:
(12) 
where , and . By minimizing (2) on the variable , we have
When is large, computing is expensive, and storage of this matrix may still be problematic. To avoid them, we can use the inexact Uzawa method [52] to linearize the last term in (2). In other words, we set with
to ensure . Then we have
(13) 
Finally, we give the algorithmic framework of the minibatch stochastic ADMM (STOCADMM) in Algorithm 1.
2.1 Convergence Analysis of Nonconvex Minibatch STOCADMM
In the subsection, we study the convergence and iteration complexity of the nonconvex minibatch STOCADMM. First, we give some mild assumptions as follows:
Assumption 2.
For smooth function , its gradient is Lipschitz continuous with the constant , such that
(14) 
and this is equivalent to
(15) 
Assumption 3.
and are all lower bounded, and denoting and .
Assumption 4.
is a proper lower semicontinuous function.
Assumption 5.
Matrix has full row rank.
Assumption 2 has been widely used in the convergence analysis of nonconvex algorithms [39, 40]. Assumptions 34 have been used in study of ADMM for nonconvex optimzations [46]. Assumption 5 has been used in the convergence analysis of ADMM [53]. Next, we define the stationary point of the nonconvex problem (3) below:
Definition 1.
Note that the above inequalities (16) are equivalent to , where
where is the Lagrangian function of (3). In the following, based the above assumptions and definition, we study the convergence and iteration complexity of the minibatch stochastic ADMM.
Lemma 1.
A detailed proof of Lemma 1 is provided in Appendix A.1. Lemma 1 gives the upper bound of . Given a sequence generated from Algorithm 1, then we define an useful sequence as follows:
(17) 
For notational simplicity, let , and .
Lemma 2.
Suppose that the sequence is generated by Algorithm 1. Let , , and
and suppose the parameters and , respectively, satisfy
(18) 
Then we have , and it holds that
(19) 
A detailed proof of Lemma 2 is provided in Appendix A.2. Lemma 2 gives a property of the sequence . Moreover, (18) provides a specific parameter selection on the step size and the penalty parameter , in which selection of the step size depends on the parameter .
Lemma 3.
A detailed proof of Lemma 3 is provided in Appendix A.3. Lemma 3 gives a lower bound of the sequence .
Theorem 1.
A detailed proof of Theorem 1 is provided in Appendix A.4. Theorem 1 shows that, given an minibatch size , the minibatch stochastic ADMM has the convergence rate of to obtain an stationary point of the nonconvex problem (3). Moreover, the IFO(Incremental Firstorder Oracle [40]) complexity of the minibatch stochastic ADMM is for obtaining an stationary point. While, the IFO complexity of the deterministic proximal ADMM [46] is for obtaining an stationary point. When , the minibatch stochastic ADMM needs less IFO complexity than the deterministic ADMM.
3 Nonconvex Mininbatch SVRGADMM
In the subsection, we propose a minibatch nonconvex stochastic variance reduced gradient ADMM (SVRGADMM) to solve the problem (3), which uses a multistage strategy to progressively reduce the variance of stochastic gradients.
Algorithm 2 gives an algorithmic framework of minibatch SVRGADMM for nonconvex optimizations. In Algorithm 2, the stochastic gradient is unbiased, i.e., . In the following, we give an upper bound of variance of the stochastic gradient .
Lemma 4.
A detailed proof of Theorem 4 is provided in Appendix B.1.
Lemma 4 shows that the variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. In fact, the variance of stochastic gradient is progressively reduced.
3.1 Convergence Analysis of Nonconvex Minibatch SVRGADMM
In the subsection, we study the convergence and iteration complexity of the minibatch nonconvex SVRGADMM. First, we give an upper bound of .
Lemma 5.
Suppose the sequence is generated by Algorithm 2. The following inequality holds
A detailed proof of Lemma 5 is provided in Appendix B.2. Given the sequence generated from Algorithm 2, then we define an useful sequence as follows:
(22) 
where is a positive sequence.
Lemma 6.
Suppose the sequence is generated from Algorithm 2, and suppose the positive sequence satisfies, for
(23) 
where . Let , , , and
and suppose the parameters and , respectively, satisfy
(24) 
where . Then it holds that the sequence is positive, defined by
(25) 
and the sequence monotonically decreases.
A detailed proof of Lemma 6 is provided in Appendix B.3. Lemma 6 shows that the sequence monotonically decreases. Moreover, (24) provides a specific parameter selection on the step size and the penalty parameter in Algorithm 2.
Lemma 7.
Lemma 7 shows that the sequence has a lower bound. The proof of Lemma 7 is the same as the proof of Lemma 3. We define an useful variable as follows:
(26) 
In the following, we will analyze the convergence and iteration complexity of the nonconvex SVRGADMM based on the above lemmas.
Theorem 2.
A detailed proof of Theorem 2 is provided in Appendix B.4. Theorem 2 shows that the minibatch SVRGADMM for nonconvex optimizations has a convergence rate of . Moreover, the IFO complexity of the minibatch SVRG is . When , the minibatch SVRGADMM needs less IFO complexity than the deterministic ADMM.
Since the minibatch SVRGADMM uses VR technique, its convergence does not depend on the minibatch size . In other words, when , the minibatch nonconvex SVRGADMM reduces to the initial nonconvex SVRGADMM in [1], which also has a convergence rate of . However, by Lemma 4, the variance of stochastic gradient in the minibatch SVRGADMM decreases faster than that in the initial nonconvex SVRGADMM.
4 Nonconvex Minibatch SAGAADMM
In the subsection, we propose a minibatch nonconvex stochastic average gradient ADMM (SAGAADMM) by additionally using the old gradients estimated in the previous iteration, which is inspired by the SAGA method [9].
Comments
There are no comments yet.