Stochastic optimization 2]) is an efficient method for solving the following optimization problem, which is a fundamental to machine learning,
where denotes the loss function, andis large, even the first-order methods become computationally burdensome due to their per-iteration complexity of . While SGD only computes gradient of one sample instead of all samples in each iteration, thus, it has only per-iteration complexity of . Despite its scalability, due to the existence of variance in stochastic process, the stochastic gradient is much noisier than the batch gradient. Thus, the step size has to be decreased gradually as stochastic learning proceeds, leading to slower convergence than the batch method. Recently, a number of accelerated algorithms have successfully been proposed to reduce this variance. For example, stochastic average gradient (SAG 
) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. Stochastic dual coordinate ascent (SDCA) performs the stochastic coordinate ascent on the dual problems to obtain also a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient (APCG ) method accelerates the SDCA method by using Nesterov’s accelerated method . However, these fast methods require much space to store old gradients or dual variables. Thus, stochastic variance reduced gradient (SVRG [7, 8]) methods are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover, proposes the SAGA method, which extends the SAG method and enjoys better theoretical convergence rate than both SAG and SVRG. Recently,  presents an accelerated SVRG by using the Nesterov’s acceleration technique . Moreover,  proposes a novel momentum accelerated SVRG method (Katyusha) via using the strongly convex parameter, which reaches a faster convergence rate. In addition,  specially proposes a class of stochastic composite optimization methods for sparse learning, when is a sparsity-inducing regularizer such as -norm and nuclear norm.
|Nonconvex incremental ADMM ||✓, Unknown|
|Nonconvex mini-batch stochastic ADMM (ours)||✓,||✓,|
|Nonconvex SVRG-ADMM (ours and )||✓,||✓,|
|Nonconvex SAGA-ADMM (ours)||✓,||✓,|
Though the above methods can effectively solve many problems in machine learning, they are still difficultly to be competent for some complicated problems with the nonseparable and nonsmooth regularization function as follows
where is a given matrix, denotes the loss function, and denotes the regularization function. With regard to , we are interested in a sparsity-inducing regularization functions, e.g. -norm and nuclear norm. The problem (2) includes the graph-guided fuzed Lasso , the overlapping group Lasso, and generalized Lasso . It is well known that the alternating direction method of multipliers (ADMM [19, 20, 21]) is an efficient optimization method for the problem (2). Specifically, we can use auxiliary variable to make the problem (2) be suitable for the general ADMM form. When sample size is large, due to the need of computing the empirical risk loss function on all training samples at each iteration, the offline or batch ADMM is unsuitable for large-scale learning problems. Thus, the online and stochastic versions of ADMM [22, 23, 24] have been successfully developed for the large-scale problems. Due to the existence of variance in the stochastic process, these stochastic ADMMs also suffer from the slow convergence rate. Recently, some accelerated stochastic ADMMs are effectively proposed to reduce this variance. For example, SAG-ADMM  is proposed by additionally using the previous estimated gradients. An accelerated stochastic ADMM  is proposed by using Nesterov’s accelerated method . SDCA-ADMM  obtains linearly convergence rate for the strong problem by solving its dual problem. SCAS-ADMM  and SVRG-ADMM  are developed, and reach the fast convergence rate with no extra space for the previous gradients or dual variables. Moreover,  proposes an accelerated SVRG-ADMM by using the momentum accelerated technique. More recently,  proposes a fast stochastic ADMM, which achieves a non-ergodic convergence rate of for the convex problem. In addition, an adaptive stochastic ADMM  is proposed by using the adaptive gradients. Due to that the penalty parameter in ADMM can affect convergence , another adaptive stochastic ADMM  is proposed by using the adaptive penalty parameters.
So far, the above study on stochastic optimization methods relies heavily on strongly convex or convex problems. However, there exist many useful nonconvex models in machine learning such as nonconvex empirical risk minimization models
and deep learning. Thus, the study of nonconvex optimization methods is much needed. Recently, some works focus on studying the stochastic gradient methods for the large-scale nonconvex optimizations. For example, [37, 38] have established the iteration complexity of for the SGD to obtain an -stationary solution of the nonconvex problems. [39, 40, 41] have proved that the variance reduced stochastic gradient methods such as the nonconvex SVRG and SAGA reach the iteration complexity of . At the same time,  has proved that the variance reduced stochastic gradient methods also reach the iteration complexity of for the nonconvex nonsmooth composite problems. More recently,  propose a faster nonconvex stochastic optimization method (Natasha) via using the strongly non-convex parameter.  proposes a faster gradient-based nonconvex optimization by using catalyst approach in .
Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as nonconvex graph-guided regularization risk loss minimizations 
and tensor decomposition. Recently, some works [47, 48, 49, 50, 46] have begun to study the ADMM method for the nonconvex optimization, but they only focus on studying the deterministic ADMMs for the nonconvex optimization. Due to the need of computing the empirical loss function on all the training examples at each iteration, these nonconvex ADMMs are not yet well competent to the large-scale learning problems. Recently,  has proposed a distributed, asynchronous and incremental algorithm based on the ADMM method for the large-scale nonconvex problems, but this method is difficult for the nonconvex problem (2) with the nonseparable and nonsmooth regularizers such as graph-guided fused lasso and overlapping group lasso. A nonconvex primal dual splitting (NESTT ) method is proposed for the distributed and stochastic optimization, but it is also difficult for the nonconvex problem (2). More recently, our initial manuscript  proposes the stochastic ADMMs with variance reduction (e.g., nonconvex SVRG-ADMM and nonconvex SAGA-ADMM) for optimizing these nonconvex problems with some complicated structure regularizers such as graph-guided fuzed Lasso, overlapping group Lasso, sparse plus low-rank penalties. In addition, our initial manuscript  and Zheng and Kwok’s paper  simultaneously propose the nonconvex SVRG-ADMM method111 The first version of our manuscript(https://arxiv.org/abs/1610.02758v1) proposes both non-convex SVRG-ADMM and SAGA-ADMM, which is online available in Oct. 10, 2016 . The first version of  (https://arxiv.org/abs/1604.07070v1) only proposes the convex SVRG-ADMM, which is online available in Apr. 24, 2016 and named as ’Fast-and-Light Stochastic ADMM’. While, the second version of  (https://arxiv.org/abs/1604.07070v2) adds the non-convex SVRG-ADMM, which is online available in Oct. 12, 2016 and renamed as ’Stochastic Variance-Reduced ADMM’. . At present, to our knowledge, there still exist two important problems needing to be addressed:
Whether the general stochastic ADMM without VR technique is convergent for the nonconvex optimization?
What is convergence rate of the general stochastic ADMM for the nonconvex optimization, if convergent?
In the paper, we provide the positive answers to them by developing a class of mini-batch stochastic ADMMs for the nonconvex optimization. Specifically, we study the mini-batch stochastic ADMMs for optimizing the nonconvex nonsmooth problem below:
where , each is a nonconvex and smooth loss function, is nonsmooth and possibly nonconvex, and , and denote the given matrices and vector, respectively. The problem (3) is inspired by the structural risk minimization in machine learning . In summary, our main contributions are four-fold as follows:
We propose the mini-batch stochastic ADMM for the nonconvex nonsmooth optimization. Moreover, we prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM reaches a fast convergence rate of to obtain a stationary point.
We extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM, proposed in our initial manuscript . Moreover, we prove that these stochastic ADMMs also reach a convergence rate of without condition on the mini-batch size.
We provide a specific parameter selection for step size of stochastic gradients and penalty parameter of the augmented Lagrangian function.
Some numerical experiments demonstrate the effectiveness of the proposed algorithms.
In addition, Table I shows the convergence rate summary of the stochastic/incremental ADMMs for optimizing the nonconvex problems.
denotes the Euclidean norm of a vector or the spectral norm of a matrix. denotes an
-dimensional identity matrix.denotes a positive definite matrix , and . Let denote the generalized inverse of matrix .
denotes the smallest eigenvalues of matrix. and denotes the largest and smallest eigenvalues of positive matrix , respectively. The other notations used in this paper is summarized as follows: , , , and .
2 Nonconvex Mini-batch Stochastic ADMM without VR
In this section, we propose a mini-batch stochastic ADMM to optimize the nonconvex problem (3). Moreover, we study convergence of the mini-batch stochastic ADMM. In particular, we prove that, given an appropriate mini-batch size, it reaches the convergence rate of .
where is the Lagrange multiplier, and is the penalty parameter. At -th iteration, the ADMM executes the update:
For smooth function , there exists a stochastic first-order oracle that returns a noisy estimation to the gradient of , and the noisy estimation satisfies
where the expectation is taken with respect to the random variable
where the expectation is taken with respect to the random variable.
Let be the size of mini-batch , and denotes a set of i.i.d. random variables, and the stochastic gradient is given by
Clearly, we have
where , and . By minimizing (2) on the variable , we have
When is large, computing is expensive, and storage of this matrix may still be problematic. To avoid them, we can use the inexact Uzawa method  to linearize the last term in (2). In other words, we set with
to ensure . Then we have
Finally, we give the algorithmic framework of the mini-batch stochastic ADMM (STOC-ADMM) in Algorithm 1.
2.1 Convergence Analysis of Nonconvex Mini-batch STOC-ADMM
In the subsection, we study the convergence and iteration complexity of the nonconvex mini-batch STOC-ADMM. First, we give some mild assumptions as follows:
For smooth function , its gradient is Lipschitz continuous with the constant , such that
and this is equivalent to
and are all lower bounded, and denoting and .
is a proper lower semi-continuous function.
Matrix has full row rank.
Assumption 2 has been widely used in the convergence analysis of nonconvex algorithms [39, 40]. Assumptions 3-4 have been used in study of ADMM for nonconvex optimzations . Assumption 5 has been used in the convergence analysis of ADMM . Next, we define the -stationary point of the nonconvex problem (3) below:
Note that the above inequalities (16) are equivalent to , where
where is the Lagrangian function of (3). In the following, based the above assumptions and definition, we study the convergence and iteration complexity of the mini-batch stochastic ADMM.
Suppose the sequence is generated by Algorithm 1. The following inequality holds
where and .
For notational simplicity, let , and .
Suppose that the sequence is generated by Algorithm 1. Let , , and
and suppose the parameters and , respectively, satisfy
Then we have , and it holds that
A detailed proof of Lemma 2 is provided in Appendix A.2. Lemma 2 gives a property of the sequence . Moreover, (18) provides a specific parameter selection on the step size and the penalty parameter , in which selection of the step size depends on the parameter .
A detailed proof of Theorem 1 is provided in Appendix A.4. Theorem 1 shows that, given an mini-batch size , the mini-batch stochastic ADMM has the convergence rate of to obtain an -stationary point of the nonconvex problem (3). Moreover, the IFO(Incremental First-order Oracle ) complexity of the mini-batch stochastic ADMM is for obtaining an -stationary point. While, the IFO complexity of the deterministic proximal ADMM  is for obtaining an -stationary point. When , the mini-batch stochastic ADMM needs less IFO complexity than the deterministic ADMM.
3 Nonconvex Minin-batch SVRG-ADMM
In the subsection, we propose a mini-batch nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM) to solve the problem (3), which uses a multi-stage strategy to progressively reduce the variance of stochastic gradients.
Algorithm 2 gives an algorithmic framework of mini-batch SVRG-ADMM for nonconvex optimizations. In Algorithm 2, the stochastic gradient is unbiased, i.e., . In the following, we give an upper bound of variance of the stochastic gradient .
In Algorithm 2, set , then it holds
where denotes variance of the stochastic gradient .
Lemma 4 shows that the variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. In fact, the variance of stochastic gradient is progressively reduced.
3.1 Convergence Analysis of Nonconvex Mini-batch SVRG-ADMM
In the subsection, we study the convergence and iteration complexity of the mini-batch nonconvex SVRG-ADMM. First, we give an upper bound of .
Suppose the sequence is generated by Algorithm 2. The following inequality holds
where is a positive sequence.
Suppose the sequence is generated from Algorithm 2, and suppose the positive sequence satisfies, for
where . Let , , , and
and suppose the parameters and , respectively, satisfy
where . Then it holds that the sequence is positive, defined by
and the sequence monotonically decreases.
A detailed proof of Lemma 6 is provided in Appendix B.3. Lemma 6 shows that the sequence monotonically decreases. Moreover, (24) provides a specific parameter selection on the step size and the penalty parameter in Algorithm 2.
In the following, we will analyze the convergence and iteration complexity of the nonconvex SVRG-ADMM based on the above lemmas.
A detailed proof of Theorem 2 is provided in Appendix B.4. Theorem 2 shows that the mini-batch SVRG-ADMM for nonconvex optimizations has a convergence rate of . Moreover, the IFO complexity of the mini-batch SVRG is . When , the mini-batch SVRG-ADMM needs less IFO complexity than the deterministic ADMM.
Since the mini-batch SVRG-ADMM uses VR technique, its convergence does not depend on the mini-batch size . In other words, when , the mini-batch nonconvex SVRG-ADMM reduces to the initial nonconvex SVRG-ADMM in , which also has a convergence rate of . However, by Lemma 4, the variance of stochastic gradient in the mini-batch SVRG-ADMM decreases faster than that in the initial nonconvex SVRG-ADMM.
4 Nonconvex Mini-batch SAGA-ADMM
In the subsection, we propose a mini-batch nonconvex stochastic average gradient ADMM (SAGA-ADMM) by additionally using the old gradients estimated in the previous iteration, which is inspired by the SAGA method .