# Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization

In the paper, we study the mini-batch stochastic ADMMs (alternating direction method of multipliers) for the nonconvex nonsmooth optimization. We prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM without variance reduction (VR) technique is convergent and reaches the convergence rate of O(1/T) to obtain a stationary point of the nonconvex optimization, where T denotes the number of iterations. Moreover, we extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM in our initial paper huang2016stochastic, and also prove that these mini-batch stochastic ADMMs reach the convergence rate of O(1/T) without the condition on the mini-batch size. In particular, we provide a specific parameter selection for step size η of stochastic gradients and penalization parameter ρ of the augmented Lagrangian function. Finally, some experimental results demonstrate the effectiveness of our algorithms.

## Authors

• 17 publications
• 30 publications
• ### Stochastic Recursive Gradient Algorithm for Nonconvex Optimization

In this paper, we study and analyze the mini-batch version of StochAstic...
05/20/2017 ∙ by Lam M. Nguyen, et al. ∙ 0

• ### Differentiable Annealed Importance Sampling and the Perils of Gradient Noise

Annealed importance sampling (AIS) and related algorithms are highly eff...
07/21/2021 ∙ by Guodong Zhang, et al. ∙ 14

We study a new aggregation operator for gradients coming from a mini-bat...
11/06/2017 ∙ by Alexandre Défossez, et al. ∙ 0

• ### Stop Wasting My Gradients: Practical SVRG

We present and analyze several strategies for improving the performance ...
11/05/2015 ∙ by Reza Babanezhad, et al. ∙ 0

• ### Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge Computing

Big data, including applications with high security requirements, are of...
10/02/2020 ∙ by Hao Chen, et al. ∙ 0

• ### An Asynchronous Mini-Batch Algorithm for Regularized Stochastic Optimization

Mini-batch optimization has proven to be a powerful paradigm for large-s...
05/18/2015 ∙ by Hamid Reza Feyzmahdavian, et al. ∙ 0

• ### Variance Regularization for Accelerating Stochastic Optimization

08/13/2020 ∙ by Tong Yang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic optimization [2]

is a class of powerful optimization tool for solving large-scale problems in machine learning, pattern recognition and computer vision. For example, stochastic gradient descent (SGD

[2]) is an efficient method for solving the following optimization problem, which is a fundamental to machine learning,

 minx∈Rdf(x)+g(x) (1)

where denotes the loss function, and

denotes the regularization function. The problem (1) includes many useful models such as support vector machine (SVM), logistic regression and neural network. When sample size

is large, even the first-order methods become computationally burdensome due to their per-iteration complexity of . While SGD only computes gradient of one sample instead of all samples in each iteration, thus, it has only per-iteration complexity of . Despite its scalability, due to the existence of variance in stochastic process, the stochastic gradient is much noisier than the batch gradient. Thus, the step size has to be decreased gradually as stochastic learning proceeds, leading to slower convergence than the batch method. Recently, a number of accelerated algorithms have successfully been proposed to reduce this variance. For example, stochastic average gradient (SAG [3]

) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. Stochastic dual coordinate ascent (SDCA

[4]) performs the stochastic coordinate ascent on the dual problems to obtain also a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient (APCG [5]) method accelerates the SDCA method by using Nesterov’s accelerated method [6]. However, these fast methods require much space to store old gradients or dual variables. Thus, stochastic variance reduced gradient (SVRG [7, 8]) methods are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover,[9] proposes the SAGA method, which extends the SAG method and enjoys better theoretical convergence rate than both SAG and SVRG. Recently, [10] presents an accelerated SVRG by using the Nesterov’s acceleration technique [6]. Moreover, [11] proposes a novel momentum accelerated SVRG method (Katyusha) via using the strongly convex parameter, which reaches a faster convergence rate. In addition, [12] specially proposes a class of stochastic composite optimization methods for sparse learning, when is a sparsity-inducing regularizer such as -norm and nuclear norm.

Though the above methods can effectively solve many problems in machine learning, they are still difficultly to be competent for some complicated problems with the nonseparable and nonsmooth regularization function as follows

 minx∈Rdf(x)+g(Ax) (2)

So far, the above study on stochastic optimization methods relies heavily on strongly convex or convex problems. However, there exist many useful nonconvex models in machine learning such as nonconvex empirical risk minimization models[35]

[36]. Thus, the study of nonconvex optimization methods is much needed. Recently, some works focus on studying the stochastic gradient methods for the large-scale nonconvex optimizations. For example, [37, 38] have established the iteration complexity of for the SGD to obtain an -stationary solution of the nonconvex problems. [39, 40, 41] have proved that the variance reduced stochastic gradient methods such as the nonconvex SVRG and SAGA reach the iteration complexity of . At the same time, [42] has proved that the variance reduced stochastic gradient methods also reach the iteration complexity of for the nonconvex nonsmooth composite problems. More recently, [43] propose a faster nonconvex stochastic optimization method (Natasha) via using the strongly non-convex parameter. [44] proposes a faster gradient-based nonconvex optimization by using catalyst approach in [45].

Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as nonconvex graph-guided regularization risk loss minimizations [1]

and tensor decomposition

• Whether the general stochastic ADMM without VR technique is convergent for the nonconvex optimization?

• What is convergence rate of the general stochastic ADMM for the nonconvex optimization, if convergent?

In the paper, we provide the positive answers to them by developing a class of mini-batch stochastic ADMMs for the nonconvex optimization. Specifically, we study the mini-batch stochastic ADMMs for optimizing the nonconvex nonsmooth problem below:

 minx∈Rd,y∈Rpf(x)+g(y) (3) s.t.Ax+By=c,

where , each is a nonconvex and smooth loss function, is nonsmooth and possibly nonconvex, and , and denote the given matrices and vector, respectively. The problem (3) is inspired by the structural risk minimization in machine learning [51]. In summary, our main contributions are four-fold as follows:

• We propose the mini-batch stochastic ADMM for the nonconvex nonsmooth optimization. Moreover, we prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM reaches a fast convergence rate of to obtain a stationary point.

• We extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM, proposed in our initial manuscript [1]. Moreover, we prove that these stochastic ADMMs also reach a convergence rate of without condition on the mini-batch size.

• We provide a specific parameter selection for step size of stochastic gradients and penalty parameter of the augmented Lagrangian function.

• Some numerical experiments demonstrate the effectiveness of the proposed algorithms.

In addition, Table I shows the convergence rate summary of the stochastic/incremental ADMMs for optimizing the nonconvex problems.

### 1.1 Notations

denotes the Euclidean norm of a vector or the spectral norm of a matrix. denotes an

-dimensional identity matrix.

denotes a positive definite matrix , and . Let denote the generalized inverse of matrix .

denotes the smallest eigenvalues of matrix

. and denotes the largest and smallest eigenvalues of positive matrix , respectively. The other notations used in this paper is summarized as follows: , , , and .

## 2 Nonconvex Mini-batch Stochastic ADMM without VR

In this section, we propose a mini-batch stochastic ADMM to optimize the nonconvex problem (3). Moreover, we study convergence of the mini-batch stochastic ADMM. In particular, we prove that, given an appropriate mini-batch size, it reaches the convergence rate of .

First, we review the deterministic ADMM for solving the problem (3). The augmented Lagrangian function of (3) is defined as follows:

 Lρ(x,y,λ)= f(x)+g(y)−⟨λ,Ax+By−c⟩ +ρ2∥Ax+By−c∥2, (4)

where is the Lagrange multiplier, and is the penalty parameter. At -th iteration, the ADMM executes the update:

 yt+1=argminyLρ(xt,y,λt), (5) xt+1=argminxLρ(x,yt+1,λt), (6) λt+1=λt−ρ(Axt+1+Byt+1−c). (7)

Next, we give a mild assumption, as in the general stochastic optimization [37, 38] and the initial convex stochastic ADMM [24].

###### Assumption 1.

For smooth function , there exists a stochastic first-order oracle that returns a noisy estimation to the gradient of , and the noisy estimation satisfies

 E[G(x,ξ)]=f(x), (8) E[∥G(x,ξ)−∇f(x)∥2]≤σ2, (9)

where the expectation is taken with respect to the random variable

.

Let be the size of mini-batch , and denotes a set of i.i.d. random variables, and the stochastic gradient is given by

 G(x,ξI)=1M∑i∈IG(x,ξi).

Clearly, we have

 E[G(x,ξI)]=∇f(x), (10) (11)

In the stochastic ADMM algorithm, we can update and by (5) and (7), respectively, as in the deterministic ADMM. However, to update the variable , we will define an approximated function of the form:

 ~Lρ(x;yt+1,λt,xt,G(xt,ξIt))=f(xt)+G(xt,ξIt)T(x−xt) +12η∥x−xt∥2H−⟨λt,Ax+Byt+1−c⟩+ρ2∥Ax+Byt+1−c∥2, (12)

where , and . By minimizing (2) on the variable , we have

 xt+1=(Hη+ρATA)−1[Hηxt−G(xt,ξIt)−ρAT(Byt+1−c−λtρ)].

When is large, computing is expensive, and storage of this matrix may still be problematic. To avoid them, we can use the inexact Uzawa method [52] to linearize the last term in (2). In other words, we set with

 r≥rmin≡ηρ∥ATA∥+1

to ensure . Then we have

 xt+1=xt−ηr[G(xt,ξIt)+ρAT(At+Byt+1−c−λtρ)]. (13)

Finally, we give the algorithmic framework of the mini-batch stochastic ADMM (STOC-ADMM) in Algorithm 1.

### 2.1 Convergence Analysis of Nonconvex Mini-batch STOC-ADMM

In the subsection, we study the convergence and iteration complexity of the nonconvex mini-batch STOC-ADMM. First, we give some mild assumptions as follows:

###### Assumption 2.

For smooth function , its gradient is Lipschitz continuous with the constant , such that

 ∥∇f(x1)−∇f(x2)∥≤L∥x1−x2∥, ∀x1,x2∈Rd, (14)

and this is equivalent to

 f(x1)≤f(x2)+∇f(x2)T(x1−x2)+L2∥x1−x2∥2. (15)
###### Assumption 3.

and are all lower bounded, and denoting and .

###### Assumption 4.

is a proper lower semi-continuous function.

###### Assumption 5.

Matrix has full row rank.

Assumption 2 has been widely used in the convergence analysis of nonconvex algorithms [39, 40]. Assumptions 3-4 have been used in study of ADMM for nonconvex optimzations [46]. Assumption 5 has been used in the convergence analysis of ADMM [53]. Next, we define the -stationary point of the nonconvex problem (3) below:

###### Definition 1.

For , the point is said to be an -stationary point of the nonconvex problem (3) if it holds that

 ⎧⎪ ⎪⎨⎪ ⎪⎩E∥Ax∗+By∗−c∥2≤ϵ,E∥∇f(x∗)−ATλ∗∥2≤ϵ,E[dist(BTλ∗,∂g(y∗))2]≤ϵ, (16)

where , and denotes the subgradient of . If , the point is said to be a stationary point of (3).

Note that the above inequalities (16) are equivalent to , where

 ∂L(x,y,λ)=⎡⎢⎣∂L(x,y,λ)/∂x∂L(x,y,λ)/∂y∂L(x,y,λ)/∂λ⎤⎥⎦,

where is the Lagrangian function of (3). In the following, based the above assumptions and definition, we study the convergence and iteration complexity of the mini-batch stochastic ADMM.

###### Lemma 1.

Suppose the sequence is generated by Algorithm 1. The following inequality holds

 E∥λt+1−λt∥2≤ ζ∥xt−xt−1∥2+ζ1E∥xt+1−xt∥2 +10σ2MϕAmin,

where and .

A detailed proof of Lemma 1 is provided in Appendix A.1. Lemma 1 gives the upper bound of . Given a sequence generated from Algorithm 1, then we define an useful sequence as follows:

 Ψt=E[Lρ(xt,yt,λt)+ζρ∥xt−xt−1∥2]. (17)

For notational simplicity, let , and .

###### Lemma 2.

Suppose that the sequence is generated by Algorithm 1. Let , , and

 ρ0=10ϕHmax(~LϕHmax+√~L2(ϕHmax)2+2L2ϕH)ϕAminϕH

and suppose the parameters and , respectively, satisfy

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩η∈(ϕHmin−√△φ,ϕHmin+√△φ),ρ∈(ρ0,ρ∗);η∈(10(ϕHmax)2ρϕAminϕHmin,r−1ρ∥ATA∥],ρ=ρ∗;η∈(ϕHmin−√△φ,r−1ρ∥ATA∥],ρ∈(ρ∗,+∞). (18)

Then we have , and it holds that

 Ψt+1−Ψt≤−γ∥xt+1−xt∥2+(ϕAminρ+20)σ22ϕAminρM. (19)

A detailed proof of Lemma 2 is provided in Appendix A.2. Lemma 2 gives a property of the sequence . Moreover, (18) provides a specific parameter selection on the step size and the penalty parameter , in which selection of the step size depends on the parameter .

###### Lemma 3.

Suppose the sequence is generated by Algorithm 1. Under the same conditions as in Lemma 2, the sequence has a lower bound.

A detailed proof of Lemma 3 is provided in Appendix A.3. Lemma 3 gives a lower bound of the sequence .

###### Theorem 1.

Suppose the sequence is generated by Algorithm 1. Define , , , and . Let

 M≥2σ2ϵmax{κ1κ4+3,κ2κ4+10ϕAminρ2,κ3κ4}, (20)
 T=max{κ1,κ2,κ3}ϵγ(Ψ1−Ψ∗),

where is a lower bound of the sequence . Define , and let , then is an -stationary point of the problem (3).

A detailed proof of Theorem 1 is provided in Appendix A.4. Theorem 1 shows that, given an mini-batch size , the mini-batch stochastic ADMM has the convergence rate of to obtain an -stationary point of the nonconvex problem (3). Moreover, the IFO(Incremental First-order Oracle [40]) complexity of the mini-batch stochastic ADMM is for obtaining an -stationary point. While, the IFO complexity of the deterministic proximal ADMM [46] is for obtaining an -stationary point. When , the mini-batch stochastic ADMM needs less IFO complexity than the deterministic ADMM.

In the convergence analysis, given an appropriate mini-batch size satisfies the condition (20), the step size only need satisfies the condition (18) instead of used in the convex stochastic ADMM [24].

In the subsection, we propose a mini-batch nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM) to solve the problem (3), which uses a multi-stage strategy to progressively reduce the variance of stochastic gradients.

Algorithm 2 gives an algorithmic framework of mini-batch SVRG-ADMM for nonconvex optimizations. In Algorithm 2, the stochastic gradient is unbiased, i.e., . In the following, we give an upper bound of variance of the stochastic gradient .

###### Lemma 4.

In Algorithm 2, set , then it holds

 E∥Δs+1t∥2≤L2M∥xs+1t−~xs∥2, (21)

where denotes variance of the stochastic gradient .

A detailed proof of Theorem 4 is provided in Appendix B.1.

Lemma 4 shows that the variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. In fact, the variance of stochastic gradient is progressively reduced.

### 3.1 Convergence Analysis of Nonconvex Mini-batch SVRG-ADMM

In the subsection, we study the convergence and iteration complexity of the mini-batch nonconvex SVRG-ADMM. First, we give an upper bound of .

###### Lemma 5.

Suppose the sequence is generated by Algorithm 2. The following inequality holds

 E∥λs+1t+1−λs+1t∥2 ≤5L2ϕAminME∥xs+1t−~xs∥2+5L2ϕAminM∥xs+1t−1−~xs∥2 +ζ∥xs+1t−xs+1t−1∥2+ζ1E∥xs+1t+1−xs+1t∥2.

A detailed proof of Lemma 5 is provided in Appendix B.2. Given the sequence generated from Algorithm 2, then we define an useful sequence as follows:

 Φst= E[Lρ(xst,yst,λst)+hst(∥xst−~xs−1∥2+∥xst−1−~xs−1∥2) +ζρ∥xst−xst−1∥2], (22)

where is a positive sequence.

###### Lemma 6.

Suppose the sequence is generated from Algorithm 2, and suppose the positive sequence satisfies, for

 hst=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(2+β)hst+1+(10+ϕAminρ)L22ρϕAminM, 1≤t≤m−1,10L2ϕAminρM,t=m, (23)

where . Let , , , and

 ρ0=10ϕHmax((~L+2^h)ϕHmax+√(~L+2^h)2(ϕHmax)2+2L2ϕH)ϕAminϕH

and suppose the parameters and , respectively, satisfy

 ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩η∈(ϕHmin−√△1φ1,ϕHmin+√△1φ1),ρ∈(ρ0,ρ∗);η∈(10(ϕHmax)2ρϕAminϕHmin,r−1ρ∥ATA∥],ρ=ρ∗;η∈(ϕHmin−√△1φ1,r−1ρ∥ATA∥],ρ∈(ρ∗,+∞). (24)

where . Then it holds that the sequence is positive, defined by

 Γst=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩ϕHminη+ϕAminρ2−~L2−ζ+ζ1ρ−(1+1β)ht+1, 1≤t≤m−1;ϕHminη+ϕAminρ2−~L2−ζ+ζ1ρ−hs+11,t=m (25)

and the sequence monotonically decreases.

A detailed proof of Lemma 6 is provided in Appendix B.3. Lemma 6 shows that the sequence monotonically decreases. Moreover, (24) provides a specific parameter selection on the step size and the penalty parameter in Algorithm 2.

###### Lemma 7.

Suppose the sequence is generated by Algorithm 2. Under the same conditions as in Lemma 6, the sequence has a lower bound.

Lemma 7 shows that the sequence has a lower bound. The proof of Lemma 7 is the same as the proof of Lemma 3. We define an useful variable as follows:

 ^θst= ∥xst−~xs−1∥2+∥xst−1−~xs−1∥2+∥xst+1−xst∥2 +∥xst−xst−1∥2. (26)

In the following, we will analyze the convergence and iteration complexity of the nonconvex SVRG-ADMM based on the above lemmas.

###### Theorem 2.

Suppose the sequence is generated by Algorithm 2. Denote , , , and and . Let

 mS=T=max{κ1,κ2,κ3}τϵ(Φ11−Φ∗), (27)

where , and is a lower bound of the sequence . Let

 (t∗,s∗)=argmin1≤t≤m, 1≤s≤S^θst,

then is an -stationary point of the problem (3).

A detailed proof of Theorem 2 is provided in Appendix B.4. Theorem 2 shows that the mini-batch SVRG-ADMM for nonconvex optimizations has a convergence rate of . Moreover, the IFO complexity of the mini-batch SVRG is . When , the mini-batch SVRG-ADMM needs less IFO complexity than the deterministic ADMM.

Since the mini-batch SVRG-ADMM uses VR technique, its convergence does not depend on the mini-batch size . In other words, when , the mini-batch nonconvex SVRG-ADMM reduces to the initial nonconvex SVRG-ADMM in [1], which also has a convergence rate of . However, by Lemma 4, the variance of stochastic gradient in the mini-batch SVRG-ADMM decreases faster than that in the initial nonconvex SVRG-ADMM.