Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization

02/08/2018 ∙ by Feihu Huang, et al. ∙ Nanjing University of Aeronautics and Astronautics 0

In the paper, we study the mini-batch stochastic ADMMs (alternating direction method of multipliers) for the nonconvex nonsmooth optimization. We prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM without variance reduction (VR) technique is convergent and reaches the convergence rate of O(1/T) to obtain a stationary point of the nonconvex optimization, where T denotes the number of iterations. Moreover, we extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM in our initial paper huang2016stochastic, and also prove that these mini-batch stochastic ADMMs reach the convergence rate of O(1/T) without the condition on the mini-batch size. In particular, we provide a specific parameter selection for step size η of stochastic gradients and penalization parameter ρ of the augmented Lagrangian function. Finally, some experimental results demonstrate the effectiveness of our algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic optimization [2]

is a class of powerful optimization tool for solving large-scale problems in machine learning, pattern recognition and computer vision. For example, stochastic gradient descent (SGD

[2]) is an efficient method for solving the following optimization problem, which is a fundamental to machine learning,

(1)

where denotes the loss function, and

denotes the regularization function. The problem (1) includes many useful models such as support vector machine (SVM), logistic regression and neural network. When sample size

is large, even the first-order methods become computationally burdensome due to their per-iteration complexity of . While SGD only computes gradient of one sample instead of all samples in each iteration, thus, it has only per-iteration complexity of . Despite its scalability, due to the existence of variance in stochastic process, the stochastic gradient is much noisier than the batch gradient. Thus, the step size has to be decreased gradually as stochastic learning proceeds, leading to slower convergence than the batch method. Recently, a number of accelerated algorithms have successfully been proposed to reduce this variance. For example, stochastic average gradient (SAG [3]

) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. Stochastic dual coordinate ascent (SDCA

[4]) performs the stochastic coordinate ascent on the dual problems to obtain also a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient (APCG [5]) method accelerates the SDCA method by using Nesterov’s accelerated method [6]. However, these fast methods require much space to store old gradients or dual variables. Thus, stochastic variance reduced gradient (SVRG [7, 8]) methods are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover,[9] proposes the SAGA method, which extends the SAG method and enjoys better theoretical convergence rate than both SAG and SVRG. Recently, [10] presents an accelerated SVRG by using the Nesterov’s acceleration technique [6]. Moreover, [11] proposes a novel momentum accelerated SVRG method (Katyusha) via using the strongly convex parameter, which reaches a faster convergence rate. In addition, [12] specially proposes a class of stochastic composite optimization methods for sparse learning, when is a sparsity-inducing regularizer such as -norm and nuclear norm.

MethodsConvergence rateProblems
Nonconvex incremental ADMM [13] ✓, Unknown
NESTT [14] ✓,
Nonconvex mini-batch stochastic ADMM (ours) ✓, ✓,
Nonconvex SVRG-ADMM (ours and [15]) ✓, ✓,
Nonconvex SAGA-ADMM (ours) ✓, ✓,
TABLE I: Summary of the existing stochastic ADMMs for the nonconvex optimization. ✓denotes that the proposed methods can optimize the corresponding nonconvex problems.

Though the above methods can effectively solve many problems in machine learning, they are still difficultly to be competent for some complicated problems with the nonseparable and nonsmooth regularization function as follows

(2)

where is a given matrix, denotes the loss function, and denotes the regularization function. With regard to , we are interested in a sparsity-inducing regularization functions, e.g. -norm and nuclear norm. The problem (2) includes the graph-guided fuzed Lasso [16], the overlapping group Lasso[17], and generalized Lasso [18]. It is well known that the alternating direction method of multipliers (ADMM [19, 20, 21]) is an efficient optimization method for the problem (2). Specifically, we can use auxiliary variable to make the problem (2) be suitable for the general ADMM form. When sample size is large, due to the need of computing the empirical risk loss function on all training samples at each iteration, the offline or batch ADMM is unsuitable for large-scale learning problems. Thus, the online and stochastic versions of ADMM [22, 23, 24] have been successfully developed for the large-scale problems. Due to the existence of variance in the stochastic process, these stochastic ADMMs also suffer from the slow convergence rate. Recently, some accelerated stochastic ADMMs are effectively proposed to reduce this variance. For example, SAG-ADMM [25] is proposed by additionally using the previous estimated gradients. An accelerated stochastic ADMM [26] is proposed by using Nesterov’s accelerated method [6]. SDCA-ADMM [27] obtains linearly convergence rate for the strong problem by solving its dual problem. SCAS-ADMM [28] and SVRG-ADMM [29] are developed, and reach the fast convergence rate with no extra space for the previous gradients or dual variables. Moreover, [30] proposes an accelerated SVRG-ADMM by using the momentum accelerated technique. More recently, [31] proposes a fast stochastic ADMM, which achieves a non-ergodic convergence rate of for the convex problem. In addition, an adaptive stochastic ADMM [32] is proposed by using the adaptive gradients. Due to that the penalty parameter in ADMM can affect convergence [33], another adaptive stochastic ADMM [34] is proposed by using the adaptive penalty parameters.

So far, the above study on stochastic optimization methods relies heavily on strongly convex or convex problems. However, there exist many useful nonconvex models in machine learning such as nonconvex empirical risk minimization models[35]

and deep learning

[36]. Thus, the study of nonconvex optimization methods is much needed. Recently, some works focus on studying the stochastic gradient methods for the large-scale nonconvex optimizations. For example, [37, 38] have established the iteration complexity of for the SGD to obtain an -stationary solution of the nonconvex problems. [39, 40, 41] have proved that the variance reduced stochastic gradient methods such as the nonconvex SVRG and SAGA reach the iteration complexity of . At the same time, [42] has proved that the variance reduced stochastic gradient methods also reach the iteration complexity of for the nonconvex nonsmooth composite problems. More recently, [43] propose a faster nonconvex stochastic optimization method (Natasha) via using the strongly non-convex parameter. [44] proposes a faster gradient-based nonconvex optimization by using catalyst approach in [45].

Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as nonconvex graph-guided regularization risk loss minimizations [1]

and tensor decomposition

[46]. Recently, some works [47, 48, 49, 50, 46] have begun to study the ADMM method for the nonconvex optimization, but they only focus on studying the deterministic ADMMs for the nonconvex optimization. Due to the need of computing the empirical loss function on all the training examples at each iteration, these nonconvex ADMMs are not yet well competent to the large-scale learning problems. Recently, [13] has proposed a distributed, asynchronous and incremental algorithm based on the ADMM method for the large-scale nonconvex problems, but this method is difficult for the nonconvex problem (2) with the nonseparable and nonsmooth regularizers such as graph-guided fused lasso and overlapping group lasso. A nonconvex primal dual splitting (NESTT [14]) method is proposed for the distributed and stochastic optimization, but it is also difficult for the nonconvex problem (2). More recently, our initial manuscript [1] proposes the stochastic ADMMs with variance reduction (e.g., nonconvex SVRG-ADMM and nonconvex SAGA-ADMM) for optimizing these nonconvex problems with some complicated structure regularizers such as graph-guided fuzed Lasso, overlapping group Lasso, sparse plus low-rank penalties. In addition, our initial manuscript [1] and Zheng and Kwok’s paper [15] simultaneously propose the nonconvex SVRG-ADMM method111 The first version of our manuscript[1](https://arxiv.org/abs/1610.02758v1) proposes both non-convex SVRG-ADMM and SAGA-ADMM, which is online available in Oct. 10, 2016 . The first version of [15] (https://arxiv.org/abs/1604.07070v1) only proposes the convex SVRG-ADMM, which is online available in Apr. 24, 2016 and named as ’Fast-and-Light Stochastic ADMM’. While, the second version of [15] (https://arxiv.org/abs/1604.07070v2) adds the non-convex SVRG-ADMM, which is online available in Oct. 12, 2016 and renamed as ’Stochastic Variance-Reduced ADMM’. . At present, to our knowledge, there still exist two important problems needing to be addressed:

  • Whether the general stochastic ADMM without VR technique is convergent for the nonconvex optimization?

  • What is convergence rate of the general stochastic ADMM for the nonconvex optimization, if convergent?

In the paper, we provide the positive answers to them by developing a class of mini-batch stochastic ADMMs for the nonconvex optimization. Specifically, we study the mini-batch stochastic ADMMs for optimizing the nonconvex nonsmooth problem below:

(3)

where , each is a nonconvex and smooth loss function, is nonsmooth and possibly nonconvex, and , and denote the given matrices and vector, respectively. The problem (3) is inspired by the structural risk minimization in machine learning [51]. In summary, our main contributions are four-fold as follows:

  • We propose the mini-batch stochastic ADMM for the nonconvex nonsmooth optimization. Moreover, we prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM reaches a fast convergence rate of to obtain a stationary point.

  • We extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM, proposed in our initial manuscript [1]. Moreover, we prove that these stochastic ADMMs also reach a convergence rate of without condition on the mini-batch size.

  • We provide a specific parameter selection for step size of stochastic gradients and penalty parameter of the augmented Lagrangian function.

  • Some numerical experiments demonstrate the effectiveness of the proposed algorithms.

In addition, Table I shows the convergence rate summary of the stochastic/incremental ADMMs for optimizing the nonconvex problems.

1.1 Notations

denotes the Euclidean norm of a vector or the spectral norm of a matrix. denotes an

-dimensional identity matrix.

denotes a positive definite matrix , and . Let denote the generalized inverse of matrix .

denotes the smallest eigenvalues of matrix

. and denotes the largest and smallest eigenvalues of positive matrix , respectively. The other notations used in this paper is summarized as follows: , , , and .

2 Nonconvex Mini-batch Stochastic ADMM without VR

In this section, we propose a mini-batch stochastic ADMM to optimize the nonconvex problem (3). Moreover, we study convergence of the mini-batch stochastic ADMM. In particular, we prove that, given an appropriate mini-batch size, it reaches the convergence rate of .

First, we review the deterministic ADMM for solving the problem (3). The augmented Lagrangian function of (3) is defined as follows:

(4)

where is the Lagrange multiplier, and is the penalty parameter. At -th iteration, the ADMM executes the update:

(5)
(6)
(7)

Next, we give a mild assumption, as in the general stochastic optimization [37, 38] and the initial convex stochastic ADMM [24].

Assumption 1.

For smooth function , there exists a stochastic first-order oracle that returns a noisy estimation to the gradient of , and the noisy estimation satisfies

(8)
(9)

where the expectation is taken with respect to the random variable

.

Let be the size of mini-batch , and denotes a set of i.i.d. random variables, and the stochastic gradient is given by

Clearly, we have

(10)
(11)
1:  Input: Number of iteration , Mini-batch size and ;
2:  Initialize: , and ;
3:  for  do
4:      Uniformly randomly pick a mini-batch from ;
5:      ;
6:      ;
7:      ;
8:  end for
9:  Output: Iterate and chosen uniformly random from .
Algorithm 1 Mini-batch Stochastic ADMM (STOC-ADMM) for Nonconvex Nonsmooth Optimization

In the stochastic ADMM algorithm, we can update and by (5) and (7), respectively, as in the deterministic ADMM. However, to update the variable , we will define an approximated function of the form:

(12)

where , and . By minimizing (2) on the variable , we have

When is large, computing is expensive, and storage of this matrix may still be problematic. To avoid them, we can use the inexact Uzawa method [52] to linearize the last term in (2). In other words, we set with

to ensure . Then we have

(13)

Finally, we give the algorithmic framework of the mini-batch stochastic ADMM (STOC-ADMM) in Algorithm 1.

2.1 Convergence Analysis of Nonconvex Mini-batch STOC-ADMM

In the subsection, we study the convergence and iteration complexity of the nonconvex mini-batch STOC-ADMM. First, we give some mild assumptions as follows:

Assumption 2.

For smooth function , its gradient is Lipschitz continuous with the constant , such that

(14)

and this is equivalent to

(15)
Assumption 3.

and are all lower bounded, and denoting and .

Assumption 4.

is a proper lower semi-continuous function.

Assumption 5.

Matrix has full row rank.

Assumption 2 has been widely used in the convergence analysis of nonconvex algorithms [39, 40]. Assumptions 3-4 have been used in study of ADMM for nonconvex optimzations [46]. Assumption 5 has been used in the convergence analysis of ADMM [53]. Next, we define the -stationary point of the nonconvex problem (3) below:

Definition 1.

For , the point is said to be an -stationary point of the nonconvex problem (3) if it holds that

(16)

where , and denotes the subgradient of . If , the point is said to be a stationary point of (3).

Note that the above inequalities (16) are equivalent to , where

where is the Lagrangian function of (3). In the following, based the above assumptions and definition, we study the convergence and iteration complexity of the mini-batch stochastic ADMM.

Lemma 1.

Suppose the sequence is generated by Algorithm 1. The following inequality holds

where and .

A detailed proof of Lemma 1 is provided in Appendix A.1. Lemma 1 gives the upper bound of . Given a sequence generated from Algorithm 1, then we define an useful sequence as follows:

(17)

For notational simplicity, let , and .

Lemma 2.

Suppose that the sequence is generated by Algorithm 1. Let , , and

and suppose the parameters and , respectively, satisfy

(18)

Then we have , and it holds that

(19)

A detailed proof of Lemma 2 is provided in Appendix A.2. Lemma 2 gives a property of the sequence . Moreover, (18) provides a specific parameter selection on the step size and the penalty parameter , in which selection of the step size depends on the parameter .

Lemma 3.

Suppose the sequence is generated by Algorithm 1. Under the same conditions as in Lemma 2, the sequence has a lower bound.

A detailed proof of Lemma 3 is provided in Appendix A.3. Lemma 3 gives a lower bound of the sequence .

Theorem 1.

Suppose the sequence is generated by Algorithm 1. Define , , , and . Let

(20)

where is a lower bound of the sequence . Define , and let , then is an -stationary point of the problem (3).

A detailed proof of Theorem 1 is provided in Appendix A.4. Theorem 1 shows that, given an mini-batch size , the mini-batch stochastic ADMM has the convergence rate of to obtain an -stationary point of the nonconvex problem (3). Moreover, the IFO(Incremental First-order Oracle [40]) complexity of the mini-batch stochastic ADMM is for obtaining an -stationary point. While, the IFO complexity of the deterministic proximal ADMM [46] is for obtaining an -stationary point. When , the mini-batch stochastic ADMM needs less IFO complexity than the deterministic ADMM.

In the convergence analysis, given an appropriate mini-batch size satisfies the condition (20), the step size only need satisfies the condition (18) instead of used in the convex stochastic ADMM [24].

3 Nonconvex Minin-batch SVRG-ADMM

In the subsection, we propose a mini-batch nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM) to solve the problem (3), which uses a multi-stage strategy to progressively reduce the variance of stochastic gradients.

Algorithm 2 gives an algorithmic framework of mini-batch SVRG-ADMM for nonconvex optimizations. In Algorithm 2, the stochastic gradient is unbiased, i.e., . In the following, we give an upper bound of variance of the stochastic gradient .

Lemma 4.

In Algorithm 2, set , then it holds

(21)

where denotes variance of the stochastic gradient .

A detailed proof of Theorem 4 is provided in Appendix B.1.

1:  Input: Mini-batch size

, epoch length

, , , ;
2:  Initialize: , and ;
3:  for  do
4:      , and ;
5:      ;
6:     for  do
7:         Uniformly randomly pick a mini-batch from ;
8:         ;
9:         ;
10:         ;
11:         ;
12:     end for
13:      ;
14:  end for
15:  Output: Iterate and chosen uniformly random from .
Algorithm 2 Mini-batch SVRG-ADMM for Nonconvex Nonsmooth Optimization

Lemma 4 shows that the variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. In fact, the variance of stochastic gradient is progressively reduced.

3.1 Convergence Analysis of Nonconvex Mini-batch SVRG-ADMM

In the subsection, we study the convergence and iteration complexity of the mini-batch nonconvex SVRG-ADMM. First, we give an upper bound of .

Lemma 5.

Suppose the sequence is generated by Algorithm 2. The following inequality holds

A detailed proof of Lemma 5 is provided in Appendix B.2. Given the sequence generated from Algorithm 2, then we define an useful sequence as follows:

(22)

where is a positive sequence.

Lemma 6.

Suppose the sequence is generated from Algorithm 2, and suppose the positive sequence satisfies, for

(23)

where . Let , , , and

and suppose the parameters and , respectively, satisfy

(24)

where . Then it holds that the sequence is positive, defined by

(25)

and the sequence monotonically decreases.

A detailed proof of Lemma 6 is provided in Appendix B.3. Lemma 6 shows that the sequence monotonically decreases. Moreover, (24) provides a specific parameter selection on the step size and the penalty parameter in Algorithm 2.

Lemma 7.

Suppose the sequence is generated by Algorithm 2. Under the same conditions as in Lemma 6, the sequence has a lower bound.

Lemma 7 shows that the sequence has a lower bound. The proof of Lemma 7 is the same as the proof of Lemma 3. We define an useful variable as follows:

(26)

In the following, we will analyze the convergence and iteration complexity of the nonconvex SVRG-ADMM based on the above lemmas.

Theorem 2.

Suppose the sequence is generated by Algorithm 2. Denote , , , and and . Let

(27)

where , and is a lower bound of the sequence . Let

then is an -stationary point of the problem (3).

A detailed proof of Theorem 2 is provided in Appendix B.4. Theorem 2 shows that the mini-batch SVRG-ADMM for nonconvex optimizations has a convergence rate of . Moreover, the IFO complexity of the mini-batch SVRG is . When , the mini-batch SVRG-ADMM needs less IFO complexity than the deterministic ADMM.

Since the mini-batch SVRG-ADMM uses VR technique, its convergence does not depend on the mini-batch size . In other words, when , the mini-batch nonconvex SVRG-ADMM reduces to the initial nonconvex SVRG-ADMM in [1], which also has a convergence rate of . However, by Lemma 4, the variance of stochastic gradient in the mini-batch SVRG-ADMM decreases faster than that in the initial nonconvex SVRG-ADMM.

4 Nonconvex Mini-batch SAGA-ADMM

In the subsection, we propose a mini-batch nonconvex stochastic average gradient ADMM (SAGA-ADMM) by additionally using the old gradients estimated in the previous iteration, which is inspired by the SAGA method [9].

The algorithmic framework of the SAGA-ADMM is given in Algorithm 3. In Algorithm 3, the stochastic gradient