Stochastic Alternating Direction Method of Multipliers with Variance Reduction for Nonconvex Optimization

10/10/2016 ∙ by Feihu Huang, et al. ∙ Nanjing University of Aeronautics and Astronautics Simon Fraser University 0

In the paper, we study the stochastic alternating direction method of multipliers (ADMM) for the nonconvex optimizations, and propose three classes of the nonconvex stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients. Specifically, the first class called the nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM), uses a multi-stage scheme to progressively reduce the variance of stochastic gradients. The second is the nonconvex stochastic average gradient ADMM (SAG-ADMM), which additionally uses the old gradients estimated in the previous iteration. The third called SAGA-ADMM is an extension of the SAG-ADMM method. Moreover, under some mild conditions, we establish the iteration complexity bound of O(1/ϵ) of the proposed methods to obtain an ϵ-stationary solution of the nonconvex optimizations. In particular, we provide a general framework to analyze the iteration complexity of these nonconvex stochastic ADMM methods with variance reduction. Finally, some numerical experiments demonstrate the effectiveness of our methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic optimization method is a class of powerful optimization tool for solving large-scale problems in machine learning. For example, the stochastic gradient descent (SGD)

(Bottou, 2004) is an efficient method for solving the finite-sum optimization problem, which is a fundamental to machine learning. Specifically, the SGD only computes gradient of one sample instead of visiting all samples in each iteration. Though its scalability, due to the variance in the stochastic process, the SGD has slower convergence rate than the batch gradient method. Recently, many accelerated versions of the SGD have successfully been proposed to reduce the variances, and obtain some better convergence rates. For example, the stochastic average gradient (SAG) method (Roux et al., 2012) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. The stochastic dual coordinate ascent (SDCA) method (Shalev-Shwartz and Zhang, 2013) performs the stochastic coordinate ascent on the dual problems and also obtains a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient method (APCG) (Lin et al., 2015) accelerates the SDCA method by using the Nesterov’s acceleration technique (Nesterov, 2004). However, these accelerated methods obtain faster convergence rate than the standard SGD at the cost of requiring much space to store old gradients or dual variables. To deal with this dilemma, thus, the stochastic variance reduced gradient (SVRG) methods (Johnson and Zhang, 2013; Xiao and Zhang, 2014) are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover, Defazio et al. (2014) have proposed a novel method called SAGA, which extends the SAG method and enjoys better theoretical convergence rate than both the SAG and SVRG methods.

Though the above gradient-based methods can effectively solve many problems in machine learning, they are still difficultly competent for some complicated problems, such as the graph-guided SVM (Ouyang et al., 2013) and the latent variable graphical models (Ma et al., 2013). It is well known that the alternating direction method of multipliers (ADMM) (Gabay and Mercier, 1976; Boyd et al., 2011) has been advocated as an efficient optimization method in many application fields such as machine learning (Danaher et al., 2014) and statistics (Fang et al., 2015)

. However, the offline or batch ADMM need to compute an empirical risk loss function on all training samples at each iteration, which makes it unsuitable for large-scale learning problems. Thus, the online or stochastic versions of ADMM

(Wang and Banerjee, 2012; Suzuki, 2013; Ouyang et al., 2013) have been developed for the large-scale/stochastic optimizations. Due to the variance in the stochastic process, these initial stochastic ADMM methods also suffer from slow convergence rate. Recently, some accelerated stochastic ADMM methods are proposed to efficiently solve the large-scale learning problems. For example, a fast stochastic ADMM (Zhong and Kwok, 2014) is proposed via incorporating the previous estimated gradients. Azadi and Sra (2014) has proposed an accelerated stochastic ADMM by using Nesterov’s accelerated method (Nesterov, 2004). Moreover, an adaptive stochastic ADMM (Zhao et al., 2015a) is proposed by using the adaptive stochastic gradients. The stochastic dual coordinate ascent ADMM (Suzuki, 2014) obtains a fast convergence rate by solving the dual problems. More recently, the scalable stochastic ADMMs (Zhao et al., 2015b; Zheng and Kwok, 2016) are developed, and obtain fast convergence rates with no extra space for the previous gradients or dual variables.

So far, the above study on stochastic optimization methods relies heavily on the strongly convex or convex objective functions. However, there exist many useful nonconvex models in machine learning such as the nonconvex robust empirical risk minimization models (Aravkin and Davis, 2016)

and deep learning

(LeCun et al., 2015). Thus, the study of stochastic optimization methods for the nonconvex problems is much needed. Recently, some works focus on studying the stochastic gradient methods for the nonconvex optimizations. For example, Ghadimi and Lan (2016) and Ghadimi et al. (2016) have established the iteration complexity of for the SGD to obtain an -stationary solution of the nonconvex optimizations. Allen-Zhu and Hazan (2016); Reddi et al. (2016a, b) have proved that both the nonconvex SVRG and SAGA methods obtain an iteration complexity of for the nonconvex optimizations. In particular, Li et al. (2016) have studied the stochastic gradient method for nonconvex sparse learning via variance reduction, which reaches an asymptotically linear convergence rate by exploring the properties of the specific problems. Moreover, Reddi et al. (2016c); Aravkin and Davis (2016) have studied the variance reduced stochastic methods for the nonconvex nonsmooth composite problems, and have proved that they have the iteration complexity of . At the same time, Hajinezhad et al. (2016) have proposed a nonconvex distributed and stochastic primal dual splitting method for the nonconvex nonsmooth problems and prove that it also has the iteration complexity of to obtain an -stationary solution.

Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as the graph-guided regularization risk loss minimizations and tensor decomposition

(Jiang et al., 2016). Fortunately, it is found that the ADMM method can be well competent to these complicated nonconvex problems, though it may fail to converge sometimes. Recently, some work (Wang et al., 2015; Yang et al., 2015; Wang et al., 2015; Hong et al., 2016; Jiang et al., 2016) begin to devote to the study of the ADMM method for the nonconvex optimizations. However, they mainly focus on studying the determinate ADMM for the nonconvex optimizations. Due to computing the empirical loss function on all the training examples at each iteration, these nonconvex ADMMs can not be well competent to the large-scale learning problems. Though Hong (2014) has proposed a distributed, asynchronous and incremental algorithm based on the ADMM method for the large-scale nonconvex problems, the proposed method is still difficult to be competent to these complicated nonconvex problems such as the graph-guided models, and its iteration complexity is not provided. At present, to the best of our knowledge, there still exists few study of the stochastic ADMM for the noncovex optimizations. In the paper, thus, we study the stochastic ADMM methods for solving the nonconvex nonsmooth stochastic optimizations as follows:

(1)

where is a nonconvex and smooth function;

is a random vector;

is nonsmooth and possibly nonconvex; , , , and . The problem (1) is inspired by the structural risk minimization in machine learning (Vapnik, 2013). Here, the random vector obeys a fixed but unknown distribution, from which we are able to draw a set of i.i.d. samples. In general, it is difficult to evaluate exactly, so we use the sample average approximation to approximate it. Throughout the paper, let denote the average sum of many nonconvex and smooth component functions , .

Moreover, we propose three classes of nonconvex stochastic ADMM with variance reduction for the problem (1), based on different reduced variance stochastic gradients. Specifically, the first class called SVRG-ADMM uses a multi-stage scheme to progressively reduce the variance of stochastic gradients. The second called SAG-ADMM reduces the variance of stochastic gradient via additionally using the old gradients estimated in previous iteration. The third called SAGA-ADMM is an extension of SAG-ADMM, which uses an unbiased stochastic gradient as the SVRG-ADMM. In summary, our main contributions include three folds as follows:

  • We propose three classes of the nonconvex stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients.

  • We study the convergence of the proposed methods, and prove that these methods have the iteration complexity bound of to obtain an -stationary solution of the nonconvex problems. In particular, we provide a general framework to analyze the iteration complexity of the nonconvex stochastic ADMM with variance reduction.

  • Finally, some numerical experiments demonstrate the effectiveness of the proposed methods.

1.1 Organization

The paper is organized as follows: In Section 2, we propose three classes of stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients. In Section 3, we study the convergence and iterative complexity of the proposed methods. Section 4 presents some numerical experiments, whose results back up the effectiveness of our methods. In Section 5, we give some conclusions. Most details of the theoretical analysis and proofs are relegated to the the following Appendices.

1.2 Notations

denotes the Euclidean norm of a vector or the spectral norm of a matrix. is subgradient of the function . implies the matrix is positive definite. Let . Let denote the generalized inverse of matrix . For a nonempty closed set , denotes the distance from to .

2 Stochastic ADMM methods for the Nonconvex Optimizations

In this section, we study stochastic ADMM methods for solving the nonconvex problem (1). First, we propose a simple nonconvex stochastic ADMM as a baseline, in which the variance of stochastic gradients is free. However, it is difficult to guarantee the convergence of this simple stochastic ADMM, and only obtain a slow convergence rate. Thus, we propose three classes of stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients.

First, we review the standard ADMM for solving the problem (1), when is deterministic. The augmented Lagrangian function of (1) is defined as

(2)

where is a Lagrange multiplier, and is a penalty parameter. At -th iteration, the ADMM executes the following update:

(3)
(4)
(5)

When

is a random variable, we can still update the variables

and by (3) and (5), respectively. However, to update the variable , we need define an approximated function of the form:

(6)

where , and . By minimizing (6) on the variable , we have

When is large, computing inversion of is expensive. To avoid it, we can use the inexact Uzawa method (Zhang et al., 2011) to linearize the last term in (6), and choose . When , by minimizing (6) on the variable , we have

1:  Input: , and ;
2:  Initialize: , and ;
3:  for  do
4:      Uniformly randomly pick from ;
5:      ;
6:      ;
7:      ;
8:  end for
9:  Output: Iterate and chosen uniformly random from .
Algorithm 1 S-ADMM for Nonconvex Optimization

Like as the initial stochastic ADMM (Ouyang et al., 2013) for convex problems, we propose a simple stochastic ADMM (S-ADMM) as a baseline for the problem (1). The algorithmic framework of the S-ADMM is given in Algorithm 2. Though , there exists the variance in stochastic process. To guarantee its convergence, we should choose a time-varying stepsize in (6), as in (Ouyang et al., 2013). However, as stochastic learning proceeds, the gradual decreasing of the stepsize generally leads to a slow convergence rate. In the following, thus, we propose three classes of stochastic ADMM with variance reduction for the problem (1), based on different reduced variance stochastic gradients.

2.1 Nonconvex SVRG-ADMM

1:  Input:epoch length , , , ;
2:  Initialize: , and ;
3:  for  do
4:      , and ;
5:      ;
6:     for  do
7:         Uniformly randomly pick from ;
8:         ;
9:         ;
10:         ;
11:         ;
12:     end for
13:      ;
14:  end for
15:  Output: Iterate and chosen uniformly random from .
Algorithm 2 SVRG-ADMM for Nonconvex Optimization

In the subsection, we propose a nonconvex SVRG-ADMM, via using a multi-stage scheme to progressively reduce the variance of stochastic gradients. The framework of the SVRG-ADMM method is given in Algorithm 1. Specifically, in Algorithm 2, the stochastic gradient is unbiased, i.e., , and its variance is progressively reduced by computing the gradients of all sample one time in each outer loop. In the following, we give an upper bound of the variance of the stochastic gradient . In Algorithm 2, set , where , then the following inequality holds

(7)

where denotes the variance of stochastic gradient .

A detailed proof of Lemma 1 is provided in Appendix A. Lemma 1 shows that variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. Note that the stochastic ADMM for solving the nonconvex problem is difficult to converge to the global solution , so we bound the variance with rather than the popular used in the convex problem.

2.2 Nonconvex SAG-ADMM

1:  Input: , , for , number of iterations ;
2:  Initialize: ;
3:  for  do
4:      Uniformly randomly pick from ;
5:      ;
6:      ;
7:      ;
8:      ;
9:      and for ;
10:      ;
11:  end for
12:  Output: Iterate and chosen uniformly random from .
Algorithm 3 SAG-ADMM for Nonconvex Optimization

In the subsection, we propose a nonconvex SAG-ADMM by additionally using the old gradients estimated in the previous iteration. The framework of the SAG-ADMM method is given in Algorithm 3. In Algorithm 3, though the stochastic gradient is biased, i.e.,

its variance is progressively reduced. In the following, we give an upper bound of variance of the stochastic gradient .

In Algorithm 3, set , where , then the following inequality holds

(8)

where , and denotes variance of the stochastic gradient .

A detailed proof of Lemma 2 is provided in Appendix B. Lemma 2 shows that the variance of the stochastic gradient has an upper bound of . As the number of iteration increases, both and the stored points approach the same stationary point, so the variance of stochastic gradient progressively reduces.

2.3 Nonconvex SAGA-ADMM

1:  Input: , , for , number of iterations ;
2:  Initialize: ;
3:  for  do
4:      Uniformly randomly pick from ;
5:      ;
6:      ;
7:      ;
8:      ;
9:      and for ;
10:      ;
11:  end for
12:  Output: Iterate and chosen uniformly random from .
Algorithm 4 SAGA-ADMM for Nonconvex Optimization
Methods stochastic gradient upper bound of variance (un)biased
S-ADMM unknown unbiased
SVRG-ADMM unbiased
SAG-ADMM biased
SAGA-ADMM unbiased
Table 1: Summary of the stochastic gradients used in the proposed methods. Note that and .

In the subsection, we propose a nonconvex SAGA-ADMM, which is an extension of the SAG-ADMM and uses an unbiased stochastic gradients as the SVRG-ADMM. The framework of the SAGA-ADMM method is given in Algorithm 4. In Algorithm 4, the stochastic gradient is unbiased, i.e., , and its variance is progressively reduced via additionally also using the old gradients in the previous iterations. Similarly, we give an upper bound of the variance of the stochastic gradient .

In Algorithm 4, set , where , then the following inequality holds

(9)

where , and denotes variance of the stochastic gradient .

A detailed proof of Lemma 3 is provided in Appendix C. Lemma 3 shows that the variance of the stochastic gradient has an upper bound . Similarly, both and the stored points approach the same stationary point, as the number of iteration increases. Thus, the variance of stochastic gradient progressively reduces. Note that the upper bound (9) loses a coefficient to the upper bound (8), due to using a unbiased stochastic gradient in the SAGA-ADMM.

To further clarify the different of the proposed methods, we summarize the stochastic gradients used in the proposed methods in Table 1. From Table 1, we can find that the SAG-ADMM only uses the biased stochastic gradient, while others use the unbiased stochastic gradient. In particular, the SAG-ADMM can reduce faster the variance of stochastic gradient than the SAGA-ADMM, at the expense of using a biased stochastic gradient.

3 Convergence Analysis

In the section, we analyze the convergence and iteration complexity of the proposed methods. First, we give some mild assumptions regarding problem (1) as follows:

Assumption 1

For , the gradient of function is Lipschitz continuous with the constant , such that

(10)

where , and this is equivalent to

(11)
Assumption 2

and are all lower bounded, and denoting and .

Assumption 3

is a proper lower semi-continuous function.

Assumption 4

Matrix has full row rank.

In the Assumption 1, since , we have and . Assumption 1 has been widely used in the convergence analysis of nonconvex algorithms (Allen-Zhu and Hazan, 2016; Reddi et al., 2016a). Assumptions 2-3 have been used in study of ADMM for nonconvex problems (Jiang et al., 2016). Assumption 4 has been used in the convergence analysis of ADMM (Deng and Yin, 2016).

Throughout the paper, let

denote the smallest eigenvalues of matrix

, and let and denote the smallest and largest eigenvalues of positive matrix , respectively. In the following, we define the -stationary point of the nonconvex problem (1):

For , the point is said to be an -stationary point of the problem (1) if it holds that

(12)
(13)
(14)

where . If , the point is said to be a stationary point of the problem (1). Note that combining the above inequalities (13-15) is equivalent to , where

Next, based the above assumptions and definition, we study the convergence and iteration complexity of the proposed methods. In particular, we provide a general framework to analyze the convergence and iteration complexity of stochastic ADMM methods with variance reduction. Specifically, the basic procedure is given as follows:

  • First, we design a new sequence based on the sequence generated from the algorithm. For example, we design the sequence in (15) for the SVRG-ADMM; the sequence in (21) for the SAG-ADMM; and the sequence in (27) for the SAGA-ADMM.

  • Second, we prove that the designed sequence is monotonically decreasing, and has a lower bound.

  • Third, we define a new variable for the algorithm. For example, we define the variable in (19) for the SVRG-ADMM; the variable in (25) for the SAG-ADMM; and the variable in (31) for the SAGA-ADMM. Then, we prove the variable has an upper bound, based on the above results.

  • Finally, we prove that is bounded by the above defined variable.

3.1 Convergence Analysis of Nonconvex SVRG-ADMM

In the subsection, we study the convergence and iteration complexity of the SVRG-ADMM. First, given the sequence generated by Algorithm 2, then we define an useful sequence as follows:

(15)

where the positive sequence satisfies the following formation (16).

Next, we consider three important lemmas: the first gives the upper bound of ; the second demonstrates that the sequence is monotonically decreasing; then the third give the lower bound of the sequence .

Suppose the sequence is generated by the Algorithm 2. The following inequality holds

where denotes the smallest eigenvalues of matrix , and denotes the largest eigenvalues of positive matrix . A detailed proof of Lemma 5 is provided in Appendix D. Lemma 5 shows the upper bound of .

Suppose that the sequence is generated by Algorithm 2. Further suppose the positive sequence satisfies

(16)

for all . Denoting

(17)

and letting , and be chosen such that

then the sequence is monotonically decreasing.

A detailed proof of Lemma 6 is provided in Appendix E. Lemma 6 shows that the sequence is monotonically decreasing. Next, we further clarify choosing the above parameters. We first define a function . For the function , when , the function can reach the largest value

where denotes the conditional number of matrix . Since , we have . Considering , then the parameter should satisfy the following inequality

(18)

Suppose the sequence is generated by the Algorithm 2, and given the same conditions as in Lemma 6, the sequence has a lower bound.

A detailed proof of Lemma 7 is provided in Appendix F. Next, based on the above lemmas, we analyze the convergence and iteration complexity of the SVRG-ADMM in the following. We first defines an useful variable as follows:

(19)

Suppose the sequence is generated by the Algorithm 2. Denote , , , and , where and . Letting

(20)

where is a lower bound of the sequence , and denoting

then is an -stationary point of the problem (1).

A detailed proof of Theorem 8 is provided in Appendix G. Theorem 8 shows that the SVRG-ADMM is convergent and has the iteration complexity of to reach an -stationary point, i.e., obtain a convergence rate . From Theorem 8, we can find that the SVRG-ADMM ensures its convergence by progressively reducing the variance of stochastic gradients.

3.2 Convergence Analysis of Nonconvex SAG-ADMM

In the subsection, we study the convergence and iteration complexity of the SAG-ADMM. First, given the sequence generated by the Algorithm 3, then we define an useful sequence as follows

(21)

where the positive sequence satisfies the equality (22).

Next, we consider three important lemmas: the first gives the upper bound of ; the second demonstrates that the sequence is monotonically decreasing; the third gives the lower bound of the sequence .

Suppose the sequence is generated by the Algorithm 3, then the following inequality holds