1 Introduction
Stochastic optimization method is a class of powerful optimization tool for solving largescale problems in machine learning. For example, the stochastic gradient descent (SGD)
(Bottou, 2004) is an efficient method for solving the finitesum optimization problem, which is a fundamental to machine learning. Specifically, the SGD only computes gradient of one sample instead of visiting all samples in each iteration. Though its scalability, due to the variance in the stochastic process, the SGD has slower convergence rate than the batch gradient method. Recently, many accelerated versions of the SGD have successfully been proposed to reduce the variances, and obtain some better convergence rates. For example, the stochastic average gradient (SAG) method (Roux et al., 2012) obtains a fast convergence rate by incorporating the old gradients estimated in the previous iterations. The stochastic dual coordinate ascent (SDCA) method (ShalevShwartz and Zhang, 2013) performs the stochastic coordinate ascent on the dual problems and also obtains a fast convergence rate. Moreover, an accelerated randomized proximal coordinate gradient method (APCG) (Lin et al., 2015) accelerates the SDCA method by using the Nesterov’s acceleration technique (Nesterov, 2004). However, these accelerated methods obtain faster convergence rate than the standard SGD at the cost of requiring much space to store old gradients or dual variables. To deal with this dilemma, thus, the stochastic variance reduced gradient (SVRG) methods (Johnson and Zhang, 2013; Xiao and Zhang, 2014) are proposed, and enjoy a fast convergence rate with no extra space to store the intermediate gradients or dual variables. Moreover, Defazio et al. (2014) have proposed a novel method called SAGA, which extends the SAG method and enjoys better theoretical convergence rate than both the SAG and SVRG methods.Though the above gradientbased methods can effectively solve many problems in machine learning, they are still difficultly competent for some complicated problems, such as the graphguided SVM (Ouyang et al., 2013) and the latent variable graphical models (Ma et al., 2013). It is well known that the alternating direction method of multipliers (ADMM) (Gabay and Mercier, 1976; Boyd et al., 2011) has been advocated as an efficient optimization method in many application fields such as machine learning (Danaher et al., 2014) and statistics (Fang et al., 2015)
. However, the offline or batch ADMM need to compute an empirical risk loss function on all training samples at each iteration, which makes it unsuitable for largescale learning problems. Thus, the online or stochastic versions of ADMM
(Wang and Banerjee, 2012; Suzuki, 2013; Ouyang et al., 2013) have been developed for the largescale/stochastic optimizations. Due to the variance in the stochastic process, these initial stochastic ADMM methods also suffer from slow convergence rate. Recently, some accelerated stochastic ADMM methods are proposed to efficiently solve the largescale learning problems. For example, a fast stochastic ADMM (Zhong and Kwok, 2014) is proposed via incorporating the previous estimated gradients. Azadi and Sra (2014) has proposed an accelerated stochastic ADMM by using Nesterov’s accelerated method (Nesterov, 2004). Moreover, an adaptive stochastic ADMM (Zhao et al., 2015a) is proposed by using the adaptive stochastic gradients. The stochastic dual coordinate ascent ADMM (Suzuki, 2014) obtains a fast convergence rate by solving the dual problems. More recently, the scalable stochastic ADMMs (Zhao et al., 2015b; Zheng and Kwok, 2016) are developed, and obtain fast convergence rates with no extra space for the previous gradients or dual variables.So far, the above study on stochastic optimization methods relies heavily on the strongly convex or convex objective functions. However, there exist many useful nonconvex models in machine learning such as the nonconvex robust empirical risk minimization models (Aravkin and Davis, 2016)
and deep learning
(LeCun et al., 2015). Thus, the study of stochastic optimization methods for the nonconvex problems is much needed. Recently, some works focus on studying the stochastic gradient methods for the nonconvex optimizations. For example, Ghadimi and Lan (2016) and Ghadimi et al. (2016) have established the iteration complexity of for the SGD to obtain an stationary solution of the nonconvex optimizations. AllenZhu and Hazan (2016); Reddi et al. (2016a, b) have proved that both the nonconvex SVRG and SAGA methods obtain an iteration complexity of for the nonconvex optimizations. In particular, Li et al. (2016) have studied the stochastic gradient method for nonconvex sparse learning via variance reduction, which reaches an asymptotically linear convergence rate by exploring the properties of the specific problems. Moreover, Reddi et al. (2016c); Aravkin and Davis (2016) have studied the variance reduced stochastic methods for the nonconvex nonsmooth composite problems, and have proved that they have the iteration complexity of . At the same time, Hajinezhad et al. (2016) have proposed a nonconvex distributed and stochastic primal dual splitting method for the nonconvex nonsmooth problems and prove that it also has the iteration complexity of to obtain an stationary solution.Similarly, the above nonconvex methods are difficult to be competent to some complicated nonconvex problems, such as the graphguided regularization risk loss minimizations and tensor decomposition
(Jiang et al., 2016). Fortunately, it is found that the ADMM method can be well competent to these complicated nonconvex problems, though it may fail to converge sometimes. Recently, some work (Wang et al., 2015; Yang et al., 2015; Wang et al., 2015; Hong et al., 2016; Jiang et al., 2016) begin to devote to the study of the ADMM method for the nonconvex optimizations. However, they mainly focus on studying the determinate ADMM for the nonconvex optimizations. Due to computing the empirical loss function on all the training examples at each iteration, these nonconvex ADMMs can not be well competent to the largescale learning problems. Though Hong (2014) has proposed a distributed, asynchronous and incremental algorithm based on the ADMM method for the largescale nonconvex problems, the proposed method is still difficult to be competent to these complicated nonconvex problems such as the graphguided models, and its iteration complexity is not provided. At present, to the best of our knowledge, there still exists few study of the stochastic ADMM for the noncovex optimizations. In the paper, thus, we study the stochastic ADMM methods for solving the nonconvex nonsmooth stochastic optimizations as follows:(1)  
where is a nonconvex and smooth function;
is a random vector;
is nonsmooth and possibly nonconvex; , , , and . The problem (1) is inspired by the structural risk minimization in machine learning (Vapnik, 2013). Here, the random vector obeys a fixed but unknown distribution, from which we are able to draw a set of i.i.d. samples. In general, it is difficult to evaluate exactly, so we use the sample average approximation to approximate it. Throughout the paper, let denote the average sum of many nonconvex and smooth component functions , .Moreover, we propose three classes of nonconvex stochastic ADMM with variance reduction for the problem (1), based on different reduced variance stochastic gradients. Specifically, the first class called SVRGADMM uses a multistage scheme to progressively reduce the variance of stochastic gradients. The second called SAGADMM reduces the variance of stochastic gradient via additionally using the old gradients estimated in previous iteration. The third called SAGAADMM is an extension of SAGADMM, which uses an unbiased stochastic gradient as the SVRGADMM. In summary, our main contributions include three folds as follows:

We propose three classes of the nonconvex stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients.

We study the convergence of the proposed methods, and prove that these methods have the iteration complexity bound of to obtain an stationary solution of the nonconvex problems. In particular, we provide a general framework to analyze the iteration complexity of the nonconvex stochastic ADMM with variance reduction.

Finally, some numerical experiments demonstrate the effectiveness of the proposed methods.
1.1 Organization
The paper is organized as follows: In Section 2, we propose three classes of stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients. In Section 3, we study the convergence and iterative complexity of the proposed methods. Section 4 presents some numerical experiments, whose results back up the effectiveness of our methods. In Section 5, we give some conclusions. Most details of the theoretical analysis and proofs are relegated to the the following Appendices.
1.2 Notations
denotes the Euclidean norm of a vector or the spectral norm of a matrix. is subgradient of the function . implies the matrix is positive definite. Let . Let denote the generalized inverse of matrix . For a nonempty closed set , denotes the distance from to .
2 Stochastic ADMM methods for the Nonconvex Optimizations
In this section, we study stochastic ADMM methods for solving the nonconvex problem (1). First, we propose a simple nonconvex stochastic ADMM as a baseline, in which the variance of stochastic gradients is free. However, it is difficult to guarantee the convergence of this simple stochastic ADMM, and only obtain a slow convergence rate. Thus, we propose three classes of stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients.
First, we review the standard ADMM for solving the problem (1), when is deterministic. The augmented Lagrangian function of (1) is defined as
(2) 
where is a Lagrange multiplier, and is a penalty parameter. At th iteration, the ADMM executes the following update:
(3)  
(4)  
(5) 
When
is a random variable, we can still update the variables
and by (3) and (5), respectively. However, to update the variable , we need define an approximated function of the form:(6) 
where , and . By minimizing (6) on the variable , we have
When is large, computing inversion of is expensive. To avoid it, we can use the inexact Uzawa method (Zhang et al., 2011) to linearize the last term in (6), and choose . When , by minimizing (6) on the variable , we have
Like as the initial stochastic ADMM (Ouyang et al., 2013) for convex problems, we propose a simple stochastic ADMM (SADMM) as a baseline for the problem (1). The algorithmic framework of the SADMM is given in Algorithm 2. Though , there exists the variance in stochastic process. To guarantee its convergence, we should choose a timevarying stepsize in (6), as in (Ouyang et al., 2013). However, as stochastic learning proceeds, the gradual decreasing of the stepsize generally leads to a slow convergence rate. In the following, thus, we propose three classes of stochastic ADMM with variance reduction for the problem (1), based on different reduced variance stochastic gradients.
2.1 Nonconvex SVRGADMM
In the subsection, we propose a nonconvex SVRGADMM, via using a multistage scheme to progressively reduce the variance of stochastic gradients. The framework of the SVRGADMM method is given in Algorithm 1. Specifically, in Algorithm 2, the stochastic gradient is unbiased, i.e., , and its variance is progressively reduced by computing the gradients of all sample one time in each outer loop. In the following, we give an upper bound of the variance of the stochastic gradient . In Algorithm 2, set , where , then the following inequality holds
(7) 
where denotes the variance of stochastic gradient .
A detailed proof of Lemma 1 is provided in Appendix A. Lemma 1 shows that variance of the stochastic gradient has an upper bound . Due to , as number of iterations increases, both and approach the same stationary point, thus the variance of stochastic gradient vanishes. Note that the stochastic ADMM for solving the nonconvex problem is difficult to converge to the global solution , so we bound the variance with rather than the popular used in the convex problem.
2.2 Nonconvex SAGADMM
In the subsection, we propose a nonconvex SAGADMM by additionally using the old gradients estimated in the previous iteration. The framework of the SAGADMM method is given in Algorithm 3. In Algorithm 3, though the stochastic gradient is biased, i.e.,
its variance is progressively reduced. In the following, we give an upper bound of variance of the stochastic gradient .
In Algorithm 3, set , where , then the following inequality holds
(8) 
where , and denotes variance of the stochastic gradient .
A detailed proof of Lemma 2 is provided in Appendix B. Lemma 2 shows that the variance of the stochastic gradient has an upper bound of . As the number of iteration increases, both and the stored points approach the same stationary point, so the variance of stochastic gradient progressively reduces.
2.3 Nonconvex SAGAADMM
Methods  stochastic gradient  upper bound of variance  (un)biased 

SADMM  unknown  unbiased  
SVRGADMM  unbiased  
SAGADMM  biased  
SAGAADMM  unbiased 
In the subsection, we propose a nonconvex SAGAADMM, which is an extension of the SAGADMM and uses an unbiased stochastic gradients as the SVRGADMM. The framework of the SAGAADMM method is given in Algorithm 4. In Algorithm 4, the stochastic gradient is unbiased, i.e., , and its variance is progressively reduced via additionally also using the old gradients in the previous iterations. Similarly, we give an upper bound of the variance of the stochastic gradient .
In Algorithm 4, set , where , then the following inequality holds
(9) 
where , and denotes variance of the stochastic gradient .
A detailed proof of Lemma 3 is provided in Appendix C. Lemma 3 shows that the variance of the stochastic gradient has an upper bound . Similarly, both and the stored points approach the same stationary point, as the number of iteration increases. Thus, the variance of stochastic gradient progressively reduces. Note that the upper bound (9) loses a coefficient to the upper bound (8), due to using a unbiased stochastic gradient in the SAGAADMM.
To further clarify the different of the proposed methods, we summarize the stochastic gradients used in the proposed methods in Table 1. From Table 1, we can find that the SAGADMM only uses the biased stochastic gradient, while others use the unbiased stochastic gradient. In particular, the SAGADMM can reduce faster the variance of stochastic gradient than the SAGAADMM, at the expense of using a biased stochastic gradient.
3 Convergence Analysis
In the section, we analyze the convergence and iteration complexity of the proposed methods. First, we give some mild assumptions regarding problem (1) as follows:
Assumption 1
For , the gradient of function is Lipschitz continuous with the constant , such that
(10) 
where , and this is equivalent to
(11) 
Assumption 2
and are all lower bounded, and denoting and .
Assumption 3
is a proper lower semicontinuous function.
Assumption 4
Matrix has full row rank.
In the Assumption 1, since , we have and . Assumption 1 has been widely used in the convergence analysis of nonconvex algorithms (AllenZhu and Hazan, 2016; Reddi et al., 2016a). Assumptions 23 have been used in study of ADMM for nonconvex problems (Jiang et al., 2016). Assumption 4 has been used in the convergence analysis of ADMM (Deng and Yin, 2016).
Throughout the paper, let
denote the smallest eigenvalues of matrix
, and let and denote the smallest and largest eigenvalues of positive matrix , respectively. In the following, we define the stationary point of the nonconvex problem (1):For , the point is said to be an stationary point of the problem (1) if it holds that
(12)  
(13)  
(14) 
where . If , the point is said to be a stationary point of the problem (1). Note that combining the above inequalities (1315) is equivalent to , where
Next, based the above assumptions and definition, we study the convergence and iteration complexity of the proposed methods. In particular, we provide a general framework to analyze the convergence and iteration complexity of stochastic ADMM methods with variance reduction. Specifically, the basic procedure is given as follows:

First, we design a new sequence based on the sequence generated from the algorithm. For example, we design the sequence in (15) for the SVRGADMM; the sequence in (21) for the SAGADMM; and the sequence in (27) for the SAGAADMM.

Second, we prove that the designed sequence is monotonically decreasing, and has a lower bound.

Third, we define a new variable for the algorithm. For example, we define the variable in (19) for the SVRGADMM; the variable in (25) for the SAGADMM; and the variable in (31) for the SAGAADMM. Then, we prove the variable has an upper bound, based on the above results.

Finally, we prove that is bounded by the above defined variable.
3.1 Convergence Analysis of Nonconvex SVRGADMM
In the subsection, we study the convergence and iteration complexity of the SVRGADMM. First, given the sequence generated by Algorithm 2, then we define an useful sequence as follows:
(15) 
where the positive sequence satisfies the following formation (16).
Next, we consider three important lemmas: the first gives the upper bound of ; the second demonstrates that the sequence is monotonically decreasing; then the third give the lower bound of the sequence .
Suppose the sequence is generated by the Algorithm 2. The following inequality holds
where denotes the smallest eigenvalues of matrix , and denotes the largest eigenvalues of positive matrix . A detailed proof of Lemma 5 is provided in Appendix D. Lemma 5 shows the upper bound of .
Suppose that the sequence is generated by Algorithm 2. Further suppose the positive sequence satisfies
(16) 
for all . Denoting
(17) 
and letting , and be chosen such that
then the sequence is monotonically decreasing.
A detailed proof of Lemma 6 is provided in Appendix E. Lemma 6 shows that the sequence is monotonically decreasing. Next, we further clarify choosing the above parameters. We first define a function . For the function , when , the function can reach the largest value
where denotes the conditional number of matrix . Since , we have . Considering , then the parameter should satisfy the following inequality
(18) 
Suppose the sequence is generated by the Algorithm 2, and given the same conditions as in Lemma 6, the sequence has a lower bound.
A detailed proof of Lemma 7 is provided in Appendix F. Next, based on the above lemmas, we analyze the convergence and iteration complexity of the SVRGADMM in the following. We first defines an useful variable as follows:
(19) 
Suppose the sequence is generated by the Algorithm 2. Denote , , , and , where and . Letting
(20) 
where is a lower bound of the sequence , and denoting
then is an stationary point of the problem (1).
A detailed proof of Theorem 8 is provided in Appendix G. Theorem 8 shows that the SVRGADMM is convergent and has the iteration complexity of to reach an stationary point, i.e., obtain a convergence rate . From Theorem 8, we can find that the SVRGADMM ensures its convergence by progressively reducing the variance of stochastic gradients.
3.2 Convergence Analysis of Nonconvex SAGADMM
In the subsection, we study the convergence and iteration complexity of the SAGADMM. First, given the sequence generated by the Algorithm 3, then we define an useful sequence as follows
(21) 
where the positive sequence satisfies the equality (22).
Next, we consider three important lemmas: the first gives the upper bound of ; the second demonstrates that the sequence is monotonically decreasing; the third gives the lower bound of the sequence .
Suppose the sequence is generated by the Algorithm 3, then the following inequality holds
Comments
There are no comments yet.