In this paper, we consider a class of composite convex optimization problems
where is a given matrix, , each is a convex function, and is convex but possibly non-smooth. With regard to , we are interested in a sparsity-inducing regularizer, e.g. -norm, group Lasso and nuclear norm. When
is an identity matrix, i.e., the above formulation (1
) arises in many places in machine learning, statistics, and operations research[Bubeck2015]Beck and Teboulle2009], become computationally burdensome due to their per-iteration complexity of
. As a result, stochastic gradient descent (SGD) with per-iteration complexity ofhas witnessed tremendous progress in the recent years. Especially, a number of stochastic variance reduced gradient methods such as SAG [Roux, Schmidt, and Bach2012], SDCA [Shalev-Shwartz and Zhang2013] and SVRG [Johnson and Zhang2013]
have been proposed to successfully address the problem of high variance of the gradient estimate in ordinary SGD, resulting in a linear convergence rate (for strongly convex problems) as opposed to sub-linear rates of SGD. More recently, the Nesterov’s acceleration technique[Nesterov2004] was introduced in [Allen-Zhu2016, Hien et al.2016] to further speed up the stochastic variance-reduced algorithms, which results in the best known convergence rates for both strongly convex and general convex problems. This motivates us to integrate the momentum acceleration trick into the stochastic alternating direction method of multipliers (ADMM) below.
When is a more general matrix, i.e. , the formulation (1) becomes many more complicated problems arising from machine learning, e.g. graph-guided fuzed Lasso [Kim, Sohn, and Xing2009] and generalized Lasso [Tibshirani and Taylor2011]. To solve this class of composite optimization problems with an auxiliary variable , which are the special case of the general ADMM form,
the ADMM is an effective optimization tool [Boyd et al.2011], and has shown attractive performance in a wide range of real-world problems, such as big data classification [Nie et al.2014]. To tackle the issue of high per-iteration complexity of batch (deterministic) ADMM (as a popular first-order optimization method), wang:oadm wang:oadm, suzuki:oadmm suzuki:oadmm and ouyang:sadmm ouyang:sadmm proposed some online or stochastic ADMM algorithms. However, all these variants only achieve the convergence rate of for general convex problems and for strongly convex problems, respectively, as compared with the and linear convergence rates of accelerated batch algorithms [Nesterov1983], e.g. FISTA, where is the number of iterations. By now several accelerated and faster converging versions of stochastic ADMM, which are all based on variance reduction techniques, have been proposed, e.g. SAG-ADMM [Zhong and Kwok2014b], SDCA-ADMM [Suzuki2014] and SVRG-ADMM [Zheng and Kwok2016]. With regard to strongly convex problems, Suzuki:sdca Suzuki:sdca and zheng:fadmm zheng:fadmm proved that linear convergence can be obtained for the special ADMM form (i.e. and ) and the general ADMM form, respectively. In SAG-ADMM and SVRG-ADMM, an convergence rate can be guaranteed for general convex problems, which implies that there still remains a gap in convergence rates between the stochastic ADMM and accelerated batch algorithms.
To bridge this gap, we integrate the momentum acceleration trick in [Tseng2010] for deterministic optimization into the stochastic variance reduction gradient (SVRG) based stochastic ADMM (SVRG-ADMM). Naturally, the proposed method has low per-iteration time complexity as existing stochastic ADMM algorithms, and does not require the storage of all gradients (or dual variables) as in SCAS-ADMM [Zhao, Li, and Zhou2015] and SVRG-ADMM [Zheng and Kwok2016], as shown in Table 1. We summarize our main contributions below.
We also prove that ASVRG-ADMM has a convergence rate of for non-strongly convex problems, which is a factor of faster than SAG-ADMM and SVRG-ADMM, whose convergence rates are .
Our experimental results further verified that our ASVRG-ADMM method has much better performance than the state-of-the-art stochastic ADMM methods.
|General convex||Strongly-convex||Space requirement|
Introducing , problem (1) becomes
Although (3) is only a special case of the general ADMM form (2), when and , the stochastic (or online) ADMM algorithms and theoretical results in [Wang and Banerjee2012, Ouyang et al.2013, Zhong and Kwok2014b, Zheng and Kwok2016] and this paper are all for the more general problem (2). To minimize (2), together with the dual variable , the update steps of batch ADMM are
where is a penalty parameter.
To extend the batch ADMM to the online and stochastic settings, the update steps for and remain unchanged. In [Wang and Banerjee2012, Ouyang et al.2013], the update step of is approximated as follows:
where we draw uniformly at random from , is the step-size, and with given positive semi-definite matrix , e.g. in [Ouyang et al.2013]
. Analogous to SGD, the stochastic ADMM variants use an unbiased estimate of the gradient at each iteration. However, all those algorithms have much slower convergence rates than their batch counterpart, as mentioned above. This barrier is mainly due to the variance introduced by the stochasticity of the gradients. Besides, to guarantee convergence, they employ a decaying sequence of step sizes, which in turn impacts the rates.
More recently, a number of variance reduced stochastic ADMM methods (e.g. SAG-ADMM, SDCA-ADMM and SVRG-ADMM) have been proposed and made exciting progress such as linear convergence rates. SVRG-ADMM in [Zheng and Kwok2016] is particularly attractive here because of its low storage requirement compared with the algorithms in [Zhong and Kwok2014b, Suzuki2014]
. Within each epoch of SVRG-ADMM, the full gradientis first computed, where is the average point of the previous epoch. Then and in (7) are replaced by
and a constant step-size , respectively, where is a mini-batch of size (which is a useful technique to reduce the variance). In fact, is an unbiased estimator of the gradient , i.e. .
Accelerated Variance Reduced Stochastic ADMM
In this section, we design an accelerated variance reduced stochastic ADMM method for both strongly convex and general convex problems. We first make the following assumptions: Each convex is -smooth, i.e. there exists a constant such that , , and ; is -strongly convex, i.e. there is such that for all ; The matrix has full row rank. The first two assumptions are common in the analysis of first-order optimization methods, while the last one has been used in the convergence analysis of batch ADMM [shang:rpca, Nishihara et al.2015, Deng and Yin2016] and stochastic ADMM [Zheng and Kwok2016].
The Strongly Convex Case
In this part, we consider the case of (2) when each is convex, -smooth, and is -strongly convex. Recall that this class of problems include graph-guided Logistic Regression and SVM as notable examples. To efficiently solve this class of problems, we incorporate both the momentum acceleration and variance reduction techniques into stochastic ADMM. Our algorithm is divided into epochs, and each epoch consists of stochastic updates, where is usually chosen to be as in [Johnson and Zhang2013].
Let be an important auxiliary variable, its update rule is given as follows. Similar to [Zhong and Kwok2014b, Zheng and Kwok2016], we also use the inexact Uzawa method [Zhang, Burger, and Osher2011] to approximate the sub-problem (7), which can avoid computing the inverse of the matrix . Moreover, the momentum weight (the update rule for is provided below) is introduced into the proximal term similar to that of (7), and then the sub-problem with respect to is formulated as follows:
is the spectral norm, i.e. the largest singular value of the matrix. Furthermore, the update rule foris given by
where is the key momentum term (similar to those in accelerated batch methods [Nesterov2004]), which helps accelerate our algorithm by using the iterate of the previous epoch, i.e. . Similar to , . Moreover, can be set to a constant in all epochs of our algorithm, which must satisfy , where , and is defined below. The optimal value of is provided in Proposition 1 below. The detailed procedure is shown in Algorithm 1, where we adopt the same initialization technique for as in [Zheng and Kwok2016], and is the pseudo-inverse. Note that, when , ASVRG-ADMM degenerates to SVRG-ADMM in [Zheng and Kwok2016].
The Non-Strongly Convex Case
In this part, we consider general convex problems of the form (2) when each is convex, -smooth, and is not necessarily strongly convex (but possibly non-smooth). Different from the strongly convex case, the momentum weight is required to satisfy the following inequalities:
where is a decreasing function with respect to the mini-batch size . The condition (20) allows the momentum weight to decease, but not too fast, similar to the requirement on the step-size in classical SGD and stochastic ADMM [tseng:sgd]. Unlike batch acceleration methods, the weight must satisfy both inequalities in (20).
For the special case of , we have and , while (i.e. batch version), and . Since is decreasing, then is satisfied. The detailed procedure is shown in Algorithm 2, which has many slight differences in the initialization and output of each epoch from Algorithm 1. In addition, the key difference between them is the update rule for the momentum weight . That is, in Algorithm 1 can be set to a constant, while that in Algorithm 2 is adaptively adjusted as in (12).
This section provides the convergence analysis of our ASVRG-ADMM algorithms (i.e. Algorithms 1 and 2) for strongly convex and general convex problems, respectively. Following [Zheng and Kwok2016], we first introduce the following function as a convergence criterion, where denotes the (sub)gradient of at . Indeed, for all . In the following, we give the intermediate key results for our analysis.
Our first main result is the following theorem which gives the convergence rate of Algorithm 1.
The proof of Theorem 1 is provided in the Supplementary Material. From Theorem 1, one can see that ASVRG-ADMM achieves linear convergence, which is consistent with that of SVRG-ADMM, while SCAS-ADMM has only an convergence rate.
Theorem 1 shows that our result improves slightly upon the rate in [Zheng and Kwok2016] with the same and . Specifically, as shown in (13), consists of three components, corresponding to those of Theorem 1 in [Zheng and Kwok2016]. In Algorithm 1, recall that here and is defined in (9). Thus, both the first and third terms in (13) are slightly smaller than those of Theorem 1 in [Zheng and Kwok2016]. In addition, one can set (i.e. ) and . Thus, the second term in (13) equals to , while that of SVRG-ADMM is approximately equal to . In summary, the convergence bound of SVRG-ADMM can be slightly improved by ASVRG-ADMM.
Selecting Scheme of
The rate in (13) of Theorem 1 can be expressed as the function with respect to the parameters and with given . Similar to [Nishihara et al.2015, Zheng and Kwok2016], one can obtain the optimal parameter , which produces a smaller rate . In addition, as shown in (13), all the three terms are with respect to the weight . Therefore, we give the following selecting scheme for .
Given , and let , we set and , where . Then the optimal of Algorithm 1 is given by
The proof of Proposition 1 is provided in the Supplementary Material.
Convergence Rate of
We first assume that , where is a convex compact set with diameter , and the dual variable is also bounded with . For Algorithm 2, we give the following result.
Using the same notation as in Lemma 4 with , then we have
The proof of Theorem 2 is provided in the Supplementary Material. Theorem 2 shows that the convergence bound consists of the three components, which converge as , and , respectively, while the three components of SVRG-ADMM converge as , and . Clearly, ASVRG-ADMM achieves the convergence rate of as opposed to of SVRG-ADMM and SAG-ADMM (). All the components in the convergence bound of SCAS-ADMM converge as . Thus, it is clear from this comparison that ASVRG-ADMM is a factor of faster than SAG-ADMM, SVRG-ADMM and SCAS-ADMM.
Connections to Related Work
Our algorithms and convergence results can be extended to the following settings. When the mini-batch size and , then , that is, the first term of (14) vanishes, and ASVRG-ADMM degenerates to the batch version. Its convergence rate becomes (which is consistent with the optimal result for accelerated deterministic ADMM methods [Goldstein et al.2014, Lu et al.2016]), where . Many empirical risk minimization problems can be viewed as the special case of (1) when . Thus, our method can be extended to solve them, and has an rate, which is consistent with the best known result as in [Allen-Zhu2016, Hien et al.2016].
In this section, we use our ASVRG-ADMM method to solve the general convex graph-guided fuzed Lasso, strongly convex graph-guided logistic regression and graph-guided SVM problems. We compare ASVRG-ADMM with the following state-of-the-art methods: STOC-ADMM [Ouyang et al.2013], OPG-ADMM [Suzuki2013], SAG-ADMM [Zhong and Kwok2014b], and SCAS-ADMM [Zhao, Li, and Zhou2015] and SVRG-ADMM [Zheng and Kwok2016]. All methods were implemented in MATLAB, and the experiments were performed on a PC with an Intel i5-2400 CPU and 16GB RAM.
Graph-Guided Fused Lasso
We first evaluate the empirical performance of the proposed method for solving the graph-guided fuzed Lasso problem:
is the logistic loss function on the feature-label pair, i.e., , and is the regularization parameter. Here, we set as in [Ouyang et al.2013, Zhong and Kwok2014b, Azadi and Sra2014, Zheng and Kwok2016], where is the sparsity pattern of the graph obtained by sparse inverse covariance selection [Banerjee, Ghaoui, and d’Aspremont2008]. We used four publicly available data sets111http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ in our experiments, as listed in Table 2. Note that except STOC-ADMM, all the other algorithms adopted the linearization of the penalty term to avoid the inversion of at each iteration, which can be computationally expensive for large matrices. The parameters of ASVRG-ADMM are set as follows: and as in [Zhong and Kwok2014b, Zheng and Kwok2016], as well as and .
Figure 1 shows the training error (i.e. the training objective value minus the minimum) and testing loss of all the algorithms for the general convex problem on the four data sets. SAG-ADMM could not generate experimental results on the HIGGS data set because it ran out of memory. These figures clearly indicate that the variance reduced stochastic ADMM algorithms (including SAG-ADMM, SCAS-ADMM, SVRG-ADMM and ASVRG-ADMM) converge much faster than those without variance reduction techniques, e.g. STOC-ADMM and OPG-ADMM. Notably, ASVRG-ADMM consistently outperforms all other algorithms in terms of the convergence rate under all settings, which empirically verifies our theoretical result that ASVRG-ADMM has a faster convergence rate of , as opposed to the best known rate of .
Graph-Guided Logistic Regression
Due to limited space and similar experimental phenomena on the four data sets, we only report the experimental results on the a9a and w8a data sets in Figure 2, from which we observe that SVRG-ADMM and ASVRG-ADMM achieve comparable performance, and they significantly outperform the other methods in terms of the convergence rate, which is consistent with their linear (geometric) convergence guarantees. Moreover, ASVRG-ADMM converges slightly faster than SVRG-ADMM, which shows the effectiveness of the momentum trick to accelerate variance reduced stochastic ADMM, as we expected.
Finally, we evaluate the performance of ASVRG-ADMM for solving the graph-guided SVM problem,
where is the non-smooth hinge loss. To effectively solve problem (17), we used the smooth Huberized hinge loss in [Rosset and Zhu2007] to approximate the hinge loss. For the 20newsgroups dataset222http://www.cs.nyu.edu/~roweis/data.html, we randomly divide it into 80% training set and 20% test set. Following [Ouyang et al.2013], we set , and use the one-vs-rest scheme for the multi-class classification.
shows the average prediction accuracies and standard deviations of testing accuracies over 10 different runs. Since STOC-ADMM, OPG-ADMM, SAG-ADMM and SCAS-ADMM consistently perform worse than SVRG-ADMM and ASVRG-ADMM in all settings, we only report the results of STOC-ADMM. We observe that SVRG-ADMM and ASVRG-ADMM consistently outperform the classical SVM and STOC-ADMM. Moreover, ASVRG-ADMM performs much better than the other methods in all settings, which again verifies the effectiveness of our ASVRG-ADMM method.
In this paper, we proposed an accelerated stochastic variance reduced ADMM (ASVRG-ADMM) method, in which we combined both the momentum acceleration trick for batch optimization and the variance reduction technique. We designed two different momentum term update rules for strongly convex and general convex cases, respectively. Moreover, we also theoretically analyzed the convergence properties of ASVRG-ADMM, from which it is clear that ASVRG-ADMM achieves linear convergence and rates for both cases. Especially, ASVRG-ADMM is at least a factor of faster than existing stochastic ADMM methods for general convex problems.
We thank the reviewers for their valuable comments. The authors are supported by the Hong Kong GRF 2150851 and 2150895, and Grants 3132964 and 3132821 funded by the Research Committee of CUHK.
- [Allen-Zhu2016] Allen-Zhu, Z. 2016. Katyusha: Accelerated variance reduction for faster sgd. arXiv:1603.05953v4.
- [Azadi and Sra2014] Azadi, S., and Sra, S. 2014. Towards an optimal stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 620–628.
- [Baldassarre and Pontil2013] Baldassarre, L., and Pontil, M. 2013. Advanced topics in machine learning part II: 5. Proximal methods. University Lecture.
- [Banerjee, Ghaoui, and d’Aspremont2008] Banerjee, O.; Ghaoui, L. E.; and d’Aspremont, A. 2008. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9:485–516.
- [Beck and Teboulle2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183–202.
- [Boyd et al.2011] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1):1–122.
- [Bubeck2015] Bubeck, S. 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8:231–358.
- [Deng and Yin2016] Deng, W., and Yin, W. 2016. On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66:889–916.
- [Goldstein et al.2014] Goldstein, T.; ODonoghue, B.; Setzer, S.; and Baraniuk, R. 2014. Fast alternating direction optimization methods. SIAM J. Imaging Sciences 7(3):1588–1623.
- [Hien et al.2016] Hien, L. T. K.; Lu, C.; Xu, H.; and Feng, J. 2016. Accelerated stochastic mirror descent algorithms for composite non-strongly convex optimization. arXiv:1605.06892v2.
- [Johnson and Zhang2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 315–323.
- [Kim, Sohn, and Xing2009] Kim, S.; Sohn, K. A.; and Xing, E. P. 2009. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25:i204–i212.
- [Koneeny et al.2016] Koneeny, J.; Liu, J.; Richtarik, P.; ; and Takae, M. 2016. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces. 10(2):242–255.
- [Lan2012] Lan, G. 2012. An optimal method for stochastic composite optimization. Math. Program. 133:365–397.
- [Lu et al.2016] Lu, C.; Li, H.; Lin, Z.; and Yan, S. 2016. Fast proximal linearized alternating direction method of multiplier with parallel splitting. In Proc. 30th AAAI Conf. Artif. Intell., 739–745.
- [Nesterov1983] Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27(2):372–376.
- [Nesterov2004] Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer Academic Publ.
- [Nie et al.2014] Nie, F.; Huang, Y.; XiaoqianWang; and Huang, H. 2014. Linear time solver for primal svm. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 505–513.
- [Nishihara et al.2015] Nishihara, R.; Lessard, L.; Recht, B.; Packard, A.; and Jordan, M. I. 2015. A general analysis of the convergence of ADMM. In Proc. 32nd Int. Conf. Mach. Learn. (ICML), 343–352.
- [Ouyang et al.2013] Ouyang, H.; He, N.; Tran, L. Q.; and Gray, A. 2013. Stochastic alternating direction method of multipliers. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 80–88.
- [Rosset and Zhu2007] Rosset, S., and Zhu, J. 2007. Piecewise linear regularized solution paths. Ann. Statist. 35(3):1012–1030.
- [Roux, Schmidt, and Bach2012] Roux, N. L.; Schmidt, M.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2672–2680.
- [Shalev-Shwartz and Zhang2013] Shalev-Shwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14:567–599.
- [Suzuki2013] Suzuki, T. 2013. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 392–400.
- [Suzuki2014] Suzuki, T. 2014. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 736–744.
- [Tibshirani and Taylor2011] Tibshirani, R. J., and Taylor, J. 2011. The solution path of the generalized lasso. Annals of Statistics 39(3):1335–1371.
- [Tseng2010] Tseng, P. 2010. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125:263–295.
- [Wang and Banerjee2012] Wang, H., and Banerjee, A. 2012. Online alternating direction method. In Proc. 29th Int. Conf. Mach. Learn. (ICML), 1119–1126.
- [Zhang, Burger, and Osher2011] Zhang, X.; Burger, M.; and Osher, S. 2011. A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1):20–46.
- [Zhao, Li, and Zhou2015] Zhao, S.-Y.; Li, W.-J.; and Zhou, Z.-H. 2015. Scalable stochastic alternating direction method of multipliers. arXiv:1502.03529v3.
- [Zheng and Kwok2016] Zheng, S., and Kwok, J. T. 2016. Fast-and-light stochastic admm. In Proc. 25th Int. Joint Conf. Artif. Intell.,, 2407–2613.
- [Zhong and Kwok2014a] Zhong, L. W., and Kwok, J. T. 2014a. Accelerated stochastic gradient method for composite regularization. In Proc. 17th Int. Conf. Artif. Intell. Statist., 1086–1094.
- [Zhong and Kwok2014b] Zhong, L. W., and Kwok, J. T. 2014b. Fast stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 46–54.
Supplementary Materials for “Accelerated Variance Reduced Stochastic ADMM”
In this supplementary material, we give the detailed proofs for two important lemmas (i.e., Lemmas 1 and 2), two key theorems (i.e., Theorems 1 and 2) and a proposition (i.e., Proposition 1).
Proof of Lemma 1:
Our convergence analysis will use a bound on the variance term , as shown in Lemma 1. Before giving the proof of Lemma 1, we first give the following lemma.
Since each is convex, -smooth (), then the following holds
where , and .
This result follows immediately from Theorem 2.1.5 in [Nesterov2004]. ∎
Proof of Lemma 1:
. Taking expectation over the random choice of , we have
where the first inequality follows from the fact that , and the second inequality is due to Lemma 3 given above. Note that the similar result for (19) was also proved in [Allen-Zhu2016] (see Lemma 3.4 in [Allen-Zhu2016]). Next, we extend the result to the mini-batch setting.
Proof of Lemma 2:
Given any , then we have