We consider the following minimization problem characterized by a separable objective function with linear constraints:
where , , ; and are nonempty convex set; and and are both convex functions. The function is assumed to be of the form , where is sample size and is the loss incurred on
Many algorithms have been developed over the last thirty years to solve the convex minimization problem in (1), starting with the Peaceman-Rachford splitting method (PRSM)  and the Douglas-Rachford splitting method (DRSM) . Applying DRSM to the dual of (1), one gets the popular optimization method called Alternating Direction Method of Multipliers (ADMM) [9, 8, 3, 2], which is fast in practice and easy to implement. Convergence rates, in ergodic and non-ergodic sense, for ADMM have been studied recently. For example,[13, 19, 25] showed ADMM has ergodic rate, where stands for the number of iterations, while  established that non-ergodic convergence rate is tight for general DRSM. Applying Nesterov’s extrapolation to accelerate ADMM, one gets convergence rate . With additional assumptions on the objective function, such as strongly convexity, we the convergence rate can be further strengthen [6, 14].
In addition to all the theoretical developments, many new variants of ADMM appeared, including both batch and stochastic versions of the algorithm.  proposed a stochastic ADMM iteration scheme and then showed its good performance on a large-scale problem by using the first order approximation to the Lagrangian. However, because of the noisy gradient and inexact approximation to the stochastic function , the stochastic ADMM can only attain ergodic convergence rate. Recently, a number of accelerated stochastic version of ADMM that incorporate variance reduction techniques (see [15, 5, 26, 23]) were proposed with better convergence results — SDCA-ADMM , SAG-ADMM , SVRG-ADMM , and ASVRG-ADMM . SVRG-ADMM and SAG-ADMM enjoy ergodic convergence rate and non-ergodic convergence rate. ASVRG-ADMM, which makes each iteration step expansive and needs a large number of inner iterations, can have ergodic rate under general convex assumptions and a linear convergence rate under strongly convex assumption. These developments effectively removed the convergence rate gap between stochastic ADMM and batch ADMM.
On the other hand, development of PRSM and its variants is not as fast as that of DRSM. Though PRSM always converges faster in experiments, whenever it converges, the main difficulty for PRSM is that the sequence generated by PRSM is not strictly contractive with respect to the solution set of Problem (1) .  proposed a method, called strictly contractive PRSM (SC-PRSM), to overcome this difficulty by attaching an underdetermined relaxation factor to the penalty parameter when updating Lagrange multiplier. After this paper, mirroring the evolution of DRSM some new variants of SC-PRSM have been developed. [10, 20] used two different relaxation factors and showed the flexibility of this setting.  showed SC-PRSM can attain the worst-case convergence rate in both ergodic and non-ergodic sense. With the exception of , development of stochastic algorithms based on PRSM is lacking, even though it outperforms ADMM in many numerical experiments.  developed an algorithm called Stochastic Semi-Proximal-Based SC-PRSM (SSPB-SCPRSM), which contains the Stochastic SC-PRSM as a special case, but with the convergence rate of just in ergodic sense, the same as stochastic ADMM. Therefore the gap between batch PRSM and stochastic PRSM still exists.
In this paper, we bridge the gap by developing a new accelerated stochastic algorithm based on SC-PRSM, called Stochastic Scalable PRSM (SS-PRSM). Compared to SC-PRSM, we use two different relaxation factors, and , to make it more flexible. We borrow the general iteration structure from  and , but accelerate the iteration for . This adjustment will help us achieve ergodic rate, improving the rate in  and matching ADMM based stochastic algorithms. Finally, we illustrate superiority over ADMM based stochastic algorithms in numerical experiments, mirroring the batch case. Our contribution in this paper are:
Theoretically, we prove ergodic convergence rate for the proposed algorithm. This bridges the ergodic convergence rate gap between stochastic PRSM and batch PRSM (note that the non-ergodic rate is still a open problem);
Comparing with related stochastic ADMM based algorithms, we add two proximal terms in iteration of and , which improve flexibility;
Comparing with SVRG-ADMM , we only accelerate iteration to get the same convergence rate;
Our algorithm is very flexible, leading to different new stochastic algorithms by setting , , , properly;
The remainder of the paper is organized as follows. In section 2, we will provide background, discuss related work in more details, show fundamental iteration schemes, and provide notations used throughout. In section 3, we introduce our algorithm. Theoretical convergence analysis is give in section 4. Extensive numerical experiments to illustrate the performance are in section 5. Finally, section 6 concludes the paper and provides directions for future research.
Notations. Throughout the papers, we will use the following notation:
, define ; ;
; ; ; ;
Define ; given ,
is vector valued soft-thresholding operator which is defined as, ; define .
Modern data sets are getting ever larger, which drives the development of efficient and scalable optimization algorithms. One promising direction is the development of mini-batch and stochastic algorithms, which can be thought of as a special case of mini-batching where one sample point is involved in updating parameters. Another direction are online algorithms that process data coming in streams [29, 11, 25]. Our focus in this paper, is developing of a scalable, stochastic algorithm for solving large-scale optimization problems in (1). In a stochastic algorithm, there is a trade off between computation speed and the convergence speed. When a single sample is randomly chosen to get an approximate descent direction, the computation is fast but the convergence speed is slow. On the other hand, batch algorithms use all the samples to find the exact descent direction, which results in faster convergence rate, but the computation speed will be slower or possibly unimplementable. Our proposed algorithm, balances the two aspects by first computing the batch gradient, and then in an iteration adjusts the gradient direction based on the current sample. Our finial algorithm is comparable with a batch algorithm like ADMM and SC-PRSM in ergodic sense. In addition, we add two pivotal proximal terms to make its implementation flexible.
We briefly introduce two fundamental stochastic iteration schemes that are useful for development of our algorithm. First scheme is used in the stochastic ADMM , where the noisy gradient is used to approximate augmented Lagrangian as:
Here, can be seen as the selector of a sample that will be
used in computing a subgradient of in the th
iteration. In our setting
, and we
can take 111We use
to denote a uniform distribution among discrete indices
to denote a uniform distribution among discrete indices.. More generally, if , we can take for some fixed distribution . Furthermore, is the Lagrangian multiplier, is the predefined penalty parameter, and is the time-varying step size. As iteration number goes up, we expect to find a triplet such that , , , we have
Details of stochastic ADMM are given in Algorithm 1, which successively updates , , .
The second iteration scheme is used in SSPB-SCPRSM  and is shown in Algorithm 2. The additional parameters make Algorithm 2 more flexible. The feasible range for and is and . Setting results in stochastic SCPRSM. When goes to , the algorithm result in stochastic PRSM. Different from stochastic ADMM, it creates an intermediate iterate between and , which always appears in PRSM based algorithms. Without additional parameters , this step can still make PRSM outperform ADMM in experiment but also results in the loss of property of strict contraction [12, 10].
3 Stochastic Scalable PRSM
In this section, we describe our proposed stochastic scalable PRSM algorithm, which improves over SC-PRSM. A gradient estimate based on a single sample has a large variance and one requires the time-varying step size to decay to zero in order to ensure convergence. The order of this decay is essential for obtaining good convergence rate. Here, we use a variance reduction trick (inspired by ) to incorporate information from all the samples in a stochastic setting.
Instead of using a first order approximated Lagrangian, we minimize the following augmented Lagrangian:
Compared to Algorithm 2, SS-PRSM only adjusts the iteration for . Suppose at iteration , we have. . We define a gradient function associated with as
where is a predefined matrix. Note that if , the proximal term has the same function as term in (2), but we do not need . Note that we can use , which changes with iterations, but for simplicity and implementability we will fix . In our algorithm, stochastic variance reduced gradient descent is used to minimize . Algorithm 3 summarizes the steps used to update .
The iteration scheme for and are the same as Algorithm 2 (lines 4-6). Note that using or for updating will result in the same update. The term is crucial here because it involves , which collects the "information" we get from other samples that are not selected in the stochastic update. With this choice of we have that . Finally, we note that saving all from 0 to is not necessary, as an incremental vector can be used to obtain .
Computationally, one iteration of SS-PRSM is slower than one iteration in Algorithm 1 and 2, because we still need to compute the batch gradient. However, the inner iteration is fast and comparable with that of ADMM based algorithms [28, 27, 18], while the overall number of iterations is lower. Also, from our algorithm, we see only accelerate one variable is enough to attain rate instead of involving and Lagrangian multiplier . In the following section, we will show our main theoretical result.
4 Convergence Analysis
In this section, we study the convergence rate for SS-PRSM and establish ergodic rate. To measure the convergence rate, we will use following criterion function which is from variational inequality: . We will show that where and , we have
The above criterion is equivalent to for , which was used in [21, 26, 18]. Furthermore, since our algorithm is only modifying the updating scheme for , we will borrow part of arguments given in . We start by listing our fairly mild assumptions:
A1 and are two convex function with .
A2 For all , is -smooth — , Furthermore, we let .
A3 We assume . Otherwise, we only solve a minimization in a large enough but finite space.
Note that the assumptions are quite general. In particular, we do not need the smoothness on . A3 is necessary for all ADMM and PRSM method to guarantee convergence.
Next, we present two lemmas which will be used to prove the main result. The first lemma measure the decrease for in one step.
For fixed, suppose we have (so ), then after getting by doing SS-PRSM for one step we will have ,
Let focus on the term on the left hand side, it’s an approximation to . To show the lower bound for decrease of , we need an upper bound for this expectation. Following lemma gives this upper bound.
Define for , then we have
Based on these two lemmas, we are ready to present our main result. However, our inner iteration step-size and iteration number is pivotal. We will show in the prove that we can simply set and for some constant to attain global convergence. Because will converges to 0 as increasing, we see fewer and fewer inner iterations we will need as algorithm progresses.
Let the sequence be generated by SS-PRSM, under above setting for and and let and , we will have
Further, , we have
where is true solution.
We give a brief proof sketch. First, because our iteration scheme for and are the same as , we modified their results to fit it our algorithm. Combining their part of results with our Lemma 4.1, we have following argument to measure the total decrease of criteria function. For self-contain, we have showed how to prove this argument in our supplementary material.
Argument: Let the sequence be generated by SS-PRSM. If and , then we have
where , for , and .
Note that . So summing from to and taking expectation on both side, we can show
So, under our setting for and , we will get our main theorem. For the second part of the theorem, because is random in above inequality, we consider to be where for any . Plug into above inequality and using following fact
we can show SS-PRSM convergence order in the second criteria. The detailed proof is proposed in supplementary material.
From above theorem, we see SS-PRSM has ergodic convergence rate. So, we have showed that only adjusting the iteration in , we can gain the same convergence rate as most of ADMM based stochastic algorithm like SVRG-ADMM (see ).
5 Numerical Experiments
In this section, we run extensive numerical experiments to test performance of our algorithm. Since the goal is to bridge the convergence rate gap between stochastic and batch algorithms based on PRSM, we will also use SC-PRSM as one of competitors. Specifically, we use the following algorithms for comparison:
We use both simulated data and real data and focus on three optimization problems arising in machine learning: Lasso, group Lasso, and sparse logistic regression. These problems can be easily formulated as an optimization programs of the form (1), as we show below. We try a wide range of settings for our parameters. We use to denote the sample matrix where is the number of predictors; is the response vector; and denote one row of and respectively; is the target parameter; and is the number of nonzero coefficients. Simulation results are averaged over independent runs.
Lasso problem can be formulated as
We set , , and . We construct by drawing each entry independently from and then normalizing each row as . The parameter is generated by uniformly drawing indices from 1 to setting . Finally, where . The regularization parameter is set as . Since is sparse we assume when setting and . The setting for other parameters is tabulated in Table 1. We only consider setting , while can be set arbitrarily. This is because other forms of might make the subproblem for finding hard to solve. The detailed iteration scheme is given in Algorithm 4 in supplementary material. The loss value decay plot is shown in Figure 0(a) and Figure 0(b). From the plot, we see that S-ADMM and SSPB-SCPRSM require many more passes over the data. Our SS-PRSM is competitive with the state-of-the-art stochastic ADMM based algorithm.
5.2 Group Lasso
Group Lasso problem can be formulated as
where is the number of groups; and is the parameter vector in each group. Let .
We set , , , and . is created as in the Lasso case. denotes the parameter vector in the th group. We let and for . Finally, with and . The iteration scheme is the same as in the Lasso case except that the th row of Algorithm 4 (see supplement) is replaced by
where (different from in updating ) and are parameter and Lagrange multiplier for the th group, respectively. Here, with . The loss decay plot is in Figure 0(c). We see that SS-PRSM and SVRG-ADMM are the two fastest algorithms, with SS-PRSM being slightly better than SVRG-ADMM. We also note the slow convergence rate of SSPB-SCPRSM and S-ADMM.
5.3 Sparse logistic regression
Sparse logistic regression can be formulated as
We set , , and . Sample matrix is generated as follows: each sample point has twenty nonzero entries independently drawn from with the index generated uniformly from 1 to ; is simulated as in the Lasso case and where and is computed entry-wise. For the regularization parameter we use the setting in  where . The intercept term is dropped in this setting for simplicity, which implies that , as assumed in our simulation setting. The iteration scheme is detailed in Algorithm 5 in supplementary material. The loss decay plot is in Figure 0(d), which shows that our algorithm still outperforms other algorithms. Note that SVRG-ADMM also performs without setting , , , in this case, however, here these parameters are easy to set and SS-PRSM seems to be robust to their specification.
5.4 Real Data
We investigate four real datasets to illustrate the efficiency of our algorithm: i) communities and crime data set 222Download from http://archive.ics.uci.edu/ml/datasets/communities+and+crime., EE2006-tfidf 333Download from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/regression.html., sido 444Download from http://www.causality.inf.ethz.ch/data/SIDO.html., and NSL-KDD data set 555Details in http://www.unb.ca/cic/datasets/index.html and download from https://github.com/defcom17/NSL_KDD.. Table 2 summarizes different setting and parameters used. We briefly introduce the last NSL-KDD data. Recently, machine learning methods have been widely used in abnormal flow detection, which is increased dramatically with upgrading the computer network, hardware and software. ADMM based framework has a good parallel performance to effectively cope with large data (see ). We are trying to test the efficiency of our proposed algorithm (which is PRSM based). We fit a sparse logistic for KDD because of
is large and our response is a binary variable characterizing the connection type: normal(1), doubtful(0).
The loss decay plots are in Figure 2. SSPB-SCPRSM and S-ADMM always converge slower than other stochastic algorithms. SVRG-ADMM and SCAS-ADMM are similar, while our SS-PRSM can converge faster than them.
In this paper, we propose a new stochastic algorithm, SS-PRSM. The resulted algorithm has ergodic convergence rate, which matches the rate of many start-of-the-art variants of stochastic ADMM. We bridge the ergodic convergence rate gap between batch and stochastic algorithms for PRSM theoretically. Furthermore, we show that our algorithm outperforms ADMM based algorithms when fitting statistical models such as Lasso, group Lasso, and sparse logistic regression.
The theoretical analysis of the convergence rate in non-ergodic sense for SS-PRSM is still an open problem and is much more difficult than for an ADMM based algorithm because the adjustment of the iteration scheme. Meanwhile, we can also try to incorporate other tricks, such as conjugate gradient and Nesterov’s extrapolation, for the iteration of . We think they can also attain ergodic convergence rate. On the other hand, the effectiveness of PRSM under nonconvex setting, for both batch and stochastic version, is still blank. SVRG-ADMM has good convergence properties under nonconvex setting as shown in . However, the general theoretical convergence result for PRSM is still unknown, which is also an interesting area.
-  Seyed Mojtaba Hosseini Bamakan, Huadong Wang, and Yong Shi. Ramp loss k-support vector classification-regression; a robust and sparse multi-class approach to the intrusion detection problem. Knowledge-Based Systems, 126:113–126, 2017.
-  Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
-  Tony Fan C Chan and Roland Glowinski. Finite element approximation and iterative solution of a class of mildly non-linear elliptic equations. Computer Science Department, Stanford University Stanford, 1978.
-  Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163. Springer, 2016.
-  Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
-  Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66(3):889–916, 2016.
-  Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
-  Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
-  Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76, 1975.
-  Yan Gu, Bo Jiang, and Deren Han. A semi-proximal-based strictly contractive peaceman-rachford splitting method. arXiv preprint arXiv:1506.02221, 2015.
-  Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2008.
-  Bingsheng He, Han Liu, Zhaoran Wang, and Xiaoming Yuan. A strictly contractive peaceman–rachford splitting method for convex programming. SIAM Journal on Optimization, 24(3):1011–1040, 2014.
-  Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
-  Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(1-2):165–199, 2017.
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.In Advances in neural information processing systems, pages 315–323, 2013.
-  Kwangmoo Koh, Seung-Jean Kim, and Stephen Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine learning research, 8(Jul):1519–1555, 2007.
-  Huan Li and Zhouchen Lin. Optimal nonergodic convergence rate: When linearized adm meets nesterov’s extrapolation. arXiv preprint arXiv:1608.06366, 2016.
-  Yuanyuan Liu, Fanhua Shang, and James Cheng. Accelerated variance reduced stochastic admm. In AAAI, pages 2287–2293, 2017.
-  Renato DC Monteiro and Benar F Svaiter. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475–507, 2013.
-  Sen Na and Cho-Jui Hsieh. Sparse learning with semi-proximal-based strictly contractive peaceman-rachford splitting method. arXiv preprint arXiv:1612.09357, 2016.
-  Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
-  Donald W Peaceman and Henry H Rachford, Jr. The numerical solution of parabolic and elliptic differential equations. Journal of the Society for industrial and Applied Mathematics, 3(1):28–41, 1955.
-  Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
-  Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In International Conference on Machine Learning, pages 736–744, 2014.
-  Huahua Wang and Arindam Banerjee. Online alternating direction method (longer version). arXiv preprint arXiv:1306.3721, 2013.
-  Shen-Yi Zhao, Wu-Jun Li, and Zhi-Hua Zhou. Scalable stochastic alternating direction method of multipliers. arXiv preprint arXiv:1502.03529, 2015.
-  Shuai Zheng and James T Kwok. Fast-and-light stochastic admm. In IJCAI, pages 2407–2613, 2016.
-  Wenliang Zhong and James Kwok. Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 46–54, 2014.
-  Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928–936, 2003.
Stochastic Approaches for Peaceman-Rachford Splitting Method
Proof of Lemma 4.1
We only consider one more step after getting . Based on our Algorithm 3, , we have
Let’s deal with first. Note that