1 Introduction
We consider the following minimization problem characterized by a separable objective function with linear constraints:
s.t.  (1) 
where , , ; and are nonempty convex set; and and are both convex functions. The function is assumed to be of the form , where is sample size and is the loss incurred on
th sample. This setting is flexible enough to incorporate a number of problems arising in machine learning and statistics, including Lasso, group Lasso and logistic regression.
Many algorithms have been developed over the last thirty years to solve the convex minimization problem in (1), starting with the PeacemanRachford splitting method (PRSM) [22] and the DouglasRachford splitting method (DRSM) [7]. Applying DRSM to the dual of (1), one gets the popular optimization method called Alternating Direction Method of Multipliers (ADMM) [9, 8, 3, 2], which is fast in practice and easy to implement. Convergence rates, in ergodic and nonergodic sense, for ADMM have been studied recently. For example,[13, 19, 25] showed ADMM has ergodic rate, where stands for the number of iterations, while [4] established that nonergodic convergence rate is tight for general DRSM. Applying Nesterov’s extrapolation to accelerate ADMM, one gets convergence rate [17]. With additional assumptions on the objective function, such as strongly convexity, we the convergence rate can be further strengthen [6, 14].
In addition to all the theoretical developments, many new variants of ADMM appeared, including both batch and stochastic versions of the algorithm. [21] proposed a stochastic ADMM iteration scheme and then showed its good performance on a largescale problem by using the first order approximation to the Lagrangian. However, because of the noisy gradient and inexact approximation to the stochastic function , the stochastic ADMM can only attain ergodic convergence rate. Recently, a number of accelerated stochastic version of ADMM that incorporate variance reduction techniques (see [15, 5, 26, 23]) were proposed with better convergence results — SDCAADMM [24], SAGADMM [28], SVRGADMM [27], and ASVRGADMM [18]. SVRGADMM and SAGADMM enjoy ergodic convergence rate and nonergodic convergence rate. ASVRGADMM, which makes each iteration step expansive and needs a large number of inner iterations, can have ergodic rate under general convex assumptions and a linear convergence rate under strongly convex assumption. These developments effectively removed the convergence rate gap between stochastic ADMM and batch ADMM.
On the other hand, development of PRSM and its variants is not as fast as that of DRSM. Though PRSM always converges faster in experiments, whenever it converges, the main difficulty for PRSM is that the sequence generated by PRSM is not strictly contractive with respect to the solution set of Problem (1) [12]. [12] proposed a method, called strictly contractive PRSM (SCPRSM), to overcome this difficulty by attaching an underdetermined relaxation factor to the penalty parameter when updating Lagrange multiplier. After this paper, mirroring the evolution of DRSM some new variants of SCPRSM have been developed. [10, 20] used two different relaxation factors and showed the flexibility of this setting. [12] showed SCPRSM can attain the worstcase convergence rate in both ergodic and nonergodic sense. With the exception of [20], development of stochastic algorithms based on PRSM is lacking, even though it outperforms ADMM in many numerical experiments. [20] developed an algorithm called Stochastic SemiProximalBased SCPRSM (SSPBSCPRSM), which contains the Stochastic SCPRSM as a special case, but with the convergence rate of just in ergodic sense, the same as stochastic ADMM. Therefore the gap between batch PRSM and stochastic PRSM still exists.
In this paper, we bridge the gap by developing a new accelerated stochastic algorithm based on SCPRSM, called Stochastic Scalable PRSM (SSPRSM). Compared to SCPRSM, we use two different relaxation factors, and , to make it more flexible. We borrow the general iteration structure from [10] and [20], but accelerate the iteration for . This adjustment will help us achieve ergodic rate, improving the rate in [20] and matching ADMM based stochastic algorithms. Finally, we illustrate superiority over ADMM based stochastic algorithms in numerical experiments, mirroring the batch case. Our contribution in this paper are:

Theoretically, we prove ergodic convergence rate for the proposed algorithm. This bridges the ergodic convergence rate gap between stochastic PRSM and batch PRSM (note that the nonergodic rate is still a open problem);

Comparing with related stochastic ADMM based algorithms, we add two proximal terms in iteration of and , which improve flexibility;

Comparing with SVRGADMM [27], we only accelerate iteration to get the same convergence rate;

Our algorithm is very flexible, leading to different new stochastic algorithms by setting , , , properly;
The remainder of the paper is organized as follows. In section 2, we will provide background, discuss related work in more details, show fundamental iteration schemes, and provide notations used throughout. In section 3, we introduce our algorithm. Theoretical convergence analysis is give in section 4. Extensive numerical experiments to illustrate the performance are in section 5. Finally, section 6 concludes the paper and provides directions for future research.
Notations. Throughout the papers, we will use the following notation:

, define ; ;

; ; ; ;
2 Background
Modern data sets are getting ever larger, which drives the development of efficient and scalable optimization algorithms. One promising direction is the development of minibatch and stochastic algorithms, which can be thought of as a special case of minibatching where one sample point is involved in updating parameters. Another direction are online algorithms that process data coming in streams [29, 11, 25]. Our focus in this paper, is developing of a scalable, stochastic algorithm for solving largescale optimization problems in (1). In a stochastic algorithm, there is a trade off between computation speed and the convergence speed. When a single sample is randomly chosen to get an approximate descent direction, the computation is fast but the convergence speed is slow. On the other hand, batch algorithms use all the samples to find the exact descent direction, which results in faster convergence rate, but the computation speed will be slower or possibly unimplementable. Our proposed algorithm, balances the two aspects by first computing the batch gradient, and then in an iteration adjusts the gradient direction based on the current sample. Our finial algorithm is comparable with a batch algorithm like ADMM and SCPRSM in ergodic sense. In addition, we add two pivotal proximal terms to make its implementation flexible.
We briefly introduce two fundamental stochastic iteration schemes that are useful for development of our algorithm. First scheme is used in the stochastic ADMM [21], where the noisy gradient is used to approximate augmented Lagrangian as:
Here, can be seen as the selector of a sample that will be used in computing a subgradient of in the th iteration. In our setting , and we can take ^{1}^{1}1We use
to denote a uniform distribution among discrete indices
.. More generally, if , we can take for some fixed distribution . Furthermore, is the Lagrangian multiplier, is the predefined penalty parameter, and is the timevarying step size. As iteration number goes up, we expect to find a triplet such that , , , we haveDetails of stochastic ADMM are given in Algorithm 1, which successively updates , , .
The second iteration scheme is used in SSPBSCPRSM [20] and is shown in Algorithm 2. The additional parameters make Algorithm 2 more flexible. The feasible range for and is and . Setting results in stochastic SCPRSM. When goes to , the algorithm result in stochastic PRSM. Different from stochastic ADMM, it creates an intermediate iterate between and , which always appears in PRSM based algorithms. Without additional parameters , this step can still make PRSM outperform ADMM in experiment but also results in the loss of property of strict contraction [12, 10].
3 Stochastic Scalable PRSM
In this section, we describe our proposed stochastic scalable PRSM algorithm, which improves over SCPRSM. A gradient estimate based on a single sample has a large variance and one requires the timevarying step size to decay to zero in order to ensure convergence. The order of this decay is essential for obtaining good convergence rate
[20]. Here, we use a variance reduction trick (inspired by [15]) to incorporate information from all the samples in a stochastic setting.Instead of using a first order approximated Lagrangian, we minimize the following augmented Lagrangian:
Compared to Algorithm 2, SSPRSM only adjusts the iteration for . Suppose at iteration , we have. . We define a gradient function associated with as
(2) 
where is a predefined matrix. Note that if , the proximal term has the same function as term in (2), but we do not need . Note that we can use , which changes with iterations, but for simplicity and implementability we will fix . In our algorithm, stochastic variance reduced gradient descent is used to minimize . Algorithm 3 summarizes the steps used to update .
The iteration scheme for and are the same as Algorithm 2 (lines 46). Note that using or for updating will result in the same update. The term is crucial here because it involves , which collects the "information" we get from other samples that are not selected in the stochastic update. With this choice of we have that . Finally, we note that saving all from 0 to is not necessary, as an incremental vector can be used to obtain .
Computationally, one iteration of SSPRSM is slower than one iteration in Algorithm 1 and 2, because we still need to compute the batch gradient. However, the inner iteration is fast and comparable with that of ADMM based algorithms [28, 27, 18], while the overall number of iterations is lower. Also, from our algorithm, we see only accelerate one variable is enough to attain rate instead of involving and Lagrangian multiplier . In the following section, we will show our main theoretical result.
4 Convergence Analysis
In this section, we study the convergence rate for SSPRSM and establish ergodic rate. To measure the convergence rate, we will use following criterion function which is from variational inequality: . We will show that where and , we have
(3) 
The above criterion is equivalent to for , which was used in [21, 26, 18]. Furthermore, since our algorithm is only modifying the updating scheme for , we will borrow part of arguments given in [20]. We start by listing our fairly mild assumptions:
A1 and are two convex function with .
A2 For all , is smooth — , Furthermore, we let .
A3 We assume . Otherwise, we only solve a minimization in a large enough but finite space.
Note that the assumptions are quite general. In particular, we do not need the smoothness on . A3 is necessary for all ADMM and PRSM method to guarantee convergence.
Next, we present two lemmas which will be used to prove the main result. The first lemma measure the decrease for in one step.
Lemma 4.1.
For fixed, suppose we have (so ), then after getting by doing SSPRSM for one step we will have ,
where .
Let focus on the term on the left hand side, it’s an approximation to . To show the lower bound for decrease of , we need an upper bound for this expectation. Following lemma gives this upper bound.
Lemma 4.2.
Define for , then we have
Based on these two lemmas, we are ready to present our main result. However, our inner iteration stepsize and iteration number is pivotal. We will show in the prove that we can simply set and for some constant to attain global convergence. Because will converges to 0 as increasing, we see fewer and fewer inner iterations we will need as algorithm progresses.
Theorem 4.3.
Let the sequence be generated by SSPRSM, under above setting for and and let and , we will have
Further, , we have
where is true solution.
We give a brief proof sketch. First, because our iteration scheme for and are the same as [20], we modified their results to fit it our algorithm. Combining their part of results with our Lemma 4.1, we have following argument to measure the total decrease of criteria function. For selfcontain, we have showed how to prove this argument in our supplementary material.
Argument: Let the sequence be generated by SSPRSM. If and , then we have
where , for , and .
Note that . So summing from to and taking expectation on both side, we can show
So, under our setting for and , we will get our main theorem. For the second part of the theorem, because is random in above inequality, we consider to be where for any . Plug into above inequality and using following fact
we can show SSPRSM convergence order in the second criteria. The detailed proof is proposed in supplementary material.
From above theorem, we see SSPRSM has ergodic convergence rate. So, we have showed that only adjusting the iteration in , we can gain the same convergence rate as most of ADMM based stochastic algorithm like SVRGADMM (see [27]).
5 Numerical Experiments
In this section, we run extensive numerical experiments to test performance of our algorithm. Since the goal is to bridge the convergence rate gap between stochastic and batch algorithms based on PRSM, we will also use SCPRSM as one of competitors. Specifically, we use the following algorithms for comparison:
We use both simulated data and real data and focus on three optimization problems arising in machine learning: Lasso, group Lasso, and sparse logistic regression. These problems can be easily formulated as an optimization programs of the form (1), as we show below. We try a wide range of settings for our parameters. We use to denote the sample matrix where is the number of predictors; is the response vector; and denote one row of and respectively; is the target parameter; and is the number of nonzero coefficients. Simulation results are averaged over independent runs.
5.1 Lasso
Lasso problem can be formulated as
s.t. 
We set , , and . We construct by drawing each entry independently from and then normalizing each row as . The parameter is generated by uniformly drawing indices from 1 to setting . Finally, where . The regularization parameter is set as . Since is sparse we assume when setting and . The setting for other parameters is tabulated in Table 1. We only consider setting , while can be set arbitrarily. This is because other forms of might make the subproblem for finding hard to solve. The detailed iteration scheme is given in Algorithm 4 in supplementary material. The loss value decay plot is shown in Figure 0(a) and Figure 0(b). From the plot, we see that SADMM and SSPBSCPRSM require many more passes over the data. Our SSPRSM is competitive with the stateoftheart stochastic ADMM based algorithm.
Model  

Lasso  1  0.9  0.1  
Lasso  1  0.9  0.1  
group Lasso  1  0.9  0.1  
sparse logistic  1  0.5  0.3 
5.2 Group Lasso
Group Lasso problem can be formulated as
s.t. 
where is the number of groups; and is the parameter vector in each group. Let .
We set , , , and . is created as in the Lasso case. denotes the parameter vector in the th group. We let and for . Finally, with and . The iteration scheme is the same as in the Lasso case except that the th row of Algorithm 4 (see supplement) is replaced by
where (different from in updating ) and are parameter and Lagrange multiplier for the th group, respectively. Here, with . The loss decay plot is in Figure 0(c). We see that SSPRSM and SVRGADMM are the two fastest algorithms, with SSPRSM being slightly better than SVRGADMM. We also note the slow convergence rate of SSPBSCPRSM and SADMM.
5.3 Sparse logistic regression
Sparse logistic regression can be formulated as
s.t. 
We set , , and . Sample matrix is generated as follows: each sample point has twenty nonzero entries independently drawn from with the index generated uniformly from 1 to ; is simulated as in the Lasso case and where and is computed entrywise. For the regularization parameter we use the setting in [16] where . The intercept term is dropped in this setting for simplicity, which implies that , as assumed in our simulation setting. The iteration scheme is detailed in Algorithm 5 in supplementary material. The loss decay plot is in Figure 0(d), which shows that our algorithm still outperforms other algorithms. Note that SVRGADMM also performs without setting , , , in this case, however, here these parameters are easy to set and SSPRSM seems to be robust to their specification.
5.4 Real Data
We investigate four real datasets to illustrate the efficiency of our algorithm: i) communities and crime data set ^{2}^{2}2Download from http://archive.ics.uci.edu/ml/datasets/communities+and+crime., EE2006tfidf ^{3}^{3}3Download from https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/regression.html., sido ^{4}^{4}4Download from http://www.causality.inf.ethz.ch/data/SIDO.html., and NSLKDD data set ^{5}^{5}5Details in http://www.unb.ca/cic/datasets/index.html and download from https://github.com/defcom17/NSL_KDD.. Table 2 summarizes different setting and parameters used. We briefly introduce the last NSLKDD data. Recently, machine learning methods have been widely used in abnormal flow detection, which is increased dramatically with upgrading the computer network, hardware and software. ADMM based framework has a good parallel performance to effectively cope with large data (see [1]). We are trying to test the efficiency of our proposed algorithm (which is PRSM based). We fit a sparse logistic for KDD because of
is large and our response is a binary variable characterizing the connection type: normal(1), doubtful(0).
The loss decay plots are in Figure 2. SSPBSCPRSM and SADMM always converge slower than other stochastic algorithms. SVRGADMM and SCASADMM are similar, while our SSPRSM can converge faster than them.
Data  Model  
crime  Lasso  1994  122  0.02  5  0.8  0.3  
E2006tfidf  Lasso  16,087  150,360  0.0001  0.8  0.9  0.1  
sido  Sparse Logistic  12,678  4,932  0.01  1  0.5  0.3  
KSLKDD  Sparse Logistic  125973  115  0.01  1  0.9  0.3 
6 Conclusion
In this paper, we propose a new stochastic algorithm, SSPRSM. The resulted algorithm has ergodic convergence rate, which matches the rate of many startoftheart variants of stochastic ADMM. We bridge the ergodic convergence rate gap between batch and stochastic algorithms for PRSM theoretically. Furthermore, we show that our algorithm outperforms ADMM based algorithms when fitting statistical models such as Lasso, group Lasso, and sparse logistic regression.
The theoretical analysis of the convergence rate in nonergodic sense for SSPRSM is still an open problem and is much more difficult than for an ADMM based algorithm because the adjustment of the iteration scheme. Meanwhile, we can also try to incorporate other tricks, such as conjugate gradient and Nesterov’s extrapolation, for the iteration of . We think they can also attain ergodic convergence rate. On the other hand, the effectiveness of PRSM under nonconvex setting, for both batch and stochastic version, is still blank. SVRGADMM has good convergence properties under nonconvex setting as shown in [27]. However, the general theoretical convergence result for PRSM is still unknown, which is also an interesting area.
References
 [1] Seyed Mojtaba Hosseini Bamakan, Huadong Wang, and Yong Shi. Ramp loss ksupport vector classificationregression; a robust and sparse multiclass approach to the intrusion detection problem. KnowledgeBased Systems, 126:113–126, 2017.
 [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
 [3] Tony Fan C Chan and Roland Glowinski. Finite element approximation and iterative solution of a class of mildly nonlinear elliptic equations. Computer Science Department, Stanford University Stanford, 1978.
 [4] Damek Davis and Wotao Yin. Convergence rate analysis of several splitting schemes. In Splitting Methods in Communication, Imaging, Science, and Engineering, pages 115–163. Springer, 2016.
 [5] Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 [6] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 66(3):889–916, 2016.
 [7] Jim Douglas and Henry H Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
 [8] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
 [9] Roland Glowinski and A Marroco. Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisationdualité d’une classe de problèmes de dirichlet non linéaires. Revue française d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(R2):41–76, 1975.
 [10] Yan Gu, Bo Jiang, and Deren Han. A semiproximalbased strictly contractive peacemanrachford splitting method. arXiv preprint arXiv:1506.02221, 2015.
 [11] Elad Hazan, Alexander Rakhlin, and Peter L Bartlett. Adaptive online gradient descent. In Advances in Neural Information Processing Systems, pages 65–72, 2008.
 [12] Bingsheng He, Han Liu, Zhaoran Wang, and Xiaoming Yuan. A strictly contractive peaceman–rachford splitting method for convex programming. SIAM Journal on Optimization, 24(3):1011–1040, 2014.
 [13] Bingsheng He and Xiaoming Yuan. On the o(1/n) convergence rate of the douglas–rachford alternating direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.
 [14] Mingyi Hong and ZhiQuan Luo. On the linear convergence of the alternating direction method of multipliers. Mathematical Programming, 162(12):165–199, 2017.

[15]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in neural information processing systems, pages 315–323, 2013.  [16] Kwangmoo Koh, SeungJean Kim, and Stephen Boyd. An interiorpoint method for largescale l1regularized logistic regression. Journal of Machine learning research, 8(Jul):1519–1555, 2007.
 [17] Huan Li and Zhouchen Lin. Optimal nonergodic convergence rate: When linearized adm meets nesterov’s extrapolation. arXiv preprint arXiv:1608.06366, 2016.
 [18] Yuanyuan Liu, Fanhua Shang, and James Cheng. Accelerated variance reduced stochastic admm. In AAAI, pages 2287–2293, 2017.
 [19] Renato DC Monteiro and Benar F Svaiter. Iterationcomplexity of blockdecomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 23(1):475–507, 2013.
 [20] Sen Na and ChoJui Hsieh. Sparse learning with semiproximalbased strictly contractive peacemanrachford splitting method. arXiv preprint arXiv:1612.09357, 2016.
 [21] Hua Ouyang, Niao He, Long Tran, and Alexander Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
 [22] Donald W Peaceman and Henry H Rachford, Jr. The numerical solution of parabolic and elliptic differential equations. Journal of the Society for industrial and Applied Mathematics, 3(1):28–41, 1955.
 [23] Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 [24] Taiji Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In International Conference on Machine Learning, pages 736–744, 2014.
 [25] Huahua Wang and Arindam Banerjee. Online alternating direction method (longer version). arXiv preprint arXiv:1306.3721, 2013.
 [26] ShenYi Zhao, WuJun Li, and ZhiHua Zhou. Scalable stochastic alternating direction method of multipliers. arXiv preprint arXiv:1502.03529, 2015.
 [27] Shuai Zheng and James T Kwok. Fastandlight stochastic admm. In IJCAI, pages 2407–2613, 2016.
 [28] Wenliang Zhong and James Kwok. Fast stochastic alternating direction method of multipliers. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 46–54, 2014.
 [29] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML03), pages 928–936, 2003.
Supplementary Materials:
Stochastic Approaches for PeacemanRachford Splitting Method
Proof of Lemma 4.1
We only consider one more step after getting . Based on our Algorithm 3, , we have
(1) 
Let’s deal with first. Note that
Comments
There are no comments yet.