Introduction
In this paper, we consider a class of composite convex optimization problems
(1) 
where is a given matrix, , each is a convex function, and is convex but possibly nonsmooth. With regard to , we are interested in a sparsityinducing regularizer, e.g. norm, group Lasso and nuclear norm. When
is an identity matrix, i.e.
, the above formulation (1) arises in many places in machine learning, statistics, and operations research
[Bubeck2015], such as logistic regression, Lasso and support vector machine (SVM). We mainly focus on the large sample regime. In this regime, even firstorder batch methods, e.g. FISTA
[Beck and Teboulle2009], become computationally burdensome due to their periteration complexity of. As a result, stochastic gradient descent (SGD) with periteration complexity of
has witnessed tremendous progress in the recent years. Especially, a number of stochastic variance reduced gradient methods such as SAG [Roux, Schmidt, and Bach2012], SDCA [ShalevShwartz and Zhang2013] and SVRG [Johnson and Zhang2013]have been proposed to successfully address the problem of high variance of the gradient estimate in ordinary SGD, resulting in a linear convergence rate (for strongly convex problems) as opposed to sublinear rates of SGD. More recently, the Nesterov’s acceleration technique
[Nesterov2004] was introduced in [AllenZhu2016, Hien et al.2016] to further speed up the stochastic variancereduced algorithms, which results in the best known convergence rates for both strongly convex and general convex problems. This motivates us to integrate the momentum acceleration trick into the stochastic alternating direction method of multipliers (ADMM) below.When is a more general matrix, i.e. , the formulation (1) becomes many more complicated problems arising from machine learning, e.g. graphguided fuzed Lasso [Kim, Sohn, and Xing2009] and generalized Lasso [Tibshirani and Taylor2011]. To solve this class of composite optimization problems with an auxiliary variable , which are the special case of the general ADMM form,
(2) 
the ADMM is an effective optimization tool [Boyd et al.2011], and has shown attractive performance in a wide range of realworld problems, such as big data classification [Nie et al.2014]. To tackle the issue of high periteration complexity of batch (deterministic) ADMM (as a popular firstorder optimization method), wang:oadm wang:oadm, suzuki:oadmm suzuki:oadmm and ouyang:sadmm ouyang:sadmm proposed some online or stochastic ADMM algorithms. However, all these variants only achieve the convergence rate of for general convex problems and for strongly convex problems, respectively, as compared with the and linear convergence rates of accelerated batch algorithms [Nesterov1983], e.g. FISTA, where is the number of iterations. By now several accelerated and faster converging versions of stochastic ADMM, which are all based on variance reduction techniques, have been proposed, e.g. SAGADMM [Zhong and Kwok2014b], SDCAADMM [Suzuki2014] and SVRGADMM [Zheng and Kwok2016]. With regard to strongly convex problems, Suzuki:sdca Suzuki:sdca and zheng:fadmm zheng:fadmm proved that linear convergence can be obtained for the special ADMM form (i.e. and ) and the general ADMM form, respectively. In SAGADMM and SVRGADMM, an convergence rate can be guaranteed for general convex problems, which implies that there still remains a gap in convergence rates between the stochastic ADMM and accelerated batch algorithms.
To bridge this gap, we integrate the momentum acceleration trick in [Tseng2010] for deterministic optimization into the stochastic variance reduction gradient (SVRG) based stochastic ADMM (SVRGADMM). Naturally, the proposed method has low periteration time complexity as existing stochastic ADMM algorithms, and does not require the storage of all gradients (or dual variables) as in SCASADMM [Zhao, Li, and Zhou2015] and SVRGADMM [Zheng and Kwok2016], as shown in Table 1. We summarize our main contributions below.

We propose an accelerated variance reduced stochastic ADMM (ASVRGADMM) method, which integrates both the momentum acceleration trick in [Tseng2010] for batch optimization and the variance reduction technique of SVRG [Johnson and Zhang2013].

We prove that ASVRGADMM achieves a linear convergence rate for strongly convex problems, which is consistent with the best known result in SDCAADMM [Suzuki2014] and SVRGADMM [Zheng and Kwok2016].

We also prove that ASVRGADMM has a convergence rate of for nonstrongly convex problems, which is a factor of faster than SAGADMM and SVRGADMM, whose convergence rates are .

Our experimental results further verified that our ASVRGADMM method has much better performance than the stateoftheart stochastic ADMM methods.
General convex  Stronglyconvex  Space requirement  

SAGADMM  unknown  
SDCAADMM  unknown  linear rate  
SCASADMM  
SVRGADMM  linear rate  
ASVRGADMM  linear rate 
Related Work
Introducing , problem (1) becomes
(3) 
Although (3) is only a special case of the general ADMM form (2), when and , the stochastic (or online) ADMM algorithms and theoretical results in [Wang and Banerjee2012, Ouyang et al.2013, Zhong and Kwok2014b, Zheng and Kwok2016] and this paper are all for the more general problem (2). To minimize (2), together with the dual variable , the update steps of batch ADMM are
(4)  
(5)  
(6) 
where is a penalty parameter.
To extend the batch ADMM to the online and stochastic settings, the update steps for and remain unchanged. In [Wang and Banerjee2012, Ouyang et al.2013], the update step of is approximated as follows:
(7) 
where we draw uniformly at random from , is the stepsize, and with given positive semidefinite matrix , e.g. in [Ouyang et al.2013]
. Analogous to SGD, the stochastic ADMM variants use an unbiased estimate of the gradient at each iteration. However, all those algorithms have much slower convergence rates than their batch counterpart, as mentioned above. This barrier is mainly due to the variance introduced by the stochasticity of the gradients. Besides, to guarantee convergence, they employ a decaying sequence of step sizes
, which in turn impacts the rates.More recently, a number of variance reduced stochastic ADMM methods (e.g. SAGADMM, SDCAADMM and SVRGADMM) have been proposed and made exciting progress such as linear convergence rates. SVRGADMM in [Zheng and Kwok2016] is particularly attractive here because of its low storage requirement compared with the algorithms in [Zhong and Kwok2014b, Suzuki2014]
. Within each epoch of SVRGADMM, the full gradient
is first computed, where is the average point of the previous epoch. Then and in (7) are replaced by(8) 
and a constant stepsize , respectively, where is a minibatch of size (which is a useful technique to reduce the variance). In fact, is an unbiased estimator of the gradient , i.e. .
Accelerated Variance Reduced Stochastic ADMM
In this section, we design an accelerated variance reduced stochastic ADMM method for both strongly convex and general convex problems. We first make the following assumptions: Each convex is smooth, i.e. there exists a constant such that , , and ; is strongly convex, i.e. there is such that for all ; The matrix has full row rank. The first two assumptions are common in the analysis of firstorder optimization methods, while the last one has been used in the convergence analysis of batch ADMM [shang:rpca, Nishihara et al.2015, Deng and Yin2016] and stochastic ADMM [Zheng and Kwok2016].
The Strongly Convex Case
In this part, we consider the case of (2) when each is convex, smooth, and is strongly convex. Recall that this class of problems include graphguided Logistic Regression and SVM as notable examples. To efficiently solve this class of problems, we incorporate both the momentum acceleration and variance reduction techniques into stochastic ADMM. Our algorithm is divided into epochs, and each epoch consists of stochastic updates, where is usually chosen to be as in [Johnson and Zhang2013].
Let be an important auxiliary variable, its update rule is given as follows. Similar to [Zhong and Kwok2014b, Zheng and Kwok2016], we also use the inexact Uzawa method [Zhang, Burger, and Osher2011] to approximate the subproblem (7), which can avoid computing the inverse of the matrix . Moreover, the momentum weight (the update rule for is provided below) is introduced into the proximal term similar to that of (7), and then the subproblem with respect to is formulated as follows:
(9) 
where is defined in (8), , and with to ensure that similar to [Zheng and Kwok2016], where
is the spectral norm, i.e. the largest singular value of the matrix. Furthermore, the update rule for
is given by(10) 
where is the key momentum term (similar to those in accelerated batch methods [Nesterov2004]), which helps accelerate our algorithm by using the iterate of the previous epoch, i.e. . Similar to , . Moreover, can be set to a constant in all epochs of our algorithm, which must satisfy , where , and is defined below. The optimal value of is provided in Proposition 1 below. The detailed procedure is shown in Algorithm 1, where we adopt the same initialization technique for as in [Zheng and Kwok2016], and is the pseudoinverse. Note that, when , ASVRGADMM degenerates to SVRGADMM in [Zheng and Kwok2016].
The NonStrongly Convex Case
In this part, we consider general convex problems of the form (2) when each is convex, smooth, and is not necessarily strongly convex (but possibly nonsmooth). Different from the strongly convex case, the momentum weight is required to satisfy the following inequalities:
(11) 
where is a decreasing function with respect to the minibatch size . The condition (20) allows the momentum weight to decease, but not too fast, similar to the requirement on the stepsize in classical SGD and stochastic ADMM [tseng:sgd]. Unlike batch acceleration methods, the weight must satisfy both inequalities in (20).
Motivated by the momentum acceleration techniques in [Tseng2010, Nesterov2004] for batch optimization, we give the update rule of the weight for the minibatch case:
(12) 
For the special case of , we have and , while (i.e. batch version), and . Since is decreasing, then is satisfied. The detailed procedure is shown in Algorithm 2, which has many slight differences in the initialization and output of each epoch from Algorithm 1. In addition, the key difference between them is the update rule for the momentum weight . That is, in Algorithm 1 can be set to a constant, while that in Algorithm 2 is adaptively adjusted as in (12).
Convergence Analysis
This section provides the convergence analysis of our ASVRGADMM algorithms (i.e. Algorithms 1 and 2) for strongly convex and general convex problems, respectively. Following [Zheng and Kwok2016], we first introduce the following function as a convergence criterion, where denotes the (sub)gradient of at . Indeed, for all . In the following, we give the intermediate key results for our analysis.
Lemma 1.
where .
Lemma 2.
Linear Convergence
Our first main result is the following theorem which gives the convergence rate of Algorithm 1.
Theorem 1.
Using the same notation as in Lemma 4 with given , and suppose is strongly convex and smooth, and is sufficiently large so that
(13) 
where
is the smallest eigenvalue of the positive semidefinite matrix
, and is defined in (9). ThenThe proof of Theorem 1 is provided in the Supplementary Material. From Theorem 1, one can see that ASVRGADMM achieves linear convergence, which is consistent with that of SVRGADMM, while SCASADMM has only an convergence rate.
Remark 1.
Theorem 1 shows that our result improves slightly upon the rate in [Zheng and Kwok2016] with the same and . Specifically, as shown in (13), consists of three components, corresponding to those of Theorem 1 in [Zheng and Kwok2016]. In Algorithm 1, recall that here and is defined in (9). Thus, both the first and third terms in (13) are slightly smaller than those of Theorem 1 in [Zheng and Kwok2016]. In addition, one can set (i.e. ) and . Thus, the second term in (13) equals to , while that of SVRGADMM is approximately equal to . In summary, the convergence bound of SVRGADMM can be slightly improved by ASVRGADMM.
Selecting Scheme of
The rate in (13) of Theorem 1 can be expressed as the function with respect to the parameters and with given . Similar to [Nishihara et al.2015, Zheng and Kwok2016], one can obtain the optimal parameter , which produces a smaller rate . In addition, as shown in (13), all the three terms are with respect to the weight . Therefore, we give the following selecting scheme for .
Proposition 1.
Given , and let , we set and , where . Then the optimal of Algorithm 1 is given by
The proof of Proposition 1 is provided in the Supplementary Material.
Convergence Rate of
We first assume that , where is a convex compact set with diameter , and the dual variable is also bounded with . For Algorithm 2, we give the following result.
Theorem 2.
Using the same notation as in Lemma 4 with , then we have
(14) 
The proof of Theorem 2 is provided in the Supplementary Material. Theorem 2 shows that the convergence bound consists of the three components, which converge as , and , respectively, while the three components of SVRGADMM converge as , and . Clearly, ASVRGADMM achieves the convergence rate of as opposed to of SVRGADMM and SAGADMM (). All the components in the convergence bound of SCASADMM converge as . Thus, it is clear from this comparison that ASVRGADMM is a factor of faster than SAGADMM, SVRGADMM and SCASADMM.
Connections to Related Work
Our algorithms and convergence results can be extended to the following settings. When the minibatch size and , then , that is, the first term of (14) vanishes, and ASVRGADMM degenerates to the batch version. Its convergence rate becomes (which is consistent with the optimal result for accelerated deterministic ADMM methods [Goldstein et al.2014, Lu et al.2016]), where . Many empirical risk minimization problems can be viewed as the special case of (1) when . Thus, our method can be extended to solve them, and has an rate, which is consistent with the best known result as in [AllenZhu2016, Hien et al.2016].
Experiments
In this section, we use our ASVRGADMM method to solve the general convex graphguided fuzed Lasso, strongly convex graphguided logistic regression and graphguided SVM problems. We compare ASVRGADMM with the following stateoftheart methods: STOCADMM [Ouyang et al.2013], OPGADMM [Suzuki2013], SAGADMM [Zhong and Kwok2014b], and SCASADMM [Zhao, Li, and Zhou2015] and SVRGADMM [Zheng and Kwok2016]. All methods were implemented in MATLAB, and the experiments were performed on a PC with an Intel i52400 CPU and 16GB RAM.
GraphGuided Fused Lasso
We first evaluate the empirical performance of the proposed method for solving the graphguided fuzed Lasso problem:
(15) 
where
is the logistic loss function on the featurelabel pair
, i.e., , and is the regularization parameter. Here, we set as in [Ouyang et al.2013, Zhong and Kwok2014b, Azadi and Sra2014, Zheng and Kwok2016], where is the sparsity pattern of the graph obtained by sparse inverse covariance selection [Banerjee, Ghaoui, and d’Aspremont2008]. We used four publicly available data sets^{1}^{1}1http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ in our experiments, as listed in Table 2. Note that except STOCADMM, all the other algorithms adopted the linearization of the penalty term to avoid the inversion of at each iteration, which can be computationally expensive for large matrices. The parameters of ASVRGADMM are set as follows: and as in [Zhong and Kwok2014b, Zheng and Kwok2016], as well as and .Data sets  training  test  minibatch  

a9a  16,281  16,280  20  1e5  1e2 
w8a  32,350  32,350  20  1e5  1e2 
SUSY  3,500,000  1,500,000  100  1e5  1e2 
HIGGS  7,700,000  3,300,000  150  1e5  1e2 
Figure 1 shows the training error (i.e. the training objective value minus the minimum) and testing loss of all the algorithms for the general convex problem on the four data sets. SAGADMM could not generate experimental results on the HIGGS data set because it ran out of memory. These figures clearly indicate that the variance reduced stochastic ADMM algorithms (including SAGADMM, SCASADMM, SVRGADMM and ASVRGADMM) converge much faster than those without variance reduction techniques, e.g. STOCADMM and OPGADMM. Notably, ASVRGADMM consistently outperforms all other algorithms in terms of the convergence rate under all settings, which empirically verifies our theoretical result that ASVRGADMM has a faster convergence rate of , as opposed to the best known rate of .
GraphGuided Logistic Regression
We further discuss the performance of ASVRGADMM for solving the strongly convex graphguided logistic regression problem [Ouyang et al.2013, Zhong and Kwok2014a]:
(16) 
Due to limited space and similar experimental phenomena on the four data sets, we only report the experimental results on the a9a and w8a data sets in Figure 2, from which we observe that SVRGADMM and ASVRGADMM achieve comparable performance, and they significantly outperform the other methods in terms of the convergence rate, which is consistent with their linear (geometric) convergence guarantees. Moreover, ASVRGADMM converges slightly faster than SVRGADMM, which shows the effectiveness of the momentum trick to accelerate variance reduced stochastic ADMM, as we expected.
GraphGuided SVM
Finally, we evaluate the performance of ASVRGADMM for solving the graphguided SVM problem,
(17) 
where is the nonsmooth hinge loss. To effectively solve problem (17), we used the smooth Huberized hinge loss in [Rosset and Zhu2007] to approximate the hinge loss. For the 20newsgroups dataset^{2}^{2}2http://www.cs.nyu.edu/~roweis/data.html, we randomly divide it into 80% training set and 20% test set. Following [Ouyang et al.2013], we set , and use the onevsrest scheme for the multiclass classification.
Figure 3
shows the average prediction accuracies and standard deviations of testing accuracies over 10 different runs. Since STOCADMM, OPGADMM, SAGADMM and SCASADMM consistently perform worse than SVRGADMM and ASVRGADMM in all settings, we only report the results of STOCADMM. We observe that SVRGADMM and ASVRGADMM consistently outperform the classical SVM and STOCADMM. Moreover, ASVRGADMM performs much better than the other methods in all settings, which again verifies the effectiveness of our ASVRGADMM method.
Conclusions
In this paper, we proposed an accelerated stochastic variance reduced ADMM (ASVRGADMM) method, in which we combined both the momentum acceleration trick for batch optimization and the variance reduction technique. We designed two different momentum term update rules for strongly convex and general convex cases, respectively. Moreover, we also theoretically analyzed the convergence properties of ASVRGADMM, from which it is clear that ASVRGADMM achieves linear convergence and rates for both cases. Especially, ASVRGADMM is at least a factor of faster than existing stochastic ADMM methods for general convex problems.
Acknowledgements
We thank the reviewers for their valuable comments. The authors are supported by the Hong Kong GRF 2150851 and 2150895, and Grants 3132964 and 3132821 funded by the Research Committee of CUHK.
References
 [AllenZhu2016] AllenZhu, Z. 2016. Katyusha: Accelerated variance reduction for faster sgd. arXiv:1603.05953v4.
 [Azadi and Sra2014] Azadi, S., and Sra, S. 2014. Towards an optimal stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 620–628.
 [Baldassarre and Pontil2013] Baldassarre, L., and Pontil, M. 2013. Advanced topics in machine learning part II: 5. Proximal methods. University Lecture.
 [Banerjee, Ghaoui, and d’Aspremont2008] Banerjee, O.; Ghaoui, L. E.; and d’Aspremont, A. 2008. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9:485–516.
 [Beck and Teboulle2009] Beck, A., and Teboulle, M. 2009. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1):183–202.
 [Boyd et al.2011] Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; and Eckstein, J. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1):1–122.
 [Bubeck2015] Bubeck, S. 2015. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 8:231–358.
 [Deng and Yin2016] Deng, W., and Yin, W. 2016. On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66:889–916.
 [Goldstein et al.2014] Goldstein, T.; ODonoghue, B.; Setzer, S.; and Baraniuk, R. 2014. Fast alternating direction optimization methods. SIAM J. Imaging Sciences 7(3):1588–1623.
 [Hien et al.2016] Hien, L. T. K.; Lu, C.; Xu, H.; and Feng, J. 2016. Accelerated stochastic mirror descent algorithms for composite nonstrongly convex optimization. arXiv:1605.06892v2.
 [Johnson and Zhang2013] Johnson, R., and Zhang, T. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 315–323.
 [Kim, Sohn, and Xing2009] Kim, S.; Sohn, K. A.; and Xing, E. P. 2009. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25:i204–i212.
 [Koneeny et al.2016] Koneeny, J.; Liu, J.; Richtarik, P.; ; and Takae, M. 2016. Minibatch semistochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces. 10(2):242–255.
 [Lan2012] Lan, G. 2012. An optimal method for stochastic composite optimization. Math. Program. 133:365–397.
 [Lu et al.2016] Lu, C.; Li, H.; Lin, Z.; and Yan, S. 2016. Fast proximal linearized alternating direction method of multiplier with parallel splitting. In Proc. 30th AAAI Conf. Artif. Intell., 739–745.
 [Nesterov1983] Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27(2):372–376.
 [Nesterov2004] Nesterov, Y. 2004. Introductory Lectures on Convex Optimization: A Basic Course. Boston: Kluwer Academic Publ.
 [Nie et al.2014] Nie, F.; Huang, Y.; XiaoqianWang; and Huang, H. 2014. Linear time solver for primal svm. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 505–513.
 [Nishihara et al.2015] Nishihara, R.; Lessard, L.; Recht, B.; Packard, A.; and Jordan, M. I. 2015. A general analysis of the convergence of ADMM. In Proc. 32nd Int. Conf. Mach. Learn. (ICML), 343–352.
 [Ouyang et al.2013] Ouyang, H.; He, N.; Tran, L. Q.; and Gray, A. 2013. Stochastic alternating direction method of multipliers. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 80–88.
 [Rosset and Zhu2007] Rosset, S., and Zhu, J. 2007. Piecewise linear regularized solution paths. Ann. Statist. 35(3):1012–1030.
 [Roux, Schmidt, and Bach2012] Roux, N. L.; Schmidt, M.; and Bach, F. 2012. A stochastic gradient method with an exponential convergence rate for finite training sets. In Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2672–2680.
 [ShalevShwartz and Zhang2013] ShalevShwartz, S., and Zhang, T. 2013. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14:567–599.
 [Suzuki2013] Suzuki, T. 2013. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proc. 30th Int. Conf. Mach. Learn. (ICML), 392–400.
 [Suzuki2014] Suzuki, T. 2014. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 736–744.
 [Tibshirani and Taylor2011] Tibshirani, R. J., and Taylor, J. 2011. The solution path of the generalized lasso. Annals of Statistics 39(3):1335–1371.
 [Tseng2010] Tseng, P. 2010. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program. 125:263–295.
 [Wang and Banerjee2012] Wang, H., and Banerjee, A. 2012. Online alternating direction method. In Proc. 29th Int. Conf. Mach. Learn. (ICML), 1119–1126.
 [Zhang, Burger, and Osher2011] Zhang, X.; Burger, M.; and Osher, S. 2011. A unified primaldual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1):20–46.
 [Zhao, Li, and Zhou2015] Zhao, S.Y.; Li, W.J.; and Zhou, Z.H. 2015. Scalable stochastic alternating direction method of multipliers. arXiv:1502.03529v3.
 [Zheng and Kwok2016] Zheng, S., and Kwok, J. T. 2016. Fastandlight stochastic admm. In Proc. 25th Int. Joint Conf. Artif. Intell.,, 2407–2613.
 [Zhong and Kwok2014a] Zhong, L. W., and Kwok, J. T. 2014a. Accelerated stochastic gradient method for composite regularization. In Proc. 17th Int. Conf. Artif. Intell. Statist., 1086–1094.
 [Zhong and Kwok2014b] Zhong, L. W., and Kwok, J. T. 2014b. Fast stochastic alternating direction method of multipliers. In Proc. 31st Int. Conf. Mach. Learn. (ICML), 46–54.
Supplementary Materials for “Accelerated Variance Reduced Stochastic ADMM”
In this supplementary material, we give the detailed proofs for two important lemmas (i.e., Lemmas 1 and 2), two key theorems (i.e., Theorems 1 and 2) and a proposition (i.e., Proposition 1).
Proof of Lemma 1:
Our convergence analysis will use a bound on the variance term , as shown in Lemma 1. Before giving the proof of Lemma 1, we first give the following lemma.
Lemma 3.
Since each is convex, smooth (), then the following holds
(18) 
where , and .
Proof.
This result follows immediately from Theorem 2.1.5 in [Nesterov2004]. ∎
Proof of Lemma 1:
Proof.
. Taking expectation over the random choice of , we have
(19) 
where the first inequality follows from the fact that , and the second inequality is due to Lemma 3 given above. Note that the similar result for (19) was also proved in [AllenZhu2016] (see Lemma 3.4 in [AllenZhu2016]). Next, we extend the result to the minibatch setting.
Let be the size of minibatch . We prove the result of Lemma 1 for the minibatch case, i.e. .
where the second equality follows from Lemma 4 in [Koneeny et al.2016], and the inequality holds due to the result in (19). ∎
Proof of Lemma 2:
Before proving the key Lemma 2, we first give the following a property [Baldassarre and Pontil2013, Lan2012], which is useful for the convergence analysis of ASVRGADMM.
Property 1.
Given any , then we have