1 Introduction
Empirical risk minimization arises frequently in machine learning applications, where the objective function is the average of losses computed at different data points. Due to the increasing size of data, distributed computing architectures are in great need to meet the scalability requirement in terms of both computation power and storage space by distributing the learning task over multiple computing nodes. In addition, distributed frameworks are suitable for problems where there are privacy concerns to transmit and store all the data in a central location, a scenario related to the nascent field of
federated learning [11]. It is, therefore, necessary to develop distributed optimization frameworks that are tailored to solving largescale empirical risk minimization problems with desirable communicationcomputation tradeoffs, where the data are stored disjointly over different machines.Due to the low periteration cost, a popular solution is distributed stochastic gradient descent (SGD)
[19], where the parameter server aggregates gradients from each worker and does minibatch gradient updates. However, distributed SGD is not communicationefficient and requires lots of communication rounds to converge, which partially diminishes the benefit of distribution. Many deterministic optimization methods have been developed to achieve communication efficiency, including but not limited to DANE [25], AIDE [20], DiSCo [34], GIANT [29], CoCoA [26], oneshot averaging [39, 35], etc.Recent breakthroughs in developing stochastic variance reduced methods such as SAG [21], SAGA [6], SVRG [10], SDCA [23], MiG [38], Katyusha [3], Catalyst [14], SCOPE [37, 36], SARAH [17], SPIDER [7], SpiderBoost [30], and many more, allow achieving fast convergence and small periteration cost at the same time. Yet, distributed schemes of such variance reduced methods that are both practical and theoretically sound are much less developed.
This paper focuses on a general framework of distributed stochastic variance reduced methods presented in Alg. 1, which is natural and friendly to implement. On a high level, SVRGtype algorithms contain inner loops for parameter updates via variancereduced SGD, and outer loops for global gradient and parameter updates. Alg. 1 assigns outer loops to the parameter server, and inner loops to worker machines. The parameter server collects gradients from worker machines and then distributes the global gradient to each machine. Each worker machine then runs the inner loop independently in parallel using variance reduction techniques, and returns the updates to the parameter server at the end. Per iteration, Alg. 1
requires two communication rounds: one communication round is used to average the parameter estimates, and the other is used to average the gradients, which is the same as distributed synchronous SGD. A distributed SVRG method under this framework has been proposed in several works under different scenarios
[11, 5, 20] with great empirical success. Surprisingly, a complete theoretical understanding is still missing at large. Moreover, distributed variants using accelerated variance reduction methods are not developed. The main analysis difficulty is that the variancereduced gradient of each worker is no longer an unbiased gradient estimator when sampling from reused local data.On the other end, several variants of distributed SVRG, e.g. [12, 24, 28] have been proposed with performance guarantees. They try to bypass the biased gradient estimation issue by simulating the process of i.i.d sampling from all the data, so a random data reallocation is needed before any batch sample is used again. Such algorithmic steps, i.e., sampling extra data with or without replacement, can be cumbersome and difficult to implement in practice.
1.1 Contributions of This Paper
This paper provides a convergence analysis of a family of naturally distributed stochastic variance reduced methods under the framework described in Alg. 1. By using different variance reduction schemes at the worker machines, we study distributed variants of three representative algorithms in this paper: DSVRG [10], DSARAH with recursive gradient updates [17, 18], and DMiG with accelerated gradient updates [38]. The contributions of this paper are summarized below.

We suggest a simple and intuitive metric called distributed smoothness to gauge data balancedness among workers, defined as the smoothness of the difference between the local loss function and the global loss function . The metric is deterministic, easytocompute, and applies for arbitrary dataset splitting. We establish the linear convergence of DSVRG, DSARAH, and DMiG under strongly convex losses, as long as the distributed smoothness parameter is smaller than a constant fraction of the strong convexity parameter , e.g. , where the fraction might change for different algorithms.

Under appropriate distributed smoothness, we show that, to reach accuracy, DSVRG and DSARAH (resp. DMiG) achieve (resp. ) time complexity with rounds of communication, where is the total number of input data points, is the number of worker machines and is the condition number of the global loss function . Compared to the time complexity of the original central SVRG and SARAH (resp. MiG), (resp. , this leads to a speed up in by the number of machines, if the local data size is sufficiently large. Furthermore, our bounds capture the phenomenon that the convergence rate improves as the local loss functions become more similar to the global loss function, by reducing the distributed smoothness parameter.

When local data are highly unbalanced, the distributed smoothness parameter becomes large which might lead to algorithm divergence. We suggests regularization as an effective way to handle this situation, and show that by adding larger regularization to machines that are less smooth, one can still ensure linear convergence in a regularized version of DSVRG, called DRSVRG, though at a slower rate of convergence.

More generally, the notion of distributed smoothness can also be used to establish the convergence under possibly nonsmooth and nonconvex losses. Under mild conditions, we show that DSARAH achieves accuracy in a time complexity of with rounds of communication.
1.2 Related Work
Many algorithms have been proposed for distributed (stochastic) optimization. For conciseness, Table 1 summarizes the most relevant ones to the current paper. Note that previous works considering variance reduction are all on distributed SVRG and its variants. The general framework of distributed variancereduced methods covering SARAH and MiG and various loss settings in this paper has not been studied before.
Algorithm  Rounds  Runtime  Assumptions 

DSVRG [12]  sampling extra data  
DASVRG [12]  sampling extra data  
Dist. Acc. Grad  none  
ADMM  none  
SCOPE [37]  uniform reg.  
pSCOPE [36]  good partition  
DSVRG*  dist. smoothness  
DSARAH*  dist. smoothness  
DMiG*  dist. smoothness  
DRSVRG*  large regularization  
DRSVRG*  small regularization 
The DSVRG algorithm, presented in Alg. 2, has been empirically studied before in [20, 11] without a theoretical convergence analysis. The pSCOPE algorithm [36] is also a variant of distributed SVRG, and its convergence is studied under an assumption called good data partition in [36]. The SCOPE algorithm [37] is similar to the regularized algorithm DRSVRG under large regularization, however our analysis is much more refined by adding different regularizations to different local workers, characterizes the amount of regularization necessary to ensure convergence with respect to the distributed smoothness of local data, and gracefully degenerates to the unregularized case when distributed smoothness is benign.
There are also a lot of recent efforts on reducing the communication cost of distributed GD/SGD by gradient quantization [1, 4, 22, 32], gradient compression and sparsification [2, 13, 15, 31, 27]. In comparison, we communicate the exact gradient, and it might be an interesting future direction to combine gradient compression schemes in distributed variance reduced stochastic gradient methods.
Paper Organization.
The rest of this paper is organized as follows. Section 2 presents the problem setup and a general framework of distributed stochastic optimization. Section 3 presents the convergence guarantees of DSVRG, DSARAH and DMiG under appropriate distributed smoothness assumptions. Section 4 introduces regularization to DSVRG to handle unbalanced data when distributed smoothness does not hold. Section 5 presents extensions to nonsmooth and nonconvex losses. Section 6 presents numerical experiments for validation. Finally, we conclude in Section 7.
2 Problem Setup
Suppose we have a data set , where for , that contains data points. In particular, we do not make any assumptions on their statistical distribution. We consider the following empirical risk minimization problem
(1) 
where is the parameter to be optimized and is the sample loss function. For brevity, we use to denote throughout the paper.
In a distributed setting, where the data are distributed to machines or workers, we define a partition of the data set as , where , . The th worker, correspondingly, is in possession of the data subset , . We assume there is a parameter server (PS), that coordinates the parameter sharing among the workers. The sizes of data held by each worker machine is . When the data is split equally, we have . The original problem (1) can be rewritten as minimizing the following objective function:
(2) 
where is the local loss function at the th worker machine.^{1}^{1}1It is straightforward to states our results under unequal data splitting with proper rescaling.
Alg. 1 presents a general framework for distributed stochastic variance reduced methods, which assigns the outer loops to PS and the inner loops to local workers. By using different variance reduction schemes at the worker machines (i.e. LocalUpdate), we obtain distributed variants of different algorithms, as described in later sections.
Throughout, we invoke one or several of the following standard assumptions of loss functions.
Assumption 1.
The sample loss function is smooth for all .
Assumption 2.
The sample loss function is convex for all .
Assumption 3.
The empirical risk is strongly convex.
When is strongly convex, the condition number of is defined as . Denote the minimizer of as and the corresponding optimal value as .
As it turns out, the smoothness of the deviation between the local loss function and the global loss function plays a key role in the convergence analysis, as it measures the balancedness between local data in a simple and intuitive manner. We refer to this as the “distributed smoothness”, which is central in our analysis. In some cases, a weaker notion called restricted smoothness is sufficient, which is defined below.
Definition 1 (Restricted smoothness).
A differentiable function is called restricted smooth with regard to if , .
The restricted smoothness, compared to standard smoothness, fixes one of the arguments to , and is therefore a much weaker requirement. The following assumption quantifies the distributed smoothness using either restricted smoothness or standard smoothness.
Assumption 4.a.
The deviation is restricted smooth with regard to (c.f. Definition 1) for all .
Assumption 4.b.
The deviation is smooth for all .
3 Convergence in the Strongly Convex Case
In this section, we describe and analyze three variancereduced routines for LocalUpdate in Alg. 1, namely SVRG [10], SARAH with recursive gradient updates [17, 18], and MiG with accelerated gradient updates [38] when is strongly convex. We establish the convergence guarantee for each algorithm, respectively.
3.1 Distributed SVRG (DSVRG)
The LocalUpdate of DSVRG is described in Alg. 2. Theorem 1 provides the convergence guarantee of DSVRG as long as the distributed smoothness parameter is small enough.
Theorem 1 (DSvrg).
Theorem 1 establishes the linear convergence of function values in expectation for DSVRG, as long as the parameter is sufficiently small. With proper and , the convergence rate can be set as . From the time complexity and the rate expressions, we can see that the smaller , the faster DSVRG converges. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SVRG in the centralized setting.
3.2 Distributed SARAH (DSARAH)
The LocalUpdate of DSARAH is also described in Alg. 2, which is different from SVRG in the update of stochastic gradient , by using a recursive formula. Theorem 2 provides the convergence guarantee of DSARAH.
Theorem 2 (DSarah).
Theorem 2 establishes the linear convergence of the gradient norm in expectation for DSARAH, as long as the parameter is small enough. With proper and , the rate can be set as . Similar to DSVRG, a smaller leads to faster convergence of DSARAH. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SARAH in the centralized setting. In particular, Theorem 2 suggests that DSARAH may allow a larger , compared with DSVRG, to guarantee convergence.
3.3 Distributed MiG (DMiG)
The LocalUpdate of DMiG is described in Alg. 3, which is inspired by the inner loop of the MiG algorithm [38], a recently proposed accelerated variancereduced algorithm. Theorem 3 provides the convergence guarantee of DMiG.
Theorem 3 (DMiG).
Theorem 3 establishes the linear convergence of DMiG under standard smoothness of , in order to fully harness the power of acceleration. While we do not make it explicit in the theorem statement, the time complexity of DMiG also decreases as gets smaller. Furthermore, the time complexity of DMiG is smaller than that of DSVRG/DSARAH when .
4 Adding Regularization Helps Unbalanced Data
So far, we have established the convergence when the distributed smoothness is not too large. Under i.i.d. and balanced data, this requirement can be justified for large enough data size, see Appendix E. However, when such conditions are violated, the algorithms might diverge. In this situation, adding a regularization term might ensure the convergence, at the cost of possibly slowing down the convergence rate. We consider regularizing the local gradient update of SVRG in Alg. 2 as
(3) 
where the last regularization term penalizes the proximity between the current iterates and the reference point . We have the following theorem.
Theorem 4 (Distributed Regularized SVRG (DRSVRG)).
Compared with Theorem 1, Theorem 4 relaxes the assumption to , which means that by inserting a larger regularization to local workers that are not distributed smooth, i.e. those with large , one can still guarantee the convergence of DRSVRG. However, increasing leads to a slower convergence rate: a large leads to an iteration complexity , similar to gradient descent. Compared with SCOPE [37] which requires a uniform regularization , our analysis applies tailored regularization to local workers, and potentially allows much smaller regularization to guarantee the convergence, since ’s can be much smaller than the smoothness parameter .
5 Convergence in the Nonconvex Case
In this section, we extend the convergence analysis of distributed SARAH to nonconvex loss functions, since SARAHtype algorithms are recently shown to achieve nearoptimal performances for nonconvex problems [30, 18, 7]. Our result is summarized in the theorem below.
Theorem 5 (DSARAH for nonconvex losses).
Suppose that Assumption 1 and Assumption 4.b hold with . With step size , DSARAH^{2}^{2}2In nonconvex case, we make every agent return . satisfies
where is the agent index selected in the th round for parameter update, i.e. (c.f. line 8 of Alg. 1). The time complexity of finding an optimal solution is no worse than with proper .
Theorem 5 suggests that DSARAH converges as long as the step size is small enough. Furthermore, a smaller allows a larger step size , and hence faster convergence. To gain further insights, assuming i.i.d. data, by concentration inequalities under mild conditions [16], and consequently, the iteration complexity of finding an accurate solution using DSARAH is . This is comparable to the best known result for the centralized SARAHtype algorithms [18, 7, 30].
6 Numerical Experiments
Though our focus in this paper is on the convergence analysis, we illustrate the performance of the proposed distributed variance reduced algorithms in various settings as a proofofconcept.
Logistic regression.
Consider the
regularized logistic regression, where the sample loss is defined as
, with the data . We evaluate the performance on the gisette dataset [8], by splitting the data equally to all workers. We scale the data according to , so that the smoothness is estimated as . We choose , and to illustrate the performance under different condition numbers. We use the optimality gap, defined as , to illustrate convergence.For DSVRG and DSARAH, the step size is set as . For DMiG, although the choice of in the theory requires knowledge of , we simply ignore it and set , and the step size to reflect the robustness of the practical performance to parameters. We further use at the PS for better empirical performance. For DAGD, the step size is set as and the momentum parameter is set as . Following [10], which sets the iterations of inner loops as , we set to ensure the same number of total inner iterations. We note that such parameters can be further tuned to achieve better tradeoff between communication cost and computation cost in practice.
Fig. 1 illustrates the optimality gap of various algorithms with respect to the number of communication rounds with 4 local workers under different conditioning, and Fig. 2 shows the corresponding result with different number of local workers when . The distributed variancereduced algorithms outperform the distributed AGD,and DMiG outperforms DSVRG and DSARAH when the condition number is large.
(a)  (b)  (c) 
(a)  (b)  (c) 
Dealing with Unbalanced data.
We justify the benefit of regularization by evaluating the proposed algorithms under unbalanced data allocation. We assign 50%, 30%, 19.9%, 0.1% percent of data to four workers, respectively, and set in the logistic regression loss. To deal with unbalanced data, we perform the regularized update, given in (3), on the worker with the least amount of data, and keep the update on the rest of the workers unchanged. A similar regularized update can be conceived for DSARAH and DMiG, resulting into regularized variants, DRSARAH and DRMiG. While our theory does not cover them, we still evaluate their numerical performance. We properly set according to the amount of data on this worker as . We set the number of iterations at workers to on all agents. For DAGD we set the momentum parameter to , which is a common choice in the convex setting. Fig. 3 shows the optimality gap with respect to the number of communication rounds for all algorithms. It can be seen that unregularized DSVRG and DSARAH fail to converge, and the regularized algorithms still converge, verifying the role of regularization in addressing unbalanced data. It is interesting to note that DMiG still manages to converge even with highly unbalanced data, a phenomenon not fully covered by the current theory and worth further investigation. In this case, the regularization slightly slows down the converge speed. It is also worthmentioning that the regularization can be flexibly imposed depending on the local data size, rather than homogeneously across all workers.
7 Conclusions
In this paper, we have developed a convergence theory for a family of distributed stochastic variance reduced methods without sampling extra data, under a mild distributed smoothness assumption that measures the discrepancy between the local and global loss functions. Convergence guarantees are obtained for distributed stochastic variance reduced methods using accelerations and recursive gradient updates, and for minimizing both strongly convex and nonconvex losses. We also suggest regularization as a means of ensuring convergence when the local data are less balanced. We believe the analysis framework is useful for studying distributed variants of other stochastic variancereduced methods such as Katyusha [3], and proximal variants such as [33].
Acknowledgements
The work of S. Cen was partly done when visiting MSRA. The work of S. Cen and Y. Chi is supported in part by National Science Foundation under the grant CCF1806154, Office of Naval Research under the grants N000141812142 and N000141912404, and Army Research Office under the grant W911NF1810303.
References
References
 [1] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
 [2] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.

[3]
Z. AllenZhu.
Katyusha: The first direct acceleration of stochastic gradient
methods.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pages 1200–1205. ACM, 2017.  [4] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar. Signsgd: Compressed optimisation for nonconvex problems. In International Conference on Machine Learning, pages 559–568, 2018.
 [5] S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 111–120. IEEE, 2016.
 [6] A. Defazio, F. Bach, and S. LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
 [7] C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.

[8]
I. Guyon, S. Gunn, A. BenHur, and G. Dror.
Result analysis of the nips 2003 feature selection challenge.
In Advances in neural information processing systems, pages 545–552, 2005.  [9] B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: A unified analysis of svrg and katyusha using semidefinite programs. International Conference on Machine Learning (ICML), 2018.
 [10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 [11] J. Konečnỳ, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
 [12] J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446, 2017.
 [13] X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
 [14] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for firstorder optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
 [15] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
 [16] S. Mei, Y. Bai, A. Montanari, et al. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018.
 [17] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
 [18] L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.W. Weng, and J. R. Kalagnanam. Finitesum smooth optimization with sarah. arXiv preprint arXiv:1901.07648, 2019.
 [19] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 [20] S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola. Aide: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
 [21] M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12):83–112, 2017.
 [22] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [23] S. ShalevShwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
 [24] O. Shamir. Withoutreplacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46–54, 2016.
 [25] O. Shamir, N. Srebro, and T. Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In International conference on machine learning, pages 1000–1008, 2014.
 [26] V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general framework for communicationefficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
 [27] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. : Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4855–4863, 2018.
 [28] J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Conference on Learning Theory, pages 1882–1919, 2017.
 [29] S. Wang, F. RoostaKhorasani, P. Xu, and M. W. Mahoney. Giant: Globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, pages 2338–2348, 2018.
 [30] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost: A class of faster variancereduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
 [31] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communicationefficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.

[32]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.
Terngrad: Ternary gradients to reduce communication in distributed deep learning.
In Advances in neural information processing systems, pages 1509–1519, 2017.  [33] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 [34] Y. Zhang and X. Lin. DiSCO: Distributed Optimization for SelfConcordant Empirical Loss. International Conference on Machine Learning, pages 362–370, 2015.
 [35] Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communicationefficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.
 [36] S. Zhao, G.D. Zhang, M.W. Li, and W.J. Li. Proximal scope for distributed sparse learning. In Advances in Neural Information Processing Systems, pages 6552–6561, 2018.

[37]
S.Y. Zhao, R. Xiang, Y.H. Shi, P. Gao, and W.J. Li.
Scope: scalable composite optimization for learning on spark.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [38] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pages 5975–5984, 2018.
 [39] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Useful Lemma
We first establish the following Lemma which will be useful in the proof of DSVRG and DMiG.
Lemma 1.
Proof.
Given is smooth and convex, the Bregman divergence is smooth and convex as a function of . When Assumption 1 and 2 hold, we have
Averaging the inequality over gives
(4) 
Assumption 4.a allows us to compare and :
Following similar arguments, using Assumption 4.b we obtain a tighter bound: . Combining the estimate in equation (4) proves the lemma. ∎
Appendix B Proof for distributed SVRG and its regularized variant
In this section, we outline the convergence of DSVRG and DRSVRG in various setups. We adopt the dissipativity theory in [9] to the analysis of DSVRG, which is briefly reviewd in Section B.1.
b.1 Dissipativity Theory for FirstOrder Methods
Consider the following linear timeinvariant system:
where is the state and is the input. Dissipativity theory characterizes how the input forces drive the internal energy stored in the states via an energy function and a supply rate . The theory aims to build the following exponential dissipation inequality:
where . The inequality indicates that at least a fraction of the internal energy will dissipate at every iteration. With an energy function and supply rates , we have
(5) 
as long as there exists a positive semidefinite matrix an nonnegative scalars such that
(6) 
In fact, by left multiplying and right multiplying to (6), we recover (5).
In the following analysis, we simplify the notation by dropping the superscript and the subscript when the meaning is clear, since it is sufficient to analyze the convergence of a specific agent during a single round. Set and we can write the distributed SVRG update as the following linear timeinvariant system [9]:
or equivalently,
(7) 
where is selected uniformly at random from local data points and , ,
Recall that the supply rate is defined as
Since and , we will write as , where .
b.2 Proof of Theorem 1
Following [9], we consider the following three supply rates:
(8) 
We have the following lemma and corollary which are proved in Appendix B.6 and B.7, respectively.
Corollary 1.
As a theoretical evaluation of time complexity, when , by choosing , , a convergence rate no more than is obtained. Therefore, as long as , we have
So we need communication rounds to obtain an optimal solution. Note that in the distributed setting, calculating the fullbatch gradient only needs time instead of time due to parallelized computation, so the overall time complexity to find an optimal solution is
where .
b.3 Convergence of DSVRG without Assumption 2
When Assumption 2 does not hold, we can still use similar arguments as Appendix B.2 and establish convergence, though at a slower rate. Using the same supply rates (8), Lemma 2 can be modified as below.
Proof.
With smoothness of we have the following estimate:
So we have
and
The estimate of is identical to that in Lemma 2. ∎
Comments
There are no comments yet.