Empirical risk minimization arises frequently in machine learning applications, where the objective function is the average of losses computed at different data points. Due to the increasing size of data, distributed computing architectures are in great need to meet the scalability requirement in terms of both computation power and storage space by distributing the learning task over multiple computing nodes. In addition, distributed frameworks are suitable for problems where there are privacy concerns to transmit and store all the data in a central location, a scenario related to the nascent field offederated learning . It is, therefore, necessary to develop distributed optimization frameworks that are tailored to solving large-scale empirical risk minimization problems with desirable communication-computation trade-offs, where the data are stored disjointly over different machines.
Due to the low per-iteration cost, a popular solution is distributed stochastic gradient descent (SGD), where the parameter server aggregates gradients from each worker and does mini-batch gradient updates. However, distributed SGD is not communication-efficient and requires lots of communication rounds to converge, which partially diminishes the benefit of distribution. Many deterministic optimization methods have been developed to achieve communication efficiency, including but not limited to DANE , AIDE , DiSCo , GIANT , CoCoA , one-shot averaging [39, 35], etc.
Recent breakthroughs in developing stochastic variance reduced methods such as SAG , SAGA , SVRG , SDCA , MiG , Katyusha , Catalyst , SCOPE [37, 36], SARAH , SPIDER , SpiderBoost , and many more, allow achieving fast convergence and small per-iteration cost at the same time. Yet, distributed schemes of such variance reduced methods that are both practical and theoretically sound are much less developed.
This paper focuses on a general framework of distributed stochastic variance reduced methods presented in Alg. 1, which is natural and friendly to implement. On a high level, SVRG-type algorithms contain inner loops for parameter updates via variance-reduced SGD, and outer loops for global gradient and parameter updates. Alg. 1 assigns outer loops to the parameter server, and inner loops to worker machines. The parameter server collects gradients from worker machines and then distributes the global gradient to each machine. Each worker machine then runs the inner loop independently in parallel using variance reduction techniques, and returns the updates to the parameter server at the end. Per iteration, Alg. 1
requires two communication rounds: one communication round is used to average the parameter estimates, and the other is used to average the gradients, which is the same as distributed synchronous SGD. A distributed SVRG method under this framework has been proposed in several works under different scenarios[11, 5, 20] with great empirical success. Surprisingly, a complete theoretical understanding is still missing at large. Moreover, distributed variants using accelerated variance reduction methods are not developed. The main analysis difficulty is that the variance-reduced gradient of each worker is no longer an unbiased gradient estimator when sampling from re-used local data.
On the other end, several variants of distributed SVRG, e.g. [12, 24, 28] have been proposed with performance guarantees. They try to bypass the biased gradient estimation issue by simulating the process of i.i.d sampling from all the data, so a random data re-allocation is needed before any batch sample is used again. Such algorithmic steps, i.e., sampling extra data with or without replacement, can be cumbersome and difficult to implement in practice.
1.1 Contributions of This Paper
This paper provides a convergence analysis of a family of naturally distributed stochastic variance reduced methods under the framework described in Alg. 1. By using different variance reduction schemes at the worker machines, we study distributed variants of three representative algorithms in this paper: D-SVRG , D-SARAH with recursive gradient updates [17, 18], and D-MiG with accelerated gradient updates . The contributions of this paper are summarized below.
We suggest a simple and intuitive metric called distributed smoothness to gauge data balancedness among workers, defined as the smoothness of the difference between the local loss function and the global loss function . The metric is deterministic, easy-to-compute, and applies for arbitrary dataset splitting. We establish the linear convergence of D-SVRG, D-SARAH, and D-MiG under strongly convex losses, as long as the distributed smoothness parameter is smaller than a constant fraction of the strong convexity parameter , e.g. , where the fraction might change for different algorithms.
Under appropriate distributed smoothness, we show that, to reach -accuracy, D-SVRG and D-SARAH (resp. D-MiG) achieve (resp. ) time complexity with rounds of communication, where is the total number of input data points, is the number of worker machines and is the condition number of the global loss function . Compared to the time complexity of the original central SVRG and SARAH (resp. MiG), (resp. , this leads to a speed up in by the number of machines, if the local data size is sufficiently large. Furthermore, our bounds capture the phenomenon that the convergence rate improves as the local loss functions become more similar to the global loss function, by reducing the distributed smoothness parameter.
When local data are highly unbalanced, the distributed smoothness parameter becomes large which might lead to algorithm divergence. We suggests regularization as an effective way to handle this situation, and show that by adding larger regularization to machines that are less smooth, one can still ensure linear convergence in a regularized version of D-SVRG, called D-RSVRG, though at a slower rate of convergence.
More generally, the notion of distributed smoothness can also be used to establish the convergence under possibly nonsmooth and nonconvex losses. Under mild conditions, we show that D-SARAH achieves -accuracy in a time complexity of with rounds of communication.
1.2 Related Work
Many algorithms have been proposed for distributed (stochastic) optimization. For conciseness, Table 1 summarizes the most relevant ones to the current paper. Note that previous works considering variance reduction are all on distributed SVRG and its variants. The general framework of distributed variance-reduced methods covering SARAH and MiG and various loss settings in this paper has not been studied before.
|DSVRG ||sampling extra data|
|DASVRG ||sampling extra data|
|Dist. Acc. Grad||none|
|SCOPE ||uniform reg.|
|pSCOPE ||good partition|
The D-SVRG algorithm, presented in Alg. 2, has been empirically studied before in [20, 11] without a theoretical convergence analysis. The pSCOPE algorithm  is also a variant of distributed SVRG, and its convergence is studied under an assumption called good data partition in . The SCOPE algorithm  is similar to the regularized algorithm D-RSVRG under large regularization, however our analysis is much more refined by adding different regularizations to different local workers, characterizes the amount of regularization necessary to ensure convergence with respect to the distributed smoothness of local data, and gracefully degenerates to the unregularized case when distributed smoothness is benign.
There are also a lot of recent efforts on reducing the communication cost of distributed GD/SGD by gradient quantization [1, 4, 22, 32], gradient compression and sparsification [2, 13, 15, 31, 27]. In comparison, we communicate the exact gradient, and it might be an interesting future direction to combine gradient compression schemes in distributed variance reduced stochastic gradient methods.
The rest of this paper is organized as follows. Section 2 presents the problem setup and a general framework of distributed stochastic optimization. Section 3 presents the convergence guarantees of D-SVRG, D-SARAH and D-MiG under appropriate distributed smoothness assumptions. Section 4 introduces regularization to D-SVRG to handle unbalanced data when distributed smoothness does not hold. Section 5 presents extensions to nonsmooth and nonconvex losses. Section 6 presents numerical experiments for validation. Finally, we conclude in Section 7.
2 Problem Setup
Suppose we have a data set , where for , that contains data points. In particular, we do not make any assumptions on their statistical distribution. We consider the following empirical risk minimization problem
where is the parameter to be optimized and is the sample loss function. For brevity, we use to denote throughout the paper.
In a distributed setting, where the data are distributed to machines or workers, we define a partition of the data set as , where , . The th worker, correspondingly, is in possession of the data subset , . We assume there is a parameter server (PS), that coordinates the parameter sharing among the workers. The sizes of data held by each worker machine is . When the data is split equally, we have . The original problem (1) can be rewritten as minimizing the following objective function:
where is the local loss function at the th worker machine.111It is straightforward to states our results under unequal data splitting with proper rescaling.
Alg. 1 presents a general framework for distributed stochastic variance reduced methods, which assigns the outer loops to PS and the inner loops to local workers. By using different variance reduction schemes at the worker machines (i.e. LocalUpdate), we obtain distributed variants of different algorithms, as described in later sections.
Throughout, we invoke one or several of the following standard assumptions of loss functions.
The sample loss function is -smooth for all .
The sample loss function is convex for all .
The empirical risk is -strongly convex.
When is strongly convex, the condition number of is defined as . Denote the minimizer of as and the corresponding optimal value as .
As it turns out, the smoothness of the deviation between the local loss function and the global loss function plays a key role in the convergence analysis, as it measures the balancedness between local data in a simple and intuitive manner. We refer to this as the “distributed smoothness”, which is central in our analysis. In some cases, a weaker notion called restricted smoothness is sufficient, which is defined below.
Definition 1 (Restricted smoothness).
A differentiable function is called -restricted smooth with regard to if , .
The restricted smoothness, compared to standard smoothness, fixes one of the arguments to , and is therefore a much weaker requirement. The following assumption quantifies the distributed smoothness using either restricted smoothness or standard smoothness.
The deviation is -restricted smooth with regard to (c.f. Definition 1) for all .
The deviation is -smooth for all .
3 Convergence in the Strongly Convex Case
In this section, we describe and analyze three variance-reduced routines for LocalUpdate in Alg. 1, namely SVRG , SARAH with recursive gradient updates [17, 18], and MiG with accelerated gradient updates  when is strongly convex. We establish the convergence guarantee for each algorithm, respectively.
3.1 Distributed SVRG (D-SVRG)
Theorem 1 (D-Svrg).
Theorem 1 establishes the linear convergence of function values in expectation for D-SVRG, as long as the parameter is sufficiently small. With proper and , the convergence rate can be set as . From the time complexity and the rate expressions, we can see that the smaller , the faster D-SVRG converges. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SVRG in the centralized setting.
3.2 Distributed SARAH (D-SARAH)
The LocalUpdate of D-SARAH is also described in Alg. 2, which is different from SVRG in the update of stochastic gradient , by using a recursive formula. Theorem 2 provides the convergence guarantee of D-SARAH.
Theorem 2 (D-Sarah).
Theorem 2 establishes the linear convergence of the gradient norm in expectation for D-SARAH, as long as the parameter is small enough. With proper and , the rate can be set as . Similar to D-SVRG, a smaller leads to faster convergence of D-SARAH. When is set such that is bounded above by a constant smaller than , the time complexity becomes , which improves the convergence rate of SARAH in the centralized setting. In particular, Theorem 2 suggests that D-SARAH may allow a larger , compared with D-SVRG, to guarantee convergence.
3.3 Distributed MiG (D-MiG)
The LocalUpdate of D-MiG is described in Alg. 3, which is inspired by the inner loop of the MiG algorithm , a recently proposed accelerated variance-reduced algorithm. Theorem 3 provides the convergence guarantee of D-MiG.
Theorem 3 (D-MiG).
Theorem 3 establishes the linear convergence of D-MiG under standard smoothness of , in order to fully harness the power of acceleration. While we do not make it explicit in the theorem statement, the time complexity of D-MiG also decreases as gets smaller. Furthermore, the time complexity of D-MiG is smaller than that of D-SVRG/D-SARAH when .
4 Adding Regularization Helps Unbalanced Data
So far, we have established the convergence when the distributed smoothness is not too large. Under i.i.d. and balanced data, this requirement can be justified for large enough data size, see Appendix E. However, when such conditions are violated, the algorithms might diverge. In this situation, adding a regularization term might ensure the convergence, at the cost of possibly slowing down the convergence rate. We consider regularizing the local gradient update of SVRG in Alg. 2 as
where the last regularization term penalizes the proximity between the current iterates and the reference point . We have the following theorem.
Theorem 4 (Distributed Regularized SVRG (D-RSVRG)).
Compared with Theorem 1, Theorem 4 relaxes the assumption to , which means that by inserting a larger regularization to local workers that are not distributed smooth, i.e. those with large , one can still guarantee the convergence of D-RSVRG. However, increasing leads to a slower convergence rate: a large leads to an iteration complexity , similar to gradient descent. Compared with SCOPE  which requires a uniform regularization , our analysis applies tailored regularization to local workers, and potentially allows much smaller regularization to guarantee the convergence, since ’s can be much smaller than the smoothness parameter .
5 Convergence in the Nonconvex Case
In this section, we extend the convergence analysis of distributed SARAH to nonconvex loss functions, since SARAH-type algorithms are recently shown to achieve near-optimal performances for nonconvex problems [30, 18, 7]. Our result is summarized in the theorem below.
Theorem 5 (D-SARAH for non-convex losses).
where is the agent index selected in the th round for parameter update, i.e. (c.f. line 8 of Alg. 1). The time complexity of finding an -optimal solution is no worse than with proper .
Theorem 5 suggests that D-SARAH converges as long as the step size is small enough. Furthermore, a smaller allows a larger step size , and hence faster convergence. To gain further insights, assuming i.i.d. data, by concentration inequalities under mild conditions , and consequently, the iteration complexity of finding an -accurate solution using D-SARAH is . This is comparable to the best known result for the centralized SARAH-type algorithms [18, 7, 30].
6 Numerical Experiments
Though our focus in this paper is on the convergence analysis, we illustrate the performance of the proposed distributed variance reduced algorithms in various settings as a proof-of-concept.
-regularized logistic regression, where the sample loss is defined as, with the data . We evaluate the performance on the gisette dataset , by splitting the data equally to all workers. We scale the data according to , so that the smoothness is estimated as . We choose , and to illustrate the performance under different condition numbers. We use the optimality gap, defined as , to illustrate convergence.
For D-SVRG and D-SARAH, the step size is set as . For D-MiG, although the choice of in the theory requires knowledge of , we simply ignore it and set , and the step size to reflect the robustness of the practical performance to parameters. We further use at the PS for better empirical performance. For D-AGD, the step size is set as and the momentum parameter is set as . Following , which sets the iterations of inner loops as , we set to ensure the same number of total inner iterations. We note that such parameters can be further tuned to achieve better trade-off between communication cost and computation cost in practice.
Fig. 1 illustrates the optimality gap of various algorithms with respect to the number of communication rounds with 4 local workers under different conditioning, and Fig. 2 shows the corresponding result with different number of local workers when . The distributed variance-reduced algorithms outperform the distributed AGD,and D-MiG outperforms D-SVRG and D-SARAH when the condition number is large.
Dealing with Unbalanced data.
We justify the benefit of regularization by evaluating the proposed algorithms under unbalanced data allocation. We assign 50%, 30%, 19.9%, 0.1% percent of data to four workers, respectively, and set in the logistic regression loss. To deal with unbalanced data, we perform the regularized update, given in (3), on the worker with the least amount of data, and keep the update on the rest of the workers unchanged. A similar regularized update can be conceived for D-SARAH and D-MiG, resulting into regularized variants, D-RSARAH and D-RMiG. While our theory does not cover them, we still evaluate their numerical performance. We properly set according to the amount of data on this worker as . We set the number of iterations at workers to on all agents. For D-AGD we set the momentum parameter to , which is a common choice in the convex setting. Fig. 3 shows the optimality gap with respect to the number of communication rounds for all algorithms. It can be seen that unregularized D-SVRG and D-SARAH fail to converge, and the regularized algorithms still converge, verifying the role of regularization in addressing unbalanced data. It is interesting to note that D-MiG still manages to converge even with highly unbalanced data, a phenomenon not fully covered by the current theory and worth further investigation. In this case, the regularization slightly slows down the converge speed. It is also worth-mentioning that the regularization can be flexibly imposed depending on the local data size, rather than homogeneously across all workers.
In this paper, we have developed a convergence theory for a family of distributed stochastic variance reduced methods without sampling extra data, under a mild distributed smoothness assumption that measures the discrepancy between the local and global loss functions. Convergence guarantees are obtained for distributed stochastic variance reduced methods using accelerations and recursive gradient updates, and for minimizing both strongly convex and nonconvex losses. We also suggest regularization as a means of ensuring convergence when the local data are less balanced. We believe the analysis framework is useful for studying distributed variants of other stochastic variance-reduced methods such as Katyusha , and proximal variants such as .
The work of S. Cen was partly done when visiting MSRA. The work of S. Cen and Y. Chi is supported in part by National Science Foundation under the grant CCF-1806154, Office of Naval Research under the grants N00014-18-1-2142 and N00014-19-1-2404, and Army Research Office under the grant W911NF-18-1-0303.
-  D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
-  D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
Katyusha: The first direct acceleration of stochastic gradient
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1200–1205. ACM, 2017.
-  J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar. Signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 559–568, 2018.
-  S. De and T. Goldstein. Efficient distributed sgd with variance reduction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 111–120. IEEE, 2016.
-  A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
-  C. Fang, C. J. Li, Z. Lin, and T. Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.
I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror.
Result analysis of the nips 2003 feature selection challenge.In Advances in neural information processing systems, pages 545–552, 2005.
-  B. Hu, S. Wright, and L. Lessard. Dissipativity theory for accelerating stochastic variance reduction: A unified analysis of svrg and katyusha using semidefinite programs. International Conference on Machine Learning (ICML), 2018.
-  R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
-  J. Konečnỳ, B. McMahan, and D. Ramage. Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
-  J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. The Journal of Machine Learning Research, 18(1):4404–4446, 2017.
-  X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
-  H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
-  Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
-  S. Mei, Y. Bai, A. Montanari, et al. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018.
-  L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pages 2613–2621, 2017.
-  L. M. Nguyen, M. van Dijk, D. T. Phan, P. H. Nguyen, T.-W. Weng, and J. R. Kalagnanam. Finite-sum smooth optimization with sarah. arXiv preprint arXiv:1901.07648, 2019.
-  B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
-  S. J. Reddi, J. Konečnỳ, P. Richtárik, B. Póczós, and A. Smola. Aide: fast and communication efficient distributed optimization. arXiv preprint arXiv:1608.06879, 2016.
-  M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
-  F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
-  O. Shamir. Without-replacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46–54, 2016.
-  O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pages 1000–1008, 2014.
-  V. Smith, S. Forte, M. Chenxin, M. Takáč, M. I. Jordan, and M. Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
-  H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. : Decentralized training over decentralized data. In International Conference on Machine Learning, pages 4855–4863, 2018.
-  J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch prox. In Conference on Learning Theory, pages 1882–1919, 2017.
-  S. Wang, F. Roosta-Khorasani, P. Xu, and M. W. Mahoney. Giant: Globally improved approximate newton method for distributed optimization. In Advances in Neural Information Processing Systems, pages 2338–2348, 2018.
-  Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spiderboost: A class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, 2018.
-  J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.
Terngrad: Ternary gradients to reduce communication in distributed deep learning.In Advances in neural information processing systems, pages 1509–1519, 2017.
-  L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
-  Y. Zhang and X. Lin. DiSCO: Distributed Optimization for Self-Concordant Empirical Loss. International Conference on Machine Learning, pages 362–370, 2015.
-  Y. Zhang, M. J. Wainwright, and J. C. Duchi. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.
-  S. Zhao, G.-D. Zhang, M.-W. Li, and W.-J. Li. Proximal scope for distributed sparse learning. In Advances in Neural Information Processing Systems, pages 6552–6561, 2018.
S.-Y. Zhao, R. Xiang, Y.-H. Shi, P. Gao, and W.-J. Li.
Scope: scalable composite optimization for learning on spark.
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
-  K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In International Conference on Machine Learning, pages 5975–5984, 2018.
-  M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Useful Lemma
We first establish the following Lemma which will be useful in the proof of D-SVRG and D-MiG.
Averaging the inequality over gives
Assumption 4.a allows us to compare and :
Appendix B Proof for distributed SVRG and its regularized variant
b.1 Dissipativity Theory for First-Order Methods
Consider the following linear time-invariant system:
where is the state and is the input. Dissipativity theory characterizes how the input forces drive the internal energy stored in the states via an energy function and a supply rate . The theory aims to build the following exponential dissipation inequality:
where . The inequality indicates that at least a fraction of the internal energy will dissipate at every iteration. With an energy function and supply rates , we have
as long as there exists a positive semidefinite matrix an non-negative scalars such that
In the following analysis, we simplify the notation by dropping the superscript and the subscript when the meaning is clear, since it is sufficient to analyze the convergence of a specific agent during a single round. Set and we can write the distributed SVRG update as the following linear time-invariant system :
where is selected uniformly at random from local data points and , ,
Recall that the supply rate is defined as
Since and , we will write as , where .
b.2 Proof of Theorem 1
Following , we consider the following three supply rates:
As a theoretical evaluation of time complexity, when , by choosing , , a convergence rate no more than is obtained. Therefore, as long as , we have
So we need communication rounds to obtain an -optimal solution. Note that in the distributed setting, calculating the full-batch gradient only needs time instead of time due to parallelized computation, so the overall time complexity to find an -optimal solution is
b.3 Convergence of D-SVRG without Assumption 2
When Assumption 2 does not hold, we can still use similar arguments as Appendix B.2 and establish convergence, though at a slower rate. Using the same supply rates (8), Lemma 2 can be modified as below.
With -smoothness of we have the following estimate:
So we have
The estimate of is identical to that in Lemma 2. ∎
By summing the inequality and letting , , we have
Therefore, with the following convergence result can be established:
By choosing , , we get a convergence rate no more than