Stochastic Gradient Descent (SGD) is a widely used algorithm to minimize a convex or non-convex function in which model parameters are updated iteratively as follows:
where is a stochastic gradient of at and
is the learning rate. This algorithm can be naively parallelized by adding more workers independently to compute a gradient and then average them at each step to reduce the variance in estimation of the true gradientdekel2012optimal . This method requires each worker to share their computed gradients with each other at every iteration.
However, it is widely acknowledged that communication is a major bottleneck of this method for large scale optimization applications mcmahan2016communication ; konevcny2016federated ; lin2017deep . Often, mini-batch parallel SGD is suggested to address this issue by increasing the computation to communication ratio. Nonetheless, too large mini-batch size might degrades the performance lin2018don . Along the same lines of increasing compute to communication, local SGD has been proposed to reduce communications mcmahan2016communication ; dieuleveut2019communication . In this method, workers compute (stochastic) gradients and update their parameters locally, and communicate only once in a while to obtain the average of their parameters. Local SGD improves the communication efficiency not only by reducing the number of communication rounds, but also alleviates the synchronization delay caused by waiting for slow workers and evens out the variations in workers’ computing time wang2018cooperative .
On the other hand, since individual gradients of each worker are calculated at different points, this method introduces residual error as opposed to fully synchronous SGD. Therefore, there is a trade-off between having fewer communication rounds and introducing additional errors to the gradient estimates.
The idea of making local updates is not new and has been used in practice for a while konevcny2016federated . However, until recently, there have been few successful efforts to analyze Local SGD theoretically and therefore it is not fully understood yet. The paper zhang2016parallel shows that for quadratic functions, when the variance of the noise is higher far from the optimum, frequent averaging leads to faster convergence. One of the main questions we want to ask is: how many communication rounds are needed for Local SGD to have the same convergence rate of a synchronized parallel SGD while achieving performance that linearly improves in the number of workers?
stich2018local was among the earlier works that tried to answer this question for general strongly convex and smooth functions and showed that the communication rounds can be reduced up to a factor of , without affecting the asymptotic convergence rate (up to constant factors), where is the total number of iterations and is number of parallel workers.
Focusing on smooth and possibly non-convex functions which satisfy a Polyak-Lojasiewicz condition, haddadpour2019local demonstrates that only communication rounds are sufficient to achieve asymptotic performance that scales proportionately to .
More recently, khaled2019tighter and stich2019error improve upon the previous works by showing linear-speed up for Local SGD with only communication rounds when data is identically distributed among workers and is strongly convex. Their works also consider the cases when is not necessarily strongly-convex as well as the case of data being heterogeneously distributed among workers in khaled2019tighter .
In this work, we focus on smooth and strongly-convex functions with a very general noise model. The main contribution of this paper is to propose a communication strategy which requires only communication rounds to achieve performance that scales as in the number of workers. To the best of the authors’ knowledge, this is the only work to show this result (without additional poly-logarithmic terms and constants). Our analysis can also recover some of the best known rates for special cases, e.g., when is constant, where is defined as the length of intercommunication intervals. A summary of our results compared to the available literature can be found in Table 1.
The rest of this paper is organized as follows. In the following subsection we outline the related literature and ongoing works. In Section 2 we define the main problem and state our assumptions. We present our theoretical findings in Section 3 and the sketch of proofs in Section 4, followed by numerical experiments in Section 5 and conclusion remarks in Section 6.
1.1 Related Works
There has been a lot of effort in the recent research to take into account the communication delays and training time in designing faster algorithms mcdonald2010distributed ; zhang2015deep ; bijral2016data ; kairouz2019advances . See tang2020communication for a comprehensive survey of communication efficient distributed training algorithms considering both system-level and algorithm-level optimizations.
Many works study the communication complexity of distributed methods for convex optimization arjevani2015communication woodworth2020local and statistical estimation zhang2013information . woodworth2020local presents a rigorous comparison of Local SGD with local steps and mini-batch SGD with times larger mini-batch size and the same number of communication rounds (we will refer to such a method as large mini-batch SGD) and show regimes in which each algorithm performs better: they show that Local SGD is strictly better than large mini-batch SGD when the functions are quadratic. Moreover, they prove a lower bound on the worst case of Local SGD that is higher than the worst-case error of large mini-batch SGD in a certain regime. zhang2013information studies the minimum amount of communication required to achieve centralized minimax-optimal rates by establishing lower bounds on minimax risks for distributed statistical estimation under a communication budget.
A parallel line of work studies the convergence of Local SGD with non-convex functions zhou2017convergence . yu2019parallel was among the first works to present provable guarantees of Local SGD with linear speed up. wang2018cooperative and koloskova2020unified present unified frameworks for analyzing decentralized SGD with local updates, elastic averaging or changing topology. The follow-up work wang2018adaptive presents ADACOMM, an adaptive communication strategy that starts with infrequent averaging and then increases the communication frequency in order to achieve a low error floor. They analyze the error-runtime trade-off of Local SGD with nonconvex functions and propose communication times to achieve faster runtime.
In One-Shot Averaging (OSA), workers perform local updates with no communication during the optimization until the end when they average their parameters. This method can be seen as an extreme case of Local SGD with , on the opposite end of synchronous SGD mcdonald2009efficient ; zinkevich2010parallelized ; zhang2013communication ; rosenblatt2016optimality ; godichon2017rates . dieuleveut2019communication
provides non-asymptotic analysis of mini-batch SGD and one-shot averaging as well as regimes in which mini-batch SGD could outperform one-shot averaging.
Another line of work reduces the communication by compressing the gradients and hence limiting the number of bits transmitted in every message between workers lin2017deep ; alistarh2017qsgd ; wangni2018gradient ; stich2018sparsified ; stich2019error .
Asynchronous methods have been studied widely due to their advantages over synchronous methods which suffer from synchronization delays due to the slower workers olshevsky2018robust . wang2019matcha studies the error-runtime trade-off in decentralized optimization and proposes MATCHA, an algorithm which parallelizes inter-node communication by decomposing the topology into matchings. hendrikx2019accelerated
provides an accelerated stochastic algorithm for decentralized optimization of finite-sum objective functions that by carefully balancing the ratio between communications and computations match the rates of the best known sequential algorithms while having the network scaling of optimal batch algorithms. However, these methods are relatively more involved and they often require full knowledge of the network, solving a semi-definite program and/or calculating communication probabilities (schedules).
For a positive integer , we define
. We use bold letters to represent vectors. We denote vectors of alls and s by and , respectively. We use for the Euclidean norm.
2 Problem Formulation
Suppose there are workers , trying to minimize in parallel. We assume all workers have access to through noisy gradients. In Local SGD, workers perform local gradient steps and occasionally calculate the average of all workers’ iterates.
Having access to the same objective function is of special interest if the data is stored in one place accessible to all machines or is distributed identically among workers with no memory constraints. We hope that results presented here can be extended to applications with heterogeneous data distributions khaled2019tighter .
We will make the following additional assumptions.
Function is differentiable, -strongly convex and -smooth for . In particular,
We define to be the condition number of .
We make the following assumption on the noise of the stochastic gradients.
Each worker has access to a gradient oracle which returns an unbiased estimate of the true gradient in the form
has access to a gradient oracle which returns an unbiased estimate of the true gradient in the form, such that is a zero-mean conditionally independent random noise with its expected squared norm error bounded as
where are constants.
To save space, we define as the stochastic gradient of node at iteration , and as the true gradient at the same point.
The noise model of Assumption 2 is very general and it includes the common case with uniformly bounded squared norm error when . As it is noted by zhang2016parallel , the advantage of periodic averaging compared to one-shot averaging only appears when is large. Therefore, to study Local SGD, it is important to consider a noise model as in Assumption 2 to capture the effects of frequent averaging. Among the related works mentioned in Table 1, only stich2019error and haddadpour2019local analyze this noise model while the rest study the special case with . SGD under this noise model with and was first studied in schmidt2013fast under the name strong-growth condition. Therefore we refer to the noise model considered in this work as uniform with strong-growth.
In Local SGD, each worker holds a local parameter at iteration and a set of communication times, and performs the following update:
When , we recover the fully synchronized parallel SGD, while recovers one-shot averaging. The pseudo code for Local SGD is provided as Algorithm 1.
The main goal of this paper is to study the effect of communication times on the convergence of the Local SGD and provide better theoretical guarantees. In what follows, we claim that by carefully choosing the step size, linear speed-up of parallel SGD can be attained with only a small number of communication instances.
3 Convergence Results
In this section we present our convergence results for Local SGD. In the following theorem, we show an upper bound for the sub-optimality error, in the sense of function value, for any choice of communication times .
Before proceeding with our results, let us introduce some notation. Let be the communication times. Define , as the length of -th inter-communication interval, for . Moreover, define as the the average of the iterates of all workers. Notice that for .
The main results of this paper will be obtained by specializing the following bound.
The last term in Equation (3) is due the to disagreement between workers (consensus error), introduced by local computations without any communication. As the inter-communication intervals become larger, becomes larger as well and increases the overall optimization error. This term explains the trade-off between communication efficiency and the optimization error.
Theorem 1 not only bounds the optimization error, but introduces a methodological approach to select the communication times to achieve smaller errors. For the scenarios when the user can afford to have a certain number of a communications, they can select to minimize the last term in (3).
We next discuss the implications of Theorem 1 under various conditions.
3.1 Fixed-Length Intervals
A simple way to select the communication times , is to split the whole training time to intervals of length at most . Then we can use the following bound in Equation (3),
We state this result formally in the following corollary.
Suppose assumptions of Theorem 1 hold and in addition, workers communicate at least once every iterations. Then,
Setting we achieve linear-speed up in the number of workers, which is equivalent to a communication complexity of . To the best of the authors’ knowledge, this is the tightest communication complexity that is shown to achieve linear speed-up. khaled2019tighter and stich2019error have shown a similar communication complexity, however with slightly higher degrees of dependence on , e.g., in khaled2019tighter .
Recovering Synchronized SGD.
When , the the last term in (4) disappears and we recover the convergence rate of parallel SGD, albeit, with a worse dependence on .
3.2 Varying Intervals
In the previous subsection, we observed that with our current analysis, having fixed-length inter-communication intervals, linear speed-up can be achieved with only rounds of communications. A natural question that might arise is whether we can improve the result above even further.
Let us allow consecutive inter-communication intervals, i.e., , grow linearly, where are the communication times. The following Theorem presents a performance guarantee for this choice of communication times.
The choice of communication times in Theorem 2 aligns with the intuition that workers need to communicate more frequently at the beginning of the optimization. As the the step-sizes become smaller and workers’ local parameters get closer to the global minimum, they diverge more slowly from each other and, hence, less communication is required to re-align them.
Choosing communication rounds , we achieve an error that scales as in the number of workers when . This is the main result of this paper: it shows that we can get a linear speedup in the number of workers by simply increasing the number of iterations while keeping the total number of communications bounded.
4 Sketch of Proof
Here we give an outline of the proofs for the results presented in this paper. The proof of the following lemmas are left to the Appendix.
A common approach in analyzing parallel algorithms such as Local SGD is to study the evolution of the sequence . We have,
where is the average of the stochastic gradient estimates of all workers.
Let us define to be the optimality error. The following lemma, which is similar to a part of the proof found in haddadpour2019local , bounds the optimality error at each iteration recursively.
Equipped with Lemma 1, we can bound the consensus error () as well as the term in the following lemmas.
In the following lemmas, we utilize the structure of the problem to bound the consensus error recursively.
This lemma, bounds how much the consensus error grows at each iteration. Of course, when workers communicate, this error resets to zero and thus, we can calculate an upper bound for the consensus error, knowing the last iteration communication occurred and the step-size sequence. The following lemma takes care of that. Before stating the following lemma, let us define .
Let assumptions of Theorem 1 hold. Then,
Our next lemma bounds .
Under Assumption 2 we have,
5 Numerical Experiments
To verify our findings and compare different communication strategies in Local SGD, we performed the following numerical experiments.
5.1 Quadratic Function With Strong-Growth Condition
As discussed in zhang2016parallel ; dieuleveut2019communication , under uniformly bounded variance, one-shot averaging performs asymptotically as well as mini-batch SGD. Therefore, to fully capture the importance of the choice of communication times , we design a hard problem, where noise variance is uniform with strong-growth condition, defined in Assumption 2. Let us define where,
, where and ,, each worker samples a and uses as a stochastic estimate of . It is easy to verify that is -strongly convex, and , where and .
We use Local SGD to minimize using different communication strategies. We select , machines and iterations and the step-size sequence with . We start each simulation from the initial point of and repeat each simulation times. The average of the results are reported in Figures 1(a) and 1(b). Moreover, average performance of Local SGD with different number of workers and the communication strategy proposed in this paper with is shown in Figure 1(c) along with the respective convergence rate of .
Figure 1(a) shows that the method with increasing communication intervals () proposed in this paper performs better than all the other communication strategies in the transient time as well as in the final error, requiring much less communication rounds. In particular, the method with the same number of communications but fixed intervals (), has both higher transient error and final error. This affirms the advantages of having more frequent communication at the beginning of the optimization. Indeed, observe that in Figures 1(a), the only method which outperforms the method we propose is the one that communicates at every step.
Figure 1(b) reveals the effectiveness of each communication round in different methods. We observe that there’s an initial spike in the initial communications in methods and . This is mainly because these two methods have more frequent communications at the beginning of the training, where the step-sizes are larger. Other methods experience this increase as well, however since they communicate later, it’s not observed in this figure. Indeed, observe that the only method which makes better use of communication periods than our method in Figure 1(b) is one-shot averaging, which is not competitive in terms of its final error.
Figure 1(c) verifies that linear-speed up in the number of workers can be achieved with only communication rounds. Moreover, it shows that Local SGD achieves the optimal convergence rate of asymptotically.
5.2 Regularized Logistic Regression
We also performed additional numerical experiments with regularized logistic regression using two real data sets. Due to space constraints, the results are presented in supplementary information.
We have presented a new analysis of Local SGD and studied the effect of choice of communication times on the final optimality error. We proposed a communication strategy which achieves linear speed-up in the number of workers with only communication rounds, independent of the total number of iterations . Numerical experiments further confirmed our theoretical findings, and showed that our method achieves smaller error than previous methods using fewer communications.
The results presented in this paper could help speed up training in many machine learning applications. The potential broader impacts are therefore somewhat generic for machine learning: this research could amplify all the benefits ML can bring by making it cheaper in terms of computational cost, while simultaneously amplifying all the ways ML could be misused.
- [AGL17] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
- [AS15] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and optimization. In Advances in neural information processing systems, pages 1756–1764, 2015.
- [BSS16] Avleen S Bijral, Anand D Sarwate, and Nathan Srebro. On data dependence in distributed stochastic optimization. arXiv preprint arXiv:1603.04379, 2016.
Chih-Chung Chang and Chih-Jen Lin.
LIBSVM: A library for support vector machines.ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- [DGBSX12] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
- [DP19] Aymeric Dieuleveut and Kumar Kshitij Patel. Communication trade-offs for local-sgd with large step size. In Advances in Neural Information Processing Systems, pages 13579–13590, 2019.
- [GBS20] Antoine Godichon-Baggioni and Sofiane Saadane. On the rates of convergence of parallelized averaged stochastic gradient algorithms. Statistics, pages 1–18, 2020.
- [HBM19] Hadrien Hendrikx, Francis Bach, and Laurent Massoulié. An accelerated decentralized stochastic proximal algorithm for finite sums. In Advances in Neural Information Processing Systems, pages 952–962, 2019.
- [HKMC19] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pages 11080–11092, 2019.
- [KLB20] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U Stich. A unified theory of decentralized sgd with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020.
- [KMA19] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
A Khaled, K Mishchenko, and P Richtárik.
Tighter theory for local sgd on identical and heterogeneous data.
The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), 2020.
- [KMY16] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
- [LHM18] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
- [LSPJ18] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
Ryan McDonald, Keith Hall, and Gideon Mann.
Distributed training strategies for the structured perceptron.In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 456–464. Association for Computational Linguistics, 2010.
- [MMR17] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282, 2017.
- [MMS09] Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon S Mann. Efficient large-scale distributed training of conditional maximum entropy models. In Advances in neural information processing systems, pages 1231–1239, 2009.
- [RN16] Jonathan D Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379–404, 2016.
- [SCJ18] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
- [SK19] Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
- [SOP20] Artin Spiridonoff, Alex Olshevsky, and Ioannis Ch Paschalidis. Robust asynchronous stochastic gradient-push: asymptotically optimal and network-independent performance for strongly convex functions. Journal of Machine Learning Research, 2020.
- [SR13] Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370, 2013.
- [Sti19] Sebastian U. Stich. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019.
- [TSC20] Zhenheng Tang, Shaohuai Shi, Xiaowen Chu, Wei Wang, and Bo Li. Communication-efficient distributed deep learning: A comprehensive survey. arXiv preprint arXiv:2003.06307, 2020.
- [WJ18a] Jianyu Wang and Gauri Joshi. Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. Systems for ML, 2018.
- [WJ18b] Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
- [WPS20] Blake Woodworth, Kumar Kshitij Patel, Sebastian U Stich, Zhen Dai, Brian Bullins, H Brendan McMahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? arXiv preprint arXiv:2002.07839, 2020.
- [WSY19] Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. Matcha: Speeding up decentralized sgd via matching decomposition sampling. arXiv preprint arXiv:1905.09435, 2019.
- [WWLZ18] Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.
Hao Yu, Sen Yang, and Shenghuo Zhu.
Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning.In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5693–5700, 2019.
- [ZC18] Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3219–3227. AAAI Press, 2018.
- [ZCL15] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in neural information processing systems, pages 685–693, 2015.
- [ZDJW13] Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pages 2328–2336, 2013.
- [ZDSMR16] Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
- [ZDW13] Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321–3363, 2013.
- [ZWLS10] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Missing Proofs
Let us define the following notations used in the proofs presented here.
Moreover, define .
Proof of Lemma 1.
We state an important identity in the following lemma.
Let be arbitrary vectors. Define . Then,
Proof of Lemma 2.
By -smoothness of ,
Moreover, by -strong convexity of ,
Now, consider the second term on the right hand side of (12). We have,
where are defined at the beginning of this section and and we used Lemma 5 in the third equation and the conditional independence of to use in the last equality. Taking full expectation of the two relations above with respect to and combining them with (12) completes the proof. ∎
Before proving this lemma, let us state and prove the following lemma.
Let be integers. Define . We then have
where we used the inequality as well as the standard technique of viewing as a Riemann sum for and observing that the Riemann sum overstates the integral. Exponentiating both sides now implies the lemma. ∎