1 Introduction
Big data optimization problems arising in machine learning and statistics, such as the training of supervised learning models, are routinely solved in a distributed manner on a cluster of compute nodes
Ben-Nun and Hoefler [2018]. Distributed optimization algorithms are typically iterative methods alternating local computations performed on the nodes, and expensive communication steps involving all or a subset of the nodes. Due to the need to solve such problems more efficiently, there has been a lot of recent interest in understanding the trade-offs between communication and computation, a concern which is particularly important in the federated learning setting; see Konečný et al. [2016], Caldas et al. [2018], McMahan et al. [2016].Minibatch SGD. A popular method for solving unconstrained smooth optimization problems of the form
(1) |
in situations when the computation of the gradient of is expensive is minibatch SGD Dekel et al. [2010], Gower et al. [2019]:
(2) |
Here is the stepsize used at time and
is an unbiased estimator of the gradient:
. In a typical parameter server setup, stochastic gradients are computed in parallel by all (or a subset of) nodes , communicated to a parameter server, which subsequently performs the update (2) and communicates it to the nodes, and the process is repeated. Asgrows, the variance
as an estimator of the gradient decreases, which leads a decrease in the overall number of communications needed to obtain a solution of sufficient quality.Local SGD. Note that (2) can equivalently be written in the form
which leads to the alternative interpretation of minibatch SGD as averaging the results of a single SGD step performed by all nodes, initiated from the same starting point . This simple observation immediately leads to the natural question: can we gain by performing more than a single step of SGD on each node before averaging? By performing what we hope will be useful additional computation locally on the nodes before expensive aggregation is done, we hope to decrease the number of communication rounds needed. We have just described the local SGD method, formalized as Algorithm 1.
2 Contributions
While local SGD has been popular among practitioners for a long time Coppola [2015], McDonald et al. [2010], its theoretical understanding has remained elusive until very recently Zhou and Cong [2018], Stich [2018], Yu et al. [2018], Wang and Joshi [2018], Jiang and Agrawal [2018], Basu et al. [2019] (see Table 1). The history of the methods goes back to the convergence proof in the early work Mangasarian [1995], but a tight convergence rate has been missing since then. Although most existing works focus on analyzing local SGD for smooth and nonconvex , there are no analyses specialized to the smooth convex case, and only two papers which provide bounds in the smooth strongly convex case.
In this paper we obtain the first result explicitly covering the convex case, and improve dramatically upon the best known communication complexity result in the strongly convex case (see the last row of Table 1). Moreover, unlike previous results in the strongly convex case that depend on a restrictive gradient boundedness assumption, our results do not have this flaw.
|
|
|
|
Reference | ||||||||
✓ | ✗ | ✗ | Zhou & Cong, 8/2017, Zhou and Cong [2018] | |||||||||
✗ | ✗ | ✗ | Stich, 5/2018, Stich [2018] | |||||||||
✗ | ✗ | ✗ | Yu et al, 7/2018, Yu et al. [2018] | |||||||||
✓ | ✗ | ✗ | Wang & Joshi, 8/2018, Wang and Joshi [2018] | |||||||||
✓ | ✗ | ✗ | Jiang & Agarwal, 12/2018, Jiang and Agrawal [2018] | |||||||||
✗ | ✗ | Basu et al, 6/2019, Basu et al. [2019] | ||||||||||
✓ | ✗ | THIS WORK, 9/2019 |
An overview of related work on local stochastic gradient methods is given in Table 1.
2.1 Setting and Contributions
In this work we consider minimization problem (1) under the following assumptions:
Assumption 1 (Smoothness and convexity).
We assume is -smooth and -strongly convex (we allow ). That is, for all we have:
Assumption 2.
The stochastic gradients are unbiased estimates of the true gradient with uniformly bounded variance when conditioned on :
Note that Assumption 2 is less restrictive than the bounded gradients assumption () used in several previous analysis as shown in Table 1. Under this setting, the main contributions of this paper are:
-
If is strongly convex, then by properly choosing stepsizes and taking the average of the local iterates , we can obtain when the total number of iterates and the total number of communication rounds are:
(3) where indicates possibly ignoring polylogarithmic factors. This tightens the previous analysis Stich [2018], where was required.
-
Furthermore, if is (possibly weakly) convex, then we can guarantee provided that:
(4) -
We support our analysis by experiments illustrating the behavior of the algorithm.
3 Convergence Theory
We denote the sequence of time stamps when synchronization happens as . The average of all local iterates is and that of gradients is . We define the set .
Lemma 1.
Theorem 1.
Corollary 1.
Choosing , with for and we take steps. Then substituting in (6) and using that and some algebraic manipulation we can conclude that,
(7) | ||||
(8) |
where ignores logarithmic factors. We see that choosing recovers the same convergence rate of minibatch SGD up to polylogarithmic factors, and the number of communications is then .
Using similar proof techniques, we can show the following result for weakly convex functions:
Theorem 2.
Corollary 2.
Assume that . Choosing , then substituting in (9) we have,
(10) |
We see that choosing we recover the same convergence rate of minibatch SGD, and the number of communication steps is then .
4 Experiments
We run experiments on
regularized logistic regression problem with
nodes, each with Intel(R) Xeon(R) Gold 6146 CPU @3.20GHz core. We use the ’a9a’ dataset from the LIBSVM library Chang and Lin [2011] and set penalty to be , where is the dataset size. The code was written in Python using MPI Dalcin et al. [2011].We ran two experiments, with stepsizes and and minibatch size equal 1. In both cases we observe convergence to a neighborhood, although of a different radius. Since we run the experiments on a single machine, the communication is very cheap and there is little gain in time required for convergence. However, the advantage in terms of required communication rounds is self-evident and can lead to significant time improvement under slow communication networks.
![]() |
![]() |
![]() |
![]() |
References
- Basu et al. [2019] Debraj Basu, Deepesh Data, Can Karakus, and Suhas Diggavi. Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification, and Local Computations. arXiv:1906.02367, 2019.
- Ben-Nun and Hoefler [2018] Tal Ben-Nun and Torsten Hoefler. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. arXiv:1802.09941, 2018.
- Caldas et al. [2018] Sebastian Caldas, Jakub Konečný, H. Brendan McMahan, and Ameet Talwalkar. Expanding the Reach of Federated Learning by Reducing Client Resource Requirements. arXiv:1812.07210, 2018.
-
Chang and Lin [2011]
Chih-Chung Chang and Chih-Jen Lin.
LIBSVM: A library for support vector machines.
ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. -
Coppola [2015]
Gregory F. Coppola.
Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing
. PhD thesis, University of Edinburgh, UK, 2015. - Dalcin et al. [2011] Lisandro D. Dalcin, Rodrigo R. Paz, Pablo A. Kler, and Alejandro Cosimo. Parallel distributed computing using Python. Advances in Water Resources, 34(9):1124–1139, 2011.
- Dekel et al. [2010] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal Distributed Online Prediction using Mini-Batches. arXiv:1012.1367, 2010.
- Gower et al. [2019] Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richtárik. SGD: General Analysis and Improved Rates. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5200–5209, Long Beach, California, USA, 09–15 Jun 2019. PMLR.
-
Jiang and Agrawal [2018]
Peng Jiang and Gagan Agrawal.
A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication.
Advances in Neural Information Processing Systems 31, (NeurIPS):2530–2541, 2018. - Konečný et al. [2016] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated Learning: Strategies for Improving Communication Efficiency. In NIPS Private Multi-Party Machine Learning Workshop, 2016.
- Mangasarian [1995] L. Mangasarian. Parallel Gradient Distribution in Unconstrained Optimization. SIAM Journal on Control and Optimization, 33(6):1916–1925, 1995.
-
McDonald et al. [2010]
Ryan McDonald, Keith Hall, and Gideon Mann.
Distributed Training Strategies for the Structured Perceptron.
In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 456–464, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. ISBN 1-932432-65-5. -
McMahan et al. [2016]
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Agüera y Arcas.
Communication-Efficient Learning of Deep Networks from Decentralized
Data.
Proceedings of the 20 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2017. JMLR: W&CP volume 54
, 2016. - Stich [2018] Sebastian U. Stich. Local SGD Converges Fast and Communicates Little. arXiv:1805.09767, 2018.
- Wang and Joshi [2018] Jianyu Wang and Gauri Joshi. Cooperative SGD: A Unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms. arXiv:1808.07576, 2018.
- Yu et al. [2018] Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. arXiv:1807.06629, 2018.
- Zhou and Cong [2018] Fan Zhou and Guojing Cong. On the Convergence Properties of a -step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization. In IJCAI International Joint Conference on Artificial Intelligence, volume 2018-July, pages 3219–3227, 2018.
Appendix A Basic Facts and Notation
We denote the sequence of time stamps when synchronization happens as . Given stochastic gradients at time we define
Throughout the proofs, we will use the variance decomposition that holds for any random vector
with finite second moment:
In particular, its version for vectors with finite number of values gives
(11) |
As a consequence of (11) we have that,
Proposition 1 (Jensen’s inequality).
For any convex function and any vectors we have
We denote the Bregman divergence associated with function and arbitrary as
Proposition 2.
If is -smooth and convex, then for any and it holds
(12) |
If satisfies Assumption 1, then
(13) |
Appendix B Proof of Lemma 1
Proof.
Let be such that . Recall that for a time such that we have and . Hence for the expectation conditional on we have:
Averaging both sides and letting , we have
(14) |
Now note that by expanding the square we have,
(15) |
We decompose the first term in the last equality again by expanding the square,
Plugging this into (15) we have,
Now average over :
where we used that by definition . Hence,
(16) |
Now note that for the first term in (16) we have by Assumption 2,
(17) |
For the second term in (16) we have
Averaging over ,
where we used the fact that , which comes from the linearity of expectation. Now we bound in the last inequality by smoothness and then use that Jensen’s inequality implies ,
(18) |
Plugging in (18) and (17) into (16) we have,
(19) |
Plugging (19) into (14), we get
(20) | ||||
Using that we can conclude,
Taking expectations and iterating the above inequality,
It remains to notice that by assumption we have . ∎
Appendix C Two More Lemmas
Lemma 2.
Proof.
This is Lemma 3.1 in Stich [2018]. ∎
Lemma 3.
Suppose that Assumption 2 holds. Then,
Proof.
This is Lemma 3.2 in Stich [2018]. Because the stochastic gradients are independent we have that the variance of their sum is the sum of their variances, hence
∎