It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets.
One popular distributed approach is local stochastic gradient descent (SGD)(Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as “parallel SGD” or “Federated Averaging”111Federated Averaging is a specialization of local SGD to the federated setting, where (a) data is assumed to be heterogenous (not i.i.d.) across workers, (b) only a handful of clients are used in each round, and (c) updates are combined with a weighted average to accommodate unbalanced datasets. (McMahan et al., 2016), which is commonly applied to large scale convex and non-convex stochastic optimization problems, including in data center and “Federated Learning” settings (Kairouz et al., 2019). Local SGD uses parallel workers which, in each of rounds, independently execute steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes stochastic gradients and executes SGD steps locally, for a total of overall stochastic gradients computed (and so samples used), with rounds of communication (every steps of computation).
Given the appeal and usage of local SGD, there is significant value in understanding its performance and limitations theoretically, and in comparing it to other alternatives and baselines that have the same computation and communication structure. That is, other methods that are distributed across machines and compute gradients per round of communication for rounds, for a total of gradients per machine and communication steps. This structure can also be formalized through the graph oracle model of Woodworth et al. (2018, see also Section 2).
So, how does local SGD compare to other algorithms with the same computation and communication structure? Is local SGD (or perhaps an accelerated variant) optimal in the same way that (accelerated) SGD is optimal in the sequential setting? Is it better than baselines?
A natural alternative and baseline is minibatch SGD (Dekel et al., 2012; Cotter et al., 2011; Shamir and Srebro, 2014) – a simple method for which we have a complete and tight theoretical understanding. Within the same computation and communication structure, minibatch SGD can be implemented as follows: Each round, calculate the
stochastic gradient estimates (at the current iterate) on each machine, and then average allestimates to obtain a single gradient estimate. That is, we can implement minibatch SGD that takes stochastic gradient steps, with each step using a minibatch of size —this is the fair and correct minibatch SGD to compare to, and when we refer to “minibatch SGD” we refer to this implementation ( steps with minibatch size ).
Local SGD seems intuitively better than minibatch SGD, since even when the workers are not communicating, they are making progress towards the optimum. In particular, local SGD performs times more updates over the course of optimization, and can be thought of as computing gradients at less “stale” and more “updated” iterates. For this reason, it has been argued that local SGD is at least as good as minibatch SGD, especially in convex settings where averaging iterates cannot hurt you. But can we capture this advantage theoretically to understand how and when local SGD is better than minibatch SGD? Or even just establish that local SGD is at least as good?
. However, a satisfying analysis has so far proven elusive. In fact, every analysis that we are aware of for local SGD in the general convex (or strongly convex) case with a typical noise scaling (e.g. as arising from supervised learning) not only does not improve over minibatch SGD, but is actually strictly dominated by minibatch SGD! But is this just a deficiency of these analyses, or is local SGD actually not better, and perhaps worse, than minibatch SGD? In this paper, we show that the answer to this question is “sometimes.” There is a regime in which local SGD indeed matches or improves upon minibatch SGD, but perhaps surprisingly, there is also a regime in which local SGD really is strictly worse than minibatch SGD.
In Section 3, we start with the special case of quadratic objectives and show that, at least in this case, local SGD is strictly better than minibatch SGD in the worst case, and that an accelerated variant is even minimax optimal.
We then turn to general convex objectives. In Section 4 we prove the first error upper bound on the performance of local SGD which is not dominated by minibatch SGD’s upper bound with a typical noise scaling. In doing so, we identify a regime (where is large and ) in which local SGD performs strictly better than minibatch in the worst case. However, our upper bound does not show that local SGD is always as good or better than minibatch SGD. In Section 5, we show that this is not just a failure of our analysis. We prove a lower bound on the worst-case error of local SGD that is higher than the worst-case error of minibatch SGD in a certain regime!
We demonstrate this behaviour empirically, using a logistic regression problem where local SGD indeed behaves much worse than mini-batch SGD in the theoretically-predicted problematic regime.
Thus, while local SGD is frequently better than minibatch SGD—and we can now see this both in theory and in practice (see experiments by e.g. Zhang et al., 2016; Lin et al., 2018; Zhou and Cong, 2018)—our work identifies regimes in which users should be wary of using local SGD without considering alternatives like minibatch SGD, and might want to seek alternative methods that combine the best of both, and attain optimal performance in all regimes.
We consider the stochastic convex optimization problem:
We will study distributed first-order algorithms that compute stochastic gradient estimates at a point via based on a sample . Our focus is on objectives that are -smooth, either (general) convex or -strongly convex222An -smooth and -strongly convex function satisfies . We allow in which case is general convex., with a minimizer with . We consider
which has uniformly bounded variance, i.e.. We use to refer to the set of all pairs which satisfy these properties. All of the analysis in this paper can be done either for general convex or strongly convex functions, and we prove all of our results for both cases. For conciseness and clarity, when discussing the results in the main text, we will focus on the general convex case. However, the picture in the strongly convex case is mostly the same.
An important instance of (1) is a supervised learning problem where is the loss on a single sample. When (referring to derivatives w.r.t. the first argument), then and also . Thus, assuming that the upper bounds on are comparable, the relative scaling of parameters we consider as most “natural” is .
For simplicity, we consider initializing all algorithms at zero. Then, Local SGD with machines, stochastic gradients per round, and rounds of communication calculates its th iterate on the th machine for via
where i.i.d., and refers to dividing . For each , minibatch SGD calculates its th iterate via
We also introduce another strawman baseline, which we will refer to as “thumb-twiddling” SGD. In thumb-twiddling SGD, each machine computes just one (rather than ) stochastic gradients per round of communication and “twiddles its thumbs” for the remaining computational steps, resulting in minibatch SGD steps, but with a minibatch size of only (instead of , i.e. as if we used ). This is a silly algorithm that is clearly strictly worse than minibatch SGD, and we would certainly expect any reasonable algorithm to beat it. But as we shall see, previous work has actually struggled to show that local SGD even matches, let alone beats, thumb-twiddling SGD. In fact, we will show in Section 5 that, in certain regimes, local SGD truly is worse than thumb-twiddling.
For a particular algorithm , we define its worst-case performance with respect to as:
In order to know if an algorithm like local or minibatch SGD is “optimal” in the worst case requires understanding the minimax error, i.e. the best error that any algorithm with the requisite computation and communication structure can guarantee in the worst case. This requires formalizing the set of allowable algorithms. One possible formalization is the graph oracle model of Woodworth et al. (2018) which focuses on the dependence structure between different stochastic gradient computations resulting from the communication pattern. Using this method, Woodworth et al. prove lower bounds which are applicable to our setting. Minibatch SGD does not match these lower bounds (nor does accelerated minibatch SGD, see Cotter et al. (2011)), but these lower bounds are not known to be tight, so the minimax complexity and minimax optimal algorithm are not yet known.
Existing analysis of local SGD
Table 1 summarizes the best existing analyses of local SGD that we are aware of that can be applied to our setting. We present the upper bounds as they would apply in our setting, and after optimizing over the stepsize and other parameters. A detailed derivation of these upper bounds from the explicitly-stated theorems in other papers is provided in Appendix A. As we can see from the table, in the natural scaling , every previous upper bound is strictly dominated by minibatch SGD. Worse, these upper bounds can even be worse than even thumb-twiddling SGD when (although they are sometimes better). In particular, the first term of each previous upper bound (in terms of ) is never better than (the optimization term of minibatch and thumb-twiddling SGD), and can be much worse.
We should note that in an extremely low noise regime , the bound of Khaled et al. (2019) can sometimes improve over minibatch SGD. However, this only happens when steps of sequential SGD is better than minibatch SGD—i.e. when you are better off ignoring of the machines and just doing serial SGD on a single machine (such an approach would have error ). This is a trivial regime in which every update for any of these algorithms is essentially an exact gradient descent step, thus there is no need for parallelism in the first place. See Appendix A.3 for further details. The upper bound we develop in Section 4, in contrast, dominates their guarantee and shows an improvement over minibatch that cannot be achieved on a single machine (i.e. without leveraging any parallelism). Furthermore, this improvement can occur even in the natural scaling and even when minibatch SGD is better than serial SGD on one machine.
We emphasize that Table 1 lists the guarantees specialized to our setting—some of the bounds are presented under slightly weaker assumptions, or with a more detailed dependence on the noise: Stich and Karimireddy (2019) analyzes local SGD assuming not-quite-convexity444Haddadpour et al. (2019a) analyze local SGD under the Polyak-Łojasiewicz condition, however their main Theorem is incorrect as stated. See Appendix A.4 for details.; and Wang and Joshi (2018); Dieuleveut and Patel (2019) derive guarantees under both multiplicative and additive bounds on the noise. Dieuleveut and Patel (2019) analyze local SGD with the additional assumption of a bounded third derivative, but even with this assumption do not improve over mini-batch SGD. Numerous works study local SGD in the non-convex setting (see e.g. Zhou and Cong, 2018; Yu et al., 2019; Wang et al., 2017; Stich and Karimireddy, 2019; Haddadpour et al., 2019b). Although their bounds would apply in our convex setting, due to the much weaker assumptions they are understandably much worse than minibatch SGD. There is also a large body of work studying the special case , i.e. where the iterates are averaged just one time at the end (Zinkevich et al., 2010; Zhang et al., 2012; Li et al., 2014; Rosenblatt and Nadler, 2016; Godichon-Baggioni and Saadane, 2017; Jain et al., 2017). However, these analyses do not easily extend to multiple rounds, and the constraint can provably harm performance (see Shamir et al., 2014). Finally, local SGD has been studied with heterogeneous data, i.e. where each machine receives stochastic gradients from different distributions—see Kairouz et al. (2019, Sec. 3.2) a recent survey.
3 Good News: Quadratic Objectives
As we have seen, existing analyses of local SGD are no better than that of minibatch SGD. In the special case where is quadratic, we will now show that not only is local SGD sometimes as good as minibatch SGD, but it is always as good as minibatch SGD, and sometimes better. In fact, an accelerated variant of local SGD is minimax optimal for quadratic objectives. More generally, we show that the local SGD anologue for a large family of serial first-order optimization algorithms enjoys an error guarantee which depends only on the product and not on or individually. In particular, we consider the following family of linear update algorithms:
Definition 1 (Linear update algorithm).
We say that a first-order optimization algorithm is a linear update algorithm if, for fixed linear functions , the algorithm generates its st iterate according to
This family captures many standard first-order methods including SGD, which corresponds to the linear mappings and . Another notable algorithm in this class is AC-SA (Ghadimi and Lan, 2013), an accelerated variant of SGD which also has linear updates. Some important non-examples, however, are adaptive gradient methods like AdaGrad (McMahan and Streeter, 2010; Duchi et al., 2011)—these have linear updates, but the linear functions are data-dependent.
For a linear update algorithm , we will use local- to denote the local SGD analogue with replacing SGD. That is, during each round of communication, each machine independently executes iterations of and then the resulting iterates are averaged. For quadratic objectives, we show that this approach inherits the guarantee of with the benefit of variance reduction:
Let be a linear update algorithm which, when executed for iterations on any quadratic , guarantees . Then, local-’s averaged final iterate will satisfy .
We prove this in Appendix B by showing that the average iterate is updated according to —even in the middle of rounds of communication when is not explicitly computed. In particular, we first show that
Then, by the linearity of and , we prove
and its variance is reduced to . Therefore, ’s guarantee carries over while still benefitting from the lower variance.
To rephrase Theorem 1, on quadratic objectives, local- is in some sense equivalent to iterations of with the gradient variance reduced by a factor of . Furthermore, this guarantee depends only on the product , and not on or individually. Thus, averaging the th iterate of independent executions of , sometimes called “one-shot averaging,” enjoys the same error upper bound as iterations of size- minibatch-.
Nevertheless, it is important to highlight the boundaries of Theorem 1. Firstly, ’s error guarantee must not rely on any particular structure of the stochastic gradients themselves, as this structure might not hold for the implicit updates of local-. Furthermore, even if some structure of the stochastic gradients is maintained for local-, the particular iterates generated by local- will generally vary with and (even holding constant). Thus, Theorem 1 does not guarantee that local- with two different values of and would perform the same on any particular instance. We have merely proven matching upper bounds on their worst-case performance.
For any quadratic , there are constants and such that local-SGD returns a point such that
and local-AC-SA returns a point such that
In particular, local-AC-SA is minimax optimal for quadratic objectives.
Comparing the bound above for local SGD with the bound for minibatch SGD (5), we see that the local SGD bound is strictly better, due to the first term scaling as as opposed to . We note that minibatch SGD can also be accelerated (Cotter et al., 2011), leading to a bound with better dependence on , but this is again outmatched by the bound for the (accelerated) local-AC-SA algorithm above.
Prior Work in the Quadratic Setting
Local SGD and related methods have been previously analyzed for quadratic objectives, but in slightly different settings. Jain et al. (2017) study a similar setting and analyze our “minibatch SGD” for and fixed , but varying and . They show that when is sufficiently small relative to , then minibatch SGD can compete with steps of serial SGD. They also show that for fixed and , when is sufficiently small then the average of independent runs of minibatch SGD with steps and minibatch size can compete with steps of minibatch SGD with minibatch size . These results are qualitatively similar to ours, but they analyze a specific algorithm while we are able to provide a guarantee for a broader class of algorithms. Dieuleveut and Patel (2019) analyze local SGD on quadratic objectives and show a result analogous to our Theorem 1. However, their result only holds when is sufficiently small relative to and . Finally, there is a literature on “one-shot-averaging” for quadratic objectives, which corresponds to an extreme where the outputs of an algorithm applied to several different training sets are averaged, (e.g. Zhang et al., 2013a, b). These results also highlight similar phenomena, but they do not apply as broadly as Theorem 1 and they do not provide as much insight into local SGD specifically.
4 More Good News: General Convex Objectives
In this section, we present the first analysis of local SGD for general convex objectives that is not dominated by minibatch SGD. For the first time, we can identify a regime of , , and in which local SGD provably performs better than minibatch SGD in the worst case. Furthermore, our analysis dominates all existing upper bounds for local SGD (at least in the natural scaling ).
Let . Then, for local SGD with decaying stepsize , the averaged iterate has expected error at most
For general convex , applying local SGD to for an optimally chosen ensures
This is proven in Appendix C. We use a similar approach as Stich (2018), who analyzes the behavior of the averaged iterate , even when it is not explicitly computed. They show, in particular, that the averaged iterate evolves almost according to size--minibatch SGD updates, up to a term proportional to the dispersion of the individual machines’ iterates . Stich bounds this with
, but this bound is too pessimistic—in particular, it holds even if the gradients are replaced by arbitrary vectors of norm. In Lemma 2, we improve this bound to which allows for our improved guarantee.555In recent work, Stich and Karimireddy (2019) present a new analysis of local-SGD which, in the general convex case is of the form . As stated, this is strictly worse than minibatch SGD. However, we suspect that this bound should hold for any because, intuitively, having more machines should not hurt you. If this is true, then optimizing their bound over yields a similar result as Theorem 2. Our approach resembles that of Khaled et al. (2019), which we became aware of in the process of preparing this manuscript, however our analysis is more refined. In particular, we optimize more carefully over the stepsize so that our analysis applies for any , , and (rather than just ) and shows an improvement over minibatch SGD in a significantly broader regime, including when (see Appendix A.3 for additional details).
Comparison of our bound with minibatch SGD
We now compare the upper bound from Theorem 2 with the guarantee of minibatch SGD. For clarity, and in order to highlight the role of , , and in the convergence rate, we will compare rates for general convex objectives when , and we will also ignore numerical constants and the logarithmic factor in Theorem 2. In this setting, the worst-case error of minibatch SGD is:
Our guarantee for local SGD from Theorem 2 reduces to:
These guarantees have matching statistical terms of , which cannot be improved by any first-order algorithm (Nemirovsky and Yudin, 1983). Therefore, in the regime where the statistical term dominates both rates, i.e. and , both algorithms will have similar worst-case performance. When we leave this noise-dominated regime, we see that local SGD’s guarantee is better than minibatch SGD’s when and is worse when . This makes sense intuitively: minibatch SGD benefits from computing very precise gradient estimates, but pays for it by taking fewer gradient steps; conversely, each local SGD update is much noisier, but local SGD is able to make times more updates.
This establishes that for general convex objectives in the large- and large- regime, local SGD will strictly outperform minibatch SGD. However, in the large- and small- regime, we are only comparing upper bounds, so it is not clear that local SGD will in fact perform worse than minibatch SGD. Nevertheless, it raises the question of whether this is the best we can hope for from local SGD. Is local SGD truly better than minibatch SGD in some regimes but worse in others? Or, should we believe the intuitive argument suggesting that local SGD is always at least as good as minibatch SGD?
5 Bad News: Minibatch SGD Can Outperform Local SGD
In Section 3, we saw that when the objective is quadratic, local SGD is strictly better than minibatch SGD, and enjoys an error guarantee that depends only on and not or individually. In Section 4, we analyzed local SGD for general convex objectives and showed that local SGD sometimes outperforms minibatch SGD. However, we did not show that it always does, nor that it is always even competitive with minibatch SGD. We will now show that this is not simply a failure of our analysis—in a certain regime, local SGD really is inferior (in the worst-case) to minibatch SGD, and even to thumb-twiddling SGD. We show this by constructing a simple, smooth piecewise-quadratic objective in three dimensions, on which local SGD performs poorly. We define this hard instance as
where and .
For , there exists such that for any and , local SGD initialized at with any fixed stepsize, will output a point such that for a universal constant
We defer a detailed proof of the Theorem to Appendix D. Intuitively, it relies on the fact that for non-quadratic functions, the SGD updates are no longer linear as in Section 3, and the local SGD dynamics introduce an additional bias term which does not depend666To see this, consider for example the univariate function where is some zero-mean bounded random variable. It is easy to verify that even if we have infinitely many machines (
is some zero-mean bounded random variable. It is easy to verify that even if we have infinitely many machines (), running local SGD for a few iterations starting from the global minimum of will generally return a point bounded away from . In contrast, minibatch SGD under the same conditions will remain at . on , and scales poorly with . In fact, this phenomenon is not unique to our construction, and can be expected to exist for any “sufficiently” non-quadratic function. With our construction, the proof proceeds by showing that the suboptimality is large unless but local SGD introduces a bias which causes to “drift” in the negative direction by an amount proportional to the stepsize. On the other hand, optimizing the first term of the objective requires the stepsize to be relatively large. Combining these yields the first term of the lower bound. The second term is classical and holds even for first-order algorithms that compute stochastic gradients sequentially (Nemirovsky and Yudin, 1983).
In order to compare this lower bound with Theorem 2 and with minibatch SGD, we again consider the general convex setting with . Then, the lower bound reduces to . Comparing this to Theorem 2, we see that our upper bound is tight up to a factor of in the optimization term. Furthermore, comparing this to the worst-case error of minibatch SGD (11), we see that local SGD is indeed worse than minibatch SGD in the worst case when is small enough relative to . The cross-over point is somewhere between and ; for smaller , minibatch SGD is better than local SGD in the worst case, for larger , local SGD is better in the worst case. Since the optimization terms of minibatch SGD and thumb-twiddling SGD are identical, this further indicates that local SGD is even outperformed by thumb-twiddling SGD in the small and large regime.
Finally, it is interesting to note that in the strongly convex case (where ), the gap between local GD and minibatch SGD can be even more dramatic: In that case, the optimization term of minibatch SGD scales as (see Stich (2019) and references therein), while our theorem implies that local SGD cannot obtain a term better than . This implies an exponentially worse dependence on in that term, and a worse bound as long as .
In order to prove Theorem 3 we constructed an artificial, but easily analyzable, situation where we could prove analytically that local SGD is worse than mini-batch. In Figure 1, we also demonstrate the behaviour empirically on a logistic regression task, by plotting the suboptimality of local SGD, minibatch SGD, and thumb-twiddling SGD iterates with optimally tuned stepsizes. As is predicted by Theorem 3, we see local SGD goes from performing worse than minibatch in the small regime, but improving relative to the other algorithms as increases to and then , when local SGD is far superior to minibatch. For each fixed , increasing causes thumb-twiddling SGD to improve relative to minibatch SGD, but does not have a significant effect on local SGD, which is consistent with local SGD introducing a bias which depends on but not on . This highlights that the “problematic regime” for local SGD is the regime with a relatively small number of iterations per round.
6 Future work
In this paper, we provided the first analysis of local SGD showing improvement over minibatch SGD in a natural setting, but also demonstrated that local SGD can sometimes be worse than minibatch SGD, and is certainly not optimal.
As can be seen from Table 1, our upper and lower bounds for local SGD are still not tight. The first term depends on versus —we believe the correct behaviour might be in between, namely , matching the bias of -step SGD. The exact worst case behaviour of local SGD is therefore not yet resolved.
But beyond obtaining a precise analysis of local SGD, our paper highlights a more important challenge: we see that local SGD is definitely not optimal, and does not even always improve over minibatch SGD. Can we suggest an optimal algorithm in this setting? Or at least a method that combines the advantages of both local SGD and minibatch SGD and enjoys guarantees that dominate both? Our work motivates developing such an algorithm, which might also have benefits in regimes where local SGD is already better than minibatch SGD.
To answer this question will require new upper bounds and perhaps also new lower bounds. Looking to the analysis of local AC-SA for quadratic objectives in Corollary 1, we might hope to design an algorithm which achieves error
for general convex objectives. That is, an algorithm which combines the optimization term for steps of accelerated gradient descent with the optimal statistical term. If this were possible, it would match the lower bound of Woodworth et al. (2018) and therefore be optimal with respect to this communication structure.
This work is partially supported by NSF-CCF/BSF award 1718970/2016741, NSF-DMS 1547396, and a Google Faculty Research Award. BW is supported by a Google PhD Fellowship. Part of this work was done while NS was visiting Google. Work by SS was done while visiting TTIC.
Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. PhD thesis, The University of Edinburgh, 2015.
- Cotter et al. (2011) Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, 2011.
- Dekel et al. (2012) Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
- Dieuleveut and Patel (2019) Aymeric Dieuleveut and Kumar Kshitij Patel. Communication trade-offs for local-sgd with large step size. In Advances in Neural Information Processing Systems, pages 13579–13590, 2019.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
- Godichon-Baggioni and Saadane (2017) Antoine Godichon-Baggioni and Sofiane Saadane. On the rates of convergence of parallelized averaged stochastic gradient algorithms. arXiv preprint arXiv:1710.07926, 2017.
- Haddadpour et al. (2019a) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pages 11080–11092, 2019a.
- Haddadpour et al. (2019b) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Trading redundancy for communication: Speeding up distributed sgd for non-convex optimization. In International Conference on Machine Learning, pages 2545–2554, 2019b.
- Jain et al. (2017) Prateek Jain, Praneeth Netrapalli, Sham M Kakade, Rahul Kidambi, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: mini-batching, averaging, and model misspecification. The Journal of Machine Learning Research, 18(1):8258–8299, 2017.
- Kairouz et al. (2019) Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning, 2019.
- Khaled et al. (2019) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Better communication complexity for local sgd. arXiv preprint arXiv:1909.04746, 2019.
- Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
- Lin et al. (2018) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
- McMahan and Streeter (2010) H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 244–256, 2010.
- McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
- Nemirovsky and Yudin (1983) Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
- Rosenblatt and Nadler (2016) Jonathan D Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379–404, 2016.
- Shamir and Srebro (2014) Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
- Shamir et al. (2014) Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pages 1000–1008, 2014.
- Simchowitz (2018) Max Simchowitz. On the randomized complexity of minimizing a convex quadratic function. arXiv preprint arXiv:1807.09386, 2018.
- Stich (2018) Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018. URL https://arxiv.org/abs/1805.09767.
- Stich (2019) Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
- Stich and Karimireddy (2019) Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
- Wang et al. (2017) Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatch-prox. arXiv preprint arXiv:1702.06269, 2017. URL https://arxiv.org/abs/1702.06269.
- Wang and Joshi (2018) Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
- Woodworth et al. (2018) Blake Woodworth, Jialei Wang, Brendan McMahan, and Nathan Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. arXiv preprint arXiv:1805.10222, 2018. URL https://arxiv.org/abs/1805.10222.
Yu et al. (2019)
Hao Yu, Sen Yang, and Shenghuo Zhu.
Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning.In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5693–5700, 2019.
- Zhang et al. (2016) Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
- Zhang et al. (2012) Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.
Zhang et al. (2013a)
Yuchen Zhang, John Duchi, and Martin Wainwright.
Divide and conquer kernel ridge regression.In Conference on learning theory, pages 592–617, 2013a.
- Zhang et al. (2013b) Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communication-efficient algorithms for statistical optimization. The Journal of Machine Learning Research, 14(1):3321–3363, 2013b.
- Zhou and Cong (2018) Fan Zhou and Guojing Cong. On the convergence properties of a k-step averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 3219–3227. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/447. URL https://doi.org/10.24963/ijcai.2018/447.
- Zinkevich et al. (2010) Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Comparisons Between Existing Local SGD Analyses and Minibatch SGD
In this section, we describe the derivation of the entries in Table 1 for the cases in which it is not obvious. In particular, these previous analyses were stated based on different assumptions (stronger as well as weaker) which need to be reconciled with ours. Since local SGD is often analyzed in the strongly convex setting (or with weaker assumptions that are implied by strong convexity), we will make use of the following fact: If an algorithm guarantees error at most when applied to a -strongly convex function, then we can apply the algorithm to in order to ensure error . This applies for any , so we can actually infer that the algorithm, in fact, guarantees error at most .
Since our purpose is to show that these analyses are dominated by minibatch SGD, the entries in the table are, in some sense, the most optimistic interpretation of the bounds stated in the paper. For example, if error is guaranteed for strongly convex functions, we actually enter into the table, which is a lower bound on the actual guarantee.
|Reference||Setting||Best Convergence rate (i.e., )|
|Stich and Karimireddy (2019)||SC|
|Khaled et al. (2019)||SC|
For reference, we restate the worst-case guarantee of minibatch SGD:
a.1 Stich (2018)
The paper makes the same assumptions as us but, in addition, assumes that the stochastic gradients are uniformly bounded, i.e. . We relax this assumption by noting the following,
In the last step we make the optimistic assumption that the iterates stray no farther from than they were at initialization, i.e. . This may not be true, so this bound is optimistic. On the other hand, it is clear that one cannot generally upper bound any tighter than this in our setting. Since our goal is anyways to show that the analysis of Stich (2018) is deficient, we continue using the bound (21). This immediately gives the result for the strongly-convex setting in table 2. For the non-strongly setting we extend their result by optimizing each term separately as and ignore the constants.
a.2 Stich and Karimireddy (2019)
The paper relaxes the convexity assumption, by assuming F is -quasi convex, i.e., . This condition can also hold for certain non-convex functions and is implied by -strong convexity. Besides they assume -smoothness of and multiplicative noise for the stochastic gradients, i.e., . The latter assumption is a relaxation of the uniform upper bound on the variance of the stochastic gradients, which we have assumed. Thus to compare to their result we set upper bounding the stochastic variance by and use the strong convexity constant instead of . For the non-strongly convex setting we use their rate, along with our uniform variance bound. Besides they use specific learning rate and averaging schedules to optimize their rates. Both these rates are given in Table 2. For the general convex setting, we believe their dependence in is poor and is improved upon by our upper bound in Section 4.
a.3 Khaled et al. (2019)
The relevant analysis from Khaled et al. (2019) is given in their Corollary 2, which is their only analysis that upper bounds the error in terms of the objective function suboptimality and in the setting where each machine receives i.i.d. stochastic gradients. Their Corollary 2 states that when , the error is bounded by777There is a typo in their statement which omits the factor of ( in their notation) from the numerator of the first term.
In the case where , it is clear that this is strictly worse than minibatch SGD since . However, consider the case of arbitrary , and and suppose Khaled et al. (2019)’s guarantee is less than , in which case
Consequently, (22) is either greater than or greater than . This does not mean that their upper bound is worse than minibatch SGD. However, it is worse than minibatch SGD unless .
If we interrogate what this regime corresponds to, we see that it is actually a trivial regime where steps of serial SGD, which achieves error , is actually better than minibatch SGD. That is, rather than implementing minibatch SGD distributed across the machines, we are actually better off just ignoring of the available machines and doing serial SGD. If this is really the right thing to do, then there was never any need for parallelism in the first place, and thus there is no reason to use local SGD, which performs no better than serial SGD in this case anyways.
a.4 Haddadpour et al. (2019a)
Haddadpour et al. (2019a) also analyze local SGD in a related setting (they assume the Polyak-Łojasiewicz condition, which is implied by strong convexity). However, in trying to adapt their Theorem 1 to our setting, it appears that there are some omitted conditions in the Theorem statement. In particular, choosing appears to be allowed by their hypotheses, yet this choice leads to an upper bound of when , which contradicts known lower bounds for deterministic first-order optimization. Since we are not sure what the actual requirements on are, we are unable to confirm what their analysis implies about our setting.
Appendix B Proofs from Section 3
We will show that the average of the iterates at any particular time evolves according to with a lower variance stochastic gradient, even though this average iterate is not explicitly computed by the algorithm at every step. It is easily confirmed from (6) that
where we used that is linear. We will now show that
is an unbiased estimate ofwith variance bounded by . Therefore, is updated exactly according to with a lower variance stochastic gradient.
By the linearity of and
Furthermore, since the on each machine are independent, and ,
It is easily confirmed that SGD and AC-SA Ghadimi and Lan (2013) are linear update algorithms, which allows us to apply Theorem 1. In addition, Simchowitz (2018) shows that any randomized algorithm that accesses an deterministic first order oracle at most times will have error at least in the worst case for an -smooth, convex quadratic objective, for some universal constant . Therefore, the first term of local-AC-SA’s guarantee cannot be improved. The second term of the guarantee also cannot be improved Nemirovsky and Yudin (1983)—in fact, this term cannot be improved even by an algorithm which is allowed to make sequential calls to a stochastic gradient oracle. ∎
Appendix C Proof of Theorem 2
Before we prove Theorem 2, we will introduce some notation. Recall that the objective is of the form . Let denote the stepsize used for the th overall iteration. Let denote the th iterate on the th machine, and let denote the averaged th iterate. The vector may not actually be computed by the algorithm, but it will be central to our analysis. We will use to denote the stochastic gradient computed at by the th machine at iteration , and will denote the average of the stochastic gradients computed at time . Finally, let denote the average of the full gradients computed at the individual iterates.
Lemma 1 (See Lemma 3.1 Stich (2018)).
Let be -smooth and -strongly convex, let
, and let , then the iterates of local SGD satisfy