1 Introduction
It is often important to leverage parallelism in order to tackle large scale stochastic optimization problems. A prime example is the task of minimizing the loss of machine learning models with millions or billions of parameters over enormous training sets.
One popular distributed approach is local stochastic gradient descent (SGD)
(Zinkevich et al., 2010; Coppola, 2015; Zhou and Cong, 2018; Stich, 2018), also known as “parallel SGD” or “Federated Averaging”^{1}^{1}1Federated Averaging is a specialization of local SGD to the federated setting, where (a) data is assumed to be heterogenous (not i.i.d.) across workers, (b) only a handful of clients are used in each round, and (c) updates are combined with a weighted average to accommodate unbalanced datasets. (McMahan et al., 2016), which is commonly applied to large scale convex and nonconvex stochastic optimization problems, including in data center and “Federated Learning” settings (Kairouz et al., 2019). Local SGD uses parallel workers which, in each of rounds, independently execute steps of SGD starting from a common iterate, and then communicate and average their iterates to obtain the common iterate from which the next round begins. Overall, each machine computes stochastic gradients and executes SGD steps locally, for a total of overall stochastic gradients computed (and so samples used), with rounds of communication (every steps of computation).Given the appeal and usage of local SGD, there is significant value in understanding its performance and limitations theoretically, and in comparing it to other alternatives and baselines that have the same computation and communication structure. That is, other methods that are distributed across machines and compute gradients per round of communication for rounds, for a total of gradients per machine and communication steps. This structure can also be formalized through the graph oracle model of Woodworth et al. (2018, see also Section 2).
So, how does local SGD compare to other algorithms with the same computation and communication structure? Is local SGD (or perhaps an accelerated variant) optimal in the same way that (accelerated) SGD is optimal in the sequential setting? Is it better than baselines?
A natural alternative and baseline is minibatch SGD (Dekel et al., 2012; Cotter et al., 2011; Shamir and Srebro, 2014) – a simple method for which we have a complete and tight theoretical understanding. Within the same computation and communication structure, minibatch SGD can be implemented as follows: Each round, calculate the
stochastic gradient estimates (at the current iterate) on each machine, and then average all
estimates to obtain a single gradient estimate. That is, we can implement minibatch SGD that takes stochastic gradient steps, with each step using a minibatch of size —this is the fair and correct minibatch SGD to compare to, and when we refer to “minibatch SGD” we refer to this implementation ( steps with minibatch size ).Local SGD seems intuitively better than minibatch SGD, since even when the workers are not communicating, they are making progress towards the optimum. In particular, local SGD performs times more updates over the course of optimization, and can be thought of as computing gradients at less “stale” and more “updated” iterates. For this reason, it has been argued that local SGD is at least as good as minibatch SGD, especially in convex settings where averaging iterates cannot hurt you. But can we capture this advantage theoretically to understand how and when local SGD is better than minibatch SGD? Or even just establish that local SGD is at least as good?
A string of recent papers have attempted to analyze local SGD for convex objectives, (e.g. Stich, 2018; Stich and Karimireddy, 2019; Khaled et al., 2019; Dieuleveut and Patel, 2019)
. However, a satisfying analysis has so far proven elusive. In fact, every analysis that we are aware of for local SGD in the general convex (or strongly convex) case with a typical noise scaling (e.g. as arising from supervised learning) not only does not improve over minibatch SGD, but is actually strictly dominated by minibatch SGD! But is this just a deficiency of these analyses, or is local SGD actually not better, and perhaps worse, than minibatch SGD? In this paper, we show that the answer to this question is “sometimes.” There is a regime in which local SGD indeed matches or improves upon minibatch SGD, but perhaps surprisingly, there is also a regime in which local SGD really is strictly worse than minibatch SGD.
Our contributions
In Section 3, we start with the special case of quadratic objectives and show that, at least in this case, local SGD is strictly better than minibatch SGD in the worst case, and that an accelerated variant is even minimax optimal.
We then turn to general convex objectives. In Section 4 we prove the first error upper bound on the performance of local SGD which is not dominated by minibatch SGD’s upper bound with a typical noise scaling. In doing so, we identify a regime (where is large and ) in which local SGD performs strictly better than minibatch in the worst case. However, our upper bound does not show that local SGD is always as good or better than minibatch SGD. In Section 5, we show that this is not just a failure of our analysis. We prove a lower bound on the worstcase error of local SGD that is higher than the worstcase error of minibatch SGD in a certain regime!
We demonstrate this behaviour empirically, using a logistic regression problem where local SGD indeed behaves much worse than minibatch SGD in the theoreticallypredicted problematic regime.
Thus, while local SGD is frequently better than minibatch SGD—and we can now see this both in theory and in practice (see experiments by e.g. Zhang et al., 2016; Lin et al., 2018; Zhou and Cong, 2018)—our work identifies regimes in which users should be wary of using local SGD without considering alternatives like minibatch SGD, and might want to seek alternative methods that combine the best of both, and attain optimal performance in all regimes.
2 Preliminaries
We consider the stochastic convex optimization problem:
(1) 
We will study distributed firstorder algorithms that compute stochastic gradient estimates at a point via based on a sample . Our focus is on objectives that are smooth, either (general) convex or strongly convex^{2}^{2}2An smooth and strongly convex function satisfies . We allow in which case is general convex., with a minimizer with . We consider
which has uniformly bounded variance, i.e.
. We use to refer to the set of all pairs which satisfy these properties. All of the analysis in this paper can be done either for general convex or strongly convex functions, and we prove all of our results for both cases. For conciseness and clarity, when discussing the results in the main text, we will focus on the general convex case. However, the picture in the strongly convex case is mostly the same.An important instance of (1) is a supervised learning problem where is the loss on a single sample. When (referring to derivatives w.r.t. the first argument), then and also . Thus, assuming that the upper bounds on are comparable, the relative scaling of parameters we consider as most “natural” is .
For simplicity, we consider initializing all algorithms at zero. Then, Local SGD with machines, stochastic gradients per round, and rounds of communication calculates its th iterate on the th machine for via
(2) 
where i.i.d., and refers to dividing . For each , minibatch SGD calculates its th iterate via
(3) 
We also introduce another strawman baseline, which we will refer to as “thumbtwiddling” SGD. In thumbtwiddling SGD, each machine computes just one (rather than ) stochastic gradients per round of communication and “twiddles its thumbs” for the remaining computational steps, resulting in minibatch SGD steps, but with a minibatch size of only (instead of , i.e. as if we used ). This is a silly algorithm that is clearly strictly worse than minibatch SGD, and we would certainly expect any reasonable algorithm to beat it. But as we shall see, previous work has actually struggled to show that local SGD even matches, let alone beats, thumbtwiddling SGD. In fact, we will show in Section 5 that, in certain regimes, local SGD truly is worse than thumbtwiddling.
For a particular algorithm , we define its worstcase performance with respect to as:
(4) 
The worstcase performance of minibatch SGD for general convex objectives is tightly understood (Nemirovsky and Yudin, 1983; Dekel et al., 2012):
(5) 
In order to know if an algorithm like local or minibatch SGD is “optimal” in the worst case requires understanding the minimax error, i.e. the best error that any algorithm with the requisite computation and communication structure can guarantee in the worst case. This requires formalizing the set of allowable algorithms. One possible formalization is the graph oracle model of Woodworth et al. (2018) which focuses on the dependence structure between different stochastic gradient computations resulting from the communication pattern. Using this method, Woodworth et al. prove lower bounds which are applicable to our setting. Minibatch SGD does not match these lower bounds (nor does accelerated minibatch SGD, see Cotter et al. (2011)), but these lower bounds are not known to be tight, so the minimax complexity and minimax optimal algorithm are not yet known.
Existing analysis of local SGD
Table 1 summarizes the best existing analyses of local SGD that we are aware of that can be applied to our setting. We present the upper bounds as they would apply in our setting, and after optimizing over the stepsize and other parameters. A detailed derivation of these upper bounds from the explicitlystated theorems in other papers is provided in Appendix A. As we can see from the table, in the natural scaling , every previous upper bound is strictly dominated by minibatch SGD. Worse, these upper bounds can even be worse than even thumbtwiddling SGD when (although they are sometimes better). In particular, the first term of each previous upper bound (in terms of ) is never better than (the optimization term of minibatch and thumbtwiddling SGD), and can be much worse.
We should note that in an extremely low noise regime , the bound of Khaled et al. (2019) can sometimes improve over minibatch SGD. However, this only happens when steps of sequential SGD is better than minibatch SGD—i.e. when you are better off ignoring of the machines and just doing serial SGD on a single machine (such an approach would have error ). This is a trivial regime in which every update for any of these algorithms is essentially an exact gradient descent step, thus there is no need for parallelism in the first place. See Appendix A.3 for further details. The upper bound we develop in Section 4, in contrast, dominates their guarantee and shows an improvement over minibatch that cannot be achieved on a single machine (i.e. without leveraging any parallelism). Furthermore, this improvement can occur even in the natural scaling and even when minibatch SGD is better than serial SGD on one machine.
We emphasize that Table 1 lists the guarantees specialized to our setting—some of the bounds are presented under slightly weaker assumptions, or with a more detailed dependence on the noise: Stich and Karimireddy (2019) analyzes local SGD assuming notquiteconvexity^{4}^{4}4Haddadpour et al. (2019a) analyze local SGD under the PolyakŁojasiewicz condition, however their main Theorem is incorrect as stated. See Appendix A.4 for details.; and Wang and Joshi (2018); Dieuleveut and Patel (2019) derive guarantees under both multiplicative and additive bounds on the noise. Dieuleveut and Patel (2019) analyze local SGD with the additional assumption of a bounded third derivative, but even with this assumption do not improve over minibatch SGD. Numerous works study local SGD in the nonconvex setting (see e.g. Zhou and Cong, 2018; Yu et al., 2019; Wang et al., 2017; Stich and Karimireddy, 2019; Haddadpour et al., 2019b). Although their bounds would apply in our convex setting, due to the much weaker assumptions they are understandably much worse than minibatch SGD. There is also a large body of work studying the special case , i.e. where the iterates are averaged just one time at the end (Zinkevich et al., 2010; Zhang et al., 2012; Li et al., 2014; Rosenblatt and Nadler, 2016; GodichonBaggioni and Saadane, 2017; Jain et al., 2017). However, these analyses do not easily extend to multiple rounds, and the constraint can provably harm performance (see Shamir et al., 2014). Finally, local SGD has been studied with heterogeneous data, i.e. where each machine receives stochastic gradients from different distributions—see Kairouz et al. (2019, Sec. 3.2) a recent survey.
3 Good News: Quadratic Objectives
As we have seen, existing analyses of local SGD are no better than that of minibatch SGD. In the special case where is quadratic, we will now show that not only is local SGD sometimes as good as minibatch SGD, but it is always as good as minibatch SGD, and sometimes better. In fact, an accelerated variant of local SGD is minimax optimal for quadratic objectives. More generally, we show that the local SGD anologue for a large family of serial firstorder optimization algorithms enjoys an error guarantee which depends only on the product and not on or individually. In particular, we consider the following family of linear update algorithms:
Definition 1 (Linear update algorithm).
We say that a firstorder optimization algorithm is a linear update algorithm if, for fixed linear functions , the algorithm generates its st iterate according to
(6) 
This family captures many standard firstorder methods including SGD, which corresponds to the linear mappings and . Another notable algorithm in this class is ACSA (Ghadimi and Lan, 2013), an accelerated variant of SGD which also has linear updates. Some important nonexamples, however, are adaptive gradient methods like AdaGrad (McMahan and Streeter, 2010; Duchi et al., 2011)—these have linear updates, but the linear functions are datadependent.
For a linear update algorithm , we will use local to denote the local SGD analogue with replacing SGD. That is, during each round of communication, each machine independently executes iterations of and then the resulting iterates are averaged. For quadratic objectives, we show that this approach inherits the guarantee of with the benefit of variance reduction:
Theorem 1.
Let be a linear update algorithm which, when executed for iterations on any quadratic , guarantees . Then, local’s averaged final iterate will satisfy .
We prove this in Appendix B by showing that the average iterate is updated according to —even in the middle of rounds of communication when is not explicitly computed. In particular, we first show that
(7) 
Then, by the linearity of and , we prove
(8) 
and its variance is reduced to . Therefore, ’s guarantee carries over while still benefitting from the lower variance.
To rephrase Theorem 1, on quadratic objectives, local is in some sense equivalent to iterations of with the gradient variance reduced by a factor of . Furthermore, this guarantee depends only on the product , and not on or individually. Thus, averaging the th iterate of independent executions of , sometimes called “oneshot averaging,” enjoys the same error upper bound as iterations of size minibatch.
Nevertheless, it is important to highlight the boundaries of Theorem 1. Firstly, ’s error guarantee must not rely on any particular structure of the stochastic gradients themselves, as this structure might not hold for the implicit updates of local. Furthermore, even if some structure of the stochastic gradients is maintained for local, the particular iterates generated by local will generally vary with and (even holding constant). Thus, Theorem 1 does not guarantee that local with two different values of and would perform the same on any particular instance. We have merely proven matching upper bounds on their worstcase performance.
We apply Theorem 1 to yield error upper bounds for localSGD, as well as localACSA (based on the ACSA algorithm of Ghadimi and Lan (2013)) which is minimax optimal:
Corollary 1.
For any quadratic , there are constants and such that localSGD returns a point such that
and localACSA returns a point such that
In particular, localACSA is minimax optimal for quadratic objectives.
Comparing the bound above for local SGD with the bound for minibatch SGD (5), we see that the local SGD bound is strictly better, due to the first term scaling as as opposed to . We note that minibatch SGD can also be accelerated (Cotter et al., 2011), leading to a bound with better dependence on , but this is again outmatched by the bound for the (accelerated) localACSA algorithm above.
Prior Work in the Quadratic Setting
Local SGD and related methods have been previously analyzed for quadratic objectives, but in slightly different settings. Jain et al. (2017) study a similar setting and analyze our “minibatch SGD” for and fixed , but varying and . They show that when is sufficiently small relative to , then minibatch SGD can compete with steps of serial SGD. They also show that for fixed and , when is sufficiently small then the average of independent runs of minibatch SGD with steps and minibatch size can compete with steps of minibatch SGD with minibatch size . These results are qualitatively similar to ours, but they analyze a specific algorithm while we are able to provide a guarantee for a broader class of algorithms. Dieuleveut and Patel (2019) analyze local SGD on quadratic objectives and show a result analogous to our Theorem 1. However, their result only holds when is sufficiently small relative to and . Finally, there is a literature on “oneshotaveraging” for quadratic objectives, which corresponds to an extreme where the outputs of an algorithm applied to several different training sets are averaged, (e.g. Zhang et al., 2013a, b). These results also highlight similar phenomena, but they do not apply as broadly as Theorem 1 and they do not provide as much insight into local SGD specifically.
4 More Good News: General Convex Objectives
In this section, we present the first analysis of local SGD for general convex objectives that is not dominated by minibatch SGD. For the first time, we can identify a regime of , , and in which local SGD provably performs better than minibatch SGD in the worst case. Furthermore, our analysis dominates all existing upper bounds for local SGD (at least in the natural scaling ).
Theorem 2.
Let . Then, for local SGD with decaying stepsize , the averaged iterate has expected error at most
(9) 
For general convex , applying local SGD to for an optimally chosen ensures
(10) 
This is proven in Appendix C. We use a similar approach as Stich (2018), who analyzes the behavior of the averaged iterate , even when it is not explicitly computed. They show, in particular, that the averaged iterate evolves almost according to sizeminibatch SGD updates, up to a term proportional to the dispersion of the individual machines’ iterates . Stich bounds this with
, but this bound is too pessimistic—in particular, it holds even if the gradients are replaced by arbitrary vectors of norm
. In Lemma 2, we improve this bound to which allows for our improved guarantee.^{5}^{5}5In recent work, Stich and Karimireddy (2019) present a new analysis of localSGD which, in the general convex case is of the form . As stated, this is strictly worse than minibatch SGD. However, we suspect that this bound should hold for any because, intuitively, having more machines should not hurt you. If this is true, then optimizing their bound over yields a similar result as Theorem 2. Our approach resembles that of Khaled et al. (2019), which we became aware of in the process of preparing this manuscript, however our analysis is more refined. In particular, we optimize more carefully over the stepsize so that our analysis applies for any , , and (rather than just ) and shows an improvement over minibatch SGD in a significantly broader regime, including when (see Appendix A.3 for additional details).Comparison of our bound with minibatch SGD
We now compare the upper bound from Theorem 2 with the guarantee of minibatch SGD. For clarity, and in order to highlight the role of , , and in the convergence rate, we will compare rates for general convex objectives when , and we will also ignore numerical constants and the logarithmic factor in Theorem 2. In this setting, the worstcase error of minibatch SGD is:
(11) 
Our guarantee for local SGD from Theorem 2 reduces to:
(12) 
These guarantees have matching statistical terms of , which cannot be improved by any firstorder algorithm (Nemirovsky and Yudin, 1983). Therefore, in the regime where the statistical term dominates both rates, i.e. and , both algorithms will have similar worstcase performance. When we leave this noisedominated regime, we see that local SGD’s guarantee is better than minibatch SGD’s when and is worse when . This makes sense intuitively: minibatch SGD benefits from computing very precise gradient estimates, but pays for it by taking fewer gradient steps; conversely, each local SGD update is much noisier, but local SGD is able to make times more updates.
This establishes that for general convex objectives in the large and large regime, local SGD will strictly outperform minibatch SGD. However, in the large and small regime, we are only comparing upper bounds, so it is not clear that local SGD will in fact perform worse than minibatch SGD. Nevertheless, it raises the question of whether this is the best we can hope for from local SGD. Is local SGD truly better than minibatch SGD in some regimes but worse in others? Or, should we believe the intuitive argument suggesting that local SGD is always at least as good as minibatch SGD?
5 Bad News: Minibatch SGD Can Outperform Local SGD
In Section 3, we saw that when the objective is quadratic, local SGD is strictly better than minibatch SGD, and enjoys an error guarantee that depends only on and not or individually. In Section 4, we analyzed local SGD for general convex objectives and showed that local SGD sometimes outperforms minibatch SGD. However, we did not show that it always does, nor that it is always even competitive with minibatch SGD. We will now show that this is not simply a failure of our analysis—in a certain regime, local SGD really is inferior (in the worstcase) to minibatch SGD, and even to thumbtwiddling SGD. We show this by constructing a simple, smooth piecewisequadratic objective in three dimensions, on which local SGD performs poorly. We define this hard instance as
(13) 
where and .
Theorem 3.
For , there exists such that for any and , local SGD initialized at with any fixed stepsize, will output a point such that for a universal constant
(14) 
We defer a detailed proof of the Theorem to Appendix D. Intuitively, it relies on the fact that for nonquadratic functions, the SGD updates are no longer linear as in Section 3, and the local SGD dynamics introduce an additional bias term which does not depend^{6}^{6}6To see this, consider for example the univariate function where
is some zeromean bounded random variable. It is easy to verify that even if we have infinitely many machines (
), running local SGD for a few iterations starting from the global minimum of will generally return a point bounded away from . In contrast, minibatch SGD under the same conditions will remain at . on , and scales poorly with . In fact, this phenomenon is not unique to our construction, and can be expected to exist for any “sufficiently” nonquadratic function. With our construction, the proof proceeds by showing that the suboptimality is large unless but local SGD introduces a bias which causes to “drift” in the negative direction by an amount proportional to the stepsize. On the other hand, optimizing the first term of the objective requires the stepsize to be relatively large. Combining these yields the first term of the lower bound. The second term is classical and holds even for firstorder algorithms that compute stochastic gradients sequentially (Nemirovsky and Yudin, 1983).In order to compare this lower bound with Theorem 2 and with minibatch SGD, we again consider the general convex setting with . Then, the lower bound reduces to . Comparing this to Theorem 2, we see that our upper bound is tight up to a factor of in the optimization term. Furthermore, comparing this to the worstcase error of minibatch SGD (11), we see that local SGD is indeed worse than minibatch SGD in the worst case when is small enough relative to . The crossover point is somewhere between and ; for smaller , minibatch SGD is better than local SGD in the worst case, for larger , local SGD is better in the worst case. Since the optimization terms of minibatch SGD and thumbtwiddling SGD are identical, this further indicates that local SGD is even outperformed by thumbtwiddling SGD in the small and large regime.
Finally, it is interesting to note that in the strongly convex case (where ), the gap between local GD and minibatch SGD can be even more dramatic: In that case, the optimization term of minibatch SGD scales as (see Stich (2019) and references therein), while our theorem implies that local SGD cannot obtain a term better than . This implies an exponentially worse dependence on in that term, and a worse bound as long as .
In order to prove Theorem 3 we constructed an artificial, but easily analyzable, situation where we could prove analytically that local SGD is worse than minibatch. In Figure 1, we also demonstrate the behaviour empirically on a logistic regression task, by plotting the suboptimality of local SGD, minibatch SGD, and thumbtwiddling SGD iterates with optimally tuned stepsizes. As is predicted by Theorem 3, we see local SGD goes from performing worse than minibatch in the small regime, but improving relative to the other algorithms as increases to and then , when local SGD is far superior to minibatch. For each fixed , increasing causes thumbtwiddling SGD to improve relative to minibatch SGD, but does not have a significant effect on local SGD, which is consistent with local SGD introducing a bias which depends on but not on . This highlights that the “problematic regime” for local SGD is the regime with a relatively small number of iterations per round.
6 Future work
In this paper, we provided the first analysis of local SGD showing improvement over minibatch SGD in a natural setting, but also demonstrated that local SGD can sometimes be worse than minibatch SGD, and is certainly not optimal.
As can be seen from Table 1, our upper and lower bounds for local SGD are still not tight. The first term depends on versus —we believe the correct behaviour might be in between, namely , matching the bias of step SGD. The exact worst case behaviour of local SGD is therefore not yet resolved.
But beyond obtaining a precise analysis of local SGD, our paper highlights a more important challenge: we see that local SGD is definitely not optimal, and does not even always improve over minibatch SGD. Can we suggest an optimal algorithm in this setting? Or at least a method that combines the advantages of both local SGD and minibatch SGD and enjoys guarantees that dominate both? Our work motivates developing such an algorithm, which might also have benefits in regimes where local SGD is already better than minibatch SGD.
To answer this question will require new upper bounds and perhaps also new lower bounds. Looking to the analysis of local ACSA for quadratic objectives in Corollary 1, we might hope to design an algorithm which achieves error
(15) 
for general convex objectives. That is, an algorithm which combines the optimization term for steps of accelerated gradient descent with the optimal statistical term. If this were possible, it would match the lower bound of Woodworth et al. (2018) and therefore be optimal with respect to this communication structure.
Acknowledgements
This work is partially supported by NSFCCF/BSF award 1718970/2016741, NSFDMS 1547396, and a Google Faculty Research Award. BW is supported by a Google PhD Fellowship. Part of this work was done while NS was visiting Google. Work by SS was done while visiting TTIC.
References

Coppola (2015)
Greg Coppola.
Iterative parameter mixing for distributed largemargin training of structured predictors for natural language processing
. PhD thesis, The University of Edinburgh, 2015.  Cotter et al. (2011) Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better minibatch algorithms via accelerated gradient methods. In Advances in Neural Information Processing Systems 24, 2011.
 Dekel et al. (2012) Ofer Dekel, Ran GiladBachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
 Dieuleveut and Patel (2019) Aymeric Dieuleveut and Kumar Kshitij Patel. Communication tradeoffs for localsgd with large step size. In Advances in Neural Information Processing Systems, pages 13579–13590, 2019.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, ii: shrinking procedures and optimal algorithms. SIAM Journal on Optimization, 23(4):2061–2089, 2013.
 GodichonBaggioni and Saadane (2017) Antoine GodichonBaggioni and Sofiane Saadane. On the rates of convergence of parallelized averaged stochastic gradient algorithms. arXiv preprint arXiv:1710.07926, 2017.
 Haddadpour et al. (2019a) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pages 11080–11092, 2019a.
 Haddadpour et al. (2019b) Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Trading redundancy for communication: Speeding up distributed sgd for nonconvex optimization. In International Conference on Machine Learning, pages 2545–2554, 2019b.
 Jain et al. (2017) Prateek Jain, Praneeth Netrapalli, Sham M Kakade, Rahul Kidambi, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: minibatching, averaging, and model misspecification. The Journal of Machine Learning Research, 18(1):8258–8299, 2017.
 Kairouz et al. (2019) Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning, 2019.
 Khaled et al. (2019) Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Better communication complexity for local sgd. arXiv preprint arXiv:1909.04746, 2019.
 Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient minibatch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
 Lin et al. (2018) Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large minibatches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
 McMahan and Streeter (2010) H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010, pages 244–256, 2010.
 McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Nemirovsky and Yudin (1983) Arkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.
 Rosenblatt and Nadler (2016) Jonathan D Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379–404, 2016.
 Shamir and Srebro (2014) Ohad Shamir and Nathan Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
 Shamir et al. (2014) Ohad Shamir, Nathan Srebro, and Tong Zhang. Communicationefficient distributed optimization using an approximate newtontype method. In International conference on machine learning, pages 1000–1008, 2014.
 Simchowitz (2018) Max Simchowitz. On the randomized complexity of minimizing a convex quadratic function. arXiv preprint arXiv:1807.09386, 2018.
 Stich (2018) Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018. URL https://arxiv.org/abs/1805.09767.
 Stich (2019) Sebastian U Stich. Unified optimal analysis of the (stochastic) gradient method. arXiv preprint arXiv:1907.04232, 2019.
 Stich and Karimireddy (2019) Sebastian U Stich and Sai Praneeth Karimireddy. The errorfeedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
 Wang et al. (2017) Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication efficient distributed stochastic optimization with minibatchprox. arXiv preprint arXiv:1702.06269, 2017. URL https://arxiv.org/abs/1702.06269.
 Wang and Joshi (2018) Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communicationefficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
 Woodworth et al. (2018) Blake Woodworth, Jialei Wang, Brendan McMahan, and Nathan Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. arXiv preprint arXiv:1805.10222, 2018. URL https://arxiv.org/abs/1805.10222.

Yu et al. (2019)
Hao Yu, Sen Yang, and Shenghuo Zhu.
Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning.
InProceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pages 5693–5700, 2019.  Zhang et al. (2016) Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
 Zhang et al. (2012) Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communicationefficient algorithms for statistical optimization. In Advances in Neural Information Processing Systems, pages 1502–1510, 2012.

Zhang et al. (2013a)
Yuchen Zhang, John Duchi, and Martin Wainwright.
Divide and conquer kernel ridge regression.
In Conference on learning theory, pages 592–617, 2013a.  Zhang et al. (2013b) Yuchen Zhang, John C Duchi, and Martin J Wainwright. Communicationefficient algorithms for statistical optimization. The Journal of Machine Learning Research, 14(1):3321–3363, 2013b.
 Zhou and Cong (2018) Fan Zhou and Guojing Cong. On the convergence properties of a kstep averaging stochastic gradient descent algorithm for nonconvex optimization. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pages 3219–3227. International Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/447. URL https://doi.org/10.24963/ijcai.2018/447.
 Zinkevich et al. (2010) Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Comparisons Between Existing Local SGD Analyses and Minibatch SGD
In this section, we describe the derivation of the entries in Table 1 for the cases in which it is not obvious. In particular, these previous analyses were stated based on different assumptions (stronger as well as weaker) which need to be reconciled with ours. Since local SGD is often analyzed in the strongly convex setting (or with weaker assumptions that are implied by strong convexity), we will make use of the following fact: If an algorithm guarantees error at most when applied to a strongly convex function, then we can apply the algorithm to in order to ensure error . This applies for any , so we can actually infer that the algorithm, in fact, guarantees error at most .
Since our purpose is to show that these analyses are dominated by minibatch SGD, the entries in the table are, in some sense, the most optimistic interpretation of the bounds stated in the paper. For example, if error is guaranteed for strongly convex functions, we actually enter into the table, which is a lower bound on the actual guarantee.
Reference  Setting  Best Convergence rate (i.e., ) 

Stich (2018)  SC  
NonSC  
Stich and Karimireddy (2019)  SC  
NonSC  
Khaled et al. (2019)  SC  
NonSC 
For reference, we restate the worstcase guarantee of minibatch SGD:
(16) 
a.1 Stich (2018)
The paper makes the same assumptions as us but, in addition, assumes that the stochastic gradients are uniformly bounded, i.e. . We relax this assumption by noting the following,
(17)  
(18)  
(19)  
(20)  
(21) 
In the last step we make the optimistic assumption that the iterates stray no farther from than they were at initialization, i.e. . This may not be true, so this bound is optimistic. On the other hand, it is clear that one cannot generally upper bound any tighter than this in our setting. Since our goal is anyways to show that the analysis of Stich (2018) is deficient, we continue using the bound (21). This immediately gives the result for the stronglyconvex setting in table 2. For the nonstrongly setting we extend their result by optimizing each term separately as and ignore the constants.
a.2 Stich and Karimireddy (2019)
The paper relaxes the convexity assumption, by assuming F is quasi convex, i.e., . This condition can also hold for certain nonconvex functions and is implied by strong convexity. Besides they assume smoothness of and multiplicative noise for the stochastic gradients, i.e., . The latter assumption is a relaxation of the uniform upper bound on the variance of the stochastic gradients, which we have assumed. Thus to compare to their result we set upper bounding the stochastic variance by and use the strong convexity constant instead of . For the nonstrongly convex setting we use their rate, along with our uniform variance bound. Besides they use specific learning rate and averaging schedules to optimize their rates. Both these rates are given in Table 2. For the general convex setting, we believe their dependence in is poor and is improved upon by our upper bound in Section 4.
a.3 Khaled et al. (2019)
The relevant analysis from Khaled et al. (2019) is given in their Corollary 2, which is their only analysis that upper bounds the error in terms of the objective function suboptimality and in the setting where each machine receives i.i.d. stochastic gradients. Their Corollary 2 states that when , the error is bounded by^{7}^{7}7There is a typo in their statement which omits the factor of ( in their notation) from the numerator of the first term.
(22) 
In the case where , it is clear that this is strictly worse than minibatch SGD since . However, consider the case of arbitrary , and and suppose Khaled et al. (2019)’s guarantee is less than , in which case
(23) 
Consequently, (22) is either greater than or greater than . This does not mean that their upper bound is worse than minibatch SGD. However, it is worse than minibatch SGD unless .
If we interrogate what this regime corresponds to, we see that it is actually a trivial regime where steps of serial SGD, which achieves error , is actually better than minibatch SGD. That is, rather than implementing minibatch SGD distributed across the machines, we are actually better off just ignoring of the available machines and doing serial SGD. If this is really the right thing to do, then there was never any need for parallelism in the first place, and thus there is no reason to use local SGD, which performs no better than serial SGD in this case anyways.
a.4 Haddadpour et al. (2019a)
Haddadpour et al. (2019a) also analyze local SGD in a related setting (they assume the PolyakŁojasiewicz condition, which is implied by strong convexity). However, in trying to adapt their Theorem 1 to our setting, it appears that there are some omitted conditions in the Theorem statement. In particular, choosing appears to be allowed by their hypotheses, yet this choice leads to an upper bound of when , which contradicts known lower bounds for deterministic firstorder optimization. Since we are not sure what the actual requirements on are, we are unable to confirm what their analysis implies about our setting.
Appendix B Proofs from Section 3
See 1
Proof.
We will show that the average of the iterates at any particular time evolves according to with a lower variance stochastic gradient, even though this average iterate is not explicitly computed by the algorithm at every step. It is easily confirmed from (6) that
(24)  
(25) 
where we used that is linear. We will now show that
is an unbiased estimate of
with variance bounded by . Therefore, is updated exactly according to with a lower variance stochastic gradient.By the linearity of and
(26) 
Furthermore, since the on each machine are independent, and ,
(27) 
∎
See 1
Proof.
It is easily confirmed that SGD and ACSA Ghadimi and Lan (2013) are linear update algorithms, which allows us to apply Theorem 1. In addition, Simchowitz (2018) shows that any randomized algorithm that accesses an deterministic first order oracle at most times will have error at least in the worst case for an smooth, convex quadratic objective, for some universal constant . Therefore, the first term of localACSA’s guarantee cannot be improved. The second term of the guarantee also cannot be improved Nemirovsky and Yudin (1983)—in fact, this term cannot be improved even by an algorithm which is allowed to make sequential calls to a stochastic gradient oracle. ∎
Appendix C Proof of Theorem 2
Before we prove Theorem 2, we will introduce some notation. Recall that the objective is of the form . Let denote the stepsize used for the th overall iteration. Let denote the th iterate on the th machine, and let denote the averaged th iterate. The vector may not actually be computed by the algorithm, but it will be central to our analysis. We will use to denote the stochastic gradient computed at by the th machine at iteration , and will denote the average of the stochastic gradients computed at time . Finally, let denote the average of the full gradients computed at the individual iterates.
Lemma 1 (See Lemma 3.1 Stich (2018)).
Let be smooth and strongly convex, let
, and let , then the iterates of local SGD satisfy
Comments
There are no comments yet.