Better Communication Complexity for Local SGD

We revisit the local Stochastic Gradient Descent (local SGD) method and prove new convergence rates. We close the gap in the theory by showing that it works under unbounded gradients and extend its convergence to weakly convex functions. Furthermore, by changing the assumptions, we manage to get new bounds that explain in what regimes local SGD is faster that its non-local version. For instance, if the objective is strongly convex, we show that, up to constants, it is sufficient to synchronize M times in total, where M is the number of nodes. This improves upon the known requirement of Stich (2018) of √(TM) synchronization times in total, where T is the total number of iterations, which helps to explain the empirical success of local SGD.


page 1

page 2

page 3

page 4


STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Distributed parallel stochastic gradient descent algorithms are workhors...

Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SG...

The Role of Local Steps in Local SGD

We consider the distributed stochastic optimization problem where n agen...

Faster Convergence of Local SGD for Over-Parameterized Models

Modern machine learning architectures are often highly expressive. They ...

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

We consider stochastic strongly convex optimization with a complex inequ...

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Communication overhead is one of the key challenges that hinders the sca...

Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data

We study stochastic gradient descent (SGD) with local iterations in the ...

1 Introduction

Big data optimization problems arising in machine learning and statistics, such as the training of supervised learning models, are routinely solved in a distributed manner on a cluster of compute nodes

Ben-Nun and Hoefler [2018]. Distributed optimization algorithms are typically iterative methods alternating local computations performed on the nodes, and expensive communication steps involving all or a subset of the nodes. Due to the need to solve such problems more efficiently, there has been a lot of recent interest in understanding the trade-offs between communication and computation, a concern which is particularly important in the federated learning setting; see Konečný et al. [2016], Caldas et al. [2018], McMahan et al. [2016].

Minibatch SGD. A popular method for solving unconstrained smooth optimization problems of the form


in situations when the computation of the gradient of is expensive is minibatch SGD Dekel et al. [2010], Gower et al. [2019]:


Here is the stepsize used at time and

is an unbiased estimator of the gradient:

. In a typical parameter server setup, stochastic gradients are computed in parallel by all (or a subset of) nodes , communicated to a parameter server, which subsequently performs the update (2) and communicates it to the nodes, and the process is repeated. As

grows, the variance

as an estimator of the gradient decreases, which leads a decrease in the overall number of communications needed to obtain a solution of sufficient quality.

Local SGD. Note that (2) can equivalently be written in the form

which leads to the alternative interpretation of minibatch SGD as averaging the results of a single SGD step performed by all nodes, initiated from the same starting point . This simple observation immediately leads to the natural question: can we gain by performing more than a single step of SGD on each node before averaging? By performing what we hope will be useful additional computation locally on the nodes before expensive aggregation is done, we hope to decrease the number of communication rounds needed. We have just described the local SGD method, formalized as Algorithm 1.

0:  Stepsize

, initial vector

for all .
1:  for  do
2:     for  do
3:        Sample such that .
5:     end for
6:  end for
Algorithm 1 Local SGD

2 Contributions

While local SGD has been popular among practitioners for a long time Coppola [2015], McDonald et al. [2010], its theoretical understanding has remained elusive until very recently Zhou and Cong [2018], Stich [2018], Yu et al. [2018], Wang and Joshi [2018], Jiang and Agrawal [2018], Basu et al. [2019] (see Table 1). The history of the methods goes back to the convergence proof in the early work Mangasarian [1995], but a tight convergence rate has been missing since then. Although most existing works focus on analyzing local SGD for smooth and nonconvex , there are no analyses specialized to the smooth convex case, and only two papers which provide bounds in the smooth strongly convex case.

In this paper we obtain the first result explicitly covering the convex case, and improve dramatically upon the best known communication complexity result in the strongly convex case (see the last row of Table 1). Moreover, unlike previous results in the strongly convex case that depend on a restrictive gradient boundedness assumption, our results do not have this flaw.

strongly convex
weakly convex
Zhou & Cong, 8/2017, Zhou and Cong [2018]
Stich, 5/2018, Stich [2018]
Yu et al, 7/2018, Yu et al. [2018]
Wang & Joshi, 8/2018, Wang and Joshi [2018]
Jiang & Agarwal, 12/2018, Jiang and Agrawal [2018]
Basu et al, 6/2019, Basu et al. [2019]
THIS WORK, 9/2019
Table 1: Existing theoretical bounds for local SGD. denotes the minimum number of communication steps required each iterations to achieve a linear speedup in the number of nodes .

An overview of related work on local stochastic gradient methods is given in Table 1.

2.1 Setting and Contributions

In this work we consider minimization problem (1) under the following assumptions:

Assumption 1 (Smoothness and convexity).

We assume is -smooth and -strongly convex (we allow ). That is, for all we have:

Assumption 2.

The stochastic gradients are unbiased estimates of the true gradient with uniformly bounded variance when conditioned on :

Note that Assumption 2 is less restrictive than the bounded gradients assumption () used in several previous analysis as shown in Table 1. Under this setting, the main contributions of this paper are:

  1. If is strongly convex, then by properly choosing stepsizes and taking the average of the local iterates , we can obtain when the total number of iterates and the total number of communication rounds are:


    where indicates possibly ignoring polylogarithmic factors. This tightens the previous analysis Stich [2018], where was required.

  2. Furthermore, if is (possibly weakly) convex, then we can guarantee provided that:

  3. We support our analysis by experiments illustrating the behavior of the algorithm.

3 Convergence Theory

We denote the sequence of time stamps when synchronization happens as . The average of all local iterates is and that of gradients is . We define the set .

Lemma 1.

Choose a stepsize such that . Under Assumptions 1, and 2 we have that for Algorithm 1 with ,


where .

Theorem 1.

Suppose that Assumptions 1, and 2 hold with . Then combining Lemma 1 with techniques from Stich [2018] we can conclude that for a constant stepsize such that we have for Algorithm 1 with ,


where .

Corollary 1.

Choosing , with for and we take steps. Then substituting in (6) and using that and some algebraic manipulation we can conclude that,


where ignores logarithmic factors. We see that choosing recovers the same convergence rate of minibatch SGD up to polylogarithmic factors, and the number of communications is then .

Using similar proof techniques, we can show the following result for weakly convex functions:

Theorem 2.

Suppose that Assumptions 1, and 2 hold with and that a constant stepsize such that and is chosen and that Algorithm 1 is run with up to local iterations, for ,

Corollary 2.

Assume that . Choosing , then substituting in (9) we have,


We see that choosing we recover the same convergence rate of minibatch SGD, and the number of communication steps is then .

4 Experiments

We run experiments on

regularized logistic regression problem with

nodes, each with Intel(R) Xeon(R) Gold 6146 CPU @3.20GHz core. We use the ’a9a’ dataset from the LIBSVM library Chang and Lin [2011] and set penalty to be , where is the dataset size. The code was written in Python using MPI Dalcin et al. [2011].

We ran two experiments, with stepsizes and and minibatch size equal 1. In both cases we observe convergence to a neighborhood, although of a different radius. Since we run the experiments on a single machine, the communication is very cheap and there is little gain in time required for convergence. However, the advantage in terms of required communication rounds is self-evident and can lead to significant time improvement under slow communication networks.

Figure 1: Results on regularized logistic regression, ’a9a’ dataset, with stepsize . All local numbers of local iterations converge to a neighborhood within a small number of communication rounds due to large stepsizes.
Figure 2: Results on regularized logistic regression, ’a9a’ dataset, with stepsize . With more local iterations, fewer communication rounds are required to get to a neighborhood of the solution.


Appendix A Basic Facts and Notation

We denote the sequence of time stamps when synchronization happens as . Given stochastic gradients at time we define

Throughout the proofs, we will use the variance decomposition that holds for any random vector

with finite second moment:

In particular, its version for vectors with finite number of values gives


As a consequence of (11) we have that,

Proposition 1 (Jensen’s inequality).

For any convex function and any vectors we have

We denote the Bregman divergence associated with function and arbitrary as

Proposition 2.

If is -smooth and convex, then for any and it holds


If satisfies Assumption 1, then


Appendix B Proof of Lemma 1


Let be such that . Recall that for a time such that we have and . Hence for the expectation conditional on we have:

Averaging both sides and letting , we have


Now note that by expanding the square we have,


We decompose the first term in the last equality again by expanding the square,

Plugging this into (15) we have,

Now average over :

where we used that by definition . Hence,


Now note that for the first term in (16) we have by Assumption 2,


For the second term in (16) we have

Averaging over ,

where we used the fact that , which comes from the linearity of expectation. Now we bound in the last inequality by smoothness and then use that Jensen’s inequality implies ,


Plugging in (18) and (17) into (16) we have,


Plugging (19) into (14), we get


Using that we can conclude,

Taking expectations and iterating the above inequality,

It remains to notice that by assumption we have . ∎

Appendix C Two More Lemmas

Lemma 2.

Stich [2018]. Let be iterates generated by Algorithm 1. Suppose that satisfies Assumption 1 and that . Then,


This is Lemma 3.1 in Stich [2018]. ∎

Lemma 3.

Suppose that Assumption 2 holds. Then,


This is Lemma 3.2 in Stich [2018]. Because the stochastic gradients are independent we have that the variance of their sum is the sum of their variances, hence

Appendix D Proof of Theorem 1


Combining Lemma 2 and Lemma 3, we have


Using Lemma 1 we can upper bound the term in :

Letting and we have,

Recursing the above inequality we have,

Using that we have,

which is the claim of this theorem. ∎

Appendix E Proof of Theorem 2


Let , then putting in Lemma 2 and combining it with Lemma 3, we have

Further using Lemma 1,

Rearranging we have,

Averaging the above equation as varies between and ,


By Jensen’s inequality we have . Using this in (23) we have,

Dividing both sides by yields the theorem’s claim. ∎