# Better Communication Complexity for Local SGD

We revisit the local Stochastic Gradient Descent (local SGD) method and prove new convergence rates. We close the gap in the theory by showing that it works under unbounded gradients and extend its convergence to weakly convex functions. Furthermore, by changing the assumptions, we manage to get new bounds that explain in what regimes local SGD is faster that its non-local version. For instance, if the objective is strongly convex, we show that, up to constants, it is sufficient to synchronize M times in total, where M is the number of nodes. This improves upon the known requirement of Stich (2018) of √(TM) synchronization times in total, where T is the total number of iterations, which helps to explain the empirical success of local SGD.

• 9 publications
• 21 publications
• 126 publications
06/11/2020

### STL-SGD: Speeding Up Local SGD with Stagewise Communication Period

Distributed parallel stochastic gradient descent algorithms are workhors...
06/26/2018

### Random Shuffling Beats SGD after Finite Epochs

A long-standing problem in the theory of stochastic gradient descent (SG...
03/14/2022

### The Role of Local Steps in Local SGD

We consider the distributed stochastic optimization problem where n agen...
01/30/2022

### Faster Convergence of Local SGD for Over-Parameterized Models

Modern machine learning architectures are often highly expressive. They ...
04/19/2013

### Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

We consider stochastic strongly convex optimization with a complex inequ...
10/30/2019

### Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Communication overhead is one of the key challenges that hinders the sca...
06/22/2020

### Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data

We study stochastic gradient descent (SGD) with local iterations in the ...

### 1 Introduction

Big data optimization problems arising in machine learning and statistics, such as the training of supervised learning models, are routinely solved in a distributed manner on a cluster of compute nodes

Ben-Nun and Hoefler [2018]. Distributed optimization algorithms are typically iterative methods alternating local computations performed on the nodes, and expensive communication steps involving all or a subset of the nodes. Due to the need to solve such problems more efficiently, there has been a lot of recent interest in understanding the trade-offs between communication and computation, a concern which is particularly important in the federated learning setting; see Konečný et al. [2016], Caldas et al. [2018], McMahan et al. [2016].

Minibatch SGD. A popular method for solving unconstrained smooth optimization problems of the form

 minx∈Rdf(x) (1)

in situations when the computation of the gradient of is expensive is minibatch SGD Dekel et al. [2010], Gower et al. [2019]:

 xt+1=xt−γtMM∑m=1gmt. (2)

Here is the stepsize used at time and

is an unbiased estimator of the gradient:

. In a typical parameter server setup, stochastic gradients are computed in parallel by all (or a subset of) nodes , communicated to a parameter server, which subsequently performs the update (2) and communicates it to the nodes, and the process is repeated. As

grows, the variance

as an estimator of the gradient decreases, which leads a decrease in the overall number of communications needed to obtain a solution of sufficient quality.

Local SGD. Note that (2) can equivalently be written in the form

 xt+1=1MM∑m=1(xt−γtgmt),

which leads to the alternative interpretation of minibatch SGD as averaging the results of a single SGD step performed by all nodes, initiated from the same starting point . This simple observation immediately leads to the natural question: can we gain by performing more than a single step of SGD on each node before averaging? By performing what we hope will be useful additional computation locally on the nodes before expensive aggregation is done, we hope to decrease the number of communication rounds needed. We have just described the local SGD method, formalized as Algorithm 1.

### 2 Contributions

While local SGD has been popular among practitioners for a long time Coppola [2015], McDonald et al. [2010], its theoretical understanding has remained elusive until very recently Zhou and Cong [2018], Stich [2018], Yu et al. [2018], Wang and Joshi [2018], Jiang and Agrawal [2018], Basu et al. [2019] (see Table 1). The history of the methods goes back to the convergence proof in the early work Mangasarian [1995], but a tight convergence rate has been missing since then. Although most existing works focus on analyzing local SGD for smooth and nonconvex , there are no analyses specialized to the smooth convex case, and only two papers which provide bounds in the smooth strongly convex case.

In this paper we obtain the first result explicitly covering the convex case, and improve dramatically upon the best known communication complexity result in the strongly convex case (see the last row of Table 1). Moreover, unlike previous results in the strongly convex case that depend on a restrictive gradient boundedness assumption, our results do not have this flaw.

An overview of related work on local stochastic gradient methods is given in Table 1.

#### 2.1 Setting and Contributions

In this work we consider minimization problem (1) under the following assumptions:

###### Assumption 1 (Smoothness and convexity).

We assume is -smooth and -strongly convex (we allow ). That is, for all we have:

 f(y)+⟨∇f(y),x−y⟩+μ2∥x−y∥2≤f(x)≤f(y)+⟨∇f(y),x−y⟩+L2∥x−y∥2.
###### Assumption 2.

The stochastic gradients are unbiased estimates of the true gradient with uniformly bounded variance when conditioned on :

 E[gmt]=∇f(xmt) and E[∥∥gmt−∇f(xmt)∥∥2]≤σ2 for all t≥0 and m∈[M].

Note that Assumption 2 is less restrictive than the bounded gradients assumption () used in several previous analysis as shown in Table 1. Under this setting, the main contributions of this paper are:

1. If is strongly convex, then by properly choosing stepsizes and taking the average of the local iterates , we can obtain when the total number of iterates and the total number of communication rounds are:

 T=~Ω(σ2εM) and C=Ω(M). (3)

where indicates possibly ignoring polylogarithmic factors. This tightens the previous analysis Stich [2018], where was required.

2. Furthermore, if is (possibly weakly) convex, then we can guarantee provided that:

 T=Ω(σ4Mε2) and C=Ω(√TM3/2). (4)
3. We support our analysis by experiments illustrating the behavior of the algorithm.

### 3 Convergence Theory

We denote the sequence of time stamps when synchronization happens as . The average of all local iterates is and that of gradients is . We define the set .

###### Lemma 1.

Choose a stepsize such that . Under Assumptions 1, and 2 we have that for Algorithm 1 with ,

 E[Vt]≤Hγ2σ2, (5)

where .

###### Theorem 1.

Suppose that Assumptions 1, and 2 hold with . Then combining Lemma 1 with techniques from Stich [2018] we can conclude that for a constant stepsize such that we have for Algorithm 1 with ,

 E[∥^xT−x∗∥2]≤(1−γμ)T∥^x0−x∗∥2+γσ2μM+2Lγ2Hσ2μ, (6)

where .

###### Corollary 1.

Choosing , with for and we take steps. Then substituting in (6) and using that and some algebraic manipulation we can conclude that,

 E[∥^xT−x∗∥2] ≤4log2(a)T2∥^x0−x∗∥2+2σ2log(a)μ2MT+8Lσ2Hlog2(a)μ3T2 (7) (8)

where ignores logarithmic factors. We see that choosing recovers the same convergence rate of minibatch SGD up to polylogarithmic factors, and the number of communications is then .

Using similar proof techniques, we can show the following result for weakly convex functions:

###### Theorem 2.

Suppose that Assumptions 1, and 2 hold with and that a constant stepsize such that and is chosen and that Algorithm 1 is run with up to local iterations, for ,

 E[f(¯xT)−f(x∗)] ≤2γT∥x0−x∗∥2+2γσ2M+4γ2L2σ2H. (9)
###### Corollary 2.

Assume that . Choosing , then substituting in (9) we have,

 E[f(¯xT)−f(x∗)]≤8∥x0−x∗∥√MT+σ22L√MT+σ2MHT. (10)

We see that choosing we recover the same convergence rate of minibatch SGD, and the number of communication steps is then .

### 4 Experiments

We run experiments on

regularized logistic regression problem with

nodes, each with Intel(R) Xeon(R) Gold 6146 CPU @3.20GHz core. We use the ’a9a’ dataset from the LIBSVM library Chang and Lin [2011] and set penalty to be , where is the dataset size. The code was written in Python using MPI Dalcin et al. [2011].

We ran two experiments, with stepsizes and and minibatch size equal 1. In both cases we observe convergence to a neighborhood, although of a different radius. Since we run the experiments on a single machine, the communication is very cheap and there is little gain in time required for convergence. However, the advantage in terms of required communication rounds is self-evident and can lead to significant time improvement under slow communication networks.

### Appendix A Basic Facts and Notation

We denote the sequence of time stamps when synchronization happens as . Given stochastic gradients at time we define

 gtdef=1MM∑m=1gmt, ¯gmtdef=E[gmt]=∇f(xmt), ¯gtdef=E[gt].

Throughout the proofs, we will use the variance decomposition that holds for any random vector

with finite second moment:

 E[∥X∥2]=E[∥X−E[X]∥2]+∥E[X]∥2.

In particular, its version for vectors with finite number of values gives

 (11)

As a consequence of (11) we have that,

###### Proposition 1 (Jensen’s inequality).

For any convex function and any vectors we have

 f(1MM∑m=1xm)≤1MM∑m=1f(xm).

We denote the Bregman divergence associated with function and arbitrary as

 Df(x,y)def=f(x)−f(y)−⟨∇f(y),x−y⟩.
###### Proposition 2.

If is -smooth and convex, then for any and it holds

 ∥∇f(x)−∇f(y)∥2≤2LDf(x,y). (12)

If satisfies Assumption 1, then

 f(x)+⟨∇f(y),x−y⟩+μ2∥y−x∥2≤f(y)∀x,y∈Rd. (13)

### Appendix B Proof of Lemma 1

###### Proof.

Let be such that . Recall that for a time such that we have and . Hence for the expectation conditional on we have:

 E[∥∥xmt+1−^xt+1∥∥2] =∥∥xmt−^xt∥∥2+γ2E[∥∥gmt−gt∥∥2]−2γ⟨xmt−^xt,∇f(xmt)⟩ +2γ⟨xmt−^xt,¯gt⟩.

Averaging both sides and letting , we have

 E[Vt+1] =Vt+γ2M∑mE[∥∥gmt−gt∥∥2]−2γM∑m⟨xmt−^xt,∇f(xmt)⟩. (14)

Now note that by expanding the square we have,

 E[∥∥gmt−gt∥∥2] (15)

We decompose the first term in the last equality again by expanding the square,

 E[∥∥gmt−¯gt∥∥2]

Plugging this into (15) we have,

 E[∥∥gmt−gt∥∥2]

Now average over :

 1M∑mE[∥∥gmt−gt∥∥2] =1M∑mE[∥∥gmt−¯gmt∥∥2]+1M∑m∥∥¯gmt−¯gt∥∥2+E[∥¯gt−gt∥2] −2E[∥¯gt−gt∥2],

where we used that by definition . Hence,

 1M∑mE[∥∥gmt−gt∥∥2] =1M∑mE[∥∥gmt−¯gmt∥∥2]+1M∑m∥∥¯gmt−¯gt∥∥2−E[∥¯gt−gt∥2] ≤1M∑mE[∥∥gmt−¯gmt∥∥2]+1M∑m∥∥¯gmt−¯gt∥∥2. (16)

Now note that for the first term in (16) we have by Assumption 2,

 E[∥∥gmt−¯gmt∥∥2] (17)

For the second term in (16) we have

 ∥∥¯gmt−¯gt∥∥2

Averaging over ,

 1MM∑m=1∥∥¯gmt−¯gt∥∥2

where we used the fact that , which comes from the linearity of expectation. Now we bound in the last inequality by smoothness and then use that Jensen’s inequality implies ,

 1M∑m∥∥¯gmt−∇f(^xt)∥∥2 =1M∑m∥∥∇f(xmt)−∇f(^xt)∥∥2 1M∑m2L(f(^xt)−f(xmt)−⟨^xt−xmt,∇f(xmt)⟩) ≤−2LM∑m⟨xmt−^xt,∇f(xmt)⟩. (18)

Plugging in (18) and (17) into (16) we have,

 1M∑mE[∥∥gmt−gt∥∥2]≤σ2−2LM∑m⟨xmt−^xt,∇f(xmt)⟩. (19)

Plugging (19) into (14), we get

 E[Vt+1] =Vt+γ2σ2−2γ(1−γL)M∑m⟨xmt−^xt,∇f(xmt)⟩ (20) (1−γ(1−γL)μ)Vt+γ2σ2.

Using that we can conclude,

 E[Vt+1] ≤(1−γμ2)Vt+γ2σ2 ≤Vt+γ2σ2.

Taking expectations and iterating the above inequality,

 E[Vt] ≤E[Vtp]+γ2σ2(t−tp) ≤E[Vtp]+γ2σ2H.

It remains to notice that by assumption we have . ∎

### Appendix C Two More Lemmas

###### Lemma 2.

Stich [2018]. Let be iterates generated by Algorithm 1. Suppose that satisfies Assumption 1 and that . Then,

 (21)
###### Proof.

This is Lemma 3.1 in Stich [2018]. ∎

###### Lemma 3.

Suppose that Assumption 2 holds. Then,

 E[∥gt−¯gt∥2]≤σ2M.
###### Proof.

This is Lemma 3.2 in Stich [2018]. Because the stochastic gradients are independent we have that the variance of their sum is the sum of their variances, hence

 E[∥gt−¯gt∥2] =1M2E⎡⎣∥∥ ∥∥M∑m=1gmt−¯gmt∥∥ ∥∥2⎤⎦=1M2M∑m=1E[∥∥gmt−¯gmt∥∥2]≤σ2M.

### Appendix D Proof of Theorem 1

###### Proof.

Combining Lemma 2 and Lemma 3, we have

 E[∥^xt+1−x∗∥2]≤(1−γμ)E[∥^xt−x∗∥2]+γ2σ2M−γ2E[Df(^xt,x∗)]+2γLVt. (22)

Using Lemma 1 we can upper bound the term in :

 E[∥^xt+1−x∗∥2]≤(1−γμ)E[∥^xt−x∗∥2]+γ2σ2M−γ2E[Df(^xt,x∗)]+2γ3LHσ2.

Letting and we have,

 E[∥rt+1∥2]≤(1−γμ)E[∥rt∥2]+γ2σ2M+2γ3LHσ2.

Recursing the above inequality we have,

 E[∥rT∥2]≤(1−γμ)TE[∥r0∥2]+(T−1∑t=0(1−γμ)t)(γ2σ2M+2γ3LHσ2).

Using that we have,

 E[∥rT∥2]≤(1−γμ)TE[∥r0∥2]+γσ2μM+2γ2LHσ2μ,

which is the claim of this theorem. ∎

### Appendix E Proof of Theorem 2

###### Proof.

Let , then putting in Lemma 2 and combining it with Lemma 3, we have

 E[∥rt+1∥2]≤E[∥rt∥2]+γ2σ2M−γ2E[Df(^xt,x∗)]+2γLVt.

Further using Lemma 1,

 E[∥rt+1∥2]≤E[∥rt∥2]+γ2σ2M−γ2E[Df(^xt,x∗)]+2γ3LHσ2.

Rearranging we have,

Averaging the above equation as varies between and ,

 γ2TT−1∑t=0E[Df(^xt,x∗)] ≤1TT−1∑t=0E[∥rt∥2]−E[∥rt+1∥2]+1TT−1∑t=0(γ2σ2M+2γ3LHσ2) ≤∥r0∥2T+γ2σ2M+2γ3LHσ2. (23)

By Jensen’s inequality we have . Using this in (23) we have,

Dividing both sides by yields the theorem’s claim. ∎