# A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent

This paper is concerned with minimizing the average of n cost functions over a network, in which agents may communicate and exchange information with their peers in the network. Specifically, we consider the setting where only noisy gradient information is available. To solve the problem, we study the standard distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, we not only show that DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD), but also explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. Furthermore, we derive the time needed for DSGD to approach the asymptotic convergence rate, which behaves as K_T=O(n^16/15/(1-ρ_w)^31/15), where (1-ρ_w) denotes the spectral gap of the mixing matrix of communicating agents.

## Authors

• 17 publications
• 19 publications
• 15 publications
• ### Improving the Transient Times for Distributed Stochastic Gradient Methods

We consider the distributed optimization problem where n agents each pos...
05/11/2021 ∙ by Kun Huang, et al. ∙ 0

• ### Theoretical Limits of Pipeline Parallel Optimization and Application to Distributed Deep Learning

We investigate the theoretical limits of pipeline parallel learning of d...
10/11/2019 ∙ by Igor Colin, et al. ∙ 0

• ### Towards Asymptotic Optimality with Conditioned Stochastic Gradient Descent

In this paper, we investigate a general class of stochastic gradient des...
06/04/2020 ∙ by Rémi Leluc, et al. ∙ 0

• ### Achieving the fundamental convergence-communication tradeoff with Differentially Quantized Gradient Descent

The problem of reducing the communication cost in distributed training t...
02/06/2020 ∙ by Chung-Yi Lin, et al. ∙ 9

• ### A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

This paper considers the problem of decentralized optimization with a co...
04/25/2017 ∙ by Zhi Li, et al. ∙ 0

• ### Swarming for Faster Convergence in Stochastic Optimization

We study a distributed framework for stochastic optimization which is in...
06/11/2018 ∙ by Shi Pu, et al. ∙ 0

• ### Data Dependent Convergence for Distributed Stochastic Optimization

In this dissertation we propose alternative analysis of distributed stoc...
08/30/2016 ∙ by Avleen S. Bijral, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we consider the distributed optimization problem where a group of agents collaboratively look for that minimizes the average of cost functions:

 minx∈Rpf(x)(=1nn∑i=1fi(x)). (1)

Each local cost function is known by agent only, and all the agents communicate and exchange information over a network. Problems in the form of (1) find applications in multi-agent target seeking [32, 8]

, distributed machine learning

[13, 24, 10, 2, 46, 1, 4], and wireless networks [9, 20, 2], among other scenarios.

In order to solve (1), we assume each agent is able to obtain noisy gradient samples satisfying the following assumption: For all and all

, each random vector

is independent, and

 Eξi[gi(x,ξi)∣x]=∇fi(x),Eξi[∥gi(x,ξi)−∇fi(x)∥2∣x]≤σ2\ for some σ>0. (2)

This condition is satisfied for many distributed learning problems. For example, suppose

represents the expected loss function for agent

, where are independent data samples gathered over time. Then for any and ,

is an unbiased estimator of

satisfying Assumption 1. For another example, suppose the overall goal is to minimize an expected risk function , and each agent has a single data sample . Then, the expected risk function can be approximated by , where . In this setting, the gradient estimation of can incur noise from various sources such as approximation error and modeling and discretization errors.

Problem (1) has been studied extensively in the literature under various distributed algorithms [43, 25, 26, 19, 15, 16, 39, 11, 35, 23, 45, 34], among which the distributed gradient descent (DGD) method proposed in [25] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest [37, 41, 12, 3, 5, 42, 6, 7, 22, 17, 18, 30, 31, 38, 40, 14, 33, 29, 44, 1]. Several recent works [18, 30, 21, 31, 33, 29] have shown that distributed methods may compare with their centralized counterparts under certain conditions. For instance, a recent paper [29] discussed a distributed stochastic gradient method that asymptotically performs as well as the best bounds on centralized stochastic gradient descent (SGD).

In this work, we perform a non-asymptotic analysis for the standard distributed stochastic gradient descent (DSGD) method adapted from DGD. In addition to showing that the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap () of the mixing matrix). Furthermore, we characterize the time needed for DSGD to achieve the optimal rate of convergence, demonstrated in the following corollary.

###### Corollary (Corollary 4.7).

It takes time for DSGD to reach the asymptotic rate of convergence, i.e., when , we have .

Note that is the asymptotic convergence rate for SGD (see Theorem 3). Here denotes the spectral norm of with being the mixing matrix for all the agents, is the average solution at time and is the optimal solution. Stepsizes are set to be for some . These results are new to the best of our knowledge.

The rest of this paper is organized as follows. After introducing necessary notation in Section 1.1, we present the DSGD algorithm and some preliminary results in Section 2. In Section 3 we prove the sublinear convergence of the algorithm. Main convergence results and a comparison with centralized stochastic gradient method are demonstrated in Section 4. We conclude the paper in Section 5.

### 1.1 Notation

Vectors are column vectors unless otherwise specified. Each agent holds a local copy of the decision vector denoted by , and its value at iteration/time is written as . Let

 x:=[x1,x2,…,xn]⊺∈Rn×p,¯¯¯x:=1n1⊺x∈R1×p,

where is the all one vector. Define an aggregate objective function

 F(x):=n∑i=1fi(xi),

and let

 ∇F(x):=[∇f1(x1),∇f2(x2),…,∇fn(xn)]⊺∈Rn×p,
 ¯∇F(x):=1n1⊺∇F(x).

 ξ:=[ξ1,ξ2,…,ξn]⊺∈Rn×p,
 g(x,ξ):=[g1(x1,ξ1),g2(x2,ξ2),…,gn(xn,ξn)]⊺∈Rn×p.

In what follows we write and for short.

The inner product of two vectors is written as . For two matrices , let , where (respectively, ) is the -th row of (respectively, ). We use to denote the -norm of vectors and the Frobenius norm of matrices.

A graph has a set of vertices (nodes) and a set of edges connecting vertices . Consider agents interact in an undirected graph, i.e., if and only if .

Denote the mixing matrix of agents by . Two agents and are connected if and only if ( otherwise). Formally, we assume the following condition on the communication among agents: The graph is undirected and connected (there exists a path between any two agents). The mixing matrix is nonnegative and doubly stochastic, i.e., and . From Assumption 1.1, we have the following contraction property of (see [35]):

###### Lemma 1

Let Assumption 1.1 hold, and let denote the spectral norm of the matrix . Then, and

 ∥Wω−1¯¯¯ω∥≤ρw∥ω−1¯¯¯ω∥

for all , where .

## 2 Distributed Stochastic Gradient Descent

We consider the following standard DSGD method: at each step , every agent independently performs the update:

 xi(k+1)=n∑j=1wij(xj(k)−αkgj(k)), (3)

where is a sequence of non-increasing stepsizes. The initial vectors are arbitrary for all . We can rewrite (3) in the following compact form:

 xk+1=W(x(k)−αkg(k)). (4)

Throughout the paper, we make the following standing assumption regarding the objective functions .111The assumption can be generalized to the case where the agents have different and . Each is -strongly convex with -Lipschitz continuous gradients, i.e., for any ,

 ⟨∇fi(x)−∇fi(x′),x−x′⟩≥μ∥x−x′∥2,∥∇fi(x)−∇fi(x′)∥≤L∥x−x′∥. (5)

Under Assumption 1, Problem (1) has a unique optimal solution , and the following result holds (See [35] Lemma 10).

###### Lemma 2

For any and , we have

 ∥x−α∇f(x)−x∗∥≤λ∥x−x∗∥,

where .

Denote . The following two lemma will be useful for our analysis later.

###### Lemma 3

Under Assumption 1, for all ,

 E[∥∥¯¯¯g(k)−¯∇F(x(k))∥∥2]≤σ2n. (6)

###### Proof

By definitions of , and Assumption 1, we have

 E[∥∥¯¯¯g(k)−¯∇F(x(k))∥∥2]=E[∥∥∥1n1⊺g(k)−1n1⊺∇F(x(k))∥∥∥2]=1n2n∑i=1E[∥gi(k)−∇fi(xi(k))∥2]≤σ2n.

###### Lemma 4

Under Assumption 1, for all ,

 ∥∥∇f(¯¯¯x(k))−¯∇F(x(k))∥∥≤L√n∥x(k)−1¯¯¯x(k)∥. (7)

###### Proof

By definition,

 ∥∥∇f(¯¯¯x(k))−¯∇F(x(k))∥∥= ∥∥∥∇f(¯¯¯x(k))−1n1⊺∇F(x(k))∥∥∥ = ∥∥ ∥∥1nn∑i=1∇fi(¯¯¯x(k))−1nn∑i=1∇fi(xi(k))∥∥ ∥∥ (Assumption ???)≤ Lnn∑i=1∥¯¯¯x(k)−xi(k)∥≤L√n∥x(k)−1¯¯¯x(k)∥,

where the last relation follows from the Cauchy-Schwarz inequality.

### 2.1 Preliminary Results

In this section, we present some preliminary results concerning (expected optimization error) and (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration. Throughout the analysis we assume Assumptions 1, 1.1 and 1 hold.

###### Lemma 5

Under Algorithm (4), for all , we have

 E[∥¯¯¯x(k+1)−x∗∥2∣x(k)]≤∥¯¯¯x(k)−αk∇f(¯¯¯x(k))−x∗∥2+2αkL√n∥¯¯¯x(k)−αk∇f(¯¯¯x(k))−x∗∥∥x(k)−1¯¯¯x(k)∥+α2kL2n∥x(k)−1¯¯¯x(k)∥2+α2kσ2n. (8)

###### Proof

See Appendix A.1.

The next result is a corollary of Lemma 5.

###### Lemma 6

Under Algorithm (4), supposing , then

 (9)

###### Proof

See Appendix A.2.

Concerning the expected consensus error , we have the following lemma.

###### Lemma 7

Under Algorithm (4), for all ,

 E[∥x(k+1)−1¯¯¯x(k+1)∥2]≤(1+ρ2w2+2αkρ2wL+2α2kρ2wL2)E[∥∥x(k)−1¯¯¯x(k)∥∥2]+ρ2w[α2k4nL2(1−ρ2w)E[∥¯¯¯x(k)−x∗∥2]+α2k4∥∇F(1x∗)∥2(1−ρ2w)+α2knσ2].

###### Proof

See Appendix A.3.

## 3 Analysis

We are now ready to derive some preliminary convergence results for Algorithm (4). First, we provide a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all . Then based on the lemma established in Section 2.1, we prove the sublinear convergence rates and .

From now on we consider the following stepsize policy:

 αk:=θμ(k+K),∀k, (10)

where and222 denotes the ceiling function.

 K:=⌈2θL2μ2⌉. (11)

### 3.1 Uniform Bound

We derive a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all .

###### Lemma 8

For all , we have

 E[∥x(k)∥2]≤max{∥x(0)∥2,n∑i=1Ri}, (12)

where

 Ri:=maxq∈Xi{(1−μ22L2)q+μL2∥∇fi(0)∥√q+μ24L4(2∥∇fi(0)∥2+σ2)}, (13)

and the sets are defined in (32).

###### Proof

See Appendix B.1.

We can further bound as follows. From the definition of ,

 maxq∈Xiq≤8∥∇fi(0)∥2μ2+3σ24L2.

Hence

 Ri= maxq∈Xi{q−μ2L2[μq−2∥∇fi(0)∥√q−μ2L2(2∥∇fi(0)∥2+σ2)]} (14) ≤ maxq∈Xiq−μ2L2minq∈Xi{μq−2∥∇fi(0)∥√q−μ2L2(2∥∇fi(0)∥2+σ2)} ≤ 8∥∇fi(0)∥2μ2+3σ24L2+μ2L2[∥∇fi(0)∥2μ+μ2L2(2∥∇fi(0)∥2+σ2)] ≤ 9∥∇fi(0)∥2μ2+σ2L2.

In light of Lemma 8 and inequality (14), further noticing that the choice of is arbitrary in the proof of Lemma 8, we obtain the following uniform bound for .

###### Lemma 9

Under Algorithm (4), for all , we have

 E[∥x(k)−1x∗∥2]≤^X:=max{∥x(0)−1x∗∥2,9∑ni=1∥∇fi(x∗)∥2μ2+nσ2L2}, (15)

### 3.2 Sublinear Rate

Denote

 (16)

Using Lemma 6 and Lemma 7 from Section 2.1, we show below that Algorithm (4) enjoys the sublinear convergence rate, i.e., and .

Define a Lyapunov function:

 W(k):=U(k)+ω(k)V(k),∀k, (17)

where is to be determined later.

For the ease of analysis, we define , , for all . In addition, we denote

 ~k:=k+K. (18)
###### Lemma 10

Let

 K1:=⌈24L2θ(1−ρ2w)μ2⌉, (19)

and

 ω(k):=12αkL2nμ(1−ρ2w). (20)

Under Algorithm (4), for all , we have

 U(k)≤^W~k, (21)

where

 ^W:=K1^Xn+3(4θ−3)(σ2θ2nμ2+σ2ρ2wθ22μ2)+6∥∇F(1x∗)∥2ρ2wθ2(4θ−3)nμ2(1−ρ2w). (22)

 V(k)≤p~k−K10^X+V1~k2+V2~k3,

where

 p0:=3+ρ2w4, (23)

and

 V1:=8θ2ρ2wμ2(1−ρ2w)[4∥∇F(1x∗)∥2(1−ρ2w)+nσ2],V2:=32θ2nL2ρ2wμ2(1−ρ2w)2^W. (24)

###### Proof

See Appendix B.2.

Notice that the sublinear rates obtained in Lemma 10 are network dependent since depends on the spectral gap , a function of the mixing matrix .

## 4 Main Results

In this section, we perform a non-asymptotic analysis of network independence for Algorithm (4). Specifically, in Theorem 1 and Corollary 1, we show that , where the first term is network independent and the second (higher-order) term depends on . In Theorem 2, we further improve the result and compare it with centralized stochastic gradient descent. We show that asymptotically, the two methods have the same convergence rate . In addition, it takes time for Algorithm (4) to reach this asymptotic rate of convergence.

###### Lemma 11

For any () and ,

 k−1∏t=a(1−γt)≤aγkγ.

###### Proof

See Appendix C.1.

The following Theorem demonstrates the asymptotic network independence property of Algorithm (4).

###### Theorem 1

Under Algorithm (4), suppose .333The condition can be easily relaxed to the case where . We have for all ,

 U(k)≤θ2σ2(1.5θ−1)nμ2~k+[3θ2(1.5θ−1)σ2(1.5θ−2)nμ2+6θL2V1(1.5θ−2)nμ2]1~k2+6θL2V2(1.5θ−3)nμ21~k3+(K1.5θ1^Xn+6θL2K1.5θ−11^Xnμ2(1−p0))1~k1.5θ. (25)

###### Proof

For , in light of Lemma 6,

 U(k+1) ≤(1−32αkμ)U(k)+3αkL2nμV(k)+α2kσ2n

Recalling the definitions of and ,

 ~U(k+1)≤(1−3θ2k)~U(k)+3θL2nμ2~V(k)k+θ2σ2nμ21k2.

Therefore,

 ~U(k)≤k−1∏t=K1(1−3θ2t)~U(K1)+k−1∑t=K1(k−1∏j=t+1(1−3θ2j))(θ2σ2nμ21t2+3θL2nμ2~V(t)t).

From Lemma 11,

 ~U(k)≤ K1.5θ1k1.5θ~U(K1)+k−1∑t=K1(t+1)1.5θk1.5θ(θ2σ2nμ2t2+3θL2nμ2~V(t)t) = 1k1.5θθ2σ2nμ2k−1∑t=K1(t+1)1.5θt2+K1.5θ1k1.5θ~U(K1)+k−1∑t=K1(t+1)1.5θk1.5θ3θL2nμ2~V(t)t.

In light of Lemma 10, when ,

 ~V(k)≤pk−K10^X+V1k2+V2k3.

Hence,

 ~U(k)−1k1.5θθ2σ2nμ2k−1∑t=K1(t+1)1.5θt2−K1.5θ1k1.5θ~U(K1) ≤ k−1∑t=K1(t+1)1.5θk1.5θ3θL2nμ21t(pt−K10^X+V1t2+V2t3) = 1k1.5θ3θL2nμ2⎡⎣V1k−1∑t=K1(t+1)1.5θt3+V2k−1∑t=K1(t+1)1.5θt4+^Xk−1∑t=K1(t+1)1.5θpt−K10t⎤⎦.

However, we have for any ,

 b∑a(t+1)1.5θt2≤b−2∑a[(t+1)1.5θ(t+1)2+3(t+1)1.5θ(t+1)3]+b1.5θ(b−1)2+(b+1)1.5θb2≤∫ba(t1.5θ−2+3t1.5θ−3)dt+2(b+1)1.5θb2≤b1.5θ−11.5θ−1+3b1.5θ−21.5θ−2+3b1.5θ−2,
 b∑a(t+1)1.5θt3≤∫bat1.5θ−3dt≤2b1.5θ−21.5θ−2,b∑a(t+1)1.5θt4≤2b1.5θ−31.5θ−3,

and

 k−1∑t=K1(t+1)1.5θpt−K10t≤2∫∞t=K1t1.5θ−1pt−K10≤2lnp0∫∞t=K1(t1.5θ−1pt−K10)dt≤2K1.5θ−111−p0.

We have

 ~U(k)≤θ2σ2(1.5θ−1)nμ2k+3θ2(1.5θ−1)σ2(1.5θ−2)nμ21k2+K1.5θ1k1.5θ~U(K1)+6θL2V1(1.5θ−2)nμ21k2+6θL2V2(1.5θ−3)nμ21k3+3θL2^Xnμ22K1.5θ−111−p01k1.5θ. (26)

Recalling Lemma 13 and the definition of yields the desired result.

Next, we estimate the constants appearing in Theorem 1 and derive their dependency on the network size and the spectral gap .

###### Lemma 12

Suppose , . Then,

###### Proof

See Appendix C.2.

In light of Lemma 9, Lemma 10 and Theorem 1, we have the following corollary.

###### Corollary 1

Under Algorithm (4) with , when ,

 U(k)≤θ2σ2(1.5θ−1)nμ2~k+c1~k2,~V(k)≤c2k2,

where

 c1=O(1(1−ρw)2),c2=O(n(1−ρw)2).

###### Proof

See Appendix C.3.

We improve the result of Theorem 1 and Corollary 1 with further analysis.

###### Theorem 2

Under Algorithm (4) with , when ,

 U(k)≤θ2σ2