A Non-Asymptotic Analysis of Network Independence for Distributed Stochastic Gradient Descent

06/06/2019 ∙ by Alex Olshevsky, et al. ∙ Boston University 0

This paper is concerned with minimizing the average of n cost functions over a network, in which agents may communicate and exchange information with their peers in the network. Specifically, we consider the setting where only noisy gradient information is available. To solve the problem, we study the standard distributed stochastic gradient descent (DSGD) method and perform a non-asymptotic convergence analysis. For strongly convex and smooth objective functions, we not only show that DSGD asymptotically achieves the optimal network independent convergence rate compared to centralized stochastic gradient descent (SGD), but also explicitly identify the non-asymptotic convergence rate as a function of characteristics of the objective functions and the network. Furthermore, we derive the time needed for DSGD to approach the asymptotic convergence rate, which behaves as K_T=O(n^16/15/(1-ρ_w)^31/15), where (1-ρ_w) denotes the spectral gap of the mixing matrix of communicating agents.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider the distributed optimization problem where a group of agents collaboratively look for that minimizes the average of cost functions:

(1)

Each local cost function is known by agent only, and all the agents communicate and exchange information over a network. Problems in the form of (1) find applications in multi-agent target seeking [32, 8]

, distributed machine learning

[13, 24, 10, 2, 46, 1, 4], and wireless networks [9, 20, 2], among other scenarios.

In order to solve (1), we assume each agent is able to obtain noisy gradient samples satisfying the following assumption: For all and all

, each random vector

is independent, and

(2)

This condition is satisfied for many distributed learning problems. For example, suppose

represents the expected loss function for agent

, where are independent data samples gathered over time. Then for any and ,

is an unbiased estimator of

satisfying Assumption 1. For another example, suppose the overall goal is to minimize an expected risk function , and each agent has a single data sample . Then, the expected risk function can be approximated by , where . In this setting, the gradient estimation of can incur noise from various sources such as approximation error and modeling and discretization errors.

Problem (1) has been studied extensively in the literature under various distributed algorithms [43, 25, 26, 19, 15, 16, 39, 11, 35, 23, 45, 34], among which the distributed gradient descent (DGD) method proposed in [25] has drawn the greatest attention. Recently, distributed implementation of stochastic gradient algorithms has received considerable interest [37, 41, 12, 3, 5, 42, 6, 7, 22, 17, 18, 30, 31, 38, 40, 14, 33, 29, 44, 1]. Several recent works [18, 30, 21, 31, 33, 29] have shown that distributed methods may compare with their centralized counterparts under certain conditions. For instance, a recent paper [29] discussed a distributed stochastic gradient method that asymptotically performs as well as the best bounds on centralized stochastic gradient descent (SGD).

In this work, we perform a non-asymptotic analysis for the standard distributed stochastic gradient descent (DSGD) method adapted from DGD. In addition to showing that the algorithm asymptotically achieves the optimal convergence rate enjoyed by a centralized scheme, we precisely identify its non-asymptotic convergence rate as a function of characteristics of the objective functions and the network (e.g., spectral gap () of the mixing matrix). Furthermore, we characterize the time needed for DSGD to achieve the optimal rate of convergence, demonstrated in the following corollary.

Corollary (Corollary 4.7).

It takes time for DSGD to reach the asymptotic rate of convergence, i.e., when , we have .

Note that is the asymptotic convergence rate for SGD (see Theorem 3). Here denotes the spectral norm of with being the mixing matrix for all the agents, is the average solution at time and is the optimal solution. Stepsizes are set to be for some . These results are new to the best of our knowledge.

The rest of this paper is organized as follows. After introducing necessary notation in Section 1.1, we present the DSGD algorithm and some preliminary results in Section 2. In Section 3 we prove the sublinear convergence of the algorithm. Main convergence results and a comparison with centralized stochastic gradient method are demonstrated in Section 4. We conclude the paper in Section 5.

1.1 Notation

Vectors are column vectors unless otherwise specified. Each agent holds a local copy of the decision vector denoted by , and its value at iteration/time is written as . Let

where is the all one vector. Define an aggregate objective function

and let

In addition, we denote

In what follows we write and for short.

The inner product of two vectors is written as . For two matrices , let , where (respectively, ) is the -th row of (respectively, ). We use to denote the -norm of vectors and the Frobenius norm of matrices.

A graph has a set of vertices (nodes) and a set of edges connecting vertices . Consider agents interact in an undirected graph, i.e., if and only if .

Denote the mixing matrix of agents by . Two agents and are connected if and only if ( otherwise). Formally, we assume the following condition on the communication among agents: The graph is undirected and connected (there exists a path between any two agents). The mixing matrix is nonnegative and doubly stochastic, i.e., and . From Assumption 1.1, we have the following contraction property of (see [35]):

Lemma 1

Let Assumption 1.1 hold, and let denote the spectral norm of the matrix . Then, and

for all , where .

2 Distributed Stochastic Gradient Descent

We consider the following standard DSGD method: at each step , every agent independently performs the update:

(3)

where is a sequence of non-increasing stepsizes. The initial vectors are arbitrary for all . We can rewrite (3) in the following compact form:

(4)

Throughout the paper, we make the following standing assumption regarding the objective functions .111The assumption can be generalized to the case where the agents have different and . Each is -strongly convex with -Lipschitz continuous gradients, i.e., for any ,

(5)

Under Assumption 1, Problem (1) has a unique optimal solution , and the following result holds (See [35] Lemma 10).

Lemma 2

For any and , we have

where .

Denote . The following two lemma will be useful for our analysis later.

Lemma 3

Under Assumption 1, for all ,

(6)

Proof

By definitions of , and Assumption 1, we have

Lemma 4

Under Assumption 1, for all ,

(7)

Proof

By definition,

where the last relation follows from the Cauchy-Schwarz inequality.

2.1 Preliminary Results

In this section, we present some preliminary results concerning (expected optimization error) and (expected consensus error). Specifically, we bound the two terms by linear combinations of their values in the last iteration. Throughout the analysis we assume Assumptions 1, 1.1 and 1 hold.

Lemma 5

Under Algorithm (4), for all , we have

(8)

Proof

See Appendix A.1.

The next result is a corollary of Lemma 5.

Lemma 6

Under Algorithm (4), supposing , then

(9)

Proof

See Appendix A.2.

Concerning the expected consensus error , we have the following lemma.

Lemma 7

Under Algorithm (4), for all ,

Proof

See Appendix A.3.

3 Analysis

We are now ready to derive some preliminary convergence results for Algorithm (4). First, we provide a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all . Then based on the lemma established in Section 2.1, we prove the sublinear convergence rates and .

From now on we consider the following stepsize policy:

(10)

where and222 denotes the ceiling function.

(11)

3.1 Uniform Bound

We derive a uniform bound on the iterates generated by Algorithm (4) (in expectation) for all .

Lemma 8

For all , we have

(12)

where

(13)

and the sets are defined in (32).

Proof

See Appendix B.1.

We can further bound as follows. From the definition of ,

Hence

(14)

In light of Lemma 8 and inequality (14), further noticing that the choice of is arbitrary in the proof of Lemma 8, we obtain the following uniform bound for .

Lemma 9

Under Algorithm (4), for all , we have

(15)

3.2 Sublinear Rate

Denote

(16)

Using Lemma 6 and Lemma 7 from Section 2.1, we show below that Algorithm (4) enjoys the sublinear convergence rate, i.e., and .

Define a Lyapunov function:

(17)

where is to be determined later.

For the ease of analysis, we define , , for all . In addition, we denote

(18)
Lemma 10

Let

(19)

and

(20)

Under Algorithm (4), for all , we have

(21)

where

(22)

In addition,

where

(23)

and

(24)

Proof

See Appendix B.2.

Notice that the sublinear rates obtained in Lemma 10 are network dependent since depends on the spectral gap , a function of the mixing matrix .

4 Main Results

In this section, we perform a non-asymptotic analysis of network independence for Algorithm (4). Specifically, in Theorem 1 and Corollary 1, we show that , where the first term is network independent and the second (higher-order) term depends on . In Theorem 2, we further improve the result and compare it with centralized stochastic gradient descent. We show that asymptotically, the two methods have the same convergence rate . In addition, it takes time for Algorithm (4) to reach this asymptotic rate of convergence.

We start with a useful lemma.

Lemma 11

For any () and ,

Proof

See Appendix C.1.

The following Theorem demonstrates the asymptotic network independence property of Algorithm (4).

Theorem 1

Under Algorithm (4), suppose .333The condition can be easily relaxed to the case where . We have for all ,

(25)

Proof

For , in light of Lemma 6,

Recalling the definitions of and ,

Therefore,

From Lemma 11,

In light of Lemma 10, when ,

Hence,

However, we have for any ,

and

We have

(26)

Recalling Lemma 13 and the definition of yields the desired result.

Next, we estimate the constants appearing in Theorem 1 and derive their dependency on the network size and the spectral gap .

Lemma 12

Suppose , . Then,

Proof

See Appendix C.2.

In light of Lemma 9, Lemma 10 and Theorem 1, we have the following corollary.

Corollary 1

Under Algorithm (4) with , when ,

where

Proof

See Appendix C.3.

We improve the result of Theorem 1 and Corollary 1 with further analysis.

Theorem 2

Under Algorithm (4) with , when ,