Push-Pull Gradient Methods for Distributed Optimization in Networks

In this paper, we focus on solving a distributed convex optimization problem in a network, where each agent has its own convex cost function and the goal is to minimize the sum of the agents' cost functions while obeying the network connectivity structure. In order to minimize the sum of the cost functions, we consider new distributed gradient-based methods where each node maintains two estimates, namely, an estimate of the optimal decision variable and an estimate of the gradient for the average of the agents' objective functions. From the viewpoint of an agent, the information about the decision variable is pushed to the neighbors, while the information about the gradients is pulled from the neighbors hence giving the name "push-pull gradient methods". This name is also due to the consideration of the implementation aspect: the push-communication-protocol and the pull-communication-protocol are respectively employed to implement certain steps in the numerical schemes. The methods utilize two different graphs for the information exchange among agents, and as such, unify the algorithms with different types of distributed architecture, including decentralized (peer-to-peer), centralized (master-slave), and semi-centralized (leader-follower) architecture. We show that the proposed algorithms and their many variants converge linearly for strongly convex and smooth objective functions over a network (possibly with unidirectional data links) in both synchronous and asynchronous random-gossip settings. We numerically evaluate our proposed algorithm for both static and time-varying graphs, and find that the algorithms are competitive as compared to other linearly convergent schemes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/20/2018

A Push-Pull Gradient Method for Distributed Optimization in Networks

In this paper, we focus on solving a distributed convex optimization pro...
05/26/2021

Distributed Zeroth-Order Stochastic Optimization in Time-varying Networks

We consider a distributed convex optimization problem in a network which...
06/03/2020

How to Spread a Rumor: Call Your Neighbors or Take a Walk?

We study the problem of randomized information dissemination in networks...
07/26/2021

Provably Accelerated Decentralized Gradient Method Over Unbalanced Directed Graphs

In this work, we consider the decentralized optimization problem in whic...
04/05/2021

Self-Healing First-Order Distributed Optimization

In this paper we describe a parameterized family of first-order distribu...
01/21/2019

Distributed Nesterov gradient methods over arbitrary graphs

In this letter, we introduce a distributed Nesterov method, termed as AB...
06/14/2021

Compressed Gradient Tracking for Decentralized Optimization Over General Directed Networks

In this paper, we propose two communication-efficient algorithms for dec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we consider a system involving agents whose goal is to collaboratively solve the following problem:

(1)

where is the global decision variable and each function is convex and known by agent only. The agents are embedded in a communication network, and their goal is to obtain an optimal and consensual solution through local neighbor communications and information exchange. This local exchange is desirable in situations where the privacy of the agent data needs to be protected, or the exchange of a large amount of data is prohibitively expensive due to limited communication resources.

To solve problem (1) in a networked system of agents, many algorithms have been proposed under various assumptions on the objective functions and on the underlying networks/graphs. Static undirected graphs were extensively considered in the literature [30, 29, 19, 24, 27]. References [18, 39, 16] studied time-varying and/or stochastic undirected networks. Directed graphs were discussed in [13, 14, 40, 16, 35, 36]. Centralized (master-slave) algorithms were discussed in [2], where extensive applications in learning can be found. Parallel, coordinated, and asynchronous algorithms were discussed in [20] and the references therein. The reader is also referred to the recent paper [15] and the references therein for a comprehensive survey on distributed optimization algorithms.

In the first part of this paper, we introduce a novel gradient-based algorithm (Push-Pull) for distributed (consensus-based) optimization in directed graphs. Unlike the push-sum type protocol used in the previous literature [16, 36]

, our algorithm uses a row stochastic matrix for the mixing of the decision variables, while it employs a column stochastic matrix for tracking the average gradients. Although motivated by a fully decentralized scheme, we show that Push-Pull can work both in fully decentralized networks and in two-tier networks.

Gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [1, 10, 8, 11]. In the second part of this paper, we consider a random-gossip push-pull algorithm (G-Push-Pull) where at each iteration, an agent wakes up uniformly randomly and communicates with one or two of its neighbors. Both Push-Pull and G-Push-Pull have different variants. We show that they all converge linearly to the optimal solution for strongly convex and smooth objective functions.

1.1 Related Work

Our emphasis in the literature review is on the decentralized optimization, since our approach builds on a new understanding of the decentralized consensus-based methods for directed communication networks. Most references, including [30, 29, 12, 33, 19, 27, 32, 38, 17, 4, 28, 9], often restrict the underlying network connectivity structure, or more commonly require doubly stochastic mixing matrices. The work in [30] has been the first to demonstrate the linear convergence of an ADMM-based decentralized optimization scheme. Reference [29] uses a gradient difference structure in the algorithm to provide the first-order decentralized optimization algorithm which is capable of achieving the typical convergence rates of a centralized gradient method, while references [12, 33] deal with the second-order decentralized methods. By using Nesterov’s acceleration, reference [19] has obtained a method whose convergence time scales linearly in the number of agents , which is the best scaling with currently known. More recently, for a class of so-termed dual friendly functions, papers [27, 32] have obtained an optimal decentralized consensus optimization algorithm whose dependency on the condition number111The condition number of a smooth and strongly convex function is the ratio of its gradient Lipschitz constant and its strong convexity constant. of the system’s objective function achieves the best known scaling in the order of . Work in [28, 9] investigates proximal-gradient methods which can tackle (1) with proximal friendly component functions. Paper [34] extends the work in [30] to handle asynchrony and delays. References [21, 22] considered a stochastic variant of problem (1) in asynchronous networks. A tracking technique has been recently employed to develop decentralized algorithms for tracking the average of the Hessian/gradient in second-order methods [33], allowing uncoordinated stepsize [38, 17], handling non-convexity [4], and achieving linear convergence over time-varying graphs [16].

For directed graphs, to eliminate the need of constructing a doubly stochastic matrix in reaching consensus222Constructing a doubly stochastic matrix over a directed graph needs weight balancing which requires an independent iterative procedure across the network; consensus is a basic coordination technique in decentralized optimization., reference [7] proposes the push-sum protocol. Reference [31] has been the first to propose a push-sum based distributed optimization algorithm for directed graphs. Then, based on the push-sum technique again, a decentralized subgradient method for time-varying directed graphs has been proposed and analyzed in [13]. Aiming to improve convergence for a smooth objective function and a fixed directed graph, work in [35, 40] modifies the algorithm from [29] with the push-sum technique, thus providing a new algorithm which converges linearly for a strongly convex objective function on a static graph. However, the algorithm requires a careful selection of the stepsize which may be even non-existent in some cases [35]. This stability issue has been resolved in [16] in a more general setting of time-varying directed graphs.

Simultaneously and independently, a paper [37] has proposed an algorithm that is similar to the synchronous variant proposed in this paper. By contrast, the work in [37] do not show that the algorithm unifies different architectures. Moreover, asynchronous or time-varying cases were not discussed either therein.

1.2 Main Contribution

The main contribution of this paper is threefold. First, we design new distributed optimization methods (Push-Pull and G-Push-Pull) and their many variants for directed graphs. These methods utilize two different graphs for the information exchange among agents, and as such, unify different computation and communication architectures, including decentralized (peer-to-peer), centralized (master-slave), and semi-centralized (leader-follower) architecture. To the best of our knowledge, these are the first algorithms in the literature that enjoy such property.

Second, we establish the linear convergence of the proposed methods in both synchronous (Push-Pull) and asynchronous random-gossip (G-Push-Pull) settings. In particular, G-Push-Pull is the first class of gossip-type algorithms for distributed optimization over directed graphs.

Finally, in our proposed methods each agent in the network is allowed to use a different nonnegative stepsize, and only one of such stepsizes needs to be positive. This is a unique feature compared to the existing literature (e.g., [16, 36]).

Some of the results related to a variant of Push-Pull can be found in [23]. In contrast, the current work analyzes a different, more communication-efficient variant of Push-Pull, adopts an uncoordinated stepsize policy which generalizes the scheme in [23] and introduces G-Push-Pull in extra.

1.3 Organization of the Paper

The structure of this paper is as follows. We first provide notation and state basic assumptions in Subsection 1.4. Then we introduce the push-pull gradient method in Section 2 along with the intuition of its design and some examples explaining how it relates to (semi-)centralized and decentralized optimization. We establish the linear convergence of the push-pull algorithm in Section 3. In Section 4 we introduce the random-gossip push-pull method (G-Push-Pull) and demonstrate its linear convergence in Section 5. In Section 6 we conduct numerical experiments to verify our theoretical claims. Concluding remarks are given in Section 7.

1.4 Notation and Assumption

Throughout the paper, vectors default to columns if not otherwise specified. Let

be the set of agents. Each agent holds a local copy of the decision variable and an auxiliary variable tracking the average gradients, where their values at iteration are denoted by and , respectively. Let

Define to be an aggregate objective function of the local variables, i.e., , and write

We use the symbol to denote the trace of a square matrix.

Definition 1

Given an arbitrary vector norm on , for any , we define

where are columns of , and represents the -norm.

We make the following assumption on the functions in (1). Each is -strongly convex and its gradient is -Lipschitz continuous, i.e., for any ,

Under Assumption 1, there exists a unique optimal solution to problem (1).

We use directed graphs to model the interaction topology among agents. A directed graph (digraph) is a pair , where is the set of vertices (nodes) and the edge set

consists of ordered pairs of vertices. If there is a directed edge from node

to node in , or , then is defined as the parent node and is defined as the child node. Information can be transmitted from the parent node to the child node directly. A directed path in graph is a sequence of edges , , . Graph is called strongly connected if there is a directed path between any pair of distinct vertices. A directed tree is a digraph where every vertex, except for the root, has only one parent. A spanning tree of a digraph is a directed tree that connects the root to all other vertices in the graph. A subgraph of graph is a graph whose set of vertices and set of edges are all subsets of (see [5]).

Given a nonnegative matrix333A matrix is nonnegative if all its elements are nonnegative. , the digraph induced by the matrix is denoted by , where and iff (if and only if) . We let be the set of roots of all possible spanning trees in the graph . For an arbitrary agent , we define its in-neighbor set as the collection of all individual agents that can actively and reliably pull data from; we also define its out-neighbor set as the collection of all individual agents that can passively and reliably receive data from . In the situation when the set is time-varying, we further add a subscript to indicate it generates a sequence of sets. For example, is the in-neighbor set of at time/iteration .

2 A Push-Pull Gradient Method

To proceed, we first illustrate and highlight the proposed algorithm, which we call Push-Pull in the following.

Algorithm 1: Push-Pull

Each agent chooses its local step size ,
 in-bound mixing/pulling weights for all ,
 and out-bound pushing weights for all ;
Each agent initialize with any arbitrary and ;
for , do
  for each ,
  agent pulls from each respectively;
  agent pushes to each respectively;
  for each ,
  ;
  ;
  end for

Algorithm 1 (Push-Pull) can be rewritten in the following aggregated form: equationparentequation

(2a)
(2b)

where is a nonnegative diagonal matrix and . We make the following assumption on the matrices and . The matrix is nonnegative row-stochastic and is nonnegative column-stochastic, i.e., and . In addition, the diagonal entries of and are positive, i.e., and for all . As a result of being column-stochastic, we have by induction that

(3)

Relation (3) is critical for (a subset of) the agents to track the average gradient through the -update.

We now give the condition on the structures of graphs and induced by matrices and , respectively. Note that is identical to the graph with all its edges reversed. The graphs and each contain at least one spanning tree. Moreover, there exists at least one node that is a root of spanning trees for both and , i.e., , where (resp., ) is the set of roots of all possible spanning trees in the graph (resp., ).

Assumption 2 is weaker than requiring that both and are strongly connected, which was assumed in most previous works (e.g., [16, 36, 37]). This relaxation offers us more flexibility in designing graphs and . For instance, suppose that we have a strongly connected communication graph . Then there are multiple ways to construct and satisfying Assumption 2. One trivial approach is to set . Another way is to pick at random and let (resp., ) be a spanning tree (resp., reversed spanning tree) contained in with as its root. Once graphs and are established, matrices and can be designed accordingly.

We have the following result from Assumption 2 and Assumption 2.

Lemma 1

Under Assumption 2 and Assumption 2, the matrix

has a unique nonnegative left eigenvector

(w.r.t. eigenvalue

) with , and the matrix has a unique nonnegative right eigenvector (w.r.t. eigenvalue ) with (see [6]). Moreover, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ), and .

Proof

See Appendix A.1.

Finally, we assume the following condition regarding the step sizes . There is at least one agent whose step size is positive.

Assumption 2 and Assumption 2 hint on the crucial role of the set . In what follows, we provide some intuition for the development of Push-Pull and an interpretation of the algorithm from another perspective. The discussions will shed light on the rationale behind the assumptions.

To motivate the development of Push-Pull, let us consider the optimality condition for (1) in the following form: equationparentequation

(4a)
(4b)

where and satisfies Assumption 2. Consider now the algorithm in (2). Suppose that the algorithm produces two sequences and converging to some points and , respectively. Then from (2a) and (2b) we would have equationparentequation

(5a)
(5b)

If and are disjoint444This is a consequence of Assumption 2 and the relation from Lemma 1., from (5) we would have and . Hence satisfies the optimality condition in (4a). In light of (5b), Assumption 2, and Lemma 1, we have . Then from (3) we know that , which is exactly the optimality condition in (4b).

For another interpretation of Push-Pull, notice that under Assumptions 2 and 2, with linear rates of convergence,

(6)

Thus with comparatively small step sizes, relation (6) together with (3) implies that (for some fixed ) and . From the proof of Lemma 1, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ). Hence indicates that only the state information of agents are pulled by the entire network, and implies that only agents are pushed and tracking the average gradients. This “push” and “pull” information structure gives the name of the algorithm. The assumption essentially says at least one agent needs to be both “pulled’ and “pushed”.

The structure of the algorithm in (2) is similar to that of the DIGing algorithm proposed in [16] with the mixing matrices distorted (doubly stochastic matrices split into a row-stochastic matrix and a column-stochastic matrix). The -update can be seen as an inexact gradient step with consensus, while the -update can be viewed as a gradient tracking step. Such an asymmetric - structure design has already been used in the literature of average consensus [3]. However, the proposed optimization algorithm can not be interpreted as a linear dynamical system since it has nonlinear dynamics due to the gradient terms.

The above has mathematically interpreted why the use of row stochastic matrices and column stochastic matrices is reasonable. Now let us explain from the implementation aspect why this algorithm is called “Push-Pull” and why it is more feasible to be implemented with “Push” and “Pull” at the same time. Although, up to now, the algorithm is designed and even analyzed in the set-up of having a static (time-invariant) underlying network, we cannot expect that such ideal environment exists or is efficient. The design of Algorithm 1 is actually motivated by the algorithms proposed in reference [16] which gives some evidence for us to believe “Push-Pull” would also work over a dynamic (time-varying) network. Imagine a dynamic network at iteration/time . When the information across agents need to be diffused/fused, either an agent needs to know what scaling weights it needs to put on the quantities sending out to other agents, or it needs to know how to combine the quantities coming in with correct weights. Specific weights assignment strategies need to be imposed when an agent’s in/out-neighbors can appear and disappear from time to time. In the following, we discuss ways to diffuse/fuse information correctly in such a situation.

  • For the networked system to maintain , an apparently convenient way is to let agent scale its data by , before sending/pushing out messages. This way, it becomes agent ’s responsibility to synchronize out-neighbors’ receptions of messages and it is natural to employ a reliable push-communication-protocol to implement such operations. If we instead let a neighbor request/pull information from , either this neighbor would not know ’s thus would not know how to combine incoming data, or needs to wait for to repetitively revise ’s due to synchronization.

  • Unlike what happens in A), to maintain , the only seemingly feasible way is to let the receiver perform the tasks of scaling and combination/addition since it would be difficult for the sender to know the weights or adjust the weights accordingly when the network changes. We may still employ the push-communication-protocol and let all in-neighbors of agent actively send their messages to . However, since is passively receiving information, it is not likely that a synchronization can be coordinated by , rather may need to “subjectively” make a judgment that a former neighbor has just disappeared if a specific time has been waited without hearing from this former neighbor. One can actually choose to use pull-communication-protocol to allow agent to actively pull information from the current neighbors and effectively coordinate the synchronization.

To sum up, for the general implementation of Algorithm 1, the push-protocol is necessary; supporting the pull-protocol enhances the effectiveness of network operation; but it cannot work over a “Pull” only network.

2.1 Unifying Different Distributed Computational Architecture

We now demonstrate how the proposed algorithm (2) unifies different types of distributed architecture, including decentralized, centralized, and semi-centralized architecture.. For the fully decentralized case, suppose we have a graph that is undirected and connected. Then we can set and let be symmetric matrices, in which case the proposed algorithm degrades to the one considered in [16, 38]; if the graph is directed and strongly connected, we can also let and design the weights for and correspondingly.

To illustrate the less straightforward situation of (semi)-centralized networks, let us give a simple example. Consider a four-node star network composed by where node is situated at the center and nodes , , and are (bidirectionally) connected with node but not connected to each other. In this case, the matrix in our algorithm can be chosen as

For a graphical illustration, the corresponding network topologies of and are shown in Fig. 1.

Figure 1: On the left is the graph and on the right is the graph .

The central node pushes (diffuses) information regarding to the neighbors (the entire network in this case ) through , while the others can only passively infuse the information from node . At the same time, node pulls (collects) information regarding () from the neighbors through , while the other nodes can only actively comply with the request from node . This motivates the algorithm’s name push-pull gradient method. Although nodes , , and are updating their ’s accordingly, these quantities do not have to contribute to the optimization procedure and will die out geometrically fast due to the weights in the last three rows of . Consequently, in this special case, the local stepsize for agents , , and can be set to . Without loss of generality, suppose . Then the algorithm becomes a typical centralized algorithm for minimizing where the master node utilizes the slave nodes , , and to compute the gradient information in a distributed way.

Taking the above as an example for explaining the semi-centralized case, it is worth nothing that node can be replaced by a strongly connected subnet in and , respectively. Correspondingly, nodes , , and can all be replaced by subnets as long as the information from the master layer in these subnets can be diffused to all the slave layer agents in , while the information from all the slave layer agents can be diffused to the master layer in . Specific requirements on connectivities of slave subnets can be understood by using the concept of rooted trees. We refer to the nodes as leaders if their roles in the network are similar to the role of node ; and the other nodes are termed as followers. Note that after the replacement of the individual nodes by subnets, the network structure in all subnets are decentralized, while the relationship between leader subnet and follower subnets is master-slave. This is why we refer to such an architecture as semi-centralized.

Remark 1 (A class of Push-Pull algorithms)

There can be multiple variants of the proposed algorithm depending on whether the Adapt-then-Combine (ATC) strategy [26] is used in the -update and/or the -update (see Remark 3 in [16] for more details). For readability, we only illustrate one algorithm in Algorithm 1 and call it Push-Pull in the above. We also generally use “Push-Pull” to refer to a class of algorithms regardless whether the ATC structure is used, if not causing confusion. Our forthcoming analysis can be adapted to these variants. Our numerical tests in Section 6 only involve some variants.

3 Convergence Analysis for Push-Pull

In this section, we study the convergence properties of the proposed algorithm. We first define the following variables:

Our strategy is to bound , and in terms of linear combinations of their previous values, where and are specific norms to be defined later. In this way we establish a linear system of inequalities which allows us to derive the convergence results. The proof technique was inspired by [24, 36].

3.1 Preliminary Analysis

From the algorithm (2) and Lemma 1, we have

(7)

and

(8)

Let us further define . Then, we obtain from relation (7) that

(9)

where

(10)

We will show later that Assumptions 2 and 2 ensures .

In view of (2) and Lemma 1, using (7) we have

(11)

and from (8) we obtain

(12)

3.2 Supporting Lemmas

Before proceeding to the main results, we state a few useful lemmas.

Lemma 2

Under Assumption 1, there holds

In addition, when , we have

Proof

See Appendix A.2.

Lemma 3

Suppose Assumptions 2-2 hold. Let and be the spectral radii of and , respectively. Then, we have and .

Proof

See Appendix A.3.

Lemma 4

There exist matrix norms and such that , , and and are arbitrarily close to and , respectively. In addition, given any diagonal matrix , we have .

Proof

See [6, Lemma 5.6.10] and the discussions thereafter.

In the rest of this paper, with a slight abuse of notation, we do not distinguish between the vector norms on and their induced matrix norms.

Lemma 5

Given an arbitrary norm , for any and , we have . For any and , we have .

Proof

See Appendix A.4.

Lemma 6

There exist constants such that for all , we have , , , and . In addition, with a proper rescaling of the norms and , we have and for all .

Proof

The above result follows from the equivalence relation of all norms on and Definition 1.

3.3 Main Results

The following lemma establishes a linear system of inequalities that bound , and .

Lemma 7

Under Assumptions 1-2, when , we have the following linear system of inequalities:

(13)

where the inequality is taken component-wise, and elements of the transition matrix are given by:

where and .

Proof

See Appendix A.5.

In light of Lemma 7, , and all converge to linearly at rate if the spectral radius of satisfies . The next lemma provides some sufficient conditions for the relation to hold.

Lemma 8

[22, Lemma 5] Given a nonnegative, irreducible matrix with for some for all . A necessary and sufficient condition for is .

Now, we are ready to deliver our main convergence result for the Push-Pull algorithm in (2).

Theorem 1

Suppose Assumptions 1-2 hold, for some and

(14)

where are given in (17)-(19). Then, the quantities , and all converge to at the linear rate with , where denotes the spectral radius of .

Proof

In light of Lemma 8, it suffices to ensure and , or

(15)

We now provide some sufficient conditions under which and (15) holds true. First, is ensured by choosing . let and . We get

(16)

Second, a sufficient condition for is to substitute (resp., ) in (15) by (resp., ) and take . We then have , where

(17)
(18)

and

(19)

Hence

(20)

Relations (16) and (20) yield the final bound on .

Remark 2

Note that . The condition is automatically satisfied for a fixed in various situations. For example, if (which is always true when both and are strongly connected), we can take for . For another example, if all are equal, then .

In general, the constant roughly measures the ratio of stepsizes used by agents in and by all the agents. According to condition (14) and definitions (17)-(19), smaller leads to a tighter upper bound on the maximum stepsize .

Remark 3

When is sufficiently small, it can be shown that , in which case the Push-Pull algorithm is comparable to the centralized gradient descent method with stepsize .

4 A Gossip-Like Push-Pull Method (G-Push-Pull)

In this section, we introduce a generalized random-gossip push-pull algorithm. We call it G-Push-Pull and outline it in the following (Algorithm 2)555In the algorithm description, the multiplication sign “” is added simply for avoiding visual confusion. It still represents the commonly recognized scalar-scalar or scalar-vector multiplication..

Algorithm 2: G-Push-Pull

Each agent chooses its local step size ;
Each agent initializes with any arbitrary and ;
for time slot do
  agent is uniformly randomly “drawn/selected” from ;
  agent uniformly randomly chooses the set , a subset of
   its out-neighbors in at “time” ;
  agent sends to all members in ;
  every agent from generates/obtains ;
  agent uniformly randomly chooses the set