In this paper, we consider a system involving agents whose goal is to collaboratively solve the following problem:
where is the global decision variable and each function is convex and known by agent only. The agents are embedded in a communication network, and their goal is to obtain an optimal and consensual solution through local neighbor communications and information exchange. This local exchange is desirable in situations where the privacy of the agent data needs to be protected, or the exchange of a large amount of data is prohibitively expensive due to limited communication resources.
To solve problem (1) in a networked system of agents, many algorithms have been proposed under various assumptions on the objective functions and on the underlying networks/graphs. Static undirected graphs were extensively considered in the literature [30, 29, 19, 24, 27]. References [18, 39, 16] studied time-varying and/or stochastic undirected networks. Directed graphs were discussed in [13, 14, 40, 16, 35, 36]. Centralized (master-slave) algorithms were discussed in , where extensive applications in learning can be found. Parallel, coordinated, and asynchronous algorithms were discussed in  and the references therein. The reader is also referred to the recent paper  and the references therein for a comprehensive survey on distributed optimization algorithms.
In the first part of this paper, we introduce a novel gradient-based algorithm (Push-Pull) for distributed (consensus-based) optimization in directed graphs. Unlike the push-sum type protocol used in the previous literature [16, 36]
, our algorithm uses a row stochastic matrix for the mixing of the decision variables, while it employs a column stochastic matrix for tracking the average gradients. Although motivated by a fully decentralized scheme, we show that Push-Pull can work both in fully decentralized networks and in two-tier networks.
Gossip-based communication protocols are popular choices for distributed computation due to their low communication costs [1, 10, 8, 11]. In the second part of this paper, we consider a random-gossip push-pull algorithm (G-Push-Pull) where at each iteration, an agent wakes up uniformly randomly and communicates with one or two of its neighbors. Both Push-Pull and G-Push-Pull have different variants. We show that they all converge linearly to the optimal solution for strongly convex and smooth objective functions.
1.1 Related Work
Our emphasis in the literature review is on the decentralized optimization, since our approach builds on a new understanding of the decentralized consensus-based methods for directed communication networks. Most references, including [30, 29, 12, 33, 19, 27, 32, 38, 17, 4, 28, 9], often restrict the underlying network connectivity structure, or more commonly require doubly stochastic mixing matrices. The work in  has been the first to demonstrate the linear convergence of an ADMM-based decentralized optimization scheme. Reference  uses a gradient difference structure in the algorithm to provide the first-order decentralized optimization algorithm which is capable of achieving the typical convergence rates of a centralized gradient method, while references [12, 33] deal with the second-order decentralized methods. By using Nesterov’s acceleration, reference  has obtained a method whose convergence time scales linearly in the number of agents , which is the best scaling with currently known. More recently, for a class of so-termed dual friendly functions, papers [27, 32] have obtained an optimal decentralized consensus optimization algorithm whose dependency on the condition number111The condition number of a smooth and strongly convex function is the ratio of its gradient Lipschitz constant and its strong convexity constant. of the system’s objective function achieves the best known scaling in the order of . Work in [28, 9] investigates proximal-gradient methods which can tackle (1) with proximal friendly component functions. Paper  extends the work in  to handle asynchrony and delays. References [21, 22] considered a stochastic variant of problem (1) in asynchronous networks. A tracking technique has been recently employed to develop decentralized algorithms for tracking the average of the Hessian/gradient in second-order methods , allowing uncoordinated stepsize [38, 17], handling non-convexity , and achieving linear convergence over time-varying graphs .
For directed graphs, to eliminate the need of constructing a doubly stochastic matrix in reaching consensus222Constructing a doubly stochastic matrix over a directed graph needs weight balancing which requires an independent iterative procedure across the network; consensus is a basic coordination technique in decentralized optimization., reference  proposes the push-sum protocol. Reference  has been the first to propose a push-sum based distributed optimization algorithm for directed graphs. Then, based on the push-sum technique again, a decentralized subgradient method for time-varying directed graphs has been proposed and analyzed in . Aiming to improve convergence for a smooth objective function and a fixed directed graph, work in [35, 40] modifies the algorithm from  with the push-sum technique, thus providing a new algorithm which converges linearly for a strongly convex objective function on a static graph. However, the algorithm requires a careful selection of the stepsize which may be even non-existent in some cases . This stability issue has been resolved in  in a more general setting of time-varying directed graphs.
Simultaneously and independently, a paper  has proposed an algorithm that is similar to the synchronous variant proposed in this paper. By contrast, the work in  do not show that the algorithm unifies different architectures. Moreover, asynchronous or time-varying cases were not discussed either therein.
1.2 Main Contribution
The main contribution of this paper is threefold. First, we design new distributed optimization methods (Push-Pull and G-Push-Pull) and their many variants for directed graphs. These methods utilize two different graphs for the information exchange among agents, and as such, unify different computation and communication architectures, including decentralized (peer-to-peer), centralized (master-slave), and semi-centralized (leader-follower) architecture. To the best of our knowledge, these are the first algorithms in the literature that enjoy such property.
Second, we establish the linear convergence of the proposed methods in both synchronous (Push-Pull) and asynchronous random-gossip (G-Push-Pull) settings. In particular, G-Push-Pull is the first class of gossip-type algorithms for distributed optimization over directed graphs.
Finally, in our proposed methods each agent in the network is allowed to use a different nonnegative stepsize, and only one of such stepsizes needs to be positive. This is a unique feature compared to the existing literature (e.g., [16, 36]).
1.3 Organization of the Paper
The structure of this paper is as follows. We first provide notation and state basic assumptions in Subsection 1.4. Then we introduce the push-pull gradient method in Section 2 along with the intuition of its design and some examples explaining how it relates to (semi-)centralized and decentralized optimization. We establish the linear convergence of the push-pull algorithm in Section 3. In Section 4 we introduce the random-gossip push-pull method (G-Push-Pull) and demonstrate its linear convergence in Section 5. In Section 6 we conduct numerical experiments to verify our theoretical claims. Concluding remarks are given in Section 7.
1.4 Notation and Assumption
Throughout the paper, vectors default to columns if not otherwise specified. Letbe the set of agents. Each agent holds a local copy of the decision variable and an auxiliary variable tracking the average gradients, where their values at iteration are denoted by and , respectively. Let
Define to be an aggregate objective function of the local variables, i.e., , and write
We use the symbol to denote the trace of a square matrix.
Given an arbitrary vector norm on , for any , we define
where are columns of , and represents the -norm.
We make the following assumption on the functions in (1). Each is -strongly convex and its gradient is -Lipschitz continuous, i.e., for any ,
We use directed graphs to model the interaction topology among agents. A directed graph (digraph) is a pair , where is the set of vertices (nodes) and the edge set
consists of ordered pairs of vertices. If there is a directed edge from nodeto node in , or , then is defined as the parent node and is defined as the child node. Information can be transmitted from the parent node to the child node directly. A directed path in graph is a sequence of edges , , . Graph is called strongly connected if there is a directed path between any pair of distinct vertices. A directed tree is a digraph where every vertex, except for the root, has only one parent. A spanning tree of a digraph is a directed tree that connects the root to all other vertices in the graph. A subgraph of graph is a graph whose set of vertices and set of edges are all subsets of (see ).
Given a nonnegative matrix333A matrix is nonnegative if all its elements are nonnegative. , the digraph induced by the matrix is denoted by , where and iff (if and only if) . We let be the set of roots of all possible spanning trees in the graph . For an arbitrary agent , we define its in-neighbor set as the collection of all individual agents that can actively and reliably pull data from; we also define its out-neighbor set as the collection of all individual agents that can passively and reliably receive data from . In the situation when the set is time-varying, we further add a subscript to indicate it generates a sequence of sets. For example, is the in-neighbor set of at time/iteration .
2 A Push-Pull Gradient Method
To proceed, we first illustrate and highlight the proposed algorithm, which we call Push-Pull in the following.
Algorithm 1: Push-Pull
|Each agent chooses its local step size ,|
|in-bound mixing/pulling weights for all ,|
|and out-bound pushing weights for all ;|
|Each agent initialize with any arbitrary and ;|
|for , do|
|for each ,|
|agent pulls from each respectively;|
|agent pushes to each respectively;|
|for each ,|
Algorithm 1 (Push-Pull) can be rewritten in the following aggregated form: equationparentequation
where is a nonnegative diagonal matrix and . We make the following assumption on the matrices and . The matrix is nonnegative row-stochastic and is nonnegative column-stochastic, i.e., and . In addition, the diagonal entries of and are positive, i.e., and for all . As a result of being column-stochastic, we have by induction that
Relation (3) is critical for (a subset of) the agents to track the average gradient through the -update.
We now give the condition on the structures of graphs and induced by matrices and , respectively. Note that is identical to the graph with all its edges reversed. The graphs and each contain at least one spanning tree. Moreover, there exists at least one node that is a root of spanning trees for both and , i.e., , where (resp., ) is the set of roots of all possible spanning trees in the graph (resp., ).
Assumption 2 is weaker than requiring that both and are strongly connected, which was assumed in most previous works (e.g., [16, 36, 37]). This relaxation offers us more flexibility in designing graphs and . For instance, suppose that we have a strongly connected communication graph . Then there are multiple ways to construct and satisfying Assumption 2. One trivial approach is to set . Another way is to pick at random and let (resp., ) be a spanning tree (resp., reversed spanning tree) contained in with as its root. Once graphs and are established, matrices and can be designed accordingly.
Under Assumption 2 and Assumption 2, the matrix has a unique nonnegative left eigenvector (w.r.t. eigenvalue
has a unique nonnegative left eigenvector
(w.r.t. eigenvalue) with , and the matrix has a unique nonnegative right eigenvector (w.r.t. eigenvalue ) with (see ). Moreover, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ), and .
See Appendix A.1.
Finally, we assume the following condition regarding the step sizes . There is at least one agent whose step size is positive.
Assumption 2 and Assumption 2 hint on the crucial role of the set . In what follows, we provide some intuition for the development of Push-Pull and an interpretation of the algorithm from another perspective. The discussions will shed light on the rationale behind the assumptions.
To motivate the development of Push-Pull, let us consider the optimality condition for (1) in the following form: equationparentequation
where and satisfies Assumption 2. Consider now the algorithm in (2). Suppose that the algorithm produces two sequences and converging to some points and , respectively. Then from (2a) and (2b) we would have equationparentequation
If and are disjoint444This is a consequence of Assumption 2 and the relation from Lemma 1., from (5) we would have and . Hence satisfies the optimality condition in (4a). In light of (5b), Assumption 2, and Lemma 1, we have . Then from (3) we know that , which is exactly the optimality condition in (4b).
Thus with comparatively small step sizes, relation (6) together with (3) implies that (for some fixed ) and . From the proof of Lemma 1, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ). Hence indicates that only the state information of agents are pulled by the entire network, and implies that only agents are pushed and tracking the average gradients. This “push” and “pull” information structure gives the name of the algorithm. The assumption essentially says at least one agent needs to be both “pulled’ and “pushed”.
The structure of the algorithm in (2) is similar to that of the DIGing algorithm proposed in  with the mixing matrices distorted (doubly stochastic matrices split into a row-stochastic matrix and a column-stochastic matrix). The -update can be seen as an inexact gradient step with consensus, while the -update can be viewed as a gradient tracking step. Such an asymmetric - structure design has already been used in the literature of average consensus . However, the proposed optimization algorithm can not be interpreted as a linear dynamical system since it has nonlinear dynamics due to the gradient terms.
The above has mathematically interpreted why the use of row stochastic matrices and column stochastic matrices is reasonable. Now let us explain from the implementation aspect why this algorithm is called “Push-Pull” and why it is more feasible to be implemented with “Push” and “Pull” at the same time. Although, up to now, the algorithm is designed and even analyzed in the set-up of having a static (time-invariant) underlying network, we cannot expect that such ideal environment exists or is efficient. The design of Algorithm 1 is actually motivated by the algorithms proposed in reference  which gives some evidence for us to believe “Push-Pull” would also work over a dynamic (time-varying) network. Imagine a dynamic network at iteration/time . When the information across agents need to be diffused/fused, either an agent needs to know what scaling weights it needs to put on the quantities sending out to other agents, or it needs to know how to combine the quantities coming in with correct weights. Specific weights assignment strategies need to be imposed when an agent’s in/out-neighbors can appear and disappear from time to time. In the following, we discuss ways to diffuse/fuse information correctly in such a situation.
For the networked system to maintain , an apparently convenient way is to let agent scale its data by , before sending/pushing out messages. This way, it becomes agent ’s responsibility to synchronize out-neighbors’ receptions of messages and it is natural to employ a reliable push-communication-protocol to implement such operations. If we instead let a neighbor request/pull information from , either this neighbor would not know ’s thus would not know how to combine incoming data, or needs to wait for to repetitively revise ’s due to synchronization.
Unlike what happens in A), to maintain , the only seemingly feasible way is to let the receiver perform the tasks of scaling and combination/addition since it would be difficult for the sender to know the weights or adjust the weights accordingly when the network changes. We may still employ the push-communication-protocol and let all in-neighbors of agent actively send their messages to . However, since is passively receiving information, it is not likely that a synchronization can be coordinated by , rather may need to “subjectively” make a judgment that a former neighbor has just disappeared if a specific time has been waited without hearing from this former neighbor. One can actually choose to use pull-communication-protocol to allow agent to actively pull information from the current neighbors and effectively coordinate the synchronization.
To sum up, for the general implementation of Algorithm 1, the push-protocol is necessary; supporting the pull-protocol enhances the effectiveness of network operation; but it cannot work over a “Pull” only network.
2.1 Unifying Different Distributed Computational Architecture
We now demonstrate how the proposed algorithm (2) unifies different types of distributed architecture, including decentralized, centralized, and semi-centralized architecture.. For the fully decentralized case, suppose we have a graph that is undirected and connected. Then we can set and let be symmetric matrices, in which case the proposed algorithm degrades to the one considered in [16, 38]; if the graph is directed and strongly connected, we can also let and design the weights for and correspondingly.
To illustrate the less straightforward situation of (semi)-centralized networks, let us give a simple example. Consider a four-node star network composed by where node is situated at the center and nodes , , and are (bidirectionally) connected with node but not connected to each other. In this case, the matrix in our algorithm can be chosen as
For a graphical illustration, the corresponding network topologies of and are shown in Fig. 1.
The central node pushes (diffuses) information regarding to the neighbors (the entire network in this case ) through , while the others can only passively infuse the information from node . At the same time, node pulls (collects) information regarding () from the neighbors through , while the other nodes can only actively comply with the request from node . This motivates the algorithm’s name push-pull gradient method. Although nodes , , and are updating their ’s accordingly, these quantities do not have to contribute to the optimization procedure and will die out geometrically fast due to the weights in the last three rows of . Consequently, in this special case, the local stepsize for agents , , and can be set to . Without loss of generality, suppose . Then the algorithm becomes a typical centralized algorithm for minimizing where the master node utilizes the slave nodes , , and to compute the gradient information in a distributed way.
Taking the above as an example for explaining the semi-centralized case, it is worth nothing that node can be replaced by a strongly connected subnet in and , respectively. Correspondingly, nodes , , and can all be replaced by subnets as long as the information from the master layer in these subnets can be diffused to all the slave layer agents in , while the information from all the slave layer agents can be diffused to the master layer in . Specific requirements on connectivities of slave subnets can be understood by using the concept of rooted trees. We refer to the nodes as leaders if their roles in the network are similar to the role of node ; and the other nodes are termed as followers. Note that after the replacement of the individual nodes by subnets, the network structure in all subnets are decentralized, while the relationship between leader subnet and follower subnets is master-slave. This is why we refer to such an architecture as semi-centralized.
Remark 1 (A class of Push-Pull algorithms)
There can be multiple variants of the proposed algorithm depending on whether the Adapt-then-Combine (ATC) strategy  is used in the -update and/or the -update (see Remark 3 in  for more details). For readability, we only illustrate one algorithm in Algorithm 1 and call it Push-Pull in the above. We also generally use “Push-Pull” to refer to a class of algorithms regardless whether the ATC structure is used, if not causing confusion. Our forthcoming analysis can be adapted to these variants. Our numerical tests in Section 6 only involve some variants.
3 Convergence Analysis for Push-Pull
In this section, we study the convergence properties of the proposed algorithm. We first define the following variables:
Our strategy is to bound , and in terms of linear combinations of their previous values, where and are specific norms to be defined later. In this way we establish a linear system of inequalities which allows us to derive the convergence results. The proof technique was inspired by [24, 36].
3.1 Preliminary Analysis
Let us further define . Then, we obtain from relation (7) that
3.2 Supporting Lemmas
Before proceeding to the main results, we state a few useful lemmas.
Under Assumption 1, there holds
In addition, when , we have
See Appendix A.2.
See Appendix A.3.
There exist matrix norms and such that , , and and are arbitrarily close to and , respectively. In addition, given any diagonal matrix , we have .
See [6, Lemma 5.6.10] and the discussions thereafter.
In the rest of this paper, with a slight abuse of notation, we do not distinguish between the vector norms on and their induced matrix norms.
Given an arbitrary norm , for any and , we have . For any and , we have .
See Appendix A.4.
There exist constants such that for all , we have , , , and . In addition, with a proper rescaling of the norms and , we have and for all .
The above result follows from the equivalence relation of all norms on and Definition 1.
3.3 Main Results
The following lemma establishes a linear system of inequalities that bound , and .
See Appendix A.5.
In light of Lemma 7, , and all converge to linearly at rate if the spectral radius of satisfies . The next lemma provides some sufficient conditions for the relation to hold.
[22, Lemma 5] Given a nonnegative, irreducible matrix with for some for all . A necessary and sufficient condition for is .
Now, we are ready to deliver our main convergence result for the Push-Pull algorithm in (2).
In light of Lemma 8, it suffices to ensure and , or
We now provide some sufficient conditions under which and (15) holds true. First, is ensured by choosing . let and . We get
Note that . The condition is automatically satisfied for a fixed in various situations. For example, if (which is always true when both and are strongly connected), we can take for . For another example, if all are equal, then .
When is sufficiently small, it can be shown that , in which case the Push-Pull algorithm is comparable to the centralized gradient descent method with stepsize .
4 A Gossip-Like Push-Pull Method (G-Push-Pull)
In this section, we introduce a generalized random-gossip push-pull algorithm. We call it G-Push-Pull and outline it in the following (Algorithm 2)555In the algorithm description, the multiplication sign “” is added simply for avoiding visual confusion. It still represents the commonly recognized scalar-scalar or scalar-vector multiplication..
Algorithm 2: G-Push-Pull
|Each agent chooses its local step size ;|
|Each agent initializes with any arbitrary and ;|
|for time slot do|
|agent is uniformly randomly “drawn/selected” from ;|
|agent uniformly randomly chooses the set , a subset of|
|its out-neighbors in at “time” ;|
|agent sends to all members in ;|
|every agent from generates/obtains ;|
|agent uniformly randomly chooses the set|