1 Introduction
In this paper, we consider a system involving agents whose goal is to collaboratively solve the following problem:
(1) 
where is the global decision variable and each function is convex and known by agent only. The agents are embedded in a communication network, and their goal is to obtain an optimal and consensual solution through local neighbor communications and information exchange. This local exchange is desirable in situations where the privacy of the agent data needs to be protected, or the exchange of a large amount of data is prohibitively expensive due to limited communication resources.
To solve problem (1) in a networked system of agents, many algorithms have been proposed under various assumptions on the objective functions and on the underlying networks/graphs. Static undirected graphs were extensively considered in the literature [30, 29, 19, 24, 27]. References [18, 39, 16] studied timevarying and/or stochastic undirected networks. Directed graphs were discussed in [13, 14, 40, 16, 35, 36]. Centralized (masterslave) algorithms were discussed in [2], where extensive applications in learning can be found. Parallel, coordinated, and asynchronous algorithms were discussed in [20] and the references therein. The reader is also referred to the recent paper [15] and the references therein for a comprehensive survey on distributed optimization algorithms.
In the first part of this paper, we introduce a novel gradientbased algorithm (PushPull) for distributed (consensusbased) optimization in directed graphs. Unlike the pushsum type protocol used in the previous literature [16, 36]
, our algorithm uses a row stochastic matrix for the mixing of the decision variables, while it employs a column stochastic matrix for tracking the average gradients. Although motivated by a fully decentralized scheme, we show that PushPull can work both in fully decentralized networks and in twotier networks.
Gossipbased communication protocols are popular choices for distributed computation due to their low communication costs [1, 10, 8, 11]. In the second part of this paper, we consider a randomgossip pushpull algorithm (GPushPull) where at each iteration, an agent wakes up uniformly randomly and communicates with one or two of its neighbors. Both PushPull and GPushPull have different variants. We show that they all converge linearly to the optimal solution for strongly convex and smooth objective functions.
1.1 Related Work
Our emphasis in the literature review is on the decentralized optimization, since our approach builds on a new understanding of the decentralized consensusbased methods for directed communication networks. Most references, including [30, 29, 12, 33, 19, 27, 32, 38, 17, 4, 28, 9], often restrict the underlying network connectivity structure, or more commonly require doubly stochastic mixing matrices. The work in [30] has been the first to demonstrate the linear convergence of an ADMMbased decentralized optimization scheme. Reference [29] uses a gradient difference structure in the algorithm to provide the firstorder decentralized optimization algorithm which is capable of achieving the typical convergence rates of a centralized gradient method, while references [12, 33] deal with the secondorder decentralized methods. By using Nesterov’s acceleration, reference [19] has obtained a method whose convergence time scales linearly in the number of agents , which is the best scaling with currently known. More recently, for a class of sotermed dual friendly functions, papers [27, 32] have obtained an optimal decentralized consensus optimization algorithm whose dependency on the condition number^{1}^{1}1The condition number of a smooth and strongly convex function is the ratio of its gradient Lipschitz constant and its strong convexity constant. of the system’s objective function achieves the best known scaling in the order of . Work in [28, 9] investigates proximalgradient methods which can tackle (1) with proximal friendly component functions. Paper [34] extends the work in [30] to handle asynchrony and delays. References [21, 22] considered a stochastic variant of problem (1) in asynchronous networks. A tracking technique has been recently employed to develop decentralized algorithms for tracking the average of the Hessian/gradient in secondorder methods [33], allowing uncoordinated stepsize [38, 17], handling nonconvexity [4], and achieving linear convergence over timevarying graphs [16].
For directed graphs, to eliminate the need of constructing a doubly stochastic matrix in reaching consensus^{2}^{2}2Constructing a doubly stochastic matrix over a directed graph needs weight balancing which requires an independent iterative procedure across the network; consensus is a basic coordination technique in decentralized optimization., reference [7] proposes the pushsum protocol. Reference [31] has been the first to propose a pushsum based distributed optimization algorithm for directed graphs. Then, based on the pushsum technique again, a decentralized subgradient method for timevarying directed graphs has been proposed and analyzed in [13]. Aiming to improve convergence for a smooth objective function and a fixed directed graph, work in [35, 40] modifies the algorithm from [29] with the pushsum technique, thus providing a new algorithm which converges linearly for a strongly convex objective function on a static graph. However, the algorithm requires a careful selection of the stepsize which may be even nonexistent in some cases [35]. This stability issue has been resolved in [16] in a more general setting of timevarying directed graphs.
Simultaneously and independently, a paper [37] has proposed an algorithm that is similar to the synchronous variant proposed in this paper. By contrast, the work in [37] do not show that the algorithm unifies different architectures. Moreover, asynchronous or timevarying cases were not discussed either therein.
1.2 Main Contribution
The main contribution of this paper is threefold. First, we design new distributed optimization methods (PushPull and GPushPull) and their many variants for directed graphs. These methods utilize two different graphs for the information exchange among agents, and as such, unify different computation and communication architectures, including decentralized (peertopeer), centralized (masterslave), and semicentralized (leaderfollower) architecture. To the best of our knowledge, these are the first algorithms in the literature that enjoy such property.
Second, we establish the linear convergence of the proposed methods in both synchronous (PushPull) and asynchronous randomgossip (GPushPull) settings. In particular, GPushPull is the first class of gossiptype algorithms for distributed optimization over directed graphs.
1.3 Organization of the Paper
The structure of this paper is as follows. We first provide notation and state basic assumptions in Subsection 1.4. Then we introduce the pushpull gradient method in Section 2 along with the intuition of its design and some examples explaining how it relates to (semi)centralized and decentralized optimization. We establish the linear convergence of the pushpull algorithm in Section 3. In Section 4 we introduce the randomgossip pushpull method (GPushPull) and demonstrate its linear convergence in Section 5. In Section 6 we conduct numerical experiments to verify our theoretical claims. Concluding remarks are given in Section 7.
1.4 Notation and Assumption
Throughout the paper, vectors default to columns if not otherwise specified. Let
be the set of agents. Each agent holds a local copy of the decision variable and an auxiliary variable tracking the average gradients, where their values at iteration are denoted by and , respectively. LetDefine to be an aggregate objective function of the local variables, i.e., , and write
We use the symbol to denote the trace of a square matrix.
Definition 1
Given an arbitrary vector norm on , for any , we define
where are columns of , and represents the norm.
We make the following assumption on the functions in (1). Each is strongly convex and its gradient is Lipschitz continuous, i.e., for any ,
Under Assumption 1, there exists a unique optimal solution to problem (1).
We use directed graphs to model the interaction topology among agents. A directed graph (digraph) is a pair , where is the set of vertices (nodes) and the edge set
consists of ordered pairs of vertices. If there is a directed edge from node
to node in , or , then is defined as the parent node and is defined as the child node. Information can be transmitted from the parent node to the child node directly. A directed path in graph is a sequence of edges , , . Graph is called strongly connected if there is a directed path between any pair of distinct vertices. A directed tree is a digraph where every vertex, except for the root, has only one parent. A spanning tree of a digraph is a directed tree that connects the root to all other vertices in the graph. A subgraph of graph is a graph whose set of vertices and set of edges are all subsets of (see [5]).Given a nonnegative matrix^{3}^{3}3A matrix is nonnegative if all its elements are nonnegative. , the digraph induced by the matrix is denoted by , where and iff (if and only if) . We let be the set of roots of all possible spanning trees in the graph . For an arbitrary agent , we define its inneighbor set as the collection of all individual agents that can actively and reliably pull data from; we also define its outneighbor set as the collection of all individual agents that can passively and reliably receive data from . In the situation when the set is timevarying, we further add a subscript to indicate it generates a sequence of sets. For example, is the inneighbor set of at time/iteration .
2 A PushPull Gradient Method
To proceed, we first illustrate and highlight the proposed algorithm, which we call PushPull in the following.
Algorithm 1: PushPull
Each agent chooses its local step size , 
inbound mixing/pulling weights for all , 
and outbound pushing weights for all ; 
Each agent initialize with any arbitrary and ; 
for , do 
for each , 
agent pulls from each respectively; 
agent pushes to each respectively; 
for each , 
; 
; 
end for 
Algorithm 1 (PushPull) can be rewritten in the following aggregated form: equationparentequation
(2a)  
(2b) 
where is a nonnegative diagonal matrix and . We make the following assumption on the matrices and . The matrix is nonnegative rowstochastic and is nonnegative columnstochastic, i.e., and . In addition, the diagonal entries of and are positive, i.e., and for all . As a result of being columnstochastic, we have by induction that
(3) 
Relation (3) is critical for (a subset of) the agents to track the average gradient through the update.
We now give the condition on the structures of graphs and induced by matrices and , respectively. Note that is identical to the graph with all its edges reversed. The graphs and each contain at least one spanning tree. Moreover, there exists at least one node that is a root of spanning trees for both and , i.e., , where (resp., ) is the set of roots of all possible spanning trees in the graph (resp., ).
Assumption 2 is weaker than requiring that both and are strongly connected, which was assumed in most previous works (e.g., [16, 36, 37]). This relaxation offers us more flexibility in designing graphs and . For instance, suppose that we have a strongly connected communication graph . Then there are multiple ways to construct and satisfying Assumption 2. One trivial approach is to set . Another way is to pick at random and let (resp., ) be a spanning tree (resp., reversed spanning tree) contained in with as its root. Once graphs and are established, matrices and can be designed accordingly.
Lemma 1
Under Assumption 2 and Assumption 2, the matrix
has a unique nonnegative left eigenvector
(w.r.t. eigenvalue
) with , and the matrix has a unique nonnegative right eigenvector (w.r.t. eigenvalue ) with (see [6]). Moreover, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ), and .Proof
See Appendix A.1.
Finally, we assume the following condition regarding the step sizes . There is at least one agent whose step size is positive.
Assumption 2 and Assumption 2 hint on the crucial role of the set . In what follows, we provide some intuition for the development of PushPull and an interpretation of the algorithm from another perspective. The discussions will shed light on the rationale behind the assumptions.
To motivate the development of PushPull, let us consider the optimality condition for (1) in the following form: equationparentequation
(4a)  
(4b) 
where and satisfies Assumption 2. Consider now the algorithm in (2). Suppose that the algorithm produces two sequences and converging to some points and , respectively. Then from (2a) and (2b) we would have equationparentequation
(5a)  
(5b) 
If and are disjoint^{4}^{4}4This is a consequence of Assumption 2 and the relation from Lemma 1., from (5) we would have and . Hence satisfies the optimality condition in (4a). In light of (5b), Assumption 2, and Lemma 1, we have . Then from (3) we know that , which is exactly the optimality condition in (4b).
For another interpretation of PushPull, notice that under Assumptions 2 and 2, with linear rates of convergence,
(6) 
Thus with comparatively small step sizes, relation (6) together with (3) implies that (for some fixed ) and . From the proof of Lemma 1, eigenvector (resp., ) is nonzero only on the entries associated with agents (resp., ). Hence indicates that only the state information of agents are pulled by the entire network, and implies that only agents are pushed and tracking the average gradients. This “push” and “pull” information structure gives the name of the algorithm. The assumption essentially says at least one agent needs to be both “pulled’ and “pushed”.
The structure of the algorithm in (2) is similar to that of the DIGing algorithm proposed in [16] with the mixing matrices distorted (doubly stochastic matrices split into a rowstochastic matrix and a columnstochastic matrix). The update can be seen as an inexact gradient step with consensus, while the update can be viewed as a gradient tracking step. Such an asymmetric  structure design has already been used in the literature of average consensus [3]. However, the proposed optimization algorithm can not be interpreted as a linear dynamical system since it has nonlinear dynamics due to the gradient terms.
The above has mathematically interpreted why the use of row stochastic matrices and column stochastic matrices is reasonable. Now let us explain from the implementation aspect why this algorithm is called “PushPull” and why it is more feasible to be implemented with “Push” and “Pull” at the same time. Although, up to now, the algorithm is designed and even analyzed in the setup of having a static (timeinvariant) underlying network, we cannot expect that such ideal environment exists or is efficient. The design of Algorithm 1 is actually motivated by the algorithms proposed in reference [16] which gives some evidence for us to believe “PushPull” would also work over a dynamic (timevarying) network. Imagine a dynamic network at iteration/time . When the information across agents need to be diffused/fused, either an agent needs to know what scaling weights it needs to put on the quantities sending out to other agents, or it needs to know how to combine the quantities coming in with correct weights. Specific weights assignment strategies need to be imposed when an agent’s in/outneighbors can appear and disappear from time to time. In the following, we discuss ways to diffuse/fuse information correctly in such a situation.

For the networked system to maintain , an apparently convenient way is to let agent scale its data by , before sending/pushing out messages. This way, it becomes agent ’s responsibility to synchronize outneighbors’ receptions of messages and it is natural to employ a reliable pushcommunicationprotocol to implement such operations. If we instead let a neighbor request/pull information from , either this neighbor would not know ’s thus would not know how to combine incoming data, or needs to wait for to repetitively revise ’s due to synchronization.

Unlike what happens in A), to maintain , the only seemingly feasible way is to let the receiver perform the tasks of scaling and combination/addition since it would be difficult for the sender to know the weights or adjust the weights accordingly when the network changes. We may still employ the pushcommunicationprotocol and let all inneighbors of agent actively send their messages to . However, since is passively receiving information, it is not likely that a synchronization can be coordinated by , rather may need to “subjectively” make a judgment that a former neighbor has just disappeared if a specific time has been waited without hearing from this former neighbor. One can actually choose to use pullcommunicationprotocol to allow agent to actively pull information from the current neighbors and effectively coordinate the synchronization.
To sum up, for the general implementation of Algorithm 1, the pushprotocol is necessary; supporting the pullprotocol enhances the effectiveness of network operation; but it cannot work over a “Pull” only network.
2.1 Unifying Different Distributed Computational Architecture
We now demonstrate how the proposed algorithm (2) unifies different types of distributed architecture, including decentralized, centralized, and semicentralized architecture.. For the fully decentralized case, suppose we have a graph that is undirected and connected. Then we can set and let be symmetric matrices, in which case the proposed algorithm degrades to the one considered in [16, 38]; if the graph is directed and strongly connected, we can also let and design the weights for and correspondingly.
To illustrate the less straightforward situation of (semi)centralized networks, let us give a simple example. Consider a fournode star network composed by where node is situated at the center and nodes , , and are (bidirectionally) connected with node but not connected to each other. In this case, the matrix in our algorithm can be chosen as
For a graphical illustration, the corresponding network topologies of and are shown in Fig. 1.
The central node pushes (diffuses) information regarding to the neighbors (the entire network in this case ) through , while the others can only passively infuse the information from node . At the same time, node pulls (collects) information regarding () from the neighbors through , while the other nodes can only actively comply with the request from node . This motivates the algorithm’s name pushpull gradient method. Although nodes , , and are updating their ’s accordingly, these quantities do not have to contribute to the optimization procedure and will die out geometrically fast due to the weights in the last three rows of . Consequently, in this special case, the local stepsize for agents , , and can be set to . Without loss of generality, suppose . Then the algorithm becomes a typical centralized algorithm for minimizing where the master node utilizes the slave nodes , , and to compute the gradient information in a distributed way.
Taking the above as an example for explaining the semicentralized case, it is worth nothing that node can be replaced by a strongly connected subnet in and , respectively. Correspondingly, nodes , , and can all be replaced by subnets as long as the information from the master layer in these subnets can be diffused to all the slave layer agents in , while the information from all the slave layer agents can be diffused to the master layer in . Specific requirements on connectivities of slave subnets can be understood by using the concept of rooted trees. We refer to the nodes as leaders if their roles in the network are similar to the role of node ; and the other nodes are termed as followers. Note that after the replacement of the individual nodes by subnets, the network structure in all subnets are decentralized, while the relationship between leader subnet and follower subnets is masterslave. This is why we refer to such an architecture as semicentralized.
Remark 1 (A class of PushPull algorithms)
There can be multiple variants of the proposed algorithm depending on whether the AdaptthenCombine (ATC) strategy [26] is used in the update and/or the update (see Remark 3 in [16] for more details). For readability, we only illustrate one algorithm in Algorithm 1 and call it PushPull in the above. We also generally use “PushPull” to refer to a class of algorithms regardless whether the ATC structure is used, if not causing confusion. Our forthcoming analysis can be adapted to these variants. Our numerical tests in Section 6 only involve some variants.
3 Convergence Analysis for PushPull
In this section, we study the convergence properties of the proposed algorithm. We first define the following variables:
Our strategy is to bound , and in terms of linear combinations of their previous values, where and are specific norms to be defined later. In this way we establish a linear system of inequalities which allows us to derive the convergence results. The proof technique was inspired by [24, 36].
3.1 Preliminary Analysis
3.2 Supporting Lemmas
Before proceeding to the main results, we state a few useful lemmas.
Lemma 2
Proof
See Appendix A.2.
Lemma 3
Proof
See Appendix A.3.
Lemma 4
There exist matrix norms and such that , , and and are arbitrarily close to and , respectively. In addition, given any diagonal matrix , we have .
Proof
See [6, Lemma 5.6.10] and the discussions thereafter.
In the rest of this paper, with a slight abuse of notation, we do not distinguish between the vector norms on and their induced matrix norms.
Lemma 5
Given an arbitrary norm , for any and , we have . For any and , we have .
Proof
See Appendix A.4.
Lemma 6
There exist constants such that for all , we have , , , and . In addition, with a proper rescaling of the norms and , we have and for all .
Proof
The above result follows from the equivalence relation of all norms on and Definition 1.
3.3 Main Results
The following lemma establishes a linear system of inequalities that bound , and .
Lemma 7
Proof
See Appendix A.5.
In light of Lemma 7, , and all converge to linearly at rate if the spectral radius of satisfies . The next lemma provides some sufficient conditions for the relation to hold.
Lemma 8
[22, Lemma 5] Given a nonnegative, irreducible matrix with for some for all . A necessary and sufficient condition for is .
Now, we are ready to deliver our main convergence result for the PushPull algorithm in (2).
Theorem 1
Proof
Remark 2
Note that . The condition is automatically satisfied for a fixed in various situations. For example, if (which is always true when both and are strongly connected), we can take for . For another example, if all are equal, then .
Remark 3
When is sufficiently small, it can be shown that , in which case the PushPull algorithm is comparable to the centralized gradient descent method with stepsize .
4 A GossipLike PushPull Method (GPushPull)
In this section, we introduce a generalized randomgossip pushpull algorithm. We call it GPushPull and outline it in the following (Algorithm 2)^{5}^{5}5In the algorithm description, the multiplication sign “” is added simply for avoiding visual confusion. It still represents the commonly recognized scalarscalar or scalarvector multiplication..
Algorithm 2: GPushPull
Each agent chooses its local step size ; 
Each agent initializes with any arbitrary and ; 
for time slot do 
agent is uniformly randomly “drawn/selected” from ; 
agent uniformly randomly chooses the set , a subset of 
its outneighbors in at “time” ; 
agent sends to all members in ; 
every agent from generates/obtains ; 
agent uniformly randomly chooses the set 
Comments
There are no comments yet.