In multi-agent consensus optimization, agents or nodes, as we will refer to them throughout this article, cooperate to solve an optimization problem. A local objective function is associated with each node , and the goal is for all nodes to find and agree on a minimizer of the average objective in a decentralized way. Each node maintains its own copy of the optimization variable, and node only has direct access to information about its local objective ; for example, node may be able to calculate the gradient of evaluated at . Throughout this article we will focus on the case where the functions are convex (so is also convex) and where has a non-empty set of minimizers so that the problem is well-defined.
Because each node only has access to local information, the nodes must communicate over a network to find a minimizer of . Multi-agent consensus optimization algorithms are iterative, where each iteration typically involves some local computation followed by communication over the network. In most applications of interest, either it is not possible or one does not allow each node to communicate with every other node. The connectivity of the network (i.e., which pairs of nodes may communicate directly with each other) is represented as a graph with vertices and with an edge between a pair of vertices if they communicate directly. In general, the network connectivity may change from iteration to iteration. We will see that, indeed, the communication network topology plays a key role in the convergence theory of multi-agent optimization methods in that it may limit the flow of information between distant nodes and thereby hinder convergence.
During the past decade, multi-agent consensus optimization has been the subject of intense interest, motivated by a variety of applications.
I-a Motivating Applications
The general multi-agent optimization problem described above was originally introduced and studied in the 1980’s in the context of parallel and distribute numerical methods [1, 2, 3]. The surge of interest in multi-agent convex optimization during the past decade has been fueled by a variety of applications where a network of autonomous agents must coordinate to achieve a common objective. We describe three such examples next.
I-A1 Decentralized estimation
Consider a wireless sensor network with nodes where node has a measurement
which is modeled as a random variable with density. In many applications of sensor networks, uncertainty is primarily due to thermal measurement noise introduced at the sensor itself, and so it is reasonable to assume that the observations are conditionally independent given the model parameters . In this case, the maximum likelihood estimate of can obtained by solving
When the nodes communicate over a wireless network, whether or not a given pair of nodes can directly communicate is typically a function of their physical proximity as well as other factors (e.g., fading, shadowing) affecting the wireless channel, which may possibly result in time-varying and directed network connectivity.
I-A2 Big data and machine learning
Many methods for supervised learning (e.g., classification or regression) can also be formulated as fitting a model to data. This task may generally be expressed as finding model parametersby solving
where the loss functionmeasures how well the model with parameters describes the th training instance, and the training data set contains
instances in total. For many popular machine learning models—including linear regression, logistic regression, ridge regression, the LASSO, support vector machines, and their variants—the corresponding loss function is convex.
When is large, it may not be possible to store the training data on a single server, or it may be desirable for other reasons to partition the training data across multiple nodes (e.g., to speedup training by exploiting parallel computing resources, or because the data is being gathered and/or stored at geographically distant locations). In this case, the training task (1) can be solved using multi-agent optimization with local objectives of the form [8, 9, 10, 11, 12]
where is the set of indices of training instances at node .
In this setting, where nodes typically correspond to servers communicating over a wired network, it may be feasible for every node to send and receive messages from all other nodes. However it is often still be preferable, for a variety of reasons, to run multi-agent algorithms over a network with sparser connectivity. Communicating a message takes time, and reducing the number of edges in the communication graph at any iteration corresponds to reducing the number of messages to be transmitted. This results in iterations that take less time and also that consume less bandwidth.
Other architectures for distributed optimization have been studied, notably the master-worker architecture, where many workers perform calculations in parallel (e.g., of gradients or other step directions), and the master collects and applies these steps to the single copy of the optimization variable. In contrast to this architecture, which has a bottleneck and single point of failure at the master node, the multi-agent consensus approach can tolerate node or link failures—if one node fails, the remaining nodes in the network can continue to compute as before without any loss of information or the need to elect a new master node, since each has a local copy of the optimization variable.
I-A3 Multi-robot systems
Similar to the previous example, multi-agent methods have attracted attention in applications requiring the coordination of multiple robots because they naturally lead to decentralized solutions. One well-studied problem arising in such systems is that of rendezvous—collectively deciding on a meeting time and location. When the robots have different battery levels or are otherwise heterogeneous, it may be desirable to design a rendezvous time and place, and corresponding control trajectories, which minimize the energy to be expended collectively by all robots. This can be formulated as a multi-agent optimization problem where the local objective at agent quantifies the energy to be expended by agent and encodes the time and place for rendezvous [13, 14].
When robots communicate over a wireless network, the network connectivity will be dependent on the proximity of nodes as well as other factors affecting channel conditions, similar to in the first example. Moreover, as the robots move, the network connectivity is likely to change. It may be desirable to ensure that a certain minimal level of network connectivity is maintained while the multi-robot system performs its task, and such requirements can be enforced by introducing constraints in the optimization formulation .
I-B Outline of the rest of the paper
The purpose of this article is to provide an overview of the main advances in this field, highlighting the state-of-the-art methods and their analyses, and pointing out open questions. During the past decade, a vast literature has amassed on multi-agent optimization and related methods, and we do not attempt to provide an exhaustive review (which, in any case, would not be feasible in the space of one article). Rather, in addition to describing the main advances and results leading to the current state-of-the-art, we also seek to provide an accessible survey of theoretical techniques arising in the analysis of multi-agent optimization methods.
Distributed averaging algorithms—where each node initially holds a number, and the aim is to calculate the average at every node—form a fundamental building block of multi-agent optimization methods. We review distributed averaging algorithms and their convergence theory in Sec. II. Sec. III discusses multi-agent optimization methods and theory in the setting of undirected communication networks. Sec. IV then describes how these methods can be extended to run over networks with directed connectivity via the push-sum approach. In both Secs. III and IV we limit our attention to methods for unconstrained optimization problems running in a synchronous manner. Sec. V discusses a variety of additional extensions, including how to incorporate various sorts of constraints and asynchronous methods, and it describes connections to various other branches of distributed optimization. We conclude in Sec. VI, mentioning open problems.
Before proceeding, we summarize some notation that is used throughout the rest of this paper.
A matrix is called stochastic if it is nonnegative and the sum of the elements in each row equals one. A matrix is called doubly stochastic if, additionally, the sum of the elements in each column equals one. To a stochastic matrix, we will associate the directed graph with vertex set and edge set . Note that this graph might contain self-loops. For convenience, we will abuse notation slightly by using the matrix and the graph interchangeably; for example, we will say that is strongly connected. Similarly, we will say that the matrix is undirected if implies (i.e., if is symmetric). Finally, we will use to denote a thresholded matrix obtained from by setting every element smaller than to zero.
Given a sequence of stochastic matrices , we will use to denote the product of elements to inclusive, i.e.,
We will say that a matrix sequence is -strongly-connected if the graph with vertex set and edge set
is strongly connected for each . Intuitively, we partition the iterations into consecutive blocks of length , and the sequence is -strongly-connected when the graph obtained by unioning the edges within each block is always strongly connected. When the graph sequence is undirected, we will simply say -connected.
The out-neighbors of node at iteration refers to the set of nodes that can receive messages from it,
and similarly, the in-neighbors of at iteration are the nodes from which receives messages,
When the graph is not time-varying, we will simply refer to the out-neighbors and in-neighbors . When the graph is undirected, the sets of in-neighbors and out-neighbors are identical, so we will simply refer to the neighbors of node , or in the time-varying setting. The out-degree of node at iteration is defined as the cardinality of and is denoted by . Similarly, , , , , and denote the cardinalities, respectively, of , , , , and .
Ii Consensus over Undirected and Directed Graphs
This section reviews methods for distributed averaging that will form a key building block in our subsequent discussion of methods for multi-agent optimization.
Ii-a Preliminaries: Results for averaging
We begin by examining the linear consensus process defined as
where the matrices are stochastic, and the initial vector is given. Various forms of Eq. (2) can be implemented in a decentralized multi-agent setting, and these form the backbone of many distributed algorithms.
For example, consider a collection of nodes interconnected in a directed graph and suppose node holds the ’th coordinate of the vector . Consider the following update rule: at step , node broadcasts the value to its out-neighbors, receives values from its in-neighbors, and sets to be the average of the messages it has received, so that
This is sometimes called the equal neighbor iteration, and by stacking up the variables into the vector it can be written in the form of Eq. (2) with an appropriate choice for the matrix .
Over undirected graphs, one popular choice of update rule is to set
where is sufficiently small. Unfortunately, finding an appropriate choice of to guarantee convergence of this iteration can be bothersome (especially when the graphs are time-varying), and it generally requires knowing an upper bound on the degrees of nodes in the network.
Another possibility (when the underlying graphs are undirected) is the so-called Metropolis update
The Metropolis update requires node to broadcast the values and its degree to its neighbors. As we will see later, this update possesses some nice convergence properties. Observe that the Metropolis update of Eq. (3) can be written in the form of Eq. (2) where the matrices are doubly stochastic.
Under certain technical conditions, the iteration of Eq. (2) results in consensus, meaning that all of the (for ) approach the same value as . We describe one such condition next. The key properties needed to ensure asymptotic consensus are that the matrices should exhibit sufficient connectivity and aperiodicity (in the long term). In the following, we use the shorthand for , the graph corresponding to the matrix . The starting point of our analysis is the following assumption.
The sequence of directed graphs is -strongly-connected. Moreover, each graph has a self-loop at every node.
As the next theorem shows, a variation on this assumption suffices to ensure that the update of Eq. (2) converges to consensus.
Suppose the sequence of stochastic matrices has the property that there exists an such that the sequence of graphs satisfies Assumption 1. Then converges to a limit in and the convergence is geometric. Moreover, if all the matrices are doubly stochastic then for all ,
We now turn to the proof of this theorem, which while being reasonably short, still builds on a sequence of preliminary lemmas and definitions which we present first. Given a sequence of directed graphs , we say that node is reachable from node in time period if there exists a sequence of directed edges , such that: (i) is present in for all , (ii) the origin of is , (iii) the destination of is . Note that this is the same as stating that if the matrices are nonnegative with if and only if belongs to . We use to denote the set of nodes reachable from node in time period .
The first lemma discusses the implications of Assumption 1 for products of the matrices .
Suppose is a sequence of nonnegative matrices with the property that there exists such that the sequence of graphs satisfies Assumption 1 . Then for any integer , is a strictly positive matrix. In fact, every entry of is at least .
Consider the set of nodes reachable from node in time period to in the graph sequence . Because each of these graphs has a self-loop at every node by Assumption 1, the reachable set never decreases, i.e.,
A further immediate consequence of Assumption 1 is that if , then is strictly larger than , because during times there is an edge in some leading from the set of nodes already reachable from to those not already reachable from . Putting together these two properties, we obtain that from to every node is reachable, i.e.,
But since every non-zero entry of is at least by construction, this implies that , and the lemma is proved. ∎
Lemma 2 tells us that, over sufficiently long horizons, products of the matrices have entries bounded away from zero. The next lemma discusses what multiplication by such a matrix does to the spread of a vector.
Suppose is a stochastic matrix, every entry of which is at least . If then
Without loss of generality, let us assume that the largest entry of is and the smallest entry of is . Then, for ,
so that for any , we have
With these two lemmas in place, we are ready to prove Theorem 1.
Proof of Theorem 1.
for all , and . Applying Lemma 3 gives that
Applying this recursively, we obtain that for all .
To obtain further that every converges, it suffices to observe that lies in the convex hull of the vectors for . Finally, since each is doubly stochastic,
where denotes a vector with all entries equal to one, and thus all must converge to the initial average. ∎
A potential shortcoming of the proof of Theorem 1 is that the convergence time bounds it leads to tend to scale poorly in terms of the number of nodes . We can overcome this shortcoming as illustrated in the following propositions. These results apply to a much narrower class of scenarios, but they tend to provide more effective bounds when they are applicable.
The first step is to introduce a precise notion of convergence time. Let denote the first time when
In other words, the convergence time is defined as the time until the deviation from the mean shrinks by a factor of . The convergence time is a function of the desired accuracy and of the underlying sequence of matrices. In particular, we emphasize the dependence on the number of nodes, . When the sequence of matrices is clear from context, we will simply write .
where each is a doubly stochastic matrix. Then
where denotes the second-largest singular value of the matrix
denotes the second-largest singular value of the matrix.
We skip the proof, which follows quickly from the definition of singular value.
We adopt the slightly non-standard notation
so that the previous proposition can be conveniently restated as
Recalling that , a consequence of this equation is that
so the number provides an upper bound on the convergence rate of distributed averaging.
In general, there is no guarantee that , and the equations we have derived may be vacuous. Fortunately, it turns out that for the lazy Metropolis matrices on connected graphs, it is true that , and furthermore, for many families of undirected graphs it is possible to give order-accurate estimates on , which translate into estimates of convergence time. This is captured in the following proposition. Note that all of these bounds should be interpreted as scaling laws, explaining how the convergence time increases as the network size increases, when the graphs all come from the same family.
If each is the lazy Metropolis matrix on the …
…path graph, then .
…-dimensional grid, then .
…-dimensional grid, then
…star graph, then .
…two-star graphs111A two-star graph is composed of two star graphs with a link connecting their centers., then .
…complete graph, then .
…expander graph, then .
…Erdős-Rényi random graph222An Erdős-Rényi random graph with nodes and parameter has a symmetric adjacency matrix whose
distinct off-diagonal entries are independent Bernoulli random variables taking the value 1 with probability. In this article we focus on the case where , where , for which it is known that the random graph is connected with high probability . then with high probability333A statement is said to hold “with high probability” if the probability of it holding approaches as . In this context, is the number of nodes of the underlying graph. .
…geometric random graph444A geometric random graph is one where nodes are placed uniformly and independently in the unit square and two nodes are connected with an edge if their distance is at most . In this article we focus on the case where for some , for which it is known that the random graph is connected with high probability ., then with high probability .
…any connected undirected graph, then
Sketch of the proof.
The spectral gap can be bounded by where
is the largest hitting time of the Markov chain whose probability transition matrix is the lazy Metropolis matrix. We thus only need to bound hitting times on the graphs in question, and these are now standard exercises. For example, the fact that the hitting time on the path graph is quadratic in the number of nodes is essentially the main finding of the standard “gambler’s ruin” exercise. For corresponding results on -d and -dimensional grids, please see . Hitting times on star, two-star, and complete graphs are elementary exercises. The result for an expander graph is a consequence of Cheeger’s inequality; see Theorem 6.2.1 in . For Erdős-Rényi graphs the result follows because such graphs are expanders with high probability; see . For geometric random graphs a bound can be obtained by partitioning the unit square into appropriately-sized regions, thereby reducing to the case of a -d grid; see . Finally the bound for connected graphs is from [21, 18]. ∎
Fig. 1 depicts examples of some of the graphs discussed in Proposition 5. Clearly the network structure affects the time it takes information to diffuse across the graph. For graphs such as the path or 2-d grid, the dependence on is intuitively related to the long time it takes information to spread across the network. For other graphs, such as stars, the dependence is due to the central node (i.e., the “hub”) becoming a bottleneck. For such graphs this dependence is strongly related to the fact that we have focused on the Metropolis scheme for designing the entries of the matrices . Because the hub has a much higher degree than the other nodes, the resulting Metropolis updates lead to very small changes and hence slower convergence (i.e., is diagonally dominant); see Eq. (3). In general, for undirected graphs in which neighboring nodes may have very different degrees, it is know that faster rates can be achieved by using linear iterations of the form Eq. (2), where is optimized for the particular graph topology [22, 23]. However, unlike using the Metropolis weights—which can be implemented locally by having neighboring nodes exchange their degrees—determining the optimal matrices involves solving a separate network-wide optimization problem; see  for a distributed approximation algorithm.
On the other hand, the algorithm is evidently fast on certain graphs. For the complete graph (where every node is directly connected to every other node, this is not surprising—since , the average is computed exactly at every other node after a single iteration. Expander graphs can be seen as sparse approximations of the complete graph (sparse here is in the sense of having many fewer edges) which approximately preserve the spectrum, and hence the hitting time . In applications where one has the option to design the network, expanders are particular practical interest since they allow fast rates of convergence—hence, few iterations—while also having relatively few links—so each iteration requires few transmissions and is thus fast to implement.
Ii-B Worst-case scaling of distributed averaging
One might wonder about the worst-case complexity of average consensus: how long does it take to get close to the average on any graph? Initial bounds were exponential in the number of nodes [1, 2, 3, 25]. However, Proposition 5 tells us that this is at most using a Metropolis matrix. A recent result  shows that if all the nodes know an upper bound on the total number of nodes which is reasonably accurate, this convergence time can be brought down by an order of magnitude. This is a consequence of the following theorem.
Suppose each node in an undirected connected graph implements the update
where and . Then
where is the initial average.
Thus if every node knows the upper bound , the above theorem tells us that convergence time until every element of the vector is at most away from the initial average is . In the event that is within a constant factor of , (e.g., ) this turns out to be linear in the number of nodes . One situation in which this is possible is if nodes precisely know the number of nodes in the network, in which case they can simply set . However, this scheme is also useful in a number of settings where the exact number of nodes in the system is not known (e.g., if nodes fail) as long as approximate knowledge of the total number of nodes is available.
Intuitively, Eq. (7) takes a lazy Metropolis update and accelerates it by adding an extrapolation term. Strategies of this form are known as over-relaxation in the numerical analysis literature  and as Nesterov acceleration in the optimization literature . On a non-technical level, the extrapolation speeds up convergence by reducing the inherent oscillations in the underlying sequence. A key feature, however, is that the degree of extrapolation must be carefully chosen, which is where knowledge of the bound is required. It is, at present, an open question whether any improvement on the quadratic convergence time of Proposition 5 is possible without such an additional assumption, or whether a linear convergence time scaling can be obtained for time-varying graphs.
Iii Decentralized optimization over undirected graphs
We now shift our attention from distributed averaging back to the problem of optimization. We begin by describing the (centralized) subgradient method, which is one of the most basic algorithms used in convex optimization.
Iii-a The subgradient method
To avoid confusion in the sequel when we discuss distributed optimization methods, here we consider a serial method for minimizing a convex function . A vector is called a subgradient of at the point if
If the function is continuously differentiable, then is the only subgradient at . However, subgradients are also well-defined over the entire domain of a convex function, even if it is not continuously differentiable, and in general, there are multiple subgradients at points where the function is not differentiable.
The subgradient method for minimizing the function is defined as the iterate
where is a subgradient of the function at the point .
There are standard results on the convergence of the subgradient method. Intuitively, the subgradient is always a descent direction at the point , so a step in the direction of the subgradient pulls the iterate closer to the global minimizers provided the step-size is small enough. If it is hard to know what step-size to take, then one can always take a step-size sequence that decays to zero at a suitable rate so that the method comes close to the minimizer after enough steps. This intuition is captured by the following theorem (see, e.g.,  for a proof).
Let be the set of minimizers of the function . Assume that (i) is convex, (ii) is nonempty, (iii) for all subgradients of the function , and (iv) the nonnegative step-size sequence is “summable but not square summable,” i.e.,
Then, for any , we have
Furthermore, if Eq. (8) is run for steps under the above assumptions, the (constant) choice of stepsize for yields
Subgradients have similar properties to gradients for convex functions, and similar to gradients, they are linear. (This follows directly from the definition in Eq. (8).) Thus, if is a subgradient of a function at and is a subgradient of at , then is a subgradient of at .
Iii-B Decentralizing the subgradient method
We now return to the problem of distributed optimization. To recap, we have nodes, interconnected in a time-varying network capturing which pairs of nodes can exchange messages. For now, assume that these networks are all undirected. (This is relaxed in Section IV, which considers directed graphs.) Node knows the convex function and the nodes would like to minimize the function
in a distributed way. If all the functions were available at a single location, we could directly apply the subgradient method to their average :
where is a subgradient of the function at . Unfortunately, this is not a distributed method under our assumptions, since only node knows the function , and thus only node can compute a subgradient of .
A decentralized subgradient method
solves this problem by interpolating between the subgradient method and an average consensus scheme from SectionII. In this scheme, node maintains the variable which is updated as
where is the subgradient of at . Note that this update is decentralized in the sense that node only requires local information to execute it. In the case when , the quantities are scalars and we can write this as
where the vector stacks up the and stacks up the . The weights should be chosen by each node in a distributed way. Later within this section, we will assume that the matrices are doubly stochastic; perhaps the easiest way to achieve this is to use the Metropolis iteration of Eq. (3).
Intuitively, Eq. (10) pulls the value at each node in two directions: one the one hand towards the minimizer (via the subgradient term) and on the other hand towards neighboring nodes (via the averaging term). The step-size should be chosen to decay to zero, so that in the limit the consensus term will prevail. However, if the rate at which it decays to zero is small enough, then under appropriate conditions consensus will be achieved on the global minimizer of .
We now turn to the analysis of Eq. (10). For simplicity, we make the assumption that all the functions are from to ; this simplifies the presentation but otherwise has no effect on the results. The same analysis extends in a straightforward manner to functions with but requires more cumbersome notation.
Let denote the set of minimizers of the function . We assume that: (i) each is convex; (ii) is nonempty; (iii) each function has the property that its subgradients at any point are bounded by the constant ; (iv) the matrices are doubly stochastic and there exists some such that the graph sequence satisfies Assumption 1; and (v) the initial values are the same across all nodes (e.g., ). Then:
If the step-sizes is “summable but not square summable,” i.e.,
then for any , we have that for all ,
If we run the protocol for steps with (constant) step-size , and with the notation , then we have that for all ,
where is defined by Eq. (4).
We remark that the quantity on which the suboptimality bound is proved can be computed via an average consensus protocol after the protocol is finished if node keeps track of .
Comparing Theorems 7 and 8, and ignoring the similar terms involving the initial conditions, we see that the convergence bound gets multiplied by . This term may be thought of as measuring the “price of decentralization” resulting from having knowledge of the objective function distributed throughout the network rather than available entirely at one place.
We can use Proposition 5 to translate this into concrete convergence times on various families of graphs, as the next result shows. For , let us define the -convergence time to be the first time when
Naturally, the convergence time will depend on and on the underlying sequence of matrices/graphs.
where if the graphs are
…path graphs, then ;
…-dimensional grids, then ;
…-dimensional grids, then ;
…complete graphs, then ;
…expander graphs, then ;
…star graphs, then ;
…two-star graphs, then ;
…Erdős-Rényi random graphs, then ;
…geometric random graphs, then ;
…any connected undirected graph, then .
We remark that it is possible to decrease the scaling from to in the above corollary if the constant , the type of the underlying graph (e.g., star graph, path graph) and the number of nodes is known to all nodes. Indeed, this can be achieved by setting the stepsize and using a hand-optimized (which will depend on , , as well as the kind of underlying graphs). We omit the details but this is very similar to the optimization done in .
We now turn to the proof of Theorem 8. We will need two preliminary lemmas covering some background in optimization.
Suppose is a convex function such that has subgradients at the points , respectively, satisfying and . Then
Our overall proof strategy is to view the distributed optimization protocol of Eq. (10) as a kind of perturbed consensus process. To that end, the next lemma extends our previous analysis of the consensus process to deal with perturbations.
For convenience, let us introduce the notation
we have that
Now using the fact that the vectors and have mean zero, by Proposition 4 we have
This equation immediately implies the first claim of the lemma.
Proof of Theorem 8.
Recall that we are assuming, for simplicity, that the functions are from to . As before, let be the average of the entries of the vector , i.e.,
Since the matrices are doubly stochastic, so that