I Introduction
In this paper we consider a class of algorithms for nonconvex optimization in distributed multiagent systems and prove convergence to the set of global optima. Recent years have seen a surge in research interest in nonconvex optimization, motivated, to a large degree, by emerging applications in machine learning and artificial intelligence. The majority of research in this area has focused on centralized computing frameworks in which memory and processing resources are either shared or coordinated by a central mechanism
[1, 2, 3, 4, 5, 6, 7, 8, 9].With the advent of the internet of things (IoT) and lowlatency 5G communication networks, there is a growing trend towards storing and processing data at the “edge” of the network (e.g., directly on IoT devices) rather than processing data in the cloud. This necessitates algorithms that are able to operate robustly in adhoc networked environments without centralized coordination. Beyond applications in IoT, distributed algorithms for nonconvex optimization also play an important role in other domains, including power systems [10], sensor networks [11], unmanned aerial vehicles [12], and wireless communications [13].
This paper considers the following distributed computation framework: A group of agents (or nodes) communicates over a (possibly random, possibly sparse) undirected communication graph . Each agent has a local objective function . We are interested in distributed algorithms that optimize the sum function
(1) 
using only local neighborhood information exchange between agents and without any centralized coordination.
As an example, in the context of distributed risk minimization or probably approximately correct (PAC) learning, e.g., [14], the ’s may correspond to (expected) risk
where
is the local loss function at agent
and is the local data distribution. The agents are interested in learning a common “hypothesis,” parameterized by , using their collective data.Distributed optimization algorithms have been studied extensively when the objective functions are convex [15, 16, 17, 18, 19, 20]. Not so when the objective is nonconvex. The majority of current work in this area focuses on demonstrating convergence of distributed algorithms to critical points of (not necessarily to minima, local or global).
This motivates us to consider a class of distributed algorithms for computing the global optima of (1). Our algorithms take the form:
(2)  
(3) 
, where is the state of agent at iteration , denotes the set of agents neighboring agent at time (per the communication graph), and are sequences of decaying weight parameters, is a sequence of decaying annealing weights, (t) is a
dimensional random variable (representing gradient noise), and
is a dimensional Gaussian noise (introduced for annealing). The algorithm is distributed since in (2) each agent only knows its local function and accesses information on the state of neighboring agents.Algorithm (2) may be viewed as a distributed consensus + innovations algorithm [21]. The algorithm consists of the consensus term, , that encourages agreement among agents, and the innovation term, , that encourages each agent to follow the gradient descent direction of their local objective function (with being zeromean gradient noise). Finally, the term is an annealing term that injects decaying Gaussian noise into the dynamics to destabilize local minima and saddle points. By appropriately controlling the decay rates of the parameter sequences, one can balance the various objectives of reaching consensus among agents, reaching a critical point of (1), and destabilizing local minima and saddle points (see Assumption 6).
Our main contribution is the following: We show that, under appropriate assumptions (outlined below), the distributed algorithm (2) converges in probability to the set of global minima of (1). More precisely, it will be shown that (i) agents reach consensus, almost surely (a.s.), i.e., for each , a.s., and (ii) for each agent , converges in probability to the set of global minima of . A precise statement of the main result is given in Theorem 2 at the end of Section IV.
Theorem 2 is proved under Assumptions 1–11. Assumptions 1–2 and 7–11 concern the agents’ objective functions, Assumption 3 concerns the timevarying communication graph, Assumptions 4–5 concern the gradient annealing noise, and Assumption 6 concerns the weight parameter sequences.
Related Work. Work on distributed optimization with convex objectives has been studied extensively; for an overview of the expansive literature in this field we refer readers to [15, 16, 17, 18, 19, 20] and references therein.
The topic of distributed algorithms for nonconvex optimization is a subject of more recent research focus. We briefly summarize related contributions here. Reference [13] considers an algorithm for nonconvex optimization (possibly constrained) over an undirected communication graph and shows convergence to KKT points. Relevant applications to wireless adhoc networks are discussed. Reference [22] considers a distributed primal dual algorithm for nonconvex optimization. The primal dual algorithm solves an approximation to the original nonconvex problem. Reference [23] analyzes the alternating direction penalty method and method of multipliers in nonconvex problems and demonstrates convergence to primal feasible points under mild assumptions. Reference [24] considers a pushsum algorithm for distributed nonconvex optimization on timevarying directed graphs and demonstrates convergence to firstorder stationary points. [25] considers a distributed algorithm for nonconvex optimization with smooth objective and possibly nonsmooth regularizer and demonstrates convergence to stationary solutions. Our work differs from these primarily in that we study distributed algorithms for global optimization of a nonconvex function.
The key feature of this approach is the incorporation of decaying Gaussian noise that allows the algorithm to escape local minima. Such techniques were explored in [8] and later studied and successfully applied in various centralized settings; e.g., [26, 27, 28, 29, 30] and references therein. On the other hand, consensus + innovations techniques, such as those used in [21, 31], are used in distributed settings. In this paper we prove global optimal convergence for consensus + innovations techniques, with an appropriate annealing schedule, in nonconvex optimization. This results in a distributed equivalent of the centralized result in [8].
As noted in [32], the analysis techniques developed to study consensus + innovations algorithms contributes to the general theory of mixedtimescale stochastic approximation (SA) algorithms, e.g., [33]. In such algorithms, the righthand side of the stochastic approximation difference equation contains two potentials decaying at different rates. The work [8] studies mixedtime scale SA algorithms in the context of simulated annealing. In [8], the term that serves a role analogous to our innovations potential is assumed to converge asymptotically to a Martingale difference process. A key element of our analysis here is to characterize the rate at which the innovation potential converges to a Martingale difference sequence in order to apply the results of [8].
Organization. The remainder of the paper is organized as follows. Section IA introduces relevant notation. Section II formally introduces our distributed algorithm. Section III presents the assumptions used in our main result and some intermediate results, and it reviews a classical result in global optimization (Theorem 1) that will be used in the proof of our main result. Section IV proves the main result (Theorem 2).
Ia Notation
The set of reals is denoted by , whereas denotes the nonnegative reals. For , we will use the notations and to denote the maximum and minimum of and respectively. We denote the dimensional Euclidean space by . The set of real matrices is denoted by . The identity matrix is denoted by , while and
denote respectively the column vector of ones and zeros in
. Often the symbol is used to denote the zero matrix, the dimensions being clear from the context. The operator applied to a vector denotes the standard Euclidean norm, while applied to matrices it denotes the induced norm, which is equivalent to the matrix spectral radius for symmetric matrices. The notation is used for the Kronecker product of two matrices and . We say that a function is of class , , if is times continuously differentiable.Given a set of elements in belonging to some Euclidean space, we let denote the vector stacking these elements. To simplify notation, we sometimes suppress the interior brackets when the meaning is clear.
We assume there exists a rich enough probability space to carry out the constructions of the random objects defined in the paper. Unless stated otherwise, all (in)equalities involving random objects are to be interpreted almost surely (a.s.). We denote by and probability and expectation respectively. Given a measure on and a (measurable) function , we let
(4) 
whenever the integral exists. For a stochastic process and a function , we let
(5) 
Spectral graph theory: The interagent communication topology may be described by an undirected graph , with and the set of agents (nodes) and communication links (edges), respectively. The unordered pair if there exists an edge between nodes and . We consider simple graphs, i.e., graphs devoid of selfloops and multiple edges. A graph is connected if there exists a path^{1}^{1}1A path between nodes and of length is a sequence of vertices, such that ., between each pair of nodes. The neighborhood of node is
(6) 
Node has degree (the number of edges with as one end point). The structure of the graph can be described by the symmetric adjacency matrix, , , if , , otherwise. Let the degree matrix be the diagonal matrix . The positive semidefinite matrix
is the graph Laplacian matrix. The eigenvalues of
can be ordered as, the eigenvector corresponding to
being . The multiplicity of the zero eigenvalue equals the number of connected components of the network; for a connected graph, . This second eigenvalue is the algebraic connectivity or the Fiedler value of the network; see [34] for detailed treatment of graphs and their spectral theory.Ii Algorithm
Consider agents connected over a timevarying graph, with denoting the graph Laplacian at time . Let , denote the objective function of agent . Let be as defined in (1).
The agents update their states in a distributed fashion according to (2) for all with deterministic initial conditions , . In (2), denotes gradient noise and denotes a standard normal vector (introduced for annealing). In vector form, the update in (2) may be written as:
(7)  
(8) 
where , , , , and denotes the (stochastic) undirected graph Laplacian.
Remark 1.
In empirical risk minimization, agents optimize an empirical risk function using collected data, rather than optimizing the expected risk. In such problems, it is common to use stochastic gradient descent (SGD) techniques that mitigate computational burden by handling the data in batches. We note that our framework readily handles such SGD techniques as the
term can model independent gradient noise.Iii Intermediate Results
This section presents some intermediate results. In Section IIIA, we begin by presenting several technical lemmas. Subsequently, in Section IIIB we will use these technical lemmas to prove that the algorithm (7) obtains asymptotic consensus (see Lemma 4). Finally, in Section IIIC, we briefly review classical results in global optimization that will be used in the proof of our main result.
Iiia Technical Results
We begin by making the following assumptions.
Assumption 1.
The functions are with Lipschitz continuous gradients, i.e., there exists such that
(9) 
for all .
Assumption 2.
The functions satisfy the following bounded gradientdissimilarity condition:
(10) 
Denote by the natural filtration corresponding to the update process (2), i.e., for all , the algebra is given by
(11) 
Assumption 3.
The adapted sequence of undirected graph Laplacians are independent and identically distributed (i.i.d.), with being independent of for each , and are connected in the mean, i.e., where .
Assumption 4.
The sequence is adapted and there exists a constant such that
(12) 
for all .
Assumption 5.
For each , the sequence is a sequence of i.i.d. dimensional standard Gaussian vectors with covariance and with being independent of for all . Further, the sequences and are mutually independent for each pair with .
Assumption 6.
The sequences , , and satisfy
(13) 
where and .
The following lemma characterizes the decay rate of scaled gradient noise.
Lemma 1.
Let Assumption 4 hold. Then, for every , we have that a.s. as .
Proof.
Lemma 2 (Lemma 4.3 in [31]).
Let be an valued adapted process that satisfies
(17) 
In the above, is an adapted process, such that for all , satisfies and
(18) 
with and . The sequence is deterministic, valued and satisfies with and . Further, let and be valued adapted processes with a.s. The process is i.i.d. with independent of for each
and satisfies the moment condition
for some and a constant . Then, for every such that(19) 
we have a.s. as .
Remark 2.
In , denote by the consensus subspace,
(20) 
and denote by its orthogonal subspace in .
Lemma 3 (Lemma 4.4 in [31]).
Let be an valued adapted process such that for all . Also, let be an i.i.d. sequence of Laplacian matrices as in Assumption 3 that satisfies
(21) 
with being adapted and independent of for all . Then there exists a measurable adapted valued process (depending on and ) and a constant , such that a.s. and
(22) 
with
(23) 
for all large enough, where the weight sequence and are defined in Assumption 6.
IiiB Consensus
The following lemma shows that, a.s., the algorithm (7) obtains consensus asymptotically.
Lemma 4 (Convergence to Consensus Subspace).
Proof.
Noting that (by the properties of the undirected Laplacian), we have by (7),
(25) 
where
(26) 
Denote by the process, , for all , and note that
(27) 
since , where recall is the orthogonal complement of the consensus subspace (see (20)) and .
(28)  
(29) 
for all . (For convenience, we suppress the time index on the terms.) Now, consider the th component of the term ,
(30) 
and note that may be decomposed as
(31)  
(32)  
(33) 
For the second term on the R.H.S. of (31), note that, by Assumption 2, there exists a constant such that
(34) 
Finally, by the Lipschitz continuity of the gradients (see Assumption 1), we have, for a constant large enough,
(35) 
Hence, there exist constants such that
(36) 
For the term in (28), consider arbitrarily small. Consider the process , defined as for all , and note that by Lemma 1 we have as a.s. Since for all , we have (see Assumption 6)
(37) 
Similarly, note that,
(38)  
(39) 
Noting that has moments of all order (by the Gaussianity of the ’s), by (37)(38) we conclude that there exist valued adapted processes and such that
(40) 
with being bounded a.s. and possessing moments of all orders.
Since for all , by Lemma 3 there exists a adapted valued process and a constant such that a.s. and
(41) 
with
(42) 
for all large enough.
Thus, by (28), (36), (40), and (41) we obtain
(43)  
(44) 
for large. Denote by the process given by, for all , and note that, since , by (42) there exists a constant such that
(45) 
for all large enough. Noting that for all large, by (43) we have
(46) 
for large and a constant sufficiently large. By (45) and the above development, the recursion in (46) clearly falls under the purview of Lemma 2, and we conclude that (by taking and in Lemma 2) for all and such that
(47) 
we have a.s. as . By taking (since has moments of all orders) and , we conclude that a.s. as for all . ∎
IiiC Classical Results: Recursive Algorithms for Global Optimization
We will now briefly review classical results on global optimization from [8] that will be used in the proof of our main result.
Consider the following stochastic recursion in :
(48) 
where , is a sequence of valued random variables, is a sequence of independent dimensional Gaussian random variables with mean zero and covariance , and
(49) 
where are constants.
Consider the following assumptions on , the gradient field , and noise :
Assumption 7.
is a function such that

,

and as ,

.
We note that Within the context of PAC learning, the assumption (i) above corresponds to the “realizability” assumption, i.e., there exists a true (but unknown) hypothesis that accurately represents that data.
Assumption 8.
For let
is such that has a weak limit as .
We note that is constructed so as to place mass 1 on the set of global minima of . A discussion of simple conditions ensuring the existence of such a can be found in [35].
Assumption 9.
, .
Assumption 10.
Assumption 11.
Let be the natural filtration generated by (48); that is, , is given by
Assumption 12.
There exists such that
with and .
Note that, in contrast to Assumption 4, Assumption 12 assumes the conditional mean may be nonzero (but decaying).
Finally, let be the constant as defined after (2.3) in [8].
Iv Main Results
We will now prove the main result of the paper. We shall proceed as follows. We will first study the behavior of the valued networked averaged process
(51) 
Using Theorem 1, we will show that converges to the set of global minima of (see Lemma 5). After proving Lemma 5 we will present Theorem 2, which is the main result of the paper. Theorem 2 follows as a straightforward consequence of Lemmas 4 and 5.
Note that, taking the average on both sides of (2) we obtain,
(52) 
where
(53) 
and and are given in (26).
The following lemma shows that the networkedaveraged process converges to the set of global minima of .
Lemma 5.
Let satisfy the recursion (7) and let be given by (51), with initial condition . Let Assumptions 3–6 hold and Assume satisfies Assumptions 1–2 and 7–11. Further, suppose that and in Assumption 6 satisfy, , where is defined after Assumption 12. Then, for any bounded continuous function , we have that
(54) 
Proof.
The result will be proven by showing that the in (52) falls under the purview of Theorem 1, and, in particular, that Assumption 12 is satisfied. To do this, the key technical issue lies in handling the process . Specifically, we must restate the a.s. convergence obtained in Lemma 4 in terms of conditional expectations as required by Assumption 12.
Comments
There are no comments yet.