# ASY-SONATA: Achieving Geometric Convergence for Distributed Asynchronous Optimization

Can one obtain a geometrically convergent algorithm for distributed asynchronous multi-agent optimization? This paper provides a positive answer to this open question. The proposed algorithm solves multi-agent (convex and nonconvex) optimization over static digraphs and it is asynchronous, in the following sense: i) agents can update their local variables as well as communicate with their neighbors at any time, without any form of coordination; and ii) they can perform their local computations using (possibly) delayed, out-of-sync information from the other agents. Delays need not obey any specific profile, and can also be time-varying (but bounded). The algorithm builds on a tracking mechanism that is robust against asynchrony (in the above sense), whose goal is to estimate locally the average of agents' gradients. When applied to strongly convex functions, we prove that it converges at an R-linear (geometric) rate as long as the step-size is sufficiently small. A sublinear convergence rate is proved, when nonconvex problems and/or diminishing, uncoordinated step-sizes are considered. Preliminary numerical results demonstrate the efficacy of the proposed algorithm and validate our theoretical findings.

## Authors

• 28 publications
• 33 publications
• 3 publications
• 20 publications
• ### Distributed Nonconvex Constrained Optimization over Time-Varying Digraphs

This paper considers nonconvex distributed constrained optimization over...
09/04/2018 ∙ by Gesualdo Scutari, et al. ∙ 0

• ### A Provably Communication-Efficient Asynchronous Distributed Inference Method for Convex and Nonconvex Problems

This paper proposes and analyzes a communication-efficient distributed o...
03/16/2019 ∙ by Jineng Ren, et al. ∙ 0

• ### The distributed dual ascent algorithm is robust to asynchrony

The distributed dual ascent is an established algorithm to solve strongl...
05/04/2021 ∙ by Mattia Bianchi, et al. ∙ 0

We consider a multi-agent framework for distributed optimization where e...
03/23/2018 ∙ by Mahmoud Assran, et al. ∙ 0

• ### Geometrically Convergent Distributed Optimization with Uncoordinated Step-Sizes

A recent algorithmic family for distributed optimization, DIGing's, have...
09/19/2016 ∙ by Angelia Nedić, et al. ∙ 0

• ### SUCAG: Stochastic Unbiased Curvature-aided Gradient Method for Distributed Optimization

We propose and analyze a new stochastic gradient method, which we call S...
03/22/2018 ∙ by Hoi-To Wai, et al. ∙ 0

• ### DSPG: Decentralized Simultaneous Perturbations Gradient Descent Scheme

In this paper, we present an asynchronous approximate gradient method th...
03/17/2019 ∙ by Arunselvan Ramaswamy, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

We study convex and nonconvex distributed optimization over a network of agents, modeled as a directed, fixed, graph. Agents aim at cooperatively solving the optimization problem

 minx∈RnF(x)≜I∑i=1fi(x) (P)

where is the cost function of agent , assumed to be smooth (nonconvex) and known only to agent . In this setting, optimization has to be performed in a distributed, collaborative manner: agents can only receive/send information from/to its immediate neighbors. Instances of (P

) that require distributed computing have found a wide range of applications in different areas, including network information processing, resource allocation in communication networks, swarm robotic, and machine learning, just to name a few.

Many of the aforementioned applications give rise to extremely large-scale problems and networks, which naturally call for asynchronous, parallel solution methods. In fact, asynchronous modus operandi reduces the idle times of workers, mitigate communication and/or memory-access congestion, save power (as agents need not perform computations and communications at every iteration), and make algorithms more fault-tolerant. In this paper, we consider the following very general, abstract, asynchronous model [3]:

(i)

Agents can perform their local computations as well as communicate (possibly in parallel) with their immediate neighbors at any time, without any form of coordination or centralized scheduling; and

(ii)

when solving their local subproblems, agents can use outdated information from their neighbors.

In (ii) no constraint is imposed on the delay profiles: delays can be arbitrary (but bounded), time-varying, and (possibly) dependent on the specific activation rules adopted to wakeup the agents in the network. This model captures in a unified fashion several forms of asynchrony: some agents execute more iterations than others; some agents communicate more frequently than others; and inter-agent communications can be unreliable and/or subject to unpredictable, time-varying delays.

Several forms of asynchrony have been studied in the literature–see Sec. I-A for an overview of related works. However, we are not aware of any distributed algorithm that is compliant to the asynchrony model (i)-(ii) and distributed (nonconvex) setting above. Furthermore, when considering the special case of strongly convex function , it is not clear how to design a (first-order) distributed asynchronous algorithm (as specified above) that achieves linear convergent rate. This paper provides a positive answer to these questions–see Sec. I-B and Table 1 for a summary of our contributions.

### I-a Literature Review

Since the seminal work [11], asynchronous parallelism has been applied to several centralized optimization algorithms, including block coordinate descent (e.g., [11, 12, 13]) and stochastic gradient (e.g., [14, 15, 16]) methods. However, these schemes are not applicable to the networked setup considered in this paper, because they would require the knowledge of the entire function from each agent. Some of these schemes were extended to hierarchical networks (e.g., master-slave architectures and star networks), see [17, 18], and references therein. However, they remain centralized, due to the use of a master (or cluster-head) node.

Distributed methods exploring (some form of) asynchrony over networks with no centralized node have been studied in [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 4, 5, 6, 7, 8, 9, 10]. We group next these works based upon the features (i)-(ii) above.

(a) Random activations and no delays [19, 20, 21, 22, 23]: These schemes considered distributed convex unconstrained optimization over undirected graphs. While substantially different in the form of the updates performed by the agents–[19, 21, 23] are instances of primal-dual (proximal-based) algorithms, [22] is an ADMM-type algorithm, while [20] is based on the distributed gradient tracking mechanism introduced in[30, 31, 32]–all these algorithms are asynchronous in the sense of feature (i) [but not (ii)]: at each iteration, a subset of agents [19, 21, 23] (or edge-connected agents [22, 20]), chosen at random, is activated, performing then their updates and communications with their immediate neighbors; between two activations, agents are assumed to be in idle mode (i.e., able to continuously receive information). However, no form of delays is allowed: every agent must perform its local computations/updates using the most updated information from its neighbors. This means that all the actions performed by the agent(s) in an activation must be completed before a new activation (agent) takes place (wakes-up), which calls for some coordination among the agents. Finally, no convergence rate was provided for the aforementioned schemes but [22, 20].
(b) Synchronous activations and delays [24, 25, 26, 27, 28, 29]: These schemes considered distributed constrained convex optimization over undirected graphs. They study the impact of delayed gradient information [24, 25] or communication delays (fixed [26], uniform [29, 25] or time-varying [27, 28]) on the convergence rate of distributed gradient (proximal [24, 25] or projection-based [28, 29]) algorithms or dual-averaging distributed-based schemes [26, 27]. While these schemes are all synchronous [thus lacking of feature (i)], they can tolerate communication delays [an instantiation of feature (ii)], converging at a sublinear rate to an optimal solution. Delays must be such that no losses occur–every agent’s message will eventually reach its destination within a finite time.
(c) Random/cyclic activations and some form of delays [4, 5, 6, 7, 8, 9, 10]: The class of optimization problems along with the key features of the algorithms proposed in these papers are summarized in Table 1 and briefly discussed next. The majority of these works studied distributed (strongly) convex optimization over undirected graphs, with [5] assuming that all the functions have the same minimizer, [6] considering also nonconvex objectives, and [8] being implementable also over digraphs. The algorithms in [4, 5] are gradient-based schemes; [6] is a decentralized instance of ADMM; [9] applies an asynchronous parallel ADMM scheme to distributed optimization; and [10] builds on a primal-dual method. The schemes in [7, 8] instead build on (approximate) second-order information. All these algorithms are asynchronous in the sense of feature (i): [4, 5, 6, 9, 10] considered random activations of the agents (or edges-connected agents) while [7, 8] studied deterministic, uncoordinated activation rules. As far as feature (ii) is concerned, some form of delays is allowed. More specifically, [4, 5, 6, 8] can deal with packet losses: the information sent by an agent to its neighbors either gets lost or received with no delay. They also assume that agents are always in idle mode between two activations. Closer to the proposed asynchronous framework are the schemes in [9, 10]

wherein a probabilist model is employed to describe the activation of the agents and the aged information used in their updates. The model requires that the random variables triggering the activation of the agents are i.i.d and

independent

of the delay vector used by the agent to performs its update. While this assumption makes the convergence analysis possible, in reality, there is a strong dependence of the delays on the activation index; see

[13] for a detailed discussion on this issue and several counter examples. Other consequences of this model are: the schemes [9, 10] are not parallel–only one agent per time can perform the update–and a random self-delay must be used in the update of each agent (even if agents have access to their most recent information). Finally, referring to the convergence rate, [9] is the only scheme with provably convergence rate: when each is strongly convex and the graph undirected, [9] converges linearly in expectation. No convergence rate is available in any of the aforementioned papers, when is nonconvex.

### I-B Summary of Contributions

This paper proposes a general distributed, asynchronous algorithmic framework for (strongly) convex and nonconvex instances of Problem (P), over directed graphs. The algorithm leverages a perturbed “sum-push” mechanism that is robust against asynchrony, whose goal is to track locally the average of agents’ gradients; this scheme along with its convergence analysis are of independent interest. To the best of our knowledge, the proposed framework is the first scheme combining the following attractive features (cf. Table 1): (a) it is parallel and asynchronous [in the sense (i) and (ii)]–multiple agents can be activated at the same time (with no coordination) and/or outdated information can be used in the agents’ updates; our asynchronous setting (i) and (ii) is less restrictive than the one in [9, 10]; furthermore, in contrast with [9], our scheme avoids solving possibly complicated subproblems; (b) it is applicable to nonconvex problems; (c) it is implementable over digraph; (d) it employs either a constant step-size or uncoordinated diminishing ones; (e) it convergences at an R-linear rate (resp. sublinear) when is strongly convex (resp. nonconvex) and a constant (resp. diminishing, uncoordinated) step-size(s) is employed; this contrasts [9] wherein each needs to be strongly convex; and (d) it is “protocol-free”, meaning that agents need not obey any specific communication protocols or asynchronous modus operandi (as long as delays are bounded and agents update/communicate uniformly infinitely often), which otherwise would impose some form of coordination.

On the technical side, convergence is studied introducing two techniques of independent interest, namely: i) the asynchronous agent system is reduced to a synchronous “augmented” one with no delays by adding virtual agents to the graph. While this idea was first explored in [33, 34], the proposed enlarged system differs from those used therein, which cannot deal with the general asynchronous model considered here; see Remark 12, Sec.VI; and ii) the rate analysis is employed putting forth a generalization of the small gain theorem (widely used in the literature [35] to analyze synchronous schemes), which is expected to be broadly applicable for other distributed algorithms.

### I-C Notation

Throughout the paper, we will use the following notation. Given the matrix , denotes its th element whereas and are its -th row vector and -th column vector, respectively. Given the matrix sequence , with , we define , if ; and , if . Given two matrices and of same size, by we mean that is a nonnegative matrix; the same notation will be used for vectors. We denote by the vector of all ones whereas is the -th canonical vector; the dimensions of and will be clear from the context. We use for both vectors and matrices; in the former case, represents the Euclidean norm whereas in the latter case it is the spectral norm. The indicator function of an event takes value when the event is true, and value otherwise. The set of nonnegative (resp. positive) integer is denoted by (resp. ). Finally, we use the convention and .

## Ii Problem Setup and Preliminaries

### Ii-a Problem Setup

We study Problem (P) under the following assumptions.

###### Assumption 1 (On the optimization problem).

1. Each is proper, closed and -Lipschitz differentiable;

2. is bounded from below.

Note that need not be convex. We also make the blanket assumption that each agent knows only its own , but not . To state linear convergence, we will use the following extra condition on the objective function.

###### Assumption 2 (Strongly convexity).

Assumption 1(i) holds and, in addition, is -strongly convex.

On the communication network: The communication network of the agents is modeled as a fixed, directed graph , where is the set of nodes (agents), and is the set of edges (communication links). If , it means that agent can send information to agent . We assume that the digraph does not have self-loops. We denote by the set of in-neighbors of node , i.e., while is the set of out-neighbors of agent . We make the following standard assumption on the graph connectivity.

###### Assumption 3.

The graph is strongly connected.

### Ii-B Preliminaries: The SONATA algorithm [36, 37]

The asynchronous, distributed framework we are going to introduce builds on the synchronous SONATA algorithm, proposed in [36, 37] to solve (nonconvex) multi-agent optimization problems over time-varying digraphs. This is motivated by the fact that SONATA has the unique property of being provably applicable to both convex and nonconvex problems, and it achieves liner convergence when applied to strongly convex objectives . We thus begin reviewing a special instance of SONATA, tailored to (P); and then generalized to the asynchronous setting (cf. Sec. IV).

Every agent controls and iteratively updates the tuple : is agent ’s copy of the shared variables in (P); acts as a local proxy of the sum-gradient ; and and are auxiliary variables instrumental to deal with communications over digraphs. Let , , and denote the value of the aforementioned variables at iteration . The update of each agent reads:

 xk+1i =∑j∈Nini∪{i}wij(xkj−αkykj), (1) zk+1i =∑j∈Nini∪{i}aijzkj+∇fi(xk+1i)−∇fi(xki), (2) ϕk+1i =∑j∈Nini∪{i}aijϕkj, (3) yk+1i =zk+1i/ϕk+1i, (4)

with and , for all . In (1), is a local estimate of the average-gradient . Therefore, every agent, first moves along the estimated gradient direction, generating ( is the step-size); and then performs a consensus step to force asymptotic agreement among the local variables . Steps (2)-(4) represent a perturbed-push-sum update, aiming at tracking the gradient [32, 37, 31]. The weight-matrices and satisfy the following standard assumptions.

###### Assumption 4.

(On the weight-matrices) The weigh-matrices and satisfy (we will write to denote either or ):

1. such that , for all ; and , for all ; , otherwise;

2. is row-stochastic, that is, ;

3. is column-stochastic, that is, ;

In [35], the authors proved that a special instance of SONATA, when applied to (P) with strongly convex , converges at an R-linear rate. This result was further extended to constraints, nonsmooth, distributed optimization in [38].

A natural question is whether the SONATA algorithm works also in an asynchronous setting still preserving linear convergence rate. Naive modifications of the updates (1)-(4) to make them asynchronous–such as using uncoordinated activations and/or delayed information–would not work. For instance, the tracking (2)-(4) calls for the invariance of the averages, i.e., , for all . It is not difficult to check that any perturbation injected in (2)-e.g., in the form of delays or packed losses–puts in jeopardy this property.

To cope with the above challenges, a first step is robustifying the gradient tracking scheme. In Sec. III, we introduce P-ASY-SUM-PUSH–an asynchronous, perturbed, instance of the push-sum algorithm [39], which serves as a unified algorithmic framework to accomplish several tasks over digraphs in an asynchronous manner, such as solving the average consensus problem and tracking the average of agents’ time-varying signals. Building on P-ASY-SUM-PUSH, in Sec. IV, we finally present the proposed distributed asynchronous optimization framework, termed ASY-SONATA.

## Iii Perturbed Asynchronous Sum-Push

We present P-ASY-SUM-PUSH; the algorithm was first introduced in our conference paper [1, 2], which we refer to for details on the genesis of the scheme and intuitions; here we directly introduce the scheme and study its convergence.

Consider a general asynchronous setting wherein multiple agents compute and communicate independently without coordination. This implies that some agents can execute more iterations than others and, generally, they use outdated information from their neighbors; delays are possibly time-varying (but bounded). To deal with asynchrony, every agent maintains state variables , , , along with the following auxiliary variables: i) the cumulative-mass variables and , with , which capture the cumulative (sum) information generated by agent up to the current time and to be sent to agent ; consequently, and are received by from its in-neighbors ; and ii) the buffer variables and , with , which store the information sent from to and used by in its last update. Values of these variables at iteration are denoted by the same symbols adding the superscript “”. Note that, because of the asynchrony, each agent might have outdated and ; (resp. ) is a delayed version of the current (resp. ) owned by at time , where is the delay, with . Similarly, and might differ from the last information generated by for , because agent might not have received that information yet (due to delays) or never will (due to packet losses).

The proposed asynchronous algorithm, P-ASY-SUM-PUSH, is summarized in Algorithm 1 wherein a “global view” of agents’ actions is introduced. The global view allows us to abstract from specific computation-communication protocols and asynchronous modus operandi employed by the agents. A global iteration clock (not known to the agents) is introduced: is triggered when one agent, say , performs its updates throughout the following set of actions. (S.2): agent maintains a local variable , for each , which keeps track of the “age” (generated time) of the -variables that it has received from its in-neighbors and already used. If is larger than the current counter , indicating that the received -variables are newer than those currently stored, agent accepts and , and updates as ; otherwise, the variables will be discarded and remains unchanged. Note that (5) can be performed without any coordination. It is sufficient that each agent attaches a time-stamp to its produced information reflecting it local timing counter. We describe next the other steps, assuming that new information has come in to agent , that is, .

(S.3.1): In (6), agent builds the intermediate “mass” based upon its current information and , and the (possibly) delayed one from its in-neighbors, ; and is an exogenous perturbation (later this perturbation will be properly chosen to accomplish specific goals, see Sec. IV). Note that the way agent forms its own estimates is immaterial to the description of the algorithm. The local buffer stores the value of that agent used in its last update. Therefore, if the information in is not older than the one in , the difference in (6) will capture the sum of the ’s that have been generated by for up until and not used by agent yet. For instance, in a synchronous setting, one would have . (S.3.2): the generated is “pushed back” to agent itself and its out-neighbors. Specifically, out of the total mass generated, agent gets , determining the update while the remaining is allocated to the agents , with cumulating to the mass buffer and generating the update , to be sent to agent . (S.3.3): each local buffer variable is updated to account for the use of new information from . Finally, as in the push-sum algorithm, the final information is read on the normalized -variables [cf. (S.3.4)].

Note that the update described above is fully defined, once and are given. The selection in (S.1) is not performed by anyone; it is instead an a-posteriori description of agents’ actions: All agents act asynchronously and continuously; the agent completing the “push” step and updating its own variables triggers retrospectively the iteration counter and determines the pair along with all quantities involved in the other steps.

Convergence is given under the following assumptions.

###### Assumption 5 (On the asynchronous model).

Suppose:

1. such that , for all ;

2. such that , for all and .

The next theorem studies convergence of P-ASY-SUM-PUSH, establishing geometric decay of the error , even in the presence of unknown (bounded) perturbations, where represents the “total mass” of the system at iteration .

###### Theorem 6.

Let be the sequence generated by Algorithm 1, under Assumption 5, and with satisfying Assumption 4 (i),(iii). Define There exist constants and , such that

 ∥∥yk+1i−(1/I)⋅mk+1z∥∥≤C1(ρk∥∥z0∥∥+k∑l=0ρk−l∥∥ϵl∥∥), (9)

for all and .

Furthermore, .

###### Proof.

See Sec. VI. ∎

Discussion: Several comments are in order.

#### Iii-1 On the asynchronous model

Algorithm 1 represents a gamut of asynchronous parallel schemes and architectures, all captured by the mechanism of generation of the indices and delay vectors , which the agents need not know. The only conditions to be satisfied by are in Assumption 5: (i) controls the frequency of the updates whereas (ii) limits the age of the old information used in the computations. These assumptions are quite mild. For instance, (i) is automatically satisfied if each agent wakes up and performs an update whenever some internal clock ticks, without the need of any central clock or coordination with the others. (ii) imposes some conditions on the communications: the information used by any agent is outdated by at most units (with finite but arbitrarily large). This however does not enforce a-priori any specific protocol (on the activation/idle time/communication). For instance, i) agents need not perform the actions in Algorithm 1 sequentially or inside the same activation; ii) executing the “push” step does not mean that agents must broadcast their new variables in the same activation; this would incur in a delay (or packet loss) in the communication.

#### Iii-2 Beyond average consensus

By choosing properly the perturbation signal , P-ASY-SUM-PUSH can solve different problems. Some examples are discussed next.
(i) Error free: . P-ASY-SUM-PUSH solves the average consensus problem and (9) reads

 ∥∥yk+1i−(1/I)⋅I∑i=1z0i∥∥≤C1ρk∥∥z0∥∥.

(ii) Vanishing error: . Using [32, Lemma 7(a)], (9) reads .

(iii) Asynchronous tracking. Each agent owns a (time-varying) signal ; the average tracking problem consists in asymptotically track the average signal , that is,

 limk→∞∥∥yk+1i−¯uk+1i∥∥=0,∀i∈V. (10)

Under mild conditions on the signal, this can be accomplished in a distributed and asynchronous fashion, using P-ASY-SUM-PUSH, as formalized next.

###### Corollary 6.1.

Consider, the following setting in P-ASY-SUM-PUSH: , for all ; , with

 ~uk+1i={uk+1iif i=ik;~ukiotherwise;~u0i=u0i;

Then (9) holds, with . Furthermore, if , then (10) holds.

###### Proof.

See Appendix -E. ∎

This instance of P-ASY-SUM-PUSH will be used in Sec. IV to perform asynchronous gradient tracking inside ASY-SONATA.

###### Remark 7 (Comparison with [33, 40, 8]).

The use of counter variables [such as -variables in our scheme] was first introduced in [33] to design a synchronous average consensus algorithm robust to packet losses. In [40], this scheme was extended to deal with uncoordinated (deterministic) agents’ activations whereas [8] built on [40] to design, in the same setting, a distributed Newton-Rapshon algorithm. There are important differences between P-ASY-SUM-PUSH and the aforementioned schemes, namely: i) none of them can deal with delays but packet losses; ii) [33] is synchronous; and iii)[40, 8] are not parallel schemes, as at each iteration only one agent is allowed to wake up and transmit information to its neighbors. For instance, [40, 8] cannot model synchronous parallel (Jacobi) updates. Hence, the convergence analysis of P-ASY-SUM-PUSH calls for a new line of proof, as introduced in Sec. VI.

## Iv Asynchronous SONATA (ASY-SONATA)

We are ready now to introduce our distributed asynchronous algorith–ASY-SONATA. The algorithm combines SONATA (cf. Sec. II-B) with P-ASY-SUM-PUSH (cf. Sec. III), the latter replacing the synchronous tracking scheme (2)-(4). The “global view” of the scheme is given in Algorithm 2.

In ASY-SONATA, agents continuously and with no coordination perform: i) their local computations [cf. (S.3)], possibly using an out-of-sync estimate of the average gradient; in (11), is a step-size (to be properly chosen); ii) a consensus step on the -variables, using possibly outdated information from their in-neighbors [cf. (S.4)]; and iii) gradient tracking [cf. (S.5)] to update the local estimate , based on the current cumulative mass variables , and buffer variables , .

Note that in Algorithm 1, the tracking variable is obtained rescaling by the factor . In Algorithm 2, we absorbed the scaling in the step size and use directly as a proxy of the average gradient, eliminating thus the -variables (and the related -, -variables). Also, for notational simplicity and without loss of generality, we assumed that the - and - variables are subject to the same delays (e.g., they are transmitted within the same packet); same convergence results hold if different delays are considered.

We study now convergence of the scheme, under a constant step-size or diminishing, uncoordinated ones.

### Iv-a Constant Step-size

To measure the progresses of the algorithm towards optimality (or stationarity) and consensus, we use the merit function

 MF(xk)≜max{∥∥∇F(¯xk)∥∥2,∥∥xk−1I⊗¯xk∥∥2}, (12)

where and Note that is a valid merit function, since it is continuous and if and only if all ’s are consensual and optimal (resp. stationary solutions).

Theorem 8 below establishes linear convergence of ASY-SONATA when is strongly convex; and Theorem 9 addresses the case of convex and nonconvex .

###### Theorem 8 (Geometric convergence).

Consider (P) under Assumption  2, and let denote its unique solution. Let be the sequence generated by Algorithm 2, under Assumption 5, and with weight-matrices and satisfying Assumption 4. Then, there exists a constant such that if , it holds

 ∥xki−x⋆∥=O(λk),∀i∈V, (13)

with given by

 λ=⎧⎨⎩1−τ¯m2K1γ2if γ∈(0,^γ1],ρ+√J1γif γ∈(^γ1,^γ2), (14)

where and are some constants strictly smaller than , and (see Appendix -D for the explicit expression of the constants).

See Sec. VII. ∎

###### Theorem 9 (Sublinear convergence).

Consider (P) under Assumption 1 (thus possibly nonconvex). Let be the sequence generated by Algorithm 2, in the same setting of Theorem 8. Given , let be the first iteration such that . Then, there exists a , such that if , . The values of the above constants is given in the proof.

###### Proof.

See Sec. VII. ∎

Theorem 8 states that both consensus and optimization errors of the sequence generated by ASY-SONATA vanish geometrically. Therefore, ASY-SONATA matches the performance of a centralized gradient method in a distributed, asynchronous computing environment. We are not aware of any other scheme enjoying such a property in the considered setting. Note that ASY-SONATA is globally convergent regardless of the initialization. This is a major difference with respect to the distributed algorithm proposed in [8] (also employing a robustification of the push-sum consensus, cf. Remark 7). Convergence therein is established assuming that all agents initialize their local copies to be almost consensual and in a neighborhood of the optimal solution.

Finally, Theorem 9 shows that for general, possibly nonconvex instances of Problem (P), both consensus and optimization errors of the sequence generated by ASY-SONATA vanish at sublinear rate.

### Iv-B Uncoordinated diminishing step-sizes

While the use of a constant step-size is appealing to obtain strong convergence rate results, estimating its upper-bound expression in a distributed setting, as required in Theorems 8 and 9, is not practical. In fact, such a value depends on global network and optimization parameters, not available at the agents’ side. Furthermore, the theoretical values are quite conservative, meaning that they would lead to slow convergence in practice. This naturally suggests the use of a diminishing step-size strategy. However, because of the asynchronous distributed nature of the system, one cannot simply assume that the sequence in (11) is a classical diminishing step-size sequence. In fact, this would require each agent to know the global iteration counter , which is not realistic. Inspired by [41], we assume instead that each agent, independently and with no coordination with the others, draws the step-size from a local sequence , according to its local clock. The sequence in (11) will be thus the result of the “uncoordinated samplings” of the local out-of-sync sequences . Fig. 1 shows an example of how the resulting sequence is built, with three agents and .

The next theorem shows that in this setting, ASY-SONATA converges sub-linearly for both convex and nonconvex objectives. To our knowledge, this is the first result of this genre.

###### Theorem 10.

Consider Problem (P) under Assumption 1 (thus possibly nonconvex). Let be the sequence generated by Algorithm 2, in the same setting of Theorem 8, but with the agents using a local step-size sequence satisfying and . Given , let be the first iteration such that . Then

 Tδ≤inf{k∈N0∣∣k∑t=0γt≥c/δ}, (15)

where is a positive constant.

See Sec. VII. ∎

## V Numerical Results

We test ASY-SONATA on the Least Squares (LS) problem, a strongly convex instance of Problem (P), over directed graphs. In the LS problem, each agent aims to estimate an unknown signal through linear measurements , where is the sensing matrix, and is the additive noise. The LS problem can be written in the form of (P), with each . We fix

with its elements being i.i.d. random variables drawn from the standard normal distribution. For each

, we firstly generate all its elements as i.i.d. random variables drawn from the standard normal distribution, and then normalize the matrix by multiplying it with the reciprocal of its spectral norm. The elements of the additive noise

are i.i.d. Gaussian distributed, with zero mean and variance equal to

.

Agents are activated according to a cyclic rule where the order is randomly permuted at the beginning of each round. Once activated, every agent performs all the steps as in Algorithm 2 and then sends its updates to all its out-neighbors. Each transmitted message has (integer) traveling time which is drawn uniformly at random within the interval .

We set and for each agent . We simulate a network of agents and set A strongly connected digraph is generated according to the following procedure: each agent has out-neighbors; one of them belongs to a directed cycle graph connecting all the agents while the other two are picked uniformly at random. We test ASY-SONATA with a constant step size , and also a diminishing step-size rule with each agent updating its local step size according to and ; as benchmark, we also simulate its synchronous instance, with step size . In Fig. 2, we plot versus the number of rounds (one round corresponds to one update of all the agents). The curves are averaged over Monte-Carlo simulations, with different graph and data instantiations. The plot clearly shows linear convergence of ASY-SONATAwith a constant step-size.

## Vi Convergence Analysis of P-ASY-SUM-PUSH

We prove Theorem 6; we assume , without loss of generality. The proof is organized in the following two steps. Step 1: We first reduce the asynchronous agent system to a synchronous “augmented” one with no delays. This will be done adding virtual agents to the graph along with their state variables, so that P-ASY-SUM-PUSH will be rewritten as a (synchronous) perturbed push-sum algorithm on the augmented graph. While this idea was first explored in [33, 34], there are some important differences between the proposed enlarged systems and those used therein, see Remark 12. Step 2: We conclude the proof establishing convergence of the perturbed push-sum algorithm built in Step 1.

### Vi-a Step 1: Reduction to a synchronous perturbed push-sum

#### Vi-A1 The augmented graph

We begin constructing the augmented graph–an enlarged agent system obtained adding virtual agents to the original graph . Specifically, we associate to each edge an ordered set of virtual nodes (agents), one for each of the possible delay values, denoted with a slight abuse of notation by ; see Fig. 3. Roughly speaking, these virtual nodes store the “information on fly” based upon its associated delay, that is, the information that has been generated by for but not used (received) by yet. Adopting the terminology in [34], nodes in the original graph are termed computing agents while the virtual nodes will be called noncomputing agents.

With a slight abuse of notation, we define the set of computing and noncomputing agents as , and its cardinality as . We now identify the neighbors of each agent in this augmented systems. Computing agents no longer communicate among themselves; each can only send information to the noncomputing nodes , with . Each noncomputing agent can either send information to the next noncomputing agent, that is (if any), or to the computing agent ; see Fig. 3(b).

To describe the information stored by the agents in the augmented system at each iteration, let us first introduce the following quantities: is the set of global iteration indices at which the computing agent wakes up; and, given , let . It is not difficult to conclude from (7) and (8) that

 ρkij=∑t∈Tk−1jaijzt+1/2jand~ρkij=ρτk−1ijij,(j,i)∈E. (16)

At iteration , every computing agent stores , whereas the noncomputing agents are initialized to At the beginning of iteration , every computing agent will store whereas every noncomputing agent , with , stores the mass (if any) generated by for at iteration (thus ), i.e., (cf. Step 3.2), and not been used by yet (thus ); otherwise it stores . Formally, we have

 zk(j,i)d≜ aijzt+1/2j ⋅1[t=k−d−1∈Tk−1