## Authors

• 1 publication
• 3 publications
• 25 publications
• ### Decentralized Stochastic First-Order Methods for Large-scale Machine Learning

Decentralized consensus-based optimization is a general computational fr...
07/23/2019 ∙ by Ran Xin, et al. ∙ 0

First-order optimization methods have been playing a prominent role in d...
05/19/2018 ∙ by Haiwen Huang, et al. ∙ 0

• ### GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange

We address the issue of speeding up the training of convolutional neural...
04/04/2018 ∙ by Michael Blot, et al. ∙ 0

• ### D-SPIDER-SFO: A Decentralized Optimization Algorithm with Faster Convergence Rate for Nonconvex Problems

Decentralized optimization algorithms have attracted intensive interests...
11/28/2019 ∙ by Taoxing Pan, et al. ∙ 15

• ### Collaborative Deep Learning in Fixed Topology Networks

There is significant recent interest to parallelize deep learning algori...
06/23/2017 ∙ by Zhanhong Jiang, et al. ∙ 0

As application demands for online convex optimization accelerate, the ne...
06/01/2019 ∙ by Saeed Masoudian, et al. ∙ 0

• ### Online and stochastic Douglas-Rachford splitting method for large scale machine learning

Online and stochastic learning has emerged as powerful tool in large sca...
08/22/2013 ∙ by Ziqiang Shi, et al. ∙ 0

## Code Repositories

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Online optimization is a fundamental procedure for solving a wide range of machine learning problems [1, 2]. It can be formulated as a repeated game between a learner (algorithm) and an adversary. The learner receives a streaming data sequence, sequentially selects actions, and the adversary reveals the convex or nonconvex losses to the learner. A standard performance metric for an online algorithm is regret, which measures the performance of the algorithm versus a static benchmark [3, 2]. For example, the benchmark could be an optimal point of the online average of the loss (local cost) function, had the learner known all the losses in advance. In a broad sense, if the benchmark is a fixed sequence, the regret is called static. Recent work on online optimization has investigated the notion of dynamic regret [3, 4, 5]. Dynamic regret can take the form of the cumulative difference between the instantaneous loss and the minimum loss. For convex functions, previous studies have shown that the dynamic regret of online gradient-based methods can be upper bounded by , where is a measure of regularity of the comparator sequence or the function sequence [3, 4, 6]. This bound can be improved to [7, 8], when the cost function is strongly convex and smooth.

Decentralized nonlinear programming has received a lot of interest in diverse scientific and engineering fields [9, 10, 11, 12]. The key problem involves optimizing a cost function , where and each is only known to the individual agent in a connected network of agents. The agents collaborate by successively sharing information with other agents located in their neighborhood with the goal of jointly converging to the network-wide optimal argument [13]. Compared to optimization procedures involving a fusion center that collects data and performs the computation, decentralized nonlinear programming enjoys the advantage of scalability to the size of the network used, robustness to the network topology, and privacy preservation in data-sensitive applications.

A popular algorithm in decentralized optimization is gradient descent which has been studied in [13, 14]. Convergence results for convex problems with bounded gradients are given in [15], while analogous convergence results even for nonconvex problems are given in [16]. Convergence is accelerated by using corrected update rules and momentum techniques [17, 18, 19, 20]. The primal-dual [21, 22], ADMM [23, 24] and zero-order [25] approaches are related to the dual decentralized gradient method [14]. We also point out the recent work [20] on a very efficient consensus-based decentralized stochastic gradient (DSGD) method for deep learning over fixed topology networks and the earlier work [26] on decentralized gradient methods for nonconvex deep learning problems. Further, under some mild assumptions, [26] showed that decentralized algorithms can be faster than its centralized counterpart for certain stochastic nonconvex loss functions.

### 1.1 Content and Contributions

In this paper, we develop and analyze a new consensus-based distributed adaptive moment estimation (DADAM) method that incorporates decentralized optimization and uses a variant of adaptive moment estimation methods [27, 31, 35]. Existing distributed stochastic and adaptive gradient methods for deep learning are mostly designed for a central network topology [36, 37]. The main bottleneck of such a topology lies on the communication overload on the central node, since all nodes need to concurrently communicate with it. Hence, performance can be significantly degraded when network bandwidth is limited. These considerations motivate us to study an adaptive algorithm for network topologies, where all nodes can only communicate with their neighbors and none of the nodes is designated as “central”. Therefore, the proposed method is suitable for large scale machine learning problems, since it enables both data parallelization and decentralized computation.

Next, we briefly summarize the main technical contributions of the work.

• Our first main result (Theorem 5) provides guarantees of DADAM for constrained convex minimization problems defined over a closed convex set . We provide the convergence bound in terms of dynamic regret and show that when the data features are sparse and have bounded gradients, our algorithm’s regret bound can be considerably better than the ones provided by standard mirror descent and gradient descent methods [13, 4, 5]. It is worth mentioning that the regret bounds provided for adaptive gradient methods [27] are static and our results generalize them to dynamic settings.

• In Theorem 8, we give a new local regret analysis for distributed online gradient-based algorithms for constrained nonconvex minimization problems computed over a network of agents. Specifically, we prove that under certain regularity conditions, DADAM can achieve a local regret bound of order for nonconvex distributed optimization. To the best of our knowledge, rigorous extensions of existing adaptive gradient methods to the distributed nonconvex setting considered in this work do not seem to be available.

• In this paper, we also present regret analysis for distributed stochastic optimization problems computed over a network of agents. Theorems 6 and 10 provide regret bounds of DADAM for minimization problem (2) with stochastic gradients and indicate that the result of Theorems 5 and 8 hold true in expectation. Further, in Corollary 11 we show that DADAM can achieve a local regret bound of order for nonconvex distributed stochastic optimization where

is an upper bound on the variance of the stochastic gradient. Hence, DADAM outperforms centralized adaptive algorithms such as ADAM for certain realistic classes of loss functions when

is sufficiently large.

In summary, a distinguishing feature of this work is the incorporation of adaptive learning with data parallelization, as well as extension to the stochastic setting with both convex/nonconvex objective functions. Further, the established technical results exhibit differences from those in [13, 14, 15, 16] with the notion of adaptive constrained optimization in online and dynamic settings.

The remainder of the paper is organized as follows. Section 2 gives a detailed description of DADAM, while Section 3 establishes its theoretical results. Section 4 explains a network correction technique for our proposed algorithm. Section 5 illustrates the proposed framework on a number of synthetic and real data sets. Finally, Section 6 concludes the paper.

The detailed proofs of the main results established are delegated to the Appendix.

### 1.2 Mathematical Preliminaries and Notations.

Throughout the paper, denotes the

-dimensional real space. For any pair of vectors

indicates the standard Euclidean inner product. We denote the norm by , the infinity norm by , and the Euclidean norm by . The above norms reduce to the vector norms if is a vector. The diameter of the set is given by

 γ∞=supx,y∈X∥x−y∥∞. (1)

Let be the set of all positive definite matrices. denotes the Euclidean projection of a vector onto for :

 ΠX,A [x]=argminy∈X∥A12(x−y)∥.

The subscript is often used to denote the time step while stands for the -th element of . Further, is given by

 yi,1:t,d=[yi,1,d,yi,2,d,…,yi,t,d]⊤.

We let denote the gradient of at . The

-th largest singular value of matrix

is denoted by . We denote the element in the -th row and -th column of matrix by . In several theorems, we consider a connected undirected graph with nodes and edges . The matrix is often used to denote the adjacency matrix of graph .

The Hadamard (entrywise) and Kronecker product are denoted by and , respectively. Finally, the expectation operator is denoted by .

## 2 Problem Formulation and Algorithm

In this section, we propose a new online adaptive optimization method (DADAM) that employs data parallelization and decentralized computation over a network of agents. Given a connected undirected graph , we let each node at time hold its own measurement and training data , and set . We also let each agent holds a local copy of the global variable at time , which is denoted by . With this setup, we present a distributed adaptive gradient method for solving the minimization problem

 minimizex∈XF(x)=1nT∑t=1n∑i=1fi,t(x), (2)

where is a continuously differentiable mapping on the convex set .

DADAM uses a new distributed adaptive gradient method in which a group of agents aim to solve a sequential version of problem (2). Here, we assume that each component function becomes only available to agent , after having made its decision at time . In the -th step, the -th agent chooses a point corresponding to what it considers as a good selection for the network as a whole. After committing to this choice, the agent has access to a cost function and the network cost is then given by . Note that this function is not known to any of the agents and is not available at any single location.

The procedure of our proposed method is outlined in Algorithm 1.

 ^υi,t=β3υi,t+(1−β3)max(^υi,t−1,υi,t),

for normalizing the running average of the gradient instead of in ADAM and in AMSGRAD. The decay parameter is an important component of the DADAM framework, since it enables us to develop a convergent adaptive method similar to AMSGRAD, while maintaining the efficiency of ADAM.

Next, we introduce the measure of regret for assessing the performance of DADAM against a sequence of successive minimizers. In the framework of online convex optimization, the performance of algorithms is assessed by regret that measures how competitive the algorithm is with respect to the best fixed solution [38, 2]. However, the notion of regret fails to illustrate the performance of online algorithms in a dynamic setting. To overcome this issue, we consider a more stringent metric-dynamic regret [4, 5, 3], in which the cumulative loss of the learner is compared against the minimizer sequence , i.e.,

 RegCT:=1nn∑i=1T∑t=1fi,t(xi,t)−T∑t=1ft(x∗t),

where .

On the other hand, in the framework of nonconvex optimization, it is usual to state convergence guarantees of an algorithm towards an -approximate stationary point-that is, there exist some iterate for which . Influenced by [39], we provide the definition of projected gradient and introduce local regret next, a new notion of regret which quantifies the moving average of gradients over a network.

###### Definition 1.

(Local Regret). Assume is a differentiable function on a closed convex set . Given a step-size , we define the projected gradient of at , by

 GX(x,fi,α)=√^υiα(x−x+i),∀i∈V, (3)

where

 x+i=argminy∈X{⟨y,mi√^υi⟩+12α∥y−n∑j=1[W]ijxj∥2}. (4)

Then, the local regret of an online algorithm is given by

 RegNT:=1nn∑i=1mint∈{1,…,T}∥GX(xi,t,¯fi,t,αt)∥2,

where is an aggregate loss.

We analyze the convergence of DADAM as applied to minimization problem (2) using regrets and . It is worth to mention that DADAM is initialized at to keep the presentation of the convergence analysis clear. In general, any initialization can be selected for implementation purposes.

## 3 Convergence Analysis

In this section, our aim is to establish convergence properties of DADAM under the following assumptions:

###### Assumption 2.

Adjacency matrix of graph is doubly stochastic with positive diagonal. More specifically, the information received from agent , satisfies

 n∑i=1[W]ij=n∑j=1[W]ij=1,[W]jj>0. (5)
###### Assumption 3.

For all and , the function is continuously differentiable over , and has Lipschitz continuous gradient on this set, i.e., there exists a constant so that

 ∥∇fi,t(x)−∇fi,t(y)∥≤ρ∥x−y∥,∀x,y∈X.

Further, is Lipschitz continuous on with a uniform constant , i.e.,

 |fi,t(x)−fi,t(y)|≤L∥x−y∥,∀x,y∈X. (6)

Let denotes the stochastic gradient observed by agent after computing the estimate . For the decentralized stochastic tracking and learning, we need the following assumption which guarantees that is unbiased and has bounded second moment.

###### Assumption 4.

For all and , the stochastic gradient satisfies

 E[∥∇fi,t(xi,t)∥2∗∇fi,t(xi,t)∣∣Ft−1]=∇fi,t(xi,t),

where is the -field containing all information prior to the outset of round .

### 3.1 Convex Case

Next, we focus on the case where for all and , the agent at time has access to the exact gradient . The following results apply to convex problems, as well as their stochastic variants in a dynamic environment.

Theorems 5 and 6 characterize the hardness of the problem via a complexly measure that captures the pattern of the minimizer sequence , where . Subsequently, we would like to provide a regret bound in terms of

 DT,d=T−1∑t=1|x∗t+1,d−x∗t,d|ford∈{1,...,p}, (7)

which represents the variations in .

Further, these theorems establish a tight connection between the convergence rate of distributed adaptive methods and the spectral properties of the underlying network. The inverse dependence on the spectral gap is quite natural and for many families of undirected graph, we can give order-accurate estimate on [[40], Proposition 5], which translate into estimates of convergence time.

###### Theorem 5.

Suppose that Assumption 2 holds and the parameters satisfy . Let and for all . Then, using a step-size for the sequence generated by Algorithm 1, we have

 RegCT ≤(1−β1)α√1+logT2√n√(1−β2)p∑d=1∥g1:T,d∥ +p∑d=1G∞γ∞(1−β1)(1−λ)+p∑d=1γ∞(γ∞+DT,d)√n(1−β1)α√T^υT,d +4α√1+logT∑pd=1∥g1:T,d∥(1−σ2(W))√(1−β1)√(1−η)√(1−β2).

Next, we analyze the stochastic convex setting and extend the result of Theorem 5 to the noisy case where agents have access to stochastic gradients of the objective function (2).

###### Theorem 6.

Suppose that Assumptions 2 and 4 hold. Further, the parameters satisfy . Let . Then, using a step-size for the sequence generated by Algorithm 1, we have

 E[RegCT]≤(1−β1)α√1+logT2√n√(1−β2)p∑d=1E[∥g1:T,d∥] +p∑d=1ξγ∞(1−β1)(1−λ)+p∑d=1γ∞(γ∞+DT,d)√n(1−β1)α√TE[√^υT,d] +4α√1+logT∑pd=1E[∥g1:T,d∥](1−σ2(W))√(1−β1)√(1−η)√(1−β2).
###### Remark 7.

We note that when the gradients are sparse, we have and [27]. Hence, similar to adaptive algorithims such as ADAM, ADAGRAD and AMSGRAD, the regret bound of DADAM can be considerably better than the ones provided by standard mirror descent and gradient descent methods in both centralized [3, 4, 5, 6] and decentralized [6] settings.

### 3.2 Nonconvex Case

In this section, we provide convergence guarantees for DADAM for the nonconvex minimization problem (2) defined over a closed convex set . To do so, we use the projection map instead of for updating parameters for , (see, Algorithm 1 for details).

To analyze the convergence of DADAM in the nonconvex setting, we assume that for all ,

 maxi,t,d gi,t,d≤¯υ,~{}~{}t∈{1,…,T}, ~{% }~{} d∈{1,…,p} , (8)

for some finite . This assumption has already been used by some authors [33, 41, 42] to establish the convergence of adaptive methods in the nonconvex setting. The property (8) together with update role of which is an exponential moving average of , gives . On the other hand, the moment vector is non-decreasing so that for some constant . Hence, for all , we have

 υ––2≤^υi,t,d≤¯υ2,~{}~{}t∈{1,…,T}, ~{}d∈{1,…,p}, (9)

The following theorem establishes the convergence rate of decentralized adaptive methods in the nonconvex setting.

###### Theorem 8.

Suppose Assumptions 2 and 3 hold. Further, the parameters satisfy . Let . Choose the positive sequence such that with for at least one . Then, for the sequence generated by Algorithm 1, we have

 RegNT ≤1ϑt[(2+logT)2L maxt∈{2,…,T}2√n(1−η)√(1−β2)t−1∑s=0αsσt−s−12(W) +T∑t=1αtβ1,t¯υ2(1−β1,t)(1−η)2(1−β2)], (10)

where .

The following corollary shows that DADAM using a certain step-size leads to a near optimal regret bound for nonconvex functions.

###### Corollary 9.

Under the same conditions of Theorem 8, using the step-sizes and for all , we have

 RegNT≤(2¯υ2(2−β1)(1−β1)(1−η)2(1−β2)(1−λ))1T +(16√n¯υL(2−β1)(1−η)√(1−β2)(1−σ2(W)))(2+logT)T. (11)

To complete the analysis of our algorithm in the nonconvex setting, we provide the regret bound for DADAM, when stochastic gradients are accessible to the learner.

###### Theorem 10.

Suppose Assumptions 2-4 hold. Further, the parameters satisfy . Let . Choose the positive sequence such that with for at least one . Then, for the sequence generated by Algorithm 1, we have

 E[RegNT] ≤1ϑt[(2+logT)2L maxt∈{2,…,T}2√n(1−η)√(1−β2)t−1∑s=0αsσt−s−12(W) +T∑t=1αtβ1,t¯υ2(1−β1,t)(1−η)2(1−β2) +¯υξ2υ––2(1−β1)T∑t=1αt], (12)

where .

We next theoretically justify the potential advantage of the proposed decentralized algorithm DADAM over centralized adaptive moment estimation methods such as ADAM. More specifically, the following corollary shows that when is sufficiently large, the term will be dominated by the term which leads to a convergence rate.

###### Corollary 11.

Suppose Assumptions 2-4 hold. Moreover, the parameters satisfy and . Choose the step-size sequence as with . Then, for the sequence generated by Algorithm 1, we have

 E[RegNT]T≤(8¯υαυ––2(1−β1))ξ2√nT+2(f1(x1)−f1(x∗1))1T, (13)

if the total number of time steps satisfies

 T ≥(I1+I2), (14a) T ≥max{4ρ2¯υ2nυ––4(2−β1)2,4¯υ2n(2−β1)2}, (14b)

where

 I1=υ––22(1−η)2(1−β2)(1−λ)ξ2, I2=16nL2υ––4(1−β1)2(1−η)2(1−β2)(1−σ2(W))2¯υ2ξ4.

Let -approximation solution of (2) be defined by . Corollary 11 indicates that the total computational complexity of DADAM to achieve an -approximation solution is bounded by .

## 4 An Extension of DADAM with a Corrected Update Rule

Compared to classical centralized algorithms, decentralized algorithms encounter more restrictive assumptions and typically worse convergence rates. Recently, for time-invariant graphs, [17] introduced a corrected decentralized gradient method in order to cancel the steady state error in decentralized gradient descent and provided a linear rate of convergence if the objective function is strongly convex. Analogous convergence results are given in [18] even for the case of time-variant graphs. Similar to [17, 18], we provide next a corrected update rule for adaptive methods, given by

for all , and , where is generated by Algorithm 1 and .

We note that a C-DADAM update is a DADAM update with a cumulative correction term. The summation in (15) is necessary, since each individual term is asymptotically vanishing and the terms must work cumulatively.

## 5 Numerical Results

In this section, we evaluate the effectiveness of the proposed DADAM-type algorithms such as DADAGRAD, DADADELTA, DRMSPROP, and DADAM by comparing them with SGD [43], decentralized SGD (DSGD) [13, 20, 26] and corrected decentralized SGD (C-DSGD) [17, 19].

The corrected variants of DADAM-type algorithms are denoted by C-DADAGRAD, C-DADADELTA, C-DRMSPROP, and C-DADAM. We also note that if the mixing matrix in Algorithm 1 is chosen the identity matrix, then above algorithms reduce to the centralized adaptive methods. These algorithms are implemented with their default settings.

All algorithms have been run on a Mac machine equipped with a 1.8 GHz Intel Core i5 processor and 8 GB 1600 MHz DDR3. Code to reproduce experiments is to be found at https://github.com/Tarzanagh/DADAM.

In our experiments, we use the Metropolis constant edge weight matrix , which is mentioned in Section 7.2.1 with [44]. The connected network is randomly generated with agents and connectivity ratio .

Next, we mainly focus on the convergence rate of algorithms instead of the running time. This is because the implementation of DADAM-type algorithms is a minor change over the standard decentralized stochastic algorithms such as DSGD and C-DSGD, and thus they have almost the same running time to finish one epoch of training, and both are faster than the centralized stochastic algorithms such as ADAM and SGD. We note that with high network latency, if a decentralized algorithm (DADAM or DSGD) converges with a similar running time as the centralized algorithm, it can be up to one order of magnitude faster

[19]. However, the convergence rate depending on the ”adaptivenesee” is different for both type of algorithms.

### 5.1 Regularized Finite-sum Minimization Problem

Consider the following online distributed learning setting: at each time , randomly generated data points are given to every agent in the form of . Our goal is to learn the model parameter by solving the regularized finite-sum minimization problem (2) with

 fi,t(x) =1mimi∑j=1L(x,yt,i,j,zt,i,j)+ν∥x∥22, (16)

where is the loss function, and is the regularization parameter.

For , we consider the ball

when a sparse classifier is preferred.

From Theorem 5, we would choose a constant step-size and diminishing step-sizes , for in order to evaluate the adaptive strategies. All other parameters of the algorithms and problems are set as follows: , , . The minibatch size is set to 10, the regularization parameter and the dimension of model parameter .

The numerical results are illustrated in Figure 1 for the synthetic datasets. It can be seen that the distributed adaptive algorithms significantly outperform DSGD and its corrected variants.

### 5.2 Neural Networks

Next, we present the experimental results using the MNIST digit recognition task. The model for training a simple multilayer percepton (MLP) on the MNIST

dataset was taken from Keras.GitHub

. In our implementation, the model function has 15 dense layers of size 64. Small regularization with regularization parameter 0.00001 is added to the weights of the network and the minibatch size is set to 32.

We compare the accuracy of DADAM with that of the DSGD and the Federated Averaging (FedAvg) algorithm [45] which also performs data parallelization without decentralized computation. The parameters for DADAM is selected in a way similar to the previous experiments. In our implementation, we use same number of agents and choose as the parameters in the FedAvg algorithm since it is close to a connected topology scenario as considered in the DADAM and ADAM. It can be easily seen from Figure 2 that DADAM can achieve high accuracy in comparison with the DSGD and FedAvg.

## 6 Conclusion

A decentralized adaptive moment estimation method (DADAM) was proposed for the distributed learning of deep networks based on adaptive moment of first and second moment of estimations. Convergence properties of the proposed algorithm were established for convex and nonconvex functions in both stochastic and deterministic settings. Numerical results on some synthetics and real datasets show the efficiency and effectiveness of the new proposed method in practice.

## Acknowledgements

The second author is grateful for a discussion of decentralized methods with Davood Hajinezhad.

## References

• [1] S. Shalev-Shwartz et al., “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
• [2] E. Hazan et al., “Introduction to online convex optimization,” Foundations and Trends® in Optimization, vol. 2, no. 3-4, pp. 157–325, 2016.
• [3] M. Zinkevich, “Online convex programming and generalized infinitesimal gradient ascent,” 2003.
• [4] E. C. Hall and R. M. Willett, “Online convex optimization in dynamic environments,” IEEE Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 647–662, 2015.
• [5] O. Besbes, Y. Gur, and A. Zeevi, “Non-stationary stochastic optimization,” Operations Research, vol. 63, no. 5, pp. 1227–1244, 2015.
• [6] S. Shahrampour and A. Jadbabaie, “Distributed online optimization in dynamic environments using mirror descent,” IEEE Transactions on Automatic Control, vol. 63, no. 3, pp. 714–725, 2018.
• [7] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization in dynamic environments: Improved regret rates for strongly convex problems,” in Decision and Control (CDC), 2016 IEEE 55th Conference on, pp. 7195–7201, IEEE, 2016.
• [8] L. Zhang, T. Yang, J. Yi, J. Rong, and Z.-H. Zhou, “Improved dynamic regret for non-degenerate functions,” in Advances in Neural Information Processing Systems, pp. 732–741, 2017.
• [9] J. Tsitsiklis, D. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE transactions on automatic control, vol. 31, no. 9, pp. 803–812, 1986.
• [10] D. Li, K. D. Wong, Y. H. Hu, and A. M. Sayeed, “Detection, classification, and tracking of targets,” IEEE signal processing magazine, vol. 19, no. 2, pp. 17–29, 2002.
• [11] M. Rabbat and R. Nowak, “Distributed optimization in sensor networks,” in Proceedings of the 3rd international symposium on Information processing in sensor networks, pp. 20–27, ACM, 2004.
• [12] V. Lesser, C. L. Ortiz Jr, and M. Tambe, Distributed sensor networks: A multiagent perspective, vol. 9. Springer Science & Business Media, 2012.
• [13] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009.
• [14] J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: Convergence analysis and network scaling,” IEEE Transactions on Automatic control, vol. 57, no. 3, pp. 592–606, 2012.
• [15] K. Yuan, Q. Ling, and W. Yin, “On the convergence of decentralized gradient descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016.
• [16] J. Zeng and W. Yin, “On nonconvex decentralized gradient descent,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2834–2848, 2018.
• [17] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015.
• [18] A. Nedic, A. Olshevsky, and W. Shi, “Achieving geometric convergence for distributed optimization over time-varying graphs,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017.
• [19] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “D: Decentralized training over decentralized data,” arXiv preprint arXiv:1803.07068, 2018.
• [20] Z. Jiang, A. Balu, C. Hegde, and S. Sarkar, “Collaborative deep learning in fixed topology networks,” in Advances in Neural Information Processing Systems, pp. 5904–5914, 2017.
• [21] T.-H. Chang, A. Nedić, and A. Scaglione, “Distributed constrained optimization by consensus-based primal-dual perturbation method,” IEEE Transactions on Automatic Control, vol. 59, no. 6, pp. 1524–1538, 2014.
• [22] G. Lan, S. Lee, and Y. Zhou, “Communication-efficient algorithms for decentralized and stochastic optimization,” arXiv preprint arXiv:1701.03961, 2017.
• [23] E. Wei and A. Ozdaglar, “Distributed alternating direction method of multipliers,” 2012.
• [24] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin, “On the linear convergence of the admm in decentralized consensus optimization,” IEEE Transactions on Signal Processing, vol. 62, pp. 1750–1761, 2014.
• [25] D. Hajinezhad, M. Hong, and A. Garcia, “Zeroth order nonconvex multi-agent optimization over networks,” arXiv preprint arXiv:1710.09997, 2017.
• [26]

X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” in

Advances in Neural Information Processing Systems, pp. 5330–5340, 2017.
• [27]

J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”

Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
• [28] M. R. Peyghami and D. A. Tarzanagh, “A relaxed nonmonotone adaptive trust region method for solving unconstrained optimization problems,” Computational Optimization and Applications, vol. 61, no. 2, pp. 321–341, 2015.
• [30]

T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of-its-recent-magnitude (accessed on 21 April 2017).

• [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
• [32] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of adam and beyond,” in International Conference on Learning Representations, 2018.
• [33] R. Ward, X. Wu, and L. Bottou, “Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization,” arXiv preprint arXiv:1806.01811, 2018.
• [34] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive methods for nonconvex optimization,” in Advances in Neural Information Processing Systems, pp. 9815–9825, 2018.
• [35] H. B. McMahan and M. Streeter, “Adaptive bound optimization for online convex optimization,” arXiv preprint arXiv:1002.4908, 2010.
• [36] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, pp. 1223–1231, 2012.
• [37] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server.,” in OSDI, vol. 14, pp. 583–598, 2014.
• [38] D. Mateos-Núnez and J. Cortés, “Distributed online convex optimization over jointly connected digraphs,” IEEE Transactions on Network Science and Engineering, vol. 1, no. 1, pp. 23–37, 2014.
• [39] E. Hazan, K. Singh, and C. Zhang, “Efficient regret minimization in non-convex games,” arXiv preprint arXiv:1708.00075, 2017.
• [40] A. Nedić, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralized optimization,” Proceedings of the IEEE, vol. 106, no. 5, pp. 953–976, 2018.
• [41] S. De, A. Mukherjee, and E. Ullah, “Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration,” 2018.
• [42] X. Chen, S. Liu, R. Sun, and M. Hong, “On the convergence of a class of adam-type algorithms for non-convex optimization,” arXiv preprint arXiv:1808.02941, 2018.
• [43] H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers, pp. 102–109, Springer, 1985.
• [44]

S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,”

SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
• [45] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al., “Communication-efficient learning of deep networks from decentralized data,” arXiv preprint arXiv:1602.05629, 2016.
• [46] A. Beck and M. Teboulle, “Mirror descent and nonlinear projected subgradient methods for convex optimization,” Operations Research Letters, vol. 31, no. 3, pp. 167–175, 2003.
• [47] S. Shahrampour and A. Jadbabaie, “An online optimization approach for multi-agent tracking of dynamic parameters in the presence of adversarial noise,” in American Control Conference (ACC), 2017, pp. 3306–3311, IEEE, 2017.
• [48] R. A. Horn, R. A. Horn, and C. R. Johnson, Matrix analysis. Cambridge university press, 1990.
• [49] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods,” Mathematical Programming, vol. 137, no. 1-2, pp. 91–129, 2013.
• [50] H. Kasai, “Sgdlibrary: A matlab library for stochastic optimization algorithms,” Journal of Machine Learning Research, vol. 18, no. 215, pp. 1–5, 2018.

## 7 Supplementary Material

Next, we establish a series of lemmas used in the proof of main theorems.

###### Lemma 12.

[46] Let be a nonempty closed convex set in . Then, for any , we have

 ⟨x∗−d,a⟩≤12∥d−c∥2−12∥d−x∗∥2−12∥x∗−c∥2,

where

 x∗=argminx∈X{⟨a,x⟩+12∥x−c∥2}.
###### Lemma 13.

[35] For any and convex feasible set suppose , we have

 ∥A12(a1−a2)∥≤∥A12(b1−b2)∥.
###### Lemma 14.

For all if satisfy , then we have

 T∑t=1p∑d=1αtm2i,t,d^υi,t,d≤α√1+logT(1−β1)(1−η)√(1−β2)p∑d=1∥gi,1:T,d∥,

where for all .

###### Proof.

Using the update rules in Algorithm 1, we have

 T∑t=1αm2i,t,d√t^υi,t,d =T−1∑t=1αm2i,t,d√t^υi,t,d+αTm2i,T,d√(1−β3)max{^υi,T−1,d,υi,T,d}+β3υi,T,d ≤T−1∑t=1αm2i,t,d√t^υi,t,d+αTm2i,T,d√(1−β3)υi,T,d+β3υi,T,d (i)=T−1∑t=1αm2i,t,d√t^υi,t,d+α(∑Tl=1(1−β1)βT−l1gi,l,d)2√T∑Tl=1(1−β2)βT−l2g2i,l,d (ii)≤T−1∑t=1αm2i,t,d√t^υi,t,d+α√T(1−β2)(∑Tl=1βT−l1)(∑Tl=1βT−l1g2i,l,d)√∑Tl=1βT−l2g2i,l,d (iii)≤T−1∑t=1αm2i,t,d√t^υi,t,d+α(1−β1)√T(1−β2)T∑l=1βT−l1g2i,l,d√βT−l2g2i,