DADAM: A Consensus-based Distributed Adaptive Gradient Method for Online Optimization
Adaptive gradient-based optimization methods such as ADAGRAD, RMSPROP, and ADAM are widely used in solving large-scale machine learning problems including deep learning. A number of schemes have been proposed in the literature aiming at parallelizing them, based on communications of peripheral nodes with a central node, but incur high communications cost. To address this issue, we develop a novel consensus-based distributed adaptive moment estimation method (DADAM) for online optimization over a decentralized network that enables data parallelization, as well as decentralized computation. The method is particularly useful, since it can accommodate settings where access to local data is allowed. Further, as established theoretically in this work, it can outperform centralized adaptive algorithms, for certain classes of loss functions used in applications. We analyze the convergence properties of the proposed algorithm and provide a dynamic regret bound on the convergence rate of adaptive moment estimation methods in both stochastic and deterministic settings. Empirical results demonstrate that DADAM works also well in practice and compares favorably to competing online optimization methods.READ FULL TEXT VIEW PDF
Decentralized consensus-based optimization is a general computational
First-order optimization methods have been playing a prominent role in d...
We address the issue of speeding up the training of convolutional neural...
Decentralized optimization algorithms have attracted intensive interests...
There is significant recent interest to parallelize deep learning algori...
As application demands for online convex optimization accelerate, the ne...
Online and stochastic learning has emerged as powerful tool in large sca...
DADAM: A Consensus-based Distributed Adaptive Gradient Method for Online Optimization
Online optimization is a fundamental procedure for solving a wide range of machine learning problems [1, 2]. It can be formulated as a repeated game between a learner (algorithm) and an adversary. The learner receives a streaming data sequence, sequentially selects actions, and the adversary reveals the convex or nonconvex losses to the learner. A standard performance metric for an online algorithm is regret, which measures the performance of the algorithm versus a static benchmark [3, 2]. For example, the benchmark could be an optimal point of the online average of the loss (local cost) function, had the learner known all the losses in advance. In a broad sense, if the benchmark is a fixed sequence, the regret is called static. Recent work on online optimization has investigated the notion of dynamic regret [3, 4, 5]. Dynamic regret can take the form of the cumulative difference between the instantaneous loss and the minimum loss. For convex functions, previous studies have shown that the dynamic regret of online gradient-based methods can be upper bounded by , where is a measure of regularity of the comparator sequence or the function sequence [3, 4, 6]. This bound can be improved to [7, 8], when the cost function is strongly convex and smooth.
Decentralized nonlinear programming has received a lot of interest in diverse scientific and engineering fields [9, 10, 11, 12]. The key problem involves optimizing a cost function , where and each is only known to the individual agent in a connected network of agents. The agents collaborate by successively sharing information with other agents located in their neighborhood with the goal of jointly converging to the network-wide optimal argument . Compared to optimization procedures involving a fusion center that collects data and performs the computation, decentralized nonlinear programming enjoys the advantage of scalability to the size of the network used, robustness to the network topology, and privacy preservation in data-sensitive applications.
A popular algorithm in decentralized optimization is gradient descent which has been studied in [13, 14]. Convergence results for convex problems with bounded gradients are given in , while analogous convergence results even for nonconvex problems are given in . Convergence is accelerated by using corrected update rules and momentum techniques [17, 18, 19, 20]. The primal-dual [21, 22], ADMM [23, 24] and zero-order  approaches are related to the dual decentralized gradient method . We also point out the recent work  on a very efficient consensus-based decentralized stochastic gradient (DSGD) method for deep learning over fixed topology networks and the earlier work  on decentralized gradient methods for nonconvex deep learning problems. Further, under some mild assumptions,  showed that decentralized algorithms can be faster than its centralized counterpart for certain stochastic nonconvex loss functions.
Appropriately choosing the learning rate that scales coordinates of the gradient and the way of updating them are crucial issues driving the performance of first  and second order optimization procedures . Indeed, the understanding that adaptation of the learning rate is advantageous, particularly in a dynamic fashion and on a per parameter basis, led to the development of a family of widely-used adaptive gradient methods including ADAGRAD , ADADELTA , RMSPROP , ADAM  and AMSGRAD . The ADAM optimizer computes adaptive learning rates for different parameters from estimates of first and second moments of the gradients and performs a local optimization. Numerical results show that ADAM can achieve significantly better performance compared to ADAGRAD, ADADELTA, RMSPROP and other gradient descent procedures when the gradients are sparse, or in general small in magnitude. However, its performance has been observed to deteriorate with either nonconvex loss function or dense gradients. Further, there is currently a gap in the theoretical understanding of ADAM, especially in the nonconvex and stochastic setting [33, 34].
In this paper, we develop and analyze a new consensus-based distributed adaptive moment estimation (DADAM) method that incorporates decentralized optimization and uses a variant of adaptive moment estimation methods [27, 31, 35]. Existing distributed stochastic and adaptive gradient methods for deep learning are mostly designed for a central network topology [36, 37]. The main bottleneck of such a topology lies on the communication overload on the central node, since all nodes need to concurrently communicate with it. Hence, performance can be significantly degraded when network bandwidth is limited. These considerations motivate us to study an adaptive algorithm for network topologies, where all nodes can only communicate with their neighbors and none of the nodes is designated as “central”. Therefore, the proposed method is suitable for large scale machine learning problems, since it enables both data parallelization and decentralized computation.
Next, we briefly summarize the main technical contributions of the work.
Our first main result (Theorem 5) provides guarantees of DADAM for constrained convex minimization problems defined over a closed convex set . We provide the convergence bound in terms of dynamic regret and show that when the data features are sparse and have bounded gradients, our algorithm’s regret bound can be considerably better than the ones provided by standard mirror descent and gradient descent methods [13, 4, 5]. It is worth mentioning that the regret bounds provided for adaptive gradient methods  are static and our results generalize them to dynamic settings.
In Theorem 8, we give a new local regret analysis for distributed online gradient-based algorithms for constrained nonconvex minimization problems computed over a network of agents. Specifically, we prove that under certain regularity conditions, DADAM can achieve a local regret bound of order for nonconvex distributed optimization. To the best of our knowledge, rigorous extensions of existing adaptive gradient methods to the distributed nonconvex setting considered in this work do not seem to be available.
In this paper, we also present regret analysis for distributed stochastic optimization problems computed over a network of agents. Theorems 6 and 10 provide regret bounds of DADAM for minimization problem (2) with stochastic gradients and indicate that the result of Theorems 5 and 8 hold true in expectation. Further, in Corollary 11 we show that DADAM can achieve a local regret bound of order for nonconvex distributed stochastic optimization where
is an upper bound on the variance of the stochastic gradient. Hence, DADAM outperforms centralized adaptive algorithms such as ADAM for certain realistic classes of loss functions whenis sufficiently large.
In summary, a distinguishing feature of this work is the incorporation of adaptive learning with data parallelization, as well as extension to the stochastic setting with both convex/nonconvex objective functions. Further, the established technical results exhibit differences from those in [13, 14, 15, 16] with the notion of adaptive constrained optimization in online and dynamic settings.
The remainder of the paper is organized as follows. Section 2 gives a detailed description of DADAM, while Section 3 establishes its theoretical results. Section 4 explains a network correction technique for our proposed algorithm. Section 5 illustrates the proposed framework on a number of synthetic and real data sets. Finally, Section 6 concludes the paper.
The detailed proofs of the main results established are delegated to the Appendix.
Throughout the paper, denotes the
-dimensional real space. For any pair of vectorsindicates the standard Euclidean inner product. We denote the norm by , the infinity norm by , and the Euclidean norm by . The above norms reduce to the vector norms if is a vector. The diameter of the set is given by
Let be the set of all positive definite matrices. denotes the Euclidean projection of a vector onto for :
The subscript is often used to denote the time step while stands for the -th element of . Further, is given by
We let denote the gradient of at . The
-th largest singular value of matrixis denoted by . We denote the element in the -th row and -th column of matrix by . In several theorems, we consider a connected undirected graph with nodes and edges . The matrix is often used to denote the adjacency matrix of graph .
The Hadamard (entrywise) and Kronecker product are denoted by and , respectively. Finally, the expectation operator is denoted by .
In this section, we propose a new online adaptive optimization method (DADAM) that employs data parallelization and decentralized computation over a network of agents. Given a connected undirected graph , we let each node at time hold its own measurement and training data , and set . We also let each agent holds a local copy of the global variable at time , which is denoted by . With this setup, we present a distributed adaptive gradient method for solving the minimization problem
where is a continuously differentiable mapping on the convex set .
DADAM uses a new distributed adaptive gradient method in which a group of agents aim to solve a sequential version of problem (2). Here, we assume that each component function becomes only available to agent , after having made its decision at time . In the -th step, the -th agent chooses a point corresponding to what it considers as a good selection for the network as a whole. After committing to this choice, the agent has access to a cost function and the network cost is then given by . Note that this function is not known to any of the agents and is not available at any single location.
The procedure of our proposed method is outlined in Algorithm 1.
It is worth mentioning that DADAM includes distributed variants of many well-known adaptive algorithms as special cases such as ADAGRAD, RMSPROP, and AMSGRAD. We also note that DADAM computes adaptive learning rates from estimates of both first and second moments of the gradients similar to AMSGRAD. However, DADAM uses a larger learning rate in comparison to AMSGRAD and yet incorporates the intuition of slowly decaying the effect of previous gradients on the learning rate. The key difference of DADAM with AMSGRAD is that it maintains the maximum of all second moment estimates of the gradient vectors until time step and uses
for normalizing the running average of the gradient instead of in ADAM and in AMSGRAD. The decay parameter is an important component of the DADAM framework, since it enables us to develop a convergent adaptive method similar to AMSGRAD, while maintaining the efficiency of ADAM.
Next, we introduce the measure of regret for assessing the performance of DADAM against a sequence of successive minimizers. In the framework of online convex optimization, the performance of algorithms is assessed by regret that measures how competitive the algorithm is with respect to the best fixed solution [38, 2]. However, the notion of regret fails to illustrate the performance of online algorithms in a dynamic setting. To overcome this issue, we consider a more stringent metric-dynamic regret [4, 5, 3], in which the cumulative loss of the learner is compared against the minimizer sequence , i.e.,
On the other hand, in the framework of nonconvex optimization, it is usual to state convergence guarantees of an algorithm towards an -approximate stationary point-that is, there exist some iterate for which . Influenced by , we provide the definition of projected gradient and introduce local regret next, a new notion of regret which quantifies the moving average of gradients over a network.
(Local Regret). Assume is a differentiable function on a closed convex set . Given a step-size , we define the projected gradient of at , by
Then, the local regret of an online algorithm is given by
where is an aggregate loss.
We analyze the convergence of DADAM as applied to minimization problem (2) using regrets and . It is worth to mention that DADAM is initialized at to keep the presentation of the convergence analysis clear. In general, any initialization can be selected for implementation purposes.
In this section, our aim is to establish convergence properties of DADAM under the following assumptions:
Adjacency matrix of graph is doubly stochastic with positive diagonal. More specifically, the information received from agent , satisfies
For all and , the function is continuously differentiable over , and has Lipschitz continuous gradient on this set, i.e., there exists a constant so that
Further, is Lipschitz continuous on with a uniform constant , i.e.,
Let denotes the stochastic gradient observed by agent after computing the estimate . For the decentralized stochastic tracking and learning, we need the following assumption which guarantees that is unbiased and has bounded second moment.
For all and , the stochastic gradient satisfies
where is the -field containing all information prior to the outset of round .
Next, we focus on the case where for all and , the agent at time has access to the exact gradient . The following results apply to convex problems, as well as their stochastic variants in a dynamic environment.
Theorems 5 and 6 characterize the hardness of the problem via a complexly measure that captures the pattern of the minimizer sequence , where . Subsequently, we would like to provide a regret bound in terms of
which represents the variations in .
Further, these theorems establish a tight connection between the convergence rate of distributed adaptive methods and the spectral properties of the underlying network. The inverse dependence on the spectral gap is quite natural and for many families of undirected graph, we can give order-accurate estimate on [, Proposition 5], which translate into estimates of convergence time.
We note that when the gradients are sparse, we have and . Hence, similar to adaptive algorithims such as ADAM, ADAGRAD and AMSGRAD, the regret bound of DADAM can be considerably better than the ones provided by standard mirror descent and gradient descent methods in both centralized [3, 4, 5, 6] and decentralized  settings.
In this section, we provide convergence guarantees for DADAM for the nonconvex minimization problem (2) defined over a closed convex set . To do so, we use the projection map instead of for updating parameters for , (see, Algorithm 1 for details).
To analyze the convergence of DADAM in the nonconvex setting, we assume that for all ,
for some finite . This assumption has already been used by some authors [33, 41, 42] to establish the convergence of adaptive methods in the nonconvex setting. The property (8) together with update role of which is an exponential moving average of , gives . On the other hand, the moment vector is non-decreasing so that for some constant . Hence, for all , we have
The following theorem establishes the convergence rate of decentralized adaptive methods in the nonconvex setting.
The following corollary shows that DADAM using a certain step-size leads to a near optimal regret bound for nonconvex functions.
Under the same conditions of Theorem 8, using the step-sizes and for all , we have
To complete the analysis of our algorithm in the nonconvex setting, we provide the regret bound for DADAM, when stochastic gradients are accessible to the learner.
We next theoretically justify the potential advantage of the proposed decentralized algorithm DADAM over centralized adaptive moment estimation methods such as ADAM. More specifically, the following corollary shows that when is sufficiently large, the term will be dominated by the term which leads to a convergence rate.
Compared to classical centralized algorithms, decentralized algorithms encounter more restrictive assumptions and typically worse convergence rates. Recently, for time-invariant graphs,  introduced a corrected decentralized gradient method in order to cancel the steady state error in decentralized gradient descent and provided a linear rate of convergence if the objective function is strongly convex. Analogous convergence results are given in  even for the case of time-variant graphs. Similar to [17, 18], we provide next a corrected update rule for adaptive methods, given by
for all , and , where is generated by Algorithm 1 and .
We note that a C-DADAM update is a DADAM update with a cumulative correction term. The summation in (15) is necessary, since each individual term is asymptotically vanishing and the terms must work cumulatively.
In this section, we evaluate the effectiveness of the proposed DADAM-type algorithms such as DADAGRAD, DADADELTA, DRMSPROP, and DADAM by comparing them with SGD , decentralized SGD (DSGD) [13, 20, 26] and corrected decentralized SGD (C-DSGD) [17, 19].
The corrected variants of DADAM-type algorithms are denoted by C-DADAGRAD, C-DADADELTA, C-DRMSPROP, and C-DADAM. We also note that if the mixing matrix in Algorithm 1 is chosen the identity matrix, then above algorithms reduce to the centralized adaptive methods. These algorithms are implemented with their default settings111https://keras.io/optimizers/.
All algorithms have been run on a Mac machine equipped with a 1.8 GHz Intel Core i5 processor and 8 GB 1600 MHz DDR3. Code to reproduce experiments is to be found at https://github.com/Tarzanagh/DADAM.
Next, we mainly focus on the convergence rate of algorithms instead of the running time. This is because the implementation of DADAM-type algorithms is a minor change over the standard decentralized stochastic algorithms such as DSGD and C-DSGD, and thus they have almost the same running time to finish one epoch of training, and both are faster than the centralized stochastic algorithms such as ADAM and SGD. We note that with high network latency, if a decentralized algorithm (DADAM or DSGD) converges with a similar running time as the centralized algorithm, it can be up to one order of magnitude faster. However, the convergence rate depending on the ”adaptivenesee” is different for both type of algorithms.
Consider the following online distributed learning setting: at each time , randomly generated data points are given to every agent in the form of . Our goal is to learn the model parameter by solving the regularized finite-sum minimization problem (2) with
where is the loss function, and is the regularization parameter.
For , we consider the ball
when a sparse classifier is preferred.
From Theorem 5, we would choose a constant step-size and diminishing step-sizes , for in order to evaluate the adaptive strategies. All other parameters of the algorithms and problems are set as follows: , , . The minibatch size is set to 10, the regularization parameter and the dimension of model parameter .
The numerical results are illustrated in Figure 1 for the synthetic datasets. It can be seen that the distributed adaptive algorithms significantly outperform DSGD and its corrected variants.
Next, we present the experimental results using the MNIST digit recognition task. The model for training a simple multilayer percepton (MLP) on the MNIST
dataset was taken from Keras.GitHub222https://github.com/keras-team/keras. In our implementation, the model function has 15 dense layers of size 64. Small regularization with regularization parameter 0.00001 is added to the weights of the network and the minibatch size is set to 32.
We compare the accuracy of DADAM with that of the DSGD and the Federated Averaging (FedAvg) algorithm  which also performs data parallelization without decentralized computation. The parameters for DADAM is selected in a way similar to the previous experiments. In our implementation, we use same number of agents and choose as the parameters in the FedAvg algorithm since it is close to a connected topology scenario as considered in the DADAM and ADAM. It can be easily seen from Figure 2 that DADAM can achieve high accuracy in comparison with the DSGD and FedAvg.
A decentralized adaptive moment estimation method (DADAM) was proposed for the distributed learning of deep networks based on adaptive moment of first and second moment of estimations. Convergence properties of the proposed algorithm were established for convex and nonconvex functions in both stochastic and deterministic settings. Numerical results on some synthetics and real datasets show the efficiency and effectiveness of the new proposed method in practice.
The second author is grateful for a discussion of decentralized methods with Davood Hajinezhad.
X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent,” inAdvances in Neural Information Processing Systems, pp. 5330–5340, 2017.
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. Jul, pp. 2121–2159, 2011.
T. Tieleman and G. Hinton, “Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning,” tech. rep., Technical Report. Available online: https://zh. coursera. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of-its-recent-magnitude (accessed on 21 April 2017).
S. Boyd, P. Diaconis, and L. Xiao, “Fastest mixing markov chain on a graph,”SIAM review, vol. 46, no. 4, pp. 667–689, 2004.
Next, we establish a series of lemmas used in the proof of main theorems.
 Let be a nonempty closed convex set in . Then, for any , we have
 For any and convex feasible set suppose , we have
For all if satisfy , then we have
where for all .
Using the update rules in Algorithm 1, we have