1 Introduction
It often takes hours to train large deep learning tasks such as ImageNet, even with hundreds of GPUs Goyal et al. (2017). At this scale, how workers communicate becomes a crucial design choice. Most existing systems such as TensorFlow (Abadi et al., 2016), MXNet (Chen et al., 2015), and CNTK (Seide and Agarwal, 2016) support two communication modes: (1) synchronous communication via parameter servers or AllReduce, or (2) asynchronous communication via parameter servers. When there are stragglers (i.e., slower workers) in the system, which is common especially at the scale of hundreds devices, asynchronous approaches are more robust. However, most asynchronous implementations have a centralized design, as illustrated in Figure 1(a) — a central server holds the shared model for all other workers. Each worker calculates its own gradients and updates the shared model asynchronously. The parameter server may become a communication bottleneck and slow down the convergence. We focus on the question: Can we remove the central server bottleneck in asynchronous distributed learning systems while maintaining the best possible convergence rate?
Communication complexity (n.t./n.h.)^{2}^{2}2n.t. means number of gradients/models transferred at the busiest worker per (minibatches of) stochastic gradients updated. n.h. means number of handshakes at the busiest worker per (minibatches of) stochastic gradients updated.  Idle time  
SPSGD (Ghadimi et al., 2016)  Long (/)  Long 
APSGD (Lian et al., 2015)  Long (/)  Short 
AllReduceSGD (Luehr, 2016)  Medium ()  Long 
DPSGD (Lian et al., 2017)  Short ()  Long 
ADPSGD (this paper)  Short ()  Short 
Recent work (Lian et al., 2017) shows that synchronous decentralized parallel stochastic gradient descent (DPSGD) can achieve comparable convergence rate as its centralized counterparts without any central bottleneck. Figure 1(b) illustrates one communication topology of DPSGD in which each worker only talks to its neighbors. However, the synchronous nature of DPSGD makes it vulnerable to stragglers because of the synchronization barrier at each iteration among all workers. Is it possible to get the best of both worlds of asynchronous SGD and decentralized SGD?
In this paper, we propose the asynchronous decentralized parallel stochastic gradient decent algorithm (ADPSGD) that is theoretically justified to keep the advantages of both asynchronous SGD and decentralized SGD. In ADPSGD, workers do not wait for all others and only communicate in a decentralized fashion. ADPSGD can achieve linear speedup with respect to the number of workers and admit a convergence rate of , where is the number of updates. This rate is consistent with DPSGD and centralized parallel SGD. By design, ADPSGD enables waitfree computation and communication, which ensures ADPSGD always converges better (w.r.t epochs or wall time) than DPSGD as the former allows much more frequent information exchanging.
In practice, we found that ADPSGD is particularly useful in heterogeneous computing environments such as cloudcomputing, where computing/communication devices’ speed often varies. We implement ADPSGD in Torch and MPI and evaluate it on an IBM S822LC cluster of up to 128 P100 GPUs. We show that, on realworld datasets such as ImageNet, ADPSGD has the same empirical convergence rate as its centralized and/or synchronous counterpart. In heterogeneous environments, ADPSGD can be faster than its fastest synchronous counterparts by orders of magnitude. On an HPC cluster with homogeneous computing devices but shared network, ADPSGD can still outperform its synchronous counterparts by 4X8X.
Both the theoretical analysis and system implementations of ADPSGD are nontrivial, and they form the two technical contributions of this work.
2 Related work
We review related work in this section. In the following, and refer to the number of iterations and the number of workers, respectively. A comparison of the algorithms can be found in Table 1.
The Stochastic Gradient Descent (SGD) Nemirovski et al. (2009); Moulines and Bach (2011); Ghadimi and Lan (2013) is a powerful approach to solve large scale machine learning problems, with the optimal convergence rate on nonconvex problems.
For Synchronous Parallel Stochastic Gradient Descent (SPSGD), every worker fetches the model saved in a parameter server and computes a minibatch of stochastic gradients. Then they push the stochastic gradients to the parameter server. The parameter server synchronizes all the stochastic gradients and update their average into the model saved in the parameter server, which completes one iteration. The convergence rate is proved to be on nonconvex problems (Ghadimi et al., 2016). Results on convex objectives can be found in Dekel et al. (2012). Due to the synchronization step, all other workers have to stay idle to wait for the slowest one. In each iteration the parameter server needs to synchronize workers, which causes high communication cost at the parameter server especially when is large.
The Asynchronous Parallel Stochastic Gradient Descent (APSGD) (Recht et al., 2011; Agarwal and Duchi, 2011; Feyzmahdavian et al., 2016; Paine et al., 2013) breaks the synchronization in SPSGD by allowing workers to use stale weights to compute gradients. Asynchronous algorithms significantly reduce the communication overhead by avoiding idling any worker and can still work well when part of the computing workers are down. On nonconvex problems, when the staleness of the weights used is upper bounded, APSGD is proved to admit the same convergence rate as SPSGD (Lian et al., 2015, 2016).
In AllReduce Stochastic Gradient Descent implementation (AllReduceSGD) (Luehr, 2016; Patarasuk and Yuan, 2009; MPI contributors, 2015), the update rule per iteration is exactly the same as in SPSGD, so they share the same convergence rate. However, there is no parameter server in AllReduceSGD. The workers are connected with a ring network and each worker keeps the same local copy of the model. In each iteration, each worker calculates a minibatch of stochastic gradients. Then all the workers use AllReduce to synchronize the stochastic gradients, after which each worker will get the average of all stochastic gradients. In this procedure, only amount of gradient is sent/received per worker, but handshakes are needed on each worker. This makes AllReduce slow on high latency network. At the end of the iteration the averaged gradient is updated into the local model of each worker. Since we still have synchronization in each iteration, the idle time is still high as in SPSGD.
In Decentralized Parallel Stochastic Gradient Descent (DPSGD) (Lian et al., 2017), all workers are connected with a network that forms a connected graph . Every worker has its local copy of the model. In each iteration, all workers compute stochastic gradients locally and at the same time average its local model with its neighbors. Finally the locally computed stochastic gradients are updated into the local models. In this procedure, the busiest worker only sends/receives models and has handshakes per iteration. Note that in DPSGD the computation and communication can be done in parallel, which means, when communication time is smaller than the computation time, the communication can be completely hidden. The idle time is still high in DPSGD because all workers need to finish updating before stepping into the next iteration. Before Lian et al. (2017) there are also previous studies on decentralized stochastic algorithms (both synchronous and asynchronous versions) though none of them is proved to have speedup when the number of workers increases. For example, Lan et al. (2017) proposed a decentralized stochastic primaldual type algorithm with a computational complexity of for general convex objectives and for strongly convex objectives. Sirb and Ye (2016) proposed an asynchronous decentralized stochastic algorithm with a complexity for convex objectives. These bounds do not imply any speedup for decentralized algorithms. Bianchi et al. (2013)
proposed a similar decentralized stochastic algorithm. The authors provided a convergence rate for the consensus of the local models when the local models are bounded. The convergence to a solution was provided by using central limit theorem. However, they did not provide the convergence rate to the solution. A very recent paper
(Tang et al., 2018)extended DPSGD so that it works better on data with high variance.
Ram et al. (2010) proposed an asynchronous subgradient variations of the decentralized stochastic optimization algorithm for convex problems. The asynchrony was modeled by viewing the update event as a Poisson process and the convergence to the solution was shown. Srivastava and Nedic (2011); Sundhar Ram et al. (2010) are similar. The main differences from this work are 1) we take the situation where a worker calculates gradients based on old model into consideration, which is the case in the asynchronous setting; 2) we prove that our algorithm can achieve linear speedup when we increase the number of workers, which is important if we want to use the algorithm to accelerate training; 3) Our implementation guarantees deadlockfree, waitfree computation and communication. Nair and Gupta (2017) proposed another distributed stochastic algorithm, but it requires a centralized arbitrator to decide which two workers are exchanging weights and it lacks convergence analysis. Tsianos and Rabbat (2016) proposed a gossip based dual averaging algorithm that achieves linear speedup in the computational complexity, but in each iteration it requires multiple rounds of communication to limit the difference between all workers within a small constant.We next briefly review decentralized algorithms. Decentralized algorithms were initially studied by the control community for solving the consensus problem where the goal is to compute the mean of all the data distributed on multiple nodes (Boyd et al., 2005; Carli et al., 2010; Aysal et al., 2009; Fagnani and Zampieri, 2008; OlfatiSaber et al., 2007; Schenato and Gamba, 2007). For decentralized algorithms used for optimization problems, Lu et al. (2010) proposed two nongradientbased algorithms for solving onedimensional unconstrained convex optimization problems where the objective on each node is strictly convex, by calculating the inverse function of the derivative of the local objectives and transmitting the gradients or local objectives to neighbors, and the algorithms can be used over networks with timevarying topologies. A convergence rate was not shown but the authors did prove the algorithms will converge to the solution eventually. Mokhtari and Ribeiro (2016) proposed a fast decentralized variance reduced algorithm for strongly convex optimization problems. The algorithm is proved to have linear convergence rate and a nice stochastic saddle point method interpretation is given. However, the speedup property is unclear and a table of stochastic gradients need to be stored. Yuan et al. (2016) studied decentralized gradient descent on convex and strongly convex objectives. The algorithm in each iteration averages the models of the nodes with their neighbors’ and then updates the full gradient of the local objective function on each node. The subgradient version was considered in Nedic and Ozdaglar (2009); Ram et al. (2009). The algorithm is intuitive and easy to understand. However, the limitation of the algorithm is that it does not converge to the exact solution because the exact solution is not a fixed point of the algorithm’s update rule. This issue was fixed later by Shi et al. (2015a); Wu et al. (2016) by using the gradients of last two instead of one iterates in each iteration, which was later improved in Shi et al. (2015b); Li et al. (2017) by considering proximal gradients. Decentralized ADMM algorithms were analyzed in Zhang and Kwok (2014); Shi et al. ; Aybat et al. (2015). Wang et al. (2016) develops a decentralized algorithm for recursive leastsquares problems.
3 Algorithm
We introduce the ADPSGD algorithm in this section.
Definitions and notations
Throughout this paper, we use the following notation and definitions:

denotes the matrix Frobenius norm.

denotes the gradient of a function .

denotes the column vector in with for all elements.

denotes the optimal solution to (1).

denotes the
th largest eigenvalue of a matrix.

denotes the th element of the standard basis of .
3.1 Problem definition
The decentralized communication topology is represented as an undirected graph: , where denotes the set of workers and
is the set of the edges in the graph. Each worker represents a machine/gpu owning its local data (or a sensor collecting local data online) such that each worker is associated with a local loss function
where is a distribution associated with the local data at worker and is a data point sampled via . The edge means that the connected two workers can exchange information. For the ADPSGD algorithm, the overall optimization problem it solves is
(1) 
where ’s define a distribution, that is, and , and indicates the updating frequency of worker or the percentage of the updates performed by worker . The faster a worker, the higher the corresponding . The intuition is that if a worker is faster than another worker, then the faster worker will run more epochs given the same amount of time, and consequently the corresponding worker has a larger impact.
Remark 1.
To solve the common form of objectives in machine learning using ADPSGD
we can appropriately distribute data such that Eq. (1) solves the target objective above:
 Strategy1

Let and , that is, all worker can access all data, and consequently , that is, all ’s are the same;
 Strategy2

Split the data into all workers appropriately such that the portion of data is on worker and define
to be the uniform distribution over the assigned data samples.
3.2 ADPSGD algorithm
The ADPSGD algorithm can be described in the following: each worker maintains a local model in its local memory and (using worker as an example) repeats the following steps:

Sample data: Sample a minibatch of training data denoted by , where is the batch size.

Compute gradients: Use the sampled data to compute the stochastic gradient , where is read from the model in the local memory.

Gradient update: Update the model in the local memory by . Note that may not be the same as as it may be modified by other workers in the averaging step.

Averaging: Randomly select a neighbor (e.g. worker ) and average the local model with the worker ’s model (both models on both workers are updated to the averaged model). More specifically, .
Note that each worker runs the procedure above on its own without any global synchronization. This reduces the idle time of each worker and the training process will still be fast even if part of the network or workers slow down.
The averaging step can be generalized into the following update for all workers:
where
can be an arbitrary doubly stochastic matrix. This generalization gives plenty flexibility to us in implementation without hurting our analysis.
All workers run the procedure above simultaneously, as shown in Algorithm 1. We use a virtual counter to denote the iteration counter – every single gradient update happens no matter on which worker will increase by . denotes the worker performing the th update.
3.3 Implementation details
We briefly describe two interesting aspects of system designs and leave more discussions to Appendix A.
3.3.1 Deadlock avoidance
A naive implementation of the above algorithm may cause deadlock — the averaging step needs to be atomic and involves updating two workers (the selected worker and one of its neighbors). As an example, given three fully connected workers , , and , sends its local model to and waits for from ; has already sent out to and waits for ’s response; and has sent out to and waits for from .
We prevent the deadlock in the following way: The communication network is designed to be a bipartite graph, that is, the worker set can be split into two disjoint sets (active set) and (passive set) such that any edge in the graph connects one worker in and one worker in . Due to the property of the bipartite graph, the neighbors of any active worker can only be passive workers and the neighbors of any passive worker can only be active workers. This implementation avoids deadlock but still fits in the general algorithm Algorithm 1 we are analyzing. We leave more discussions and a detailed implementation for waitfree training to Appendix A.
3.3.2 Communication topology
The simplest realization of ADPSGD algroithm is a ringbased topology. To accelerate information exchanging, we also implement a communication topology in which each sender communicates with a reciever that is hops away in the ring, where is an integer from 0 to ( is the number of learners). It is easy to see it takes at most steps for any pair of workers to exchange information instead of in the simple ringbased topology. In this way, (as defined in Section 4) becomes smaller and the scalability of ADPSGD improves. This implementation also enables robustness against slow or failed network links because there are multiple routes for a worker to disseminate its information.
4 Theoretical analysis
In this section we provide theoretical analysis for the ADPSGD algorithm. We will show that the convergence rate of ADPSGD is consistent with SGD and DPSGD.
Note that by counting each update of stochastic gradients as one iteration, the update of each iteration in Algorithm 1 can be viewed as
where is the iteration number, is the local model of the th worker at the th iteration, and
and for some nonnegative integer .
Assumption 1.
Throughout this paper, we make the following commonly used assumptions:

Lipschitzian gradient: All functions ’s are with Lipschitzian gradients.

Doubly stochastic averaging: is doubly stochastic for all .

Spectral gap: There exists a such that^{4}^{4}4A smaller means a faster information spreading in the network, leading to a faster convergence.
(2) 
Unbiased estimation: ^{5}^{5}5Note that this is easily satisfied when all workers can access all data so that . When each worker can only access part of the data, we can also meet these assumptions by appropriately distributing data.
(3) (4) 
Bounded variance: Assume the variance of the stochastic gradient
is bounded for any with sampled from the distribution and from the distribution . This implies there exist constants and such that
(5) (6) Note that if all workers can access all data, then .

Dependence of random variables:
are independent random variables. is a random variable dependent on . 
Bounded staleness: and there exists a constant such that .
Throughout this paper, we define the following notations for simpler notation
Under Assumption 1 we have the following results:
Theorem 1 (Main theorem).
While and and are satisfied we have
Noting that , this theorem characterizes the convergence of the average of all local models. By appropriately choosing the learning rate, we obtain the following corollary
Corollary 2.
Let . We have the following convergence rate
(7) 
if the total number of iterations is sufficiently large, in particular,
(8) 
This corollary indicates that if the iteration number is big enough, ADPSGD’s convergence rate is . We compare the convergence rate of ADPSGD with existing results for SGD and DPSGD to show the tightness of the proved convergence rate. We will also show the efficiency and the linear speedup property for ADPSGD w.r.t. batch size, number of workers, and staleness respectively. Further discussions on communication topology and intuition will be provided at the end of this section.
Remark 2 (Consistency with SGD).
Remark 3 (Linear speedup w.r.t. batch size).
When is large enough the second term on the RHS of (7) dominates the first term. Note that the second term converges at a rate if , which means the convergence efficiency gets boosted with a linear rate if increase the minibatch size. This observation indicates the linear speedup w.r.t. the batch size and matches the results of minibatch SGD. ^{6}^{6}6 Note that when , ADPSGD does not admit this linear speedup w.r.t. batch size. It is unavoidable because increasing the minibatch size only decreases the variance of the stochastic gradients within each worker, while characterizes the variance of stochastic gradient among different workers, independent of the batch size.
Remark 4 (Linear speedup w.r.t. number of workers).
Note that every single stochastic gradient update counts one iteration in our analysis and our convergence rate in Corollary 2 is consistent with SGD / minibatch SGD. It means that the number of required stochastic gradient updates to achieve a certain precision is consistent with SGD / minibatch SGD, as long as the total number of iterations is large enough. It further indicates the linear speedup with respect to the number of workers ( workers will make the iteration number advance times faster in the sense of wallclock time, which means we will converge times faster). To the best of our knowledge, the linear speedup property w.r.t. to the number of workers for decentralized algorithms has not been recognized until the recent analysis for DPSGD by Lian et al. (2017). Our analysis reveals that by breaking the synchronization ADPSGD can maintain linear speedup, reduce the idle time, and improve the robustness in heterogeneous computing environments.
Remark 5 (Linear speedup w.r.t. the staleness).
From (8) we can also see that as long as the staleness is bounded by (if other parameters are considered to be constants), linear speedup is achievable.
5 Experiments
We describe our experimental methodologies in Section 5.1 and we evaluate the ADPSGD algorithm in the following sections:

Section 5.2: Compare ADPSGD’s convergence rate (w.r.t epochs) with other algorithms.

Section 5.3: Compare ADPSGD’s convergence rate (w.r.t runtime) and its speedup with other algorithms.

Section 5.4: Compare ADPSGD’s robustness to other algorithms in heterogeneous computing and heterogeneous communication environments.

: Evaluate ADPSGD on IBM proprietary natural language processing dataset and model.
5.1 Experiments methodology
5.1.1 Dataset, model, and software
We use CIFAR10 and ImageNet1K as the evaluation dataset and we use Torch7 as our deep learning framework. We use MPI to implement the communication scheme. For CIFAR10, we evaluate both VGG (Simonyan and Zisserman, 2015) and ResNet20 (He et al., 2016) models. VGG, whose size is about 60MB, represents a communication intensive workload and ResNet20, whose size is about 1MB, represents a computation intensive workload. For the ImageNet1K dataset, we use the ResNet50 model whose size is about 100MB.
Additionally, we experimented on an IBM proprietary natural language processing datasets and models Zhang et al. (2017) in Appendix B.
5.1.2 Hardware
We evaluate ADPSGD in two different environments:

IBM S822LC HPC cluster: Each node with 4 Nvidia P100 GPUs, 160 Power8 cores (8way SMT) and 500GB memory on each node. 100Gbit/s Mellanox EDR infiniband network. We use 32 such nodes.

x86based cluster: This cluster is a cloudlike environment with 10Gbit/s ethernet connection. Each node has 4 Nvidia P100 GPUs, 56 Xeon E52680 cores (2way SMT), and 1TB DRAM. We use 4 such nodes.
5.1.3 Compared algorithms
We compare the proposed ADPSGD algorithm to AllReduceSGD, DPSGD Lian et al. (2017) and a state of the art asynchronous SGD implementation EAMSGD. Zhang et al. (2015)^{7}^{7}7In this paper, we use ASGD and EAMSGD interchangeably. In EAMSGD, each worker can communicate with the parameter server less frequently by increasing the “communication period” parameter .
5.2 Convergence w.r.t. epochs
AllReduce  DPSGD  EAMSGD  ADPSGD  

VGG  87.04%  86.48%  85.75%  88.58% 
ResNet20  90.72%  90.81%  89.82%  91.49% 
Cifar10
Figure 4 plots training loss w.r.t. epochs for each algorithm, which is evaluated for VGG and ResNet20 models on CIFAR10 dataset with 16 workers. Table 2 reports the test accuracy of all algorithms.
For EAMSGD, we did extensive hyperparameter tuning to get the best possible model, where . We set momentum moving average to be (where is the number of workers) as recommended in Zhang et al. (2015) for EAMSGD.
For other algorithms, we use the following hyperparameter setup as prescribed in Zagoruyko (2015) and FAIR (2017):

Batch size: 128 per worker for VGG, 32 for ResNet20.

Learning rate: For VGG start from 1 and reduce by half every 25 epochs. For ResNet20 start from 0.1 and decay by a factor of 10 at the 81st epoch and the 122nd epoch.

Momentum: 0.9.

Weight decay: .
Figure 4 show that w.r.t epochs, AllReduceSGD, DPSGD and ADPSGD converge similar, while ASGD converges worse. Table 2 shows ADPSGD does not sacrifice test accuracy.
AllReduce  DPSGD  ADPSGD  

16 Workers  74.86%  74.74%  75.28% 
32 Workers  74.78%  73.66%  74.66% 
64 Workers  74.90%  71.18%  74.20% 
128 Workers  74.78%  70.90%  74.23% 
ImageNet
We further evaluate the ADPSGD’s convergence rate w.r.t. epochs using ImageNet1K and ResNet50 model. We compare ADPSGD with AllReduceSGD and DPSGD as they tend to converge better than APSGD.
Figure 11 and Table 3 demonstrate that w.r.t. epochs ADPSGD converges similarly to AllReduce and converges better than DPSGD when running with 16,32,64,128 workers. How to maintain convergence while increasing ^{8}^{8}8 is minibatch size per worker and is the number of workers is an active ongoing research area Zhang et al. (2016); Goyal et al. (2017) and it is orthogonal to the topic of this paper. For 64 and 128 workers, we adopted similar learning rate tuning scheme as proposed in Goyal et al. (2017) (i.e., learning rate warmup and linear scaling)^{9}^{9}9In ADPSGD, we decay the learning rate every 25 epochs instead of 30 epochs as in AllReduce. It worths noting that we could further increase the scalability of ADPSGD by combining learners on the same computing node as a superlearner (via Nvidia NCCL AllReduce collectives). In this way, a 128worker system can easily scale up to 512 GPUs or more, depending on the GPU count on a node.
Above results show ADPSGD converges similarly to AllReduceSGD w.r.t epochs and better than DPSGD. Techniques used for tuning learning rate for AllReduceSGD can be applied to ADPSGD when batch size is large.
5.3 Speedup and convergence w.r.t runtime
On CIFAR10, Figure 5 shows the runtime convergence results on both IBM HPC and x86 system. The EAMSGD implementation deploys parameter server sharding to mitigate the network bottleneck at the parameter servers. However, the central parameter server quickly becomes a bottleneck on a slow network with a large model as shown in Figure 5(b).
Figure 12 shows the speedup for different algorithms w.r.t. number of workers. The speedup for ResNet20 is better than VGG because ResNet20 is a computation intensive workload.
Above results show that regardless of workload type (computation intensive or communication intensive) and communication networks (fast or slow), ADPSGD consistently converges the fastest w.r.t. runtime and achieves the best speedup.
5.4 Robustness in a heterogeneous environment
In a heterogeneous environment, the speed of computation device and communication device may often vary, subject to architectural features (e.g., over/underclocking, caching, paging), resourcesharing (e.g., cloud computing) and hardware malfunctions. Synchronous algorithms like AllReduceSGD and DPSGD perform poorly when workers’ computation and/or communication speeds vary. Centralized asynchronous algorithms, such as APSGD, do poorly when the parameter server’s network links slow down. In contrast, ADPSGD localizes the impact of slower workers or network links.
On ImageNet, Figure (e)e shows the epochwise training time of the ADPSGD, DPSGD and AllReduce run over 64 GPUs (16 nodes) over a reserved window of 10 hours when the job shares network links with other jobs on IBM HPC. ADPSGD finishes each epoch in 264 seconds, whereas AllReduceSGD and DPSGD can take over 1000 sec/epoch.
We then evaluate ADPSGD’s robustness under different situations by randomly slowing down 1 of the 16 workers and its incoming/outgoing network links. Due to space limit, we will discuss the results for ResNet20 model on CIFAR10 dataset as the VGG results are similar.
Robustness against slow computation

ADPSGD  AllReduce/DPSGD  
Time/epoch(sec)  Speedup  Time/epoch (sec)  Speedup  
no slowdown  1.22  14.78  1.47/1.45  12.27/12.44  
2X  1.28  14.09  2.6/2.36  6.93/7.64  
10X  1.33  13.56  11.51/11.24  1.56/1.60  
100X  1.33  13.56  100.4/100.4  0.18/0.18 
Robustness against slow communication
Figure 16 shows that ADPSGD is robust when one worker is connected to slower network links. In contrast, centralized asynchronous algorithm EAMSGD uses a larger communication period to overcome slower links, which significantly slows down the convergence.
These results show only ADPSGD is robust against both heterogeneous computation and heterogeneous communication.
6 Conclusion
This paper proposes an asynchronous decentralized stochastic gradient descent algorithm (ADPSGD). The algorithm is not only robust in heterogeneous environments by combining both decentralization and asynchronization, but it is also theoretically justified to have the same convergence rate as its synchronous and/or centralized counterparts and can achieve linear speedup w.r.t. number of workers. Extensive experiments validate the proposed algorithm.
Acknowledgment
This project is supported in part by NSF CCF1718513, NEC fellowship, IBM faculty award, Swiss NSF NRP 75 407540_167266, IBM Zurich, MercedesBenz Research & Development North America, Oracle Labs, Swisscom, Zurich Insurance, and Chinese Scholarship Council. We thank David Grove, Hillery Hunter, and Ravi Nair for providing valuable feedback. We thank Anthony Giordano and Paul Crumley for wellmaintaining the computing infrastructure that enables the experiments conducted in this paper.
References
 Abadi et al. [2016] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In NIPS, 2011.
 Aybat et al. [2015] N. S. Aybat, Z. Wang, T. Lin, and S. Ma. Distributed linearized alternating direction method of multipliers for composite convex consensus optimization. arXiv preprint arXiv:1512.08122, 2015.
 Aysal et al. [2009] T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus. IEEE Transactions on Signal processing, 2009.
 Bianchi et al. [2013] P. Bianchi, G. Fort, and W. Hachem. Performance of a distributed stochastic approximation algorithm. IEEE Transactions on Information Theory, 2013.
 Boyd et al. [2005] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis and applications. In INFOCOM, 2005.
 Carli et al. [2010] R. Carli, F. Fagnani, P. Frasca, and S. Zampieri. Gossip consensus algorithms via quantized communication. Automatica, 2010.
 Chen et al. [2015] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 Dekel et al. [2012] O. Dekel, R. GiladBachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 2012.
 Fagnani and Zampieri [2008] F. Fagnani and S. Zampieri. Randomized consensus algorithms over large scale networks. IEEE Journal on Selected Areas in Communications, 2008.
 FAIR [2017] FAIR. ResNet in Torch. https://github.com/facebook/fb.resnet.torch, 2017.
 Feyzmahdavian et al. [2016] H. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous minibatch algorithm for regularized stochastic optimization. IEEE Transactions on Automatic Control, 2016.
 Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013.
 Ghadimi et al. [2016] S. Ghadimi, G. Lan, and H. Zhang. Minibatch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 2016.
 Goyal et al. [2017] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677.
 He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Lan et al. [2017] G. Lan, S. Lee, and Y. Zhou. Communicationefficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
 Li et al. [2017] Z. Li, W. Shi, and M. Yan. A decentralized proximalgradient method with network independent stepsizes and separated convergence rates. arXiv preprint arXiv:1704.07807, 2017.
 Lian et al. [2015] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In NIPS, 2015.
 Lian et al. [2016] X. Lian, H. Zhang, C.J. Hsieh, Y. Huang, and J. Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, NIPS. Curran Associates, Inc., 2016.
 Lian et al. [2017] X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent, 2017.
 Lu et al. [2010] J. Lu, C. Y. Tang, P. R. Regier, and T. D. Bow. A gossip algorithm for convex consensus optimization over networks. In ACC. IEEE, 2010.
 Luehr [2016] N. Luehr. Fast multigpu collectives with nccl, 2016. URL https://devblogs.nvidia.com/parallelforall/fastmultigpucollectivesnccl/.
 Mokhtari and Ribeiro [2016] A. Mokhtari and A. Ribeiro. DSA: decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 2016.

Moulines and Bach [2011]
E. Moulines and F. R. Bach.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In NIPS, 2011.  MPI contributors [2015] MPI contributors. MPI AllReduce, 2015. URL http://mpiforum.org/docs/.
 Nair and Gupta [2017] R. Nair and S. Gupta. Wildfire: Approximate synchronization of parameters in distributed deep learning. IBM Journal of Research and Development, 61(4/5):7:1–7:9, July 2017. ISSN 00188646. doi: 10.1147/JRD.2017.2709198.
 Nedic and Ozdaglar [2009] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 2009.
 Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009.
 OlfatiSaber et al. [2007] R. OlfatiSaber, J. A. Fax, and R. M. Murray. Consensus and cooperation in networked multiagent systems. Proceedings of the IEEE, 2007.
 Paine et al. [2013] T. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. Gpu asynchronous stochastic gradient descent to speed up neural network training. arXiv preprint arXiv:1312.6186, 2013.
 Patarasuk and Yuan [2009] P. Patarasuk and X. Yuan. Bandwidth optimal allreduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 2009.
 Ram et al. [2009] S. S. Ram, A. Nedic, and V. V. Veeravalli. Distributed subgradient projection algorithm for convex optimization. In ICASSP. IEEE, 2009.
 Ram et al. [2010] S. S. Ram, A. Nedić, and V. V. Veeravalli. Asynchronous gossip algorithm for stochastic optimization: Constant stepsize analysis. In Recent Advances in Optimization and its Applications in Engineering. Springer, 2010.
 Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, 2011.
 Schenato and Gamba [2007] L. Schenato and G. Gamba. A distributed consensus protocol for clock synchronization in wireless sensor network. In CDC. IEEE, 2007.
 Seide and Agarwal [2016] F. Seide and A. Agarwal. CNTK: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. ACM, 2016.
 [38] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized consensus optimization.
 Shi et al. [2015a] W. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 2015a.
 Shi et al. [2015b] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015b.
 Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. ICLR, 2015.
 Sirb and Ye [2016] B. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In Big Data, 2016.
 Srivastava and Nedic [2011] K. Srivastava and A. Nedic. Distributed asynchronous constrained stochastic optimization. IEEE Journal of Selected Topics in Signal Processing, 2011.
 Sundhar Ram et al. [2010] S. Sundhar Ram, A. Nedić, and V. Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 2010.
 Tang et al. [2018] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu. D2: Decentralized training over decentralized data. arXiv preprint arXiv:1803.07068, 2018.
 Tsianos and Rabbat [2016] K. I. Tsianos and M. G. Rabbat. Efficient distributed online prediction and stochastic optimization with approximate distributed averaging. IEEE Transactions on Signal and Information Processing over Networks, 2(4):489–506, 2016.
 Wang et al. [2016] Z. Wang, Z. Yu, Q. Ling, D. Berberidis, and G. B. Giannakis. Decentralized rls with dataadaptive censoring for regressions over largescale networks. arXiv preprint arXiv:1612.08263, 2016.
 Wu et al. [2016] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony and delays. arXiv preprint arXiv:1612.00150, 2016.
 Yuan et al. [2016] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 2016.
 Zagoruyko [2015] S. Zagoruyko. CIFAR VGG in Torch. https://github.com/szagoruyko/cifar.torch, 2015.
 Zhang and Kwok [2014] R. Zhang and J. Kwok. Asynchronous distributed admm for consensus optimization. In ICML, 2014.
 Zhang et al. [2015] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
 Zhang et al. [2016] W. Zhang, S. Gupta, and F. Wang. Model accuracy and runtime tradeoff in distributed deep learning: A systematic study. In IEEE International Conference on Data Mining, 2016.
 Zhang et al. [2017] W. Zhang, M. Feng, Y. Zheng, Y. Ren, Y. Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and F. Wang. Gadei: On scaleup training as a service for deep learning. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. The IEEE International Conference on Data Mining series(ICDM’2017), 2017.
Appendix A Waitfree (continuous) training and communication
The theoretical guarantee of ADPSGD relies on the the doubly stochastic property of matrix . The implication is the averaging of the weights between two workers should be atomic. This brings a special challenge for current distributed deep learning frameworks where the computation (gradients calculation and weights update) runs on GPU devices and the communication runs on CPU (or its peripherals such as infiniband or RDMA), because when there is averaging happening on a worker, the GPU is not allowed to update gradients into the weights. This can be solve by using CPU to update weights while GPUs only calculate gradients. Every worker (including active and passive workers) runs two threads in parallel with a shared buffer , one thread for computation and the other for communication. Algorithm 2, Algorithm 3, and Algorithm 4 illustrate the task on each thread. The communication thread is run by CPUs, while the computation thread is run by GPUs. In this way GPUs can continuously calculate new gradients by putting the results in CPUs’ buffer regardless of whether there is averaging happening. Recall in DPSGD, communication only occurs once in each iteration. In contrast, ADPSGD can exchange weights at any time by using this implementation.
Appendix B NLC experiments
In this section, we use IBM proprietary natural language processing dataset and model to evaluate ADPSGD against other algorithms.
The IBM NLC task is to classify input sentences into a target category in a predefined label set. The NLC model is a CNN model that has a wordembedding lookup table layer, a convolutional layer and a fully connected layer with a softmax output layer. We use two datasets in our evaluation. The first dataset Joule is an inhouse customer dataset that has 2.5K training samples, 1K test samples, and 311 different classes. The second dataset Yelp, which is a public dataset, has 500K training samples, 2K test samples and 5 different classes.
Figure 19 shows that ADPSGD converges (w.r.t epochs) similarly to AllReduceSGD and DPSGD on NLC tasks.Above results show ADPSGD converges similarly (w.r.t) to AllReduceSGD and DPSGD for IBM NLC workload, which is an example of proprietary workloads.
Appendix C Appendix: proofs
In the following analysis we define
(9) 
and
(10) 
We also define