1 Introduction
Distributed learning has attracted significant interest from both academia and industry. Over the last decade, researchers have come with up a range of different designs of more efficient learning systems. An important subset of this work focuses on understanding the impact of different system relaxations to the convergence and performance of distributed stochastic gradient descent, such as the compression of communication, e.g
Seide and Agarwal (2016), decentralized communication Lian et al. (2017a); Sirb and Ye (2016); Lan et al. (2017); Tang et al. (2018a); Stich et al. (2018), and asynchronous communication Lian et al. (2017b); Zhang et al. (2013); Lian et al. (2015). Most of these works are motivated by realworld system bottlenecks, abstracted as general problems for the purposes of analysis.In this paper, we are motivated by a new type of system relaxation—the reliability of the communication channel. We abstract this problem as a theoretical one, conduct a novel convergence analysis for this scenario, and then validate our results in a practical setting. Specifically, considering a network of nodes, we model unreliable communication as a setting where the communication channel between any two machines has a probability of not delivering a message. Alternatively, this setting can be abstracted as the simple, general problem of learning in a decentralized multiworker system, over a random topology which changes upon every single communication.
We assume a standard decentralized system implementation for aggregating gradients or models executing standard dataparallel stochastic gradient descent (SGD). In a nutshell, the algorithm, which we call RPS (for Robust Parameter Server), would work as follows on a reliable network. Given machines, each maintaining its own local model, each machine alternates local SGD steps with global communication steps, in which machines exchange their local models. This communication step is performed in two steps: first, in the ReduceScatter step, the model is partitioned into blocks, one for each machine; for each block , the machines average their model on the block by sending it to the corresponding machine. In the subsequent AllGather step, each machine broadcasts its block to all others, so that all machines have a consistent model copy. Our modeling covers two standard distributed settings: the Parameter Server model (Li et al., 2014; Abadi et al., 2016)^{1}^{1}1Our modeling above fits the case of workers and parameter servers, although our analysis will extend to any setting of these parameters., as well as standard implementations of the AllReduce averaging operation in a decentralized setting (Seide and Agarwal, 2016; Renggli et al., 2018).
If the underlying network is unreliable (He et al., 2018), the two aggregation steps change as follows. In the ReduceScatter step, a uniform random subset of machines will average their model on each model block . In the AllGather step, it is again a uniform random subset of machines which receive the resulting average. Specifically, machines not chosen for the ReduceScatter step do not contribute to the average, and all machines that are not chosen for the AllGather will not receive updates on their model block . This is a realistic model of running an AllReduce operator implemented with ReduceScatter/AllGather on unreliable network.
Our main technical contribution is characterizing the convergence properties of the RPS algorithm. To the best of our knowledge, this is a novel theoretical analysis of this faulty communication model. Compared with previous work on decentralized learning, e.g. Lian et al. (2017a) this paper considers the impact of a random topology on convergence. Specifically, we prove that the decentralized learning over random topologies can converge as efficient as the decentralized learning, over a fixed topology, and admits linear speedup in the number of machines. Moreover, we also prove that the impact of the package drop rate diminishes as the number of workers increases.
We apply our theoretical result to a realworld use case, illustrating the potential benefit of allowing an unreliable network. We focus on a realistic scenario where the network is shared among multiple applications or tenants, for instance in a data center. Both applications communicate using the same network. In this case, if the machine learning traffic is tolerant to some packet loss, the other application can potentially be made faster by receiving priority for its network traffic. Via network simulations, we find that tolerating a drop rate for the learning traffic can make a simple (emulated) Web service up to faster. (Even small speedups of
are significant for such services; for example, Google actively pursues minimizing its Web services’ response latency.) At the same time, this degree of loss does not impact the convergence rate for a range of machine learning applications, such as image classification and natural language processing.
2 Related Work
There has been a significant work on distributing deep learning, e.g.
Seide and Agarwal (2016); Abadi et al. (2016); Goyal et al. (2017); Colin et al. (2016). Due to space constraints, we only mainly focus on work considering dataparallel SGD with decentralized randomized communication.2.1 Distributed Learning
In Scaman et al. (2017), the optimal convergence rate for both centralized and decentralized distributed learning is given with the time cost for communication included. In Lin et al. (2018); Stich (2018), they investigate the trade off between getting more minibatches or having more communication. To save the communication cost, some sparse based distributed learning algorithms is proposed (Shen et al., 2018b; Wu et al., 2018; Wen et al., 2017; McMahan et al., 2016; Wang et al., 2016). Recent works indicate that many distributed learning is delaytolerant under an asynchronous setting (Zhou et al., 2018; Lian et al., 2015; Sra et al., 2015; Leblond et al., 2016). Also, in Blanchard et al. (2017); Yin et al. (2018); Alistarh et al. (2018) They study the Byzantinerobust distributed learning when Byzantine worker included in the network.
Many optimization algorithms were proved to achieve much better performance with more workers. For example, Hajinezhad et al. (2016) utilize a primaldual based method for optimizing a finitesum objective function and proved that it’s possible to achieve a speedup corresponding to the number of the workers. In Xu et al. (2017), an adaptive consensus ADMM is proposed and Goldstein et al. (2016) studied the performance of transpose ADMM on an entire distributed dataset.
2.2 Gossiplike Communication
Closest to this paper is a line of work considering gossiplike communication patterns for distributed learning. Specifically, Jin et al. (2016) proposes to scale the gradient aggregation process via a gossiplike mechanism. Reference Blot et al. (2016) considers a more radical approach, called GoSGD, where each worker exchanges gradients with a random subset of other workers in each round. They show that GoSGD can be faster than Elastic Averaging SGD Zhang et al. (2015) on CIFAR10, but provide no largescale experiments or theoretical justification. Recently, Daily et al. (2018) proposed GossipGrad, a more complex gossipbased scheme with upper bounds on the time for workers to communicate indirectly, periodic rotation of partners and shuffling of the input data, which provides strong empirical results on largescale deployments. The authors also provide an informal justification for why GossipGrad should converge. Despite the promising empirical results, there is very little known in terms of convergence guarantees.
2.3 AllReduce SGD and Decentralized Learning
To overcome the limitations of the parameterserver model (Li et al., 2014), recent research, e.g.(Iandola et al., 2016) has explored communication patterns which do not depend on the existence of a coordinator node. These references typically consider an alltoall topology, where each worker in the network can communicate reliably to all others.
Another direction of related work considers decentralized learning over fixed, but not fullyconnected graph topologies. A recent result by Lian et al. (2017a) provided strong convergence bounds for a similar algorithm to the one we are considering, in a setting where the communication graph is fixed and regular. In Tang et al. (2018b), a new approach that admits a better performance than decentralized SGD when the data among workers is very different is studied. Shen et al. (2018a) generalize the decentralized optimization problem to a monotone operator. Here, we consider random, dynamically changing topologies, and therefore require a different analytic approach.
2.4 Random topology decentralized algorithms
In Boyd et al. (2006), a randomized decentralized SGD is studied. The weighted matrix for randomized algorithms can be timevarying, which means workers are allowed to change the communication network based on the availability of the network. However, most of the previous work (Li and Zhang, 2010; Lobel and Ozdaglar, 2011) made the assumption that the randomized network should be doublystochastic. This assumption is not satisfied in our setting. Recently, Nedic et al. (2017); Nedić and Olshevsky (2015) relax the assumption and consider the situation where the communication network is directed, but still, the row sum of the weight matrix is required to be 1. Those works focus on the case where the timevarying communication pattern is predesigned, which does not extend to our setting.
In this paper, we consider a general model communication, which covers both Parameter Server Li et al. (2014) and AllReduce Seide and Agarwal (2016) distribution strategies. We specifically include the uncertainty of the network into our theoretical analysis, which is the first to not require the doublystochastic communication matrix. In addition, our analysis highlights the fact that the system can handle additional packet drops as we increase the number of worker nodes.
3 Problem Setup
We consider the following distributed optimization problem:
(1) 
where is the number of workers, is the local data distribution for worker (in other words, we do not assume that all nodes can access the same data set), and
is the local loss function of model
given data for worker .Unreliable Network Connection
Nodes can communicate with all other workers, but with packet drop rate (here we do not use the commonused phrase “packet loss rate” because we use “loss” to refer to the loss function). That means, whenever any node forwards models or data to any other model, the destination worker fails to receive it, with probability . For simplicity, we assume that all packet drop events are independent, and that they occur with the same probability .
Definitions and notations
Throughout, we use the following notation and definitions:

denotes the gradient of the function .

is the
th largest eigenvalue of a matrix.

is the fullone vector.

denotes the all ’s by matrix.

denotes the norm for vectors.

denotes the Frobenius norm of matrices.
4 Algorithm
In this section, we describe the standard RPS algorithm in detail, followed by its interpretation from a global view.
4.1 The RPS Algorithm
In the RPS algorithm, each worker maintains an individual local model. We use to denote the local model on worker at time step . At each iteration , each worker first performs a regular SGD step
where is the learning rate and are the data samples of worker at iteration .
We would like to reliably average the vector among all workers, via the RPS procedure. In brief, the RS step perfors communicationefficient model averaging, and the AG step performs communicationefficient model sharing.
The ReduceScatter (RS) step:
In this step, each worker divides into equallysized blocks.
(2) 
The reason for this division is to reduce the communication cost and parallelize model averaging since we only assign each worker for averaging one of those blocks. For example, worker can be assigned for averaging the first block while worker might be assigned to deal with the third block. For simplicity, we would proceed our discussion in the case that worker is assigned for averaging the th block.
After the division, each worker sends its th block to worker . Once receiving those blocks, each worker would average all the blocks it receives. As noted, some packets might be dropped. In this case, worker averages all those blocks using
where is the set of the packages worker receives (including itself).
The AllGather (AG) step:
After computing , each worker attempts to broadcast to all other workers, using the averaged blocks to recover the averaged original vector by concatenation:
Note that it is entirely possible that some workers in the network may not be able to receive some of the averaged blocks. In this case, they just use the original block. Formally,
(3) 
where
We can see that each worker just replace the corresponding blocks of using received averaged blocks. The complete algorithm is summarized in Algorithm 1.
4.2 RPS From a Global Viewpoint
It can be seen that at time step , the th block of worker ’s local model, that is, , is a linear combination of th block of all workers’ intermediate model ,
(4) 
where
and is the coefficient matrix indicating the communication outcome at time step . The th element of is denoted by . means that worker receives worker ’s individual th block (that is, ), whereas means that the package might be dropped either in RS step (worker fails to send) or AG step (worker fails to receive). So is timevarying because of the randomness of the package drop. Also is not doublystochastic (in general) because the package drop is independent between RS step and AG step.
In fact, it can be shown that all ’s () satisfy the following properties
(5)  
(6) 
for some constants and satisfying in Lemmas 6, 7, and 8 respectively (see Supplementary Material for proof details) to make the algorithm convergent.
Simply speaking, we have and with regards to . It can be proved that is always larger than . Our result also indicates that and when and and when , which proves the tightness of our bound for and . Here we plot and in Figure 4.2 and Figure 4.2 . Detailed discussion is included in Section D in Supplementary Material.
5 Theoretical Guarantees and Discussion
Below we will show that, for certain parameter values, RPS with unreliable communication rates admits the same convergence rate as the standard algorithms. In other words, the impact of network unreliablity may be seen as negligible. We begin by stating our analytic assumptions.
First let us make some necessary assumptions, that are commonly used in analyzing stochastic optimization algorithms.
Assumption 1.
We make the following commonly used assumptions:

Lipschitzian gradient: All function ’s are with Lipschitzian gradients, which means

Start from 0: We assume for simplicity w.l.o.g.
Next we are ready to show our main result.
Theorem 1 (Convergence of Algorithm 1).
To make the result more clear, we appropriately choose the learning rate as follows:
Corollary 2.
We discuss our theoretical results below

(Linear Speedup) Since the the leading term of convergence rate for is . It suggests that our algorithm admits the linear speedup property with respect to the number of workers .

(Better performance for larger networks) Fixing the package drop rate (implicitly included in Section D), the convergence rate under a larger network (increasing ), would be superior, because the leading terms’ dependence of the . This indicates that the affection of the package drop ratio diminishes, as we increase the number of workers and parameter servers.
6 Experiments: Convergence of RPS
We now validate empirically the scalability and accuracy of the RPS algorithm, given reasonable message arrival rates.
6.1 Experimental Setup
Datasets and models
We evaluate our algorithm on two state of the art machine learning tasks: (1) image classification and (2) natural language understanding (NLU). We train ResNet He et al. (2016) with different number of layers on CIFAR10 Krizhevsky and Hinton (2009)
for classifying images. We perform the NLU task on the Air travel information system (ATIS) corpus on a one layer LSTM network.
Implementation
We simulate packet losses by adapting the latest version 2.5 of the Microsoft Cognitive Toolkit Seide and Agarwal (2016). We implement the RPS algorithm using MPI. During training, we use a local batch size of 32 samples per worker for image classification. We adjust the learning rate by applying a linear scaling rule Goyal et al. (2017)
and decay of 10 percent after 80 and 120 epochs, respectively. To achieve the best possible convergence, we apply a gradual warmup strategy
Goyal et al. (2017) during the first 5 epochs. We deliberately do not use any regularization or momentum during the experiments in order to be consistent with the described algorithm and proof. The NLU experiments are conducted with the default parameters given by the CNTK examples, with scaling the learning rate accordingly, and omit momentum and regularization terms on purpose. The training of the models is executed on 16 NVIDIA TITAN Xp GPUs. The workers are connected by Gigabit Ethernet. We use each GPU as a worker. We describe the results in terms of training loss convergence, although the validation trends are similar.Convergence of Image Classification
We perform convergence tests using the analyzed algorithm, model averaging SGD, on both ResNet110 and ResNet20 with CIFAR10. Figure 2(a,b) shows the result. We vary probabilities for each packet being correctly delivered at each worker between 80%, 90%, 95% and 99%. The baseline is 100% message delivery rate. The baseline achieves a training loss of 0.02 using ResNet110 and 0.09 for ResNet20. Dropping 1% doesn’t increase the training loss achieved after 160 epochs. For 5% the training loss is identical on ResNet110 and increased by 0.02 on ResNet20. Having a probability of 90% of arrival leads to a training loss of 0.03 for ResNet110 and 0.11 for ResNet20 respectively.
Convergence of NLU
We perform full convergence tests for the NLU task on the ATIS corpus and a single layer LSTM. Figure 2(c) shows the result. The baseline achieves a training loss of 0.01. Dropping 1, 5 or 10 percent of the communicated partial vectors result in an increase of 0.01 in training loss.
Comparison with Gradient Averaging
We conduct experiments with identical setup and a probability of 99 percent of arrival using a gradient averaging methods, instead of model averaging. When running data distributed SGD, gradient averaging is the most widely used technique in practice, also implemented by default in most deep learning frameworksAbadi et al. (2016); Seide and Agarwal (2016). As expected, the baseline (all the transmissions are successful) convergences to the same training loss as its model averaging counterpart, when omitting momentum and regularization terms. As seen in figures 3(a,b), having a loss in communication of even 1 percentage results in worse convergence in terms of accuracy for both ResNet architectures on CIFAR10. This behavior is not visible on the NLU examples. The reason for achieving similar training loss when dropping gradients randomly in this case lies most probably in the naturally sparse nature of the gradients and model for this task. Nevertheless, this insight suggests that one should favor a model averaging algorithm over gradient averaging, if the underlying network connection is unreliable.
7 Case study: Speeding up Colocated Applications
Our results on the resilience of distributed learning to losses of model updates open up an interesting use case. That model updates can be lost (within some tolerance) without the deterioration of model convergence implies that model updates transmitted over the physical network can be deprioritized compared to other more “inflexible,” delaysensitive traffic, such as for Web services. Thus, we can colocate other applications with the training workloads, and reduce infrastructure costs for running them. Equivalently, workloads that are colocated with learning workers can benefit from prioritized network traffic (at the expense of some model update losses), and thus achieve lower latency.
To demonstrate this in practice, we perform a packet simulation over 16 servers, each connected with a Gbps link to a network switch. Over this network of servers, we run two workloads: (a) replaying traces from the machine learning process of ResNet110 on CIFAR10 (which translates to a load of 2.4 Gbps) which is sent unreliably, and (b) a simple emulated Web service running on all servers. Web services often produce significant background traffic between servers within the data center, consisting typically of small messages fetching distributed pieces of content to compose a response (e.g., a Google query response potentially consists of advertisements, search results, and images). We emulate this intra data center traffic for the Web service as alltoall traffic between these servers, with small messages of KB (a reasonable size for such services) sent reliably between these servers. The interarrival time for these messages follows a Poisson process, parametrized by the expected message rate, (aggregated across the servers).
Different degrees of prioritization of the Web service traffic over learning traffic result in different degrees of loss in learning updates transmitted over the network. As the Web service is prioritized to a greater extent, its performance improves – its message exchanges take less time; we refer to this reduction in (average) completion time for these messages as a speedup. Note that even small speedups of are significant for such services; for example, Google actively pursues minimizing its Web services’ response latency. An alternative method of quantifying the benefit for the colocated Web service is to measure how many additional messages the Web service can send, while maintaining a fixed average completion time. This translates to running more Web service queries and achieving more throughput over the same infrastructure, thus reducing cost per request / message.
Fig. 7 and Fig. 7 show results for the above described Web service speedup and cost reduction respectively. In Fig. 7, the arrival rate of Web service messages is fixed ( per second). As the network prioritizes the Web service more and more over learning update traffic, more learning traffic suffers losses (on the axis), but performance for the Web service improves. With just losses for learning updates, the Web service can be sped up by more than (i.e., ).
In Fig. 7, we set a target average transmission time (, , or ms) for the Web service’s messages, and increase the message arrival rate, , thus causing more and more losses for learning updates on the axis. But accommodating higher over the same infrastructure translates to a lower cost of running the Web service (with this reduction shown on the axis).
Thus, tolerating small amounts of loss in model update traffic can result in significant benefits for colocated services, while not deteriorating convergence.
8 Conclusion
We present a novel analysis for a general model of distributed machine learning, under a realistic unreliable communication model. Mathematically, the problem we considered is that of decentralized, distributed learning over a randomized topology. We present a novel theoretical analysis for such a scenario, and evaluate it while training neural networks on both image and natural language datasets. We also provide a case study of application collocation, to illustrate the potential benefit that can be provided by allowing learning algorithms to take advantage of unreliable communication channels.
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Alistarh et al. (2018) Dan Alistarh, Zeyuan AllenZhu, and Jerry Li. Byzantine stochastic gradient descent. CoRR, abs/1803.08917, 2018.
 Blanchard et al. (2017) Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017.
 Blot et al. (2016) Michael Blot, David Picard, Matthieu Cord, and Nicolas Thome. Gossip training for deep learning. arXiv preprint arXiv:1611.09726, 2016.
 Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
 Colin et al. (2016) Igor Colin, Aurélien Bellet, Joseph Salmon, and Stéphan Clémençon. Gossip dual averaging for decentralized optimization of pairwise functions. arXiv preprint arXiv:1606.02421, 2016.
 Daily et al. (2018) Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, 2018.
 Goldstein et al. (2016) Tom Goldstein, Gavin Taylor, Kawika Barabin, and Kent Sayre. Unwrapping admm: efficient distributed computing via transpose reduction. In Artificial Intelligence and Statistics, pages 1151–1158, 2016.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Hajinezhad et al. (2016) Davood Hajinezhad, Mingyi Hong, Tuo Zhao, and Zhaoran Wang. Nestt: A nonconvex primaldual splitting method for distributed and stochastic optimization. In Advances in Neural Information Processing Systems, pages 3215–3223, 2016.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  He et al. (2018) Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. 08 2018. URL https://arxiv.org/pdf/1808.04883.
 Iandola et al. (2016) Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592–2600, 2016.
 Jin et al. (2016) Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581, 2016.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Lan et al. (2017) Guanghui Lan, Soomin Lee, and Yi Zhou. Communicationefficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
 Leblond et al. (2016) Rémi Leblond, Fabian Pedregosa, and Simon LacosteJulien. Asaga: asynchronous parallel saga. arXiv preprint arXiv:1606.04809, 2016.
 Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and BorYiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 Li and Zhang (2010) Tao Li and JiFeng Zhang. Consensus conditions of multiagent systems with timevarying topologies and stochastic communication noises. IEEE Transactions on Automatic Control, 55(9):2043–2057, 2010.
 Lian et al. (2015) Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
 Lian et al. (2017a) Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5336–5346, 2017a.
 Lian et al. (2017b) Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017b.
 Lin et al. (2018) Tao Lin, Sebastian U. Stich, and Martin Jaggi. Don’t use large minibatches, use local sgd. 08 2018. URL https://arxiv.org/pdf/1808.07217.
 Lobel and Ozdaglar (2011) Ilan Lobel and Asuman Ozdaglar. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291, 2011.
 McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Nedić and Olshevsky (2015) Angelia Nedić and Alex Olshevsky. Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
 Nedic et al. (2017) Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
 Renggli et al. (2018) Cèdric Renggli, Dan Alistarh, and Torsten Hoefler. Sparcml: Highperformance sparse communication for machine learning. arXiv preprint arXiv:1802.08021, 2018.
 Scaman et al. (2017) Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint arXiv:1702.08704, 2017.
 Seide and Agarwal (2016) Frank Seide and Amit Agarwal. Cntk: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
 Shen et al. (2018a) Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4624–4633, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.mlr.press/v80/shen18a.html.
 Shen et al. (2018b) Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. arXiv preprint arXiv:1805.09969, 2018b.
 Sirb and Ye (2016) Benjamin Sirb and Xiaojing Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In Big Data (Big Data), 2016 IEEE International Conference on, pages 76–85. IEEE, 2016.
 Sra et al. (2015) Suvrit Sra, Adams Wei Yu, Mu Li, and Alexander J Smola. Adadelay: Delay adaptive distributed stochastic convex optimization. arXiv preprint arXiv:1508.05003, 2015.
 Stich (2018) Sebastian U. Stich. Local sgd converges fast and communicates little. page https://arxiv.org/abs/1805.09767, 2018.
 Stich et al. (2018) Sebastian U. Stich, JeanBaptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. 09 2018. URL https://arxiv.org/pdf/1809.07599.
 Tang et al. (2018a) Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. 03 2018a. URL https://arxiv.org/pdf/1803.06443.
 Tang et al. (2018b) Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data. arXiv preprint arXiv:1803.07068, 2018b.
 Wang et al. (2016) Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. arXiv preprint arXiv:1605.07991, 2016.
 Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
 Wu et al. (2018) Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd and its applications to largescale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
 Xu et al. (2017) Zheng Xu, Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein. Adaptive consensus admm for distributed optimization. arXiv preprint arXiv:1706.02869, 2017.
 Yin et al. (2018) Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In ICML, 2018.
 Zhang et al. (2013) Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. Asynchronous stochastic gradient descent for dnn training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6660–6663. IEEE, 2013.
 Zhang et al. (2015) Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
 Zhou et al. (2018) Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, LiJia Li, and Li FeiFei. Distributed asynchronous optimization with unbounded delays: How slow can you go? In International Conference on Machine Learning, pages 5965–5974, 2018.
Supplemental Materials
References
 Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Alistarh et al. (2018) Dan Alistarh, Zeyuan AllenZhu, and Jerry Li. Byzantine stochastic gradient descent. CoRR, abs/1803.08917, 2018.
 Blanchard et al. (2017) Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, pages 119–129, 2017.
 Blot et al. (2016) Michael Blot, David Picard, Matthieu Cord, and Nicolas Thome. Gossip training for deep learning. arXiv preprint arXiv:1611.09726, 2016.
 Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
 Colin et al. (2016) Igor Colin, Aurélien Bellet, Joseph Salmon, and Stéphan Clémençon. Gossip dual averaging for decentralized optimization of pairwise functions. arXiv preprint arXiv:1606.02421, 2016.
 Daily et al. (2018) Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, and Vinay Amatya. Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent. arXiv preprint arXiv:1803.05880, 2018.
 Goldstein et al. (2016) Tom Goldstein, Gavin Taylor, Kawika Barabin, and Kent Sayre. Unwrapping admm: efficient distributed computing via transpose reduction. In Artificial Intelligence and Statistics, pages 1151–1158, 2016.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Hajinezhad et al. (2016) Davood Hajinezhad, Mingyi Hong, Tuo Zhao, and Zhaoran Wang. Nestt: A nonconvex primaldual splitting method for distributed and stochastic optimization. In Advances in Neural Information Processing Systems, pages 3215–3223, 2016.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  He et al. (2018) Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. 08 2018. URL https://arxiv.org/pdf/1808.04883.
 Iandola et al. (2016) Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. Firecaffe: nearlinear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2592–2600, 2016.
 Jin et al. (2016) Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581, 2016.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Lan et al. (2017) Guanghui Lan, Soomin Lee, and Yi Zhou. Communicationefficient algorithms for decentralized and stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.
 Leblond et al. (2016) Rémi Leblond, Fabian Pedregosa, and Simon LacosteJulien. Asaga: asynchronous parallel saga. arXiv preprint arXiv:1606.04809, 2016.
 Li et al. (2014) Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and BorYiing Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 Li and Zhang (2010) Tao Li and JiFeng Zhang. Consensus conditions of multiagent systems with timevarying topologies and stochastic communication noises. IEEE Transactions on Automatic Control, 55(9):2043–2057, 2010.
 Lian et al. (2015) Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
 Lian et al. (2017a) Xiangru Lian, Ce Zhang, Huan Zhang, ChoJui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5336–5346, 2017a.
 Lian et al. (2017b) Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017b.
 Lin et al. (2018) Tao Lin, Sebastian U. Stich, and Martin Jaggi. Don’t use large minibatches, use local sgd. 08 2018. URL https://arxiv.org/pdf/1808.07217.
 Lobel and Ozdaglar (2011) Ilan Lobel and Asuman Ozdaglar. Distributed subgradient methods for convex optimization over random networks. IEEE Transactions on Automatic Control, 56(6):1291, 2011.
 McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629, 2016.
 Nedić and Olshevsky (2015) Angelia Nedić and Alex Olshevsky. Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
 Nedic et al. (2017) Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributed optimization over timevarying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.
 Renggli et al. (2018) Cèdric Renggli, Dan Alistarh, and Torsten Hoefler. Sparcml: Highperformance sparse communication for machine learning. arXiv preprint arXiv:1802.08021, 2018.
 Scaman et al. (2017) Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint arXiv:1702.08704, 2017.
 Seide and Agarwal (2016) Frank Seide and Amit Agarwal. Cntk: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
 Shen et al. (2018a) Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4624–4633, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018a. PMLR. URL http://proceedings.mlr.press/v80/shen18a.html.
 Shen et al. (2018b) Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. arXiv preprint arXiv:1805.09969, 2018b.
 Sirb and Ye (2016) Benjamin Sirb and Xiaojing Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In Big Data (Big Data), 2016 IEEE International Conference on, pages 76–85. IEEE, 2016.
 Sra et al. (2015) Suvrit Sra, Adams Wei Yu, Mu Li, and Alexander J Smola. Adadelay: Delay adaptive distributed stochastic convex optimization. arXiv preprint arXiv:1508.05003, 2015.
 Stich (2018) Sebastian U. Stich. Local sgd converges fast and communicates little. page https://arxiv.org/abs/1805.09767, 2018.
 Stich et al. (2018) Sebastian U. Stich, JeanBaptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. 09 2018. URL https://arxiv.org/pdf/1809.07599.
 Tang et al. (2018a) Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. 03 2018a. URL https://arxiv.org/pdf/1803.06443.
 Tang et al. (2018b) Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training over decentralized data. arXiv preprint arXiv:1803.07068, 2018b.
 Wang et al. (2016) Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. arXiv preprint arXiv:1605.07991, 2016.
 Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
 Wu et al. (2018) Jiaxiang Wu, Weidong Huang, Junzhou Huang, and Tong Zhang. Error compensated quantized sgd and its applications to largescale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
 Xu et al. (2017) Zheng Xu, Gavin Taylor, Hao Li, Mario Figueiredo, Xiaoming Yuan, and Tom Goldstein. Adaptive consensus admm for distributed optimization. arXiv preprint arXiv:1706.02869, 2017.
 Yin et al. (2018) Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantinerobust distributed learning: Towards optimal statistical rates. In ICML, 2018.
 Zhang et al. (2013) Shanshan Zhang, Ce Zhang, Zhao You, Rong Zheng, and Bo Xu. Asynchronous stochastic gradient descent for dnn training. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6660–6663. IEEE, 2013.
 Zhang et al. (2015) Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. In Advances in Neural Information Processing Systems, pages 685–693, 2015.
 Zhou et al. (2018) Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, LiJia Li, and Li FeiFei. Distributed asynchronous optimization with unbounded delays: How slow can you go? In International Conference on Machine Learning, pages 5965–5974, 2018.
Appendix A Notations
In order to unify notations, we define the following notations about gradient:
We define as the identity matrix, as and as . Also, we suppose the packet drop rate is .
The following equation is used frequently:
(8) 
a.1 Matrix Notations
We aggregate vectors into matrix, and using matrix to simplify the proof.
a.2 Averaged Notations
We define averaged vectors as follows:
(9)  
(10)  
(11)  
a.3 Block Notations
a.4 Aggregated Block Notations
Now, we can define some additional notations throughout the following proof
a.5 Relations between Notations
We have the following relations between these notations:
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
a.6 Expectation Notations
There are different conditions when taking expectations in the proof, so we list these conditions below:
Denote taking the expectation over the computing stochastic Gradient procedure at th iteration on condition of the history information before the th iteration.
Denote taking the expectation over the Package drop in sending and receiving blocks procedure at th iteration on condition of the history information before the th iteration and the SGD procedure at the th iteration.
Denote taking the expectation over all procedure during the th iteration on condition of the history information before the th iteration.
Denote taking the expectation over all history information.
Appendix B Proof to Theorem 1
The critical part for a decentralized algorithm to be successful, is that local model on each node will converge to their average model. We summarize this critical property by the next lemma.
We will prove this critical property first. Then, after proving some lemmas, we will prove the final theorem. During the proof, we will use properties of weighted matrix which is showed in Section D.