1 Introduction
When training machine learning models in a distributed fashion, the underlying constraints of how workers (or nodes) communication have a significant impact on the training algorithm. When workers cannot form a fully connected communication topology or the communication latency is high (
e.g., in sensor networks or mobile networks), decentralizing the communication comes to the rescue. On the other hand, when the amount of data sent through the network is an optimization objective (maybe to lower the cost or energy consumption), or the network bandwidth is low, compressing the traffic, either via sparsification (Wangni et al., 2017; Konečnỳ and Richtárik, 2016) or quantization (Zhang et al., 2017a; Suresh et al., 2017) is a popular strategy. In this paper, our goal is to develop a novel framework that works robustly in an environment that both decentralization and communication compression could be beneficial. In this paper, we focus on quantization, the process of lowering the precision of data representation, often in a stochastically unbiased way. But the same techniques would apply to other unbiased compression schemes such as sparsification.Both decentralized training and quantized (or compressed more generally) training have attracted intensive interests recently (Yuan et al., 2016; Zhao and Song, 2016; Lian et al., 2017a; Konečnỳ and Richtárik, 2016; Alistarh et al., 2017). Decentralized algorithms usually exchange local models among nodes, which consumes the main communication budget; on the other hand, quantized algorithms usually exchange quantized gradient, and update an unquantized model. A straightforward idea to combine these two is to directly quantize the models sent through the network during decentralized training. However, this simple strategy does not converge to the right solution as the quantization error would accumulate during training. The technical contribution of this paper is to develop novel algorithms that combine both decentralized training and quantized training together.
Problem Formulation.
We consider the following decentralized optimization:
(1) 
where is the number of node and is the local data distribution for node . nodes form a connected graph and each node can only communicate with its neighbors. Here we only assume ’s are with LLipschitzian gradients.
Summary of Technical Contributions.
In this paper, we propose two decentralized parallel stochastic gradient descent algorithms (DPSGD): extrapolation compression DPSGD (ECDPSGD) and difference compression DPSGD (DCDPSGD). Both algorithms can be proven to converge in the rate roughly
where is the number of iterations. The convergence rates are consistent with two special cases: centralized parallel stochastic gradient descent (CPSGD) and DPSGD. To the best of our knowledge, this is the first work to combine quantization algorithms and decentralized algorithms for generic optimization.The key difference between ECDPSGD and DCDPSGD is that DCDPSGD quantizes the difference between the last two local models, and ECDPSGD quantizes the extrapolation between the last two local models. DCDPSGD admits a slightly better convergence rate than ECDPSGD when the data variation among nodes is very large. On the other hand, ECDPSGD is more robust to more aggressive quantization, as extremely low precision quantization can cause DCDPSGD to diverge, since DCDPSGD has strict constraint on quantization. In this paper, we analyze both algorithms, and empirically validate our theory. We also show that when the underlying network has both high latency and low bandwidth, both algorithms outperform stateofthearts significantly. We present both algorithm because we believe both of them are theoretically interesting. In practice, ECDPSGD could potentially be a more robust choice.
Definitions and notations
Throughout this paper, we use following notations and definitions:

[fullwidth]

denotes the gradient of a function .

denotes the optimal solution of (1).

denotes the
th largest eigenvalue of a matrix.

denotes the fullone vector.

denotes the norm for vector.

denotes the vector Frobenius norm of matrices.

denotes the compressing operator.

.
2 Related work
Stochastic gradient descent
The Stocahstic Gradient Descent (SGD) (Ghadimi and Lan, 2013; Moulines and Bach, 2011; Nemirovski et al., 2009)  a stochastic variant of the gradient descent method  has been widely used for solving large scale machine learning problems (Bottou, 2010). It admits the optimal convergence rate for nonconvex functions.
Centralized algorithms
The centralized algorithms is a widely used scheme for parallel computation, such as Tensorflow
(Abadi et al., 2016), MXNet (Chen et al., 2015), and CNTK (Seide and Agarwal, 2016). It uses a central node to control all leaf nodes. For Centralized Parallel Stochastic Gradient Descent (CPSGD), the central node performs parameter updates and leaf nodes compute stochastic gradients based on local information in parallel. In Agarwal and Duchi (2011); Zinkevich et al. (2010), the effectiveness of CPSGD is studied with latency taken into consideration. The distributed minibatches SGD, which requires each leaf node to compute the stochastic gradient more than once before the parameter update, is studied in Dekel et al. (2012). Recht et al. (2011) proposed a variant of CPSGD, HOGWILD, and proved that it would still work even if we allow the memory to be shared and let the private mode to be overwriten by others. The asynchronous nonconvex CPSGD optimization is studied in Lian et al. (2015). Zheng et al. (2016) proposed an algorithm to improve the performance of the asynchronous CPSGD. In Alistarh et al. (2017); De Sa et al. (2017), a quantized SGD is proposed to save the communication cost for both convex and nonconvex object functions. The convergence rate for CPSGD is . The tradeoff between the minibatch number and the local SGD step is studied in Lin et al. (2018); Stich (2018).Decentralized algorithms
Recently, decentralized training algorithms have attracted significantly amount of attentions. Decentralized algorithms are mostly applied to solve the consensus problem (Zhang et al., 2017b; Lian et al., 2017a; Sirb and Ye, 2016), where the network topology is decentralized. A recent work shows that decentralized algorithms could outperform the centralized counterpart for distributed training (Lian et al., 2017a). The main advantage of decentralized algorithms over centralized algorithms lies on avoiding the communication traffic in the central node. In particular, decentralized algorithms could be much more efficient than centralized algorithms when the network bandwidth is small and the latency is large. The decentralized algorithm (also named gossip algorithm in some literature under certain scenarios (Colin et al., 2016)
) only assume a connect computational network, without using the central node to collect information from all nodes. Each node owns its local data and can only exchange information with its neighbors. The goal is still to learn a model over all distributed data. The decentralized structure can applied in solving of multitask multiagent reinforcement learning
(Omidshafiei et al., 2017; Mhamdi et al., 2017). Boyd et al. (2006) uses a randomized weighted matrix and studied the effectiveness of the weighted matrix in different situations. Two methods (Li et al., 2017; Shi et al., 2015) were proposed to blackuce the steady point error in decentralized gradient descent convex optimization. Dobbe et al. (2017) applied an information theoretic framework for decentralize analysis. The performance of the decentralized algorithm is dependent on the second largest eigenvalue of the weighted matrix. In He et al. (2018), In He et al. (2018), they proposed the gradient descent based algorithm (CoLA) for decentralized learning of linear classification and regression models, and proved the convergence rate for strongly convex and general convex cases.Decentralized parallel stochastic gradient descent
The Decentralized Parallel Stochastic Gradient Descent (DPSGD) (Nedic and Ozdaglar, 2009; Yuan et al., 2016) requires each node to exchange its own stochastic gradient and update the parameter using the information it receives. In Nedic and Ozdaglar (2009), the convergence rate for a timevarying topology was proved when the maximum of the subgradient is assumed to be bounded. In Lan et al. (2017), a new decentralized primaldual type method is proposed with a computational complexity of for general convex objectives. The linear speedup of DPSGD is proved in Lian et al. (2017a), where the computation complexity is . The asynchronous variant of DPSGD is studied in Lian et al. (2017b).
Compression
To guarantee the convergence and correctness, this paper only considers using the unbiased stochastic compression techniques. Existing methods include randomized quantization (Zhang et al., 2017a; Suresh et al., 2017) and randomized sparsification (Wangni et al., 2017; Konečnỳ and Richtárik, 2016). Other compression methods can be found in Kashyap et al. (2007); Lavaei and Murray (2012); Nedic et al. (2009). In Drumond et al. (2018), a compressed DNN training algorithm is proposed. In Stich et al. (2018), a centralized biased sparsified parallel SGD with memory is studied and proved to admits an factor of acceleration.
3 Preliminary: decentralized parallel stochastic gradient descent (DPSGD)
Unlike the traditional (centralized) parallel stochastic gradient descent (CPSGD), which requires a central node to compute the average value of all leaf nodes, the decentralized parallel stochastic gradient descent (DPSGD) algorithm does not need such a central node. Each node (say node ) only exchanges its local model with its neighbors to take weighted average, specifically, where in general and means that node and node is not connected. At th iteration, DPSGD consists of three steps ( is the node index):
1. Each node computes the stochastic gradient , where is the samples from its local data set and is the local model on node .
2. Each node queries its neighbors’ variables and updates its local model using .
3. Each node updates its local model using stochastic gradient, where is the learning rate.
To look at the DPSGD algorithm from a global view, by defining
the DPSGD can be summarized into the form .
4 Quantized, Decentralized Algorithms
We introduce two quantized decentralized algorithms that compress information exchanged between nodes. All communications for decentralized algorithms are exchanging local models .
To reduce the communication cost, a straightforward idea is to compress the information exchanged within the decentralized network just like centralized algorithms sending compressed stochastic gradient (Alistarh et al., 2017). Unfortunately, such naive combination does not work even using the unbiased stochastic compression and diminishing learning rate as shown in Figure 1. The reason can be seen from the detailed derivation (please find it in Supplement).
Before propose our solutions to this issue, let us first make some common optimization assumptions for analyzing decentralized stochastic algorithms (Lian et al., 2017b).
Assumption 1.
Throughout this paper, we make the following commonly used assumptions:

Lipschitzian gradient: All function ’s are with Lipschitzian gradients.

Symmetric double stochastic matrix:
The weighted matrix is a real double stochastic matrix that satisfies and . 
Spectral gap: Given the symmetric doubly stochastic matrix , we define and assume .

Bounded variance: Assume the variance of stochastic gradient to be bounded

Independent and unbiased stochastic compression: The stochastic compression operation is unbiased, that is, for any and the stochastic compressions are independent on different workers or at different time point.
The last assumption essentially restricts the compression to be lossy but unbiased. Biased stochastic compression is generally hard to ensure the convergence and lossless compression can combine with any algorithms. Both of them are beyond of the scope of this paper. The commonly used stochastic unbiased compression include random quantization^{1}^{1}1A real number is randomly quantized into one of closest thresholds, for example, givens the thresholds , the number “” will be quantized to
with probability
and to with probability . Here, we assume that all numbers have been normalized into the range . (Zhang et al., 2017a) and sparsification^{2}^{2}2A real number is set to with probability and to with probability . (Wangni et al., 2017; Konečnỳ and Richtárik, 2016).4.1 Difference compression approach
1: Input: Initial point , initial replica , iteration step length , weighted matrix , and number of total iterations T
2: for t = 1,2,…,T do
3: Randomly sample from local data of the th node
4: Compute local stochastic gradient using and current optimization variable
5: Update the local model using local stochastic gradient and the weighted average of its connected neighbors’ replica (denote as ):
6: Each node computes
and compress this into .
7: Update the local optimization variables
8: Send to its connected neighbors, and update the replicas of its connected neighbors’ values:
9: end for
10: Output:

1: Input: Initial point
, initial estimate , iteration step length , weighted matrix , and number of total iterations T.
2: for do
3: Randomly sample from local data of the th node
4: Compute local stochastic gradient using and current optimization variable
5: Compute the neighborhood weighted average by using the estimate value of the connected neighbors :
6: Update the local model
7: Each node computes the value of itself:
8: Each node updates the estimate for its connected neighbors:
9: end for
10: Output:

In this section, we introduces a difference based approach, namely, difference compression DPSGD (DCDPSGD), to ensure efficient convergence.
The DCDPSGD basically follows the framework of DPSGD, except that nodes exchange the compressed difference of local models between two successive iterations, instead of exchanging local models. More specifically, each node needs to store its neighbors’ models in last iteration and follow the following steps:

take the weighted average and apply stochastic gradient descent step: , where is just the replica of but is stored on node ^{3}^{3}3Actually each neighbor of node maintains a replica of .;

compress the difference between and and update the local model:;

send and query neighbors’ to update the local replica: , .
The full DCDPSGD algorithm is described in Algorithm 1.
To ensure convergence, we need to make some restriction on the compression operator . Again this compression operator could be random quantization or random sparsification or any other operators. We introduce the definition of the signaltonoise related parameter . Let , where . We have the following theorem.
Theorem 1.
To make the result more clear, we appropriately choose the steplength in the following:
Corollary 2.
The leading term of the convergence rate is , and we also proved the convergence rate for (see (27) in Supplementary). We shall see the tightness of our result in the following discussion.
Linear speedup
Since the leading term of the convergence rate is when is large, which is consistent with the convergence rate of CPSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes.
Consistence with DPSGD
Setting to match the scenario of DPSGD, ECDPSGD admits the rate , that is slightly better the rate of DPSGD proved in Lian et al. (2017b) . The nonleading terms’ dependence of the spectral gap is also consistent with the result in DPSDG.
4.2 Extrapolation compression approach
From Theorem 1, we can see that there is an upper bound for the compressing level in DCDPSGD. Moreover, since the spectral gap would decrease with the growth of the amount of the workers, so DCDPSGD will fail to work under a very aggressive compression. So in this section, we propose another approach, namely ECDPSGD, to remove the restriction of the compressing degree, with a little sacrifice on the computation efficiency.
For ECDPSGD, we make the following assumption that the noise brought by compression is bounded.
Assumption 2.
(Bounded compression noise) We assume the noise due to compression is unbiased and its variance is bounded, that is,
Instead of sending the local model directly to neighbors, we send a value that is extrapolated from and at each iteration. Each node (say, node ) estimates its neighbor’s values from compressed value at th iteration. This procedure could ensure diminishing estimate error, in particular, .
At th iteration, node performs the following steps to estimate by :

The node , computes the value that is obtained through extrapolation
(3) 
Compress and send it to its neighbors, say node . Node computes using
(4)
Using Lemma 12 (see Supplemental Materials), if the compression noise is globally bounded variance by , we will have
Using this way to estimate the neighbors’ local models leads to the following equivalent updating form
The full extrapolation compression DPSGD (ECDPSGD) algorithm is summarized in Algorithm 2.
Below we will show that EDCPSGD algorithm would admit the same convergence rate and the same computation complexity as DPSGD.
Theorem 3 (Convergence of Algorithm 2).
To make the result more clear, we choose the steplength in the following:
Corollary 4.
In Algorithm 2 choose the steplength . Then it admits the following convergence rate (with , , and treated as constants).
(6) 
This result suggests the algorithm converges roughly in the rate , and we also proved the convergence rate for (see (36) in Supplementary). The followed analysis will bring more detailed interpretation to show the tightness of our result.
Linear speedup
Since the leading term of the convergence rate is when is large, which is consistent with the convergence rate of CPSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes.
Consistence with DPSGD
Setting to match the scenario of DPSGD, ECDPSGD admits the rate , that is slightly better the rate of DPSGD proved in Lian et al. (2017b) . The nonleading terms’ dependence of the spectral gap is also consistent with the result in DPSDG.
Comparison between DCDPSGD and ECDPSGD
On one side, in term of the convergence rate, ECDPSGD is slightly worse than DCDPSGD due to additional terms that suggests that if is relatively large than , the additional terms dominate the convergence rate. On the other side, DCDPSGD does not allow too aggressive compression or quantization and may lead to diverge due to , while ECDPSGD is quite robust to aggressive compression or quantization.
5 Experiments
In this section we evaluate two decentralized algorithms by comparing with an Allreduce implementation of centralized SGD. We run experiments under diverse network conditions and show that, decentralized algorithms with low precision can speed up training without hurting convergence.
5.1 Experimental Setup
We choose the image classification task as a benchmark to evaluate our theory. We train ResNet20 (He et al., 2016) on CIFAR10 dataset which has 50,000 images for training and 10,000 images for testing. Two proposed algorithms are implemented in Microsoft CNTK and compablack with CNTK’s original implementation of distributed SGD:

[fullwidth]

Centralized: This implementation is based on MPI Allreduce primitive with full precision (32 bits). It is the standard training method for multiple nodes in CNTK.

Decentralized_32bits/8bits: The implementation of the proposed decentralized approach with OpenMPI. The full precision is 32 bits, and the compressed precision is 8 bits.

In this paper, we omit the comparison with quantized centralized training because the difference between Decentralized 8bits and Centralized 8bits would be similar to the original decentralized training paper Lian et al. (2017a)
– when the network latency is high, decentralized algorithm outperforms centralized algorithm in terms of the time for each epoch.
We run experiments on 8 Amazon EC2 instances, each of which has one Nvidia K80 GPU. We use each GPU as a node. In decentralized cases, 8 nodes are connected as a ring topology, which means each node just communicates with its two neighbors. The batch size for each node is same as the default configuration in CNTK. We also tune learning rate for each variant.
5.2 Convergence and Run Time Performance
We first study the convergence of our algorithms. Figure 2(a) shows the convergence w.r.t # epochs of centralized and decentralized cases. We only show ECDPSGD in the figure (and call it Decentralized) because DCDPSGD has almost identical convergence behavior in this experiment. We can see that with our algorithms, decentralization and compression would not hurt the convergence rate.
We then compare the runtime performance. Figure 2(b, c, d) demonstrates how training loss decreases with the run time under different network conditions. We use command to change bandwidth and latency of the underlying network. By default, 1.4 Gbps bandwidth and 0.13 ms latency is the best network condition we can get in this cluster. On this occasion, all implementations have a very similar runtime performance because communication is not the bottleneck for system. When the latency is high, as shown in 2(c), decentralized algorithms in both low and full precision can outperform the Allreduce method because of fewer number of communications. However, in low bandwidth case, training time is mainly dominated by the amount of communication data, so low precision method can be obviously faster than these full precision methods.
5.3 Speedup in Diverse Network Conditions
To better understand the influence of bandwidth and latency on speedup, we compare the time of one epoch under various of network conditions. Figure 3(a, b) shows the trend of epoch time with bandwidth decreasing from 1.4 Gbps to 5 Mbps. When the latency is low (Figure 3(a)), low precision algorithm is faster than its full precision counterpart because it only needs to exchange around one fourth of full precision method’s data amount. Note that although in a decentralized way, full precision case has no advantage over Allreduce in this situation, because they exchange exactly the same amount of data. When it comes to high latency shown in Figure 3(b), both full and low precision cases are much better than Allreduce in the beginning. But also, full precision method gets worse dramatically with the decline of bandwidth.
Figure 3(c, d) shows how latency influences the epoch time under good and bad bandwidth conditions. When bandwidth is not the bottleneck (Figure 3(c)), decentralized approaches with both full and low precision have similar epoch time because they have same number of communications. As is expected, Allreduce is slower in this case. When bandwidth is very low (Figure 3(d)), only decentralized algorithm with low precision can achieve best performance among all implementations.
5.4 Discussion
Our previous experiments validate the efficiency of the decentralized algorithms on 8 nodes with 8 bits. However, we wonder if we can scale it to more nodes or compress the exchanged data even more aggressively. We firstly conducted experiments on 16 nodes with 8 bits as before. According to Figure 4(a), Alg. 1 and Alg. 2 on 16 nodes can still achieve basically same convergence rate as Allreduce, which shows the scalability of our algorithms. However, they can not be comparable to Allreduce with 4 bits, as is shown in 4(b). What is noteworthy is that these two compression approaches have quite different behaviors in 4 bits. For Alg. 1, although it converges much slower than Allreduce, its training loss keeps reducing. However, Alg. 2 just diverges in the beginning of training. This observation is consistent with our theoretical analysis.
6 Conclusion
In this paper, we studied the problem of combining two tricks of training distributed stochastic gradient descent under imperfect network conditions: quantization and decentralization. We developed two novel algorithms or quantized, decentralized training, analyze the theoretical property of both algorithms, and empirically study their performance in a various settings of network conditions. We found that when the underlying communication networks has both high latency and low bandwidth, quantized, decentralized algorithm outperforms other strategies significantly.
References
 Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
 Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: CommunicationEfficient SGD via Gradient Quantization and Encoding. NIPS, 2017.
 Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718, 2017.
 Bottou [2010] L. Bottou. Largescale machine learning with stochastic gradient descent. Proc. of the International Conference on Computational Statistics (COMPSTAT), 2010.
 Boyd et al. [2006] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Trans. Netw., 14(SI):2508–2530, June 2006. ISSN 10636692. doi: 10.1109/TIT.2006.874516.
 Chen et al. [2015] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 Colin et al. [2016] I. Colin, A. Bellet, J. Salmon, and S. Clémençon. Gossip dual averaging for decentralized optimization of pairwise functions. In International Conference on Machine Learning, pages 1388–1396, 2016.
 De Sa et al. [2017] C. De Sa, M. Feldman, C. Ré, and K. Olukotun. Understanding and optimizing asynchronous lowprecision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 561–574. ACM, 2017.
 Dekel et al. [2012] O. Dekel, R. GiladBachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
 Dobbe et al. [2017] R. Dobbe, D. FridovichKeil, and C. Tomlin. Fully decentralized policies for multiagent systems: An information theoretic approach. In Advances in Neural Information Processing Systems, pages 2945–2954, 2017.
 Drumond et al. [2018] M. Drumond, T. Lin, M. Jaggi, and B. Falsafi. Training dnns with hybrid block floating point. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pages 451–461, 2018. URL http://papers.nips.cc/paper/7327trainingdnnswithhybridblockfloatingpoint.
 Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. doi: 10.1137/120880811.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  He et al. [2018] L. He, A. Bian, and M. Jaggi. Cola: Decentralized linear learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4541–4551. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7705coladecentralizedlinearlearning.pdf.
 Kashyap et al. [2007] A. Kashyap, T. Başar, and R. Srikant. Quantized consensus. Automatica, 43(7):1192–1203, 2007.
 Konečnỳ and Richtárik [2016] J. Konečnỳ and P. Richtárik. Randomized distributed mean estimation: Accuracy vs communication. arXiv preprint arXiv:1611.07555, 2016.
 Lan et al. [2017] G. Lan, S. Lee, and Y. Zhou. Communicationefficient algorithms for decentralized and stochastic optimization. 01 2017.
 Lavaei and Murray [2012] J. Lavaei and R. M. Murray. Quantized consensus by means of gossip algorithm. IEEE Transactions on Automatic Control, 57(1):19–32, 2012.
 Li et al. [2017] Z. Li, W. Shi, and M. Yan. A decentralized proximalgradient method with network independent stepsizes and separated convergence rates. arXiv preprint arXiv:1704.07807, 2017.
 Lian et al. [2015] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
 Lian et al. [2017a] X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. 05 2017a.
 Lian et al. [2017b] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. 10 2017b.
 Lin et al. [2018] T. Lin, S. U. Stich, and M. Jaggi. Don’t use large minibatches, use local SGD. CoRR, abs/1808.07217, 2018. URL http://arxiv.org/abs/1808.07217.
 Mhamdi et al. [2017] E. Mhamdi, E. Mahdi, H. Hendrikx, R. Guerraoui, and A. D. O. Maurer. Dynamic safe interruptibility for decentralized multiagent reinforcement learning. Technical report, EPFL, 2017.

Moulines and Bach [2011]
E. Moulines and F. R. Bach.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011.  Nedic and Ozdaglar [2009] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multiagent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
 Nedic et al. [2009] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis. On distributed averaging algorithms and quantization effects. IEEE Transactions on Automatic Control, 54(11):2506–2517, 2009.
 Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi: 10.1137/070704277.
 Omidshafiei et al. [2017] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multitask multiagent rl under partial observability. arXiv preprint arXiv:1703.06182, 2017.
 Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 Seide and Agarwal [2016] F. Seide and A. Agarwal. Cntk: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. ISBN 9781450342322. doi: 10.1145/2939672.2945397.
 Shi et al. [2015] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015.
 Sirb and Ye [2016] B. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76–85, 2016.
 Stich [2018] S. U. Stich. Local SGD converges fast and communicates little. Technical Report, page arXiv:1805.00982, 2018.
 Stich et al. [2018] S. U. Stich, J.B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4452–4463. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7697sparsifiedsgdwithmemory.pdf.
 Suresh et al. [2017] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3329–3337, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/suresh17a.html.
 Wangni et al. [2017] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communicationefficient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.
 Yuan et al. [2016] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. doi: 10.1137/130943170.
 Zhang et al. [2017a] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zipml: Training linear models with endtoend low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035–4043, 2017a.
 Zhang et al. [2017b] W. Zhang, P. Zhao, W. Zhu, S. C. Hoi, and T. Zhang. Projectionfree distributed online learning in networks. In International Conference on Machine Learning, pages 4054–4062, 2017b.
 Zhao and Song [2016] L. Zhao and W. Song. Decentralized consensus in distributed networks. International Journal of Parallel, Emergent and Distributed Systems, pages 1–20, 2016.
 Zheng et al. [2016] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.M. Ma, and T.Y. Liu. Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. arXiv preprint arXiv:1609.08326, 2016.
 Zinkevich et al. [2010] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A General bound with compression noise
In this section, to see the influence of the compression more clearly, we are going to prove two general bounds (see see Lemma 7 and Lemma 8) for compressed DPSGD that has the same updating rule like (7). Those bounds are very helpful for the following proof of our algorithms.
The most challenging part of a decentralized algorithm, unlike the centralized algorithm, is that we need to ensure the local model on each node to converge to the average value . So we start with an analysis of the quantity and its influence on the final convergence rate. For both ECDPSGD and DCDPSGD, we are going to prove that
and
From the above two inequalities, we can see that the extra noise term decodes the convergence efficiency of to the average .
The proof of the general bound for (7) is divided into two parts. In subsection A.1, we provide a new perspective in understanding decentralization, which can be very helpful for the simplicity of our following proof. In subsection A.2, we give the detail proof for the general bound.
a.1 A more intuitive way to understand decentralization
To have a better understanding of how decentralized algorithms work, and how can we ensure a consensus from all local variables on each node. We provide a new perspective to understand decentralization using coordinate transformation, which can simplify our analysis in the following.
The confusion matrix
satisfies is doubly stochastic, so we can decompose it into , where that satisfies . Without the loss of generality, we can assume . Then we have the following equalities:Consider the coordinate transformation using as the base change matrix, and denote , , . Then the above equation can be rewritten as
(8) 
Since is a diagonal matrix, so we use , , to indicate the th column of , , . Then (8) becomes
(9) 
(9) offers us a much intuitive way to analysis the algorithm. Since all eigenvalues of , except , satisfies , so the corresponding would “decay to zero” due to the scaling factor .
Moreover, since the eigenvector corresponding to
is , then we have . So, if , intuitively we can set for , then and . This whole process shows how the confusion works under a coordinate transformation.a.2 Analysis for the general updating form in (7)
Lemma 5.
For any matrix , decompose the confusion matrix as , where , is the normalized eigenvector of , and is a diagonal matrix with be its th element. We have
where follows the defination in Theorem 3.
Proof.
Since , we have
Specifically, when , we have
(10) 
where . ∎
Lemma 6.
Given two nonnegative sequences and
Comments
There are no comments yet.