Decentralization Meets Quantization

03/17/2018 ∙ by Hanlin Tang, et al. ∙ ETH Zurich University of Rochester 0

Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: quantization for low bandwidth networks, and decentralization for high latency networks. In this paper, we explore a natural question: can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency? Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call extrapolation compression and difference compression. We analyze both algorithms and prove both converge at the rate of O(1/√(nT)) where n is the number of workers and T is the number of iterations, matching the convergence rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with both high latency and low bandwidth.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When training machine learning models in a distributed fashion, the underlying constraints of how workers (or nodes) communication have a significant impact on the training algorithm. When workers cannot form a fully connected communication topology or the communication latency is high (

e.g., in sensor networks or mobile networks), decentralizing the communication comes to the rescue. On the other hand, when the amount of data sent through the network is an optimization objective (maybe to lower the cost or energy consumption), or the network bandwidth is low, compressing the traffic, either via sparsification (Wangni et al., 2017; Konečnỳ and Richtárik, 2016) or quantization (Zhang et al., 2017a; Suresh et al., 2017) is a popular strategy. In this paper, our goal is to develop a novel framework that works robustly in an environment that both decentralization and communication compression could be beneficial. In this paper, we focus on quantization, the process of lowering the precision of data representation, often in a stochastically unbiased way. But the same techniques would apply to other unbiased compression schemes such as sparsification.

Both decentralized training and quantized (or compressed more generally) training have attracted intensive interests recently (Yuan et al., 2016; Zhao and Song, 2016; Lian et al., 2017a; Konečnỳ and Richtárik, 2016; Alistarh et al., 2017). Decentralized algorithms usually exchange local models among nodes, which consumes the main communication budget; on the other hand, quantized algorithms usually exchange quantized gradient, and update an un-quantized model. A straightforward idea to combine these two is to directly quantize the models sent through the network during decentralized training. However, this simple strategy does not converge to the right solution as the quantization error would accumulate during training. The technical contribution of this paper is to develop novel algorithms that combine both decentralized training and quantized training together.

Problem Formulation.

We consider the following decentralized optimization:

(1)

where is the number of node and is the local data distribution for node . nodes form a connected graph and each node can only communicate with its neighbors. Here we only assume ’s are with L-Lipschitzian gradients.

Summary of Technical Contributions.

In this paper, we propose two decentralized parallel stochastic gradient descent algorithms (D-PSGD): extrapolation compression D-PSGD (ECD-PSGD) and difference compression D-PSGD (DCD-PSGD). Both algorithms can be proven to converge in the rate roughly

where is the number of iterations. The convergence rates are consistent with two special cases: centralized parallel stochastic gradient descent (C-PSGD) and D-PSGD. To the best of our knowledge, this is the first work to combine quantization algorithms and decentralized algorithms for generic optimization.

The key difference between ECD-PSGD and DCD-PSGD is that DCD-PSGD quantizes the difference between the last two local models, and ECD-PSGD quantizes the extrapolation between the last two local models. DCD-PSGD admits a slightly better convergence rate than ECD-PSGD when the data variation among nodes is very large. On the other hand, ECD-PSGD is more robust to more aggressive quantization, as extremely low precision quantization can cause DCD-PSGD to diverge, since DCD-PSGD has strict constraint on quantization. In this paper, we analyze both algorithms, and empirically validate our theory. We also show that when the underlying network has both high latency and low bandwidth, both algorithms outperform state-of-the-arts significantly. We present both algorithm because we believe both of them are theoretically interesting. In practice, ECD-PSGD could potentially be a more robust choice.

Definitions and notations

Throughout this paper, we use following notations and definitions:

  • [fullwidth]

  • denotes the gradient of a function .

  • denotes the optimal solution of (1).

  • denotes the

    -th largest eigenvalue of a matrix.

  • denotes the full-one vector.

  • denotes the norm for vector.

  • denotes the vector Frobenius norm of matrices.

  • denotes the compressing operator.

  • .

2 Related work

Stochastic gradient descent

The Stocahstic Gradient Descent (SGD) (Ghadimi and Lan, 2013; Moulines and Bach, 2011; Nemirovski et al., 2009) - a stochastic variant of the gradient descent method - has been widely used for solving large scale machine learning problems (Bottou, 2010). It admits the optimal convergence rate for non-convex functions.

Centralized algorithms

The centralized algorithms is a widely used scheme for parallel computation, such as Tensorflow

(Abadi et al., 2016), MXNet (Chen et al., 2015), and CNTK (Seide and Agarwal, 2016). It uses a central node to control all leaf nodes. For Centralized Parallel Stochastic Gradient Descent (C-PSGD), the central node performs parameter updates and leaf nodes compute stochastic gradients based on local information in parallel. In Agarwal and Duchi (2011); Zinkevich et al. (2010), the effectiveness of C-PSGD is studied with latency taken into consideration. The distributed mini-batches SGD, which requires each leaf node to compute the stochastic gradient more than once before the parameter update, is studied in Dekel et al. (2012). Recht et al. (2011) proposed a variant of C-PSGD, HOGWILD, and proved that it would still work even if we allow the memory to be shared and let the private mode to be overwriten by others. The asynchronous non-convex C-PSGD optimization is studied in Lian et al. (2015). Zheng et al. (2016) proposed an algorithm to improve the performance of the asynchronous C-PSGD. In Alistarh et al. (2017); De Sa et al. (2017), a quantized SGD is proposed to save the communication cost for both convex and non-convex object functions. The convergence rate for C-PSGD is . The tradeoff between the mini-batch number and the local SGD step is studied in Lin et al. (2018); Stich (2018).

Decentralized algorithms

Recently, decentralized training algorithms have attracted significantly amount of attentions. Decentralized algorithms are mostly applied to solve the consensus problem (Zhang et al., 2017b; Lian et al., 2017a; Sirb and Ye, 2016), where the network topology is decentralized. A recent work shows that decentralized algorithms could outperform the centralized counterpart for distributed training (Lian et al., 2017a). The main advantage of decentralized algorithms over centralized algorithms lies on avoiding the communication traffic in the central node. In particular, decentralized algorithms could be much more efficient than centralized algorithms when the network bandwidth is small and the latency is large. The decentralized algorithm (also named gossip algorithm in some literature under certain scenarios (Colin et al., 2016)

) only assume a connect computational network, without using the central node to collect information from all nodes. Each node owns its local data and can only exchange information with its neighbors. The goal is still to learn a model over all distributed data. The decentralized structure can applied in solving of multi-task multi-agent reinforcement learning

(Omidshafiei et al., 2017; Mhamdi et al., 2017). Boyd et al. (2006) uses a randomized weighted matrix and studied the effectiveness of the weighted matrix in different situations. Two methods (Li et al., 2017; Shi et al., 2015) were proposed to blackuce the steady point error in decentralized gradient descent convex optimization. Dobbe et al. (2017) applied an information theoretic framework for decentralize analysis. The performance of the decentralized algorithm is dependent on the second largest eigenvalue of the weighted matrix. In He et al. (2018), In He et al. (2018), they proposed the gradient descent based algorithm (CoLA) for decentralized learning of linear classification and regression models, and proved the convergence rate for strongly convex and general convex cases.

Decentralized parallel stochastic gradient descent

The Decentralized Parallel Stochastic Gradient Descent (D-PSGD) (Nedic and Ozdaglar, 2009; Yuan et al., 2016) requires each node to exchange its own stochastic gradient and update the parameter using the information it receives. In Nedic and Ozdaglar (2009), the convergence rate for a time-varying topology was proved when the maximum of the subgradient is assumed to be bounded. In Lan et al. (2017), a new decentralized primal-dual type method is proposed with a computational complexity of for general convex objectives. The linear speedup of D-PSGD is proved in Lian et al. (2017a), where the computation complexity is . The asynchronous variant of D-PSGD is studied in Lian et al. (2017b).

Compression

To guarantee the convergence and correctness, this paper only considers using the unbiased stochastic compression techniques. Existing methods include randomized quantization (Zhang et al., 2017a; Suresh et al., 2017) and randomized sparsification (Wangni et al., 2017; Konečnỳ and Richtárik, 2016). Other compression methods can be found in Kashyap et al. (2007); Lavaei and Murray (2012); Nedic et al. (2009). In Drumond et al. (2018), a compressed DNN training algorithm is proposed. In Stich et al. (2018), a centralized biased sparsified parallel SGD with memory is studied and proved to admits an factor of acceleration.

Figure 1: D-PSGD vs. D-PSGD with naive compression

3 Preliminary: decentralized parallel stochastic gradient descent (D-PSGD)

Unlike the traditional (centralized) parallel stochastic gradient descent (C-PSGD), which requires a central node to compute the average value of all leaf nodes, the decentralized parallel stochastic gradient descent (D-PSGD) algorithm does not need such a central node. Each node (say node ) only exchanges its local model with its neighbors to take weighted average, specifically, where in general and means that node and node is not connected. At th iteration, D-PSGD consists of three steps ( is the node index):

1. Each node computes the stochastic gradient , where is the samples from its local data set and is the local model on node .

2. Each node queries its neighbors’ variables and updates its local model using .

3. Each node updates its local model using stochastic gradient, where is the learning rate.

To look at the D-PSGD algorithm from a global view, by defining

the D-PSGD can be summarized into the form .

The convergence rate of D-PSGD can be shown to be (without assuming convexity) where both and

are the stochastic variance (please refer to Assumption 

1 for detailed definitions), if the learning rate is chosen appropriately.

4 Quantized, Decentralized Algorithms

We introduce two quantized decentralized algorithms that compress information exchanged between nodes. All communications for decentralized algorithms are exchanging local models .

To reduce the communication cost, a straightforward idea is to compress the information exchanged within the decentralized network just like centralized algorithms sending compressed stochastic gradient (Alistarh et al., 2017). Unfortunately, such naive combination does not work even using the unbiased stochastic compression and diminishing learning rate as shown in Figure 1. The reason can be seen from the detailed derivation (please find it in Supplement).

Before propose our solutions to this issue, let us first make some common optimization assumptions for analyzing decentralized stochastic algorithms (Lian et al., 2017b).

Assumption 1.

Throughout this paper, we make the following commonly used assumptions:

  1. Lipschitzian gradient: All function ’s are with -Lipschitzian gradients.

  2. Symmetric double stochastic matrix:

    The weighted matrix is a real double stochastic matrix that satisfies and .

  3. Spectral gap: Given the symmetric doubly stochastic matrix , we define and assume .

  4. Bounded variance: Assume the variance of stochastic gradient to be bounded

  5. Independent and unbiased stochastic compression: The stochastic compression operation is unbiased, that is, for any and the stochastic compressions are independent on different workers or at different time point.

The last assumption essentially restricts the compression to be lossy but unbiased. Biased stochastic compression is generally hard to ensure the convergence and lossless compression can combine with any algorithms. Both of them are beyond of the scope of this paper. The commonly used stochastic unbiased compression include random quantization111A real number is randomly quantized into one of closest thresholds, for example, givens the thresholds , the number “” will be quantized to

with probability

and to with probability . Here, we assume that all numbers have been normalized into the range . (Zhang et al., 2017a) and sparsification222A real number is set to with probability and to with probability . (Wangni et al., 2017; Konečnỳ and Richtárik, 2016).

4.1 Difference compression approach

1:   Input: Initial point , initial replica , iteration step length , weighted matrix , and number of total iterations T
2:   for t = 1,2,…,T do
3:       Randomly sample from local data of the th node
4:       Compute local stochastic gradient using and current optimization variable
5:        Update the local model using local stochastic gradient and the weighted average of its connected neighbors’ replica (denote as ):
6:       Each node computes and compress this into .
7:       Update the local optimization variables
8:       Send to its connected neighbors, and update the replicas of its connected neighbors’ values:
9:   end for
10:   Output:
Algorithm 1 DCD-PSGD
1:   Input: Initial point

, initial estimate

, iteration step length , weighted matrix , and number of total iterations T.
2:   for  do
3:       Randomly sample from local data of the th node
4:       Compute local stochastic gradient using and current optimization variable
5:       Compute the neighborhood weighted average by using the estimate value of the connected neighbors :
6:       Update the local model
7:       Each node computes the -value of itself:
and compress this into .
8:       Each node updates the estimate for its connected neighbors:
9:   end for
10:   Output:
Algorithm 2 ECD-PSGD

In this section, we introduces a difference based approach, namely, difference compression D-PSGD (DCD-PSGD), to ensure efficient convergence.

The DCD-PSGD basically follows the framework of D-PSGD, except that nodes exchange the compressed difference of local models between two successive iterations, instead of exchanging local models. More specifically, each node needs to store its neighbors’ models in last iteration and follow the following steps:

  1. take the weighted average and apply stochastic gradient descent step: , where is just the replica of but is stored on node 333Actually each neighbor of node maintains a replica of .;

  2. compress the difference between and and update the local model:;

  3. send and query neighbors’ to update the local replica: , .

The full DCD-PSGD algorithm is described in Algorithm 1.

To ensure convergence, we need to make some restriction on the compression operator . Again this compression operator could be random quantization or random sparsification or any other operators. We introduce the definition of the signal-to-noise related parameter . Let , where . We have the following theorem.

Theorem 1.

Under the Assumption 1, if satisfies , choosing satisfying , we have the following convergence rate for Algorithm 1

(2)

where , and

To make the result more clear, we appropriately choose the steplength in the following:

Corollary 2.

Choose in Algorithm 1. If is small enough that satisfies , then we have

where , follow to same definition in Theorem 1 and we treat , , and constants.

The leading term of the convergence rate is , and we also proved the convergence rate for (see (27) in Supplementary). We shall see the tightness of our result in the following discussion.

Linear speedup

Since the leading term of the convergence rate is when is large, which is consistent with the convergence rate of C-PSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes.

Consistence with D-PSGD

Setting to match the scenario of D-PSGD, ECD-PSGD admits the rate , that is slightly better the rate of D-PSGD proved in Lian et al. (2017b) . The non-leading terms’ dependence of the spectral gap is also consistent with the result in D-PSDG.

4.2 Extrapolation compression approach

From Theorem 1, we can see that there is an upper bound for the compressing level in DCD-PSGD. Moreover, since the spectral gap would decrease with the growth of the amount of the workers, so DCD-PSGD will fail to work under a very aggressive compression. So in this section, we propose another approach, namely ECD-PSGD, to remove the restriction of the compressing degree, with a little sacrifice on the computation efficiency.

For ECD-PSGD, we make the following assumption that the noise brought by compression is bounded.

Assumption 2.

(Bounded compression noise) We assume the noise due to compression is unbiased and its variance is bounded, that is,

Instead of sending the local model directly to neighbors, we send a -value that is extrapolated from and at each iteration. Each node (say, node ) estimates its neighbor’s values from compressed -value at -th iteration. This procedure could ensure diminishing estimate error, in particular, .

At th iteration, node performs the following steps to estimate by :

  • The node , computes the -value that is obtained through extrapolation

    (3)
  • Compress and send it to its neighbors, say node . Node computes using

    (4)

Using Lemma 12 (see Supplemental Materials), if the compression noise is globally bounded variance by , we will have

Using this way to estimate the neighbors’ local models leads to the following equivalent updating form

The full extrapolation compression D-PSGD (ECD-PSGD) algorithm is summarized in Algorithm 2.

Below we will show that EDC-PSGD algorithm would admit the same convergence rate and the same computation complexity as D-PSGD.

Theorem 3 (Convergence of Algorithm 2).

Under Assumptions 1 and  2, choosing in Algorithm 2 to be constant satisfying , we have the following convergence rate for Algorithm 2

(5)

where , , , and .

To make the result more clear, we choose the steplength in the following:

Corollary 4.

In Algorithm 2 choose the steplength . Then it admits the following convergence rate (with , , and treated as constants).

(6)

This result suggests the algorithm converges roughly in the rate , and we also proved the convergence rate for (see (36) in Supplementary). The followed analysis will bring more detailed interpretation to show the tightness of our result.

Linear speedup

Since the leading term of the convergence rate is when is large, which is consistent with the convergence rate of C-PSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes.

Consistence with D-PSGD

Setting to match the scenario of D-PSGD, ECD-PSGD admits the rate , that is slightly better the rate of D-PSGD proved in Lian et al. (2017b) . The non-leading terms’ dependence of the spectral gap is also consistent with the result in D-PSDG.

Comparison between DCD-PSGD and ECD-PSGD

On one side, in term of the convergence rate, ECD-PSGD is slightly worse than DCD-PSGD due to additional terms that suggests that if is relatively large than , the additional terms dominate the convergence rate. On the other side, DCD-PSGD does not allow too aggressive compression or quantization and may lead to diverge due to , while ECD-PSGD is quite robust to aggressive compression or quantization.

Figure 2: Performance Comparison between Decentralized and AllReduce implementations.

5 Experiments

In this section we evaluate two decentralized algorithms by comparing with an Allreduce implementation of centralized SGD. We run experiments under diverse network conditions and show that, decentralized algorithms with low precision can speed up training without hurting convergence.

5.1 Experimental Setup

We choose the image classification task as a benchmark to evaluate our theory. We train ResNet-20 (He et al., 2016) on CIFAR-10 dataset which has 50,000 images for training and 10,000 images for testing. Two proposed algorithms are implemented in Microsoft CNTK and compablack with CNTK’s original implementation of distributed SGD:

Figure 3: Performance Comparison in Diverse Network Conditions.
  • [fullwidth]

  • Centralized: This implementation is based on MPI Allreduce primitive with full precision (32 bits). It is the standard training method for multiple nodes in CNTK.

  • Decentralized_32bits/8bits: The implementation of the proposed decentralized approach with OpenMPI. The full precision is 32 bits, and the compressed precision is 8 bits.

  • In this paper, we omit the comparison with quantized centralized training because the difference between Decentralized 8bits and Centralized 8bits would be similar to the original decentralized training paper Lian et al. (2017a)

    – when the network latency is high, decentralized algorithm outperforms centralized algorithm in terms of the time for each epoch.

We run experiments on 8 Amazon EC2 instances, each of which has one Nvidia K80 GPU. We use each GPU as a node. In decentralized cases, 8 nodes are connected as a ring topology, which means each node just communicates with its two neighbors. The batch size for each node is same as the default configuration in CNTK. We also tune learning rate for each variant.

5.2 Convergence and Run Time Performance

We first study the convergence of our algorithms. Figure 2(a) shows the convergence w.r.t # epochs of centralized and decentralized cases. We only show ECD-PSGD in the figure (and call it Decentralized) because DCD-PSGD has almost identical convergence behavior in this experiment. We can see that with our algorithms, decentralization and compression would not hurt the convergence rate.

We then compare the runtime performance. Figure 2(b, c, d) demonstrates how training loss decreases with the run time under different network conditions. We use command to change bandwidth and latency of the underlying network. By default, 1.4 Gbps bandwidth and 0.13 ms latency is the best network condition we can get in this cluster. On this occasion, all implementations have a very similar runtime performance because communication is not the bottleneck for system. When the latency is high, as shown in 2(c), decentralized algorithms in both low and full precision can outperform the Allreduce method because of fewer number of communications. However, in low bandwidth case, training time is mainly dominated by the amount of communication data, so low precision method can be obviously faster than these full precision methods.

5.3 Speedup in Diverse Network Conditions

To better understand the influence of bandwidth and latency on speedup, we compare the time of one epoch under various of network conditions. Figure 3(a, b) shows the trend of epoch time with bandwidth decreasing from 1.4 Gbps to 5 Mbps. When the latency is low (Figure 3(a)), low precision algorithm is faster than its full precision counterpart because it only needs to exchange around one fourth of full precision method’s data amount. Note that although in a decentralized way, full precision case has no advantage over Allreduce in this situation, because they exchange exactly the same amount of data. When it comes to high latency shown in Figure 3(b), both full and low precision cases are much better than Allreduce in the beginning. But also, full precision method gets worse dramatically with the decline of bandwidth.

Figure 3(c, d) shows how latency influences the epoch time under good and bad bandwidth conditions. When bandwidth is not the bottleneck (Figure 3(c)), decentralized approaches with both full and low precision have similar epoch time because they have same number of communications. As is expected, Allreduce is slower in this case. When bandwidth is very low (Figure 3(d)), only decentralized algorithm with low precision can achieve best performance among all implementations.

Figure 4: Comparison of Alg. 1 and Alg. 2

5.4 Discussion

Our previous experiments validate the efficiency of the decentralized algorithms on 8 nodes with 8 bits. However, we wonder if we can scale it to more nodes or compress the exchanged data even more aggressively. We firstly conducted experiments on 16 nodes with 8 bits as before. According to Figure 4(a), Alg. 1 and Alg. 2 on 16 nodes can still achieve basically same convergence rate as Allreduce, which shows the scalability of our algorithms. However, they can not be comparable to Allreduce with 4 bits, as is shown in 4(b). What is noteworthy is that these two compression approaches have quite different behaviors in 4 bits. For Alg. 1, although it converges much slower than Allreduce, its training loss keeps reducing. However, Alg. 2 just diverges in the beginning of training. This observation is consistent with our theoretical analysis.

6 Conclusion

In this paper, we studied the problem of combining two tricks of training distributed stochastic gradient descent under imperfect network conditions: quantization and decentralization. We developed two novel algorithms or quantized, decentralized training, analyze the theoretical property of both algorithms, and empirically study their performance in a various settings of network conditions. We found that when the underlying communication networks has both high latency and low bandwidth, quantized, decentralized algorithm outperforms other strategies significantly.

References

  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. NIPS, 2017.
  • Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718, 2017.
  • Bottou [2010] L. Bottou. Large-scale machine learning with stochastic gradient descent. Proc. of the International Conference on Computational Statistics (COMPSTAT), 2010.
  • Boyd et al. [2006] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Trans. Netw., 14(SI):2508–2530, June 2006. ISSN 1063-6692. doi: 10.1109/TIT.2006.874516.
  • Chen et al. [2015] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
  • Colin et al. [2016] I. Colin, A. Bellet, J. Salmon, and S. Clémençon. Gossip dual averaging for decentralized optimization of pairwise functions. In International Conference on Machine Learning, pages 1388–1396, 2016.
  • De Sa et al. [2017] C. De Sa, M. Feldman, C. Ré, and K. Olukotun. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pages 561–574. ACM, 2017.
  • Dekel et al. [2012] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
  • Dobbe et al. [2017] R. Dobbe, D. Fridovich-Keil, and C. Tomlin. Fully decentralized policies for multi-agent systems: An information theoretic approach. In Advances in Neural Information Processing Systems, pages 2945–2954, 2017.
  • Drumond et al. [2018] M. Drumond, T. Lin, M. Jaggi, and B. Falsafi. Training dnns with hybrid block floating point. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 451–461, 2018. URL http://papers.nips.cc/paper/7327-training-dnns-with-hybrid-block-floating-point.
  • Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013. doi: 10.1137/120880811.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • He et al. [2018] L. He, A. Bian, and M. Jaggi. Cola: Decentralized linear learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4541–4551. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7705-cola-decentralized-linear-learning.pdf.
  • Kashyap et al. [2007] A. Kashyap, T. Başar, and R. Srikant. Quantized consensus. Automatica, 43(7):1192–1203, 2007.
  • Konečnỳ and Richtárik [2016] J. Konečnỳ and P. Richtárik. Randomized distributed mean estimation: Accuracy vs communication. arXiv preprint arXiv:1611.07555, 2016.
  • Lan et al. [2017] G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. 01 2017.
  • Lavaei and Murray [2012] J. Lavaei and R. M. Murray. Quantized consensus by means of gossip algorithm. IEEE Transactions on Automatic Control, 57(1):19–32, 2012.
  • Li et al. [2017] Z. Li, W. Shi, and M. Yan. A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates. arXiv preprint arXiv:1704.07807, 2017.
  • Lian et al. [2015] X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
  • Lian et al. [2017a] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. 05 2017a.
  • Lian et al. [2017b] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. 10 2017b.
  • Lin et al. [2018] T. Lin, S. U. Stich, and M. Jaggi. Don’t use large mini-batches, use local SGD. CoRR, abs/1808.07217, 2018. URL http://arxiv.org/abs/1808.07217.
  • Mhamdi et al. [2017] E. Mhamdi, E. Mahdi, H. Hendrikx, R. Guerraoui, and A. D. O. Maurer. Dynamic safe interruptibility for decentralized multi-agent reinforcement learning. Technical report, EPFL, 2017.
  • Moulines and Bach [2011] E. Moulines and F. R. Bach.

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

    In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011.
  • Nedic and Ozdaglar [2009] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48–61, 2009.
  • Nedic et al. [2009] A. Nedic, A. Olshevsky, A. Ozdaglar, and J. N. Tsitsiklis. On distributed averaging algorithms and quantization effects. IEEE Transactions on Automatic Control, 54(11):2506–2517, 2009.
  • Nemirovski et al. [2009] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. doi: 10.1137/070704277.
  • Omidshafiei et al. [2017] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent rl under partial observability. arXiv preprint arXiv:1703.06182, 2017.
  • Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
  • Seide and Agarwal [2016] F. Seide and A. Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2945397.
  • Shi et al. [2015] W. Shi, Q. Ling, G. Wu, and W. Yin. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013–6023, 2015.
  • Sirb and Ye [2016] B. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized networks. In 2016 IEEE International Conference on Big Data (Big Data), pages 76–85, 2016.
  • Stich [2018] S. U. Stich. Local SGD converges fast and communicates little. Technical Report, page arXiv:1805.00982, 2018.
  • Stich et al. [2018] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4452–4463. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7697-sparsified-sgd-with-memory.pdf.
  • Suresh et al. [2017] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3329–3337, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/suresh17a.html.
  • Wangni et al. [2017] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. arXiv preprint arXiv:1710.09854, 2017.
  • Yuan et al. [2016] K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016. doi: 10.1137/130943170.
  • Zhang et al. [2017a] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning, pages 4035–4043, 2017a.
  • Zhang et al. [2017b] W. Zhang, P. Zhao, W. Zhu, S. C. Hoi, and T. Zhang. Projection-free distributed online learning in networks. In International Conference on Machine Learning, pages 4054–4062, 2017b.
  • Zhao and Song [2016] L. Zhao and W. Song. Decentralized consensus in distributed networks. International Journal of Parallel, Emergent and Distributed Systems, pages 1–20, 2016.
  • Zheng et al. [2016] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu. Asynchronous stochastic gradient descent with delay compensation for distributed deep learning. arXiv preprint arXiv:1609.08326, 2016.
  • Zinkevich et al. [2010] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.

Appendix A General bound with compression noise

In this section, to see the influence of the compression more clearly, we are going to prove two general bounds (see see Lemma 7 and Lemma 8) for compressed D-PSGD that has the same updating rule like (7). Those bounds are very helpful for the following proof of our algorithms.

The most challenging part of a decentralized algorithm, unlike the centralized algorithm, is that we need to ensure the local model on each node to converge to the average value . So we start with an analysis of the quantity and its influence on the final convergence rate. For both ECD-PSGD and DCD-PSGD, we are going to prove that

and

From the above two inequalities, we can see that the extra noise term decodes the convergence efficiency of to the average .

The proof of the general bound for (7) is divided into two parts. In subsection A.1, we provide a new perspective in understanding decentralization, which can be very helpful for the simplicity of our following proof. In subsection A.2, we give the detail proof for the general bound.

a.1 A more intuitive way to understand decentralization

To have a better understanding of how decentralized algorithms work, and how can we ensure a consensus from all local variables on each node. We provide a new perspective to understand decentralization using coordinate transformation, which can simplify our analysis in the following.

The confusion matrix

satisfies is doubly stochastic, so we can decompose it into , where that satisfies . Without the loss of generality, we can assume . Then we have the following equalities:

Consider the coordinate transformation using as the base change matrix, and denote , , . Then the above equation can be rewritten as

(8)

Since is a diagonal matrix, so we use , , to indicate the -th column of , , . Then (8) becomes

(9)

(9) offers us a much intuitive way to analysis the algorithm. Since all eigenvalues of , except , satisfies , so the corresponding would “decay to zero” due to the scaling factor .

Moreover, since the eigenvector corresponding to

is , then we have . So, if , intuitively we can set for , then and . This whole process shows how the confusion works under a coordinate transformation.

a.2 Analysis for the general updating form in (7)

Lemma 5.

For any matrix , decompose the confusion matrix as , where , is the normalized eigenvector of , and is a diagonal matrix with be its th element. We have

where follows the defination in Theorem 3.

Proof.

Since , we have

Specifically, when , we have

(10)

where . ∎

Lemma 6.

Given two non-negative sequences and