DeepSqueeze: Decentralization Meets Error-Compensated Compression

07/17/2019 ∙ by Hanlin Tang, et al. ∙ 0

Communication is a key bottleneck in distributed training. Recently, an error-compensated compression technology was particularly designed for the centralized learning and receives huge successes, by showing significant advantages over state-of-the-art compression based methods in saving the communication cost. Since the decentralized training has been witnessed to be superior to the traditional centralized training in the communication restricted scenario, therefore a natural question to ask is "how to apply the error-compensated technology to the decentralized learning to further reduce the communication cost." However, a trivial extension of compression based centralized training algorithms does not exist for the decentralized scenario. key difference between centralized and decentralized training makes this extension extremely non-trivial. In this paper, we propose an elegant algorithmic design to employ error-compensated stochastic gradient descent for the decentralized scenario, named DeepSqueeze. Both the theoretical analysis and the empirical study are provided to show the proposed DeepSqueeze algorithm outperforms the existing compression based decentralized learning algorithms. To the best of our knowledge, this is the first time to apply the error-compensated compression to the decentralized learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the following decentralized optimization:

(1)

where is the number of node and is the local data distribution for node . nodes form a connected graph and each node can only communicate with its neighbors.

Communication is a key bottleneck in distributed training (Seide and Agarwal, 2016; Abadi et al., 2016). A popular strategy to reduce communication cost is compressing the gradients computed on the local workers and sending to the parameter server or a central node. To ensure the convergence, the compression usually needs to be unbiased and the compression ratio needs to be chosen very cautiously (it cannot be too aggressive), since the bias or noise caused by the compression may significantly degrade the convergence efficiency. Recently, an error-compensated technology has been designed to make the aggressive compression possible. The key idea is to store the compression error in the previous step and send the compressed sum of the gradient and the remaining compression error in the previous step (Tang et al., 2019):

(compressed gradient on local worker(s))
(remaining error on the local worker(s))
(update on the parameter server)

where is the computed stochastic gradient. This idea has been proven to very effective and may significantly outperform the traditional (non-error-compensated) compression centralized training algorithms such as Zhang et al. (2017); Jiang and Agrawal (2018).

The decentralized training has been proven to be superior to the centralized training in terms of reducing the communication cost (Boyd et al., 2006), especially when the model is huge and the network bandwidth and latency are less satisfactory. The key reason is that in the decentralized learning framework workers only need to communicate with their individual neighbors, while in the centralized learning all workers are required to talk to the central node (or the parameter server). Moreover, the iteration complexity (or the total computational complexity) of the decentralized learning is proven to be comparable to that of the centralized counterpart (Lian et al., 2017).

Therefore, given the recent successes of the error-compensated technology for the centralized learning, it motivates us to ask a natural question: How to apply the error-compensated technology to the decentralized learning to further reduce the communication cost?

However, key differences exist between centralized and decentralized training, and it is highly non-trivial to extend the error-compensated technology to decentralized learning. In this paper, we propose an error-compensated decentralized stochastic gradient method named DeepSqueeze by combining these two strategies. Both theoretical analysis and empirical study are provided to show the advantage of the proposed DeepSqueeze algorithm over the existing compression based decentralized learning algorithms, including Tang et al. (2018); Koloskova et al. (2019).

Our Contribution:

  • To the best of our knowledge, this is the first time to apply the error-compensated compression technology to decentralized learning.

  • Our algorithm is compatible with almost all compression strategies, and admits a much higher tolerance on aggressive compression ratio than existing work (for example, (Tang et al., 2018)) in both theory and empirical study.

Notations and definitions

Throughout this paper, we use the following notations:

  • denotes the gradient of a function .

  • denotes the optimal value of the minimization problem (1).

  • .

  • denotes the

    -th largest eigenvalue of a symmetric matrix.

  • denotes the full-one vector.

  • is an all square matrix.

  • denotes the norm for vectors and the spectral norm for matrices.

  • denotes the vector Frobenius norm of matrices.

  • denotes the compressing operator.

The rest of the paper is organized as follows. First we review the related studies. We then discuss our proposed method in detail and provides key theoretical results. We further validate our method in an experiment and finally we conclude this paper.

2 Related Work

In this section, we review the recent works in a few areas that are closely related to the topic of this study: distributed learning, decentralized learning, compression based learning, and error-compensated compression methods.

2.1 Centralized Parallel Learning

Distributed learning is an widely-used acceleration strategy for training deep neural network with high computational cost. Two main designs are developed to parallelize the computation: 1) centralized parallel learning

(Agarwal and Duchi, 2011; Recht et al., 2011), where all the workers are ensured to obtain information from all others; 2) decentralized parallel learning (He et al., 2018; Li and Yan, 2017; Lian et al., 2017; Dvurechenskii et al., 2018), where each worker can only gather information from a fraction of its neighbors. Centralized parallel learning requires a supporting network architecture to aggregate all local models or gradients in each iteration. Various implementations are developed for information aggregation in centralized systems. For example, the parameter server (Li et al., 2014; Abadi et al., 2016), AllReduce (Seide and Agarwal, 2016; Renggli et al., 2018)

, adaptive distributed learning, differentially private distributed learning, distributed proximal primal-dual algorithm, non-smooth distributed optimization, projection-free distributed online learning, and parallel back-propagation for deep learning.

2.2 Decentralized Parallel Learning

Unlike centralized learning, decentralized learning does not require obtaining information from all workers in each step. Therefore, the network structure would have fewer restrictions than centralized parallel learning. Due to this high flexibility of the network design, the decentralized parallel learning has been the focus of many recent studies.

In decentralized parallel learning, workers only communicate with their neighbors. There are two main types of decentralized learning algorithms: fixed network topology (He et al., 2018), or time-varying (Nedić and Olshevsky, 2015; Lian et al., 2018) during training. Wu et al. (2017); Shen et al. (2018) shows that the decentralized SGD would converge with a comparable convergence rate to the centralized algorithm with less needed communication to make large-scale model training feasible. Li et al. (2018) provide a systematic analysis of the decentralized learning pipeline.

2.3 Compressed Communication Distributed Learning

To further reduce the communication overhead, one promising direction is to compress the variables that are sent between different workers (Chen et al., 2018; Wang et al., 2018). Previous works mostly focus on a centralized scenario where the compression is assumed to be unbiased (Wangni et al., 2018; Shen et al., 2018; Zhang et al., 2017; Wen et al., 2017; Jiang and Agrawal, 2018). A general theoretical analysis of centralized compressed parallel SGD can be found in Alistarh et al. (2017). Beyond this, some biased compressing methods are also proposed and proven to be quite efficient in reducing the communication cost. One example is the 1-Bit SGD (Seide et al., 2014), which compresses the entries in gradient vector into depends on its sign. The theoretical guarantee of this method is given in Bernstein et al. (2018).

Recently, another emerging area of compressed distributed learning is decentralized compressed learning. Unlike centralized setting, decentralized learning requires each worker to share the model parameters, instead of the gradients. This differentiating factor could potential invalidate the convergence (see Tang et al. (2018) for example). Many methods are proposed to solve this problem. One idea is to use the difference of the model updates as the shared information (Tang et al., 2018). Koloskova et al. (2019)

further reduces the communication error by sharing the difference of the model difference, but their result only considers a strongly convex loss function, which is not the case for nowadays deep neural network.

Tang et al. (2018) also propose a extrapolation like strategy to control the compressing level. None of these works employ the error-compensate strategy.

2.4 Error-Compensated SGD

Interestingly, an error-compensate strategy for AllReduce 1Bit-SGD is proposed in Seide et al. (2014), which can compensate the error in each iteration with quite less accuracy drop than training without the error-compensation. The error-compensation strategy is proved to be able to potentially improve the training efficiency for strongly convex loss function (Stich et al., 2018). Wu et al. (2018)

further study an Error-Compensated SGD for quadratic optimization via adding two hyperparameters to compensate the error, but fail to theoretically verify the advantage of using error compensation. Most recently,

Tang et al. (2019) give an theoretical analysis showing that in centralized learning, the error-compensation strategy is compatible with an arbitrary compression technique and fundamentally proving why error-compensation admit an improved convergence rate with linear speedup for general non-convex loss function. Even though, their work only study the centralized distributed training. Whether error-compensation strategy can work both theoretically and empirically for decentralized learning remains to be investigated. We will try to answer these questions in this paper.

3 Algorithm

1:  Initialize: , learning rate , averaging rate , initial error , number of total iterations , and weight matrix .
2:  for  do
3:     (On -th node)
4:     Randomly sample and compute local stochastic gradient .
5:     Compute the error-compensated variable update .
6:     Compress into and update the error .
7:     Send compressed variable to the immediate neighbors of -th node.
8:     Receive from all neighbors of -th node where (notice that ).
9:     Update local model .
10:  end for
11:  Output: .
Algorithm 1 DeepSqueeze

We introduce our proposed DeepSqueeze algorithm details below.

Consider that there are workers in the network, where each individual worker (e.g., the -th worker) can only communicate with its immediate neighbors (denotes worker ’s immediate neighbors

). The connection can be represented by a symmetric double stochastic matrix, the weighted matrix

. We denote by the weight between the -th and -th worker.

For the -th worker at time , each worker, say worker , need to maintain three main variables: the local model , the compression error , and the error-compensated variable . The updates of those variables follow the following steps:

  1. Local Computation: Update the error-compensated variable by where is the stochastic gradient with randomly sampled , and is the learning.

  2. Local Compression: Compress the error-compensated variable into , where is the compression operation and update the compression error by .

  3. Global Communication: Send compressed variable to the neighbors of -th node, namely .

  4. Local Update: Update local model by , where

    denotes the identity matrix, and

    is the averaging rate.

Finally, the proposed DeepSqueeze algorithm is summarized in Algorithm 1.

4 Theoretical Analysis

In this section, we introduce our theoretical analysis of DeepSqueeze. We first prove the updating rule of our algorithm to give an intuitive explanation of how our algorithm works, then we present the final convergence rate.

4.1 Mathematical formulation

In order to get a global view of DeepSqueeze, we define

Then the updating rule of DeepSqueeze follows

which can also be rewritten as

where we denote .

4.2 Why DeepSqueeze is better

We would be able to get a better understanding about why DeepSqueeze is better than the other compression based decentralized algorithms, by analyzing the closed forms of updating rules. For simplicity, we assume the initial values are . Skipping the detailed derivation, we can obtain the closed forms:

  • D-PSGD (Lian et al., 2017) (without compression):

  • DCD-PSGD (Tang et al., 2018) (with compression):

  • CHCHO-SGD (Koloskova et al., 2019) (with compression):

  • Updating rule of DeepSqueeze (with error-compensated compression):

We can notice that if , that is, there is no compression error, all compression based methods reduce to D-PSGD. Therefore the efficiency of compression based methods lie on the magnitude of (for D-PSGD) or (for DeepSqueeze and CHCHO-SGD).

Using the fact that is doubly stochastic and symmetric, we have

which leads to

which means our algorithm could be able to significantly reducing the influence of the history compressing error by controlling . Actually, it will be clear soon (from our analysis), has to be small enough if we compress the information very aggressive.

Before we introducing the final convergence rate of DeepSqueeze, we first introduce some assumptions that is used for theoretical analysis.

Assumption 1.

Throughout this paper, we make the following commonly used assumptions:

  1. Lipschitzian gradient: All function ’s are with -Lipschitzian gradients.

  2. Symmetric double stochastic matrix: The weighted matrix is a real double stochastic matrix that satisfies and .

  3. Spectral gap: Given the symmetric doubly stochastic matrix , we assume that .

  4. Start from 0: We assume . This assumption simplifies the proof w.l.o.g.

  5. Bounded variance:

    Assume the variance of stochastic gradient to be bounded

  6. Bounded Signal-to-noise factor: The magnitude of the compression error is assumed to be bounded by the original vector’s magnitude:

Notice that these assumptions are quite standard for analyzing decentralized algorithms in previous works Lian et al. (2017). Unlike Koloskova et al. (2019), we do not require the gradient to be bounded (), which makes our result more applicable to a general case. It is worth pointing out that many of the previous compressing strategies satisfy the bounded signal-to-noise factor assumption, such as GSpar, random sparsification (Wangni et al., 2018), top- sparsification and random quantization (Koloskova et al., 2019).

Now we are ready to show the main theorem of DeepSqueeze.

Theorem 1.

For DeepSqueeze, if the averaging rate and learning rate satisfies

then we have

where

To make the result more clear, we choose the learning rate appropriately in the following:

Corollary 1.

According to Theorem 1, choosing and , we have the following convergence rate for DeepSqueeze

where we have

where is treated to be a constant.

This result suggests that:

Linear speedup The asymptotical convergence rate of DeepSqueeze is , which is exactly the same with Centralized Parallel SGD.

Consistence with D-PSGD Setting , our DeepSqueeze reduces to the standard decentralized SGD (D-PSGD), our result indicates that the convergence rate admits , which is slightly better than the previous one in Lian et al. (2017).

Superiority over DCD-PSD and ECD-PSGD By using the error-compensate strategy, we prove that the the dependence to the compression ratio is , and is robust to any compression operator with . Whereas the previous work (Tang et al., 2018) only ensure a dependence and there is a restriction that .

Balance between Communication and Computation: Our result indicates that when , we need to set to be small enough to ensure the convergence of our algorithm. However, too small leads to a slower convergence rate, which means more iterations are needed for training. Our result could be a guidance for balancing the communication cost and computation cost for decentralized learning under different situations.

5 Experiments

In this section, we further demonstrate the superiority of the DeepSqueeze

algorithm through an empirical study. Under aggressive compression ratios, we show the epoch-wise convergence results of

DeepSqueeze and compare them with centralized SGD (AllReduce), decentralized SGD without compression (D-PSGD (Lian et al., 2017)), and other existing decentralized compression algorithms (ECP-PSGD, DCP-PSGD (Tang et al., 2018), and Choco-SGD (Koloskova et al., 2019)).

5.1 Experimental Setup

Dataset

We benchmark the algorithms with a standard image classification task: CIFAR10 using ResNet-20. This dateset has a training set of 50,000 images and a test set of 10,000 images, where each image is given one of the 10 labels.

Communication

The communication is implemented based on NVIDIA NCCL for AllReduce and gloo for all other algorithms. The workers are connected in a ring network so that each worker has two neighbors.

Compression

We use bit compression for all the compression algorithms, where every element of the tensor sent out or received by a worker is compressed to 4 bits or 2 bits. A number representing the Euclidean norm of the original tensor is also sent together, so that when a worker receives the compressed tensor, it can scale it to get a tensor with the same Euclidean norm as the uncompressed tensor.

Hardware

The experiments are done on 8 2080Ti GPUs, where each GPU is treated as a single worker in the algorithm.

5.2 Convergence Efficiency

We compare the algorithms with 4bit and 2bit compression, and the convergence results are shown in Figure 1. We use a batch size of 128 and tune the learning rate in a set for each algorithm. The learning rate is decreased by 5x every 60 epochs.

Note that, compared with uncompressed methods, 4bit compression saves 7/8 of the communication cost while 2bit compression reduces the cost by 93.75%. The results show that under 4bit compression, DeepSqueeze converges similar to uncompressed algorithms. DeepSqueeze is slightly better than Choco-SGD (Koloskova et al., 2019), and converges much faster than DCP-PSGD and ECD-PSGD. Note that DCP-PSGD does not converge under 4bit compression (as stated in the original paper (Tang et al., 2018)), and ECD-PSGD does not converge under 2bit compression. Under 2bit compression, DeepSqueeze converges slower than uncompressed algorithms, but is still faster than any other compression algorithms. The reason is that the less bits we have, the more we gain from the error compensation in DeepSqueeze, and consequently DeepSqueeze gets a faster convergence.

The experimental results further confirm that DeepSqueeze achieves outstanding communication cost reduction (up to 93.75%) without much sacrifice on convergence.

Figure 1: Epoch-wise convergence for the algorithms with 8 workers connected on a ring network. 4bit compression is on the left and 2bit compression is on the right. Note that DCP-PSGD (marked with *) does not converge below 4bit compression as stated in the original paper (Tang et al., 2018), and ECP-PSGD does not converge under 2bit compression, so they are not shown in the figure.

6 Conclusion

This paper applies the error-compensated compression strategy, which receives big successes in centralized training, to the decentralized training. Through elaborated algorithm, the proposed DeepSqueeze algorithms is proven to admit higher tolerance on the compression ratio than existing compression based decentralized training. Both theoretical analysis and empirical study validate our algorithm and theorem.

References

  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association. ISBN 978-1-931971-33-1.
  • Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1707–1718, 2017.
  • Bernstein et al. [2018] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. signsgd with majority vote is communication efficient and byzantine fault tolerant. 10 2018.
  • Boyd et al. [2006] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
  • Chen et al. [2018] T. Chen, G. Giannakis, T. Sun, and W. Yin. Lag: Lazily aggregated gradient for communication-efficient distributed learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5055–5065. Curran Associates, Inc., 2018.
  • Dvurechenskii et al. [2018] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Faster algorithm for wasserstein barycenters. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10760–10770. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8274-decentralize-and-randomize-faster-algorithm-for-wasserstein-barycenters.pdf.
  • He et al. [2018] L. He, A. Bian, and M. Jaggi. Cola: Decentralized linear learning. In Advances in Neural Information Processing Systems, pages 4541–4551, 2018.
  • Jiang and Agrawal [2018] P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2530–2541. Curran Associates, Inc., 2018.
  • Koloskova et al. [2019] A. Koloskova, S. U. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. CoRR, abs/1902.00340, 2019. URL http://arxiv.org/abs/1902.00340.
  • Li et al. [2014] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
  • Li et al. [2018] Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing. Pipe-sgd: A decentralized pipelined sgd framework for distributed deep net training. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8056–8067. Curran Associates, Inc., 2018.
  • Li and Yan [2017] Z. Li and M. Yan. A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization. arXiv preprint arXiv:1711.06785, 2017.
  • Lian et al. [2017] X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
  • Lian et al. [2018] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, 2018.
  • Nedić and Olshevsky [2015] A. Nedić and A. Olshevsky. Distributed optimization over time-varying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
  • Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
  • Renggli et al. [2018] C. Renggli, D. Alistarh, and T. Hoefler. Sparcml: High-performance sparse communication for machine learning. arXiv preprint arXiv:1802.08021, 2018.
  • Seide and Agarwal [2016] F. Seide and A. Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
  • Seide et al. [2014] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.
  • Shen et al. [2018] Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, and H. Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4624–4633, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Stich et al. [2018] S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4452–4463. Curran Associates, Inc., 2018.
  • Tang et al. [2018] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication compression for decentralized training. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7663–7673. Curran Associates, Inc., 2018.
  • Tang et al. [2019] H. Tang, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Thirty-sixth International Conference on Machine Learning, 2019.
  • Wang et al. [2018] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo: Communication-efficient learning via atomic sparsification. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9872–9883. Curran Associates, Inc., 2018.
  • Wangni et al. [2018] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1306–1316. Curran Associates, Inc., 2018.
  • Wen et al. [2017] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1509–1519. Curran Associates, Inc., 2017.
  • Wu et al. [2018] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to large-scale distributed optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325–5333, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  • Wu et al. [2017] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, PP:1–1, 04 2017. doi: 10.1109/TSIPN.2017.2695121.
  • Zhang et al. [2017] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. ZipML: Training linear models with end-to-end low precision, and a little bit of deep learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035–4043, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

Preliminaries

Below we would use some basic properties of matrix and vector.

  • For any and , if , then we have

  • For any two vectors and , we have

  • For any two matrices and , we have

Before presenting our proof, we first rewrite our updating rule as follows

(2)

where

Proof of Lemma 1

Lemma 1 is the key lemma for proving our Theorem 1.

Lemma 1.

For algorithm that admits the updating rule (2), if

then we have

where

and and for short.

We outline our proof of Lemma 1 as follows.

The most challenging part of a decentralized algorithm, unlike the centralized algorithm, is that we need to ensure the local model on each node to converge to the average value . This is because

(3)

So we first prove that

and the compressing error can be upper bounded by

With these two important observations, we finally prove that

So taking the equation above into (3) we could directly prove Theorem 1.

Now we introduce the detailed proof for Lemma 1.

Proof of Lemma 1

Proof.

The updating rule of DeepSqueeze can be written as

The updating rule of can be conducted using the equation above

So we have

Using

then the equation above becomes

(4)

The difference between and can be upper bounded by

So (4) becomes

Continuing using Lemma 7 to give upper bound for , we have

which can be rewritten as

Summing up the equation above from to we get

This concludes the proof. ∎

Below are some critical lemmas for the proof of Lemma 1.

Lemma 2.

Given two non-negative sequences and that satisfying