1 Introduction
We consider the following decentralized optimization:
(1) 
where is the number of node and is the local data distribution for node . nodes form a connected graph and each node can only communicate with its neighbors.
Communication is a key bottleneck in distributed training (Seide and Agarwal, 2016; Abadi et al., 2016). A popular strategy to reduce communication cost is compressing the gradients computed on the local workers and sending to the parameter server or a central node. To ensure the convergence, the compression usually needs to be unbiased and the compression ratio needs to be chosen very cautiously (it cannot be too aggressive), since the bias or noise caused by the compression may significantly degrade the convergence efficiency. Recently, an errorcompensated technology has been designed to make the aggressive compression possible. The key idea is to store the compression error in the previous step and send the compressed sum of the gradient and the remaining compression error in the previous step (Tang et al., 2019):
(compressed gradient on local worker(s))  
(remaining error on the local worker(s))  
(update on the parameter server) 
where is the computed stochastic gradient. This idea has been proven to very effective and may significantly outperform the traditional (nonerrorcompensated) compression centralized training algorithms such as Zhang et al. (2017); Jiang and Agrawal (2018).
The decentralized training has been proven to be superior to the centralized training in terms of reducing the communication cost (Boyd et al., 2006), especially when the model is huge and the network bandwidth and latency are less satisfactory. The key reason is that in the decentralized learning framework workers only need to communicate with their individual neighbors, while in the centralized learning all workers are required to talk to the central node (or the parameter server). Moreover, the iteration complexity (or the total computational complexity) of the decentralized learning is proven to be comparable to that of the centralized counterpart (Lian et al., 2017).
Therefore, given the recent successes of the errorcompensated technology for the centralized learning, it motivates us to ask a natural question: How to apply the errorcompensated technology to the decentralized learning to further reduce the communication cost?
However, key differences exist between centralized and decentralized training, and it is highly nontrivial to extend the errorcompensated technology to decentralized learning. In this paper, we propose an errorcompensated decentralized stochastic gradient method named DeepSqueeze by combining these two strategies. Both theoretical analysis and empirical study are provided to show the advantage of the proposed DeepSqueeze algorithm over the existing compression based decentralized learning algorithms, including Tang et al. (2018); Koloskova et al. (2019).
Our Contribution:

To the best of our knowledge, this is the first time to apply the errorcompensated compression technology to decentralized learning.

Our algorithm is compatible with almost all compression strategies, and admits a much higher tolerance on aggressive compression ratio than existing work (for example, (Tang et al., 2018)) in both theory and empirical study.
Notations and definitions
Throughout this paper, we use the following notations:

denotes the gradient of a function .

denotes the optimal value of the minimization problem (1).

.

denotes the
th largest eigenvalue of a symmetric matrix.

denotes the fullone vector.

is an all square matrix.

denotes the norm for vectors and the spectral norm for matrices.

denotes the vector Frobenius norm of matrices.

denotes the compressing operator.
The rest of the paper is organized as follows. First we review the related studies. We then discuss our proposed method in detail and provides key theoretical results. We further validate our method in an experiment and finally we conclude this paper.
2 Related Work
In this section, we review the recent works in a few areas that are closely related to the topic of this study: distributed learning, decentralized learning, compression based learning, and errorcompensated compression methods.
2.1 Centralized Parallel Learning
Distributed learning is an widelyused acceleration strategy for training deep neural network with high computational cost. Two main designs are developed to parallelize the computation: 1) centralized parallel learning
(Agarwal and Duchi, 2011; Recht et al., 2011), where all the workers are ensured to obtain information from all others; 2) decentralized parallel learning (He et al., 2018; Li and Yan, 2017; Lian et al., 2017; Dvurechenskii et al., 2018), where each worker can only gather information from a fraction of its neighbors. Centralized parallel learning requires a supporting network architecture to aggregate all local models or gradients in each iteration. Various implementations are developed for information aggregation in centralized systems. For example, the parameter server (Li et al., 2014; Abadi et al., 2016), AllReduce (Seide and Agarwal, 2016; Renggli et al., 2018), adaptive distributed learning, differentially private distributed learning, distributed proximal primaldual algorithm, nonsmooth distributed optimization, projectionfree distributed online learning, and parallel backpropagation for deep learning.
2.2 Decentralized Parallel Learning
Unlike centralized learning, decentralized learning does not require obtaining information from all workers in each step. Therefore, the network structure would have fewer restrictions than centralized parallel learning. Due to this high flexibility of the network design, the decentralized parallel learning has been the focus of many recent studies.
In decentralized parallel learning, workers only communicate with their neighbors. There are two main types of decentralized learning algorithms: fixed network topology (He et al., 2018), or timevarying (Nedić and Olshevsky, 2015; Lian et al., 2018) during training. Wu et al. (2017); Shen et al. (2018) shows that the decentralized SGD would converge with a comparable convergence rate to the centralized algorithm with less needed communication to make largescale model training feasible. Li et al. (2018) provide a systematic analysis of the decentralized learning pipeline.
2.3 Compressed Communication Distributed Learning
To further reduce the communication overhead, one promising direction is to compress the variables that are sent between different workers (Chen et al., 2018; Wang et al., 2018). Previous works mostly focus on a centralized scenario where the compression is assumed to be unbiased (Wangni et al., 2018; Shen et al., 2018; Zhang et al., 2017; Wen et al., 2017; Jiang and Agrawal, 2018). A general theoretical analysis of centralized compressed parallel SGD can be found in Alistarh et al. (2017). Beyond this, some biased compressing methods are also proposed and proven to be quite efficient in reducing the communication cost. One example is the 1Bit SGD (Seide et al., 2014), which compresses the entries in gradient vector into depends on its sign. The theoretical guarantee of this method is given in Bernstein et al. (2018).
Recently, another emerging area of compressed distributed learning is decentralized compressed learning. Unlike centralized setting, decentralized learning requires each worker to share the model parameters, instead of the gradients. This differentiating factor could potential invalidate the convergence (see Tang et al. (2018) for example). Many methods are proposed to solve this problem. One idea is to use the difference of the model updates as the shared information (Tang et al., 2018). Koloskova et al. (2019)
further reduces the communication error by sharing the difference of the model difference, but their result only considers a strongly convex loss function, which is not the case for nowadays deep neural network.
Tang et al. (2018) also propose a extrapolation like strategy to control the compressing level. None of these works employ the errorcompensate strategy.2.4 ErrorCompensated SGD
Interestingly, an errorcompensate strategy for AllReduce 1BitSGD is proposed in Seide et al. (2014), which can compensate the error in each iteration with quite less accuracy drop than training without the errorcompensation. The errorcompensation strategy is proved to be able to potentially improve the training efficiency for strongly convex loss function (Stich et al., 2018). Wu et al. (2018)
further study an ErrorCompensated SGD for quadratic optimization via adding two hyperparameters to compensate the error, but fail to theoretically verify the advantage of using error compensation. Most recently,
Tang et al. (2019) give an theoretical analysis showing that in centralized learning, the errorcompensation strategy is compatible with an arbitrary compression technique and fundamentally proving why errorcompensation admit an improved convergence rate with linear speedup for general nonconvex loss function. Even though, their work only study the centralized distributed training. Whether errorcompensation strategy can work both theoretically and empirically for decentralized learning remains to be investigated. We will try to answer these questions in this paper.3 Algorithm
We introduce our proposed DeepSqueeze algorithm details below.
Consider that there are workers in the network, where each individual worker (e.g., the th worker) can only communicate with its immediate neighbors (denotes worker ’s immediate neighbors
). The connection can be represented by a symmetric double stochastic matrix, the weighted matrix
. We denote by the weight between the th and th worker.For the th worker at time , each worker, say worker , need to maintain three main variables: the local model , the compression error , and the errorcompensated variable . The updates of those variables follow the following steps:

Local Computation: Update the errorcompensated variable by where is the stochastic gradient with randomly sampled , and is the learning.

Local Compression: Compress the errorcompensated variable into , where is the compression operation and update the compression error by .

Global Communication: Send compressed variable to the neighbors of th node, namely .
Finally, the proposed DeepSqueeze algorithm is summarized in Algorithm 1.
4 Theoretical Analysis
In this section, we introduce our theoretical analysis of DeepSqueeze. We first prove the updating rule of our algorithm to give an intuitive explanation of how our algorithm works, then we present the final convergence rate.
4.1 Mathematical formulation
In order to get a global view of DeepSqueeze, we define
Then the updating rule of DeepSqueeze follows
which can also be rewritten as
where we denote .
4.2 Why DeepSqueeze is better
We would be able to get a better understanding about why DeepSqueeze is better than the other compression based decentralized algorithms, by analyzing the closed forms of updating rules. For simplicity, we assume the initial values are . Skipping the detailed derivation, we can obtain the closed forms:

DPSGD (Lian et al., 2017) (without compression):

DCDPSGD (Tang et al., 2018) (with compression):

CHCHOSGD (Koloskova et al., 2019) (with compression):

Updating rule of DeepSqueeze (with errorcompensated compression):
We can notice that if , that is, there is no compression error, all compression based methods reduce to DPSGD. Therefore the efficiency of compression based methods lie on the magnitude of (for DPSGD) or (for DeepSqueeze and CHCHOSGD).
Using the fact that is doubly stochastic and symmetric, we have
which leads to
which means our algorithm could be able to significantly reducing the influence of the history compressing error by controlling . Actually, it will be clear soon (from our analysis), has to be small enough if we compress the information very aggressive.
Before we introducing the final convergence rate of DeepSqueeze, we first introduce some assumptions that is used for theoretical analysis.
Assumption 1.
Throughout this paper, we make the following commonly used assumptions:

Lipschitzian gradient: All function ’s are with Lipschitzian gradients.

Symmetric double stochastic matrix: The weighted matrix is a real double stochastic matrix that satisfies and .

Spectral gap: Given the symmetric doubly stochastic matrix , we assume that .

Start from 0: We assume . This assumption simplifies the proof w.l.o.g.

Bounded Signaltonoise factor: The magnitude of the compression error is assumed to be bounded by the original vector’s magnitude:
Notice that these assumptions are quite standard for analyzing decentralized algorithms in previous works Lian et al. (2017). Unlike Koloskova et al. (2019), we do not require the gradient to be bounded (), which makes our result more applicable to a general case. It is worth pointing out that many of the previous compressing strategies satisfy the bounded signaltonoise factor assumption, such as GSpar, random sparsification (Wangni et al., 2018), top sparsification and random quantization (Koloskova et al., 2019).
Now we are ready to show the main theorem of DeepSqueeze.
Theorem 1.
For DeepSqueeze, if the averaging rate and learning rate satisfies
then we have
where
To make the result more clear, we choose the learning rate appropriately in the following:
Corollary 1.
According to Theorem 1, choosing and , we have the following convergence rate for DeepSqueeze
where we have
where is treated to be a constant.
This result suggests that:
Linear speedup The asymptotical convergence rate of DeepSqueeze is , which is exactly the same with Centralized Parallel SGD.
Consistence with DPSGD Setting , our DeepSqueeze reduces to the standard decentralized SGD (DPSGD), our result indicates that the convergence rate admits , which is slightly better than the previous one in Lian et al. (2017).
Superiority over DCDPSD and ECDPSGD By using the errorcompensate strategy, we prove that the the dependence to the compression ratio is , and is robust to any compression operator with . Whereas the previous work (Tang et al., 2018) only ensure a dependence and there is a restriction that .
Balance between Communication and Computation: Our result indicates that when , we need to set to be small enough to ensure the convergence of our algorithm. However, too small leads to a slower convergence rate, which means more iterations are needed for training. Our result could be a guidance for balancing the communication cost and computation cost for decentralized learning under different situations.
5 Experiments
In this section, we further demonstrate the superiority of the DeepSqueeze
algorithm through an empirical study. Under aggressive compression ratios, we show the epochwise convergence results of
DeepSqueeze and compare them with centralized SGD (AllReduce), decentralized SGD without compression (DPSGD (Lian et al., 2017)), and other existing decentralized compression algorithms (ECPPSGD, DCPPSGD (Tang et al., 2018), and ChocoSGD (Koloskova et al., 2019)).5.1 Experimental Setup
Dataset
We benchmark the algorithms with a standard image classification task: CIFAR10 using ResNet20. This dateset has a training set of 50,000 images and a test set of 10,000 images, where each image is given one of the 10 labels.
Communication
The communication is implemented based on NVIDIA NCCL for AllReduce and gloo for all other algorithms. The workers are connected in a ring network so that each worker has two neighbors.
Compression
We use bit compression for all the compression algorithms, where every element of the tensor sent out or received by a worker is compressed to 4 bits or 2 bits. A number representing the Euclidean norm of the original tensor is also sent together, so that when a worker receives the compressed tensor, it can scale it to get a tensor with the same Euclidean norm as the uncompressed tensor.
Hardware
The experiments are done on 8 2080Ti GPUs, where each GPU is treated as a single worker in the algorithm.
5.2 Convergence Efficiency
We compare the algorithms with 4bit and 2bit compression, and the convergence results are shown in Figure 1. We use a batch size of 128 and tune the learning rate in a set for each algorithm. The learning rate is decreased by 5x every 60 epochs.
Note that, compared with uncompressed methods, 4bit compression saves 7/8 of the communication cost while 2bit compression reduces the cost by 93.75%. The results show that under 4bit compression, DeepSqueeze converges similar to uncompressed algorithms. DeepSqueeze is slightly better than ChocoSGD (Koloskova et al., 2019), and converges much faster than DCPPSGD and ECDPSGD. Note that DCPPSGD does not converge under 4bit compression (as stated in the original paper (Tang et al., 2018)), and ECDPSGD does not converge under 2bit compression. Under 2bit compression, DeepSqueeze converges slower than uncompressed algorithms, but is still faster than any other compression algorithms. The reason is that the less bits we have, the more we gain from the error compensation in DeepSqueeze, and consequently DeepSqueeze gets a faster convergence.
The experimental results further confirm that DeepSqueeze achieves outstanding communication cost reduction (up to 93.75%) without much sacrifice on convergence.
6 Conclusion
This paper applies the errorcompensated compression strategy, which receives big successes in centralized training, to the decentralized training. Through elaborated algorithm, the proposed DeepSqueeze algorithms is proven to admit higher tolerance on the compression ratio than existing compression based decentralized training. Both theoretical analysis and empirical study validate our algorithm and theorem.
References
 Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for largescale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016. USENIX Association. ISBN 9781931971331.
 Agarwal and Duchi [2011] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
 Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: communicationefficient SGD via gradient quantization and encoding. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 1707–1718, 2017.
 Bernstein et al. [2018] J. Bernstein, J. Zhao, K. Azizzadenesheli, and A. Anandkumar. signsgd with majority vote is communication efficient and byzantine fault tolerant. 10 2018.
 Boyd et al. [2006] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
 Chen et al. [2018] T. Chen, G. Giannakis, T. Sun, and W. Yin. Lag: Lazily aggregated gradient for communicationefficient distributed learning. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5055–5065. Curran Associates, Inc., 2018.
 Dvurechenskii et al. [2018] P. Dvurechenskii, D. Dvinskikh, A. Gasnikov, C. Uribe, and A. Nedich. Decentralize and randomize: Faster algorithm for wasserstein barycenters. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 10760–10770. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8274decentralizeandrandomizefasteralgorithmforwassersteinbarycenters.pdf.
 He et al. [2018] L. He, A. Bian, and M. Jaggi. Cola: Decentralized linear learning. In Advances in Neural Information Processing Systems, pages 4541–4551, 2018.
 Jiang and Agrawal [2018] P. Jiang and G. Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2530–2541. Curran Associates, Inc., 2018.
 Koloskova et al. [2019] A. Koloskova, S. U. Stich, and M. Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. CoRR, abs/1902.00340, 2019. URL http://arxiv.org/abs/1902.00340.
 Li et al. [2014] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598, 2014.
 Li et al. [2018] Y. Li, M. Yu, S. Li, S. Avestimehr, N. S. Kim, and A. Schwing. Pipesgd: A decentralized pipelined sgd framework for distributed deep net training. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 8056–8067. Curran Associates, Inc., 2018.
 Li and Yan [2017] Z. Li and M. Yan. A primaldual algorithm with optimal stepsizes and its application in decentralized consensus optimization. arXiv preprint arXiv:1711.06785, 2017.
 Lian et al. [2017] X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.
 Lian et al. [2018] X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, 2018.
 Nedić and Olshevsky [2015] A. Nedić and A. Olshevsky. Distributed optimization over timevarying directed graphs. IEEE Transactions on Automatic Control, 60(3):601–615, 2015.
 Recht et al. [2011] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 Renggli et al. [2018] C. Renggli, D. Alistarh, and T. Hoefler. Sparcml: Highperformance sparse communication for machine learning. arXiv preprint arXiv:1802.08021, 2018.
 Seide and Agarwal [2016] F. Seide and A. Agarwal. Cntk: Microsoft’s opensource deeplearning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2135–2135. ACM, 2016.
 Seide et al. [2014] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1bit stochastic gradient descent and application to dataparallel distributed training of speech dnns. In Interspeech 2014, September 2014.
 Shen et al. [2018] Z. Shen, A. Mokhtari, T. Zhou, P. Zhao, and H. Qian. Towards more efficient stochastic decentralized learning: Faster convergence and sparse communication. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4624–4633, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 Stich et al. [2018] S. U. Stich, J.B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4452–4463. Curran Associates, Inc., 2018.
 Tang et al. [2018] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu. Communication compression for decentralized training. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7663–7673. Curran Associates, Inc., 2018.
 Tang et al. [2019] H. Tang, X. Lian, T. Zhang, and J. Liu. Doublesqueeze: Parallel stochastic gradient descent with doublepass errorcompensated compression. In Thirtysixth International Conference on Machine Learning, 2019.
 Wang et al. [2018] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo: Communicationefficient learning via atomic sparsification. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9872–9883. Curran Associates, Inc., 2018.
 Wangni et al. [2018] J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communicationefficient distributed optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 1306–1316. Curran Associates, Inc., 2018.
 Wen et al. [2017] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1509–1519. Curran Associates, Inc., 2017.
 Wu et al. [2018] J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized SGD and its applications to largescale distributed optimization. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5325–5333, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
 Wu et al. [2017] T. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with asynchrony and delays. IEEE Transactions on Signal and Information Processing over Networks, PP:1–1, 04 2017. doi: 10.1109/TSIPN.2017.2695121.
 Zhang et al. [2017] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. ZipML: Training linear models with endtoend low precision, and a little bit of deep learning. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 4035–4043, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
Preliminaries
Below we would use some basic properties of matrix and vector.

For any and , if , then we have

For any two vectors and , we have

For any two matrices and , we have
Before presenting our proof, we first rewrite our updating rule as follows
(2) 
where
Proof of Lemma 1
Lemma 1.
We outline our proof of Lemma 1 as follows.
The most challenging part of a decentralized algorithm, unlike the centralized algorithm, is that we need to ensure the local model on each node to converge to the average value . This is because
(3) 
So we first prove that
and the compressing error can be upper bounded by
With these two important observations, we finally prove that
So taking the equation above into (3) we could directly prove Theorem 1.
Now we introduce the detailed proof for Lemma 1.
Proof of Lemma 1
Proof.
The updating rule of DeepSqueeze can be written as
The updating rule of can be conducted using the equation above
So we have
Using
then the equation above becomes
(4) 
The difference between and can be upper bounded by
So (4) becomes
Continuing using Lemma 7 to give upper bound for , we have
which can be rewritten as
Summing up the equation above from to we get
This concludes the proof. ∎
Below are some critical lemmas for the proof of Lemma 1.
Lemma 2.
Given two nonnegative sequences and that satisfying
(5) 
with , we have