1 Introduction
In typical modern machine learning tasks, we often encounter large scale optimization problems, which require huge computational time to solve. Hence, saving computational time of optimization processes is practically quite important and is main interest in the optimization community.
To tackle large scale problems, a goldenstandard approach is the usage of Stochastic Gradient Descent (SGD) method Robbins and Monro (1951)
. For reducing loss, SGD updates the current solution by using a stochastic gradient in each iteration, that is the average of the gradients of the loss functions correspond to a random subset of the dataset (minibatch) rather than the whole dataset. This (stochastic) minibatch approach allows that SGD can be faster than deterministic fullbatch methods in terms of computational time
Dekel et al. (2012); Li et al. (2014). Furthermore, Stochastic Nesterov’s Accelerated Gradient (SNAG) method and its variants have been proposed Hu et al. (2009); Chen et al. (2012); Ghadimi and Lan (2016), that are based on the combination of SGD with Nesterov’s acceleration Nesterov (2013b, a); Tseng (2008). Minibatch SNAG theoretically outperforms vanilla minibatch SGD for moderate optimization accuracy, though its asymptotic convergence rate matches that of SGD.For realizing further scalability, distributed optimization have received much research attention Bekkerman et al. (2011); Duchi et al. (2011); Jaggi et al. (2014); Gemulla et al. (2011); Dean et al. (2012); Ho et al. (2013); Arjevani and Shamir (2015); Chen et al. (2016); Goyal et al. (2017)
. Distributed optimization methods are mainly classified as synchronous centralized
Zinkevich et al. (2010); Dekel et al. (2012); Shamir and Srebro (2014), asynchronous centralized Recht et al. (2011); Agarwal and Duchi (2011); Lian et al. (2015); Liu et al. (2015); Zheng et al. (2017), synchronous decentralized Nedic and Ozdaglar ; Yuan et al. (2016); Lian et al. (2017a); Lan et al. ; Uribe et al. (2017); Scaman et al. (2018) and asynchronous decentralized Lian et al. (2017b); Lan and Zhou (2018) ones by their communication types. In this paper, we particularly focus on data parallel stochastic gradient methods for synchronous centralized distributed optimization with smooth objective function , where each corresponds to a data partition of the whole dataset for the th node (or processor). In this setting, first each processor computes a stochastic gradient of and then the nodes send the gradients each other. Finally, the current solution is updated using the averaged gradient on each processor. Here we assume that nodetonode broadcasts are used, but it is also possible to utilize an intermediate parameter server.A main concern in synchronous distributed optimization is communication cost because it can easily be a bottleneck in optimization processes. Theoretically, naive parallel minibatch SGD achieves linear speed up with respect to the number of processors Dekel et al. (2012); Li et al. (2014), but not empirically due to this cost Shamir and Srebro (2014); Chen et al. (2016). For leveraging the power of parallel computing, it is essential to reduce the communication cost.
One of fascinating techniques for reducing communication cost in distributed optimization is compression of the communicated gradients Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019); Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Sparsification is an approach in which the gradient is compressed by sparsifying it in each local node before communication Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019). For sparsifying a gradient, topk algorithm, that drops the smallest components of the gradient by absolute value from the components of the gradient, has been typically used. Another example of compression is quantization, which is a technique that limit the number of bits to represent the communicated gradients. Several work has demonstrated that parallel SGD with quantized gradients has good practical performance Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Particularly, Alistarh et al. Alistarh et al. (2017) have proposed Quantized SGD (QSGD), which is the first quantization algorithm with a theoretical convergence rate. QSGD is based on unbiased quantization of the communicated gradient.
However, theoretically there exists an essential tradeoff between communication cost and convergence speed when we use naive gradient compression schemes. Specifically, naive compression (including sparsification and quantization) causes large variances and theoretically always slower than vanilla SGD, though they surely reduce the communication cost
Stich et al. (2018); Alistarh et al. (2017).Error feedback scheme partially solves this tradeoff problem. Some work has considered the usage of compressed gradients with the locally accumulated compression errors in each node and its effectiveness has been validated empirically Aji and Heafield (2017); Lin et al. (2017); Wu et al. (2018). Very recently, several work has attempted to analyse and justified the effectiveness of error feedback in a theoretical view Alistarh et al. (2018); Stich et al. (2018); Karimireddy et al. (2019). Surprisingly, it has been shown that Sparsified SGD with error feedback asymptotically achieves the same rate as nonsparsified SGD.
Nevertheless, for a theoretical point of view, the analysis in previous work is still unsatisfactory, since no analysis has been given for distributed settings Karimireddy et al. (2019); Stich et al. (2018) or only focused on top sparsification and they have never shown the linear speed up property with respect to the number of nodes Alistarh et al. (2018). Also, previous work has not taken nonasymptotic iteration complexities into consideration. However, consideration of them is practically important because the additional iteration complexity caused by sparsification typically has a factor of , which can be very large particularly for high compression settings.
There exist two open questions.

Does sparsified SGD with error feedback asymptotically achieves the same rate as nonsparsified parallel SGD in distributed optimization settings?

Are there any better algorithms than sparsified SGD with error feedback in terms of nonasymptotic iteration complexity?
We will positively answer these questions in this work.
Main contribution
We propose and analyse Sparsified Stochastic Nesterov’s Accelerated Gradient method (SSNAGEF) based on the combination of (i) unbiased compression of the stochastic gradients; (ii) error feedback scheme; and (iii) Nesterov’s acceleration technique. The main features of our method are as follows:

(Linear speed up w.r.t. #Nodes) Our method possesses linear speed up property with respect to the number of processors in distributed optimization settings, in the sense that the method asymptotically achieves the same rate as nonsparsified parallel SGD. To the best of our knowledge, this property has not been shown in any previous methods, particularly top sparsified SGD and its variants.

(Low iteration complexity for moderate accuracy) It is shown that our proposed method can achieve strictly better iteration complexity than SSGDEF for a wide range of desired optimization accuracy, that is practically meaningful for high compression settings, though the asymptotic iteration complexity matches to the one of SSGEEF.
We also analyse nonaccelerated sparsified SGD with error feedback (SSGDEF) in parallel computing settings and show that SSGDEF has the former property of the above.
The comparison of our method with the most relevant previous methods is summarized in Table 1. From Table 1, we can make the following observations:
general convex  strongly convex  general nonconvex  Para?  
SGD  Yes  
SNAG  No Analysis  Yes  
SSGD  Yes  
SSNAG  No Analysis  Yes  
MEMSGD Stich et al. (2018)  No Analysis  No  
EFSGD Karimireddy et al. (2019)  No Analysis  No  
SSGDEF 



Yes  
SSNAGEF 



Yes 

SSGD vs. SGD: The iteration complexities of SSGD always times worse than SGD because of the times larger variances of the randomly compressed stochastic gradients.

SSGDEF vs. SSGD: SSGDEF has better dependence on the desired accuracy than SSGD in the sparsification error terms. Asymptotically, the iteration complexities of SSGDEF is times better than the ones of SSGD

SSGDEF vs. MEMSGD: When , the rates of the two methods are same for convex cases. However, SSGDEF is applicable to parallelization settings and achieves linear speed up in terms of the number of processors with respect to the asymptotically dominated term.

SSGDEF vs. EFSGD: When , For general nonconvex cases, the rate of SSGDEF is always better than the one of EFSGD because . Note that for general convex cases, EFSGD is applicable to nonsmooth objectives and the rates cannot be directly compared.

SSNAGEF vs. SSGDEF: For geneal convex cases, the rate of SSNAGEF is strictly better than the one of SSGDEF when , though the rates of two methods are asymptotically same. For general nonconvex cases, the rate of SSNAGEF is strictly better than the one of SSGDEF when . For high compression settings (i.e., ), these ranges are wide and meaningful.
For looking more closely at the comparison of the theoretical iteration complexities of SSGDEF and SSNAGEF, we illustrate the comparison of them in Figure 1.
2 Notation and Assumptions
We use the following notation in this paper.

denotes the Euclidean norm : .

For natural number , denotes the set .

We define as the quadratic function with center , i.e., .

A sparsification operator is defined as for in and otherwise, where is a uniformly random subset of .
The followings are theoretical assumptions for our analysis. These are very standard in optimization literature. We always assume the first three assumptions.
Assumption 1.
has a minimizer .
Assumption 2.
is smooth (), i.e., .
Assumption 3.
has bounded variance, i.e., .
Assumption 4.
is strongly convex (), i.e., .
3 Algorithm Descriptions
In this section, we describe our proposed algorithms in detail.
3.1 Sparsified Stochastic Gradient Descent with Error Feedback
The algorithm of Sparsified SGD with Error Feedback (SSGDEF) for convex and nonconvex objectives is provided in Algorithm 1
. In line 37, roughly speaking, we construct a gradient estimator by using error feedback scheme, compress it to a sparse vector and update a cumulative compression error in parallel. More specifically, each node first computes i.i.d. stochastic gradient with respect to the correspondence data partition. Second, the cumulative compression error
is added to the stochastic gradient (we call this process as error feedback) and then we construct unbiasedly sparsified gradient estimator by randomly picking nonzero coordinates of the stochastic gradient with error feedback. Finally, the cumulative compression error is updated for the after iterations. In line 8, we broadcast and receive the compressed gradient estimator from and to each node. In line 910, we update the solution using the average of the received compressed gradients in each node. Note that each node has the same updated solution in each iteration.Remark (Difference from previous algorithms).
Algorithm 1 can be regard as an extension of MemSGD Stich et al. (2018) or EFSGD Karimireddy et al. (2019) to parallel computing settings, though these two methods mainly utilize top compression for gradient sparsification. We rather use unbiased random compression. This difference is essential for our analysis.
3.2 Sparsified Stochastic Nesterov Accelerated Gradient descent with Error Feedback
The procedure of SSNAGEF for convex objectives is provided in Algorithm 2. In line 5, we compress two different gradient estimators by randomly picking coordinates for each. Also in line 6, we update three cumulative compression errors. Why are different compressed estimators and cumulative errors necessary for appropriate updates? In a typical acceleration algorithm we construct two different solution paths and , and their aggregations as in line 10. The aggregation of the "conservative" solution (because of small learning rate ) and "aggressive" solution (because of large learning rate ) is the essence of Nesterov’s acceleration. On the other hand, from a theoretical point of view, the impact of error feedback to the vanilla stochastic gradient should be scaled to the inverse of learning rate as in line 5. Therefore, for using two different learning rates, it is necessary to construct two compressed gradient estimators and hence three compression errors. Generally, SSNAGEF has no theoretical guarantee for nonconvex objectives. However, utilizing regularization technique, the convergence of RegSNAGEF (Algorithm 3) to a stationary point is guaranteed. Specifically, Algorithm 3 repeatedly minimize the "regularized" objective by using SSNAGEF, where and is the current solution.
Remark (Parameter tuning).
It seems that Algorithm 2 has many tuning parameters. However, this is not. Specifically, as Theorem 4.8 in Section 4 indicates, actual tuning parameters are only constant learning rate , strong convexity and , and the other parameters are theoretically determined. This means that the additional tuning parameters compared to SSGDEF are essentially only strong convexity parameter . Practically, fixing works well.
4 Convergence Analysis
In this section, we provide convergence analysis of SSGDEF and SSNAGEF. For convex cases, we assume the strong convexity of the objective. For nonstrongly convex cases, we can immediately derive its convergence rates from the ones for strongly convex cases by taking standard dummy regularizer approach and we omit it here.
Let be the mean of the cumulative compression errors of the all nodes at th iteration, i.e., . We use notation to hide additional logarithmic factors for simplicity.
4.1 Analysis of SSGDEF
In this subsection, we provide the analysis of SSGDEF. The proofs of the statements are found in Section A of supplementary material.
The following proposition holds for strongly convex objective .
Proposition 4.1 (Strongly convex).
The first term is the deterministic term and the second term is the stochastic error term. The last term is the compression error term and we can further bound it by the following proposition.
Proposition 4.2.
Suppose that Assumptions 3 holds. Let be sufficiently small. Then SSGDEF satisfies
Remark.
Importantly, the expected accumulated compression error is scaled to , i.e., linearly scaled with respect to the number of nodes.
Theorem 4.3 (Strongly convex).
Remark.
Theorem 4.3 implies that SSGDEF asymptotically achieves , that is the asymptotic iteration complexity of nonsparsified parallel SGD, because the last compression error term has a dependence on rather than . Also note that the last term is scaled to . This is a desirable property for distributed optimization with . However, the last term has a factor of , which may be large and can dominate the other terms for moderate accuracy . Thus, consideration of nonasymptotic behavior is also important particularly for high compression settings.
For nonconvex objectives, we can derive the following proposition.
Proposition 4.4 (General nonconvex).
Suppose that Assumptions 1, 2 and 3 hold. Assume that . Then SSGDEF satisfies
where and
with probability
.Theorem 4.5 (General nonconvex).
Similar to convex cases, SSGDEF asymptotically achieves the same rate as nonsparsified SGD.
4.2 Analysis of SSNAGEF
Here, theoretical analysis of our proposed SSNAGEF is provided. For the proofs of the statements, see supplementary material (Section B).
The following proposition holds for strongly convex objective .
Proposition 4.6 (Strongly convex).
Remark.
The first deterministic error term is scaled to rather than thanks to the acceleration scheme at the expense of times larger stochastic error (the second term) than the one of SSGDEF.
The third and last terms are bounded by the following proposition.
Proposition 4.7.
Suppose that Assumptions 3 holds. Let , be sufficiently small and is monotonically nonincreasing. Then SSNAGEF satisfies
Theorem 4.8 (Strongly convex).
Remark.
The terms after the third one have a better dependence on than the second stochastic error term. Hence for very small , we can ignore the compression error terms and the rate asymptotically matches to the one of vanilla SGD. Additionally, the compression error terms have better dependences on than SSGDEF.
Remark.
Compared with the rate of SSGDEF, we can easily see that the rate of SSNAGEF is strictly better than the one of SSGDEF when , if we assume .
We can derive a convergence rate of RegSSNAGEF for general nonconvex objectives by applying Theorem 4.8 to pseudoregularized objective iteratively.
Theorem 4.9 (General nonconvex).
Remark.
From Theorem 4.5, we can see that even in nonconvex cases, acceleration can be beneficial. Indeed, the compression error terms (third and fourth terms) have a better dependence on than SSGDEF.
5 Related Work
In this section, we briefly describe the most relevant papers to this work. Stich et al. Stich et al. (2018) have first provided theoretical analysis of sparsified SGD with error feedback (called MEMSGD) and shown that MEMSGD asymptotically achieves the rate of nonsparsifed SGD. However, their analysis is limited to convex cases in serial computing settings, i.e., . Independently, Alistarh et al. Alistarh et al. (2018) have also theoretically considered sparsified SGD with error feedback in parallel settings for convex and nonconvex objectives. However, their analysis is still unsatisfactory for some reasons. First, their analysis relies on an artificial analytic assumption due to the usage of top algorithm as gradient compression, though they have experimentally tried to validate it. Second, it is unclear from their results whether the algorithm asymptotically possesses the linear speed up property with respect to the number of nodes. Recently, Karimireddy et al. Karimireddy et al. (2019) have also analysed a variant of sparsified SGD with error feedback (called EFSGD) for convex and nonconvex cases in serial computing settings. The derived rate for nonconvex cases is worse than our result of SSGDEF when . Differently from ours, their analysis allows nonsmoothness of the objectives for convex cases, though the convergence rate is always worse than vanilla SGD and the algorithm does not possesses the asymptotic optimality.
6 Conclusion and Future Work
In this paper, we mainly considered an accelerated sparsified SGD with error feedback in parallel computing settings. We gave theoretical analysis of it for convex and nonconvex objectives and showed that our proposed algorithm achieves (i) asymptotical linear speed up with respect to the number of nodes; (ii) lower iteration complexity for moderate accuracy than the nonaccelerated algorithm thanks to Nesterov’s acceleration.
One of interesting questions is whether our theoretical results are tight or not. Deriving lower bound of the iteration complexity of sparsification (or more generally compression) methods in distributed settings with limited communication is quite important. Another interesting future work is to extend our results to proximal settings, which allows nonsmooth regularizer, for example, regularizer, since the usage of nonsmooth regularizer in machine learning tasks is very popular for both convex and nonconvex problems. Construction of the proximal version of our algorithms and their analysis are nontrivial and definitely meaningful. We conjecture that the asymptotic optimality is still guaranteed in this setting.
7 Acknowledgement
TS was partially supported by MEXT Kakenhi (15H05707, 18K19793 and 18H03201), Japan Digital Design, and JSTCREST.
References
 Agarwal and Duchi (2011) A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
 Aji and Heafield (2017) A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
 Alistarh et al. (2017) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
 Alistarh et al. (2018) D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
 Arjevani and Shamir (2015) Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. In Advances in neural information processing systems, pages 1756–1764, 2015.
 Bekkerman et al. (2011) R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge Univ Pr, 2011.
 Chen et al. (2016) J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
 Chen et al. (2012) X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In Advances in Neural Information Processing Systems, pages 395–403, 2012.
 Dean et al. (2012) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
 Dekel et al. (2012) O. Dekel, R. GiladBachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
 Duchi et al. (2011) J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011.
 Gemulla et al. (2011) R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Largescale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011.
 Ghadimi and Lan (2016) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(12):59–99, 2016.
 Goyal et al. (2017) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.
 Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
 Jaggi et al. (2014) M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communicationefficient distributed dual coordinate ascent. In Advances in neural information processing systems, pages 3068–3076, 2014.
 Karimireddy et al. (2019) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
 Lan and Zhou (2018) G. Lan and Y. Zhou. Asynchronous decentralized accelerated stochastic gradient descent. arXiv preprint arXiv:1809.09258, 2018.
 (20) G. Lan, S. Lee, and Y. Zhou. Communicationefficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48.
 Li et al. (2014) M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient minibatch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
 Lian et al. (2015) X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
 Lian et al. (2017a) X. Lian, C. Zhang, H. Zhang, C.J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017a.
 Lian et al. (2017b) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017b.
 Lin et al. (2017) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
 Liu et al. (2015) J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
 (27) A. Nedic and A. Ozdaglar. Distributed subgradient methods for multiagent optimization.
 Nesterov (2013a) Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013a.
 Nesterov (2013b) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013b.
 Recht et al. (2011) B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
 Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 Scaman et al. (2018) K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee. Optimal algorithms for nonsmooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740–2749, 2018.
 Seide et al. (2014) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 Shamir and Srebro (2014) O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
 Shi et al. (2019) S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu. A distributed synchronous sgd algorithm with global top sparsification for low bandwidth networks. arXiv preprint arXiv:1901.04359, 2019.
 Stich et al. (2018) S. U. Stich, J.B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
 Tseng (2008) P. Tseng. On accelerated proximal gradient methods for convexconcave optimization. submitted to SIAM Journal on Optimization, 2:3, 2008.
 Uribe et al. (2017) C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić. Optimal algorithms for distributed optimization. arXiv preprint arXiv:1712.00232, 2017.
 Wangni et al. (2018) J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communicationefficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.

Wen et al. (2017)
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.
Terngrad: Ternary gradients to reduce communication in distributed deep learning.
In Advances in neural information processing systems, pages 1509–1519, 2017.  Wu et al. (2018) J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications to largescale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
 Yuan et al. (2016) K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
 Zheng et al. (2017) S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.M. Ma, and T.Y. Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 4120–4129. JMLR. org, 2017.
 Zinkevich et al. (2010) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Analysis of SSGDEF
a.1 Analysis of
Lemma A.1.
where the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on .
Proof.
First note that and . Since , where and , we have
Here the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on . Since each
is an independent unbiased estimator of
for , we haveThe last equality is from the independence of . ∎
Now we need to bound the variance term .
Lemma A.2.
For ,
Comments
There are no comments yet.