In typical modern machine learning tasks, we often encounter large scale optimization problems, which require huge computational time to solve. Hence, saving computational time of optimization processes is practically quite important and is main interest in the optimization community.
. For reducing loss, SGD updates the current solution by using a stochastic gradient in each iteration, that is the average of the gradients of the loss functions correspond to a random subset of the dataset (mini-batch) rather than the whole dataset. This (stochastic) mini-batch approach allows that SGD can be faster than deterministic full-batch methods in terms of computational timeDekel et al. (2012); Li et al. (2014). Furthermore, Stochastic Nesterov’s Accelerated Gradient (SNAG) method and its variants have been proposed Hu et al. (2009); Chen et al. (2012); Ghadimi and Lan (2016), that are based on the combination of SGD with Nesterov’s acceleration Nesterov (2013b, a); Tseng (2008). Mini-batch SNAG theoretically outperforms vanilla mini-batch SGD for moderate optimization accuracy, though its asymptotic convergence rate matches that of SGD.
For realizing further scalability, distributed optimization have received much research attention Bekkerman et al. (2011); Duchi et al. (2011); Jaggi et al. (2014); Gemulla et al. (2011); Dean et al. (2012); Ho et al. (2013); Arjevani and Shamir (2015); Chen et al. (2016); Goyal et al. (2017)
. Distributed optimization methods are mainly classified as synchronous centralizedZinkevich et al. (2010); Dekel et al. (2012); Shamir and Srebro (2014), asynchronous centralized Recht et al. (2011); Agarwal and Duchi (2011); Lian et al. (2015); Liu et al. (2015); Zheng et al. (2017), synchronous decentralized Nedic and Ozdaglar ; Yuan et al. (2016); Lian et al. (2017a); Lan et al. ; Uribe et al. (2017); Scaman et al. (2018) and asynchronous decentralized Lian et al. (2017b); Lan and Zhou (2018) ones by their communication types. In this paper, we particularly focus on data parallel stochastic gradient methods for synchronous centralized distributed optimization with smooth objective function , where each corresponds to a data partition of the whole dataset for the -th node (or processor). In this setting, first each processor computes a stochastic gradient of and then the nodes send the gradients each other. Finally, the current solution is updated using the averaged gradient on each processor. Here we assume that node-to-node broadcasts are used, but it is also possible to utilize an intermediate parameter server.
A main concern in synchronous distributed optimization is communication cost because it can easily be a bottleneck in optimization processes. Theoretically, naive parallel mini-batch SGD achieves linear speed up with respect to the number of processors Dekel et al. (2012); Li et al. (2014), but not empirically due to this cost Shamir and Srebro (2014); Chen et al. (2016). For leveraging the power of parallel computing, it is essential to reduce the communication cost.
One of fascinating techniques for reducing communication cost in distributed optimization is compression of the communicated gradients Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019); Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Sparsification is an approach in which the gradient is compressed by sparsifying it in each local node before communication Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019). For sparsifying a gradient, top-k algorithm, that drops the smallest components of the gradient by absolute value from the components of the gradient, has been typically used. Another example of compression is quantization, which is a technique that limit the number of bits to represent the communicated gradients. Several work has demonstrated that parallel SGD with quantized gradients has good practical performance Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Particularly, Alistarh et al. Alistarh et al. (2017) have proposed Quantized SGD (QSGD), which is the first quantization algorithm with a theoretical convergence rate. QSGD is based on unbiased quantization of the communicated gradient.
However, theoretically there exists an essential trade-off between communication cost and convergence speed when we use naive gradient compression schemes. Specifically, naive compression (including sparsification and quantization) causes large variances and theoretically always slower than vanilla SGD, though they surely reduce the communication costStich et al. (2018); Alistarh et al. (2017).
Error feedback scheme partially solves this trade-off problem. Some work has considered the usage of compressed gradients with the locally accumulated compression errors in each node and its effectiveness has been validated empirically Aji and Heafield (2017); Lin et al. (2017); Wu et al. (2018). Very recently, several work has attempted to analyse and justified the effectiveness of error feedback in a theoretical view Alistarh et al. (2018); Stich et al. (2018); Karimireddy et al. (2019). Surprisingly, it has been shown that Sparsified SGD with error feedback asymptotically achieves the same rate as non-sparsified SGD.
Nevertheless, for a theoretical point of view, the analysis in previous work is still unsatisfactory, since no analysis has been given for distributed settings Karimireddy et al. (2019); Stich et al. (2018) or only focused on top- sparsification and they have never shown the linear speed up property with respect to the number of nodes Alistarh et al. (2018). Also, previous work has not taken non-asymptotic iteration complexities into consideration. However, consideration of them is practically important because the additional iteration complexity caused by sparsification typically has a factor of , which can be very large particularly for high compression settings.
There exist two open questions.
Does sparsified SGD with error feedback asymptotically achieves the same rate as non-sparsified parallel SGD in distributed optimization settings?
Are there any better algorithms than sparsified SGD with error feedback in terms of non-asymptotic iteration complexity?
We will positively answer these questions in this work.
We propose and analyse Sparsified Stochastic Nesterov’s Accelerated Gradient method (S-SNAG-EF) based on the combination of (i) unbiased compression of the stochastic gradients; (ii) error feedback scheme; and (iii) Nesterov’s acceleration technique. The main features of our method are as follows:
(Linear speed up w.r.t. #Nodes) Our method possesses linear speed up property with respect to the number of processors in distributed optimization settings, in the sense that the method asymptotically achieves the same rate as non-sparsified parallel SGD. To the best of our knowledge, this property has not been shown in any previous methods, particularly top- sparsified SGD and its variants.
(Low iteration complexity for moderate accuracy) It is shown that our proposed method can achieve strictly better iteration complexity than S-SGD-EF for a wide range of desired optimization accuracy, that is practically meaningful for high compression settings, though the asymptotic iteration complexity matches to the one of S-SGE-EF.
We also analyse non-accelerated sparsified SGD with error feedback (S-SGD-EF) in parallel computing settings and show that S-SGD-EF has the former property of the above.
|general convex||strongly convex||general nonconvex||Para?|
|MEM-SGD Stich et al. (2018)||No Analysis||No|
|EF-SGD Karimireddy et al. (2019)||No Analysis||No|
S-SGD vs. SGD: The iteration complexities of S-SGD always times worse than SGD because of the times larger variances of the randomly compressed stochastic gradients.
S-SGD-EF vs. S-SGD: S-SGD-EF has better dependence on the desired accuracy than S-SGD in the sparsification error terms. Asymptotically, the iteration complexities of S-SGD-EF is times better than the ones of S-SGD
S-SGD-EF vs. MEM-SGD: When , the rates of the two methods are same for convex cases. However, S-SGD-EF is applicable to parallelization settings and achieves linear speed up in terms of the number of processors with respect to the asymptotically dominated term.
S-SGD-EF vs. EF-SGD: When , For general nonconvex cases, the rate of S-SGD-EF is always better than the one of EF-SGD because . Note that for general convex cases, EF-SGD is applicable to non-smooth objectives and the rates cannot be directly compared.
S-SNAG-EF vs. S-SGD-EF: For geneal convex cases, the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when , though the rates of two methods are asymptotically same. For general nonconvex cases, the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when . For high compression settings (i.e., ), these ranges are wide and meaningful.
For looking more closely at the comparison of the theoretical iteration complexities of S-SGD-EF and S-SNAG-EF, we illustrate the comparison of them in Figure 1.
2 Notation and Assumptions
We use the following notation in this paper.
denotes the Euclidean norm : .
For natural number , denotes the set .
We define as the quadratic function with center , i.e., .
A sparsification operator is defined as for in and otherwise, where is a uniformly random subset of .
The followings are theoretical assumptions for our analysis. These are very standard in optimization literature. We always assume the first three assumptions.
has a minimizer .
is -smooth (), i.e., .
has -bounded variance, i.e., .
is -strongly convex (), i.e., .
3 Algorithm Descriptions
In this section, we describe our proposed algorithms in detail.
3.1 Sparsified Stochastic Gradient Descent with Error Feedback
The algorithm of Sparsified SGD with Error Feedback (S-SGD-EF) for convex and nonconvex objectives is provided in Algorithm 1
. In line 3-7, roughly speaking, we construct a gradient estimator by using error feedback scheme, compress it to a sparse vector and update a cumulative compression error in parallel. More specifically, each node first computes i.i.d. stochastic gradient with respect to the correspondence data partition. Second, the cumulative compression erroris added to the stochastic gradient (we call this process as error feedback) and then we construct unbiasedly sparsified gradient estimator by randomly picking -nonzero coordinates of the stochastic gradient with error feedback. Finally, the cumulative compression error is updated for the after iterations. In line 8, we broadcast and receive the compressed gradient estimator from and to each node. In line 9-10, we update the solution using the average of the received compressed gradients in each node. Note that each node has the same updated solution in each iteration.
Remark (Difference from previous algorithms).
Algorithm 1 can be regard as an extension of Mem-SGD Stich et al. (2018) or EF-SGD Karimireddy et al. (2019) to parallel computing settings, though these two methods mainly utilize top- compression for gradient sparsification. We rather use unbiased random compression. This difference is essential for our analysis.
3.2 Sparsified Stochastic Nesterov Accelerated Gradient descent with Error Feedback
The procedure of S-SNAG-EF for convex objectives is provided in Algorithm 2. In line 5, we compress two different gradient estimators by randomly picking -coordinates for each. Also in line 6, we update three cumulative compression errors. Why are different compressed estimators and cumulative errors necessary for appropriate updates? In a typical acceleration algorithm we construct two different solution paths and , and their aggregations as in line 10. The aggregation of the "conservative" solution (because of small learning rate ) and "aggressive" solution (because of large learning rate ) is the essence of Nesterov’s acceleration. On the other hand, from a theoretical point of view, the impact of error feedback to the vanilla stochastic gradient should be scaled to the inverse of learning rate as in line 5. Therefore, for using two different learning rates, it is necessary to construct two compressed gradient estimators and hence three compression errors. Generally, S-SNAG-EF has no theoretical guarantee for nonconvex objectives. However, utilizing regularization technique, the convergence of Reg-SNAG-EF (Algorithm 3) to a stationary point is guaranteed. Specifically, Algorithm 3 repeatedly minimize the "regularized" objective by using S-SNAG-EF, where and is the current solution.
Remark (Parameter tuning).
It seems that Algorithm 2 has many tuning parameters. However, this is not. Specifically, as Theorem 4.8 in Section 4 indicates, actual tuning parameters are only constant learning rate , strong convexity and , and the other parameters are theoretically determined. This means that the additional tuning parameters compared to S-SGD-EF are essentially only strong convexity parameter . Practically, fixing works well.
4 Convergence Analysis
In this section, we provide convergence analysis of S-SGD-EF and S-SNAG-EF. For convex cases, we assume the strong convexity of the objective. For non-strongly convex cases, we can immediately derive its convergence rates from the ones for strongly convex cases by taking standard dummy regularizer approach and we omit it here.
Let be the mean of the cumulative compression errors of the all nodes at -th iteration, i.e., . We use notation to hide additional logarithmic factors for simplicity.
4.1 Analysis of S-SGD-EF
In this subsection, we provide the analysis of S-SGD-EF. The proofs of the statements are found in Section A of supplementary material.
The following proposition holds for strongly convex objective .
Proposition 4.1 (Strongly convex).
The first term is the deterministic term and the second term is the stochastic error term. The last term is the compression error term and we can further bound it by the following proposition.
Suppose that Assumptions 3 holds. Let be sufficiently small. Then S-SGD-EF satisfies
Importantly, the expected accumulated compression error is scaled to , i.e., linearly scaled with respect to the number of nodes.
Theorem 4.3 (Strongly convex).
Theorem 4.3 implies that S-SGD-EF asymptotically achieves , that is the asymptotic iteration complexity of non-sparsified parallel SGD, because the last compression error term has a dependence on rather than . Also note that the last term is scaled to . This is a desirable property for distributed optimization with . However, the last term has a factor of , which may be large and can dominate the other terms for moderate accuracy . Thus, consideration of non-asymptotic behavior is also important particularly for high compression settings.
For nonconvex objectives, we can derive the following proposition.
Proposition 4.4 (General nonconvex).
Theorem 4.5 (General nonconvex).
Similar to convex cases, S-SGD-EF asymptotically achieves the same rate as non-sparsified SGD.
4.2 Analysis of S-SNAG-EF
Here, theoretical analysis of our proposed S-SNAG-EF is provided. For the proofs of the statements, see supplementary material (Section B).
The following proposition holds for strongly convex objective .
Proposition 4.6 (Strongly convex).
The first deterministic error term is scaled to rather than thanks to the acceleration scheme at the expense of times larger stochastic error (the second term) than the one of S-SGD-EF.
The third and last terms are bounded by the following proposition.
Suppose that Assumptions 3 holds. Let , be sufficiently small and is monotonically non-increasing. Then S-SNAG-EF satisfies
Theorem 4.8 (Strongly convex).
The terms after the third one have a better dependence on than the second stochastic error term. Hence for very small , we can ignore the compression error terms and the rate asymptotically matches to the one of vanilla SGD. Additionally, the compression error terms have better dependences on than S-SGD-EF.
Compared with the rate of S-SGD-EF, we can easily see that the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when , if we assume .
We can derive a convergence rate of Reg-S-SNAG-EF for general nonconvex objectives by applying Theorem 4.8 to pseudo-regularized objective iteratively.
Theorem 4.9 (General nonconvex).
From Theorem 4.5, we can see that even in nonconvex cases, acceleration can be beneficial. Indeed, the compression error terms (third and fourth terms) have a better dependence on than S-SGD-EF.
5 Related Work
In this section, we briefly describe the most relevant papers to this work. Stich et al. Stich et al. (2018) have first provided theoretical analysis of sparsified SGD with error feedback (called MEM-SGD) and shown that MEM-SGD asymptotically achieves the rate of non-sparsifed SGD. However, their analysis is limited to convex cases in serial computing settings, i.e., . Independently, Alistarh et al. Alistarh et al. (2018) have also theoretically considered sparsified SGD with error feedback in parallel settings for convex and nonconvex objectives. However, their analysis is still unsatisfactory for some reasons. First, their analysis relies on an artificial analytic assumption due to the usage of top- algorithm as gradient compression, though they have experimentally tried to validate it. Second, it is unclear from their results whether the algorithm asymptotically possesses the linear speed up property with respect to the number of nodes. Recently, Karimireddy et al. Karimireddy et al. (2019) have also analysed a variant of sparsified SGD with error feedback (called EF-SGD) for convex and nonconvex cases in serial computing settings. The derived rate for nonconvex cases is worse than our result of S-SGD-EF when . Differently from ours, their analysis allows non-smoothness of the objectives for convex cases, though the convergence rate is always worse than vanilla SGD and the algorithm does not possesses the asymptotic optimality.
6 Conclusion and Future Work
In this paper, we mainly considered an accelerated sparsified SGD with error feedback in parallel computing settings. We gave theoretical analysis of it for convex and nonconvex objectives and showed that our proposed algorithm achieves (i) asymptotical linear speed up with respect to the number of nodes; (ii) lower iteration complexity for moderate accuracy than the non-accelerated algorithm thanks to Nesterov’s acceleration.
One of interesting questions is whether our theoretical results are tight or not. Deriving lower bound of the iteration complexity of sparsification (or more generally compression) methods in distributed settings with limited communication is quite important. Another interesting future work is to extend our results to proximal settings, which allows non-smooth regularizer, for example, regularizer, since the usage of non-smooth regularizer in machine learning tasks is very popular for both convex and nonconvex problems. Construction of the proximal version of our algorithms and their analysis are non-trivial and definitely meaningful. We conjecture that the asymptotic optimality is still guaranteed in this setting.
TS was partially supported by MEXT Kakenhi (15H05707, 18K19793 and 18H03201), Japan Digital Design, and JST-CREST.
- Agarwal and Duchi (2011) A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
- Aji and Heafield (2017) A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
- Alistarh et al. (2017) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
- Alistarh et al. (2018) D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
- Arjevani and Shamir (2015) Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. In Advances in neural information processing systems, pages 1756–1764, 2015.
- Bekkerman et al. (2011) R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge Univ Pr, 2011.
- Chen et al. (2016) J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
- Chen et al. (2012) X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In Advances in Neural Information Processing Systems, pages 395–403, 2012.
- Dean et al. (2012) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
- Dekel et al. (2012) O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
- Duchi et al. (2011) J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011.
- Gemulla et al. (2011) R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011.
- Ghadimi and Lan (2016) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
- Goyal et al. (2017) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.
- Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
- Jaggi et al. (2014) M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems, pages 3068–3076, 2014.
- Karimireddy et al. (2019) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
- Lan and Zhou (2018) G. Lan and Y. Zhou. Asynchronous decentralized accelerated stochastic gradient descent. arXiv preprint arXiv:1809.09258, 2018.
- (20) G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48.
- Li et al. (2014) M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
- Lian et al. (2015) X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
- Lian et al. (2017a) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017a.
- Lian et al. (2017b) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017b.
- Lin et al. (2017) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
- Liu et al. (2015) J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
- (27) A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.
- Nesterov (2013a) Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013a.
- Nesterov (2013b) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013b.
- Recht et al. (2011) B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
- Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
- Scaman et al. (2018) K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740–2749, 2018.
- Seide et al. (2014) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- Shamir and Srebro (2014) O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
- Shi et al. (2019) S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu. A distributed synchronous sgd algorithm with global top- sparsification for low bandwidth networks. arXiv preprint arXiv:1901.04359, 2019.
- Stich et al. (2018) S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
- Tseng (2008) P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 2:3, 2008.
- Uribe et al. (2017) C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić. Optimal algorithms for distributed optimization. arXiv preprint arXiv:1712.00232, 2017.
- Wangni et al. (2018) J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.
Wen et al. (2017)
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.
Terngrad: Ternary gradients to reduce communication in distributed deep learning.In Advances in neural information processing systems, pages 1509–1519, 2017.
- Wu et al. (2018) J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
- Yuan et al. (2016) K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
- Zheng et al. (2017) S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4120–4129. JMLR. org, 2017.
- Zinkevich et al. (2010) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.
Appendix A Analysis of S-SGD-EF
a.1 Analysis of
where the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on .
First note that and . Since , where and , we have
Here the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on . Since each
is an independent unbiased estimator offor , we have
The last equality is from the independence of . ∎
Now we need to bound the variance term .
where and each
is i.i.d. to the uniform distribution on. Since are i.i.d., we have