Accelerated Sparsified SGD with Error Feedback

We study a stochastic gradient method for synchronous distributed optimization. For reducing communication cost, we are interested in utilizing compression of communicated gradients. Our main focus is a sparsified stochastic gradient method with error feedback scheme combined with Nesterov's acceleration. Strong theoretical analysis of sparsified SGD with error feedback in parallel computing settings and an application of acceleration scheme to sparsified SGD with error feedback are new. It is shown that (i) our method asymptotically achieves the same iteration complexity of non-sparsified SGD even in parallel computing settings; (ii) Nesterov's acceleration can improve the iteration complexity of non-accelerated methods in convex and even in nonconvex optimization problems for moderate optimization accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/08/2020

Minibatch vs Local SGD for Heterogeneous Distributed Learning

We analyze Local SGD (aka parallel or federated SGD) and Minibatch SGD i...
11/15/2020

Acceleration of stochastic methods on the example of decentralized SGD

In this paper, we present an algorithm for accelerating decentralized st...
05/31/2018

On Acceleration with Noise-Corrupted Gradients

Accelerated algorithms have broad applications in large-scale optimizati...
06/11/2019

ADASS: Adaptive Sample Selection for Training Acceleration

Stochastic gradient decent (SGD) and its variants, including some accele...
12/20/2021

Distributed and Stochastic Optimization Methods with Gradient Compression and Local Steps

In this thesis, we propose new theoretical frameworks for the analysis o...
11/16/2020

Avoiding Communication in Logistic Regression

Stochastic gradient descent (SGD) is one of the most widely used optimiz...
10/31/2018

MaSS: an Accelerated Stochastic Method for Over-parametrized Learning

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In typical modern machine learning tasks, we often encounter large scale optimization problems, which require huge computational time to solve. Hence, saving computational time of optimization processes is practically quite important and is main interest in the optimization community.

To tackle large scale problems, a golden-standard approach is the usage of Stochastic Gradient Descent (SGD) method Robbins and Monro (1951)

. For reducing loss, SGD updates the current solution by using a stochastic gradient in each iteration, that is the average of the gradients of the loss functions correspond to a random subset of the dataset (mini-batch) rather than the whole dataset. This (stochastic) mini-batch approach allows that SGD can be faster than deterministic full-batch methods in terms of computational time

Dekel et al. (2012); Li et al. (2014). Furthermore, Stochastic Nesterov’s Accelerated Gradient (SNAG) method and its variants have been proposed Hu et al. (2009); Chen et al. (2012); Ghadimi and Lan (2016), that are based on the combination of SGD with Nesterov’s acceleration Nesterov (2013b, a); Tseng (2008). Mini-batch SNAG theoretically outperforms vanilla mini-batch SGD for moderate optimization accuracy, though its asymptotic convergence rate matches that of SGD.

For realizing further scalability, distributed optimization have received much research attention Bekkerman et al. (2011); Duchi et al. (2011); Jaggi et al. (2014); Gemulla et al. (2011); Dean et al. (2012); Ho et al. (2013); Arjevani and Shamir (2015); Chen et al. (2016); Goyal et al. (2017)

. Distributed optimization methods are mainly classified as synchronous centralized

Zinkevich et al. (2010); Dekel et al. (2012); Shamir and Srebro (2014), asynchronous centralized Recht et al. (2011); Agarwal and Duchi (2011); Lian et al. (2015); Liu et al. (2015); Zheng et al. (2017), synchronous decentralized Nedic and Ozdaglar ; Yuan et al. (2016); Lian et al. (2017a); Lan et al. ; Uribe et al. (2017); Scaman et al. (2018) and asynchronous decentralized Lian et al. (2017b); Lan and Zhou (2018) ones by their communication types. In this paper, we particularly focus on data parallel stochastic gradient methods for synchronous centralized distributed optimization with smooth objective function , where each corresponds to a data partition of the whole dataset for the -th node (or processor). In this setting, first each processor computes a stochastic gradient of and then the nodes send the gradients each other. Finally, the current solution is updated using the averaged gradient on each processor. Here we assume that node-to-node broadcasts are used, but it is also possible to utilize an intermediate parameter server.

A main concern in synchronous distributed optimization is communication cost because it can easily be a bottleneck in optimization processes. Theoretically, naive parallel mini-batch SGD achieves linear speed up with respect to the number of processors Dekel et al. (2012); Li et al. (2014), but not empirically due to this cost Shamir and Srebro (2014); Chen et al. (2016). For leveraging the power of parallel computing, it is essential to reduce the communication cost.

One of fascinating techniques for reducing communication cost in distributed optimization is compression of the communicated gradients Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019); Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Sparsification is an approach in which the gradient is compressed by sparsifying it in each local node before communication Aji and Heafield (2017); Lin et al. (2017); Wangni et al. (2018); Alistarh et al. (2018); Stich et al. (2018); Shi et al. (2019); Karimireddy et al. (2019). For sparsifying a gradient, top-k algorithm, that drops the smallest components of the gradient by absolute value from the components of the gradient, has been typically used. Another example of compression is quantization, which is a technique that limit the number of bits to represent the communicated gradients. Several work has demonstrated that parallel SGD with quantized gradients has good practical performance Seide et al. (2014); Wen et al. (2017); Alistarh et al. (2017); Wu et al. (2018). Particularly, Alistarh et al. Alistarh et al. (2017) have proposed Quantized SGD (QSGD), which is the first quantization algorithm with a theoretical convergence rate. QSGD is based on unbiased quantization of the communicated gradient.

However, theoretically there exists an essential trade-off between communication cost and convergence speed when we use naive gradient compression schemes. Specifically, naive compression (including sparsification and quantization) causes large variances and theoretically always slower than vanilla SGD, though they surely reduce the communication cost

Stich et al. (2018); Alistarh et al. (2017).

Error feedback scheme partially solves this trade-off problem. Some work has considered the usage of compressed gradients with the locally accumulated compression errors in each node and its effectiveness has been validated empirically Aji and Heafield (2017); Lin et al. (2017); Wu et al. (2018). Very recently, several work has attempted to analyse and justified the effectiveness of error feedback in a theoretical view Alistarh et al. (2018); Stich et al. (2018); Karimireddy et al. (2019). Surprisingly, it has been shown that Sparsified SGD with error feedback asymptotically achieves the same rate as non-sparsified SGD.

Nevertheless, for a theoretical point of view, the analysis in previous work is still unsatisfactory, since no analysis has been given for distributed settings Karimireddy et al. (2019); Stich et al. (2018) or only focused on top- sparsification and they have never shown the linear speed up property with respect to the number of nodes Alistarh et al. (2018). Also, previous work has not taken non-asymptotic iteration complexities into consideration. However, consideration of them is practically important because the additional iteration complexity caused by sparsification typically has a factor of , which can be very large particularly for high compression settings.

There exist two open questions.

  • Does sparsified SGD with error feedback asymptotically achieves the same rate as non-sparsified parallel SGD in distributed optimization settings?

  • Are there any better algorithms than sparsified SGD with error feedback in terms of non-asymptotic iteration complexity?

We will positively answer these questions in this work.

Main contribution

We propose and analyse Sparsified Stochastic Nesterov’s Accelerated Gradient method (S-SNAG-EF) based on the combination of (i) unbiased compression of the stochastic gradients; (ii) error feedback scheme; and (iii) Nesterov’s acceleration technique. The main features of our method are as follows:

  • (Linear speed up w.r.t. #Nodes) Our method possesses linear speed up property with respect to the number of processors in distributed optimization settings, in the sense that the method asymptotically achieves the same rate as non-sparsified parallel SGD. To the best of our knowledge, this property has not been shown in any previous methods, particularly top- sparsified SGD and its variants.

  • (Low iteration complexity for moderate accuracy) It is shown that our proposed method can achieve strictly better iteration complexity than S-SGD-EF for a wide range of desired optimization accuracy, that is practically meaningful for high compression settings, though the asymptotic iteration complexity matches to the one of S-SGE-EF.

We also analyse non-accelerated sparsified SGD with error feedback (S-SGD-EF) in parallel computing settings and show that S-SGD-EF has the former property of the above.

The comparison of our method with the most relevant previous methods is summarized in Table 1. From Table 1, we can make the following observations:

general convex strongly convex general nonconvex Para?
SGD Yes
SNAG No Analysis Yes
S-SGD Yes
S-SNAG No Analysis Yes
MEM-SGD Stich et al. (2018) No Analysis No
EF-SGD Karimireddy et al. (2019) No Analysis No
S-SGD-EF
Yes
S-SNAG-EF
Yes
Table 1: Comparison of the iteration complexities of our methods with relevant previous ones. "Para?" indicates whether the algorithm has theoretical guarantees on parallel (multi-processors) settings. is a desired accuracy, is the problem dimensionality, is the number of non-zero components of communicated stochastic gradients in each iteration, is the number of processors and is the strong convexitity parameter. For simple comparison, we assume that , , , are . Also, extra logarithmic factors are ignored.
  • S-SGD vs. SGD: The iteration complexities of S-SGD always times worse than SGD because of the times larger variances of the randomly compressed stochastic gradients.

  • S-SGD-EF vs. S-SGD: S-SGD-EF has better dependence on the desired accuracy than S-SGD in the sparsification error terms. Asymptotically, the iteration complexities of S-SGD-EF is times better than the ones of S-SGD

  • S-SGD-EF vs. MEM-SGD: When , the rates of the two methods are same for convex cases. However, S-SGD-EF is applicable to parallelization settings and achieves linear speed up in terms of the number of processors with respect to the asymptotically dominated term.

  • S-SGD-EF vs. EF-SGD: When , For general nonconvex cases, the rate of S-SGD-EF is always better than the one of EF-SGD because . Note that for general convex cases, EF-SGD is applicable to non-smooth objectives and the rates cannot be directly compared.

  • S-SNAG-EF vs. S-SGD-EF: For geneal convex cases, the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when , though the rates of two methods are asymptotically same. For general nonconvex cases, the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when . For high compression settings (i.e., ), these ranges are wide and meaningful.

For looking more closely at the comparison of the theoretical iteration complexities of S-SGD-EF and S-SNAG-EF, we illustrate the comparison of them in Figure 1.

Figure 1: Comparison of the theoretical iteration complexities of SGD, S-SGD, S-SGD-EF and S-SNAG-EF. For simple comparison, we assume that . We set and . For strongly convex cases, we set .

2 Notation and Assumptions

We use the following notation in this paper.

  • denotes the Euclidean norm : .

  • For natural number , denotes the set .

  • We define as the quadratic function with center , i.e., .

  • A sparsification operator is defined as for in and otherwise, where is a uniformly random subset of .

The followings are theoretical assumptions for our analysis. These are very standard in optimization literature. We always assume the first three assumptions.

Assumption 1.

has a minimizer .

Assumption 2.

is -smooth (), i.e., .

Assumption 3.

has -bounded variance, i.e., .

Assumption 4.

is -strongly convex (), i.e., .

3 Algorithm Descriptions

In this section, we describe our proposed algorithms in detail.

3.1 Sparsified Stochastic Gradient Descent with Error Feedback

1:  Set: , .
2:  for  to  do
3:     for  to in parallel do
4:        Compute i.i.d. stochastic gradient of the partition of : .
5:        Compress: .
6:        Update cumulative compression error: .
7:     end for
8:     Broadcast and Receive: .
9:     for  to in parallel do
10:        Update solution: .
11:     end for
12:  end for
13:  return  .
Algorithm 1 S-SGD-EF(, , , , , )

The algorithm of Sparsified SGD with Error Feedback (S-SGD-EF) for convex and nonconvex objectives is provided in Algorithm 1

. In line 3-7, roughly speaking, we construct a gradient estimator by using error feedback scheme, compress it to a sparse vector and update a cumulative compression error in parallel. More specifically, each node first computes i.i.d. stochastic gradient with respect to the correspondence data partition. Second, the cumulative compression error

is added to the stochastic gradient (we call this process as error feedback) and then we construct unbiasedly sparsified gradient estimator by randomly picking -nonzero coordinates of the stochastic gradient with error feedback. Finally, the cumulative compression error is updated for the after iterations. In line 8, we broadcast and receive the compressed gradient estimator from and to each node. In line 9-10, we update the solution using the average of the received compressed gradients in each node. Note that each node has the same updated solution in each iteration.

Remark (Difference from previous algorithms).

Algorithm 1 can be regard as an extension of Mem-SGD Stich et al. (2018) or EF-SGD Karimireddy et al. (2019) to parallel computing settings, though these two methods mainly utilize top- compression for gradient sparsification. We rather use unbiased random compression. This difference is essential for our analysis.

3.2 Sparsified Stochastic Nesterov Accelerated Gradient descent with Error Feedback

1:  Set: , .
2:  for  to  do
3:     for  to in parallel do
4:        Compute i.i.d. stochastic gradient of the partition of : .
5:        Compress:
6:        Update cumulative compression errors:
7:     end for
8:     Broadcast and Receive: .
9:     for  to in parallel do
10:        Update solutions:
11:     end for
12:  end for
13:  return  .
Algorithm 2 S-SNAG-EF(, , , , , )
  Set: .
  for  to  do
     Run: S-SNAG-EF(, , , , , )
  end for
  return  .
Algorithm 3 Reg-S-SNAG-EF (, , , , , , , )

The procedure of S-SNAG-EF for convex objectives is provided in Algorithm 2. In line 5, we compress two different gradient estimators by randomly picking -coordinates for each. Also in line 6, we update three cumulative compression errors. Why are different compressed estimators and cumulative errors necessary for appropriate updates? In a typical acceleration algorithm we construct two different solution paths and , and their aggregations as in line 10. The aggregation of the "conservative" solution (because of small learning rate ) and "aggressive" solution (because of large learning rate ) is the essence of Nesterov’s acceleration. On the other hand, from a theoretical point of view, the impact of error feedback to the vanilla stochastic gradient should be scaled to the inverse of learning rate as in line 5. Therefore, for using two different learning rates, it is necessary to construct two compressed gradient estimators and hence three compression errors. Generally, S-SNAG-EF has no theoretical guarantee for nonconvex objectives. However, utilizing regularization technique, the convergence of Reg-SNAG-EF (Algorithm 3) to a stationary point is guaranteed. Specifically, Algorithm 3 repeatedly minimize the "regularized" objective by using S-SNAG-EF, where and is the current solution.

Remark (Parameter tuning).

It seems that Algorithm 2 has many tuning parameters. However, this is not. Specifically, as Theorem 4.8 in Section 4 indicates, actual tuning parameters are only constant learning rate , strong convexity and , and the other parameters are theoretically determined. This means that the additional tuning parameters compared to S-SGD-EF are essentially only strong convexity parameter . Practically, fixing works well.

4 Convergence Analysis

In this section, we provide convergence analysis of S-SGD-EF and S-SNAG-EF. For convex cases, we assume the strong convexity of the objective. For non-strongly convex cases, we can immediately derive its convergence rates from the ones for strongly convex cases by taking standard dummy regularizer approach and we omit it here.

Let be the mean of the cumulative compression errors of the all nodes at -th iteration, i.e., . We use notation to hide additional logarithmic factors for simplicity.

4.1 Analysis of S-SGD-EF

In this subsection, we provide the analysis of S-SGD-EF. The proofs of the statements are found in Section A of supplementary material.

The following proposition holds for strongly convex objective .

Proposition 4.1 (Strongly convex).

Suppose that Assumptions 1, 2, 3 and 4 hold. Let . Then S-SGD-EF satisfies

where and according to .

The first term is the deterministic term and the second term is the stochastic error term. The last term is the compression error term and we can further bound it by the following proposition.

Proposition 4.2.

Suppose that Assumptions 3 holds. Let be sufficiently small. Then S-SGD-EF satisfies

Remark.

Importantly, the expected accumulated compression error is scaled to , i.e., linearly scaled with respect to the number of nodes.

Combining Proposition 4.1 with Proposition 4.2 yields the following theorem.

Theorem 4.3 (Strongly convex).

Suppose that Assumptions 1, 2, 3 and 4 hold. Let be sufficiently small and be sufficiently large. Then the iteration complexity of S-SGD-EF with appropriate for achieving is

where is defined in Proposition 4.1.

Remark.

Theorem 4.3 implies that S-SGD-EF asymptotically achieves , that is the asymptotic iteration complexity of non-sparsified parallel SGD, because the last compression error term has a dependence on rather than . Also note that the last term is scaled to . This is a desirable property for distributed optimization with . However, the last term has a factor of , which may be large and can dominate the other terms for moderate accuracy . Thus, consideration of non-asymptotic behavior is also important particularly for high compression settings.

For nonconvex objectives, we can derive the following proposition.

Proposition 4.4 (General nonconvex).

Suppose that Assumptions 1, 2 and 3 hold. Assume that . Then S-SGD-EF satisfies

where and

with probability

.

Combining Proposition A.6 with Proposition 4.2 yields the following theorem.

Theorem 4.5 (General nonconvex).

Suppose that Assumptions 1, 2 and 3 hold. Let be the same one in Theorem 4.3. Then the iteration complexity of S-SGD-EF with appropriate to acheive is

where and is defined in Proposition A.6.

Similar to convex cases, S-SGD-EF asymptotically achieves the same rate as non-sparsified SGD.

4.2 Analysis of S-SNAG-EF

Here, theoretical analysis of our proposed S-SNAG-EF is provided. For the proofs of the statements, see supplementary material (Section B).

The following proposition holds for strongly convex objective .

Proposition 4.6 (Strongly convex).

Suppose that Assumptions 1, 2, 3 and 4 hold. Let , , and . Then S-SNAG-EF satisfies

where .

Remark.

The first deterministic error term is scaled to rather than thanks to the acceleration scheme at the expense of times larger stochastic error (the second term) than the one of S-SGD-EF.

The third and last terms are bounded by the following proposition.

Proposition 4.7.

Suppose that Assumptions 3 holds. Let , be sufficiently small and is monotonically non-increasing. Then S-SNAG-EF satisfies

Combining Proposition 4.6 and 4.7 yields the following theorem.

Theorem 4.8 (Strongly convex).

Suppose that Assumptions 1, 2, 3 and 4 hold. Let and are the same ones in Proposition 4.6 and be sufficiently small. Then the iteration complexity of S-SNAG-EF with appropriate to acheive is

where .

Remark.

The terms after the third one have a better dependence on than the second stochastic error term. Hence for very small , we can ignore the compression error terms and the rate asymptotically matches to the one of vanilla SGD. Additionally, the compression error terms have better dependences on than S-SGD-EF.

Remark.

Compared with the rate of S-SGD-EF, we can easily see that the rate of S-SNAG-EF is strictly better than the one of S-SGD-EF when , if we assume .

We can derive a convergence rate of Reg-S-SNAG-EF for general nonconvex objectives by applying Theorem 4.8 to pseudo-regularized objective iteratively.

Theorem 4.9 (General nonconvex).

Suppose that Assumptions 1, 2 and 3 hold. Let , and , , and be the same ones in Theorem 4.8 (with ), and and be sufficiently large. Then the iteration complexity of Reg-S-SNAG-EF with appropriate for achieving is

where and according to .

Remark.

From Theorem 4.5, we can see that even in nonconvex cases, acceleration can be beneficial. Indeed, the compression error terms (third and fourth terms) have a better dependence on than S-SGD-EF.

5 Related Work

In this section, we briefly describe the most relevant papers to this work. Stich et al. Stich et al. (2018) have first provided theoretical analysis of sparsified SGD with error feedback (called MEM-SGD) and shown that MEM-SGD asymptotically achieves the rate of non-sparsifed SGD. However, their analysis is limited to convex cases in serial computing settings, i.e., . Independently, Alistarh et al. Alistarh et al. (2018) have also theoretically considered sparsified SGD with error feedback in parallel settings for convex and nonconvex objectives. However, their analysis is still unsatisfactory for some reasons. First, their analysis relies on an artificial analytic assumption due to the usage of top- algorithm as gradient compression, though they have experimentally tried to validate it. Second, it is unclear from their results whether the algorithm asymptotically possesses the linear speed up property with respect to the number of nodes. Recently, Karimireddy et al. Karimireddy et al. (2019) have also analysed a variant of sparsified SGD with error feedback (called EF-SGD) for convex and nonconvex cases in serial computing settings. The derived rate for nonconvex cases is worse than our result of S-SGD-EF when . Differently from ours, their analysis allows non-smoothness of the objectives for convex cases, though the convergence rate is always worse than vanilla SGD and the algorithm does not possesses the asymptotic optimality.

6 Conclusion and Future Work

In this paper, we mainly considered an accelerated sparsified SGD with error feedback in parallel computing settings. We gave theoretical analysis of it for convex and nonconvex objectives and showed that our proposed algorithm achieves (i) asymptotical linear speed up with respect to the number of nodes; (ii) lower iteration complexity for moderate accuracy than the non-accelerated algorithm thanks to Nesterov’s acceleration.

One of interesting questions is whether our theoretical results are tight or not. Deriving lower bound of the iteration complexity of sparsification (or more generally compression) methods in distributed settings with limited communication is quite important. Another interesting future work is to extend our results to proximal settings, which allows non-smooth regularizer, for example, regularizer, since the usage of non-smooth regularizer in machine learning tasks is very popular for both convex and nonconvex problems. Construction of the proximal version of our algorithms and their analysis are non-trivial and definitely meaningful. We conjecture that the asymptotic optimality is still guaranteed in this setting.

7 Acknowledgement

TS was partially supported by MEXT Kakenhi (15H05707, 18K19793 and 18H03201), Japan Digital Design, and JST-CREST.

References

  • Agarwal and Duchi (2011) A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Information Processing Systems, pages 873–881, 2011.
  • Aji and Heafield (2017) A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
  • Alistarh et al. (2017) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1709–1720, 2017.
  • Alistarh et al. (2018) D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
  • Arjevani and Shamir (2015) Y. Arjevani and O. Shamir. Communication complexity of distributed convex learning and optimization. In Advances in neural information processing systems, pages 1756–1764, 2015.
  • Bekkerman et al. (2011) R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge Univ Pr, 2011.
  • Chen et al. (2016) J. Chen, X. Pan, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.
  • Chen et al. (2012) X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic optimization. In Advances in Neural Information Processing Systems, pages 395–403, 2012.
  • Dean et al. (2012) J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231, 2012.
  • Dekel et al. (2012) O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(Jan):165–202, 2012.
  • Duchi et al. (2011) J. C. Duchi, A. Agarwal, and M. J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592–606, 2011.
  • Gemulla et al. (2011) R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 69–77. ACM, 2011.
  • Ghadimi and Lan (2016) S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  • Goyal et al. (2017) P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
  • Ho et al. (2013) Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.
  • Hu et al. (2009) C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pages 781–789, 2009.
  • Jaggi et al. (2014) M. Jaggi, V. Smith, M. Takác, J. Terhorst, S. Krishnan, T. Hofmann, and M. I. Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems, pages 3068–3076, 2014.
  • Karimireddy et al. (2019) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signsgd and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
  • Lan and Zhou (2018) G. Lan and Y. Zhou. Asynchronous decentralized accelerated stochastic gradient descent. arXiv preprint arXiv:1809.09258, 2018.
  • (20) G. Lan, S. Lee, and Y. Zhou. Communication-efficient algorithms for decentralized and stochastic optimization. Mathematical Programming, pages 1–48.
  • Li et al. (2014) M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
  • Lian et al. (2015) X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2737–2745, 2015.
  • Lian et al. (2017a) X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017a.
  • Lian et al. (2017b) X. Lian, W. Zhang, C. Zhang, and J. Liu. Asynchronous decentralized parallel stochastic gradient descent. arXiv preprint arXiv:1710.06952, 2017b.
  • Lin et al. (2017) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
  • Liu et al. (2015) J. Liu, S. J. Wright, C. Ré, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent algorithm. The Journal of Machine Learning Research, 16(1):285–322, 2015.
  • (27) A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization.
  • Nesterov (2013a) Y. Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013a.
  • Nesterov (2013b) Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013b.
  • Recht et al. (2011) B. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in neural information processing systems, pages 693–701, 2011.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • Scaman et al. (2018) K. Scaman, F. Bach, S. Bubeck, L. Massoulié, and Y. T. Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Advances in Neural Information Processing Systems, pages 2740–2749, 2018.
  • Seide et al. (2014) F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • Shamir and Srebro (2014) O. Shamir and N. Srebro. Distributed stochastic optimization and learning. In 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 850–857. IEEE, 2014.
  • Shi et al. (2019) S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu. A distributed synchronous sgd algorithm with global top- sparsification for low bandwidth networks. arXiv preprint arXiv:1901.04359, 2019.
  • Stich et al. (2018) S. U. Stich, J.-B. Cordonnier, and M. Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
  • Tseng (2008) P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM Journal on Optimization, 2:3, 2008.
  • Uribe et al. (2017) C. A. Uribe, S. Lee, A. Gasnikov, and A. Nedić. Optimal algorithms for distributed optimization. arXiv preprint arXiv:1712.00232, 2017.
  • Wangni et al. (2018) J. Wangni, J. Wang, J. Liu, and T. Zhang. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems, pages 1299–1309, 2018.
  • Wen et al. (2017) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.

    Terngrad: Ternary gradients to reduce communication in distributed deep learning.

    In Advances in neural information processing systems, pages 1509–1519, 2017.
  • Wu et al. (2018) J. Wu, W. Huang, J. Huang, and T. Zhang. Error compensated quantized sgd and its applications to large-scale distributed optimization. arXiv preprint arXiv:1806.08054, 2018.
  • Yuan et al. (2016) K. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835–1854, 2016.
  • Zheng et al. (2017) S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z.-M. Ma, and T.-Y. Liu. Asynchronous stochastic gradient descent with delay compensation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4120–4129. JMLR. org, 2017.
  • Zinkevich et al. (2010) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Advances in neural information processing systems, pages 2595–2603, 2010.

Appendix A Analysis of S-SGD-EF

a.1 Analysis of

Lemma A.1.

where the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on .

Proof.

First note that and . Since , where and , we have

Here the expectations are taken with respect to , which are the random choices of the coordinates for constructing conditioned on . Since each

is an independent unbiased estimator of

for , we have

The last equality is from the independence of . ∎

Now we need to bound the variance term .

Lemma A.2.

For ,

Proof.

Remember that

where and each

is i.i.d. to the uniform distribution on

. Since are i.i.d., we have