Deep neural networks have been highly successful in recent years graves2013generating ; he2016identity ; silver2016mastering ; vaswani2017attention ; zaremba2014recurrent . To achieve state-of-the-art performance, they often have to leverage the computing power of multiple machines during training dean-12 ; xing2015petuum ; zinkevich2010parallelized . Popular approaches include distributed synchronous SGD and its momentum variant SGDM, in which the computational load for evaluating a mini-batch gradient is distributed among the workers. However, its scalability is limited by the possibly overwhelming cost due to communication of the gradient and model parameter li2014communication . Let be the gradient/parameter dimensionality, and be the number of workers. bits need to be transferred between the workers and server in each iteration.
To mitigate this communication bottleneck, the two common approaches are gradient sparsification and gradient quantization. Gradient sparsification only sends the most significant, information-preserving gradient entries. A heuristic algorithm is first introduced inseide20141
, in which only the large entries are transmitted. On training a neural machine translation model with 4 GPUs, this greatly reduces the communication overhead and achieves 22% speedupaji2017sparse . Deep gradient compression lin2017deep
is another heuristic method that combines gradient sparsification with other techniques such as momentum correction, local gradient clipping, and momentum factor masking, achieving significant reduction on communication cost. MEM-SGDstich2018sparsified combines top-
sparsification with error correction. By keeping track of the accumulated errors, these can be added back to the gradient estimator before each transmission. MEM-SGD converges at the same rate as SGD on convex problems, whilst reducing the communication overhead by a factor equal to the problem dimensionality.
On the other hand, gradient quantization mitigates the communication bottleneck by lowering the gradient’s floating-point precision with a smaller bit width. 1-bit SGD achieves state-of-the-art results on acoustic modeling while dramatically reducing the communication cost seide20141 ; strom2015scalable . TernGrad wen2017terngrad quantizes the gradients to ternary levels . QSGD alistarh2017qsgd employs stochastic randomized rounding to ensure unbiasedness of the estimator. Recently, Bernstein et al. proposed signSGD with majority vote pmlr-v80-bernstein18a , which only transmits the 1-bit gradient sign between workers and server. A variant using momentum, called signum with majority vote, is also introduced though without convergence analysis bernstein2019signsgd . Using the majority vote, signSGD achieves a notion of Byzantine fault tolerance bernstein2019signsgd . Moreover, it converges at the same rate as distributed SGD, though it has to rely on the unrealistic assumptions of having a large mini-batch and unimodal symmetric gradient noise. Indeed, signSGD can diverge in some simple cases when these assumptions are violated karimireddy2019error . With only a single worker, this divergence issue can be fixed by using the error correction technique in MEM-SGD, leading to SGD with error-feedback (EF-SGD) karimireddy2019error .
While only a single worker is considered in EF-SGD, we study in this paper the more interesting distributed setting with a parameter server architecture. To ensure efficient communication, we consider two-way gradient compression, in which gradients in both directions (server to/from workers) are compressed. Note that existing works (except signSGD/signum with majority vote pmlr-v80-bernstein18a ; bernstein2019signsgd
) do not compress the aggregated gradients before sending back to workers. Moreover, as gradients in a deep network typically have similar magnitudes in each layer, each layer-wise gradient can be sufficiently represented using a sign vector and its average-norm. This layer-wise (or blockwise in general) compressor achieves nearly x reduction in the communication cost. The resulant procedure is called communication-efficient distributed SGD with error-feedback (dist-EF-SGD). Analogous to SGDM, we also propose a stochastic variant dist-EF-SGDM with Nesterov’s momentum nesterov1983method .
The convergence properties of dist-EF-SGD(M) are studied theoretically. For dist-EF-SGD, we provide a bound with general stepsize schedule for a class of compressors (including the commonly used sign-operator and top- sparsification). In particular, without relying on the unrealistic assumptions in pmlr-v80-bernstein18a ; bernstein2019signsgd , we show that dist-EF-SGD with constant/decreasing/increasing stepsize converges at an optimal rate, which matches that of distributed synchronous SGD. For dist-EF-SGDM with constant stepsize, it also achieves the rate. To the best of our knowledge, these are the first convergence results on two-way gradient compression with Nesterov’s momentum. Experimental results show that the proposed algorithms are efficient without losing prediction accuracy.
Notations. For a vector , and are its - and -norms, respectively. outputs a vector in which each element is the sign of the corresponding entry of . For two vectors , denotes the dot product. For a function , its gradient is .
2 Related Work: SGD with Error-Feedback
In machine learning, one is often interested in minimizing the expected risk . which directly measures the generalization error bottou2004large : Here, is the model parameter, is drawn from some unknown distribution, and is the possibly nonconvex risk due to . When the expectation is taken over a training set of size , the expected risk reduces to empirical risk.
Recently, Karimireddy et al. karimireddy2019error introduced SGD with error-feedback (EF-SGD), which combines gradient compression with error correction (Algorithm 1). A single machine is considered, which keeps the gradient difference that are not used for parameter update in the current step. In the next iteration , the accumulated residual is added to the current gradient. The corrected gradient is then fed into an -approximate compressor.
karimireddy2019error An operator is a -approximate compressor for if .
Examples of -approximate compressors include scaled sign operator karimireddy2019error and top- operator (which only preserves the coordinates with the largest absolute values) stich2018sparsified . One can also have randomized compressors that only satisfy Definition 1 in expectation. Obviously, it is desirable to have a large while achieving low communication cost.
EF-SGD achieves the same rate as SGD. To obtain this convergence guarantee, an important observation is that the error-corrected iterate satisfies the recurrence: , which is similar to that of SGD. This allows utilizing the convergence proof of SGD to bound the gradient difference .
3 Distributed Blockwise Momentum SGD with Error-Feedback
3.1 Distributed SGD with Error-Feedback
The proposed procedure, which extends EF-SGD to the distributed setting. is shown in Algorithm 2. The computational workload is distributed over workers. A local accumulated error vector and a local corrected gradient vector are stored in the memory of worker . At iteration , worker pushes the compressed signal to the parameter server. On the server side, all workers’ ’s are aggregated and used to update its global error-corrected vector . Before sending back the final update direction to each worker, compression is performed to ensure a comparable amount of communication costs between the push and pull operations. Due to gradient compression on the server, we also employ a global accumulated error vector . Unlike EF-SGD in Algorithm 1, we do not multiply gradient by the stepsize before compression. The two cases make no difference when is constant. However, when the stepsize is changing over time, this would affect convergence. We also rescale the local accumulated error by . This modification, together with the use of error correction on both workers and server, allows us to obtain Lemma 1. Because of these differences, note that dist-EF-SGD does not reduce to EF-SGD when . When is the identity mapping, dist-EF-SGD reduces to full-precision distributed SGD.
In the following, we investigate the convergence of dist-EF-SGD. We make the following assumptions, which are common in the stochastic approximation literature.
is lower-bounded (i.e., ) and -smooth (i.e., for ).
The stochastic gradient has bounded variance:
has bounded variance:.
The full gradient is uniformly bounded: .
This implies the bounded second moment, i.e.,.
Consider the error-corrected iterate , where , , and ’s are generated from Algorithm 2. It satisfies the recurrence: .
The above Lemma shows that is very similar to the distributed SGD iterate except that the stochastic gradients are evaluated at instead of . This connection allows us to utilize the analysis of full-precision distributed SGD. In particular, we have the following Lemma.
for any .
This implies that by Assumption 1. Given the above results, we can prove convergence of the proposed method by utilizing tools used on the full-precision distributed SGD.
The first term on the RHS shows decay of the initial value. The second term is related to the variance, and the proposed algorithm enjoys variance reduction with more workers. The last term is due to gradient compression. A large (less compression) makes this term smaller and thus faster convergence. Similar to the results in karimireddy2019error , our bound also holds for unbiased compressors of the form , where and for some . Then, is a -approximate compressor in expectation.
The following Corollary shows that dist-EF-SGD has a convergence rate of , leading to a iteration complexity for satisfying .
Let stepsize for some . Then, . In comparison, under the same assumptions, distributed synchronous SGD achieves .
Thus, the convergence rate of dist-EF-SGD matches that of distributed synchronous SGD (with full-precision gradients) after iterations, even though gradient compression is used. Moreover, more workers (larger ) leads to faster convergence. Note that the bound above does not reduce to that of EF-SGD when , as we have two-way compression. When , our bound also differs from Remark 4 in karimireddy2019error in that our last term is , while theirs is (which is for single machine with one-way compression). Ours is worse by a factor of , which is the price to pay for two-way compression and a linear speedup of using workers. Moreover, unlike signSGD with majority vote pmlr-v80-bernstein18a , we achieve a convergence rate of without assuming a large mini-batch size () and unimodal symmetric gradient noise.
Theorem 1 only requires for all . This thus allows the use of any decreasing, increasing, or hybrid stepsize schedule. In particular, we have the following Corollary.
Let (decreasing stepsize) with or (increasing stepsize) with . Then, dist-EF-SGD converges to a stationary point at a rate of .
To the best of our knowledge, this is the first such result for distributed compressed SGD with decreasing/increasing stepsize on nonconvex problems. These two stepsize schedules can also be used together. For example, one can use an increasing stepsize at the beginning of training as warm-up, and then a decreasing stepsize afterwards.
3.2 Blockwise Compressor
A commonly used compressor is karimireddy2019error :
Compared to using only the sign operator as in signSGD, the factor can preserve the gradient’s magnitude. However, as shown in karimireddy2019error , its in Definition 1 is , and can be particularly small when is sparse. When is closer to , the bound in Corollary 1 becomes smaller and thus convergence is faster. In this section, we achieve this by proposing a blockwise extension of (1).
Specifically, we partition the compressor input into blocks, where each block has elements indexed by . Block is then compressed with scaling factor (where is the subvector of with elements in block ), leading to: . A similar compression scheme, with each layer being a block, is considered in the experiments of karimireddy2019error . However, they provide no theoretical justifications.
First, Proposition 1 shows that is also an approximate compressor. For convenience, dist-EF-SGD using a blockwise compressor will be called dist-EF-blockSGD in the sequel. By replacing with in Proposition 1, the convergence results of dist-EF-SGD can be directly applied.
Let . is a -approximate compressor, where .
There are many ways to partition the gradient into blocks. In practice, one can simply consider each parameter tensor/matrix/vector in the deep network as a block. The intuition is that (i) gradients in the same parameter tensor/matrix/vector typically have similar magnitudes, and (ii) the corresponding scaling factors can thus be tighter than the scaling factor obtained on the whole parameter, leading to a larger. As an illustration of (i), Figure 1(a)
shows the coefficient of variation (which is defined as the ratio of the standard deviation to the mean) of
averaged over all blocks and iterations in an epoch, obtained from ResNet-20 on the CIFAR-100 dataset (with a mini-batch size of 16 per worker).111The detailed experimental setup is in Section 4.1. A value smaller than indicates that the absolute gradient values in each block concentrate around the mean. As for point (ii) above, consider the case where all the blocks are of the same size (), elements in the same block have the same magnitude ( for some ), and the magnitude is increasing across blocks ( for some ). For the standard compressor in (1), for a sufficiently large ; whereas for the proposed blockwise compressor, . Figure 1(b) shows the empirical estimates of and in the ResNet-20 experiment. As can be seen, .
The per-iteration communication costs of the various distributed algorithms are shown in Table 1. Compared to signSGD with majority vote pmlr-v80-bernstein18a , dist-EF-blockSGD requires an extra bits for transmitting the blockwise scaling factors (each factor is stored in float32 format and transmitted twice in each iteration). By treating each vector/matrix/tensor parameter as a block, is typically in the order of hundreds. For most problems of interest, . The reduction in communication cost compared to full-precision distributed SGD is thus nearly 32x.
|algorithm||#bits per iteration|
|signSGD with majority vote|
3.3 Nesterov’s Momentum
Momentum has been widely used in deep networks sutskever2013importance . Standard distributed SGD with Nesterov’s momentum nesterov1983method and full-precision gradients uses the update: and , where is a local momentum vector maintained by each worker at time (with ), and is the momentum parameter. In this section, we extend the proposed dist-EF-SGD with momentum. Instead of sending the compressed to the server, the compressed is sent. The server merges all the workers’s results and sends it back to each worker. The resultant procedure with blockwise compressor is called dist-EF-blockSGDM (Algorithm 3), and has the same communication cost as dist-EF-blockSGD. The corresponding non-block variant is analogous.
Similar to Lemma 1, the following Lemma shows that the error-corrected iterate is very similar to Nesterov’s accelerated gradient iterate, except that the momentum is computed based on .
The error-corrected iterate , where , , and ’s are generated from Algorithm 3, satisfies the recurrence: .
As in Section 3.1, it can be shown that is bounded and . The following Theorem shows the convergence rate of the proposed dist-EF-blockSGDM.
Compared to Theorem 1, using a larger momentum parameter makes the first term (which depends on the initial condition) smaller but a worse variance term (second term) and error term due to gradient compression (last term). Similar to Theorem 1, a larger makes the third term larger. The following Corollary shows that the proposed dist-EF-blockSGDM achieves a convergence rate of .
Let for some . For any , .
4.1 Multi-GPU Experiment on CIFAR-100
In this experiment, we demonstrate that the proposed dist-EF-blockSGDM and dist-EF-blockSGD ( in Algorithm 3), though using fewer bits for gradient transmission, still has good convergence. For faster experimentation, we use a a single node with multiple GPUs (an AWS P3.16 instance with 8 Nvidia V100 GPUs, each GPU being a worker) instead of a distributed setting. Note that the convergence w.r.t. the number of epochs are the same in both the distributed and multi-GPU settings. Convergence w.r.t. time in a truly distributed setting will be studied in Section 4.2.
Experiment is performed on the CIFAR-100 dataset, with 50K training images and 10K test images. We use a 20-layer ResNet he2016identity . Each parameter tensor/matrix/vector is treated as a block in dist-EF-blockSGDM. It is compared with (i) distributed synchronous SGD (with full-precision gradient); (ii) distributed synchronous SGD (full-precision gradient) with momentum (SGDM); (iii) signSGD with majority vote pmlr-v80-bernstein18a ; and (iv) signum with majority vote bernstein2019signsgd . All the algorithms are implemented in MXNet. We vary the mini-batch size per worker in . Results are averaged over 3 repetitions. More details of the experiments are shown in Appendix A.1.
Figure 2 shows convergence of the testing accuracy w.r.t. the number of epochs. As can be seen, dist-EF-blockSGD converges as fast as distributed SGD and has slightly better accuracy, while signSGD performs poorly. In particular, dist-EF-blockSGD is robust to the mini-batch size, while the performance of signSGD degrades with smaller mini-batch size (which agrees with the results in pmlr-v80-bernstein18a ). Momentum does not offer SGD and dist-EF-blockSGD obvious acceleration, but significantly improves signSGD. However, signum is still much worse than SGDM and dist-EF-blockSGDM.
4.2 Distributed Training on ImageNet
In this section, we perform distributed optimization on ImageNet russakovsky2015imagenet using a 50-layer ResNet. Each worker is an AWS P3.2 instance with 1 GPU, and the parameter server is housed in one node. We use the publicly available code222https://github.com/PermiJW/signSGD-with-Majority-Vote in bernstein2019signsgd , and the default communication library Gloo333https://github.com/facebookincubator/gloo
communication library in PyTorch. As inbernstein2019signsgd , we use its allreduce implementation for SGDM, which is faster.
As signum is much better than signSGD in Section 4.1, we only compare the momentum variants here. The proposed dist-EF-blockSGDM is compared with (i) distributed synchronous SGD with momentum (SGDM); and (ii) signum with majority vote bernstein2019signsgd . The number of workers is varied in
. With an odd number of workers, a majority vote will not produce zero, and so signum does not lose accuracy by using 1-bit compression. More details of the setup are in AppendixA.2.
Figure 3 shows the testing accuracy w.r.t. the number of epochs and wall clock time. As in Section 4.1, SGDM and dist-EF-blockSGDM have comparable accuracies, while signum is inferior. When 7 workers are used, dist-EF-blockSGDM has higher accuracy than SGDM (76.77% vs 76.27%). dist-EF-blockSGDM reaches SGDM’s highest accuracy in around 13 hours, while SGDM takes 24 hours (Figure 3(b)), leading to a speedup. With 15 machines, the improvement is smaller (Figure 3(e)). This is because the burden on the parameter server is heavier. We expect comparable speedup with the 7-worker setting can be obtained by using more parameter servers. In both cases, signum converges fast but the test accuracies are about worse.
Figures 3(c) and 3(f) show a breakdown of wall clock time into computation and communication time444Following bernstein2019signsgd , communication time includes the extra computation time for error feedback and compression. All methods have comparable computation costs, but signum and dist-EF-blockSGDM have lower communication costs than SGDM. The communication costs for signum and dist-EF-blockSGDM are comparable for 7 workers, but for 15 workers signum is lower. We speculate that it is because the sign vectors and scaling factors are sent separately to the server in our implementation, which causes more latency on the server with more workers. This may be alleviated if the two operations are fused.
In this paper, we proposed a distributed blockwise SGD algorithm with error feedback and momentum. By partitioning the gradients into blocks, we can transmit each block of gradient using 1-bit quantization with its average -norm. The proposed methods are communication-efficient and have the same convergence rates as full-precision distributed SGD/SGDM for nonconvex objectives. Experimental results show that the proposed methods have fast convergence and achieve the same test accuracy as SGD/SGDM, while signSGD and signum only achieve much worse accuracies.
A. F. Aji and K. Heafield.
Sparse communication for distributed gradient descent.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 440–445, 2017.
- (2) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communication-efficient sgd via gradient quantization and encoding. In Proceedings of the Neural Information Processing Systems Conference, pages 1709–1720, 2017.
- (3) J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for non-convex problems. In Proceedings of the International Conference on Machine Learning, pages 560–569, 2018.
- (4) J. Bernstein, J.and Zhao, K. Azizzadenesheli, and A. Anandkumar. signSGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations, 2019.
- (5) L. Bottou and Y. Lecun. Large scale online learning. In Proceedings of the Neural Information Processing Systems Conference, pages 217–224, 2004.
- (6) J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, and A. Ng. Large scale distributed deep networks. In Proceedings of the Neural Information Processing Systems Conference, pages 1223–1231, 2012.
- (7) A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
K. He, X. Zhang, S. Ren, and J. Sun.
Identity mappings in deep residual networks.
Proceedings of the European conference on computer vision, pages 630–645, 2016.
- (9) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signSGD and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
- (10) M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the Neural Information Processing Systems Conference, pages 19–27, 2014.
- (11) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
- (12) Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
- (13) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu.
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns.In Proceedings of the Annual Conference of the International Speech Communication Association, 2014.
- (15) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, George Van D. D., J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- (16) S. U. Stich, J. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In Proceedings of the Neural Information Processing Systems Conference, pages 4452–4463, 2018.
- (17) N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Proceedings of the Annual Conference of the International Speech Communication Association, 2015.
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.In Proceedings of the International Conference on Machine Learning, pages 1139–1147, 2013.
- (19) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the Neural Information Processing Systems Conference, pages 5998–6008, 2017.
- (20) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proceedings of the Neural Information Processing Systems Conference, pages 1509–1519, 2017.
- (21) E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49–67, 2015.
- (22) W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- (23) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Proceedings of the Neural Information Processing Systems Conference, pages 2595–2603, 2010.
Appendix A Experimental Setup
As weight decay is the same for all the machines, it is not necessary to compress it. In the experiment, for dist-EF-blockSGD, the weight decay is not added to , instead, we add it to . For dist-EF-blockSGDM, we maintain an extra momentum for weight decay on each machine. Specifically, we perform the following update on each worker:
where is the weight decay parameter. In the experiment, the sign is mapped to
and takes 1 bit. Note that the gradient sign has zero probability of being zero.
a.1 Setup: Multi-GPU Experiment on CIFAR-100
Each algorithm is run for 200 epochs. The weight decay parameter is fixed to , and the momentum parameter . We only tune the initial stepsize, using a validation set with 5K images that is carved out from the training set. For dist-EF-blockSGD/dist-EF-blockSGDM, we use the stepsize tuned for SGD/SGDM. The stepsize is divided by at the th and th epochs. The stepsize with the best validation set performance is used to run the algorithm on the full training set. When mini-batch size is per worker, for both SGD and SGDM, the stepsize is tuned from , and for signSGD and signum, the stepsize is chosen from . When we obtain the best stepsizes tuned with mini-batch size 16 per worker, we compare them to the ones that are divided and multiplied both by 2 for mini-batch sizes 8 and 32 per worker, respectively. The best stepsizes are listed in Table 2
|mini-batch size per worker|
a.2 Setup: Distributed Training on ImageNet
We use the default hyperparameters for SGDM and signum in the code base, which have been tuned for the ImageNet experiment in. Specifically, the momentum parameter is , and weight decay parameter is . A mini-batch size of 128 per worker is employed.
For SGDM, we use (used for SGDM on the ImageNet experiment in the code base). For signum, (used for signum on the ImageNet experiment in the code base) on 7 workers and on 15 workers. For dist-EF-blockSGDM, we also use and a weight decay of . Its stepsize is 555We observe that is too large for dist-EF-blockSGDM, while SGDM with performs worse than SGDM with . for 7 workers and for 15 workers.
Suppose that for any sequence . Consider the error-corrected iterate , it satisfies the recurrence:
Appendix C Proof of Theorem 1
By the smoothness of the function , we have
where in the second equality we use Lemma 1, and the second-to-last inequality follows the fact . In the last inequality, we use the variance bound of the mini-batch gradient, i.e., . Then, we get