1 Introduction
Deep neural networks have been highly successful in recent years graves2013generating ; he2016identity ; silver2016mastering ; vaswani2017attention ; zaremba2014recurrent . To achieve stateoftheart performance, they often have to leverage the computing power of multiple machines during training dean12 ; xing2015petuum ; zinkevich2010parallelized . Popular approaches include distributed synchronous SGD and its momentum variant SGDM, in which the computational load for evaluating a minibatch gradient is distributed among the workers. However, its scalability is limited by the possibly overwhelming cost due to communication of the gradient and model parameter li2014communication . Let be the gradient/parameter dimensionality, and be the number of workers. bits need to be transferred between the workers and server in each iteration.
To mitigate this communication bottleneck, the two common approaches are gradient sparsification and gradient quantization. Gradient sparsification only sends the most significant, informationpreserving gradient entries. A heuristic algorithm is first introduced in
seide20141, in which only the large entries are transmitted. On training a neural machine translation model with 4 GPUs, this greatly reduces the communication overhead and achieves 22% speedup
aji2017sparse . Deep gradient compression lin2017deepis another heuristic method that combines gradient sparsification with other techniques such as momentum correction, local gradient clipping, and momentum factor masking, achieving significant reduction on communication cost. MEMSGD
stich2018sparsified combines topsparsification with error correction. By keeping track of the accumulated errors, these can be added back to the gradient estimator before each transmission. MEMSGD converges at the same rate as SGD on convex problems, whilst reducing the communication overhead by a factor equal to the problem dimensionality.
On the other hand, gradient quantization mitigates the communication bottleneck by lowering the gradient’s floatingpoint precision with a smaller bit width. 1bit SGD achieves stateoftheart results on acoustic modeling while dramatically reducing the communication cost seide20141 ; strom2015scalable . TernGrad wen2017terngrad quantizes the gradients to ternary levels . QSGD alistarh2017qsgd employs stochastic randomized rounding to ensure unbiasedness of the estimator. Recently, Bernstein et al. proposed signSGD with majority vote pmlrv80bernstein18a , which only transmits the 1bit gradient sign between workers and server. A variant using momentum, called signum with majority vote, is also introduced though without convergence analysis bernstein2019signsgd . Using the majority vote, signSGD achieves a notion of Byzantine fault tolerance bernstein2019signsgd . Moreover, it converges at the same rate as distributed SGD, though it has to rely on the unrealistic assumptions of having a large minibatch and unimodal symmetric gradient noise. Indeed, signSGD can diverge in some simple cases when these assumptions are violated karimireddy2019error . With only a single worker, this divergence issue can be fixed by using the error correction technique in MEMSGD, leading to SGD with errorfeedback (EFSGD) karimireddy2019error .
While only a single worker is considered in EFSGD, we study in this paper the more interesting distributed setting with a parameter server architecture. To ensure efficient communication, we consider twoway gradient compression, in which gradients in both directions (server to/from workers) are compressed. Note that existing works (except signSGD/signum with majority vote pmlrv80bernstein18a ; bernstein2019signsgd
) do not compress the aggregated gradients before sending back to workers. Moreover, as gradients in a deep network typically have similar magnitudes in each layer, each layerwise gradient can be sufficiently represented using a sign vector and its average
norm. This layerwise (or blockwise in general) compressor achieves nearly x reduction in the communication cost. The resulant procedure is called communicationefficient distributed SGD with errorfeedback (distEFSGD). Analogous to SGDM, we also propose a stochastic variant distEFSGDM with Nesterov’s momentum nesterov1983method .The convergence properties of distEFSGD(M) are studied theoretically. For distEFSGD, we provide a bound with general stepsize schedule for a class of compressors (including the commonly used signoperator and top sparsification). In particular, without relying on the unrealistic assumptions in pmlrv80bernstein18a ; bernstein2019signsgd , we show that distEFSGD with constant/decreasing/increasing stepsize converges at an optimal rate, which matches that of distributed synchronous SGD. For distEFSGDM with constant stepsize, it also achieves the rate. To the best of our knowledge, these are the first convergence results on twoway gradient compression with Nesterov’s momentum. Experimental results show that the proposed algorithms are efficient without losing prediction accuracy.
Notations. For a vector , and are its  and norms, respectively. outputs a vector in which each element is the sign of the corresponding entry of . For two vectors , denotes the dot product. For a function , its gradient is .
2 Related Work: SGD with ErrorFeedback
In machine learning, one is often interested in minimizing the expected risk . which directly measures the generalization error bottou2004large : Here, is the model parameter, is drawn from some unknown distribution, and is the possibly nonconvex risk due to . When the expectation is taken over a training set of size , the expected risk reduces to empirical risk.
Recently, Karimireddy et al. karimireddy2019error introduced SGD with errorfeedback (EFSGD), which combines gradient compression with error correction (Algorithm 1). A single machine is considered, which keeps the gradient difference that are not used for parameter update in the current step. In the next iteration , the accumulated residual is added to the current gradient. The corrected gradient is then fed into an approximate compressor.
Definition 1.
karimireddy2019error An operator is a approximate compressor for if .
Examples of approximate compressors include scaled sign operator karimireddy2019error and top operator (which only preserves the coordinates with the largest absolute values) stich2018sparsified . One can also have randomized compressors that only satisfy Definition 1 in expectation. Obviously, it is desirable to have a large while achieving low communication cost.
EFSGD achieves the same rate as SGD. To obtain this convergence guarantee, an important observation is that the errorcorrected iterate satisfies the recurrence: , which is similar to that of SGD. This allows utilizing the convergence proof of SGD to bound the gradient difference .
3 Distributed Blockwise Momentum SGD with ErrorFeedback
3.1 Distributed SGD with ErrorFeedback
The proposed procedure, which extends EFSGD to the distributed setting. is shown in Algorithm 2. The computational workload is distributed over workers. A local accumulated error vector and a local corrected gradient vector are stored in the memory of worker . At iteration , worker pushes the compressed signal to the parameter server. On the server side, all workers’ ’s are aggregated and used to update its global errorcorrected vector . Before sending back the final update direction to each worker, compression is performed to ensure a comparable amount of communication costs between the push and pull operations. Due to gradient compression on the server, we also employ a global accumulated error vector . Unlike EFSGD in Algorithm 1, we do not multiply gradient by the stepsize before compression. The two cases make no difference when is constant. However, when the stepsize is changing over time, this would affect convergence. We also rescale the local accumulated error by . This modification, together with the use of error correction on both workers and server, allows us to obtain Lemma 1. Because of these differences, note that distEFSGD does not reduce to EFSGD when . When is the identity mapping, distEFSGD reduces to fullprecision distributed SGD.
In the following, we investigate the convergence of distEFSGD. We make the following assumptions, which are common in the stochastic approximation literature.
Assumption 1.
is lowerbounded (i.e., ) and smooth (i.e., for ).
Assumption 2.
Assumption 3.
The full gradient is uniformly bounded: .
This implies the bounded second moment, i.e.,
.Lemma 1.
Consider the errorcorrected iterate , where , , and ’s are generated from Algorithm 2. It satisfies the recurrence: .
The above Lemma shows that is very similar to the distributed SGD iterate except that the stochastic gradients are evaluated at instead of . This connection allows us to utilize the analysis of fullprecision distributed SGD. In particular, we have the following Lemma.
Lemma 2.
for any .
This implies that by Assumption 1. Given the above results, we can prove convergence of the proposed method by utilizing tools used on the fullprecision distributed SGD.
Theorem 1.
The first term on the RHS shows decay of the initial value. The second term is related to the variance, and the proposed algorithm enjoys variance reduction with more workers. The last term is due to gradient compression. A large (less compression) makes this term smaller and thus faster convergence. Similar to the results in karimireddy2019error , our bound also holds for unbiased compressors of the form , where and for some . Then, is a approximate compressor in expectation.
The following Corollary shows that distEFSGD has a convergence rate of , leading to a iteration complexity for satisfying .
Corollary 1.
Let stepsize for some . Then, . In comparison, under the same assumptions, distributed synchronous SGD achieves .
Thus, the convergence rate of distEFSGD matches that of distributed synchronous SGD (with fullprecision gradients) after iterations, even though gradient compression is used. Moreover, more workers (larger ) leads to faster convergence. Note that the bound above does not reduce to that of EFSGD when , as we have twoway compression. When , our bound also differs from Remark 4 in karimireddy2019error in that our last term is , while theirs is (which is for single machine with oneway compression). Ours is worse by a factor of , which is the price to pay for twoway compression and a linear speedup of using workers. Moreover, unlike signSGD with majority vote pmlrv80bernstein18a , we achieve a convergence rate of without assuming a large minibatch size () and unimodal symmetric gradient noise.
Theorem 1 only requires for all . This thus allows the use of any decreasing, increasing, or hybrid stepsize schedule. In particular, we have the following Corollary.
Corollary 2.
Let (decreasing stepsize) with or (increasing stepsize) with . Then, distEFSGD converges to a stationary point at a rate of .
To the best of our knowledge, this is the first such result for distributed compressed SGD with decreasing/increasing stepsize on nonconvex problems. These two stepsize schedules can also be used together. For example, one can use an increasing stepsize at the beginning of training as warmup, and then a decreasing stepsize afterwards.
3.2 Blockwise Compressor
A commonly used compressor is karimireddy2019error :
(1) 
Compared to using only the sign operator as in signSGD, the factor can preserve the gradient’s magnitude. However, as shown in karimireddy2019error , its in Definition 1 is , and can be particularly small when is sparse. When is closer to , the bound in Corollary 1 becomes smaller and thus convergence is faster. In this section, we achieve this by proposing a blockwise extension of (1).
Specifically, we partition the compressor input into blocks, where each block has elements indexed by . Block is then compressed with scaling factor (where is the subvector of with elements in block ), leading to: . A similar compression scheme, with each layer being a block, is considered in the experiments of karimireddy2019error . However, they provide no theoretical justifications.
First, Proposition 1 shows that is also an approximate compressor. For convenience, distEFSGD using a blockwise compressor will be called distEFblockSGD in the sequel. By replacing with in Proposition 1, the convergence results of distEFSGD can be directly applied.
Proposition 1.
Let . is a approximate compressor, where .
There are many ways to partition the gradient into blocks. In practice, one can simply consider each parameter tensor/matrix/vector in the deep network as a block. The intuition is that (i) gradients in the same parameter tensor/matrix/vector typically have similar magnitudes, and (ii) the corresponding scaling factors can thus be tighter than the scaling factor obtained on the whole parameter, leading to a larger
. As an illustration of (i), Figure 1(a)shows the coefficient of variation (which is defined as the ratio of the standard deviation to the mean) of
averaged over all blocks and iterations in an epoch, obtained from ResNet20 on the CIFAR100 dataset (with a minibatch size of 16 per worker).
^{1}^{1}1The detailed experimental setup is in Section 4.1. A value smaller than indicates that the absolute gradient values in each block concentrate around the mean. As for point (ii) above, consider the case where all the blocks are of the same size (), elements in the same block have the same magnitude ( for some ), and the magnitude is increasing across blocks ( for some ). For the standard compressor in (1), for a sufficiently large ; whereas for the proposed blockwise compressor, . Figure 1(b) shows the empirical estimates of and in the ResNet20 experiment. As can be seen, .The periteration communication costs of the various distributed algorithms are shown in Table 1. Compared to signSGD with majority vote pmlrv80bernstein18a , distEFblockSGD requires an extra bits for transmitting the blockwise scaling factors (each factor is stored in float32 format and transmitted twice in each iteration). By treating each vector/matrix/tensor parameter as a block, is typically in the order of hundreds. For most problems of interest, . The reduction in communication cost compared to fullprecision distributed SGD is thus nearly 32x.
algorithm  #bits per iteration 

fullprecision SGD  
signSGD with majority vote  
distEFblockSGD 
3.3 Nesterov’s Momentum
Momentum has been widely used in deep networks sutskever2013importance . Standard distributed SGD with Nesterov’s momentum nesterov1983method and fullprecision gradients uses the update: and , where is a local momentum vector maintained by each worker at time (with ), and is the momentum parameter. In this section, we extend the proposed distEFSGD with momentum. Instead of sending the compressed to the server, the compressed is sent. The server merges all the workers’s results and sends it back to each worker. The resultant procedure with blockwise compressor is called distEFblockSGDM (Algorithm 3), and has the same communication cost as distEFblockSGD. The corresponding nonblock variant is analogous.
Similar to Lemma 1, the following Lemma shows that the errorcorrected iterate is very similar to Nesterov’s accelerated gradient iterate, except that the momentum is computed based on .
Lemma 3.
The errorcorrected iterate , where , , and ’s are generated from Algorithm 3, satisfies the recurrence: .
As in Section 3.1, it can be shown that is bounded and . The following Theorem shows the convergence rate of the proposed distEFblockSGDM.
Theorem 2.
Compared to Theorem 1, using a larger momentum parameter makes the first term (which depends on the initial condition) smaller but a worse variance term (second term) and error term due to gradient compression (last term). Similar to Theorem 1, a larger makes the third term larger. The following Corollary shows that the proposed distEFblockSGDM achieves a convergence rate of .
Corollary 3.
Let for some . For any , .
4 Experiments
4.1 MultiGPU Experiment on CIFAR100
In this experiment, we demonstrate that the proposed distEFblockSGDM and distEFblockSGD ( in Algorithm 3), though using fewer bits for gradient transmission, still has good convergence. For faster experimentation, we use a a single node with multiple GPUs (an AWS P3.16 instance with 8 Nvidia V100 GPUs, each GPU being a worker) instead of a distributed setting. Note that the convergence w.r.t. the number of epochs are the same in both the distributed and multiGPU settings. Convergence w.r.t. time in a truly distributed setting will be studied in Section 4.2.
Experiment is performed on the CIFAR100 dataset, with 50K training images and 10K test images. We use a 20layer ResNet he2016identity . Each parameter tensor/matrix/vector is treated as a block in distEFblockSGDM. It is compared with (i) distributed synchronous SGD (with fullprecision gradient); (ii) distributed synchronous SGD (fullprecision gradient) with momentum (SGDM); (iii) signSGD with majority vote pmlrv80bernstein18a ; and (iv) signum with majority vote bernstein2019signsgd . All the algorithms are implemented in MXNet. We vary the minibatch size per worker in . Results are averaged over 3 repetitions. More details of the experiments are shown in Appendix A.1.
Figure 2 shows convergence of the testing accuracy w.r.t. the number of epochs. As can be seen, distEFblockSGD converges as fast as distributed SGD and has slightly better accuracy, while signSGD performs poorly. In particular, distEFblockSGD is robust to the minibatch size, while the performance of signSGD degrades with smaller minibatch size (which agrees with the results in pmlrv80bernstein18a ). Momentum does not offer SGD and distEFblockSGD obvious acceleration, but significantly improves signSGD. However, signum is still much worse than SGDM and distEFblockSGDM.
4.2 Distributed Training on ImageNet
In this section, we perform distributed optimization on ImageNet russakovsky2015imagenet using a 50layer ResNet. Each worker is an AWS P3.2 instance with 1 GPU, and the parameter server is housed in one node. We use the publicly available code^{2}^{2}2https://github.com/PermiJW/signSGDwithMajorityVote in bernstein2019signsgd , and the default communication library Gloo^{3}^{3}3https://github.com/facebookincubator/gloo
communication library in PyTorch. As in
bernstein2019signsgd , we use its allreduce implementation for SGDM, which is faster.As signum is much better than signSGD in Section 4.1, we only compare the momentum variants here. The proposed distEFblockSGDM is compared with (i) distributed synchronous SGD with momentum (SGDM); and (ii) signum with majority vote bernstein2019signsgd . The number of workers is varied in
. With an odd number of workers, a majority vote will not produce zero, and so signum does not lose accuracy by using 1bit compression. More details of the setup are in Appendix
A.2.Figure 3 shows the testing accuracy w.r.t. the number of epochs and wall clock time. As in Section 4.1, SGDM and distEFblockSGDM have comparable accuracies, while signum is inferior. When 7 workers are used, distEFblockSGDM has higher accuracy than SGDM (76.77% vs 76.27%). distEFblockSGDM reaches SGDM’s highest accuracy in around 13 hours, while SGDM takes 24 hours (Figure 3(b)), leading to a speedup. With 15 machines, the improvement is smaller (Figure 3(e)). This is because the burden on the parameter server is heavier. We expect comparable speedup with the 7worker setting can be obtained by using more parameter servers. In both cases, signum converges fast but the test accuracies are about worse.
Figures 3(c) and 3(f) show a breakdown of wall clock time into computation and communication time^{4}^{4}4Following bernstein2019signsgd , communication time includes the extra computation time for error feedback and compression. All methods have comparable computation costs, but signum and distEFblockSGDM have lower communication costs than SGDM. The communication costs for signum and distEFblockSGDM are comparable for 7 workers, but for 15 workers signum is lower. We speculate that it is because the sign vectors and scaling factors are sent separately to the server in our implementation, which causes more latency on the server with more workers. This may be alleviated if the two operations are fused.
5 Conclusion
In this paper, we proposed a distributed blockwise SGD algorithm with error feedback and momentum. By partitioning the gradients into blocks, we can transmit each block of gradient using 1bit quantization with its average norm. The proposed methods are communicationefficient and have the same convergence rates as fullprecision distributed SGD/SGDM for nonconvex objectives. Experimental results show that the proposed methods have fast convergence and achieve the same test accuracy as SGD/SGDM, while signSGD and signum only achieve much worse accuracies.
References

(1)
A. F. Aji and K. Heafield.
Sparse communication for distributed gradient descent.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
, pages 440–445, 2017.  (2) D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationefficient sgd via gradient quantization and encoding. In Proceedings of the Neural Information Processing Systems Conference, pages 1709–1720, 2017.
 (3) J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for nonconvex problems. In Proceedings of the International Conference on Machine Learning, pages 560–569, 2018.
 (4) J. Bernstein, J.and Zhao, K. Azizzadenesheli, and A. Anandkumar. signSGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations, 2019.
 (5) L. Bottou and Y. Lecun. Large scale online learning. In Proceedings of the Neural Information Processing Systems Conference, pages 217–224, 2004.
 (6) J. Dean, G.S. Corrado, R. Monga, K. Chen, M. Devin, Q.V. Le, and A. Ng. Large scale distributed deep networks. In Proceedings of the Neural Information Processing Systems Conference, pages 1223–1231, 2012.
 (7) A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.

(8)
K. He, X. Zhang, S. Ren, and J. Sun.
Identity mappings in deep residual networks.
In
Proceedings of the European conference on computer vision
, pages 630–645, 2016.  (9) S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi. Error feedback fixes signSGD and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
 (10) M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication efficient distributed machine learning with the parameter server. In Proceedings of the Neural Information Processing Systems Conference, pages 19–27, 2014.
 (11) Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
 (12) Y. E. Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
 (13) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.

(14)
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu.
1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns.
In Proceedings of the Annual Conference of the International Speech Communication Association, 2014.  (15) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, George Van D. D., J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 (16) S. U. Stich, J. Cordonnier, and M. Jaggi. Sparsified SGD with memory. In Proceedings of the Neural Information Processing Systems Conference, pages 4452–4463, 2018.
 (17) N. Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Proceedings of the Annual Conference of the International Speech Communication Association, 2015.

(18)
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.
In Proceedings of the International Conference on Machine Learning, pages 1139–1147, 2013.  (19) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the Neural Information Processing Systems Conference, pages 5998–6008, 2017.
 (20) W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Proceedings of the Neural Information Processing Systems Conference, pages 1509–1519, 2017.
 (21) E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49–67, 2015.
 (22) W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 (23) M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Proceedings of the Neural Information Processing Systems Conference, pages 2595–2603, 2010.
Appendix A Experimental Setup
As weight decay is the same for all the machines, it is not necessary to compress it. In the experiment, for distEFblockSGD, the weight decay is not added to , instead, we add it to . For distEFblockSGDM, we maintain an extra momentum for weight decay on each machine. Specifically, we perform the following update on each worker:
where is the weight decay parameter. In the experiment, the sign is mapped to
and takes 1 bit. Note that the gradient sign has zero probability of being zero.
a.1 Setup: MultiGPU Experiment on CIFAR100
Each algorithm is run for 200 epochs. The weight decay parameter is fixed to , and the momentum parameter . We only tune the initial stepsize, using a validation set with 5K images that is carved out from the training set. For distEFblockSGD/distEFblockSGDM, we use the stepsize tuned for SGD/SGDM. The stepsize is divided by at the th and th epochs. The stepsize with the best validation set performance is used to run the algorithm on the full training set. When minibatch size is per worker, for both SGD and SGDM, the stepsize is tuned from , and for signSGD and signum, the stepsize is chosen from . When we obtain the best stepsizes tuned with minibatch size 16 per worker, we compare them to the ones that are divided and multiplied both by 2 for minibatch sizes 8 and 32 per worker, respectively. The best stepsizes are listed in Table 2
minibatch size per worker  

algorithm  
fullprecision SGD  
fullprecision SGDM  
distEFblockSGD  
distEFblockSGDM  
signSGD  
signum 
a.2 Setup: Distributed Training on ImageNet
We use the default hyperparameters for SGDM and signum in the code base, which have been tuned for the ImageNet experiment in
[4]. Specifically, the momentum parameter is , and weight decay parameter is . A minibatch size of 128 per worker is employed.For SGDM, we use (used for SGDM on the ImageNet experiment in the code base). For signum, (used for signum on the ImageNet experiment in the code base) on 7 workers and on 15 workers. For distEFblockSGDM, we also use and a weight decay of . Its stepsize is ^{5}^{5}5We observe that is too large for distEFblockSGDM, while SGDM with performs worse than SGDM with . for 7 workers and for 15 workers.
Appendix B Proof of Lemmas 1 and 3
Lemma 4.
Suppose that for any sequence . Consider the errorcorrected iterate , it satisfies the recurrence:
Appendix C Proof of Theorem 1
Proof.
By the smoothness of the function , we have
where in the second equality we use Lemma 1, and the secondtolast inequality follows the fact . In the last inequality, we use the variance bound of the minibatch gradient, i.e., . Then, we get
Comments
There are no comments yet.