1 Introduction
Stochastic Gradient Descent (SGD) and its variants are commonly used for training deep neural networks. To accelerate the training, it is common to distribute the computation to multiple GPUs/machines, which results in parallel versions of SGD. There are various ways to parallelize SGD in a distributed manner. A typical solution is to synchronously compute the gradients on multiple worker nodes, and take the average on the server node. Such a solution is equivalent to singlethreaded SGD with large minibatch sizes (Goyal et al., 2017; You et al., 2017a, b, 2019). By increasing the number of workers, the overall time consumed by training will be reduced.
However, in practice, it is difficult to achieve the ideal scalability of distributed SGD due to the communication overhead, which increases with the number of workers. When the number of workers is large enough, the communication overhead becomes the bottleneck of the distributed learning system. Thus, to achieve better scalability, it is necessary to reduce the communication overhead.
Various approaches have been proposed to reduce the communication overhead of distributed SGD, such as quantization (Seide et al., 2014; Strom, 2015; Wen et al., 2017; Alistarh et al., 2016; Bernstein et al., 2018; Karimireddy et al., 2019; Zheng et al., 2019) and sparsification (Aji and Heafield, 2017; Stich et al., 2018; Jiang and Agrawal, 2018). In this paper, we focus on local SGD, which reduces the communication overhead by skipping communication rounds, i.e., less frequent synchronization, and periodically averaging the models across the workers (Stich, 2018; Lin et al., 2018; Yu et al., 2018; Wang and Joshi, 2018; Yu et al., 2019).
Adaptive learning rate methods adapt coordinatewise dynamic learning rates by accumulating the historical gradients. Examples include AdaGrad (McMahan and Streeter, 2010; Duchi et al., 2011)
, RMSProp
(Tieleman and Hinton, 2012), AdaDelta (Zeiler, 2012), and Adam (Kingma and Ba, 2014). Along similar lines, recent research has shown that AdaGrad can converge without explicitly decreasing the learning rate (Ward et al., 2019; Zou et al., 2019) .Nevertheless, it remains unclear how to modify adaptive learning rates in distributed SGD with infrequent synchronization. In this paper, we answer this question by combining local SGD and adaptive learning rates. We propose a novel variant for AdaGrad, and combine it with the concept of local SGD. To the best of our knowledge, this paper is the first to theoretically and empirically study local SGD with adaptive learning rates.
The main contributions of our paper are as follows:

We propose a new SGD algorithm with adaptive learning rate: AdaAlter, with provable convergence.

We propose a variant of AdaAlter: local AdaAlter, which reduces the communication overhead via infrequent synchronization.

We prove the convergence of the proposed algorithms for nonconvex problems and nonIID workers.

We show empirically that the proposed algorithms converge quickly and scale well in practical applications.
2 Related Work
In this paper, we consider a centralized serverworker architecture, also known as the Parameter Server (PS) architecture (Li et al., 2014a, b; Ho et al., 2013). In general, PS is a distributed keyvalue store, which can be used for exchanging blocks of model parameters between the workers and the servers (Peng et al., 2019). A common alternative approach of PS is the AllReduce algorithm, which is typically implemented by MPI (Sergeev and Balso, 2018; Walker and Dongarra, 1996)
. Most of the existing largescale distributed deeplearning frameworks, such as Tensorflow
(Abadi et al., 2016), PyTorch
(Steiner et al., 2019), and MXNet Chen et al. (2015) support either PS or AllReduce.Similar to local SGD, there are other SGD variants that also reduce the communication overhead by skipping synchronization rounds. For example, federated learning (Konevcnỳ et al., 2016; McMahan et al., 2016) adopts local SGD with heterogeneous numbers of local steps and subsampling workers to train models on edge devices. EASGD (Zhang et al., 2014) periodically synchronizes the models on the workers and the servers with moving average.
Additional to communication compression, there are other approaches to improve scalability and accelerate training. For example, decentralized SGD (Shi et al., 2014; Yuan et al., 2013; Lian et al., 2017) avoids congesting the central server node and improves the scalability by removing the server, and letting the workers communicate with their neighbours only. Another technique is pipelining (Li et al., 2018), which overlaps the computation and the communication to hide the communication overhead.
In this paper, we focus on synchronous training, which blocks the global update until all the workers respond. In contrast, asynchronous training (Zinkevich et al., 2009; Niu et al., 2011; Zhao and Li, 2016) updates the global model immediately after any worker responds. Theoretical and empirical analysis (Dutta et al., 2018) suggests that synchronous training is more stable with less noise, but can also be slowed down by the global barrier across all the workers. Asynchronous training is generally faster, but needs to address instability and noisiness due to staleness.
3 Problem Formulation
We consider the following optimization problem:
where , for , is sampled from the local dataset on the th worker.
We solve this problem in a distributed manner with workers. Each worker trains the model on its local dataset. In each iteration, the th worker will sample a minibatch of independent data points from the dataset , and compute the stochastic gradient , where .
Note that different devices have different local datasets, i.e., . Thus, samples drawn from different workers may have different expectations i.e. in general, .
Notation  Description 

Model parameter  
Overall loss function in expectation 

Total number and index of iterations  
Stochastic gradient  
The th coordinate of ,  
The th coordinate of ,  
on the th worker, ,  
The th coordinate of ,  
Hadamard (coordinatewise) product  
(all vectors are column vectors) 

4 Methodology
Before we formally introduce the proposed algorithms, we introduce two SGD variants that are highly related to our work: AdaGrad and local SGD. Then, we will first propose a new variant of SGD with adaptive learning rates: AdaAlter, and combine it with the concept of local SGD, which results in another new variant of SGD: local AdaAlter.
4.1 Preliminary
To help understand our proposed algorithm, we first introduce the classic SGD variant with adaptive learning rate: AdaGrad. The detailed algorithm is shown in Algorithm 1. The general idea is to accumulate the gradients in a coordinatewise manner, and use such accumulation as the denominator to normalize the gradients. Such accumulation grows when the number of iterations grows, so that we do not need to explicitly decrease the learning rate .
In addition to AdaGrad, we also adopt the concept of local SGD to reduce the communication overhead. The vanilla local SGD algorithm is shown in Algorithm 2. Local SGD skips the communication rounds, and synchronizes/averages the model parameters for every iterations. Thus, on average, the communication overhead is reduced to , compared to fully synchronous SGD.
4.2 AdaAlter
In this section, we formally introduce AdaAlter, which is an alternative of AdaGrad. AdaAlter accumulates the denominators similar to AdaGrad. The major difference is that AdaAlter updates the model parameter before accumulating the denominator. The detailed algorithm is shown in Algorithm 3. Note that AdaGrad updates the denominator first, and then updates the model parameters, while AdaAlter updates the model parameters first, and then updates the denominator. This simple modification ensures that AdaAlter behaves similar to AdaGrad, yet makes it easier to combine with local SGD. The practical importance of switching the ordering of operation will be discussed in detail after we introduce the local AdaAlter algorithm in the next section.
4.3 Local AdaAlter
We propose a variant of AdaAlter, namely, local AdaAlter, which skips synchronization rounds and averages the model parameter and the accumulated denominator after every iterations. The detailed algorithm is shown in Algorithm 4. Note that in the communication rounds, AdaAlter has to synchronize not only the model parameters, but also the accumulated denominators across the workers. Thus, compared to the distributed AdaGrad (Algorithm 1), local AdaAlter (Algorithm 4) reduces the communication overhead to on average.
In AdaGrad, a small positive constant is added for the numerical stability, in case that the denominator is too small. However, in AdaAlter, acts as a placeholder for the yettobeadded . Thus, after local steps without synchronization, such placeholder becomes . The denominators are updated in the synchronization rounds only, which guarantees that the denominators are the same on different workers in the local iterations.
Similar to the fully synchronous AdaAlter, local AdaAlter updates the denominator after updating the model parameters. Switching the order of updates is essential for local AdaAlter, in order to enable lazy updates of the denominator, while keeping the denominator synchronized on different workers. The key idea is to use to substitute the actual accumulated denominator before synchronization. This is also the key for the convergence proof of local AdaAlter.
5 Theoretical Analysis
In this section, we prove the convergence of Algorithm 3 and Algorithm 4 for smooth but nonconvex problems, with constant learning rate .
5.1 Assumptions
First, we introduce some assumptions, and a useful lemma for our convergence analysis.
Assumption 1.
(Smoothness) We assume that and are smooth:
Assumption 2.
(Bounded gradients) For any stochastic gradient , we assume bounded coordinates , or simply .
Lemma 1.
(Zou et al. (2019), Lemma 15) For any nonnegative sequence , we have
5.2 Main results
Based on the assumptions and lemma above, we have the following convergence guarantees. Detailed proofs can be found in the appendix.
We first prove the convergence of fully synchronous AdaAlter for smooth but nonconvex problems.
Theorem 1.
When , . When , . Thus, AdaAlter converges to a critical point when . Increasing the number of workers
reduces the variance.
Now, we prove the convergence of local AdaAlter for smooth but nonconvex problems. To analyze Algorithm 4, we introduce the following auxiliary variable:
We show that the sequence converges to a critical point.
Theorem 2.
When , . When , . Thus, local AdaAlter converges to a critical point when . Increasing the number of workers reduces the variance. Compared to the fully synchronous AdaAlter, local AdaAlter has the extra noise proportional to , which means that less synchronization results in larger noise. Thus, there is an inevitable tradeoff between the reduction of the noise, and the reduction of the communication overhead.
6 Experiments
In this section, we empirically evaluate the proposed algorithms.
6.1 Datasets and Model Architecture
We conduct experiments on the 1B Word Benchmark dataset (Chelba et al., 2013), which is a publicly available benchmark for language models. The dataset is composed of about 0.8B words with a vocabulary of 793471 words, including sentence boundary markers. As a standard preprocessing procedure, all the sentences are shuffled and the duplicates are removed. We train the socalled Big LSTM model with 10% dropout (LSTM2048512 in Józefowicz et al. (2016)).
6.2 Evaluation Setup
Our experiments are conducted on a cluster of machines where each machine has 8 NVIDIA V100 GPUs (with 16GB memory). In the default setting, the model is trained on a single machine with 8 GPU workers, where the local batch size at each GPU is 256. We tune the learning rates in the range of
, and report the best results on the test dataset. Each experiment is composed of 50 epochs. In each epoch, the algorithm processes
data samples. We repeat each experiment 5 times, and take the average.The typical measure used for language models is perplexity (PPL), which is the average perword logprobability on the test dataset:
where is the predicted probability of word in the language model. We follow the standard procedure and sum over all the words.
6.2.1 Practical Remarks for AdaAlter
There are some additional remarks for using (local) AdaAlter in practice.
Warmup Learning Rates: When using AdaAlter, we observe that its behavior is almost the same as AdaGrad. The only exception is that, at the very beginning, the denominator is too small for AdaAlter. Thus, we add a warmup mechanism for AdaAlter:
where
is a hyperparameter. In the first
iterations, the learning rate will gradually increase from to . In our default setting where we use 8 GPU workers with batch size , we take and .Scaling Learning Rates: The original baseline is conducted on 4 GPU workers with batch size and learning rate , where the actual overall batch size is . When the batch size increases by , it is a common strategy to rescale the learning rate by or (Goyal et al., 2017; You et al., 2017a, b, 2019). In our experiments, we conduct the experiments on 8 GPU workers with batch size , where the actual overall batch size is . Thus, it is reasonable to rescale the learning rate in the range of . When tuning the learning rates, we found that taking results in the best performance.
6.3 Empirical Results
We evaluate the following performance metrics to test the reduction of communication overhead and the convergence of the proposed algorithms:

The time consumed by one epoch versus different number of GPU workers.

The throughput (the overall number of samples processed per second) versus different number of GPU workers.

Perplexity on the test dataset versus time.

Perplexity on the test dataset versus the number of epochs.
Note that in all the experiments, we take , .
6.3.1 Reduction of Communication
We first evaluate the reduction of the communication overhead of the proposed algorithms. In Figure 1 and 2, we illustrate the time consumed by one epoch and the throughput with different numbers of workers and different algorithms. We test local AdaAlter with different synchronization periods . It is shown that local AdaAlter efficiently reduces the communication overhead.
The baseline “Local AdaAlter, ” is evaluated by manually removing the communication, i.e., the synchronization never happens. The baseline “Ideal computationonly overhead” is evaluated by manually removing not only the communication, but also the dataloading, which uses a batch of dummy data to avoid the overhead of loading real data samples. These two baselines illustrate the ideal lower bounds of the training time, by removing the overheads other than the computation.
6.3.2 Convergence
We test the convergence of the proposed algorithms. In Figure 3, we illustrate the perplexity on the test dataset with different algorithms. Compared to vanilla distributed AdaGrad, local AdaAlter converges almost the same with the same number of epochs, but takes much less time. To reach the same perplexity, local AdaAlter can reduce almost 30% of the training time.
In Table 2
, we report the perplexity and consumed time at the end of training for different algorithms. We can see that local AdaAlter produces comparable performance to the fully synchronous AdaGrad and AdaAlter, on the test dataset, with much less training time, and acceptable variance. Note that we do not illustrate the standard deviation in Figure
3, since it is too small to recognize.Method  Test PPL  Time (hours) 

AdaGrad  98.05  
AdaAlter  98.47  
Local AdaAlter  
69.17  
67.41  
65.49  
64.22 
6.4 Discussion
We can see that the fully synchronous AdaGrad or AdaAlter are very slow. Local AdaAlter reduces almost 30% of the training time compared to the fully synchronous AdaGrad or AdaAlter.
As we expected, Figure 3 and Table 2 show that larger reduces more communication overhead, but also results in worse perplexity, which validates our theoretical analysis in Theorem 2: when increases, the noise in the convergence also increases. Taking gives the best tradeoff between the communication overhead and the variance.
Interestingly, as shown in Figure 3(b), local AdaAlter has slightly better perplexity on the test dataset, compared to the fully synchronous AdaGrad and AdaAlter. Although, our theorems indicate that local AdaAlter has larger variance compared to the fully synchronous version, such conclusion only applies to the training loss. In fact, there is previous work (Lin et al., 2018) showing that local SGD potentially generalizes better than the fully synchronous SGD, which makes our results not surprising. We also notice that when is too large, such benefit will be overwhelmed by the large noise.
We also observe that almost all the algorithms do not scale well when changing the number of workers from 4 to 8. The major reason is that all the workers are placed in the same machine, which has limited CPU resources. When there are too many workers, the dataloading also becomes a bottleneck, which is shown in the gap between “Local AdaAlter, ” and “Ideal computationonly overhead” in Figure 1. That is also the reason why different does not show much difference when using 8 GPU workers.
7 Conclusion
We propose a novel SGD algorithm: AdaAlter, and its variant with reduced communication, namely, local AdaAlter. We show that the algorithm provably converges. Our empirical results also show that the proposed algorithm can accelerate training. In future work, we will apply our algorithms to other datasets and applications, and optimize the performance systemically.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P. A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zhang, X. (2016). Tensorflow: A system for largescale machine learning. In OSDI.
 Aji and Heafield (2017) Aji, A. F. and Heafield, K. (2017). Sparse communication for distributed gradient descent. In EMNLP.
 Alistarh et al. (2016) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2016). Qsgd: Communicationefficient sgd via gradient quantization and encoding. In NIPS.
 Bernstein et al. (2018) Bernstein, J., Wang, Y.X., Azizzadenesheli, K., and Anandkumar, A. (2018). signsgd: compressed optimisation for nonconvex problems. In ICML.
 Chelba et al. (2013) Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., and Koehn, P. (2013). One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH.
 Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv, abs/1512.01274.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
 Dutta et al. (2018) Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. (2018). Slow and stale gradients can win the race: Errorruntime tradeoffs in distributed sgd. In AISTATS.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677.
 Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. (2013). More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231.
 Jiang and Agrawal (2018) Jiang, P. and Agrawal, G. (2018). A linear speedup analysis of distributed deep learning with sparse and quantized communication. In NeurIPS.
 Józefowicz et al. (2016) Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling. ArXiv, abs/1602.02410.
 Karimireddy et al. (2019) Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. (2019). Error feedback fixes signsgd and other gradient compression schemes. In ICML.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Konevcnỳ et al. (2016) Konevcnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. (2016). Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492.
 Li et al. (2014a) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. (2014a). Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598.
 Li et al. (2014b) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014b). Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27.
 Li et al. (2018) Li, Y., Yu, M., Li, S., Avestimehr, A. S., Kim, N. S., and Schwing, A. G. (2018). Pipesgd: A decentralized pipelined sgd framework for distributed deep net training. ArXiv, abs/1811.03619.
 Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. ArXiv, abs/1705.09056.
 Lin et al. (2018) Lin, T., Stich, S. U., and Jaggi, M. (2018). Don’t use large minibatches, use local sgd. ArXiv, abs/1808.07217.
 McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. (2016). Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629.
 McMahan and Streeter (2010) McMahan, H. B. and Streeter, M. J. (2010). Adaptive bound optimization for online convex optimization. In COLT.
 Niu et al. (2011) Niu, F., Recht, B., Ré, C., and Wright, S. J. (2011). Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. In NIPS.
 Peng et al. (2019) Peng, Y., Zhu, Y., Chen, Y., Bao, Y., Yi, B., Lan, C., Wu, C., and Guo, C. (2019). A generic communication scheduler for distributed dnn training acceleration. In the 27th ACM Symposium on Operating Systems Principles (ACM SOSP 2019).
 Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014). 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In INTERSPEECH.
 Sergeev and Balso (2018) Sergeev, A. and Balso, M. D. (2018). Horovod: fast and easy distributed deep learning in tensorflow. ArXiv, abs/1802.05799.
 Shi et al. (2014) Shi, W., Ling, Q., Wu, G., and Yin, W. (2014). Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25:944–966.
 Steiner et al. (2019) Steiner, B., DeVito, Z., Chintala, S., Gross, S., Paszke, A., Massa, F., Lerer, A., Chanan, G., Lin, Z., Yang, E., Desmaison, A., Tejani, A., Kopf, A., Bradbury, J., Antiga, L., Raison, M., Gimelshein, N., Chilamkurthy, S., Killeen, T., Fang, L., and Bai, J. (2019). Pytorch: An imperative style, highperformance deep learning library. In NeurIPS.
 Stich (2018) Stich, S. U. (2018). Local sgd converges fast and communicates little. ArXiv, abs/1805.09767.
 Stich et al. (2018) Stich, S. U., Cordonnier, J.B., and Jaggi, M. (2018). Sparsified sgd with memory. ArXiv, abs/1809.07599.
 Strom (2015) Strom, N. (2015). Scalable distributed dnn training using commodity gpu cloud computing. In INTERSPEECH.
 Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
 Walker and Dongarra (1996) Walker, D. W. and Dongarra, J. J. (1996). Mpi: a standard message passing interface. Supercomputer, 12:56–68.
 Wang and Joshi (2018) Wang, J. and Joshi, G. (2018). Cooperative sgd: A unified framework for the design and analysis of communicationefficient sgd algorithms. ArXiv, abs/1808.07576.
 Ward et al. (2019) Ward, R., Wu, X., and Bottou, L. (2019). Adagrad stepsizes: sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686.
 Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. (2017). Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS.
 You et al. (2017a) You, Y., Gitman, I., and Ginsburg, B. (2017a). Scaling sgd batch size to 32k for imagenet training. ArXiv, abs/1708.03888.
 You et al. (2019) You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., and Hsieh, C.J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
 You et al. (2017b) You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., and Keutzer, K. (2017b). Imagenet training in minutes. In ICPP.
 Yu et al. (2019) Yu, H., Jin, R., and Yang, S. X. (2019). On the linear speedup analysis of communication efficient momentum sgd for distributed nonconvex optimization. In ICML.
 Yu et al. (2018) Yu, H., Yang, S. X., and Zhu, S. (2018). Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI.
 Yuan et al. (2013) Yuan, K., Ling, Q., and Yin, W. (2013). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26:1835–1854.
 Zeiler (2012) Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. ArXiv, abs/1212.5701.
 Zhang et al. (2014) Zhang, S., Choromanska, A., and LeCun, Y. (2014). Deep learning with elastic averaging sgd. In ICLR.

Zhao and Li (2016)
Zhao, S.Y. and Li, W.J. (2016).
Fast asynchronous parallel stochastic gradient descent: A lockfree
approach with convergence guarantee.
In
Thirtieth AAAI Conference on Artificial Intelligence
.  Zheng et al. (2019) Zheng, S., Huang, Z., and Kwok, J. T. (2019). Communicationefficient distributed blockwise momentum sgd with errorfeedback. ArXiv, abs/1905.10936.
 Zinkevich et al. (2009) Zinkevich, M., Smola, A. J., and Langford, J. (2009). Slow learners are fast. In NIPS.

Zou et al. (2019)
Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019).
A sufficient condition for convergences of adam and rmsprop.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 11127–11135.
Appendix
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P. A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zhang, X. (2016). Tensorflow: A system for largescale machine learning. In OSDI.
 Aji and Heafield (2017) Aji, A. F. and Heafield, K. (2017). Sparse communication for distributed gradient descent. In EMNLP.
 Alistarh et al. (2016) Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2016). Qsgd: Communicationefficient sgd via gradient quantization and encoding. In NIPS.
 Bernstein et al. (2018) Bernstein, J., Wang, Y.X., Azizzadenesheli, K., and Anandkumar, A. (2018). signsgd: compressed optimisation for nonconvex problems. In ICML.
 Chelba et al. (2013) Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., and Koehn, P. (2013). One billion word benchmark for measuring progress in statistical language modeling. In INTERSPEECH.
 Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv, abs/1512.01274.
 Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
 Dutta et al. (2018) Dutta, S., Joshi, G., Ghosh, S., Dube, P., and Nagpurkar, P. (2018). Slow and stale gradients can win the race: Errorruntime tradeoffs in distributed sgd. In AISTATS.
 Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677.
 Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P. (2013). More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231.
 Jiang and Agrawal (2018) Jiang, P. and Agrawal, G. (2018). A linear speedup analysis of distributed deep learning with sparse and quantized communication. In NeurIPS.
 Józefowicz et al. (2016) Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu, Y. (2016). Exploring the limits of language modeling. ArXiv, abs/1602.02410.
 Karimireddy et al. (2019) Karimireddy, S. P., Rebjock, Q., Stich, S. U., and Jaggi, M. (2019). Error feedback fixes signsgd and other gradient compression schemes. In ICML.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Konevcnỳ et al. (2016) Konevcnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. (2016). Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492.
 Li et al. (2014a) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.Y. (2014a). Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pages 583–598.
 Li et al. (2014b) Li, M., Andersen, D. G., Smola, A. J., and Yu, K. (2014b). Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems, pages 19–27.
 Li et al. (2018) Li, Y., Yu, M., Li, S., Avestimehr, A. S., Kim, N. S., and Schwing, A. G. (2018). Pipesgd: A decentralized pipelined sgd framework for distributed deep net training. ArXiv, abs/1811.03619.
 Lian et al. (2017) Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., and Liu, J. (2017). Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. ArXiv, abs/1705.09056.
 Lin et al. (2018) Lin, T., Stich, S. U., and Jaggi, M. (2018). Don’t use large minibatches, use local sgd. ArXiv, abs/1808.07217.
 McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. (2016). Communicationefficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629.
 McMahan and Streeter (2010) McMahan, H. B. and Streeter, M. J. (2010). Adaptive bound optimization for online convex optimization. In COLT.
 Niu et al. (2011) Niu, F., Recht, B., Ré, C., and Wright, S. J. (2011). Hogwild!: A lockfree approach to parallelizing stochastic gradient descent. In NIPS.
 Peng et al. (2019) Peng, Y., Zhu, Y., Chen, Y., Bao, Y., Yi, B., Lan, C., Wu, C., and Guo, C. (2019). A generic communication scheduler for distributed dnn training acceleration. In the 27th ACM Symposium on Operating Systems Principles (ACM SOSP 2019).
 Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014). 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In INTERSPEECH.
 Sergeev and Balso (2018) Sergeev, A. and Balso, M. D. (2018). Horovod: fast and easy distributed deep learning in tensorflow. ArXiv, abs/1802.05799.
 Shi et al. (2014) Shi, W., Ling, Q., Wu, G., and Yin, W. (2014). Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25:944–966.
 Steiner et al. (2019) Steiner, B., DeVito, Z., Chintala, S., Gross, S., Paszke, A., Massa, F., Lerer, A., Chanan, G., Lin, Z., Yang, E., Desmaison, A., Tejani, A., Kopf, A., Bradbury, J., Antiga, L., Raison, M., Gimelshein, N., Chilamkurthy, S., Killeen, T., Fang, L., and Bai, J. (2019). Pytorch: An imperative style, highperformance deep learning library. In NeurIPS.
 Stich (2018) Stich, S. U. (2018). Local sgd converges fast and communicates little. ArXiv, abs/1805.09767.
 Stich et al. (2018) Stich, S. U., Cordonnier, J.B., and Jaggi, M. (2018). Sparsified sgd with memory. ArXiv, abs/1809.07599.
 Strom (2015) Strom, N. (2015). Scalable distributed dnn training using commodity gpu cloud computing. In INTERSPEECH.
 Tieleman and Hinton (2012) Tieleman, T. and Hinton, G. (2012). Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
 Walker and Dongarra (1996) Walker, D. W. and Dongarra, J. J. (1996). Mpi: a standard message passing interface. Supercomputer, 12:56–68.
 Wang and Joshi (2018) Wang, J. and Joshi, G. (2018). Cooperative sgd: A unified framework for the design and analysis of communicationefficient sgd algorithms. ArXiv, abs/1808.07576.
 Ward et al. (2019) Ward, R., Wu, X., and Bottou, L. (2019). Adagrad stepsizes: sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, pages 6677–6686.
 Wen et al. (2017) Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., and Li, H. (2017). Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS.
 You et al. (2017a) You, Y., Gitman, I., and Ginsburg, B. (2017a). Scaling sgd batch size to 32k for imagenet training. ArXiv, abs/1708.03888.
 You et al. (2019) You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., and Hsieh, C.J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
 You et al. (2017b) You, Y., Zhang, Z., Hsieh, C.J., Demmel, J., and Keutzer, K. (2017b). Imagenet training in minutes. In ICPP.
 Yu et al. (2019) Yu, H., Jin, R., and Yang, S. X. (2019). On the linear speedup analysis of communication efficient momentum sgd for distributed nonconvex optimization. In ICML.
 Yu et al. (2018) Yu, H., Yang, S. X., and Zhu, S. (2018). Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI.
 Yuan et al. (2013) Yuan, K., Ling, Q., and Yin, W. (2013). On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26:1835–1854.
 Zeiler (2012) Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. ArXiv, abs/1212.5701.
 Zhang et al. (2014) Zhang, S., Choromanska, A., and LeCun, Y. (2014). Deep learning with elastic averaging sgd. In ICLR.

Zhao and Li (2016)
Zhao, S.Y. and Li, W.J. (2016).
Fast asynchronous parallel stochastic gradient descent: A lockfree
approach with convergence guarantee.
In
Thirtieth AAAI Conference on Artificial Intelligence
.  Zheng et al. (2019) Zheng, S., Huang, Z., and Kwok, J. T. (2019). Communicationefficient distributed blockwise momentum sgd with errorfeedback. ArXiv, abs/1905.10936.
 Zinkevich et al. (2009) Zinkevich, M., Smola, A. J., and Langford, J. (2009). Slow learners are fast. In NIPS.

Zou et al. (2019)
Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019).
A sufficient condition for convergences of adam and rmsprop.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 11127–11135.
Appendix A Proofs
Theorem 1.
Proof.
For convenience, we denote as the th coordinate of the gradient . Using smoothness, we have
Note that , where .
Conditional on and , taking expectation on both sides, using , we have
Note that , thus, we have . Then, we have
If , then we have
Otherwise, denoting , we have
Thus, denoting , we have
By rearranging the terms, we have
Comments
There are no comments yet.