Many machine learning models can be formulated as the following empirical risk minimization problem:
where denotes the model parameter, denotes the th training data, is number of training data, is the size of model. For example, let , where denotes the feature of the th training data,
denotes the label. Then in logistic regression, and in SVM
. Many deep learning models can also be formulated as (1).
One of the efficient ways to solve (1) is stochastic gradient descent (SGD) (Robbins and Monro, 1951). In each iteration, SGD calculates one stochastic gradient and update by , or update with a mini-batch of stochastic gradients. Inspired by momentum and nesterov’s accelerated gradient descent, momentum SGD (MSGD) (Polyak, 1964; Tseng, 1998; Lan, 2012; Kingma and Ba, 2015) has been proposed and widely used in machine learning. In practice, MSGD can achieve better performance than SGD (Krizhevsky et al., 2012; Sutskever et al., 2013)
With the rapid growth of data, distributed SGD (DSGD) (Dekel et al., 2012; Li et al., 2014b) has attracted much attention since it can parallelly calculate a batch of stochastic gradients. DSGD can be formulated as follows:
where is the number of workers, is the stochastic gradient (or a mini-batch of stochastic gradients) calculated by the th worker. DSGD can be implemented on distributed frameworks like parameter server and all-reduce framework. Each worker calculates and sends it to the server or other workers for updating . Recently, more and more large models, such as deep learning models, are used in machine learning to improve the generalization ability. This makes
be a high dimensional vector. Due to the latency and limited bandwidth of network, communication cost has become the bottleneck of traditional DSGD or distributed MSGD (DMSGD). For example, when we implement DSGD on parameter server, the server need to receivehigh dimension vectors from workers, which will lead to communication traffic jam and make the convergence of DSGD slow. Hence, we need to compress to reduce the communication cost.
Recently, researchers have proposed two main categories of communication compression techniques for reducing communication cost in DSGD and DMSGD. The first category is quantization (Wen et al., 2017; Alistarh et al., 2017; Jiang and Agrawal, 2018). In machine learning problems,
-bit float number is typically adopted for representation. Quantization methods quantize the value (gradient or parameter) representation from 32 bits to some low bit-width like 8 bits or 4 bits. Since the quantized gradients in most methods are an unbiased estimation of the original ones, the convergence rate of these methods has the same order of magnitude as that of DSGD, but slower due to the extra quantization variance. It is easy to find that the communication cost can be reduced by 31 fold in the ideal case. In practice, at least 4 bits should be adopted for representation in most cases to keep original accuracy. In these cases, the communication cost is reduced by 7 fold.
The other category is based on sparsified gradient (Aji and Heafield, 2017; Alistarh et al., 2018; Stich et al., 2018), which is always called sparse communication. In sparse communication, after calculating the update vector at each iteration, each worker only sends a subset of coordinates in , denoted as . Here, is a sparse vector, and hence it can reduce the communication cost. In recent works (Aji and Heafield, 2017; Lin et al., 2018), each worker will typically remember those values which are not sent, i.e., , and store it in the memory rather than dropping them. The is called memory gradient and it will be used to calculate the next update vector
. This is intuitively necessary because a subset of coordinates of one stochastic gradient can not reflect the real descent direction and can make errors with higher probability than the original stochastic gradient. This memory gradient based sparse communication strategy has been widely adopted by recent communication compression methods(Seide et al., 2014; Aji and Heafield, 2017; Lin et al., 2018), and has achieved better performance than quantization methods and other sparse communication methods without memory gradient. In these memory gradient based sparse communication methods, some are for vanilla DSGD (Aji and Heafield, 2017; Alistarh et al., 2018; Stich et al., 2018). The convergence rate of vanilla DSGD with sparse communication has been proved in (Alistarh et al., 2018; Stich et al., 2018). Very recently, there has appeared one sparse communication method for distributed MSGD (DMSGD), called deep gradient compression (DGC) (Lin et al., 2018), which has achieved better performance than vanilla DSGD with sparse communication in practise. However, the theory about the convergence of DGC is still lack. Furthermore, although DGC uses momentum SGD, the momentum in DGC is calculated by each worker and hence it is a local momentum without global information.
In this paper, we propose a novel method, called global momentum compression (GMC), for sparse communication in DMSGD which includes DSGD as a special case. The main contributions of this paper are summarized as follows:
GMC combines memory gradient and momentum SGD to achieve sparse communication for DMSGD (DSGD). But different from DGC which adopts local momentum, GMC adopts global momentum.
We theoretically prove the convergence rate of GMC for both convex and non-convex problems. To the best of our knowledge, this is the first work that proves the convergence of DMSGD with sparse communication and memory gradient.
Empirical results show that, compared with the DMSGD counterpart without sparse communication, GMC can reduce the communication cost by approximately 100 fold without loss of generalization accuracy.
GMC can also achieve comparable (sometimes better) performance compared with DGC, with extra theoretical guarantee.
In this paper, we use to denote norm, use to denote the optimal solution of (1), use to denote one stochastic gradient with respect to a mini-batch of samples such that and , use to denote element-wise product, use to denote the vector , use
to denote an identity matrix. For a vector, we use to denote its th coordinate value. (bounded gradient) is called -bounded stochastic gradient of function if and .
(smooth function) Function is -smooth () if or equivalently
(strongly convex function) Function is -strongly convex () if
2.1 Distributed Momentum SGD
The is the Polyak’s momentum and is an unbiased estimated stochastic gradient of . Since , it can also be written as
Please note that if , MSGD degenerates to SGD.
One simple way to implement distributed MSGD (DMSGD) is that each worker parallelly calculates some stochastic gradient and then the stochastic gradient of all workers are aggregated to get . The update process of in this way is totally equivalent to the serial MSGD. We call the global momentum, because it captures the global information from all workers.
Another way to implement DMSGD is using local momentum:
where is the stochastic gradient calculated by the th worker and . is the local momentum. We will find that DGC (Lin et al., 2018) degenerates to this DMSGD with local momentum when it does not adopt sparse communication. Since , this DMSGD with local momentum can also be written as the formulation in (3). Hence, the global momentum contains all information of local momentum. Please note that if sparse communication is adopted, the update rule of DGC cannot capture all the information in global momentum.
In the later section, we will see that global momentum is better than local momentum when using memory gradient for sparse communication. Recently, there has appeared another distributed SGD method using local momentum (Yu et al., 2019). In (Yu et al., 2019), it also needs to unify the local momentums on each worker to get the global momentum after several iterations, which means local momentum cannot be independently applied for too many iterations.
3 Global Momentum Compression
In this section, we introduce our method global momentum compression (GMC). Assume we have workers. The training data are divided into , where . Each is stored on the th worker. GMC mainly performs the following operations:
Each worker calculates ;
Each worker generates a sparse vector and sends ;
Each worker updates ;
Update parameter ;
3.1 Framework of GMC
GMC can be easily implemented on all-reduce distributed framework in which each worker sends the sparse vector to all the other workers, then each worker updates after receiving the sparse vectors from other workers.
Recently, parameter server (Li et al., 2014a) has been one of the most popular distributed frameworks in machine learning due to its scalability. GMC can also be implemented on parameter server. The details are shown in Algorithm 1. The difference between GMC and traditional DSGD on parameter server is that in GMC after updating , server will send to workers instead of . Since is sparse, is sparse as well. Then sending can reduce the communication cost. In our experiments, we find that GMC can make without loss of accuracy when training large scale models. Here, denotes the number of non-zero values in . Workers can get by .
3.2 Memory Gradient
In GMC, after sending a sparse vector , each worker will remember the coordinates which are not sent and store them in :
So we call the memory gradient, which is important for the convergence guarantee of GMC. Here, we give an intuitive explanation about why GMC needs to remember the coordinates which are not sent. We consider the simple case that , which means is a stochastic gradient of and GMC degenerates to (Aji and Heafield, 2017).
Since is a sparse vector, GMC can be seen as a method achieving sparse communication by combining stochastic coordinate descent (SCD) (Nesterov, 2012) and DSGD. In SCD, each denotes a true descent direction. When we use a stochastic gradient to replace , will make error with high probability, especially when adopts the top-K strategy (choose the coordinates with larger absolute values (Alistarh et al., 2018)).
For example, let , and
where , is on the first worker, is on the second worker. Then we run GMC to solve . Let adopts the top- strategy.
If we do not use the memory gradient, which means each worker directly sends , then the first worker will send , the second worker will send and . We observe that will never be updated. This is due to the pseudo large gradient values which cheat . Since , we can see that the second coordinate is the true larger gradient and we should have mainly updated . However, in the two stochastic functions , the first coordinate has larger absolute value, so they cheat and lead to the error.
If we use memory gradient, at first . After some iterations, they will be and due to the memory gradient. Specifically, let be two integers satisfying , then it is easy to verify that . It implies that if we use the memory gradient, both and will be updated, so GMC can make converge to the optimum .
Hence, the memory gradient is necessary for sparse communication. It can overcome the disadvantage of combining DSGD and SCD.
3.3 Global Momentum
In GMC, each worker calculates as . When , it degenerates to that of gradient dropping (Aji and Heafield, 2017), denoted as . We can see that GMC uses the global momentum . While in DGC (Lin et al., 2018), the is calculated by , which uses the local momentum .
If we set , then DGC is equivalent to GMC and . If is sparse, according to the update rule for memory gradient in (4), in GMC each will contain partial information of the global momentum, while in DGC each only contains partial information of the local momentum. Assume that converges to , then denotes the descent direction with high probability. Since , which only contains partial information of the whole training data, the global momentum can make compensation for the error between the stochastic gradient and full gradient. Specifically, if , then we get that
It implies that is a better estimation of than . However, it is hard to judge whether is better than from the update rule for .
4 Convergence of GMC
In this section, we prove the convergence rate of GMC for both convex and non-convex problems. For convenience, we define a diagonal matrix such that to replace the symbol . Then the update rule for GMC can be written as follows:
According to the above two equations, we get that
where . Denoting , we have
First, we re-write the above update rule in the following lemma: Let , then we have
For the new variable in Lemma 4, the gap between and has the following property: Assume . If , then
If , then
where , satisfies . The learning rate proposed in Lemma 4 are common in the convergence analysis of DSGD and it tells us that . For convenience, below we use the constant to denote
(strongly convex case) Let be the variable defined in Lemma 4. Assume . We further assume is -smooth and -strongly convex with -bounded stochastic gradient. By setting , , then
where . It implies that GMC has a convergence rate of , if the objective function is strongly convex.
(convex case) Let be the variable defined in Lemma 4. Assume . And assume is convex with -bounded stochastic gradient. By setting , we get that
where . It implies that GMC has a convergence rate of , if the objective function is convex.
(non-convex case) Let be the variable defined in Lemma 4. Assume . And assume is -smooth with -bounded stochastic gradient. By setting , we get that
where . By taking , it is easy to find that GMC has a convergence rate of , if the objective function is non-convex.
In the previous theorems, we need . According to its definition, we only need to be bounded. The bound is mainly related to the choice of . Since , we can get by: randomly choosing the coordinates of (random strategy), or choosing the coordinates which have (approximate) top-K absolute values (top-K strategy). Specifically, we have the following theorem: If adopts random strategy or top-K strategy, and , then we have .
We conduct experiments on a PyTorch based parameter server with one server and eight workers. Each worker has access to one K40 GPU. We compare with distributed momentum SGD (DMSGD) and DGC (Lin et al., 2018). In DGC, the momentum factor masking is used (Lin et al., 2018). We set for GMC and DGC. In our experiments, we consider the communication cost on the server which is the busiest node. It includes receiving vectors from the workers and sending one vector to the workers. So the cost of DMSGD is . In GMC and DGC, since is sparse, workers send the vectors using the structure of . The cost of each is . Server sends using this structure as well. Hence, the cost of GMC and DGC is . Hence, the communication compression ratio (CR) is: . Here, all numbers have the same unit (float value).
Convex model. We use the dataset MNIST and the model logistic regression (LR) to evaluate GMC on convex problem. Since the size of dataset and model is small, directly adopts top-K strategy with where or
. We use 4 workers for this experiment. We train LR (weight decay: 0.0001, batch size: 128) for 30 epochs. The results are in Table5. We can see that GMC gets the same training loss and test accuracy as that of DMSGD under different sparsity. According to the definition of CR, the communication compress ratio is smaller than . So the compress ratio in Table 5 is consist with it, which is proportional to . We can find that, compared with DMSGD, GMC can reduce the communication cost by more than 200 fold without loss of accuracy. Furthermore, GMC achieves comparable performance as DGC.
Non-convex model. We use the dataset CIFAR-10 and two popular deep models (AlexNet, ResNet20) to evaluate GMC on non-convex problems. Since the model size is large, in GMC and DGC we use the approximate top-K strategy for : given a vector , we first randomly choose some indexes with . We get the threshold such that . Then we choose the indexes . It implies that is approximately . We use both and workers with the total batch size 128. The results are shown in Figure 1 and Table 5. First, according to Figure 1 (a), GMC and DGC has the same training loss and test accuracy as that of DMSGD on ResNet20. Compared to ResNet20, AlexNet has more parameters. In Figure 1 (b), we can see that GMC also gets the same loss and accuracy as that of DMSGD. When using 4 workers, GMC is better than DGC on test accuracy. From Table 5, we can find that, compared with DMSGD, GMC can reduce the communication cost by more than 100 fold without loss of accuracy. Furthermore, GMC achieves comparable (sometimes better) accuracy with comparable communication compression ratio, compared with DGC.
In this paper, we propose a novel method, called global momentum compression (GMC), for sparse communication in DMSGD (DSGD). To the best of our knowledge, this is the first work that proves the convergence of DMSGD with sparse communication and memory gradient. Empirical results show that GMC can achieve state-of-the-art performance.
Aji and Heafield (2017)
Alham Fikri Aji and Kenneth Heafield.
Sparse communication for distributed gradient descent.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 440–445, 2017.
- Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718, 2017.
- Alistarh et al. (2018) Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5977–5987, 2018.
- Dekel et al. (2012) Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13:165–202, 2012.
- Jiang and Agrawal (2018) Peng Jiang and Gagan Agrawal. A linear speedup analysis of distributed deep learning with sparse and quantized communication. In Advances in Neural Information Processing Systems, pages 2530–2541, 2018.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1106–1114, 2012.
- Lan (2012) Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.
- Li et al. (2014a) Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation, pages 583–598, 2014a.
- Li et al. (2014b) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 661–670, 2014b.
- Lin et al. (2018) Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In 6th International Conference on Learning Representations, 2018.
- Nesterov (2012) Yurii Nesterov. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341–362, 2012.
- Polyak (1964) Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964.
- Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
- Seide et al. (2014) Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, pages 1058–1062, 2014.
- Stich et al. (2018) Sebastian U. Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Advances in Neural Information Processing Systems, pages 4452–4463, 2018.
- Sutskever et al. (2013) Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139–1147, 2013.
- Tseng (1998) Paul Tseng. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization, 8(2):506–531, 1998.
- Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
- Yu et al. (2019) Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum sgd for distributed non-convex optimization. In Proceedings of the 36th International Conference on Machine Learning, 2019.