1 Introduction
Many machine learning problems can be formulated as the following optimization problem:
(1) 
In (1), refers to the model parameter, is the number of training data, and each
is the loss function defined on the
th instance. For example, given the labeled training data , if we set, it is known as logistic regression (LR). Many deep learning models, like ResNet
(He et al., 2016), can also be formulated as the form in (1).Stochastic gradient descent (SGD) has been one of the most powerful optimization methods to solve (1). In the th iteration, SGD randomly selects one minibatch training data indexed with (Li et al., 2014) and update the parameter as follows:
(2) 
where is the parameter value at the th iteration, is the minibatch sampled at the th iteration, and is the learning rate.
Recently, several variants of SGD (ShalevShwartz and Zhang, 2013; Johnson and Zhang, 2013; Defazio et al., 2014; Schmidt et al., 2017) have been proposed and have achieved better performance than traditional SGD in (2) for some cases like linear models. However, for some other cases like deep learning models, these variants are not necessarily better than traditional SGD. Hence, the formulation in (2) is still the most widely used optimization methods for general machine learning. Furthermore, the SGD in (2) is also easy to be implemented on distributed platforms such as Parameter Server: each worker calculates a minibatch gradient and sent it to the server; server aggregates these gradient and updates the parameters. Hence, SGD is one of the most popular optimization methods for distributed learning.
Recently, quantization has attracted much attention since it can reduce the storage of data and model, cost of computation and communication for distributed learning (Zhang et al., 2017)
. Researchers have proposed many methods to combine quantization and SGD. In particular, for training neural networks, many heuristic methods have been proposed
(Gupta et al., 2015; Chen et al., 2015; Hubara et al., 2016; Rastegari et al., 2016; Aji and Heafield, 2017; Lin et al., 2018) which can quantize parameters, activations and gradients during the training procedure. Most of these methods are heuristic without theoretical guarantee.Recently, the authors of (Wen et al., 2017; Alistarh et al., 2017) propose quantized SGD (QSGD) by compressing gradients with unbiased guarantee. Using previous theory of SGD (Bach and Moulines, 2011; Shamir and Zhang, 2013), the methods in (Wen et al., 2017; Alistarh et al., 2017) converge well. For distributed learning, they only need to communicate low precision gradients. It can save much communication cost which is one of the biggest bottlenecks in distributed learning. The method in (Zhang et al., 2017)
tries to compress the training data by executing multiple independent quantizations to make QSGD efficient. However, to get an unbiased quantization vector, all of the above methods will introduce extra variance. Although the authors in
(Zhang et al., 2017) propose optimal quantization to reduce the variance, the variance still exists. Combining with the natural variance of stochastic gradients, these algorithms may not perform as well as SGD and it seems hard to reduce the variance asymptotically when compressing gradients. To further reduce the quantization variance, researchers recently propose to compress parameter instead of gradients (Sa et al., 2018; Tang et al., 2018). The method in (Sa et al., 2018) focuses on a variant of SGD called SVRG (Johnson and Zhang, 2013). In each epoch, it needs to pass through the training data three times, which is quite slow and not efficient for some models like deep learning models. The method in (Tang et al., 2018) focuses on decentralized distributed framework.In this paper, we propose a new quantized SGD method, called Quantized EpochSGD (QESGD), for communicationefficient distributed learning. QESGD adopts quantization on epochSGD (Hazan and Kale, 2014; Xu et al., 2017)
. QESGD compresses (quantizes) the parameter with variance reduction, so that it can get almost the same performance as that of SGD with less communication cost. Comparing to existing QSGD methods which need to decrease learning rate after each iteration or set a quite small constant learning rate, QESGD only need to decrease learning rate after one or more epochs. Hence, the changing of learning rate is more similar to the successful practical procedure taken by existing deep learning platforms like Pytorch and Tensorflow. QESGD is implemented on the Parameter Server framework, and empirical results on distributed deep learning show that QESGD can outperform other stateoftheart quantization methods to achieve the best performance.
2 Preliminary
In this paper, we use to denote the optimal solution of (1) and use to denote the norm. We also make the following common assumptions throughout the paper.
Assumption 1
We assume that each is smooth (), which means ,
Assumption 2
We assume that each is strongly convex (), which means ,
Assumption 3
2.1 Quantization
For simplicity, we use uniform quantization (Alistarh et al., 2017; Sa et al., 2018) in this paper. For any scalar , we use to denote the quantization result of , where ,
(3) 
and
(4) 
For any vector , we also use to denote the quantization result where each coordinate of is quantified according to (4) independently. Then we have the following lemma about quantization variance: Given fixed scalars and , we have
(5) 
The proof is similar to that of (Sa et al., 2018). If , it is easy to get that
If , let , according to (4), we obtain
Although such a quantization is a biased estimation (
), Lemma 2.1 implies that the quantization operation defined in (4) would not make the result be far away from the set and if , the quantization variance can be bounded by .2.2 Epoch SGD and Motivation
The EpochSGD (Hazan and Kale, 2014) is presented in Algorithm 1. EpochSGD updates parameter using a fixed learning rate in each epoch. After each epoch, it will decrease the learning rate and increase . Such a training procedure is more practical when comparing to that in (2) since the learning rate would descend quickly to zero and it is hard to get a good result.
According to Algorithm 1, we can consider the th inner iteration as optimizing the following subproblem:
(6) 
using SGD with initialization and a fixed learning rate. Although it can not get an optimal solution using a fixed learning rate, it can get a good estimation. Furthermore, if gets close to , then the optimal solution of (6) would get close to . Then we can use the bit centering technique (Sa et al., 2018) that compress the variable which refers to in Algorithm 1.
3 Qesgd
Now we present our new quantized SGD called QESGD in Algorithm 2. In the th inner iteration, it will update variable using the stochastic gradient, and then compress it according to (4). Using the quantization vector, it recovers real model parameter and turns to the next update. For the choice of parameters , we will give details in the later section which leads to the convergence of QESGD.
QESGD is also easy to implemented on Parameter Server. The distributed version of QESGD is presented in Algorithm 3. Servers will send quantization vector to workers which will reduce much communication cost. For convergence guarantee and asymptotic reduction of quantization variance in theory, we do not compress the gradients in Algorithm 3. In practice, users can compress gradients carefully so that when workers push the gradients, it can also reduce the communication cost. For example, in the experiments of Alistarh et al. (2017), the authors split the vector into buckets and compress the buckets individually which can reduce the quantization variance. QESGD can also use this trick when compressing variable .
4 Convergence analysis
In this section, we give convergence analysis of QESGD and give details about choosing the parameters in Algorithm 2. First, let where is defined in (6), then we have the lemma: Let , then . By the definition of , we obtain . It implies that each coordinate of belongs to .
Let be the sequences in Algorithm 2. With Assumption 1, 2 and 3, , , we have the following result
Let , where is defined in (6). Then we have .
The first inequality uses Lemma 2.1 and Lemma 4. The last inquality uses the the fact that . By the convexity of , we get that . Then we obtain
(7) 
Summing up the above equation from to and taking , we obtain
where the last inequality uses the smooth property that .
Now let’s make details on the choice of parameters for finial convergence. Let , , then we have
Moreover, let , then . First, it is easy to calculate . For convenience, let , , where is a constant. Then we have
(8) 
We proof the result by induction. Assuming a constant satisfies and . If , then
(9) 
Above all, we get that .
On the choice of , it is related to the full gradient . Computing the full gradient is unacceptable since the large scale training data. In fact, and according to the corollary, we obtain . This implies that we can directly set so that we can avoid the full gradient computation. At the same time, it implies that the quantization variance would decrease to zero. On the choice of , since , we get that , which would increase quite slowly as increases (when is large). Within finite training time, we can consider it as a constant. In our experiments, we find that setting is good enough which would lead to the same performance as that of SGD.
5 Experiments
We use deep neural networks to evaluate QESGD. We do experiments on Pytorch using TITAN xp GPU. We compare our method with SGD and QSGD. Since the distributed version of these methods take synchronous strategy, the performance is equivalent to that on single machine. In this paper, we would only do experiments on single machine to verify the impact of quantization on training and testing results. To evaluate the efficiency of variance reduction of quantization in our method, we would compress the whole vector directly using uniform quantization without any other tricks for both QESGD and QSGD.
CNN. First, we choose two CNN models: ResNet20 and ResNet56. We use the data set CIFAR10. For QESGD, we set (We only calculate the full gradient w.r.t the initialization ), where the constant is chosen from . For QSGD, we set , where is the gradient that need to be compressed. The learning rates of QESGD and QSGD are the same as that of SGD. The result is in Figure 1
. We can find that QESGD gets almost the same performance on both training and testing results as that of SGD. Due to the quantization variance, QSGD is weak. The gap between QSGD and SGD is pronounced. We also train ResNet18 on imagenet, the result is in Figure
2.Many evidences have show that weight decay would affect the distribution of model parameters and gradients. Since we take uniform quantization, weight decay would affect the quantization variance. The number of bits can also affect the quantization variance. Then we evaluate these methods on the large model ResNet56 under different weight decays and quantization bits. The performance is in Table 1. We can find that: (a) under the same settings, QESGD is always better than QSGD; (b) when we do not use weight decay, quantization method would be a little weak than SGD; (c) when we use small bits, the quantization methods have obviously deteriorated, especially that of QSGD.
weight decay  bits  test accuracy  

SGD  0  90.87  
0.0001  92.61  
QSGD  0  8  89.91 
0.0001  8  91.15  
0.0001  4  80.29  
QESGD  0  8  90.64 
0.0001  8  92.61  
0.0001  4  87.72 
RNN. We also evaluate our method on RNN. We choose the model LSTM that contains two hidden layers, each layer contains with 128 units and the data set TinyShakespeare ^{1}^{1}1https://github.com/karpathy/charrnn. The choice of of QESGD and QSGD is the same as that in CNN experiments. The result is in Figure 3
. QESGD still gets almost the same performance as that of SGD. Sometimes it is even better than SGD. In this experiment, we can find the gap between QSGD and SGD is smaller than that in CNN experiments. This is due to the gradient clipping technique which is common in the training of RNN. It can reduce the quantization variance of gradients so that QSGD performs well.
Distrbuted training. We also evaluate the communication efficiency of distributed QESGD on Parameter Server. We conduct experiments on docker with 8 k40 GPUs and 1 server. We use three models: ResNet56, AlexNet and VGG19. The result is in Table 2. The Speedup is defined as (Time per epoch of SGD)/(Time per epoch of QESGD) under the same number of GPUs. Since QESGD uses 8bits and SGD uses 32 bits during communication, the ideal speedup is 2/(1+8/32) = 1.6. Due to the computation cost, the results in Table 2 are smaller than 1.6. On this hand, our method can reduce communication efficiently.
Model  Parameters  GPUs  Speedup(Ideal 1.6) 

ResNet56  0.85M  4  1.12 
8  1.31  
AlexNet  57M  4  1.34 
8  1.49  
VGG19  140M  4  1.39 
8  1.38 
6 Conclusion
In this paper, we propose a new quantization SGD called QESGD. It can reduce the quantization variance by compressing parameters instead of gradients. It is also easy to implemented on distributed platform so that it can reduce the communication by quantization.
References

Aji and Heafield (2017)
Alham Fikri Aji and Kenneth Heafield.
Sparse communication for distributed gradient descent.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 440–445, 2017.  Alistarh et al. (2017) Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: communicationefficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pages 1707–1718, 2017.

Bach and Moulines (2011)
Francis R. Bach and Eric Moulines.
Nonasymptotic analysis of stochastic approximation algorithms for machine learning.
In Advances in Neural Information Processing Systems, pages 451–459, 2011.  Chen et al. (2015) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In Proceedings of the 32nd International Conference on Machine Learning, pages 2285–2294, 2015.
 Defazio et al. (2014) Aaron Defazio, Francis R. Bach, and Simon LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning, pages 1737–1746, 2015.
 Hazan and Kale (2014) Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic stronglyconvex optimization. Journal of Machine Learning Research, 15(1):2489–2512, 2014.

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778, 2016.  Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, abs/1609.07061, 2016.
 Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
 Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient minibatch training for stochastic optimization. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 661–670, 2014.
 Lin et al. (2018) Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. In Computer Vision  ECCV 2016  14th European Conference, pages 525–542, 2016.
 Sa et al. (2018) Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R. Aberger, Kunle Olukotun, and Christopher Ré. Highaccuracy lowprecision training. CoRR, abs/1803.03383, 2018.
 Schmidt et al. (2017) Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., 162(12):83–112, 2017.
 ShalevShwartz and Zhang (2013) Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal of Machine Learning Research, 14(1):567–599, 2013.
 Shamir and Zhang (2013) Ohad Shamir and Tong Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning, pages 71–79, 2013.
 Tang et al. (2018) Hanlin Tang, Ce Zhang, Shaoduo Gan, Tong Zhang, and Ji Liu. Decentralization meets quantization. CoRR, abs/1803.06443, 2018.
 Wen et al. (2017) Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems, pages 1508–1518, 2017.
 Xu et al. (2017) Yi Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 3821–3830, 2017.
 Zhang et al. (2017) Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with endtoend low precision, and a little bit of deep learning. In Proceedings of the 34th International Conference on Machine Learning, pages 4035–4043, 2017.
Comments
There are no comments yet.