1 Introduction
Deep learning is booming thanks to enormous datasets and very large models. In fact, the largest datasets and models can no longer be trained on a single machine. One solution to this problem is to use distributed systems. The most common algorithms underlying deep learning are stochastic gradient descent (SGD) and its variants. As such, the problem of building and understanding distributed versions of SGD is being intensely studied.
Implementations of SGD on distributed systems and dataparallel versions of SGD are scalable and take advantage of multiGPU systems. Dataparallel SGD, in particular, has received significant attention due to its excellent scalability properties (Zinkevich et al., 2010; Bekkerman et al., 2011; Recht et al., 2011; Dean et al., 2012; Coates et al., 2013; Chilimbi et al., 2014; Li et al., 2014; Duchi et al., 2015; Xing et al., 2015; Zhang et al., 2015; Alistarh et al., 2017). In dataparallel SGD, a large dataset is partitioned among processors. These processors work together to minimize an objective function. Each processor has access to the current parameter vector of the model. At each SGD iteration, each processor computes an updated stochastic gradient using its own local data. It then shares the gradient update with its peers. The processors collect and aggregate stochastic gradients to compute the updated parameter vector.
Increasing the number of processing machines reduces the computational costs significantly. However, the communication costs to share and synchronize huge gradient vectors and parameters increases dramatically as the size of the distributed systems grows. Communication costs may thwart the anticipated benefits of reducing computational costs. Indeed, in practical scenarios, the communication time required to share stochastic gradients and parameters is the main performance bottleneck (Recht et al., 2011; Li et al., 2014; Seide et al., 2014; Strom, 2015; Alistarh et al., 2017). Reducing communication costs in dataparallel SGD is an important problem.
One possible solution is inference acceleration through network/weight compression, i.e., using sparse and quantized deep neural networks (Wen et al., 2016; Hubara et al., 2016; Park et al., 2017). However, these techniques sometimes exacerbate the training in terms of achieving the accuracy of original networks (Wen et al., 2017).
Another promising solution to the problem of reducing communication costs of dataparallel SGD is gradient compression, e.g., through quantization (Dean et al., 2012; Seide et al., 2014; Sa et al., 2015; Gupta et al., 2015; Abadi et al., 2016; Zhou et al., 2016; Alistarh et al., 2017; Wen et al., 2017; Bernstein et al., 2018). Unlike fullprecision dataparallel SGD, where each processor is required to broadcast its local gradient in fullprecision, i.e., transmit and receive huge fullprecision vectors at each iteration, quantization requires each processor to transmit only a few communication bits per iteration for each component of the stochastic gradient.
One such proposal for combining quantization and SGD is quantized SGD (QSGD), due to Alistarh et al. (2017). In QSGD, stochastic gradient vectors are normalized to have unit
norm and then compressed by quantizing each element to a uniform grid of quantization levels using a randomized method. Most lossy compression schemes do not provide convergence guarantees under standard assumptions. QSGD’s quantization scheme, however, is designed to be unbiased, which implies that the quantized stochastic gradient is itself a stochastic gradient, only with higher variance determined by the dimension and number of quantization levels. As a result,
Alistarh et al. (2017) are able to establish a number of theoretical guarantees for QSGD, including that it converges under standard assumptions. By changing the number of quantization levels, QSGD allows the user to make a tradeoff between communication bandwidth and convergence time.Despite their theoretical guarantees based on quantizing after normalization, Alistarh et al. opt to present empirical results using normalization. We call this variation QSGDinf. While the empirical performance of QSGDinf is strong, their theoretical guarantees no longer apply. Indeed, in our own empirical evaluation of QSGD, we find the variance induced by quantization is substantial, and the performance is far from that of SGD and QSGDinf.
An important question is whether one can obtain guarantees as strong as those of QSGD while matching the performance of QSGDinf. In this work, we answer this question in the affirmative by modifying the quantization scheme underlying QSGD in a way that allows us to establish stronger theoretical guarantees on the variance, bandwidth, and cost to achieve a prescribed suboptimality gap. Instead of QSGD’s uniform quantization scheme, we use an unbiased nonuniform logarithmic scheme, similar to those introduced in telephony systems for audio compression (Cattermole, 1969). We call the resulting algorithm nonuniformly quantized stochastic gradient descent (NUQSGD). Like QSGD, NUQSGD is a quantized dataparallel SGD algorithm with strong theoretical guarantees that allows the user to trade off communication costs with convergence speed. Unlike QSGD, NUQSGD has strong empirical performance on deep models and large datasets, matching that of QSGDinf.
The intuition behind the nonuniform quantization scheme underlying NUQSGD is that, after normalization, many elements of the normalized stochastic gradient will be nearzero. By concentrating quantization levels near zero, we are able to establish stronger bounds on the excess variance. In the overparametrized regime of interest, these bounds decrease rapidly as the number of quantization levels increases. Combined with a bound on the expected codelength, we obtain a bound on the total communication costs of achieving an expected suboptimality gap. This bound is slightly stronger than the bound for QSGD.
To study how quantization affects convergence on stateoftheart deep models, we compare NUQSGD, QSGD, and QSGDinf, focusing on training loss, variance, and test accuracy on standard deep models and large datasets. Using the same number of bits per iteration, experimental results show that NUQSGD has smaller variance than QSGD, as expected by our theoretical results. This smaller variance also translates to improved optimization performance, in terms of both training loss and test accuracy. We also observe that NUQSGD matches the performance of QSGDinf in terms of variance and loss/accuracy.
1.1 Summary of Contributions

[leftmargin=*]

We establish stronger theoretical guarantees for the excess variance and communication costs of our gradient quantization method than those available for QSGD’s uniform quantization method.

We then establish stronger convergence guarantees for the resulting algorithm, NUQSGD, under standard assumptions.

We demonstrate that NUQSGD has strong empirical performance on deep models and large datasets. NUQSGD closes the gap between the theoretical guarantees of QSGD and the empirical performance of QSGDinf.
1.2 Related Work
Seide et al. (2014)
proposed signSGD, an efficient heuristic scheme to reduce communication costs drastically by quantizing each gradient component to two values.
Bernstein et al. (2018) later provided convergence guarantees for signSGD. Note that the quantization employed by signSGD is not unbiased, and so a new analysis was required. As the number of levels is fixed, SignSGD does not provide any tradeoff between communication costs and convergence speed.Sa et al. (2015)
introduced Buckwild!, a lossy compressed SGD with convergence guarantees. The authors provided bounds on the error probability of SGD, assuming convexity and gradient sparsity.
Wen et al. (2017) proposed TernGrad, a stochastic quantization scheme with three levels. TernGrad also significantly reduces communication costs and obtains reasonable accuracy with a small degradation to performance compared to fullprecision SGD. Convergence guarantees for TernGrad rely on a nonstandard gradient norm assumption.
NUQSGD uses a logarithmic quantization scheme. Such schemes have long been used in telephony systems for audio compression (Cattermole, 1969). Logarithmic quantization schemes have appeared in other contexts recently: Hou and Kwok (2018)
studied weight distributions of long shortterm memory networks and proposed to use logarithm quantization for network compression.
Zhang et al. (2017) proposed a gradient compression scheme and introduced an optimal quantization scheme, but for the setting where the points to be quantized are known in advance. As a result, their scheme is not applicable to the communication setting of quantized dataparallel SGD.2 Preliminaries: Dataparallel SGD and Convergence
We consider a highdimensional machine learning model, parametrized by a vector
. Let denote a closed and convex set. Our objective is to minimize , which is an unknown, differentiable, convex, and smooth function. The following summary is based on (Alistarh et al., 2017).Setting some notation, denote by the expectation operator; by and the Euclidean norm and the number of nonzero elements of a vector, respectively; by the length of a binary string, the length of a vector, and cardinality of a set. We use lowercase bold letters to denote vectors. Sets are typeset in a calligraphic font. The base logarithm is denoted by , and the set of binary strings is denoted by .
A function is smooth if, for all , we have . We consider a probability space to represent the randomness in updates of the stochastic algorithm. Assume we have access to stochastic gradients of , i.e., we have access to a function such that, if , then for all . In the rest of the paper, we denote by the stochastic gradient for notational simplicity. The update rule for conventional fullprecision projected SGD is given by
(1) 
where is the current parameter input, is the learning rate, and is the Euclidean projection onto .
The stochastic gradient has a secondmoment upper bound when for all . The stochastic gradient has a variance upper bound when for all
. Note that a secondmoment upper bound implies a variance upper bound, because the stochastic gradient is unbiased.
We have classical convergence guarantees for conventional fullprecision SGD given access to stochastic gradients at each iteration:
Theorem 1 (Bubeck 2015, Theorem 6.3).
Let denote a convex and smooth function and let . Suppose that the projected SGD update (1) is executed for iterations with where . Given repeated and independent access to stochastic gradients with a variance upper bound , projected SGD satisfies
(2) 
Minibatched (with larger batch sizes) and dataparallel SGD are two common SGD variants used in practice to reduce variance and improve computational efficiency of conventional SGD.
Following (Alistarh et al., 2017), we consider dataparallel SGD, a synchronous distributed framework consisting of processors that partition a large dataset among themselves. This framework models realworld systems with multiple GPU resources. Each processor keeps a local copy of the parameter vector and has access to independent and private stochastic gradients of .
At each iteration, each processor computes its own stochastic gradient based on its local data and then broadcasts it to all peers. Each processor receives and aggregates the stochastic gradients from all peers to obtain the updated parameter vector. In detail, the update rule for fullprecision dataparallel SGD is where is the stochastic gradient computed and broadcasted by processor . Provided that is a stochastic gradient with a variance upper bound for all , then is a stochastic gradient with a variance upper bound . Thus, aggregation improves convergence of SGD by reducing the first term of the upper bound in (2). Assume each processor computes a minibatch gradient of size . Then, this update rule is essentially a minibatched update with size .
Dataparallel SGD is described in Algorithm 1. Fullprecision dataparallel SGD is a special case of Algorithm 1 with identity encoding and decoding mappings. Otherwise, the decoded stochastic gradient is likely to be different from the original local stochastic gradient .
Applying Theorem 1, we have the following convergence guarantees for fullprecision dataparallel SGD:
Corollary 1 (Alistarh et al. 2017, Corollary 2.2).
3 Nonuniformly Quantized Stochastic Gradient Descent (NUQSGD)
Dataparallel SGD reduces computational costs significantly. However, the communication costs of broadcasting stochastic gradients is the main performance bottleneck in largescale distributed systems. In order to reduce communication costs and accelerate training, Alistarh et al. (2017) introduced a compression scheme that produces a compressed and unbiased stochastic gradient, suitable for use in SGD.
At each iteration of QSGD, each processor broadcasts an encoding of its own compressed stochastic gradient, decodes the stochastic gradients received from other processors, and sums all the quantized vectors to produce a stochastic gradient. In order to compress the gradients, every coordinate (with respect to the standard basis) of the stochastic gradient is normalized by the Euclidean norm of the gradient and then stochastically quantized to one of a small number quantization levels distributed uniformly in the unit interval. The stochasticity of the quantization is necessary to not introduce bias.
Alistarh et al. (2017) give a simple argument that provides a lower bound on the number of coordinates that are quantized to zero in expectation. Encoding these zeros efficiently provides communication savings at each iteration. However, the cost of their scheme is greatly increased variance in the gradient, and thus slower overall convergence. In order to optimize overall performance, we must balance communication savings with variance.
By simple counting arguments, the distribution of the (normalized) coordinates cannot be uniform. Indeed, this is the basis of the lower bound on the number of zeros. These arguments make no assumptions on the data distribution, and rely entirely on the fact that the quantities being quantized are the coordinates of a unitnorm vector. Uniform quantization does not capture the properties of such vectors, leading to substantial gradient variance.
3.1 Nonuniform Quantization
In this paper, we propose and study a new scheme to quantize normalized gradient vectors. Instead of uniformly distributed quantization levels, as proposed by
Alistarh et al. (2017), we consider quantization levels that are nonuniformly distributed in the unit interval, as depicted in Figure 1. In order to obtain a quantized gradient that is suitable for SGD, we need the quantized gradient to remain unbiased. Alistarh et al. (2017) achieve this via a randomized quantization scheme, which can be easily generalized to the case of nonuniform quantization levels.Using a carefully parametrized generalization of the unbiased quantization scheme introduced by Alistarh et al., we can control both the cost of communication and the variance of the gradient. Compared to a uniform quantization scheme, our nonuniform scheme reduces quantization error and variance by better matching the properties of normalized vectors. In particular, by increasing the number of quantization levels near zero, we obtain a stronger variance bound. Empirically, our scheme also better matches the distribution of normalized coordinates observed on real datasets and networks.
We now describe the nonuniform quantization scheme: Let be the number of internal quantization levels, and let denote the sequence of quantization levels, where . For , let and satisfy and , respectively. Define . Note that .
Definition 1.
The nonuniform quantization of a vector is
(3) 
where, letting , the
’s are independent random variables given by
(4) 
We note that the distribution of in (4) satisfies and achieves the minimum variance over all distributions that satisfy with support .
In the following, we focus on a special case of nonuniform quantization with as the quantization levels.
The intuition behind this quantization scheme is that it is very unlikely to observe large values of in the stochastic gradient vectors of machine learning models. Stochastic gradients are observed to be dense vectors (Bernstein et al., 2018). Hence, it is natural to use fine intervals for small values to reduce quantization error and control the variance.
3.2 Encoding
After quantizing the stochastic gradient with a small number of discrete levels, each processor must encode its local gradient into a binary string for broadcasting. We now describe this encoding.
By inspection, the quantized gradient is determined by the tuple , where is the norm of the gradient, is the vector of signs of the coordinates ’s, and are the quantizations of the normalized coordinates. We can describe the function (for Algorithm 1) in terms of the tuple and an encoding/decoding scheme and for encoding/decoding positive integers, which we define later.
The encoding, , of a stochastic gradient is as follows: We first encode the norm using bits where, in practice, we use standard 32bit floating point encoding. We then proceed in rounds, . On round , having transmitted all nonzero coordinates up to and including , we transmit where is either (i) the index of the first nonzero coordinate of after (with ) or (ii) the index of the last nonzero coordinate. In the former case, we then transmit one bit encoding the sign , transmit , and proceed to the next round. In the latter case, the encoding is complete after transmitting and .
The DECODE function (for Algorithm 1) simply reads bits to reconstruct . Using , it decodes the index of the first nonzero coordinate, reads the bit indicating the sign, and then uses again to determines the quantization level of this first nonzero coordinate. The process proceeds in rounds, mimicking the encoding process, finishing when all coordinates have been decoded.
Like Alistarh et al. (2017), we use Elias recursive coding (Elias, 1975, ERC) to encode positive integers. ERC is simple and has several desirable properties, including the property that the coding scheme assigns shorter codes to smaller values, which makes sense in our scheme as they are more likely to occur. Elias coding is a universal lossless integer coding scheme with a recursive encoding and decoding structure.
4 Theoretical Guarantees
In this section, we provide theoretical guarantees for NUQSGD, giving variance and codelength bounds, and using these in turn to compare NUQSGD and QSGD. Please note that the proofs of Theorems 2 and 3 are provided in Appendices B and C, respectively.
Theorem 2 (Variance bound).
Let . The nonuniform quantization of satisfies . Furthermore, provided that , we have
(5) 
where .
The result in Theorem 2 implies that if is a stochastic gradient with a secondmoment bound , then is a stochastic gradient with a variance upper bound . In the range of interest where is sufficiently large, i.e., , the variance upper bound decreases with the number of quantization levels. To obtain this dataindependent bound, we establish upper bounds on the number of coordinates of falling into intervals defined by . We note that, for large values of , the variance bound becomes loose, although this is not the range of interest.
Theorem 3 (Codelength bound).
Let . Provided is large enough to ensure , the expectation of the number of communication bits to transmit is bounded above by
(6) 
where .
Theorem 3 provides a bound on the expected number of communication bits to encode the quantized stochastic gradient. Note that is a mild assumption in practice. As one would expect, the bound, (6), increases monotonically in and . In the sparse case, if we choose levels, then the upper bound on the expected codelength is .
Combining the upper bounds above on the variance and codelength, Corollary 1 implies the following guarantees for NUQSGD:
Theorem 4 (NUQSGD for smooth convex optimization).
Let and be defined as in Theorem 1, let be defined as in Theorem 2, let , , and . With and defined as in Section 3.2, suppose that Algorithm 1 is executed for iterations with a learning rate on processors, each with access to independent stochastic gradients of with a secondmoment bound . Then iterations suffice to guarantee
(7) 
In addition, NUQSGD requires at most communication bits per iteration in expectation.
Proof.
Note that we can also apply NUQSGD to nonconvex problems and provide convergence guarantees as is done for QSGD (Alistarh et al., 2017, Theorem 3.5).
4.1 NUQSGD vs QSGD
In the following, we compare QSGD and NUQSGD in terms of bounds on the expected number of communication bits required to achieve a given suboptimality gap .
The quantity that controls our guarantee on the convergence speed in both algorithms is the variance upper bound, which in turn is controlled by the quantization schemes. Note that the number of quantization levels, , is usually a small number in practice. On the other hand, the dimension, , can be very large, especially in overparameterized networks. In Figure 2, we show that the quantization scheme underlying NUQSGD results in substantially smaller variance upper bounds for plausible ranges of and . Note that these bounds do not make any assumptions on the dataset or the structure of the network.
For any (nonrandom) number of iterations , an upper bound, , holding uniformly over iterations on the expected number of bits used by an algorithm to communicate the gradient on iteration , yields an upper bound , on the expected number of bits communicated over iterations by algorithm . Taking to be the (minimum) number of iterations needed to guarantee an expected suboptimality gap of based on the properties of , we obtain an upper bound, , on the expected number of bits of communicated on a run expected to achieve a suboptimality gap of at most .
Theorem 5 (Expected number of communication bits).
Provided that and , and .
Proof.
Assuming , then . Ignoring all but terms depending on and , we have . Following Theorems 2 and 3 for NUQSGD, . For QSGD, following the results of Alistarh et al. (2017), where and .
In overparameterized networks, where , we have and . Furthermore, for sufficiently large , and are given by and , respectively. ∎
Focusing on the dominant terms in the expressions of overall number of communication bits required to guarantee a suboptimality gap of , we observe that NUQSGD provides stronger guarantees. Note that our stronger guarantees come without any assumption about the data.
5 Empirical Evaluation
Training loss for the entire training set on CIFAR10 (left) and minibatch training loss on ImageNet (right) for ResNet models trained from random initialization until convergence. QSGD, QSGDinf, and NUQSGD are trained by simulating the quantization and dequantizing of the gradients from
GPUs on CIFAR10 and GPUs on ImageNet. SGD refers to the singleGPU training. SGD is shown to highlight the significance of the gap between QSGD and QSGDinf. SuperSGD refers to simulating fullprecision distributed training without quantization. SuperSGD is impractical in scenarios with limited bandwidth.The main purpose of our work is to close the gap between theory and practice using our method NUQSGD. Alistarh et al. (2017) have introduced two quantization methods, QSGD with theoretical guarantees and a slight modification QSGDinf that performs well in practice but lacks theoretical guarantees. QSGDinf is a uniform quantization scheme where Euclidean norm is replaced by infinity norm. Our method has theoretical guarantees and as we show in this section, matches the performance of QSGDinf, while QSGD has inferior performance. We compare the performance of these distributed methods to fullprecision (bit) singleGPU SGD and distributed fullprecision SGD (SuperSGD). We investigate the impact of quantization on training performance by measuring loss, variance, and accuracy for ResNet models (He et al., 2016) applied to ImageNet (Deng et al., 2009) and CIFAR10 (Krizhevsky, ).
Given fixed communication time, convergence speed is the essential performance indicator of synchronized distributed algorithms. We set the same number of communication bits for different quantization methods to incur the same communication cost per iteration. All quantization methods are configured to use the same bandwidth, and so they would have the same wallclock time.
We evaluate these methods on two image classification datasets: ImageNet and CIFAR10. We train ResNet110 on CIFAR10 with minibatch size and base learning rate . On ImageNet, we train ResNet34 with minibatch size to lessen the cost of the experiments and use the base learning rate that we found works best for all methods. In all experiments, momentum and weight decay are set to and , respectively. The bucket size and the number of quantization bits are set to and , respectively.^{1}^{1}1We observe similar results in experiments with various bucket sizes and number of bits. We simulate a scenario with GPUs for all three quantization methods by estimating the gradient from independent minibatches and aggregating them after quantization and dequantization.
In Figure 3, we show the training loss on CIFAR10 on the entire training set and the minibatch training loss on ImageNet with 8 GPUs and 2 GPUs, respectively. We observe that NUQSGD and QSGDinf improve training loss compared to QSGD on ImageNet. We observe significant gap in training loss on CIFAR10 where the gap grows as training proceeds. We also observe similar performance gaps in test accuracy (provided in Appendix D). In particular, unlike NUQSGD, QSGD does not achieve test accuracy of fullprecision SGD.
We also measure the variance and normalized variance at fixed snapshots during training by evaluating multiple gradient estimates using each quantization method. All methods are evaluated on the same trajectory traversed by the singleGPU SGD. These plots answer this specific question: What would the variance of the first gradient estimate be if one were to train using SGD for any number of iterations then continue the optimization using another method? The entire future trajectory may change by taking a single good or bad step. We can study the variance along any trajectory. However, the trajectory of SGD is particularly interesting because it covers a subset of points in the parameter space that is likely to be traversed by any firstorder optimizer. For multidimensional parameter space, we average the variance of each dimension.
Figure 4 (left), shows the variance of the gradient estimates on the trajectory of singleGPU SGD on CIFAR10. We observe that QSGD has particularly high variance, while QSGDinf and NUQSGDinf have lower variance than singleGPU SGD.
We also propose another measure of stochasticity, normalized variance, that is the variance normalized by the norm of the gradient. The mean normalized variance can be expressed as
where denotes the loss of the model parametrized by on sample and subscript refers to randomness in the algorithm, e.g., randomness in sampling and quantization. Normalized variance can be interpreted as the inverse of Signal to Noise Ratio (SNR) for each dimension. We argue that the noise in optimization is more troubling when it is significantly larger than the gradient. For sources of noise such as quantization that stay constant during training, their negative impact might only be observed when the norm of the gradient becomes small.
Figure 4 (right) shows the mean normalized variance of the gradient versus training iteration. Observe that the normalized variance for QSGD stays relatively constant while the unnormalized variance of QSGD drops after the learning rate drops. It shows that the quantization noise of QSGD can cause slower convergence at the end of the training than at the beginning.
These observations validate our theoretical results that NUQSGD has smaller variance for large models with small number of quantization bits.
6 Conclusions
We study dataparallel and communicationefficient version of stochastic gradient descent. Building on QSGD (Alistarh et al., 2017), we study a nonuniform quantization scheme. We establish upper bounds on the variance of nonuniform quantization and the expected codelength. In the overparametrized regime of interest, the former decreases as the number of quantization levels increases, while the latter increases with the number of quantization levels. Thus, this scheme provides a tradeoff between the communication efficiency and the convergence speed. We compare NUQSGD and QSGD in terms of their variance bounds and the expected number of communication bits required to meet a certain convergence error, and show that NUQSGD provides stronger guarantees. Experimental results are consistent with our theoretical results and confirm that NUQSGD matches the performance of QSGDinf when applied to practical deep models and datasets including ImageNet. Thus, NUQSGD closes the gap between the theoretical guarantees of QSGD and empirical performance of QSGDinf.
Acknowledgement
The authors would like to thank Dan Alistarh and Shaoduo Gan for helpful discussions and access to code. ARK was supported by NSERC Postdoctoral Fellowship. DMR was supported by an NSERC Discovery Grant and Ontario Early Researcher Award.
References
 Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, and M. Vojnovic. QSGD: Communicationefficient SGD via gradient quantization and encoding. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
 Zinkevich et al. [2010] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
 Bekkerman et al. [2011] R. Bekkerman, M. Bilenko, and J. Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
 Recht et al. [2011] B. Recht, C. Ré, S. Wright, and F. Niu. Hogwild: A lockfree approach to parallelizing stochastic gradient descent. In Proc. Advances in Neural Information Processing Systems (NIPS), 2011.
 Dean et al. [2012] J. Dean, G. Corrado, R. Monga K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012.
 Coates et al. [2013] A. Coates, B. Huval, T. Wang, D. Wu, B. Catanzaro, and A. Ng. Deep learning with cots hpc systems. In Proc. International Conference on Machine Learning (ICML), 2013.
 Chilimbi et al. [2014] T. Chilimbi, Y. Suzue J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
 Li et al. [2014] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.Y. Su. Scaling distributed machine learning with the parameter server. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014.
 Duchi et al. [2015] J. C Duchi, S. Chaturapruek, and C. Ré. Asynchronous stochastic convex optimization. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
 Xing et al. [2015] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Y. Petuum. Petuum: A new platform for distributed machine learning on big data. IEEE transactions on Big Data, 1(2):49–67, 2015.
 Zhang et al. [2015] S. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
 Seide et al. [2014] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech DNNs. In Proc. INTERSPEECH, 2014.
 Strom [2015] N. Strom. Scalable distributed DNN training using commodity GPU cloud computing. In Proc. INTERSPEECH, 2015.
 Wen et al. [2016] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2016.
 Hubara et al. [2016] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2016.
 Park et al. [2017] J. Park, S. Li, W. Wen, P. Tang, H. Li, Y. Chen, and P. Dubey. Faster CNNs with direct sparse convolutions and guided pruning. In Proc. International Conference on Learning Representations (ICLR), 2017.
 Wen et al. [2017] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Proc. Advances in Neural Information Processing Systems (NIPS), 2017.
 Sa et al. [2015] C. M. D. Sa, Ce. Zhang, K. Olukotun, and C. Ré. Taming the wild: A unified analysis of hogwildstyle algorithms. In Proc. Advances in Neural Information Processing Systems (NIPS), 2015.
 Gupta et al. [2015] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. In Proc. International Conference on Machine Learning (ICML), 2015.
 Abadi et al. [2016] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
 Zhou et al. [2016] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, 2016.
 Bernstein et al. [2018] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar. signSGD: Compressed optimisation for nonconvex problems. In Proc. International Conference on Machine Learning (ICML), 2018.
 Cattermole [1969] K. W. Cattermole. Principles of pulse code modulation. Iliffe, 1969.
 Hou and Kwok [2018] L. Hou and J. T. Kwok. Lossaware weight quantization of deep networks. In Proc. International Conference on Learning Representations (ICLR), 2018.
 Zhang et al. [2017] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang. ZipML: Training linear models with endtoend low precision, and a little bit of deep learning. In Proc. International Conference on Machine Learning (ICML), 2017.
 Bubeck [2015] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(34):231–358, 2015.
 Elias [1975] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  Deng et al. [2009] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A largescale hierarchical image database. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
 [30] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Appendix A Elias Recursive Coding Pseudocode
Appendix B Proof of Theorem 2 (Variance Bound)
We first find a simple expression of the variance of for every arbitrary quantization scheme in the following lemma:
Lemma 1.
Let , , and fix . The variance of for general sequence of quantization levels is given by
(9) 
where and are defined in Section 3.1.
Proof.
Noting the random quantization is i.i.d over elements of a stochastic gradient, we can decompose as:
(10) 
where . Computing the variance of (4), we can show that . ∎
In the following, we consider NUQSGD algorithm with as the quantization levels. Then, ’s are defined in two cases based on which quantization interval falls into:
1) If , then
(11) 
where
2) If for , then
(12) 
where Note that .
Let denote the coordinates of vector whose elements fall into the th bin, i.e., and for . Let . Applying the result of Lemma 1, we have
(13) 
where for .
Substituting and for into (13), we have
(14) 
We note that
(15) 
since . Similarly, we have
(16) 
Substituting the upper bounds in (15) and (16) into (B), an upper bound on the variance of is given by
(17) 
The upper bound in (17) cannot be used directly as it depends on . Note that ’s depend on quantization intervals. In the following, we obtain an upper bound on , which depends only on and . To do so, we need to use this lemma inspired by [Alistarh et al., 2017, Lemma A.5]:
Lemma 2.
Let . The expected number of nonzeros in is bounded above by
Proof.
Note that since
(18) 
For each , becomes zero with probability , which results in
(19) 
∎
Using a similar argument as in the proof of Lemma 2, we have
(20) 
for . Define for . Then
(21) 
Note that .
We define
(22) 
Note that , , , , and .
Appendix C Proof of Theorem 3 (Codelength Bound)
In this section, we find an upper bound on , i.e., the expected number of communication bits per iteration. Recall from Section 3.2 that the quantized gradient is determined by the tuple . Write for the indices of the nonzero entries of . Let .
The encoding produced by can be partitioned into two parts, and , such that, for ,

contains the codewords encoding the runs of zeros; and

contains the sign bits and codewords encoding the normalized quantized coordinates.
Note that . Thus, by [Alistarh et al., 2017, Lemma A.3], the properties of Elias encoding imply that
(23) 
We now turn to bounding . The following result in inspired by [Alistarh et al., 2017, Lemma A.3].
Lemma 3.
Fix a vector such that , let be the indices of its nonzero entries, and assume each nonzero entry is of form of , for some positive integer . Then
Proof.
Applying property (1) for ERC (end of Section 3.2), we have
where the last bound is obtained by Jensen’s inequality. ∎
It is not difficult to show that, for all , is concave. Note that is an increasing function up to .
Appendix D Additional Experiments
In this section, we present further experimental results in a similar setting to Section 5.
In Figure 5, we show the test accuracy for training ResNet110 on CIFAR10 and validation accuracy for training ResNet34 on ImageNet from random initialization until convergence (discussed in Section 5). Similar to the training loss performance, we observe that NUQSGD and QSGDinf outperform QSGD in terms of test accuracy in both experiments. In both experiments, unlike NUQSGD, QSGD does not recover the test accuracy of SGD. The gap between NUQSGD and QSGD on ImageNet is significant. We argue that this is achieved because NUQSGD and QSGDinf have lower variance relative to QSGD. It turns out both training loss and generalization error can benefit from the reduced variance.
In Figure 6, we show the mean normalized variance of the gradient versus training iteration on CIFAR10 and ImageNet. For different methods, the variance is measured on their own trajectories. Since the variance depends on the optimization trajectory, these curves are not directly comparable. Rather the general trend should be studied.