1 Introduction
Largescale machine learning systems often requires the distribution of data across multiple devices. For example, in federated learning
(Kairouz et al., 2021), data is distributed across user devices such as cell phones, and the machine learning models are trained with adaptive stochastic gradient descent methods or their variations like federated averaging
(McMahan et al., 2017). Such algorithms requires multiple rounds of communications between the devices and the the centralized server. At each round, the devices send model updates to the server, and the server aggregates the data and outputs a new model.In many scenarios like federated learning, the data from devices are sent to the server over wireless channels. Communication between devices and server, especially the uplink communication, is a bottleneck. This has resulted in a series of works on compression and quantization methods to reduce the communication cost (Konečnỳ et al., 2016; Lin et al., 2018; Alistarh et al., 2017).
At the heart of these algorithms is the distributed mean estimation protocol where each client has a model update (in the form of a vector). Each client compresses its update and transmits the compressed version to the server. The server then decompresses and aggregates the updates to approximate the mean of the updates. In this work, we study the problem of distributed mean estimation and provide the first algorithm whose error depends on the variance of the inputs rather than only their absolute range without additional side information. We then use these results to provide improved convergence guarantee for distributed optimization protocols. Before we proceed further, we need a few definitions.
Distributed mean estimation.
Let be the input space and be the data points where each . For most results, we assume , the ball of radius . We denote the mean of the vectors as
In compression, these s are encoded at the clients and then decoded at the server (Mayekar and Tyagi, 2020). , the quantizer (encoder) at client is a (possibly randomized) mapping from , where is the quantized space. With the slight abuse of notation let , be the set of quantizers. Each client encodes and sends to the server. The server then decodes to get an estimate of the mean . Following earlier works (Suresh et al., 2017; Mayekar and Tyagi, 2020), we measure the performance of the quantizer in terms of mean squared error,
Distributed optimization.
We consider solving the following optimization task using distributed stochastic gradient descent (SGD) methods. Let be an objective function and the goal is to minimize the over an bounded space Motivated by the federated learning setting, we assume there are rounds of communications. At round , a set of clients are involved, and each of them has access to a stochastic gradient oracle of , denoted by . Following stochastic optimziation literature, we assume the oracle satisfies ,

Unbiasedness: .

Lipschitzness: .

Bounded variance: .
In distributed SGD, after selecting a random initialization , at round , each client queries the oracle at . Under communication constraints, users must quantize their obtained gradient with limited bits and send it to the server. The server then use these quantized messages to estimate the true gradient of , which is the average of local gradients. We denote this estimate as . The server then update the parameter with some learning rate
and project it back to , which is sent to all clients in the next round.
Standard results in optimization, e.g., Bubeck (2015), allows us to obtain convergence results given mean squared error guarantees for the mean estimation primitive . Hence we will focus on analyzing error guarantees on the mean estimation task and discuss their implications on distributed optimization.
2 Related works
The goal of distributed mean estimation is to estimate the empirical mean without making distributional assumptions of the data. This is different from works estimating the mean of the underlying distributional model (Zhang et al., 2013; Garg et al., 2014; Braverman et al., 2016; Cai and Wei, 2020; Acharya et al., 2021). To achieve guarantees in terms of the deviation of the data, the techniques rely on the distributional assumption, which is not applicable in our setting.
The classic algorithm for this problem is stochastic scalar quantization, where each dimension of the data is stochastically quantized to one of the fixed values (such as 0 or 1 in stochastic binary quantization). This provides an unbiased estimation with reduced communication cost. It has been shown that adding random rotation reduces quantization error
(Suresh et al., 2017) and variablelength coding provides the nearoptimal communicationerror tradeoff (Alistarh et al., 2017; Suresh et al., 2017). Many variants and improvements of the scalar quantization algorithms exist. For example, Terngrad (Wen et al., 2017) and 3LC (Lim et al., 2019) use a threelevel stochastic quantization strategy. SignSGD uses the sign of the coordinate of gradients rather than quantizing it (Bernstein et al., 2018). 1bit SGD uses errorfeedback as a mechanism to reduce the error in quantization (Seide et al., 2014); errorfeedback is orthogonal to our work and can be potentially used in combination. Mitchell et al. (2022) propose to learn the quantizer leveraging data distribution across the clients using ratedistortion theory. Vargaftik et al. (2021) proposes an improvement of the random rotation method by replacing the stochastic binary quantization by the sign operator. This method is shown to outperform other variants of scalar quantization.Beyond scalar quantization, vector quantization may lead to higher worstcase communication cost (Gandikota et al., 2021). Kashin’s representation has been used to quantize a dimensional vector using less than bits (Caldas et al., 2018b; Chen et al., 2020; Safaryan et al., 2021). Davies et al. (2021) use the lattice quantization method which will be discussed below.
More broadly, our work is also related to nonquantization methods to improve the communication cost of distributed mean estimation, often under the context of distributed optimization. Examples include gradient sparsification (Aji and Heafield, 2017; Lin et al., 2018; Basu et al., 2019), low rank decomposition (Wang et al., 2018; Vogels et al., 2019). These methods require assumptions of the data such as high sparsity or low rank. The idea of using correlation between local compressors has also been considered in Szlendak et al. (2021) for gradient sparsification, which is also shown to be advantageous to independent masking.
Perhaps closest to our work is that of Davies et al. (2021); Mayekar et al. (2021), who proposed algorithms with error depends on the variance of the inputs. However, these works all need certain side information about the inputs, and deviate from our work in two ways: first in Davies et al. (2021), the clients need to know the input variance. Secondly, both Davies et al. (2021); Mayekar et al. (2021) require the server to know one of the client values to a high accuracy. Finally, their information theoretically optimal algorithm is not computationally efficient, and their efficient algorithm is suboptimal in logarithmic factors.
3 Our contributions
We propose correlated quantization that only requires a simple modification in the standard stochastic quantization algorithm. Correlated quantization uses sharedrandomness to introduce correlation between local quantizers at each client, which results in improved error bounds. In the absence of shared randomness, it can be simulated by the server sending a seed to generate randomness to all the clients. We first state the error guarantees below.
In one dimension, if all values lie in the range , the error of standard stochastic quantization with levels scales as (Suresh et al., 2017)
In this work, we show that the modified algorithm (Algorithm 2) has error scales as
where is the empirical mean absolute deviation of points defined below:
(1) 
Informally, models how concentrated the data points are. Compared to other commonly used concentration measures, such as worstcase deviation (Mayekar et al., 2021)
it holds that . Hence our result implies bounds in terms of these concentration measures as well.
When , i.e., the data points are “close” to each other, the proposed algorithm has smaller error than stochastic quantization. Notice that it was shown in previous works that better error guarantees can be obtained when the data points have better concentration properties. However these work rely on knowing a bound on the concentration radius (Davies et al., 2021) or side information such as a crude bound on the location of the mean (Davies et al., 2021; Mayekar et al., 2021). Different from the above, our proposed scheme doesn’t require any side information. Moreover, we remark that our proposed scheme only requires a simple modification of how randomness is generated in the standard stochastic quantization algorithm while these algorithms are based on sophisticated encoding schemes based on these prior information.
Lower bound.
We further show that in the onedimensional case, our obtained bound is optimal for any interval quantizers (see Definition 1), which is commonly used in many stateofthe art compression algorithms in distributed optimization including stochastic quantization and our proposed algorithm. Moreover, when each client is only allowed to use one bit (or constant bits), the obtained bound is informationtheoretically optimal for any quantizers.
Extension to higher dimensions.
In high dimensions, if all values lie in , the error of the minmax optimal algorithms with levels of quantization scales as (Suresh et al., 2017)
We show that an improvement similar to the previous result:
where is the average distance between data points,
(2) 
Similar to the onedimensional case, it can be shown that is upper bounded by and . And hence the same bound holds for these measures as well.
We note that our dependence on standard deviation is linear. This coincides with the results of Mayekar et al. (2021) for mean estimation with side information. While the results of one are not directly applicable to the other, understanding the two results in the highdimensional case remains an interesting direction to explore.
Distributed optimization.
Turning to the task of distributed optimization, the proposed mean estimation algorithm can be used as a primitive in distributed SGD algorithm. Using standard results on smooth convex optimzation (e.g., Theorem 6.3 in Bubeck (2015)) and estimation error guarantee in Corollary 1, we obtain the following bound.
Theorem 1.
Suppose the objective function is convex and smooth, Using correlated quantization with levels as the mean estimation primitive, distributed SGD with rounds and learning rate yields
where . Moreover, each client only needs to send bits (constant bits per dimension) in each round.
Optimization algorithms based on stochastic rounding (Alistarh et al., 2017; Suresh et al., 2017) has convergence rate of the same form with under the same communication complexity. Notice that the mean squared error (and therefore the convergence rate) we obtain is always better than that of the classic stochastic quantization algorithm, and when the new algorithm gives faster convergence. This corresponds to the case where the clients’ local gradients are better concentrated than their absolute range. This is a mild assumption especially in practical federated learning settings, where in each communication round, only a small amount of clients participate ( is small), and the gradient of each client is computed over a large set of data points accumulate since its previous participation ( is small).
4 A toy example
Before we proceed further, we motivate our result with a simple example. Suppose there are devices and device has . Further, let’s consider the simple scenario in which each client can only send one bit i.e., can take only two values. The popular algorithm for mean estimation is stochastic quantization, in which client sends given by
(3) 
Notice that takes only two values and the server computes the mean by computing their average
We first note that is unbiased
The mean squared error of this protocol is
To motivate our approach, we consider the special case when and further assume . In this case, the mean squared error of the stochastic quantizer is
(4) 
The key insight of correlated quantization is that if the first client rounds up its value, the second client should round down its value to reduce the mean squared error. To see this, we first rewrite the original stochastic quantization slightly differently. For each , let
be an independent uniform random variable in the range
. Then we can rewrite asTo see the equivalence of the above definition and (3), observe that the probability of is . Hence is one with probability and zero otherwise.
For the special case of , we propose to modify the estimator by making s dependent on each other. In particular, let be a uniform random variable in the range and let . This can be implemented using sharedrandomness. Let the modified estimator be as before
Since is an unbiased estimator, is also an unbiased estimator. We now compute the mean squared error of
We plot the mean squared error of the original quantizer and the new quantizer in Figure 1. Observe that the above mean squared error is uniformly upper bounded by the mean squared error of the original quantizer (4). Our goal is to propose a similar estimator for devices, in dimensions, that has uniformly low error compared to even when the devices have different values of .
5 Correlated quantization in one dimension
We first extend the above idea and provide a new quantizer for bounded scalar values. For simplicity, we first assume each . Recall that the goal is to estimate Let be some quantization of . We approximate the average by As we discussed in the previous section, we propose to use
where s are uniform random variables between zero and one, but are now correlated across the clients. We generate s as follows. Let denote a random permutation of . Let denote the value of this permutation. Let is a uniform random variable between . With these definitions, we let
Observe that for each , is a uniform random variable over . However, they are now correlated across clients. Hence the quantizer can be written as
We illustrate why this quantizer is better with an example. If all clients have the same value , the new quantizer can be written as
where uses the fact that the value of does not change for this example and uses the fact that is a random permutation of . We have shown that the correlated quantizer has zero error in the above example. In contrast, the standard stochastic quantizer outputs
where follows from the fact that is an integer and , a uniform random variable from the set . Moreover, s are independent.
The above example also provides an alternative view of the proposed quantizer. If , , then the random variables in the standard stochastic random quantizer can be viewed as samplingwithreplacement from the set while the random variables in the correlated quantizer can be viewed as an samplingwithoutreplacement from the set . Since sampling without replacement has smaller variance, the proposed estimator performs better and in the particular case of this example has zero error.
We will generalize the above result and show a data dependent bound in its error in Theorem 2. For completeness, the full algorithm when each input belongs to the range is given in Algorithm 1.
Theorem 2.
If all the inputs lie in the range , the proposed estimator OneDimOneBitCQ is unbiased and the mean squared error is upper bounded by
where is defined in (1).
We provide the proof of the theorem in Appendix A. We now extend the results to levels. A standard way of extending one bit quantizers to multiple bits is to divide the interval into multiple fixed length intervals and use stochastic quantization in each interval. While this does provide good worstcase bounds, we cannot get datadependent bounds in terms of mean deviation. This is because there can be examples in which the samples lie in two different intervals and using stochastic quantization in each interval separately can increase the error. For example, if and we divide the into intervals , , . If all points are in set , then they will belong to two different intervals, which yields looser bounds.
We overcome this, by dividing the input space into randomized intervals. Even though the points may lie in different intervals with randomized intervals, we use the fact that the chance of it happening is small to get bounds in terms of absolute deviation. More formally, let be levels such that
is uniformly distributed in the interval
and where . Observe that with these definitions,Let
If , we use the following algorithm:
where is the twolevel quantizer in Algorithm 1. The full algorithm when each input belongs to the range is given in Algorithm 2 and we provide its thoeretical guarantee in Theorem 3. We provide the proof of the theorem in Appendix B.
Theorem 3.
If all the inputs lie in the range , , the proposed estimator OneDimKLevelsCQ is unbiased and the mean squared error is upper bounded by
where is defined in (1).
6 Extensions to high dimensions
To extend the algorithm to highdimensional data, we can quantize each coordinate independently using the above quantization scheme. However such an approch is suboptimal. In this section, we show that the two approaches used in
Suresh et al. (2017) namely variable length coding and random rotation can be used here.6.1 Variable length coding
One natural way to extend the above algorithm to high dimensions is to use OneDimKLevelsSC on each coordinate using bits. Suresh et al. (2017); Alistarh et al. (2017) observed that while each coordinate is quantized by bits, instead of using bits of communication, one can reduce the communication cost by using variable length codes such as EliasGamma codes or Arithmetic coding. We refer to this algorithm as EntropyCQ. We use the same approach and show the following corollary.
Corollary 1.
If all the inputs lie in the range , the proposed estimator EntropyCQ is unbiased and the mean squared error is upper bounded by
where is defined in (2) and is a constant. Furthermore, overall the quantizer can be communicated to the server in bits per client in expectation.
Proof.
Based on the above corollary, we can set bits and have a quantizer that uses bits and has error
6.2 Random rotation
Instead of using variable length code, one can use a random rotation matrix to reduce the norm of the vectors. We use this approach and show the following result. The proof is given in Appendix C. Similar to Suresh et al. (2017), one can use the efficient WalshHadamard rotation which takes time to compute (Algorithm 3).
Corollary 2.
If all the inputs lie in the range , the proposed estimator WalshHadamardCQ has bias at most and the mean squared error is upper bounded by
where is defined in (1) and is a constant. Furthermore, overall the quantizer can be communicated to the server in bits per client in expectation.
We note that with communication cost of bits, the bounds with the random rotation are suboptimal by a logarithmic factor compared to the variable length coding approach. However, it may be the desired approach in practice due to ease of use or computation costs (Suresh et al., 2017; Vargaftik et al., 2021).
7 Lower Bound
In this section, we discuss informationtheoretic lower bounds on the quantization error. We will focus on the onedimensional case and show that correlated quantization is optimal in terms of the dependence on and under mild assumptions. For the general dimensional case, whether the dependence on and is tight is an interesting question to explore.
In the onedimensional case with one bit (or constant bits) per client, we obtain the following lower bound, which shows that the upper bound in Theorem 2 is tight up to constant factors in terms of the dependence on when . Note that the condition is mild since when , the second term in Theorem 2, . As shown in Theorem 5, the dependence can not be improved for interval quantizers. Whether it can be improved for general quantizers is an interesting future direction.
Theorem 4.
For any and , and any onebit quantizer , when , there exists a dataset with mean absolute deviation , such that
Turning to the level ( bits) case, we are able to show that our estimator is optimal up to constant factors for a more restricted class of interval quantizers. interval quantizers includes all quantization schemes under which the preimage of each quantized message, ignoring common randomness, is an interval. To make the definition formal, we slightly abuse notation to assume each quantizer admits another argument , which incorporates all randomness in the quantizer. Fix , is a deterministic function of .
Definition 1.
A quantizer is said to be a interval quantizer if , there exists nonoverlapping intervals which partitions , and , ,
The class of interval quantizers includes many common compression algorithms used in distributed optimization such as stochastic quantization and our proposed algorithm. For this restricted class of schemes, we get
Theorem 5.
Given and , for any interval quantizer , there exists a dataset with mean absolute deviation , such that
We defer the proof of the theorems to Appendix D.
8 Experiments
Algorithm  Synthetic  MNIST 

Independent  
Independent + Rotation  
TernGrad ( bits)  
Structured DRIVE  
Correlated  
Correlated + Rotation 
Algorithm  Dist. means  Dist. Power Iteration  FedAvg 

Objective  Error  Accuracy %  
None  
Independent  
Independent + Rotation  
TernGrad ( bits)  
Structured DRIVE  
Correlated  
Correlated + Rotation 
We demonstrate that the proposed algorithm outperforms existing baselines on several distributed tasks. Before we conduct a full comparison, we first empirically demonstrate that our correlated quantization algorithm is better than existing independent quantization schemes (Suresh et al., 2017; Alistarh et al., 2017)
. We implement all algorithms and experiments using the opensource JAX
(Bradbury et al., 2018) and FedJAX (Ro et al., 2021) libraries.Correlated vs independent stochastic quantization. We first compare correlated and independent stochastic quantizations on a simple mean estimation task. Let be i.i.d. samples over , where coordinate is sampled independently according to , where is a uniform random variable between and is fixed for all clients and is a independent random variable for each client in the range . Note that this distribution has a mean deviation of . We first fix the number of clients to be , , and vary . We then fix , and vary . Finally, we fix , and vary . The results are given in Figures 2 respectively. The experiments are averaged over ten runs for statistical consistency. Observe that in all the experiments, correlated quantization outperforms independent stochastic quantization.
Effect of random rotation. We next demonstrate that the correlated quantization benefits from random rotation similar to independent quantization. Let be i.i.d. samples over , where coordinate is independently sampled according to , where is a independent random variable for each client in the range . We let , , and . We compare the results as a function of in Figure 2 . Observe that random rotation improves the performance of both correlated and independent quantization. Furthermore, rotated correlated quantization outperforms the remaining schemes.
In the following experiments, we compare our correlated quantization algorithm with several quantization baselines: Independent, Independent+Rotation (Suresh et al., 2017), TernGrad (Wen et al., 2017), and Structured DRIVE (Vargaftik et al., 2021). Since the focus of the paper is quantization, we only compare lossy quantization schemes and do not evaluate the lossless compression schemes such as arithmetic coding or Huffman coding, which can be applied on any quantizer. We use 2level quantization (one bit) for all the algorithms, except TernGrad which uses levels and hence requires bits per coordinate per client.
Distributed mean estimation. We next compare our proposed algorithm to existing baselines on a sparse mean estimation task. Let be i.i.d. samples over , where coordinate is sampled independently according to , where is a sparse vector with sparse entries and is fixed for all clients and is a independent random variable for each client in the range . We also compare quantizers on the distributed mean estimation task for the MNIST () dataset distributed over clients. The results for both for repeated trials are in Table 1. Observe that correlated quantizers perform best.
Distributed kmeans clustering.
In the distributed Lloyd’s (kmeans) algorithm, each client has access to a subset of data points and the goal of a server is to learn
centers by repeatedly interacting with the clients. We employ quantizers to reduce the uplink communication cost from clients to server and use the MNIST () dataset and set both the number of centers and number of clients to 10. We split examples evenly amongst clients and use kmeans++ to select the initial cluster centers. The results for communication rounds for repeated trials are in Table 2. Observe that correlated quantization performs the best.Distributed power iteration.
In distributed power iteration, each client has access to a subset of data and the goal of the server is to learn the top eigenvector. Similar to the previous distributed kmeans clustering, in the quantized setting, we use quantization to reduce the communication cost from clients to server and use the MNIST (
) dataset distributed evenly over clients The results for communication rounds for repeated trials are in Table 2. Observe that correlated quantizers outperform all baselines.Federated learning. We finally evaluate the effectiveness of the proposed algorithm in reducing the uplink communication costs in federated learning (McMahan et al., 2017). We use the image recognition task for the Federated MNIST dataset (Caldas et al., 2018a)
provided by TensorFlow Federated
(Bonawitz et al., 2019). This dataset consists of training examples with label classes distributed acrossclients. We use quantizers to reduce the uplink communication cost from clients to server and train a logistic regression model for
communication rounds of federated averaging for repeated trials. The results are in Table 2. Observe that correlated quantization with rotation achieves the highest test accuracy.9 Conclusion
We proposed a new quantizer for distributed mean estimation and showed that the error guarantee depends on the deviation of data points instead of their absolute range. We further used this result to provide fast convergence rates for distributed optimization under communication constraints. Experimental results show that our proposed algorithm outperforms existing compression protocols on several tasks. We also proved the optimality of the proposed approach under mild assumptions in one dimension. Extending the lower bounds to high dimensions remains an interesting open question.
References

Acharya et al. [2021]
J. Acharya, C. L. Canonne, Y. Liu, Z. Sun, and H. Tyagi.
Distributed estimation with multiple samples per user: Sharp rates and phase transition.
In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. 
Aji and Heafield [2017]
A. F. Aji and K. Heafield.
Sparse communication for distributed gradient descent.
In
Empirical Methods in Natural Language Processing
, 2017.  Alistarh et al. [2017] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Communicationefficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30:1709–1720, 2017.
 Basu et al. [2019] D. Basu, D. Data, C. Karakus, and S. Diggavi. Qsparselocalsgd: Distributed sgd with quantization, sparsification and local computations. Advances in Neural Information Processing Systems, 32, 2019.
 Bernstein et al. [2018] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar. signsgd: Compressed optimisation for nonconvex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
 Bonawitz et al. [2019] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. Van Overveldt, D. Petrou, D. Ramage, and J. Roselander. Towards federated learning at scale: System design. In A. Talwalkar, V. Smith, and M. Zaharia, editors, Proceedings of Machine Learning and Systems, volume 1, pages 374–388, 2019. URL https://proceedings.mlsys.org/paper/2019/file/bd686fd640be98efaae0091fa301e613Paper.pdf.
 Bradbury et al. [2018] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. WandermanMilne, and Q. Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.

Braverman et al. [2016]
M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff.
Communication lower bounds for statistical estimation problems via a
distributed data processing inequality.
In
Proceedings of the fortyeighth annual ACM symposium on Theory of Computing
, pages 1011–1020, 2016.  Bubeck [2015] S. Bubeck. Convex Optimization: Algorithms and Complexity. Foundations and trends in machine learning. Now Publishers Incorporated, 2015. ISBN 9781601988607. URL https://books.google.com/books?id=B4JbjwEACAAJ.
 Cai and Wei [2020] T. T. Cai and H. Wei. Distributed gaussian mean estimation under communication constraints: Optimal rates and communicationefficient algorithms. arXiv preprint arXiv:2001.08877, 2020.
 Caldas et al. [2018a] S. Caldas, S. M. K. Duddu, P. Wu, T. Li, J. Konečnỳ, H. B. McMahan, V. Smith, and A. Talwalkar. Leaf: A benchmark for federated settings. arXiv preprint arXiv:1812.01097, 2018a.
 Caldas et al. [2018b] S. Caldas, J. Konečny, H. B. McMahan, and A. Talwalkar. Expanding the reach of federated learning by reducing client resource requirements. arXiv preprint arXiv:1812.07210, 2018b.
 Chen et al. [2020] W.N. Chen, P. Kairouz, and A. Ozgur. Breaking the communicationprivacyaccuracy trilemma. Advances in Neural Information Processing Systems, 33:3312–3324, 2020.
 Davies et al. [2021] P. Davies, V. Gurunanthan, N. Moshrefi, S. Ashkboos, and D. Alistarh. New bounds for distributed mean estimation and variance reduction. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=t86MwoUCCNe.

Gandikota et al. [2021]
V. Gandikota, D. Kane, R. K. Maity, and A. Mazumdar.
vqsgd: Vector quantized stochastic gradient descent.
In
International Conference on Artificial Intelligence and Statistics
, pages 2197–2205. PMLR, 2021.  Garg et al. [2014] A. Garg, T. Ma, and H. Nguyen. On communication cost of distributed statistical estimation and dimensionality. Advances in Neural Information Processing Systems, 27:2726–2734, 2014.
 Kairouz et al. [2021] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021. ISSN 19358237. doi: 10.1561/2200000083. URL http://dx.doi.org/10.1561/2200000083.
 Konečnỳ et al. [2016] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 Lim et al. [2019] H. Lim, D. G. Andersen, and M. Kaminsky. 3lc: Lightweight and effective traffic compression for distributed machine learning. Proceedings of Machine Learning and Systems, 1:53–64, 2019.
 Lin et al. [2018] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SkhQHMW0W.
 Mayekar and Tyagi [2020] P. Mayekar and H. Tyagi. Ratq: A universal fixedlength quantizer for stochastic optimization. In International Conference on Artificial Intelligence and Statistics, pages 1399–1409. PMLR, 2020.
 Mayekar et al. [2021] P. Mayekar, A. T. Suresh, and H. Tyagi. Wynerziv estimators: Efficient distributed mean estimation with sideinformation. In International Conference on Artificial Intelligence and Statistics, pages 3502–3510. PMLR, 2021.
 McMahan et al. [2017] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communicationefficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
 Mitchell et al. [2022] N. Mitchell, J. Ballé, Z. Charles, and J. Konečnỳ. Optimizing the communicationaccuracy tradeoff in federated learning with ratedistortion theory. arXiv preprint arXiv:2201.02664, 2022.
 Ro et al. [2021] J. H. Ro, A. T. Suresh, and K. Wu. Fedjax: Federated learning simulation with jax. arXiv preprint arXiv:2108.02117, 2021.
 Safaryan et al. [2021] M. Safaryan, E. Shulgin, and P. Richtárik. Uncertainty principle for communication compression in distributed and federated learning and the search for an optimal compressor. Information and Inference: A Journal of the IMA, 04 2021. ISSN 20498772. doi: 10.1093/imaiai/iaab006. URL https://doi.org/10.1093/imaiai/iaab006. iaab006.
 Seide et al. [2014] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth annual conference of the international speech communication association, 2014.
 Suresh et al. [2017] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan. Distributed mean estimation with limited communication. In International Conference on Machine Learning, pages 3329–3337. PMLR, 2017.
 Szlendak et al. [2021] R. Szlendak, A. Tyurin, and P. Richtárik. Permutation compressors for provably faster distributed nonconvex optimization. arXiv preprint arXiv:2110.03300, 2021.
 Vargaftik et al. [2021] S. Vargaftik, R. BenBasat, A. Portnoy, G. Mendelson, Y. BenItzhak, and M. Mitzenmacher. DRIVE: Onebit distributed mean estimation. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=KXRTmcv3dQ8.
 Vogels et al. [2019] T. Vogels, S. P. Karimireddy, and M. Jaggi. Powersgd: Practical lowrank gradient compression for distributed optimization. Advances in Neural Information Processing Systems, 32:14259–14268, 2019.
 Wang et al. [2018] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright. Atomo: Communicationefficient learning via atomic sparsification. Advances in Neural Information Processing Systems, 31:9850–9861, 2018.

Wen et al. [2017]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li.
Terngrad: Ternary gradients to reduce communication in distributed deep learning.
Advances in Neural Information Processing Systems, 30, 2017.  Zhang et al. [2013] Y. Zhang, J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Informationtheoretic lower bounds for distributed statistical estimation with communication constraints. Advances in Neural Information Processing Systems, pages 2328–2336, 2013.
Appendix A Proof of Theorem 2
See 2
Proof.
We first show the result when and , one can obtain the final result by rescaling the quantizer via
Hence in the following, without loss of generality, let . We first show that is an unbiased equantizer
Observe that is a uniform random variable between . Hence,
We now bound its variance. To this end let
We can rewrite the difference between estimate and the true sum as
Since ,
We now bound each of the three terms in the above summation.
First term: Observe that for all
Hence,
Therefore,
Second term: To bound the second term,
where the second equality uses the fact that is a Beroulli random variable with parameter . We now bound for . observe that
However, since is an integral multiple of , and , if and only if . Hence,
Hence,
Since is a random permutation,
Hence,