Correlated quantization for distributed mean estimation and optimization

03/09/2022
by   Ananda Theertha Suresh, et al.
Google
0

We study the problem of distributed mean estimation and optimization under communication constraints. We propose a correlated quantization protocol whose error guarantee depends on the deviation of data points instead of their absolute range. The design doesn't need any prior knowledge on the concentration property of the dataset, which is required to get such dependence in previous works. We show that applying the proposed protocol as sub-routine in distributed optimization algorithms leads to better convergence rates. We also prove the optimality of our protocol under mild assumptions. Experimental results show that our proposed algorithm outperforms existing mean estimation protocols on a diverse set of tasks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/24/2020

Wyner-Ziv Estimators: Efficient Distributed Mean Estimation with Side Information

Communication efficient distributed mean estimation is an important prim...
02/12/2018

On the Needs for Rotations in Hypercubic Quantization Hashing

The aim of this paper is to endow the well-known family of hypercubic qu...
06/24/2015

Communication Lower Bounds for Statistical Estimation Problems via a Distributed Data Processing Inequality

We study the tradeoff between the statistical error and communication co...
02/21/2020

Distributed Mean Estimation with Optimal Error Bounds

Motivated by applications to distributed optimization and machine learni...
07/23/2021

Finite-Bit Quantization For Distributed Algorithms With Linear Convergence

This paper studies distributed algorithms for (strongly convex) composit...
03/16/2019

SLSGD: Secure and Efficient Distributed On-device Machine Learning

We consider distributed on-device learning with limited communication an...
02/20/2017

Hemingway: Modeling Distributed Optimization Algorithms

Distributed optimization algorithms are widely used in many industrial m...

1 Introduction

Large-scale machine learning systems often requires the distribution of data across multiple devices. For example, in federated learning

(Kairouz et al., 2021)

, data is distributed across user devices such as cell phones, and the machine learning models are trained with adaptive stochastic gradient descent methods or their variations like federated averaging

(McMahan et al., 2017). Such algorithms requires multiple rounds of communications between the devices and the the centralized server. At each round, the devices send model updates to the server, and the server aggregates the data and outputs a new model.

In many scenarios like federated learning, the data from devices are sent to the server over wireless channels. Communication between devices and server, especially the uplink communication, is a bottleneck. This has resulted in a series of works on compression and quantization methods to reduce the communication cost (Konečnỳ et al., 2016; Lin et al., 2018; Alistarh et al., 2017).

At the heart of these algorithms is the distributed mean estimation protocol where each client has a model update (in the form of a vector). Each client compresses its update and transmits the compressed version to the server. The server then decompresses and aggregates the updates to approximate the mean of the updates. In this work, we study the problem of distributed mean estimation and provide the first algorithm whose error depends on the variance of the inputs rather than only their absolute range without additional side information. We then use these results to provide improved convergence guarantee for distributed optimization protocols. Before we proceed further, we need a few definitions.

Distributed mean estimation.

Let be the input space and be the data points where each . For most results, we assume , the ball of radius . We denote the mean of the vectors as

In compression, these s are encoded at the clients and then decoded at the server (Mayekar and Tyagi, 2020). , the quantizer (encoder) at client is a (possibly randomized) mapping from , where is the quantized space. With the slight abuse of notation let , be the set of quantizers. Each client encodes and sends to the server. The server then decodes to get an estimate of the mean . Following earlier works (Suresh et al., 2017; Mayekar and Tyagi, 2020), we measure the performance of the quantizer in terms of mean squared error,

Distributed optimization.

We consider solving the following optimization task using distributed stochastic gradient descent (SGD) methods. Let be an objective function and the goal is to minimize the over an -bounded space Motivated by the federated learning setting, we assume there are rounds of communications. At round , a set of clients are involved, and each of them has access to a stochastic gradient oracle of , denoted by . Following stochastic optimziation literature, we assume the oracle satisfies ,

  • Unbiasedness: .

  • Lipschitzness: .

  • Bounded variance: .

In distributed SGD, after selecting a random initialization , at round , each client queries the oracle at . Under communication constraints, users must quantize their obtained gradient with limited bits and send it to the server. The server then use these quantized messages to estimate the true gradient of , which is the average of local gradients. We denote this estimate as . The server then update the parameter with some learning rate

and project it back to , which is sent to all clients in the next round.

Standard results in optimization, e.g., Bubeck (2015), allows us to obtain convergence results given mean squared error guarantees for the mean estimation primitive . Hence we will focus on analyzing error guarantees on the mean estimation task and discuss their implications on distributed optimization.

2 Related works

The goal of distributed mean estimation is to estimate the empirical mean without making distributional assumptions of the data. This is different from works estimating the mean of the underlying distributional model (Zhang et al., 2013; Garg et al., 2014; Braverman et al., 2016; Cai and Wei, 2020; Acharya et al., 2021). To achieve guarantees in terms of the deviation of the data, the techniques rely on the distributional assumption, which is not applicable in our setting.

The classic algorithm for this problem is stochastic scalar quantization, where each dimension of the data is stochastically quantized to one of the fixed values (such as 0 or 1 in stochastic binary quantization). This provides an unbiased estimation with reduced communication cost. It has been shown that adding random rotation reduces quantization error

(Suresh et al., 2017) and variable-length coding provides the near-optimal communication-error trade-off (Alistarh et al., 2017; Suresh et al., 2017). Many variants and improvements of the scalar quantization algorithms exist. For example, Terngrad (Wen et al., 2017) and 3LC (Lim et al., 2019) use a three-level stochastic quantization strategy. SignSGD uses the sign of the coordinate of gradients rather than quantizing it (Bernstein et al., 2018). 1-bit SGD uses error-feedback as a mechanism to reduce the error in quantization (Seide et al., 2014); error-feedback is orthogonal to our work and can be potentially used in combination. Mitchell et al. (2022) propose to learn the quantizer leveraging data distribution across the clients using rate-distortion theory. Vargaftik et al. (2021) proposes an improvement of the random rotation method by replacing the stochastic binary quantization by the sign operator. This method is shown to outperform other variants of scalar quantization.

Beyond scalar quantization, vector quantization may lead to higher worst-case communication cost (Gandikota et al., 2021). Kashin’s representation has been used to quantize a -dimensional vector using less than bits (Caldas et al., 2018b; Chen et al., 2020; Safaryan et al., 2021). Davies et al. (2021) use the lattice quantization method which will be discussed below.

More broadly, our work is also related to non-quantization methods to improve the communication cost of distributed mean estimation, often under the context of distributed optimization. Examples include gradient sparsification (Aji and Heafield, 2017; Lin et al., 2018; Basu et al., 2019), low rank decomposition (Wang et al., 2018; Vogels et al., 2019). These methods require assumptions of the data such as high sparsity or low rank. The idea of using correlation between local compressors has also been considered in Szlendak et al. (2021) for gradient sparsification, which is also shown to be advantageous to independent masking.

Perhaps closest to our work is that of Davies et al. (2021); Mayekar et al. (2021), who proposed algorithms with error depends on the variance of the inputs. However, these works all need certain side information about the inputs, and deviate from our work in two ways: first in Davies et al. (2021), the clients need to know the input variance. Secondly, both Davies et al. (2021); Mayekar et al. (2021) require the server to know one of the client values to a high accuracy. Finally, their information theoretically optimal algorithm is not computationally efficient, and their efficient algorithm is sub-optimal in logarithmic factors.

3 Our contributions

We propose correlated quantization that only requires a simple modification in the standard stochastic quantization algorithm. Correlated quantization uses shared-randomness to introduce correlation between local quantizers at each client, which results in improved error bounds. In the absence of shared randomness, it can be simulated by the server sending a seed to generate randomness to all the clients. We first state the error guarantees below.

In one dimension, if all values lie in the range , the error of standard stochastic quantization with levels scales as (Suresh et al., 2017)

In this work, we show that the modified algorithm (Algorithm 2) has error scales as

where is the empirical mean absolute deviation of points defined below:

(1)

Informally, models how concentrated the data points are. Compared to other commonly used concentration measures, such as worst-case deviation (Mayekar et al., 2021)

and standard deviation

(Davies et al., 2021)

it holds that . Hence our result implies bounds in terms of these concentration measures as well.

When , i.e., the data points are “close” to each other, the proposed algorithm has smaller error than stochastic quantization. Notice that it was shown in previous works that better error guarantees can be obtained when the data points have better concentration properties. However these work rely on knowing a bound on the concentration radius (Davies et al., 2021) or side information such as a crude bound on the location of the mean (Davies et al., 2021; Mayekar et al., 2021). Different from the above, our proposed scheme doesn’t require any side information. Moreover, we remark that our proposed scheme only requires a simple modification of how randomness is generated in the standard stochastic quantization algorithm while these algorithms are based on sophisticated encoding schemes based on these prior information.

Lower bound.

We further show that in the one-dimensional case, our obtained bound is optimal for any -interval quantizers (see Definition 1), which is commonly used in many state-of-the art compression algorithms in distributed optimization including stochastic quantization and our proposed algorithm. Moreover, when each client is only allowed to use one bit (or constant bits), the obtained bound is information-theoretically optimal for any quantizers.

Extension to higher dimensions.

In high dimensions, if all values lie in , the error of the min-max optimal algorithms with levels of quantization scales as (Suresh et al., 2017)

We show that an improvement similar to the previous result:

where is the average distance between data points,

(2)

Similar to the one-dimensional case, it can be shown that is upper bounded by and . And hence the same bound holds for these measures as well.

We note that our dependence on standard deviation is linear. This coincides with the results of Mayekar et al. (2021) for mean estimation with side information. While the results of one are not directly applicable to the other, understanding the two results in the high-dimensional case remains an interesting direction to explore.

Distributed optimization.

Turning to the task of distributed optimization, the proposed mean estimation algorithm can be used as a primitive in distributed SGD algorithm. Using standard results on smooth convex optimzation (e.g., Theorem 6.3 in Bubeck (2015)) and estimation error guarantee in Corollary 1, we obtain the following bound.

Theorem 1.

Suppose the objective function is convex and -smooth, Using correlated quantization with levels as the mean estimation primitive, distributed SGD with rounds and learning rate yields

where . Moreover, each client only needs to send bits (constant bits per dimension) in each round.

Optimization algorithms based on stochastic rounding (Alistarh et al., 2017; Suresh et al., 2017) has convergence rate of the same form with under the same communication complexity. Notice that the mean squared error (and therefore the convergence rate) we obtain is always better than that of the classic stochastic quantization algorithm, and when the new algorithm gives faster convergence. This corresponds to the case where the clients’ local gradients are better concentrated than their absolute range. This is a mild assumption especially in practical federated learning settings, where in each communication round, only a small amount of clients participate ( is small), and the gradient of each client is computed over a large set of data points accumulate since its previous participation ( is small).

4 A toy example

Figure 1: Mean squared error for the toy problem.

Before we proceed further, we motivate our result with a simple example. Suppose there are devices and device has . Further, let’s consider the simple scenario in which each client can only send one bit i.e., can take only two values. The popular algorithm for mean estimation is stochastic quantization, in which client sends given by

(3)

Notice that takes only two values and the server computes the mean by computing their average

We first note that is unbiased

The mean squared error of this protocol is

To motivate our approach, we consider the special case when and further assume . In this case, the mean squared error of the stochastic quantizer is

(4)

The key insight of correlated quantization is that if the first client rounds up its value, the second client should round down its value to reduce the mean squared error. To see this, we first rewrite the original stochastic quantization slightly differently. For each , let

be an independent uniform random variable in the range

. Then we can rewrite as

To see the equivalence of the above definition and (3), observe that the probability of is . Hence is one with probability and zero otherwise.

For the special case of , we propose to modify the estimator by making s dependent on each other. In particular, let be a uniform random variable in the range and let . This can be implemented using shared-randomness. Let the modified estimator be as before

Since is an unbiased estimator, is also an unbiased estimator. We now compute the mean squared error of

We plot the mean squared error of the original quantizer and the new quantizer in Figure 1. Observe that the above mean squared error is uniformly upper bounded by the mean squared error of the original quantizer (4). Our goal is to propose a similar estimator for devices, in dimensions, that has uniformly low error compared to even when the devices have different values of .

5 Correlated quantization in one dimension

We first extend the above idea and provide a new quantizer for bounded scalar values. For simplicity, we first assume each . Recall that the goal is to estimate Let be some quantization of . We approximate the average by As we discussed in the previous section, we propose to use

where s are uniform random variables between zero and one, but are now correlated across the clients. We generate s as follows. Let denote a random permutation of . Let denote the value of this permutation. Let is a uniform random variable between . With these definitions, we let

Observe that for each , is a uniform random variable over . However, they are now correlated across clients. Hence the quantizer can be written as

We illustrate why this quantizer is better with an example. If all clients have the same value , the new quantizer can be written as

where uses the fact that the value of does not change for this example and uses the fact that is a random permutation of . We have shown that the correlated quantizer has zero error in the above example. In contrast, the standard stochastic quantizer outputs

where follows from the fact that is an integer and , a uniform random variable from the set . Moreover, s are independent.

The above example also provides an alternative view of the proposed quantizer. If , , then the random variables in the standard stochastic random quantizer can be viewed as sampling-with-replacement from the set while the random variables in the correlated quantizer can be viewed as an sampling-without-replacement from the set . Since sampling without replacement has smaller variance, the proposed estimator performs better and in the particular case of this example has zero error.

We will generalize the above result and show a data dependent bound in its error in Theorem 2. For completeness, the full algorithm when each input belongs to the range is given in Algorithm 1.

Input: .
Generate , a random permutation of .
For :
. , where . . Output: .
Algorithm 1 OneDimOneBitCQ
Theorem 2.

If all the inputs lie in the range , the proposed estimator OneDimOneBitCQ is unbiased and the mean squared error is upper bounded by

where is defined in (1).

We provide the proof of the theorem in Appendix A. We now extend the results to levels. A standard way of extending one bit quantizers to multiple bits is to divide the interval into multiple fixed length intervals and use stochastic quantization in each interval. While this does provide good worst-case bounds, we cannot get data-dependent bounds in terms of mean deviation. This is because there can be examples in which the samples lie in two different intervals and using stochastic quantization in each interval separately can increase the error. For example, if and we divide the into intervals , , . If all points are in set , then they will belong to two different intervals, which yields looser bounds.

We overcome this, by dividing the input space into randomized intervals. Even though the points may lie in different intervals with randomized intervals, we use the fact that the chance of it happening is small to get bounds in terms of absolute deviation. More formally, let be levels such that

is uniformly distributed in the interval

and where . Observe that with these definitions,

Let

If , we use the following algorithm:

where is the two-level quantizer in Algorithm 1. The full algorithm when each input belongs to the range is given in Algorithm 2 and we provide its thoeretical guarantee in Theorem 3. We provide the proof of the theorem in Appendix B.

Input: .
Let be levels such that is uniformly distributed in the interval and where .
Generate , a random permutation of .
For :
. . , where . . Output: .
Algorithm 2 OneDimKLevelsCQ
Theorem 3.

If all the inputs lie in the range , , the proposed estimator OneDimKLevelsCQ is unbiased and the mean squared error is upper bounded by

where is defined in (1).

6 Extensions to high dimensions

To extend the algorithm to high-dimensional data, we can quantize each coordinate independently using the above quantization scheme. However such an approch is suboptimal. In this section, we show that the two approaches used in

Suresh et al. (2017) namely variable length coding and random rotation can be used here.

6.1 Variable length coding

One natural way to extend the above algorithm to high dimensions is to use OneDimKLevelsSC on each coordinate using bits. Suresh et al. (2017); Alistarh et al. (2017) observed that while each coordinate is quantized by bits, instead of using bits of communication, one can reduce the communication cost by using variable length codes such as Elias-Gamma codes or Arithmetic coding. We refer to this algorithm as EntropyCQ. We use the same approach and show the following corollary.

Corollary 1.

If all the inputs lie in the range , the proposed estimator EntropyCQ is unbiased and the mean squared error is upper bounded by

where is defined in (2) and is a constant. Furthermore, overall the quantizer can be communicated to the server in bits per client in expectation.

Proof.

The proof of unbiasedness and variance follows directly from Theorem 3 applied to each coordinate. The proof of communication cost is similar to that of (Suresh et al., 2017, Theorem 4) and omitted. ∎

Based on the above corollary, we can set bits and have a quantizer that uses bits and has error

6.2 Random rotation

Input: .
For , let , a random permutation of .
Let , where is a diagonal matrix with independent Rademacher entries.
For :
. For : . For : . Output: .
Algorithm 3 WalshHadamardCQ

Instead of using variable length code, one can use a random rotation matrix to reduce the norm of the vectors. We use this approach and show the following result. The proof is given in Appendix C. Similar to Suresh et al. (2017), one can use the efficient Walsh-Hadamard rotation which takes time to compute (Algorithm 3).

Corollary 2.

If all the inputs lie in the range , the proposed estimator WalshHadamardCQ has bias at most and the mean squared error is upper bounded by

where is defined in (1) and is a constant. Furthermore, overall the quantizer can be communicated to the server in bits per client in expectation.

We note that with communication cost of bits, the bounds with the random rotation are sub-optimal by a logarithmic factor compared to the variable length coding approach. However, it may be the desired approach in practice due to ease of use or computation costs (Suresh et al., 2017; Vargaftik et al., 2021).

7 Lower Bound

In this section, we discuss information-theoretic lower bounds on the quantization error. We will focus on the one-dimensional case and show that correlated quantization is optimal in terms of the dependence on and under mild assumptions. For the general -dimensional case, whether the dependence on and is tight is an interesting question to explore.

In the one-dimensional case with one bit (or constant bits) per client, we obtain the following lower bound, which shows that the upper bound in Theorem 2 is tight up to constant factors in terms of the dependence on when . Note that the condition is mild since when , the second term in Theorem 2, . As shown in Theorem 5, the dependence can not be improved for -interval quantizers. Whether it can be improved for general quantizers is an interesting future direction.

Theorem 4.

For any and , and any one-bit quantizer , when , there exists a dataset with mean absolute deviation , such that

Turning to the -level ( bits) case, we are able to show that our estimator is optimal up to constant factors for a more restricted class of -interval quantizers. -interval quantizers includes all quantization schemes under which the preimage of each quantized message, ignoring common randomness, is an interval. To make the definition formal, we slightly abuse notation to assume each quantizer admits another argument , which incorporates all randomness in the quantizer. Fix , is a deterministic function of .

Definition 1.

A quantizer is said to be a -interval quantizer if , there exists non-overlapping intervals which partitions , and , ,

The class of -interval quantizers includes many common compression algorithms used in distributed optimization such as stochastic quantization and our proposed algorithm. For this restricted class of schemes, we get

Theorem 5.

Given and , for any -interval quantizer , there exists a dataset with mean absolute deviation , such that

We defer the proof of the theorems to Appendix D.

8 Experiments

(a) RMSE as a function of .
(b) RMSE as a function of .
(c) RMSE as a function of .
(d) RMSE with random rotation.
Figure 2: Comparison of compression algorithms on mean estimation task.
Algorithm Synthetic MNIST
Independent
Independent + Rotation
TernGrad ( bits)
Structured DRIVE
Correlated
Correlated + Rotation
Table 1: Comparison of compression algorithms on distributed mean estimation.
Algorithm Dist. -means Dist. Power Iteration FedAvg
Objective Error Accuracy %
None
Independent
Independent + Rotation
TernGrad ( bits)
Structured DRIVE
Correlated
Correlated + Rotation
Table 2: Comparison of compression algorithms on a variety of tasks: distributed mean estimation, distributed -means clustering, distributed power iteration, and federated averaging on the MNIST dataset. For all tasks, we set the number of quantization levels to two (one bit), except TernGrad which uses three quantization levels.

We demonstrate that the proposed algorithm outperforms existing baselines on several distributed tasks. Before we conduct a full comparison, we first empirically demonstrate that our correlated quantization algorithm is better than existing independent quantization schemes (Suresh et al., 2017; Alistarh et al., 2017)

. We implement all algorithms and experiments using the open-source JAX

(Bradbury et al., 2018) and FedJAX (Ro et al., 2021) libraries.

Correlated vs independent stochastic quantization. We first compare correlated and independent stochastic quantizations on a simple mean estimation task. Let be i.i.d. samples over , where coordinate is sampled independently according to , where is a uniform random variable between and is fixed for all clients and is a independent random variable for each client in the range . Note that this distribution has a mean deviation of . We first fix the number of clients to be , , and vary . We then fix , and vary . Finally, we fix , and vary . The results are given in Figures 2 respectively. The experiments are averaged over ten runs for statistical consistency. Observe that in all the experiments, correlated quantization outperforms independent stochastic quantization.

Effect of random rotation. We next demonstrate that the correlated quantization benefits from random rotation similar to independent quantization. Let be i.i.d. samples over , where coordinate is independently sampled according to , where is a independent random variable for each client in the range . We let , , and . We compare the results as a function of in Figure 2 . Observe that random rotation improves the performance of both correlated and independent quantization. Furthermore, rotated correlated quantization outperforms the remaining schemes.

In the following experiments, we compare our correlated quantization algorithm with several quantization baselines: Independent, Independent+Rotation (Suresh et al., 2017), TernGrad (Wen et al., 2017), and Structured DRIVE (Vargaftik et al., 2021). Since the focus of the paper is quantization, we only compare lossy quantization schemes and do not evaluate the lossless compression schemes such as arithmetic coding or Huffman coding, which can be applied on any quantizer. We use 2-level quantization (one bit) for all the algorithms, except TernGrad which uses levels and hence requires bits per coordinate per client.

Distributed mean estimation. We next compare our proposed algorithm to existing baselines on a sparse mean estimation task. Let be i.i.d. samples over , where coordinate is sampled independently according to , where is a sparse vector with sparse entries and is fixed for all clients and is a independent random variable for each client in the range . We also compare quantizers on the distributed mean estimation task for the MNIST () dataset distributed over clients. The results for both for repeated trials are in Table 1. Observe that correlated quantizers perform best.

Distributed k-means clustering.

In the distributed Lloyd’s (k-means) algorithm, each client has access to a subset of data points and the goal of a server is to learn

-centers by repeatedly interacting with the clients. We employ quantizers to reduce the uplink communication cost from clients to server and use the MNIST () dataset and set both the number of centers and number of clients to 10. We split examples evenly amongst clients and use k-means++ to select the initial cluster centers. The results for communication rounds for repeated trials are in Table 2. Observe that correlated quantization performs the best.

Distributed power iteration.

In distributed power iteration, each client has access to a subset of data and the goal of the server is to learn the top eigenvector. Similar to the previous distributed k-means clustering, in the quantized setting, we use quantization to reduce the communication cost from clients to server and use the MNIST (

) dataset distributed evenly over clients The results for communication rounds for repeated trials are in Table 2. Observe that correlated quantizers outperform all baselines.

Federated learning. We finally evaluate the effectiveness of the proposed algorithm in reducing the uplink communication costs in federated learning (McMahan et al., 2017). We use the image recognition task for the Federated MNIST dataset (Caldas et al., 2018a)

provided by TensorFlow Federated 

(Bonawitz et al., 2019). This dataset consists of training examples with label classes distributed across

clients. We use quantizers to reduce the uplink communication cost from clients to server and train a logistic regression model for

communication rounds of federated averaging for repeated trials. The results are in Table 2. Observe that correlated quantization with rotation achieves the highest test accuracy.

9 Conclusion

We proposed a new quantizer for distributed mean estimation and showed that the error guarantee depends on the deviation of data points instead of their absolute range. We further used this result to provide fast convergence rates for distributed optimization under communication constraints. Experimental results show that our proposed algorithm outperforms existing compression protocols on several tasks. We also proved the optimality of the proposed approach under mild assumptions in one dimension. Extending the lower bounds to high dimensions remains an interesting open question.

References

Appendix A Proof of Theorem 2

See 2

Proof.

We first show the result when and , one can obtain the final result by rescaling the quantizer via

Hence in the following, without loss of generality, let . We first show that is an unbiased equantizer

Observe that is a uniform random variable between . Hence,

We now bound its variance. To this end let

We can rewrite the difference between estimate and the true sum as

Since ,

We now bound each of the three terms in the above summation.

First term: Observe that for all

Hence,

Therefore,

Second term: To bound the second term,

where the second equality uses the fact that is a Beroulli random variable with parameter . We now bound for . observe that

However, since is an integral multiple of , and , if and only if . Hence,

Hence,

Since is a random permutation,

Hence,