Hyper-Sphere Quantization: Communication-Efficient SGD for Federated Learning

11/12/2019 ∙ by Xinyan Dai, et al. ∙ 0

The high cost of communicating gradients is a major bottleneck for federated learning, as the bandwidth of the participating user devices is limited. Existing gradient compression algorithms are mainly designed for data centers with high-speed network and achieve O(√(d)log d) per-iteration communication cost at best, where d is the size of the model. We propose hyper-sphere quantization (HSQ), a general framework that can be configured to achieve a continuum of trade-offs between communication efficiency and gradient accuracy. In particular, at the high compression ratio end, HSQ provides a low per-iteration communication cost of O(log d), which is favorable for federated learning. We prove the convergence of HSQ theoretically and show by experiments that HSQ significantly reduces the communication cost of model training without hurting convergence accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning usually solves the optimization problem , where is usually the average loss over samples, to obtain the model parameter, where

is the size of the model. Currently, stochastic gradient descent (SGD) 

(Robbins and Monro, 1951)

is the most popular algorithm for this purpose, especially for training deep neural networks. Given an unbiased stochastic gradient

such that , SGD iteratively updates the model by

(1)

where is the learning rate and is the model parameter at the -th iteration.

Federated learning is an emerging machine learning paradigm in which many user devices (e.g., tablets, smart phones, also called clients) cooperate to train a model (Konečnỳ et al., 2016; McMahan and Ramage, 2017; McMahan et al., 2016). In the typical setting of federated learning, user devices calculate gradients (or local updates, we use gradient to refer to stochastic gradient for conciseness) on their local samples and transmit the gradients to a central coordinator for model update. Federated learning is gaining increasing attention thanks to its unique advantages over data-center-based training (Li et al., 2014; Patarasuk and Yuan, 2009): sensitive user data do not need to be uploaded to a data center, which better motivates users to participate (Shokri and Shmatikov, 2015). Moreover, labels for some supervised tasks (e.g., next word prediction) can be inferred naturally from use interaction and used efficiently for local training (McMahan et al., 2016).

As modern models are usually large (e.g., millions or even billions of parameters for deep neural networks), communicating the gradient is a major bottleneck for federated learning, as user devices commonly use wireless networks and have limited bandwidth. For federated learning, a low per-iteration communication cost 111We define per-iteration communication cost as the number of bits needed to communicate a gradient . is important because users who cannot afford the per-iteration cost will not participate in the training at all. On the contrary, if the per-iteration cost is low, more users will be willing to participate in federated learning. This also leads to higher flexibility for federated learning to access different user devices in different iterations during training, thereby reducing the amount of communication conducted by an individual user.

Name SGD TernGrad SignSGD QSGD HSQ
Cost
Table 1: Per-iteration communication cost (in bits), assuming that a floating point number has 32 bits.

A number of gradient compression algorithms have been proposed to reduce the cost of gradient communication (Alistarh et al., 2017; Bernstein et al., 2018; Wen et al., 2017; Yu et al., 2018; Chen et al., 2018; Wang et al., 2019; Stich et al., 2018; Suresh et al., 2017; Sattler et al., 2019; Tang et al., 2018, 2019; Assran et al., 2019; Acharya et al., 2019). However, these algorithms are designed for data-center-based training (e.g., clusters connected via high-speed network) and the compression is not sufficient for federated learning given its stringent requirement on per-iteration communication cost. Table 1 lists the per-iteration communication cost of some representative algorithms. Generally, there is a trade-off between communication efficiency and gradient accuracy, where a lower communication cost is achieved by transmitting less accurate gradients. QSGD (Alistarh et al., 2017) achieves the state-of-the-art per-iteration communication cost at

, with the variance bound of gradient blown up by

 222Let be the actual gradient, and be an unbiased stochastic gradient and its compressed approximation, respectively. If the original variance bound is (i.e., ) and the variance after quantization can be bounded by , then the blow-up is said to be .

. Some heuristics can give a lower communication cost, but they do not come with convergence guarantees. To facilitate communication-efficient SGD for federated learning, we ask the following questions. 

Can we achieve lower per-iteration communication cost than QSGD and still guarantee convergence? What costs (in gradient accuracy) we need to pay at extremely low communication cost?

To answer the above questions, we propose hyper-sphere quantization (HSQ), a general framework of gradient compression that can be configured to achieve various trade-offs between communication efficiency and gradient accuracy. At the high compression ratio end, HSQ achieves a communication cost of and the variance bound of gradient is blown up by , which is favorable for federated learning. With a per-iteration communication cost of , HSQ achieves the same variance scaling as QSGD at . At the other extreme end with an per-iteration communication cost, HSQ only increases the variance bound by a small constant.

Inspired by vector quantization techniques 

(Chen et al., 2010; Ge et al., 2013; Jegou et al., 2011; Wu et al., 2017), HSQ adopts a paradigm that is fundamentally different from existing gradient compression algorithms (Bernstein et al., 2018; Wen et al., 2017). Instead of quantizing each element of the gradient vector individually or relying on sparsity, HSQ quantizes as a whole using a vector codebook shared between user devices and the central coordinator. HSQ chooses a codeword to approximate in a probabilistic manner and achieves low communication cost by only sending the index of the selected codeword. We prove that HSQ converges for both smooth convex and non-convex cost functions by analyzing its variance bound. We also demonstrate that some existing algorithms (e.g., SignSGD (Bernstein et al., 2018), TernGrad (Wen et al., 2017)) can actually be regarded as special cases of HSQ under specific configurations. Experiments on state-of-the-art neural networks show that model training with HSQ converges smoothly. In terms of the total amount of communication to train the networks to convergence, HSQ significantly outperforms SGD and existing gradient compression algorithms.

Contributions Our contributions are three-folds. First, we provide a new paradigm that quantizes gradient using vector codebook. Vector quantization (VQ) has been extensively studied (Chen et al., 2010; Jegou et al., 2011; Wang et al., 2018b) and this work may inspire the adoption of many effective VQ techniques for gradient compression. Second, by providing a continuum of trade-offs between communication efficiency and gradient accuracy, HSQ can be used in a diverse set of scenarios and helps understand the relation between variance and compression ratio in gradient quantization. Third, HSQ reduces the state-of-the-art per-iteration communication cost from to , which benefits federated learning.

Notations We use plain lower-case letters for scalars and vectors, e.g, . Matrices are denoted by bold upper-case letters, i.e., . is the Euclidean norm of vector while is the -norm of vector . is the absolute value of scalar . We denote a vector with all zeros by .

2 Related Work

Gradient Compression for Data-Center-Based Training It is widely known that gradient communication could easily become the bottleneck of data-center-based distributed machine learning when the model is large (Alistarh et al., 2017; Seide et al., 2014). To reduce communication cost, the most intuitive idea is to transmit gradients with reduced precision. TernGrad (Wen et al., 2017) quantizes each element of a gradient vector to three numerical values . The gradient approximation is shown to be unbiased and training converges under the assumption of a bound on gradients. SignSGD (Bernstein et al., 2018) quantizes each element of gradient to according to its sign. Although its gradient approximation is biased, SignSGD converges with respect to the -norm of the gradients. Both TernGrad and SignSGD achieve an communication cost at best as they quantize each dimension of a gradient vector individually.

QSGD (Alistarh et al., 2017) scales gradient vector by its Euclidean norm and quantizes each element in independently using uniformly spaced levels in . As there are at most non-zero elements (in expectation) in the quantized gradient, QSGD achieves a per-iteration communication cost at with . GradiVeQ (Yu et al., 2018) compresses the gradient

using its projections on the eigenvectors (that corresponds to large eigenvalues) of the gradient covariance matrix. However, a training phase that communicates uncompressed gradients is required to learn the eigenvectors and GradiVeQ does not come with a theoretical analysis. Similar to QSGD and GradiVeQ, ATOMO 

(Wang et al., 2018a) also relies on sparsity. It proposes the atomic decomposition on gradient and reduces communication cost by only sending atoms with non-zero (or large) weights. HSQ is different from these algorithms as it neither quantizes each element of the gradient independently nor relies on sparsity. It is difficult to apply the error feedback based algorithms (Lin et al., 2018; Karimireddy et al., 2019) to federated learning as a user may not be selected in successive iterations and the error feedback may be stale. Moreover, HSQ is orthogonal to and can be combined with these algorithms.

Gradient Compression for Federated Learning As communication cost is more critical for federated learning than data-center-based distributed training (McMahan et al., 2016), heuristics were proposed to achieve even lower per-iteration communication cost than the algorithms listed in Table 1. Konečnỳ et al.( 2016) proposed structured update and random selection. Structured update constrains the local gradient matrix to be low-rank, i.e., , such that the low-rank matrices and can be reported with lower cost than . Random selection randomly chooses some elements in a gradient vector to report. Selective SGD (Shokri and Shmatikov, 2015) communicates elements in gradient with the largest absolute values. Although these heuristics may work in practice, they do not come with theoretical analysis and it is not clear how they trade gradient accuracy for communication efficiency. With an explicit variance bound, HSQ allows to adjust the balance between communication efficiency and gradient accuracy.

Vector Quantization Vector quantization (VQ) techniques (Wang et al., 2018b), including product quantization (PQ) (Jegou et al., 2011), optimized product quantization (OPQ) (Ge et al., 2013) and residual quantization (RQ) (Chen et al., 2010), are widely used to compress large datasets and conduct efficient similarity search. To compress a dataset containing vectors in , VQ uses one (or several) vector codebook(s) , in which each codeword (column) . For a vector , only the index of its nearest codeword in the codebook (i.e., ) is stored. As the codebook is shared over the entire dataset, the storage cost for each vector is reduced from to .

Although gradient is inherently a vector, directly applying VQ techniques to gradient compression is difficult. First, all VQ algorithms need to learn the codebook on the dataset to be compressed (e.g., using the kmeans algorithm). However, we can not get the gradients for codebook learning before model training starts. Second, the VQ algorithms do not provide guarantee on the quality of their approximations, which makes convergence analysis difficult.

3 Hyper-Sphere Quantization

We consider the typical setting of federated learning with a central coordinator (or server) and many participating devices (or clients). Each device computes gradients using local training samples and transmits the (compressed) gradients to the coordinator. The coordinator aggregates the gradients from the devices and sends the model updates back to them. We assume that the devices use HSQ for gradient reporting while the coordinator transmits uncompressed model updates to devices in the analysis. In our experiments, model training also converges when the coordinator uses HSQ to transmit model updates.

3.1 HSQ Algorithm

HSQ partitions the original -dimensional gradient vector into segments with length (assume is divisible by ) and quantizes each segment individually. For simplicity of presentation, we also use to denote a gradient segment. HSQ uses a vector codebook

with codewords, in which each codeword (column) is a -dimensional unit-norm vector, i.e., and for . We also require to be a full-row-rank matrix, which means and normally .  Note that the codebook is shared among the coordinator and the devices such that a device only needs to transmit the index of the selected codeword. Moreover, the same codebook can be reused for all gradient segments. For a gradient segment , HSQ approximates it using a tuple , in which is called the pseudo-norm and is a codeword chosen from the codebook . Intuitively, encodes the unit-norm direction vector of while encodes the norm of . Therefore, the HSQ-based approximation of is given by . We introduce two variants of HSQ, the unbiased version and the greedy version, in Algorithm 1 and Algorithm 2, respectively.

  Input: a gradient segment to compress
  Output: tuple to approximate
  if  then
     , , return
  end if
  Calculate where
  Get such that , for
  Select codeword as

with probability

, if is selected, set
  Quantize to obtain as
Algorithm 1 Hyper-Sphere Quantization: Unbiased Version

Algorithm 1 gives the unbiased version of the HSQ algorithm, which chooses the tuple to approximate in a probabilistic manner. We will show that the gradient approximation provided by Algorithm 1 is unbiased in Section 4.1. Note that if a gradient segment is an all-zero vector, we set and . A special configuration of Unbiased-HSQ is that and is an orthonormal matrix, which means and . In this case, codeword is selected as with a probability proportional to its correlation with .

There are gradient segments in total and we assume that the minimum value and maximum value of in these segments are and , respectively. Each is quantized to one of the uniformly spaced levels between and (inclusive). The quantization function is similar to the one in QSGD and can be expressed as

(2)

where , and with . Therefore, HSQ takes bits to communicate a gradient segment , in which bits are used for the pseudo-norm and bits are used to transmit the index of the selected codeword. Denote the HSQ quantized gradient of the -th device as  333A device also sends and to the coordinator so that can be decoded., the coordinator simply aggregates the gradients as , where is the total number of participating user devices. We call SGD that uses HSQ for gradient reporting HSQ-SGD.

  Input: a gradient segment to compress
  Output: tuple to approximate
  if  then
     , , return
  end if
  Calculate the correlation vector
  Select codeword as , where
  set
  Quantize to obtain as using (2)
Algorithm 2 Hyper-Sphere Quantization: Greedy Version

Algorithm 2 gives the greedy version of the HSQ algorithm. The difference from Algorithm 1 is that codeword selection is no longer probabilistic and the codeword that has the largest correlation with the gradient is chosen. Although the gradient approximation

provided by Greedy-HSQ is biased, we show that training converges with a growing epoch size in Section 

4.2. Unbiased-HSQ and Greedy-HSQ have the same per-iteration communication cost but Greedy-HSQ usually performs better in the experiments.

The paradigm of HSQ, which approximates the gradient vector with a direction vector and a pseudo-norm, is quite general. Several existing gradient compression algorithms, such as SignSGD and TernGrad, can actually be viewed as special cases of HSQ with a specific configuration of the segment length , codebook and the method of pseudo-norm quantization. We provide more discussion about the relation between HSQ and these algorithms in Section 1 of the supplementary material.

3.2 Typical Configurations

HSQ can achieve different trade-offs between communication efficiency and gradient accuracy by configuring the parameters, i.e., the length of gradient segment and the number of quantization levels for pseudo-norm . Based on the analytical results for Unbiased-HSQ in Section 4.1, we show three representative configurations without pseudo-norm quantization, which means that the exact is transmitted.

Extreme Compression By quantizing the -dimensional gradient vector as a whole (i.e., use a single segment with

) and using an orthogonal matrix as

(i.e., ), HSQ uses bits. This configuration achieves the current best per-iteration communication cost at and blows up the variance bound of gradient by .

Compact Compression With and , HSQ takes () bits to transmit the entire gradient vector. The variance bound of gradient is scaled up by . This configuration resembles the sparse case of QSGD, which gives the previously known best communication cost of with the same variance blow-up of .

High Precision Setting , a small positive integer independent of , and , HSQ has a communication cost of and the variance bound of gradient is scaled up by . Under this configuration, HSQ has a communication cost similar to the algorithms that quantize each element in the gradient vector individually and the constant variance scaling resembles the dense configuration of QSGD.

The three configurations show a clear trade-off between communication efficiency and gradient accuracy. A higher compression ratio leads to a larger variance bound on gradients, while reporting gradients more accurately incurs a higher communication cost. The number of user devices in federated learning is much larger than the number of machines in data-center-based distributed training. Variance can be reduced by averaging the gradients reported by a large number of devices rather than requiring each device to pay high per-iteration communication cost, which makes the extreme compression configuration of HSQ appealing for federated learning.

3.3 Codebook and Practical Considerations

Consider the Greedy-HSQ in Algorithm 2, the approximation error is small when is large. Note that the norm of only serves as scaling factor in the approximation error and we can assume without loss of generality. As we do not have any knowledge of when designing the codebook, we assume can appear anywhere on the unit hypersphere and optimize the worst case value of , which leads to the following formulation of the codebook design problem

(3)

denotes the unit hypersphere in the -dimensional space. Although problem (3) is difficult to solve, it requires that any vector on the unit hypersphere should have a codeword close to it in . Intuitively, the codewords in should be uniformly located on the unit hypersphere such that the region covered by each codeword () has identical area. When , any orthonormal basis is uniformly located on the unit hypersphere and should be a good choice as codebook. Empirically, we observed that there is no difference in performance when setting as different random rotations of the standard orthonormal basis. However, using the standard orthonormal basis provides slightly worse performance and this may be because the standard orthonormal basis only allows each device to update one element in a gradient segment. We discuss how to generate codewords uniformly located on the unit hypersphere when in Section 3 of the suppl. material.

Although the core idea of HSQ is a shared codebook between the coordinator and the devices, the coordinator does not need to really transmit to the devices. Instead, a random seed can be issued to the devices to generate the codebook on their own to reduce communication. The overall complexity of matrix-vector multiplication ( or ) is for all segments of a gradient vector and can be controlled by configuring and  444Remember that is required.. When large and are used for high compression ratio, complexity can be reduced in two ways. The first is to transform the matrices as or and compute instead, in which is an matrix () and each entry follows i.i.d . If the matrix transformations are conducted beforehand and , the complexity to process a gradient vector is only . According to the Johnson-Lindenstrauss lemma (Vempala, 2005), this transformation preserves the inner product between and the rows of and with high probability if is not too small. The other is using the standard orthonormal basis as the codebook, i.e., , Unbiased-HSQ simply selects codeword with a probability proportional to , while Greedy-HSQ selects the codeword with , both of which can be conducted very efficiently.

4 Convergence Analysis

In this section, we present convergence results for both Alogrithm 1 and Algorithm 2. We begin our analysis by presenting the necessary definitions and assumptions.

Definition 1 (-smooth).

A function is said to be -smooth if for all , it holds that

Assumption 1 (Convex setting).

The objective function is -smooth and convex.

Assumption 2 (Non-convex setting).

The objective function is -smooth but potentially non-convex.

Assumption 3 (Bounded second moment).

For all , let denote the unbiased stochastic gradient at , we require a

-bounded second moment for all the segments

of , i.e., . Thus, the second moment of is bounded by .

Remark: The assumption of bounded second moment is also required in QSGD (Definition 2.1). For a segment of stochastic gradient with second moment bound , the variance bound can be deduced as if is unbiased.

4.1 Analysis for Unbiased-HSQ

Now we present our main lemmas and theorems for Algorithm 1. The proofs can be found in Section 2 of the suppl. material. First, we show that the Unbiased-HSQ approximation of a gradient segment is unbiased.

Lemma 1 (Unbiasedness).

Using the notations in Algorithm 1, given a full row rank codebook , for any gradient segment , we have that defines a probability over the indices and the unbiasedness of with respect to the randomness of quantization, i.e.,

Next, we analyze the variance brought by the quantization process of Unbiased-HSQ in Algorithm 1.

Lemma 2 (Quantization Variance).

Using the notations in Algorithm 1, if Assumption 3 holds, the second moment of the quantized gradient segment can be bounded as

where

denotes the largest singular value of

. Moreover, with the quantized gradient at denoted as , its variance can be upper bounded as

A variance bound that does not depend on and can also be obtained by relating the two values to the second moment of the gradient as

However, this bound is loose and we use the one related to and in the proof of convergence. Setting and assuming that is an orthonormal matrix, Lemma 2 can be re-rewritten as

Several observations can be made from Lemma 2. First, the variances introduced by direction quantization and pseudo-norm quantization are additive. Second, if is orthonormal and , the variance is blown up by when the pseudo-norms are not quantized. This observation immediately leads to the configurations reported in Section 3.2. According to Lemma 2, one will not use more codewords than the length of a segment (i.e., ) as this will blow up the variance bound. However, we observed empirically that increasing beyond improves training performance. This is because the bound used in the proof of Lemma 2 is loose. With the variance bound in Lemma 2, we can obtain the following convergence results of Unbiased-HSQ.

Theorem 1 (Convex, Theorem 6.3, (Bubeck and others, 2015)).

If Assumptions 1 and 3 hold and using Lemmas 1 and 2, let be a positive integer and , where is the initial point of iterative scheme (1), choosing a constant step size with , after running (1) for iterations, we have the following inequality in expectation:

Theorem 2 (Non-convex).

If Assumptions 2 and 3 hold and using Lemmas 1 and 2, let be a positive integer, be the initial point of iterative scheme (1), choosing a constant step size , after running (1) for iterations, we have the following inequality in expectation:

4.2 Analysis for Greedy-HSQ

Definition 2 (-Compressor).

A compressor is called an -Compressor if

where is a constant and .

Lemma 3 (Quantization Error).

Compressor Q

is an -compressor, where , and , is the minimum singular value of the codebook matrix .

Lemma 3 shows that the direction quantizer in Greedy-HSQ is an -compressor. Actually, we can substitute the direction quantizer in Greedy-HSQ with any -compressor and still preserve the convergence results in Theorem 3.

Theorem 3 (Greedy-version, Non-convex).

If Assumptions 2 and 3 hold, let the learning rate and batch size be

where is the total number of iterations, with , , we have the convergence rate

The proof can be found in the supplementary material and the main technical challenge is that the gradient approximation given by an -compressor may be biased. The convergence rate of Greedy-HSQ is but convergence is faster with small . This is intuitive as models the error introduced by the compressor and smaller means less error. Although convergence of Algorithm 2 relies on relatively large batch size (still smaller than SignSGD, where ) in Theorem 3, empirically we observed that greedy-HSQ converges well with a small batch size.

5 Experimental Results

We experimented with popular deep neural networks, VGG (Simonyan and Zisserman, 2014) and ResNet (He et al., 2016)

and report the performance of training image classifiers on ILSVRC-12 

(Russakovsky et al., 2015), CIFAR-10, CIFAR-100 (Krizhevsky and Hinton, 2009) and Fashion MNIST (Xiao et al., 2017) in the main paper and Section 6 of the suppl. material. The greedy version of HSQ is used due to its better empirical performance. The codebook is generated by kmeans on random Gaussian vectors. Detailed experimental settings (e.g., learning rate scheduling and data augmentation) can be found in Section 4 of the suppl. material. We focus on the communication cost in federated learning and report it as the main performance metric. All codes will be made public.

Figure 1: Test accuracy vs. epoch (left) and communication cost (right) in federated learning setting for training ResNet50 on Fashion MNIST (best viewed in color)
Figure 2: Test accuracy vs. communication cost for VGG19 (left) and ResNet101 (right) training with SGD, QSGD, TernGrad, SignSGD and HSQ (best viewed in color)
Algorithm SGD SignSGD TernGrad QSGD (4 bit) QSGD (8 bit) HSQ (=8) HSQ (=16) HSQ (=64)
Comp. Ratio 1 32 20.2 8 4 18.3 36.6 146.3
VGG19 92.65 90.79 91.10 92.60 92.71 92.76 92.38 91.13
ResNet50 94.19 92.60 93.29 94.64 94.03 94.68 94.77 93.77
ResNet101 94.63 92.01 93.15 94.35 94.67 94.48 94.70 93.87
Table 2: Compression ratio and convergence accuracy of the algorithms on CIFAR10

Simulated experiments in federated learning setting For experiments on Fashion-MNIST(Xiao et al., 2017), the training samples were randomly partitioned among 1000 users and 100 users are selected randomly for each iteration to simulate the scenario of federated learning. We report the test accuracy against epoch and the total amount of up-link communication conducted by user devices for gradient reporting in Figure 1. We compared HSQ with SGD, SignSGD (Bernstein et al., 2018), TernGrad (Wen et al., 2017), and QSGD (Alistarh et al., 2017). The bucket size was set as 512 for QSGD as in (Alistarh et al., 2017). HSQ used 6 bits for the pseudo-norm and a codebook with 256 codewords for all configurations. The results show that training converges smoothly with HSQ and HSQ significantly reduces the amount of communication for achieving the same test accuracy comparing with the baselines. Setting , HSQ reduces the communication cost of SGD by about 585x with only a loss of 0.8% in final classification accuracy. With or , HSQ achieves the same or slightly higher final classification accuracy compared with SGD but the communication cost is much lower.

Figure 3: Test accuracy vs. epoch (left) and per-epoch time (right) for SGD and HSQ (d=8) for training ResNet50 on ILSVRC-12 (best viewed in color)

We conducted more experiments on CIFAR-10 and report the results in Figure 2. For clearer demonstration, we also list the compression ratio (compared with vanilla SGD as the baseline) and the convergence accuracy of the algorithms (with more configurations than shown in Figure 2) in Table 2. The results show that HSQ often outperform the baselines in both convergence accuracy and compress ratio. With , the compression ratio of HSQ is significantly higher than the other algorithms and the degradation in convergence accuracy is small. We plotted the test accuracy against the iteration count for the algorithms in Section 5.1 of the suppl. material, which shows training converges smoothly using HSQ.

Timing experiments for distributed training Although HSQ is designed for federated learning, where low communication cost is critical, we also report its performance for data-center-based distributed training in Figure 3. The dataset is ILSVRC-12 (Russakovsky et al., 2015) and the 4 GPUs used for training are connected using high-speed PCIe bus. The results show that HSQ reduces the per-epoch time of SGD by 14.4% due to smaller communication cost and the degradation in final test accuracy is very small (0.5%).

(a) Value of
(b) Greedy v.s. Unbiased HSQ
Figure 4: Test of the parameter configurations in HSQ for training ResNet50 on Fashion-mnist (best viewed in color)
(a) Codebook generation
(b) #bits for
Figure 5: Test of the parameter configurations in HSQ for training ResNet50 on CIFAR10 (best viewed in color)

Influence of the parameters Keeping

, we plot the value of the loss function against epoch count in Figure 

3(a) under different values of . Note that larger results in higher compression ratio. The results show that when is too large (e.g., 512), the decrease of loss becomes unstable, which can be explained by the high variance in gradient. However, practical federated learning will involve a much larger number of users than we simulated in the experiments and averaging the gradients from different users reduces the variance. Thus, HSQ may use much larger (hence higher compression ratio) than we reported in the experiments in practical federated learning scenario. We compared Greedy-HSQ and Unbiased-HSQ in Figure 3(b), which shows that Greedy-HSQ outperforms Unbiased-HSQ. Although the gradient approximation of Greedy-HSQ is biased, its performance is better possibly because the variance is smaller than Unbiased-HSQ.

We also tested different codebook generation methods, including standard orthonormal basis (SOB), the random rotation of SOB (RR), random Gaussian and K-means Gaussian, with

. Random Gaussian generates Gaussian vectors and normalizes them to unit norm. K-means Gaussian generates a large number of Gaussian vectors, conducts K-means with centers on them and normalizes the centers to unit norm. Figure 4(a) shows that RR (RR-1 and RR-2 are two different random rotations), Gaussian and K-means Gaussian have almost the same performance while SOB performs slightly worse. Figure 4(b) shows that using 4, 6 and 32 bits for pseudo-norm quantization provide almost the same performance but using only 2 bits hurts final test accuracy.

Additional experiments Due to the page limit, we report additional experimental results in Section 5 of the suppl. material. For experiments in the main paper, the coordinator transmits uncompressed model updates. We show that the degradation in convergence accuracy is small (about 1%) when the coordinator also uses HSQ to compress model updates. We also experimented the influence of the number of codewords (i.e., ) with fixed at 32. The results show that increasing the number of codewords (using larger and paying more communication cost) provides better performance.

6 Conclusions

We presented hyper-sphere quantization (HSQ), a general framework for gradient quantization that offers a range of trade-offs between communication efficiency and gradient accuracy via different configurations. HSQ achieves an extremely low per-iteration communication cost at , where is the size of the model, and is guaranteed to converge for both smooth convex and smooth non-convex cost functions. The low per-iteration cost of HSQ is appealing for federated learning as it lowers the communication threshold to join the training and encourages more users to participate. With HSQ, we demonstrate that vector quantization techniques can be effectively used for gradient compression. Given a rich literature of existing vector quantization techniques and that gradients are inherently high dimensional vectors, the idea of HSQ can stimulate more research along this direction.

References

  • J. Acharya, C. De Sa, D. Foster, and K. Sridharan (2019) Distributed learning with sublinear communication. In International Conference on Machine Learning, pp. 40–50. Cited by: §1.
  • D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic (2017) QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp. 1709–1720. Cited by: §1, §2, §2, §5.
  • M. Assran, N. Loizou, N. Ballas, and M. Rabbat (2019)

    Stochastic gradient push for distributed deep learning

    .
    In International Conference on Machine Learning, pp. 344–353. Cited by: §1.
  • J. Bernstein, Y. Wang, K. Azizzadenesheli, and A. Anandkumar (2018) signSGD: compressed optimisation for non-convex problems. arXiv preprint arXiv:1802.04434. Cited by: §1, §1, §2, §5.
  • S. Bubeck et al. (2015) Convex optimization: algorithms and complexity. Foundations and Trends® in Machine Learning 8 (3-4), pp. 231–357. Cited by: Theorem 1.
  • C. Chen, J. Choi, D. Brand, A. Agrawal, W. Zhang, and K. Gopalakrishnan (2018) Adacomp: adaptive residual gradient compression for data-parallel distributed training. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • Y. Chen, T. Guan, and C. Wang (2010) Approximate nearest neighbor search by residual vector quantization. Sensors 10 (12), pp. 11259–11273. Cited by: §1, §1, §2.
  • T. Ge, K. He, Q. Ke, and J. Sun (2013) Optimized product quantization for approximate nearest neighbor search. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2946–2953. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
  • H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §1, §1, §2.
  • S. P. Karimireddy, Q. Rebjock, S. Stich, and M. Jaggi (2019) Error feedback fixes signsgd and other gradient compression schemes. In International Conference on Machine Learning, pp. 3252–3261. Cited by: §2.
  • J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1, §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling Distributed Machine Learning with the Parameter Server.. In OSDI, Vol. 14, pp. 583–598. Cited by: §1.
  • Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally (2018) Deep gradient compression: reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • B. McMahan and D. Ramage (2017) Federated learning: Collaborative machine learning without centralized training data. Google Research Blog. Cited by: §1.
  • H. B. McMahan, E. Moore, D. Ramage, S. Hampson, et al. (2016) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §1, §2.
  • P. Patarasuk and X. Yuan (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing 69 (2), pp. 117–124. Cited by: §1.
  • H. Robbins and S. Monro (1951) A stochastic approximation method. Ann. Math. Statist. 22 (3), pp. 400–407. Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §5, §5.
  • F. Sattler, S. Wiedemann, K. Müller, and W. Samek (2019) Robust and communication-efficient federated learning from non-iid data. arXiv preprint arXiv:1903.02891. Cited by: §1.
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §2.
  • R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §1, §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.
  • S. U. Stich, J. Cordonnier, and M. Jaggi (2018) Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pp. 4447–4458. Cited by: §1.
  • A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan (2017)

    Distributed mean estimation with limited communication

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3329–3337. Cited by: §1.
  • H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu (2018) Communication compression for decentralized training. In Advances in Neural Information Processing Systems, pp. 7652–7662. Cited by: §1.
  • H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu (2019) DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In International Conference on Machine Learning, pp. 6155–6165. Cited by: §1.
  • S. S. Vempala (2005) The random projection method. Vol. 65, American Mathematical Soc.. Cited by: §3.3.
  • H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, and S. Wright (2018a) Atomo: communication-efficient learning via atomic sparsification. In Advances in Neural Information Processing Systems, pp. 9850–9861. Cited by: §2.
  • J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. (2018b) A survey on learning to hash. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 769–790. Cited by: §1, §2.
  • S. Wang, A. Pi, and X. Zhou (2019) Scalable distributed dl training: batching communication and computation. Cited by: §1.
  • W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li (2017) Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pp. 1509–1519. Cited by: §1, §1, §2, §5.
  • X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu (2017) Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems, pp. 5745–5755. Cited by: §1.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §5, §5.
  • M. Yu, Z. Lin, K. Narra, S. Li, Y. Li, N. S. Kim, A. Schwing, M. Annavaram, and S. Avestimehr (2018) GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training. In Advances in Neural Information Processing Systems, pp. 5129–5139. Cited by: §1, §2.