1 Introduction
^{†}^{†}This work was supported by US NSF through grants CAREER 1651492, CNS 1715947, and by the Keysight Early Career Professor Award.Federated Learning (FL) refers to a distributed machine learning (ML) framework that allows distributed machines, or users, to collaboratively train an ML model with the help of a parameter server (PS). Typically, users compute gradients for a global model on their local data, and send gradients to the PS for aggregation and model updates in an iterative fashion. FL is appealing and has gained recent attention due to the fact that it allows natural parallelization, and can be more efficient than centralized approaches in terms of storage. However, communication overhead caused by exchanging gradients remains an issue that needs to be addressed.
Previous works alleviate the communication bottleneck by compressing gradients before transmissions. Two commonly used gradient compression approaches are quantization, and sparsification. Gradient quantization follows the idea of lossy compression by describing gradients using a small number of bits and these lowprecision gradients are transmitted back to the PS. One extreme is to send just bit of information per value [1]. Similar idea was used in signSGD [2] and TernGrad [3], which use and
bits to describe each value, respectively. In gradient sparsification, some coordinates of the gradient vector are dropped based on certain criteria
[4, 5], which for instance, can depend on the variance and informativeness of the gradients. Other quantization/sparsification techniques include
[6, 7, 8, 9, 10]. However, these stand alone compression techniques are not tuned to the underlying communication channel over which the exchange takes place between the users and the PS, and may not utilize the channel resources to the fullest.Another line of recent works study FL over wireless channels, and more generally multiple access channels (MACs). The superposition nature of wireless channels allows gradients to be aggregated ”overtheair” and allows for much more efficient training. Several recent works include [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. The approaches can be broadly categorized into digital or analog schemes depending on how the gradients are transmitted over the channel. In analog schemes, the local gradients are scaled and directly transmitted over the wireless channel, allowing PS to directly receive a noisy version of the aggregated gradient. In digital schemes, gradients from users are decoded individually, but transmission still occurs over a MAC. Although it has been shown that in terms of bandwidth efficiency, analog schemes can be superior than digital schemes [11, 13], we argue that digital schemes have the following advantages: backward compatibility  they can be easily implemented on the existing digital systems, they are less prone to slow users, they are more reliable due to the fact that various error control codes can be used, and digital schemes do not require tight synchronization as required by analog transmission.
Main Contributions: Motivated by the above discussion, we consider FL learning over a MAC and focus on the design of digital gradient transmission schemes, where gradients at each user are first quantized, and then transmitted over a MAC to be decoded individually at the PS. When designing digital FL schemes over MACs, we show that there are new opportunities to assign different amount of resources (such as rate or bandwidth) to different users based on a) the informativeness of the gradients at each user, and b) the underlying channel conditions. We propose a stochastic gradient quantization scheme, where the quantization parameters are optimized based on the capacity region of the MAC. We show that such channel aware quantization for FL outperforms channel unaware quantization schemes (such as uniform allocation), particularly when users experience different channel conditions, and when have gradients with varying levels of informativeness.
2 System Model
We consider a distributed machine learning system with a parameter server (PS) and users, where users are connected to the PS through a Gaussian MAC as shown in Fig. 1. Users want to collaboratively train a machine learning model
with the help of PS by minimizing an empirical loss function,
(1) 
where denotes the local data set at user and is the th data point in , and is the loss function. The minimization is done by using gradient descent (GD) algorithm. Each user computes the local gradient on the local data set , where is vector of model parameters at iteration , and
(2) 
At each iteration, each user sends a function of its computed gradient back to the PS through channel uses of the MAC, where is some preprocessing function the PS assigned to user at iteration . We note that the capacity region of a Gaussian MAC can be described as follows [24],
(3) 
where denotes the transmission rate of user and denotes the sum capacity of the users in subset . We assume an average transmit power constraint for user , and in this case, , where denotes variance of the channel noise.
At iteration , the received signal at the PS is a function of all . The goal of the PS is to recover the average of the local gradients from using some postprocessing function . However, due to the pre and postprocessing, and the capacity region of the MAC, the PS can only recover the noisy versions of the local gradients , thus, the noisy version of the average gradient
. Therefore, the transmission from the users must ensure that the gradients received at the PS are unbiased estimators of
and have bounded variance, i.e.,(4) 
where the variance bound should be as small as possible.
Problem Statement When jointly transmitting over a MAC, it is critical to allocate resources efficiently to ensure that the gradient aggregation can be done in a timely manner, and the training error is low. Let be the set of rates allocated to users for gradient transmission over the MAC. In this work, we want to understand how one should allocate rates as a function of the capacity region of the MAC, and the underlying informativeness of the gradients at different users. Furthermore, we want to characterize the resulting tradeoff between the underlying channel conditions of the MAC and the convergence rate of GD algorithms.
3 Main Results
In this section, we present our proposed stochastic gradient quantization scheme for GD, which is inspired by schemes in [10, 25]. In this scheme, the PS asks users to quantize their local gradients before sending them based on individual quantization budgets. The quantization budgets are found by the PS by solving an optimization problem that aims to minimize the variance of the aggregated gradients, while satisfying the transmission rate constraints imposed by the MAC. The distinction between our scheme and the scheme in [10] is that we allow each user to have its own quantization budget. We first present the proposed scheme for any number of users , analyze the convergence rate of the scheme, and present a general optimization problem for quantization budget allocation based on the capacity of the MAC. We then show an example with users and solve for the optimal quantization budgets and communication rates.
3.1 Stochastic Multilevel Gradient Quantization
At each iteration , each user computes the local gradient vector using its local data set . For simplicity of notation, we drop the iteration index in describing the quantization scheme. Each user computes the dynamic range of its local gradient, i.e., , where and are the maximum and minimum values of the local gradient vector at user . The user then quantizes its local gradient vector using the stochastic multilevel quantization scheme as we describe next. For every integer , we define
(5) 
where is the quantization budget for user . For each element in the local gradient vector, if , then is quantized as follows,
(6) 
This operation is shown in Fig. 2. Once the entire gradient vector is quantized, user sends its quantized gradient vector to the PS over the Gaussian MAC. We assume that before each iteration, each user describes the scalars and (which describe the dynamic range of the local gradient) at full resolution to the PS. In addition, as each element in the gradient vector is quantized to be one of the levels, hence, a total of bits are required to describe the quantized gradient vector. The PS recovers all the quantized gradient vectors by performing optimal decoding over the MAC. Thus, for reliable decoding, the transmission rates of the users, i.e., must be within the MAC capacity region.
The PS then aggregates the quantized gradients as
(7) 
and updates the model using,
(8) 
where is the learning rate. The updated model is then transmitted back to users for subsequent iterations.
Suppose that in the th iteration, the dynamic range of the gradient vector of user is , and the number of quantization levels used is . Then, it can be readily checked that is an unbiased estimator of , i.e., . The variance can be computed as, . Therefore, the variance of the quantized gradient vector at user in iteration can be bounded as
(9) 
We next present our first result which shows how the convergence of the above algorithm depends on the parameters of multilevel stochastic quantization at the users.
Theorem 1.
If the loss function is strongly convex and smooth, with Lipschitz gradients, then by using a time varying learning rate of , we have the following convergence result:
(10) 
The proof of this Theorem is presented in Appendix I.
From Theorem 1, we observe that the convergence rate depends directly on the following factors: the dynamic range of the gradients computed by the users, and the quantization levels assigned to the users in each iteration. The traditional approach is to assign equal quantization levels to all users, i.e., , for all . However, the above expression shows that in order to maximize the rate of convergence, users whose gradients have a higher dynamic range must be assigned a higher quantization budget. On the other hand, if the users are communicating to the PS in a communication constrained setting, such as a MAC, then the quantization budget , which is directly related to the transmission rate cannot exceed the constraints imposed by the capacity region of the MAC.
3.2 MAC Aware Gradient Quantization
Motivated by the above discussion, we propose MAC aware gradient quantization which works as follows. In each iteration , users compute their local gradients , and describe to the PS. using these scalars, PS computes the dynamic range(s) of the gradients for all the users and performs the optimization described in Theorem 2. Subsequently, the PS assigns individual quantization budgets (transmission rates) to each user; users subsequently quantize their gradients and transmit over the MAC. In the following Theorem, we present the optimization problem using which we can determine the optimal ’s that maximize the convergence rate.
Theorem 2.
At each iteration , the optimal ’s that give the best convergence rate can be found by solving the following optimization problem,
(11)  
s.t.  (12)  
(13) 
where denotes the transmission rate of user and denotes the sum capacity of the users in subset , i.e., , where denotes variance of the channel noise.
The above optimization problem falls into the category of constrained integer programming since ’s take nonnegative integer values. In general, integer programming is considered to be NPhard problem [26]. However, one could obtain suboptimal solutions by relaxing the constraint on ’s. For instance, by allowing ’s to be real numbers greater or equal to (so that each user gets at least bit), it is easy to verify that the above problem becomes a convex optimization problem. One could then either use convex solvers or solve the convex problem analytically by checking KKT conditions, and round the results. We next show an example for users, and solve the convex relaxation analytically to gain insights on how the dynamic ranges of the gradients, and the capacity region of MAC impact the resulting quantization budgets.
3.3 Solution for the Relaxed Optimization Problem with
For users, the relaxed optimization problem () is given as follows:
(14)  
s.t. 
The three constraints on rates can be rearranged as follows:
(15) 
where and . As mentioned earlier, the objective function being minimized is a convex function when and are both greater or equal to . The user case can be solved analytically by first forming the following Lagrangian function,
(16) 
We note that to fully utilize the channel, the sumrate constraint in should be satisfied with equality, i.e., or equivalently, . By taking the partial derivatives of with respect to and and checking the KKT conditions, we obtain,
(17) 
Using (17) and the sumrate constraint, i.e., , we can solve for the optimal quantization budgets.
Theorem 3.
For a user Gaussian MAC, the optimal quantization budgets and for can be found by solving
(18) 
and subsequently , where and are dynamic ranges of gradients at users and .
We solve and numerically with the following parameters: we let , , so that the individual and sum capacities for this setting are and . These lead to and . We fix and vary from to to understand the impact of the ratio of dynamic range on the quantization budgets. It can be seen in Fig. 3 and Table 1 that by using proposed MAC aware scheme, the PS allocates more rate towards the user whose gradients are more informative (higher dynamic range). For instance, when , gradients from both users are equally informative, and both users are assigned equal quantization budgets . On one extreme, when , gradients from user are considered more useful than user , the optimal allocation is , . On the other extreme, if , gradients from user are more informative, hence we see that , and .
4 Experiments
To show the performance of our proposed scheme, we consider MNIST image classification task using single layer neural networks trained on
training and testing samples withusers, and a crossentropy loss function. The dimensionality of the classifier model is
. We assume that user ’s data set consists of images belonging to digits ’0’ and ’1’, whereas the data set of user consists of all the digits. The channel noise variance is set as , and the total transmit power per iteration is set as . We use the MAC for channel uses for each iteration.In Fig. 4, we let and , and compare the proposed MAC aware gradient quantization scheme with the following schemes: uniform rate allocation subject to MAC capacity constraints, a recently proposed digital scheme in [11], SignSGD, which uses bit quantization per dimension for each user [2], and TernGrad [3], which uses three levels to quantize each dimension of the gradient. We also plot the nonquantized full resolution scheme as a baseline. In the digital scheme proposed in [11], all but the highest and lowest gradient values are set to zero. The remaining gradient values are then split into two groups depending on their signs. The mean of elements in each group is computed, denoted by and . If (), all remaining positive (negative) values will be set to (). Each user then transmits the location of nonzero values and a scalar (using bits) to describe the average value at each iteration. Therefore, the communication cost is . This scheme [11] is fundamentally different than the one proposed in this paper, and, moreover, the quantization budget is the same for all users. As shown in Fig. 4, the proposed MAC aware multilevel scheme outperforms the uniform multilevel scheme, the scheme in [11], SignSGD and TernGrad. This is due to the fact that grows exponentially as increases. In addition, the rates are limited by the user with the worst channel. Therefore, as it reaches the capacity of the user with the worst channel, is still small compared to . Other schemes such as SignSGD and TernGrad suffer from underutilization of channel resources, as they use a fixed quantization budget ( bit, and bits respectively per gradient dimension). We also show the testing accuracy of each scheme at the end of iterations (see Table 2). They are consistent with Fig. 4 where our proposed scheme is the closest to full resolution.
For Fig. 5, we set , and , and vary to see the impact of increasing power, and thus, a larger capacity region. It can be seen in Fig. 5 that the performance improves monotonically with the increase in total power. The testing accuracy at the end of iterations is shown in Table 3 as a function of the total power.
5 Conclusions
In this paper, we considered the problem of MAC aware gradient quantization for federated learning. We showed that when designing digital FL schemes over MACs, there are new opportunities to assign different amount of resources (such as quantization rates) to different users based on a) the informativeness of the gradients at each user, captured by their dynamic range, and b) the underlying channel conditions. We studied and analyzed a channel aware quantization scheme and showed that it outperforms uniform quantization and other existing digital schemes. An interesting future direction is to explore if other quantization schemes (for instance, the scheme in [11], or gradient sparsification schemes in [4, 5]) can be optimized (with limited interaction with the PS) as a function of the underlying communication channel such as MAC.
Appendix I: Proof of Theorem 1
Standard convergence results in [27] have shown that for a loss function that is strongly convex and smooth w.r.t.
, using SGD with stochastic unbiased gradients, bounded second order moments, i.e.,
, with a learning rate of can achieve a convergence result:(19) 
There are two distinctions between our bound and (19). First, the randomness in our scheme comes from quantizing the gradients instead of randomly selecting data points. Second, as users can have different quantization budgets per iteration, the resulting variance is iteration dependent, i.e., . By slightly modifying the proof in [27], it is possible to prove the following convergence result (proof omitted due to space):
(20) 
Theorem 1 now follows directly by plugging in the values of , which can be computed as:
(21) 
where (a) follows from (9) and Lipschitz assumption, i.e., .
References

[1]
F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1Bit Stochastic Gradient Descent and Application to DataParallel Distributed Training of Speech DNNs,” in
Interspeech 2014, September 2014.  [2] J. Bernstein, Y.X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for nonconvex problems,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80, 10–15 Jul 2018, pp. 560–569.

[3]
W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning,” in
Advances in Neural Information Processing Systems 30, 2017, pp. 1509–1519.  [4] A. F. Aji and K. Heafield, “Sparse communication for distributed gradient descent,” CoRR, vol. abs/1704.05021, 2017. [Online]. Available: http://arxiv.org/abs/1704.05021
 [5] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsification for communicationefficient distributed optimization,” in Advances in Neural Information Processing Systems 31, 2018, pp. 1299–1309.
 [6] N. Dryden, S. A. Jacobs, T. Moon, and B. Van Essen, “Communication quantization for dataparallel training of deep neural networks,” in Proceedings of the Workshop on Machine Learning in High Performance Computing Environments, ser. MLHPC ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 1–8. [Online]. Available: https://doi.org/10.1109/MLHPC.2016.4
 [7] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth for distributed training,” CoRR, vol. abs/1712.01887, 2017. [Online]. Available: http://arxiv.org/abs/1712.01887
 [8] F. Sattler, S. Wiedemann, K. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning with minimal communication,” CoRR, vol. abs/1805.08768, 2018. [Online]. Available: http://arxiv.org/abs/1805.08768
 [9] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: CommunicationEfficient SGD via Gradient Quantization and Encoding,” in Advances in Neural Information Processing Systems 30, 2017, pp. 1709–1720.
 [10] A. T. Suresh, F. X. Yu, S. Kumar, and H. B. McMahan, “Distributed mean estimation with limited communication,” in Proceedings of the 34th International Conference on Machine Learning, 2017, p. 3329–3337.
 [11] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge: Distributed stochastic gradient descent overtheair,” CoRR, vol. abs/1901.00844, 2019. [Online]. Available: http://arxiv.org/abs/1901.00844
 [12] M. M. Amiri and D. Gündüz, “Overtheair machine learning at the wireless edge,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), July 2019, pp. 1–5.
 [13] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” CoRR, vol. abs/1907.09769, 2019. [Online]. Available: http://arxiv.org/abs/1907.09769
 [14] M. M. Amiri, T. M. Duman, and D. Gündüz, “Collaborative machine learning at the wireless edge with blind transmitters,” CoRR, vol. abs/1907.03909, 2019. [Online]. Available: http://arxiv.org/abs/1907.03909
 [15] M. S. H. Abad, E. Ozfatura, D. Gündüz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” CoRR, vol. abs/1909.02362, 2019. [Online]. Available: http://arxiv.org/abs/1909.02362
 [16] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federated learning over wireless networks,” CoRR, vol. abs/1909.07972, 2019. [Online]. Available: http://arxiv.org/abs/1909.07972
 [17] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via overtheair computation,” CoRR, vol. abs/1812.11750, 2018. [Online]. Available: http://arxiv.org/abs/1812.11750
 [18] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energyefficient radio resource allocation for federated edge learning,” CoRR, vol. abs/1907.06040, 2019. [Online]. Available: http://arxiv.org/abs/1907.06040
 [19] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for lowlatency federated edge learning,” IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 491–506, Jan. 2020.
 [20] Y. Sun, S. Zhou, and D. Gündüz, “Energyaware analog aggregation for federated learning with redundant data,” CoRR, vol. abs/1911.00188, 2019. [Online]. Available: http://arxiv.org/abs/1911.00188
 [21] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive control for resourceconstrained distributed machine learning,” CoRR, vol. abs/1804.05271, 2018. [Online]. Available: http://arxiv.org/abs/1804.05271
 [22] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” CoRR, vol. abs/1908.07463, 2019. [Online]. Available: http://arxiv.org/abs/1908.07463
 [23] T. Sery and K. Cohen, “A sequential gradientbased multiple access for distributed learning over fading channels,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sep. 2019, pp. 303–307.
 [24] T. M. Cover and J. A. Thomas, Elements of Information Theory. WileyInterscience, 2006.
 [25] N. Agarwal, A. T. Suresh, F. Yu, S. Kumar, and H. B. Mcmahan, “cpSGD: Communicationefficient and differentiallyprivate distributed SGD,” CoRR, vol. abs/1805.10559, 2018. [Online]. Available: http://arxiv.org/abs/1805.10559
 [26] A. Schrijver, Theory of linear and integer programming. John Wiley & Sons, 1998.
 [27] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” CoRR, vol. abs/1109.5647, 2012. [Online]. Available: http://arxiv.org/abs/1109.5647
Comments
There are no comments yet.