1 Introduction
^{†}^{†}This work was supported by US NSF through grants CAREER 1651492, CNS 1715947, and by the Keysight Early Career Professor Award.Federated learning (FL) [1] is a framework that enables multiple users to jointly train a learning model. In prototypical FL, a central server interacts with multiple users to train a ML model in an iterative manner as follows: users compute gradients for the ML model on their local data sets, and gradients are subsequently exchanged for model updates. There are several motivating factors behind the surging popularity of FL: a) centralized approaches can be inefficient in terms of storage/computation, and FL provides natural parallelization for training, and can leverage increasing computational power of devices and b) local data at each user is never shared, but only gradient computations from each user are collected. Despite the fact that in FML, local data is never shared by a user, even exchanging gradients in a raw form can leak information, as shown in recent works [2, 3, 4].
Motivated by these factors, there has been a recent surge in designing FML algorithms with rigorous privacy guarantees. Differential privacy (DP) [5] has been adopted a de facto standard notion for private data analysis and aggregation. Within the context of FL, the notion of local differential privacy (LDP) is more suitable in which a user can locally perturb and disclose the data to an untrusted data curator/aggregator [6]. LDP has been already adopted and used in current applications, including Google’s RAPPOR [7] for website browsing history aggregation, and by Microsoft for privately collecting telemetry data [8]. In the literature, there has been several research efforts to design FL algorithms satisfying LDP [9, 10, 11, 12, 13, 14, 15]. While LDP provides stronger privacy guarantees (compared to a centralized solution), this comes at the cost of lower utility. In particular, to achieve the same level of privacy attained by a centralized solution, significant higher amount of noise/perturbation is needed [16, 17, 18, 19, 20].
Another parallel recent trend is to study the feasibility of FL over wireless channels. As the prototypical computation for FL training involves gradient aggregation from multiple users, the superposition property of the wireless channel can naturally support this operation much more efficiently. This has led to several recent works [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31] under the umbrella of FL at the wireless edge, where distributed users interact with a parameter server (PS) over a shared wireless medium for training ML models. Several methodologies have been proposed to study wireless FL, which can be broadly categorized into either digital or analog aggregation schemes. In digital schemes, quantized gradients from each user are individually transmitted to the PS using orthogonal transmission. For analog schemes, on the other hand, the gradient computations are rescaled and transmitted directly over the air by all users simultaneously. The superposition nature of the wireless medium makes analog schemes more bandwidth efficient compared to digital ones.
In this paper, we focus on the following question: Can the superposition property of wireless also be beneficial for privacy? If yes, how can we optimally utilize the wireless resources, and what are the tradeoffs between convergence of FML training, wireless resources and privacy?
Main Contributions: In this paper, we consider the problem of FL training over a flatfading Gaussian multiple access channel (MAC), subject to LDP constraints. We propose and study analog aggregation schemes, in which each user transmits a linear combination of a) local gradients and b) artificial Gaussian noise, subject to power constraints. The local gradients are processed as a function of the channel gains to align the resulting gradients at the PS, whereas the artificial noise parameters are selected to satisfy the privacy constraints. We show that the privacy level per user scales as
compared to orthogonal transmission in which the privacy leakage scales as a constant. We also provide the privacyconvergence tradeoffs for smooth and convex loss functions through convergence analysis of the distributed gradient descent algorithm. We show that the training error decreases as the number of users increases and converges to the centralized algorithm where all points are available at the PS. To the best of our knowledge, this is the first result on wireless FL with LDP constraints.
2 System Model & Problem Statement
Wireless Channel Model: We consider a singleantenna wireless FL system with users and a central PS as shown in Fig. 1. The inputoutput relationship at time is
(1) 
where is the signal transmitted by user at time , and is the received signal at the PS. Here, is the complex valued channel coefficient between the th user and the PS, and and
is the independent additive zeromean unitvariance (AWGN) Gaussian noise. The channel coefficients are assume to be time invariant, and each user can transmit subject to maximum power constraint of
. Each user is assumed to know its local channel gains, whereas we assume that the PS has global channel state information.Federated Learning Problem: Each user has a private local dataset of size data points, denoted as , where is the th data point and is the corresponding label at user . Users communicate with the PS through the Gaussian MAC described above in order to train a model by minimizing the loss function , i.e.,
where
is the parameter vector to be optimized,
is the loss function for user , and denotes the entire dataset used for training. The minimization of is carried out iteratively through a distributed gradient descent (GD) algorithm. More specifically, in the th training iteration, the PS broadcasts the global parameter vector from the last iteration to all users. Each user computes his local gradient over the local data points, i.e., and sends back the computed gradient to the PS. For the scope of this paper, we assume that , therefore . The global parameter is updated according to(2) 
where is the learning rate of the distributed GD algorithm at iteration . The iteration process continues until convergence.
In addition, the gradient descent (GD) algorithm for wireless FL should also satisfy local differential privacy (LDP) constraints for each user, as defined next.
Definition 1.
(LDP [32]) A randomized mechanism is LDP if for any pair and any measurable subset , we have
(3) 
The case of is called pure LDP.
Problem Statement. The main goal of this paper is to explore the benefits of wireless gradient aggregation for privacy in FL. In addition, we investigate tradeoffs between the convergence rate of GD, wireless channel conditions and resources (such as power, SNR), subject to the privacy budgets of the users.
3 Main Results & Discussions
In this Section, we present a general gradient aggregation scheme for wireless FL, where each user transmits a linear combination of its local gradients and artificial noise. We then specialize this scheme in which the part of transmission containing gradients are designed in a manner so that this component is aligned at the PS. We analyze this scheme and obtain the privacy leakage under LDP for each user, as a function of the wireless channel conditions, and the transmission parameters. Finally, we present the convergence rate of the private FL algorithm, and maximize the convergence rate by optimizing the local perturbations of each user for privacy.
3.1 FL Transmission Scheme over Gaussian MAC
The overall FL scheme consists of training iterations, where each iteration comprises of uses of the wireless channel described in (1). At each iteration , each user transmits the computed gradient vector together with additive Gaussian noise for privacy. In particular, the transmitted signal of user at iteration is given as:
(4) 
Here, each user performs local phase correction (i.e., input is multiplied by ) so that the received channel coefficient is nonnegative, i.e., . We assume that the gradient vectors have a bounded norm, i.e., , and normalize the gradient vector by . Here, denotes the fraction of power dedicated to the gradient vector , whereas is the fraction of power dedicated to artificial Gaussian noise , whose elements are i.i.d., and drawn from . These parameters satisfy so that the maximum power constraint of is satisfied. From (1) and (4), the received signal at the PS can be written as:
(5) 
where is the independent Gaussian noise, whose elements are i.i.d. drawn from
. In order to carry out the summation of the local gradients overtheair, and receive an unbiased estimate of the true aggregated gradient, all users pick the coefficients
s in order to align their transmitted local gradient estimates. Specifically, user picks so that(6) 
where is a constant. From (6), we obtain , and using the fact that , for all , we can upper bound the constant as follows: . To maximize the signal power of the aligned gradient, we choose to match this upper bound, i.e.,
(7) 
Plugging this back in (6), we obtain the choice of as
(8) 
The above choice shows that alignment of gradients is effectively limited by the user with the worst effective SNR, i.e., . For the alignment scheme described above, the received signal by the PS in iteration in (5) simplifies to:
(9) 
The PS subsequently performs postprocessing on as follows:
(10) 
where is the effective noise at the PS, and . Thus, we can write . As is zero mean, is an unbiased estimate of , with variance of being equal to .
3.2 Local Differential Privacy Analysis
We next analyze the privacy level achieved by the transmission scheme for each user, as per the definition of LDP. Recall, that the local perturbation noise is drawn from Gaussian distribution. This wellknown technique is known as Gaussian mechanism and can provide rigorous privacy guarantees based on LDP, as defined next.
Definition 2.
(Gaussian Mechanism  Appendix A of [32]) Suppose a user wants to release a function of an input subject to LDP. The Gaussian release mechanism is defined as:
(11) 
If the sensitivity of the function is bounded by , i.e., , , then for any , Gaussian mechanism satisfies LDP, where
(12) 
In the next Theorem, we make use of the above result, and present the peruser privacy achieved by the proposed wireless FL scheme as a function of the noise power allocation parameters , transmit powers , and the channel coefficients .
Theorem 1.
For each user , the proposed transmission scheme achieves LDP per iteration, where
(13) 
Proof.
The final received signal at the PS from (9) can be expressed as: . We first observe that the variance of the effective Gaussian noise, i.e., variance of is . In order to invoke the result of the Gaussian mechanism, we next obtain a bound on the sensitivity for user . To bound the local sensitivity of , consider any two different local datasets and at user , while fixing the datasets (and thus the gradients) of the remaining users. The local sensitivity of user can then be bounded as
(14) 
where in step (a), we used the fact that , and (b) follows from (7). Hence, using the sensitivity bound in (14) together with the variance in (12), we arrive at the proof of Theorem 1.
∎
Remark 1.
From Theorem 1, we can observe the privacy benefits of wireless gradient aggregation. We can further upper bound the achievable in Theorem 1 as follows:
which shows that asymptotically, the peruser privacy level behaves like . In contrast, privacy achieved by orthogonal transmission can be shown to be:
(15) 
which scales as a constant, and does not decay with .
Remark 2.
While Theorem 1 shows the periteration leakage, we can use advanced composition results for LDP using the Gaussian mechanism to obtain the total privacy leakage when the wireless FL algorithm is used for iterations. Using existing results in [33], it can be readily shown that the total leakage over iterations (peruser) of the proposed scheme is LDP for where,
(16) 
We illustrate the total peruser privacy leakage as a function of , the number of users in Fig. 2 for various values of . As is clearly evident, the leakage provided by wireless FL goes asymptotically to as .
3.3 Convergence rate of private FL
We next analyze the performance of private wireless FL under the assumption that the global loss function is smooth and strongly convex. Due to privacy requirements and noisy nature of wireless channel, the convergence rate is penalized as shown in the following Theorem.
Theorem 2.
Suppose the loss function is strongly convex and smooth with respect to . Then, for a learning rate and a number of iterations , the convergence rate of the private wireless FL algorithm is
(17) 
Theorem 2 is proved in Appendix I. We next show that artificial noise parameters can be optimized to maximize the convergence rate in (17) while satisfying a desired privacy level LDP at each user.
Theorem 3.
The optimized convergence rate of the private wireless FL algorithm is given as follows:
(18) 
where where , ,
, and .
Proof.
Maximizing the convergence rate in (17) is equivalent to minimizing the term that depends on . Therefore, we solve the following optimization problem:
For given target privacy levels , this is feasible when
We design as follows:
(19) 
where , , and . As seen in Fig. 3, we first rank the leftover powers from the users after aligning the gradients, i.e., in an ascending order. We then allocate the powers such that a subset of users satisfies , to satisfy privacy constraints. This completes the proof of Theorem 3. ∎
4 Simulation Results
In this Section, we provide some simulation results to assess the performance of private wireless FL model. We consider a linear regression task on a synthetic dataset. The regularized loss function at the
th user is given as:(20) 
Our synthetic dataset consists of 3000 i.i.d. samples drawn from , where , and . We assume that each user has data points. For the GD algorithm, the regularization parameter is and training iterations. The channel coefficients are drawn from , and the channel noise variance is set to . Also, we assume that each user requires the same privacy level LDP.
In Fig. 4(a), we show the impact of the number of users on the training loss for dBm for all . As we increase the number of users, the training loss decays faster with . In Fig. 4(b), we compare with the private orthogonal scheme for iterations and dBm for all . Interestingly, the nonorthogonal scheme is more efficient in terms of the bandwidth and accuracy. In Fig. 4(c), we show the impact of the transmit power on the training loss where the error decays faster with as we increase the transmit power.
5 Conclusion & Future Directions
We studied the problem of wireless federated learning subject to local differential privacy (LDP) constraints. We showed that the wireless channel provides a dual benefit of bandwidth efficiency together with strong LDP guarantees. Using the proposed wireless aggregation scheme, privacy leakage was shown to scale as compared to orthogonal transmission in which the privacy leakage scales as a constant. We also analyzed and optimized the convergence rate of the proposed private FL training algorithm and studied the tradeoffs between wireless resources, convergence, and privacy.
There are several interesting directions for future work, such as generalization to multipleantennas at the users and the PS. In the proposed scheme, all users align their gradients, which limits the effective SNR by a user with the worst channel conditions. A possible direction would be to explore generalizations of this scheme, by selecting and aligning gradients from a smaller subsets of users.
Appendix I: Proof of Theorem 2
To prove the convergence rate of the proposed algorithm, we recall that the gradient estimate at the PS in (10) satisfies: (a) Unbiasedness, i.e.,
, since the total additive noise is zero mean; and (b) Bounded second moment,
, which we prove as follows:(21) 
where (a) follows from the fact that , (b) follows from CauchySchwarz inequality, and (c) from the assumption that , i.e., the Lipschitz constant . We next invoke standard results [34] on convergence of SGD for smooth and strongly convex loss, which states
(22) 
References
 [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communicationefficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics, 2017, pp. 1273–1282.
 [2] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership inference attacks against machine learning models,” in 2017 IEEE Symposium on Security and Privacy (S P), May 2017, pp. 3–18.
 [3] J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro, “LOGAN: Membership inference attacks against generative models,” Proceedings on Privacy Enhancing Technologies, vol. 2019, no. 1, pp. 133–152, 2019.
 [4] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting unintended feature leakage in collaborative learning,” in 2019 IEEE Symposium on Security and Privacy (S P), May 2019, pp. 691–706.
 [5] C. Dwork, “Differential privacy,” in Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Part II, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, Eds., 2006, pp. 1–12. [Online]. Available: https://doi.org/10.1007/11787006_1
 [6] M. Joseph, A. Roth, J. Ullman, and B. Waggoner, “Local differential privacy for evolving data,” in Advances in Neural Information Processing Systems, 2018, pp. 2375–2384.
 [7] G. Fanti, V. Pihur, and Ú. Erlingsson, “Building a RAPPOR with the unknown: Privacypreserving learning of associations and data dictionaries,” Proceedings on Privacy Enhancing Technologies, vol. 2016, no. 3, pp. 41–61, 2016.
 [8] B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Advances in Neural Information Processing Systems, 2017, pp. 3571–3580.
 [9] A. Triastcyn and B. Faltings, “Federated learning with Bayesian differential privacy,” arXiv preprint arXiv:1911.10071, 2019.
 [10] R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,” arXiv preprint arXiv:1712.07557, 2017.
 [11] E. Bagdasaryan, O. Poursaeed, and V. Shmatikov, “Differential privacy has disparate impact on model accuracy,” in Advances in Neural Information Processing Systems, 2019, pp. 15 453–15 462.
 [12] C. Wu, F. Zhang, and F. Wu, “Distributed modelling approaches for data privacy preserving,” in IEEE Fifth International Conference on Multimedia Big Data (BigMM), September 2019, pp. 357–365.
 [13] O. Choudhury, A. GkoulalasDivanis, T. Salonidis, I. Sylla, Y. Park, G. Hsu, and A. Das, “Differential privacyenabled federated learning for sensitive health data,” arXiv preprint arXiv:1910.02578, 2019.
 [14] K. Wei, J. Li, M. Ding, C. Ma, H. H. Yang, F. Farhad, S. Jin, T. Q. Quek, and H. V. Poor, “Performance analysis on federated learning with differential privacy,” arXiv preprint arXiv:1911.00222, 2019.
 [15] N. Agarwal, A. T. Suresh, F. X. X. Yu, S. Kumar, and B. McMahan, “cpSGD: Communicationefficient and differentiallyprivate distributed SGD,” in Advances in Neural Information Processing Systems, 2018, pp. 7564–7575.
 [16] G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang, “Privacy at scale: Local differential privacy in practice,” in Proceedings of the 2018 International Conference on Management of Data. ACM, 2018, pp. 1655–1658.
 [17] D. Wang, M. Gaboardi, and J. Xu, “Empirical risk minimization in noninteractive local differential privacy revisited,” in Advances in Neural Information Processing Systems, 2018, pp. 965–974.
 [18] R. Bassily, “Linear queries estimation with local differential privacy,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 721–729.

[19]
R. Bassily and A. Smith, “Local, private, efficient protocols for succinct
histograms,” in
Proceedings of the fortyseventh annual ACM symposium on Theory of computing
. ACM, June 2015, pp. 127–135.  [20] R. Bassily, K. Nissim, U. Stemmer, and A. G. Thakurta, “Practical locally private heavy hitters,” in Advances in Neural Information Processing Systems, 2017, pp. 2288–2296.
 [21] M. M. Amiri and D. Gunduz, “Machine learning at the wireless edge: Distributed stochastic gradient descent overtheair,” arXiv preprint arXiv:1901.00844, 2019.
 [22] G. Zhu, Y. Wang, and K. Huang, “Lowlatency broadband analog aggregation for federated edge learning,” arXiv preprint arXiv:1812.11494, 2018.
 [23] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energyefficient radio resource allocation for federated edge learning,” arXiv preprint arXiv:1907.06040, 2019.
 [24] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via overtheair computation,” arXiv preprint arXiv:1812.11750, 2018.
 [25] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resource constrained edge computing systems,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1205–1221, March 2019.
 [26] M. M. Amiri and D. Gunduz, “Federated learning over wireless fading channels,” arXiv preprint arXiv:1907.09769, 2019.
 [27] T. Sery and K. Cohen, “On analog gradient descent learning over multiple access fading channels,” arXiv preprint arXiv:1908.07463, 2019.
 [28] ——, “A sequential gradientbased multiple access for distributed learning over fading channels,” in 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), September 2019, pp. 303–307.
 [29] M. S. H. Abad, E. Ozfatura, D. Gunduz, and O. Ercetin, “Hierarchical federated learning across heterogeneous cellular networks,” arXiv preprint arXiv:1909.02362, 2019.
 [30] L. U. Khan, N. H. Tran, S. R. Pandey, W. Saad, Z. Han, M. N. Nguyen, and C. S. Hong, “Federated learning for edge networks: Resource optimization and incentive mechanism,” arXiv preprint arXiv:1911.05642, 2019.
 [31] M. M. Amiri and D. Gündüz, “Overtheair machine learning at the wireless edge,” in 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), July 2019, pp. 1–5.
 [32] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
 [33] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and differential privacy,” in 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, October 2010, pp. 51–60.
 [34] A. Rakhlin, O. Shamir, and K. Sridharan, “Making gradient descent optimal for strongly convex stochastic optimization,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 1571–1578.
Comments
There are no comments yet.