I Introduction
With the development of InternetofThings (IoT) technologies, smart devices with builtin sensors, Internet connectivity, and programmable computation capability have proliferated and generated huge volumes of data at the network edge over the past few years. These data can be collected and analyzed to build machine learning models that enable a wide range of intelligent services, such as personal fitness tracking[1], traffic monitoring[2], smart home security[3], and renewable energy integration[4]. However, data are often sensitive in many services, like the heart rate monitored by smart watches, and can leak a lot of personal information about the users. Due to the privacy concern, users would not be willing to share their data, prohibiting the deployment of these intelligent services. Federated Learning is a novel machine learning paradigm where a group of edge devices collaboratively learn a shared model under the orchestration of a central server without sharing their training data. It mitigates many of the privacy risks resulting from the traditional, centralized machine learning paradigm, and has received significant attention recently.
Although promising, federated learning faces several challenges, among which communication overhead is a major one[5]
. Specifically, at each iteration of federated learning, edge devices download the shared model from the server and compute updates to it using their own datasets, and then these updates will be gathered by the server to renew the shared model. Although only model updates are transmitted between edge devices and the server instead of the raw data, such updates could contain hundreds of millions of parameters in modern machine learning models such as deep neural networks, resulting in high bandwidth usage per iteration. Moreover, many federated learning schemes require many iterations to achieve a high model accuracy, and hence the communication of the whole training process is expensive. Since most edge devices are resourceconstrained, the bandwidth between the server and edge devices is rather limited, especially during uplink transmissions. Therefore, it is crucial to make federated learning communicationefficient.
Besides communication efficiency, privacy leakage is another core challenge in federated learning[5]. Although in federated learning edge devices keep their data locally and only exchange ephemeral model updates which contain less information than raw data, this is not sufficient to guarantee data privacy. For example, by observing the model updates from an edge device, it is possible for the adversary to recover the private dataset in that device using reconstruction attack [6] or infer whether a sample is in the dataset of that device using membership inference attack [7]. Especially, if the server is not fully trusted, it can easily infer the private information of edge devices from the received model updates during the training by employing existing attack methods. Therefore, how to protect against those advanced privacy attacks and provide rigorous privacy guarantee for each participant in federated learning without a fully trusted server is challenging and needs to be addressed.
In order to motivate and retain edge devices in federated learning, it is desirable to achieve both communication efficiency and data privacy guarantee. Many prior efforts have considered either communication efficiency [8, 9, 10, 11] or privacy [12, 13] in federated learning, but not both. In this paper, we propose a novel distributed learning scheme called Communicationefficient and Privacypreserving Federated learning (CPFed) that both reduces communication cost and provides formal privacy guarantee without assuming a fully trusted server. To save the communication cost, we reduce the number of communications from two perspectives. We limit the number of participating devices per iteration through device selection, and then decrease the number of iterations via allowing selected devices to perform multiple iterations before sending their computation results out. Utilizing client selection and more local computation can significantly save the communication cost, however, it is hard to provide the rigorous convergence analysis and will have an impact on the privacy guarantee. To preserve the privacy of devices without a fully trusted server, we leverage the concept of local differential privacy and ask each device to add random noise to perturb its local computation results before transmission. However, combined with our communicationreduction strategy directly, local differential privacy adds too much noise to the model updates and leads to low model accuracy. In our proposed scheme, we use a secure aggregation protocol with low communication overhead to aggregate devices’ model updates, which improves the model accuracy under the same differential privacy guarantee.
In summary, the main contributions of this paper are as follows.

We propose a novel federated learning scheme called CPFed for communicationefficient and differentially private learning over distributed data without a fully trusted server.

CPFed reduces the number of communications by allowing partial devices to participate the training at each iteration and communicate with the server periodically.

Without much degradation of the model accuracy, CPFed rigorously protects the data privacy of each device by integrating secure aggregation and differential privacy techniques.

Instead of providing only periteration differential privacy guarantee, we tightly account the endtoend privacy loss of CPFed using zeroconcentrated differential privacy.

We perform convergence analysis of CPFed for both stronglyconvex and nonconvex loss functions and conduct extensive evaluations based on the realworld dataset.
The rest of the paper is organized as follows. Related work and background on privacy notations used in this paper are described in Section VIII and Section II, respectively. Section III introduces the system setting and the problem formulation. Section IV presents our CPFed learning scheme and the corresponding algorithm. The privacy guarantee and convergence property of CPFed is rigorously analyzed in Section V and Section VI, respectively. Finally, Section VII shows the evaluation results based on the realworld dataset, and Section IX concludes the paper.
Ii Preliminaries
In what follows, we briefly describe the basics of differential privacy and their properties. Differential privacy (DP) is a rigorous notion of privacy and has become the defacto standard for measuring privacy risk.
Differential Privacy[14]. DP is the classic DP notion with the following definition:
Definition 1 (Dp).
A randomized algorithm with domain and range is differentially private if for any two adjacent datasets that differ in at most one data sample and any subset of outputs , it satisfies that:
(1) 
The above definition reduces to DP when . Here the parameter is also called the privacy budget. Given any function that maps a dataset into a scalar , we can achieve DP by adding Gaussian noise to the output scalar , where the noise magnitude is proportional to the sensitivity of , given as .
ZeroConcentrated Differential Privacy[15]. zeroconcentrated differential privacy (zCDP) is a relaxed version of DP. zCDP has a tight composition bound and is more suitable to analyze the endtoend privacy loss of iterative algorithms. To define zCDP, we first define the privacy loss. Given an output , the privacy loss of the mechanism
is a random variable defined as:
(2) 
zCDP imposes a bound on the moment generating function of the privacy loss
. Formally, a randomized mechanism satisfies zCDP if for any two adjacent datasets , it holds that for all ,(3) 
Here, (3) requires the privacy loss to be concentrated around zero, and hence it is unlikely to distinguish from given their outputs. zCDP has the following properties [15]:
Lemma 1.
Let be any realvalued function with sensitivity , then the Gaussian mechanism, which returns , satisfies zCDP.
Lemma 2.
Suppose two mechanisms satisfy zCDP and zCDP, then their composition satisfies zCDP.
Lemma 3.
If is a mechanism that provides zCDP, then is DP for any .
Iii System Modeling and Problem Formulation
Iiia Federated Learning System
Consider a federated learning setting that consists of a central server and clients which are able to communicate with the server. Each of clients has a local dataset , a collection of datapoints from its edge device. The clients want to collaboratively learn a shared model using their data under the orchestration of the central server. Due to the privacy concern and high latency of uploading all local datapoints to the server, federated learning allows clients to train the model while keeping their data locally. Specifically, the shared model is learned by minimizing the overall empirical risk of the loss on the union of all local datasets, that is,
(4) 
Here, represents the local objective function of client , is the loss of the model at a datapoint sampled from local dataset .
In federated learning, the central server is responsible for coordinating the training process across all clients and maintaining the shared model . The system architecture of federated learning is shown in Fig. 1. At the beginning of each iteration, clients download the shared model from the server and compute local updates on using their local datasets. Then, each client uploads its computed result to the server, where the received local results are aggregated to update the shared model . This procedure repeats until certain convergence criteria are satisfied.
IiiB Threat Model
We assume that the adversary here can be the “honestbutcurious” central server or clients in the system. The central server will honestly follow the designed training protocol, but are curious about the client’s private data and may infer it from the shared message. Furthermore, some malicious clients could collude with the central server or each other to infer private information about a specific client. Besides, the adversary can also be the passive outside attacker. These attackers can eavesdrop all shared messages in the execution of the training protocol but will not actively inject false messages into or interrupt the message transmission. Malicious clients who, for instance, may launch data pollution attacks by lying about their private datasets or returning incorrect computed results to disrupt the training process are out of the scope of this paper and will be left as our future work.
IiiC Design Goals
Our goal is to design a scheme that enables multiple clients to jointly learn an accurate model for a given machine learning task with low communication cost. Moreover, the differential privacy guarantee for each client should be provided without sacrificing much accuracy of the trained model.
Iv Proposed CPFed Scheme
In this section, we propose our method called CPFed to address the communication overhead and privacy leakage issue in federated learning. In what follows, we first describe how to save the communication cost of training the model, using the periodic averaging method. Then we discuss how to preserve the data privacy of each client in the system with differential privacy. Next, we improve the accuracy of our method with secure aggregation. Finally, we present the overall algorithm that captures all these components.
Iva Improving Communication Efficiency with Periodic Averaging
In the vanilla distributed stochastic gradient descent (SGD) approach that solves Problem (
4), the server collects the gradients of local objectives from all clients and updates the shared model using a gradient descent iteration given by(5) 
where represents the shared model at iteration , is the stepsize, and represents the gradient of local objective function based on the local dataset .
Above distributedSGD method, however, requires many rounds of communication between clients and the server[5]. Federated learning systems are potentially comprised of a massive number of devices, e.g., millions of smartphones, and communication can be slower than local computation by many orders of magnitude. Therefore, a large number of communication rounds will lead to inefficient training. More precisely, assume the number of iterations for training the model is and at each iteration client shares its local gradient with the server to update the model. Then, the total number of communications is .
To save the communication cost, we propose a communicationreduction method which reduces the number of communication round and the involved clients per round simultaneously, as shown in Fig. 2. In our method, the server first selects a bunch of clients uniformly at random and then lets the selected clients perform multiple iterations to minimize the local objectives before sending their local computation results to the server. Specifically, at round , a set of clients are selected to download the current shared model from the server and perform local iterations on . Let denote the local model of client at th local iteration of the th round. At each local iteration , client updates its model by
(6) 
where represents the minibatch stochastic gradient computed based on a batch of datapoints sampled from the local dataset . Note that when , the local model for all clients in . After local iterations, the selected clients upload their local models to the server where the shared model is updated by
(7) 
In this case, each client is selected to communicate to the server with probability
and only needs to periodically communicate for times in total. Hence, the number of communication rounds is reduced by a factor of .IvB Preventing Privacy Leakage with Differential Privacy
Aforementioned communicationreduction method is able to prevent the direct information leakage of clients via keeping the raw data locally, however, it could not prevent more advanced attacks[6, 7] that infer private information of local training data by observing the messages communicated between clients and the server. According to our threat model described in Section IIIB, clients and the server in the system are “honestbutcurious”, and attackers outside the system can eavesdrop the transmitted messages. These attackers are able to obtain the latest shared model sent from the server to clients and the local models sent from clients to the server, both of which contain private information of clients’ training data. Our goal is to prevent the privacy leakage from these two types of messages with differential privacy.
In our setting, a straightforward approach would be the Gaussian mechanism. Specifically, each client adds enough Gaussian noise into the shared information (i.e., the local model to be uploaded) directly before releasing it. In this case, attackers would not be able to learn much about an individual sample in from the received massages. Accordingly, at each local iteration, client updates its local model by
(8) 
where is the Gaussian noise sampled at th local iteration of the th round from the distribution . Here, the local model will preserve a certain level of differential privacy guarantee for client , which is proportional to the size of noise . Due to the postprocessing property of differential privacy[14], the sum of local models, i.e., the updated model , preserves the same level of differential privacy guarantee for client as the local model.
IvC Improving Model Accuracy with Secure Aggregation
Although differential privacy can be achieved using Gaussian mechanism, the accuracy of the learned model will degrade significantly. At each round of the training, all uploaded local models are exposed to the attacker, leading to a large amount of information leakage. However, we observe that the server only needs to know the average values of the local models. Intuitively, one can reduce the privacy loss of clients by hiding the individual local models and restricting the server to receive only the sum of local models, without disturbing the learning process. This can be achieved via a secure aggregation protocol so that the server can only decrypt the sum of the encrypted local models of selected clients without knowing each client’s local model. In the following, we design such a protocol based on secrete sharing, which is efficient in terms of the amortized computation and communication overhead across all communication rounds. The effect of secure aggregation in reducing privacy loss will be rigorously analyzed in Section V.
In our setting, a secure aggregation protocol should be able to 1) hide individual messages for clients, 2) recover the sum of individual messages of a random set of clients at each round, and 3) incur low communication cost for participating clients. Denote by the plaintext message of client (i.e., local model parameters ) that needs to be uploaded to the server. Our proposed protocol involves few interactions between clients and the server during each round and consists of the following two main steps:

Encryption uploading: Clients in upload their own encrypted local models to the server.

Decryption: The server decrypts the sum of the messages received from clients in .
The basic idea of the protocol is to protect the message of client by hiding it with a random number in the plaintext space, i.e., . However, the challenge here is how to remove the random number from the received ciphertext at the server part. To this end, we require that all the will sum up to , i.e., , which prevents the attacker from recovering each individual message but enables the server to recover . However, this requires the clients to communicate with each other in order to generate such secrets , which is inefficient in terms of communication overhead.
To save the communication overhead, we introduce a pseudorandom function (PRF) here. The PRF takes a random seed that both client and agree on during initialization and the round number , and outputs a different pseudorandom number at each round. Client could calculate the shared secret without interacting with client at each round as long as they both use the same seed and round number, and thus each client could calculate without interactions. This procedure greatly reduces the amortized communication overhead of our protocol over multiple rounds.
The detailed protocol is depicted in Fig. 3. All clients need to go through an initialization step upon enrollment which involves pairwise communications with all other clients (which can be facilitated by the server) to generate a random seed . After this initialization step, all enrolled clients could upload their messages through the encryption uploading step. At each round, only a subset of selected clients would upload their messages. Clients send a notification signal to the server once they are ready to upload their local models, and the server waits until receiving notifications from enough clients. The server then broadcasts the information to all clients in . Client would first compute its secret at the current round as follows:
(9) 
where is a secret known by both client and . Client could then generate the ciphertext for by
(10) 
In the decryption step, the server receives from all selected clients. The server could then recover the sum of plaintext messages from clients in as follows:
(11) 
Note that in the above protocol, we assume all clients in have stable connection to the server. In the rest of the paper, for the ease of expression, we use to denote the encryption of the local model parameters .
IvD The Overall Scheme of CPFed
The overall scheme of our CPFed is summarized in Algorithm 1. Our scheme consists of communication rounds and during each round, a set of clients is selected to perform local iterations, which results in iterations in total. More precisely, at each round , the server first picks clients uniformly at random which we denote by . The server then broadcasts its current shared model to all the clients in and each client performs local iterations using its local dataset according to (8). Note that clients in start with a common model initialization, i.e., , at the beginning of each round. After local iterations, each client in uploads an encrypted local model to the server. The server finally aggregates the encrypted messages to compute the next shared model, and the procedure repeats for rounds.
V Privacy Analysis
As we mentioned before, our goal of using differential privacy techniques is to prevent the attacker outside the system or the “honestbutcurious” server and clients from learning sensitive information about the local data of a client. Under the secure aggregation protocol, the local model is encrypted and the attacker will only obtain the sum of local models. Thus, as long as the sum of local models is differentially private, we can prevent the attacks launched by the attacker.
Instead of using DP directly, we use zCDP to tightly account the endtoend privacy loss of CPFed and then convert it to a DP guarantee. Accordingly, we first show that the sum of local models achieves zCDP at each round, then we account the overall zCDP guarantee after rounds. To do so, we compute the sensitivity of the stochastic gradient of client at a single local iteration (as given in Corollary 1) and the sensitivity of (as given in Lemma 4).
Corollary 1.
The sensitivity of the stochastic gradient of client at the th local iteration of the th round is bounded by .
Proof:
For client , given any two neighboring datasets and of size that differ only in the th data sample, the sensitivity of the stochastic gradient computed at each local iteration in Algorithm 1 can be computed as
Since the loss function is Lipschitz continuous, the sensitivity of
can be estimated as
. ∎Lemma 4.
The sensitivity of the sum of uploaded local models at round is bounded by .
Proof:
Without adding noise, the local model of client after local iterations at round can be written as
(12) 
According to the sensitivity of given in Corollary 1, we have that
Now, it is easy to find that the sensitivity of the sum of uploaded local models is . ∎
By Lemma 1 and 4, we can obtain the zCDP guarantee at each round if we can measure the magnitude of noise added on the sum of local models. According to Algorithm 1, Gaussian noise is added to the stochastic gradient. Thus, after local iterations, client obtains a noisy local model that is
The server will receive such local models at each round and each of them contains an independent Gaussian noise drawn from the distribution . Therefore, we have that
where we can see the magnitude of Gaussian noise added on the sum of uploaded local models is . By Lemma 1 and Lemma 4, each round of Algorithm 1 achieves zCDP for each selected client. Finally, we compute the overall privacy guarantee for a client after rounds of training and give the DP guarantee in Theorem 1.
Theorem 1.
Proof:
It is proved that each round of Algorithm 1 achieves zCDP for the client in . Due to the client selection, not all clients will upload their models to the server at round . If their models are not sent out, they do not lose their privacy at that round. Indeed, every client in the system only participates the training with probability at each round. Therefore, by Lemma 2, the overall zCDP guarantee of each client in the system after rounds of training is . Theorem 1 then follows by Lemma 3. ∎
Vi Convergence Analysis
In this section, we present our main theoretical results on the convergence properties of the proposed CPFed algorithm. We first consider strongly convex loss functions and state the convergence rate of the CPFed for such losses in Theorem 2. Then, in Theorem 3, we present the convergence rate of the CPFed for nonconvex losses.
Before stating our results, we give some assumptions for both convex and nonconvex cases.
Assumption 1 (Smoothness).
The loss function is smooth, i.e., for any , we have .
Assumption 2 (Unbiased gradients).
The local stochastic gradients with are unbiased, i.e., for any and , .
Assumption 3 (Bounded divergence).
The local stochastic gradients will not diverge a lot from the exact gradient, i.e., for any and , .
The condition in Assumption 1 implies that the local loss function and the global loss function are smooth. The condition on the bias of stochastic gradients in Assumption 2 is customary for SGD algorithms. Assumption 3 ensures that the divergence between local stochastic gradients is bounded. This condition is assumed in most federated learning settings. Under Assumption 2 and 3, we obtain Lemma 5 which bounds the divergence of local gradients. Note that in the rest of this paper we consider the stochastic gradient with batch size , but it is easy to extend our conclusions to the stochastic gradient descent with larger batch size.
Lemma 5.
The variance of local stochastic gradient is bounded, and the local gradient will not diverge a lot from the exact gradient, i.e., for any
and :
[label=()]

;

.
Proof:
To prove the convergence of CPFed, we first represent the update rule of CPFed in a general manner. In Algorithm 1, the total number of iterations is , i.e., . At iteration where , each client evaluates the stochastic gradient based on its local dataset and updates current model . Thus, clients have different versions of the model. After local iterations, clients upload their encrypted local models to the server and then update their models with the updated shared model downloaded from the server, i.e., with , where since does not change during the local iteration.
Now, we can present a virtual update rule that captures all the features in Algorithm 1. Define matrices for that concatenate all local models, gradients and noises:
If client is not selected to upload its model at iteration , . Besides, define matrix with element if and otherwise. Unless otherwise stated,
is a column vector of size
with element if and otherwise. To capture periodic averaging, we define aswhere is a identity matrix. Then a general update rule of CPFed can be expressed as follows:
(14) 
Note that the secure aggregation does not change the sum of local models. Multiplying on both sides of (14), we have
(15) 
Then define the averaged model at iteration as
After rewriting (15), one yields
(16) 
Since client is picked at random to preform updating at each round, and is the stochastic gradient computed on one data sample . We can see that the randomness in our federated learning system comes from the client selection, stochastic gradient, and Gaussian noise. In the following, we bound the expectation of several intermediate random variables, which we denote by . For ease of expression, we use instead of in the rest of the paper, unless otherwise stated. Specifically, as given in Lemma 6 and Lemma 7, we compute the upper bound of the expectation of the perturbed stochastic gradients and the network error that captures the divergence between local models and the averaged model.
Lemma 6.
The expectation and variance of the averaged perturbed stochastic gradients at iteration are
(17) 
and
(18) 
Proof:
Lemma 7.
Assume , the expected network error at iteration is bounded as follows:
(19) 
Proof:
Since and all clients in start from the same model received from the server to update, i.e., . For client , we have
(20) 
Given that , one yields ,
where we use the inequality . By Lemma 6 and the fact that , we have that
which shows that the upper bound of is not related to the index of client . Thus, the expected network error at iteration is
and Lemma 7 is finally obtained by relaxing the constant of the second term. ∎
Via Convex Setting
This subsection describes the convergence rate of CPFed for smooth and strongly convex loss functions. The strong convexity is defined as follows:
Assumption 4.
The loss function is strongly convex if for any we have for some .
This assumption implies that the local loss functions and the global loss function are also strongly convex.
Convergence Criteria. In the convergence rate analysis of CPFed for convex loss functions, we use the expected optimality gap as the convergence criteria, i.e., after iterations the algorithm achieves an expected suboptimal solution if
(21) 
where is arbitrarily small and is the objective value at optimal solution . Specifically, we have the following main convergence results:
Theorem 2 (Convergence of CPFed for Convex Losses).
For the CPFed algorithm, suppose the total number of iterations where is the number of communication round and is the round length. Under Assumptions 14, if the learning rate satisfies , and all clients are initialized at the same point . Then after iterations, the expected optimality gap is bounded as follows
(22) 
where