With the dramatic development of the internet-of-things (IoT), data from intelligent devices is exploding at unprecedented scales [1, 2]. Conventional machine learning (ML) is often no longer capable of efficiently processing data in a centralized manner. To tackle this challenge, several distributed ML architectures, e.g., federated learning (FL), have been proposed with different approaches of aggregating models [3, 4, 5, 6, 7, 8]. In FL, all clients with the same data structure collaboratively learn a shared model with the help of a server, i.e., training the model at clients and aggregating model parameters at the server. Owing to the local training, FL does not require clients to upload their private data, thereby effectively reducing transmission overhead as well as preserving clients’ privacy. As such, FL is applicable to a variety of scenarios where data are either sensitive or expensive to transmit to the server, e.g., health-care records, private images, and personally identifiable information, etc. [9, 10, 11].
Although FL can preserve private data from being exposed to the public, hidden adversaries may attack the learning model by eavesdropping and analyzing the shared parameters, e.g., via a reconstruction attack and inference attack [12, 13, 14]
. For example, a malicious classifier may reveal the features of clients’ data and reconstruct data points from a given FL model. Some common attack strategies can be found in recent studies. The work in recovered the private data based on the observation that the gradient of weights is proportional to that of the bias, and their ratio approximates to the training input. The work in  considered collaborative learning and proposed a generative adversarial network (GAN) based reconstruction attack. Furthermore, this reconstruction attack utilized the shared model as the discriminator to train a GAN  model that generates prototypical samples of the training data. In , Melis et al. demonstrated that the shared models in FL leak unintended information from participants’ training data, and they developed passive and active inference attacks to exploit this leakage. The work in  proposed a framework incorporating GAN with a multi-task discriminator to explore user-level privacy, which simultaneously discriminates category, reality, and client identity of input samples. The work in 
showed that it is possible to obtain the private training data from the publicly shared gradients, namely, deep leakage from gradient, which was empirically validated on both computer vision and natural language processing tasks.
Therefore, for the purpose of preserving data contributors’ privacy, it is crucial to design an effective algorithm to mitigate the privacy concerns for data sharing without damaging the quality of trained FL models. In this paper, in order to prevent information leakage from the shared model parameters and improve the training efficiency, we develop a novel privacy-preserving FL framework. We first borrow the concept of local differential privacy (LDP), and introduce a client-level differential privacy (CDP) algorithm. Then, we prove that the CDP satisfies the requirement of DP under a certain privacy level by properly adapting their variances of the Gaussian mechanism. We also develop a theoretical convergence bound of the CDP algorithm, which reveals that there exists an optimal number of communication rounds in terms of convergence performance for a given privacy level. Furthermore, we design an online optimization method, termed communication rounds discounting (CRD), which can obtain a better tradeoff between complexity and convergence performance compared with the offline heuristic searching method. Evaluations demonstrate that our theoretical results are consistent with simulations. Extensive experiments validate that our discounting method in CDP can achieve an equivalent performance compared with offline searching in terms of loss function values with a much lower complexity.
The remainder of this paper is organized as follows. The related work is introduced in Section II, and the threat model and backgrounds on DP and FL are explained in Section III. Then, we show the details of the proposed CDP algorithm and its convergence bound are presented in Section IV, and present the noise recalculation and CRD methods in Section V. The analytical and experimental results are shown in Section VI. Finally, conclusions are drawn in Section VII.
Ii Related Work
In this section, we present related work on ML with DP and privacy-preserving FL.
Ii-a Machine Learning with Differential Privacy
Privacy-preserving ML has attracted intensive attentions over the last decade, as the emergence of centralized searchable data repositories and open data applications may lead to the leakage of private information . The work in 
first proposed the concept of deep learning with differential privacy (DP), providing an evaluation criterion for privacy guarantees. The work in
designed a functional mechanism to perturb the objective function of the neural network in the training process to achieve a certain DP level. However, this work lacks a tight analysis on the cumulative privacy loss because the privacy budget is uniformly distributed over training epochs and the gaps become large when the training epochs increase. The work in
improved the DP based stochastic gradient descent (SGD) algorithms by carefully allocating of privacy budget at each training iteration. In, the privacy budget and step size for each iteration are dynamically determined at runtime based on the quality of the noisy statistics (e.g., gradient) obtained for the current training iteration.
Recently, ML with distributed frameworks becomes more and more popular, due to the ever-increasing capability of IoT devices. Therefore, privacy issues are more critical in distributed ML. The work in  introduced the notion of DP and proposed a distributed online learning algorithm to improve the learning performance for a given privacy level. The work in  analyzed the privacy loss in a DP-based distributed ML framework, and provided the explicit convergence performance. The work in  presented a theoretical analysis of DP-based SGD algorithms, and provided an approach for analyzing the quality of ML models related to the privacy level and the size of the datasets.
Ii-B Privacy Preserving in Federated Learning
A formal treatment of privacy risks in FL calls for a holistic and interdisciplinary approach [26, 27]. The work in  proposed a privacy-preserving FL framework to maintain a given privacy level with a low performance loss when there are a sufficiently large number of participating clients. The work in  presented a novel approach to combining DP with secure multiparty computation, thereby reducing the growth of noise injection when the number of participants increases without sacrificing privacy. The work in  proposed a novel sketch-based FL framework to obtain provable DP benefits for clients, which can compress the transmitted messages via sketches to a high communication efficiency.
|A randomized mechanism for DP|
|The parameters related to DP|
|The -th client|
|The dataset held by the owner|
|The -th sample in|
|The dataset held by all the clients|
|The cardinality of a set|
|Total number of all users|
|The number of chosen clients ()|
|The index of the -th communication round|
|The number of communication rounds|
The vector of model parameters
|Global loss function|
|Local loss function from the -th client|
|Local training parameters of the -th client|
|Local training parameters after adding noises|
|Initial parameters of the global model|
|Global parameters generated from local|
|parameters at the -th communication round|
|The optimal parameters that minimize|
In this section, we present the threat model and backgrounds on DP and FL.
Iii-a Threat Model and DP Guarantee
The inference attack [12, 13] and reconstruction attack  aim to infer private features or reconstruct the training data by a learning model, respectively. In this paper, we assume that the server is honest-but-curious, which may recover the training datasets or infer the private features based on the local uploaded parameters. Concretely, this curious server may train a GAN framework, e.g., multi-task GAN-AI , which simultaneously discriminates category, reality, and client identity of input samples. The novel discrimination on client identity enables the generator to recover users’ private data. This curious server may also be interested in learning whether a given sample belongs to the training datasets, which can be inferred by utilizing the difference between model outputs from training and non-training this sample, respectively . For example, when conducting a clinical experiment, a person may not want the observer to know whether he is involved in the experiment (a client may contain several individuals’ records). This is due to the fact that the observer can link the test results to the appearance/disappearance of a certain person and inflict harm to that person.
DP mechanism with parameters and provides a strong criterion for the privacy preservation of distributed data processing systems. Here, is the distinguishable bound of all outputs on neighboring datasets in a database, and
represents the probability of the event that the ratio of the probabilities for two adjacent datasetscannot be bounded by after adding a privacy-preserving mechanism. With an arbitrarily given , a larger gives a clearer distinguishability of neighboring datasets and hence a higher risk of privacy violation. Now, we will formally define DP as follows.
(-DP ): A randomized mechanism with domain and range satisfies -DP, if for all measurable sets and for any two adjacent datasets ,
In this paper, we choose the Gaussian mechanism that adopts norm sensitivity, and this sensitivity of the function can be expressed as
It adds zero-mean Gaussian noise with variance in each coordinate of the output as
is an identity matrix and has the same size with. It satisfies -DP when we properly select the value of .
Iii-B Federated Learning
Let us consider a general FL system consisting of a honest-but-curious server and clients. Let denote the local datasets held by client , where and . At the server, the goal is to learn a model over data that resides at the associated clients. Formally, this FL task can be expressed as
where , is the local loss function of the -th client, with and is the total size of all data samples. Generally, the local loss function is given by local empirical risks and has the same expression for different clients. At the server, a model aggregation is performed over the chosen clients. In more detail, is the global model parameter, given by
where is the parameter vector trained at the -th client, is a subset of , with chosen clients out of clients chosen for participating in the model aggregations.
Iv Privacy and Convergence Analysis
In this section, we first introduce a client-level differential privacy (CDP) algorithm in FL against curious servers. Then, we develop an improved method for analyzing the loss moment under moments accountant method
. We also show how to select the standard deviation (STD) of additive noises in CDP under a certain-DP guarantee. Finally, we analyze the convergence bound of the CDP in FL, which reveals an explicit tradeoff between the privacy level and convergence performance.
Iv-a Client-level DP
In this subsection, we will introduce a CDP algorithm, which can prevent information leakage from uploaded parameters, as depicted in Fig. 1. Algorithm 1 outlines our CDP algorithm for training an effective model with the DP guarantee. We denote by the number of communication rounds, by the initial global parameter, by the STD of additive noises and by the random sampling ratio (). At the beginning of this algorithm, the server broadcasts the initiate global parameter to clients. In the -th aggregation, active clients respectively train the parameters by using local databases with preset terminal conditions. After completing the local training and clipping of with threshold , the -th client, , will add noises to the trained parameters , and upload the noised parameters to the server for aggregation. Note that the noised parameters is calculated as , where is the variance of artificial noises and this value is calculated by the clients according to the privacy level , sampling ratio and the number of communication rounds .
Then the server updates the global parameters by aggregating the local parameters integrated with different weights according to (5) and broadcast them to the clients. Based on the received global parameters
, each client will estimate the accuracy using local testing databases and start the next round of training process based on these received parameters. The FL completes when the aggregation time reaches a preset number of communicationrounds and the algorithm returns . Owing to the local perturbations, it will be difficult for the curious server to infer private information of the -client. In this case, in order to protect the clients’ privacy with -DP, the STD of additive noises will be analyzed in the following sections.
Iv-B Bound of the Moment
According to , using Gaussian mechanism, we can define privacy loss by
and moment generating function by
where is any positive integer,
denotes the probability density function (PDF) of,
denotes the mixture of two Gaussian distributions, is the random sampling ratio,
Considering two Gaussian distributions and used in the moments accountant method, they satisfy the following relationship:
See Appendix A.
Note that the only difference between the and is the factor of and on the exponent. With Lemma IV-B, we can obtain that .
Iv-C Sensitivity and Privacy Analysis
According to the definition of DP, we consider two adjacent databases in the -th user, where and have the same size, and only differ by one sample.
Assume that the batch size in the local training is equal to the number of training samples.
Consequently, for the -th client with the training dataset , the local training process can be written into the following form:
where denotes the local training process. Therefore, the sensitivity of the local training process can be given as
Based on Assumption IV-C, we consider the general GD method and have
Note that we can obtain the sensitivity as
where is the clipping threshold to bound .
From (14), we can observe that when the size of training datasets is large, the sensitivity of the aggregation process should be comparable small. The intuition is that the maximum difference one single data sample can make is and such a maximum difference is scaled down by a factor of in a local training process since every data sample in contributes equally to the resulting local model. For example, if a model was trained on the records of patients with a certain disease, acquiring that an individual’s record was among them directly affects his or her privacy. If we increase the size of training datasets, it will be more difficult to determine whether this record was used as part of the model’s training datasets or not.
With this sensitivity, we can design the Gaussian mechanism with -DP requirement in terms of the sampling ratio and the number of communication rounds . The STD of the Gaussian noises can be derived according to the following theorem.
Given the sampling ratio and the number of communication rounds , to guarantee -DP with respect to all the data used in the training datasets, the STD of noises from Gaussian mechanism should satisfy
See Appendix B.
Theorem IV-C quantifies the relation between the noise level and the privacy level . It shows that for a fixed perturbation on gradient, a larger leads to a weaker privacy guarantee (i.e., a larger ). This is indeed true since when more clients are involved in computing at each communication round, more aggregation times for each client. Also, for a given , a larger in the total training process lead to a higher chance of information leakage because the observer may obtain more information for training datasets. Furthermore, for a given privacy protection level and , a larger value of leads to a larger value of , which helps reduce clients’ concerns on participating in FL because clients are allowed to add more noise to the trained local models. This requires us to choose the parameters carefully in order to have a reasonable privacy level.
Iv-D Convergence for Private Federated Learning
In this subsection, we analyze the convergence of the CDP algorithm in FL.
We make the following assumptions on the global loss function defined by , and the -th local loss function for the analysis:
satisfies the Polyak-Lojasiewicz condition with the positive parameter , which implies that , where is the optimal result;
is -Lipschitz smooth, i.e., for any , , , where is a constant determined by the practical loss function;
, where is the learning rate;
For any and , , where is the divergence metric.
Based on the above assumptions and Theorem IV-C, we can obtain the follow result which characterizes the convergence performance of the CDP algorithm.
To guarantee -DP, the convergence upper bound of the CDP algorithm after communication rounds is given by
where , and .
See Appendix C.
Theorem IV-D shows an explicit tradeoff between convergence performance and privacy: When privacy guarantee is weak (large values of and ), the convergence bound will be small, which indicates a tight convergence to the optimal weights. From our assumptions, we know that and obtain . Then, we show the process on discovering the relationship between the convergence upper bound and the number of total clients , the number of participant clients and the number of communication rounds .
The upper bound of loss function is as a convex function of for a given and a sufficiently large . In more detail, a larger has a negative impact on the model quality by increasing the amount of noise added in each communication round for a given privacy level (in line with (15)), but it also has a positive impact on the convergence because it reduces the loss function value with more iterations.
With a slight abuse of notation, we consider continuous values of , and . Let denote the RHS of (16) and we have
It can be seen that the first term and fourth term of on the RHS of (17) are always positive. When , and are set to be large enough, we can see that the second term of on the RHS of (17) is small. In this case, we have and the upper bound is convex for .
Then we consider the condition , we have
If and are set to be large enough, we have and the upper bound is convex for .
According to our analysis above, there exists an optimal value of for a given privacy level . However, this optimal value cannot be derived directly from (16), since some parameters, e.g., , and , are difficult to obtain accurately. One possible method for obtaining the optimal is through exhaustive search, i.e., try different value of and choose the one with the highest convergence performance. However, exhaustive search is time-consuming and computationally complex. In the next section, we will propose an efficient algorithm for finding a good value of to achieve a high convergence performance.
V Discounting Method in CDP
In this section, we first propose a discounting method to improve convergence performance in the training process. Then, we provide a noise recalculation method for obtaining the noise STD in the case of varying during the FL training. Finally, we will summarize our proposed CRD algorithm.
V-a Proposed Discounting Method
Based on the analysis above, we note that we can obtain a smaller STD which will improve the training performance if we reduce the value of slightly when the training performance stops improving. Therefore, we design a CRD algorithm by adjusting the number of communication rounds with a discounting method during the training process to achieve a high convergence performance. The training process of such a CRD algorithm in CDP contains following steps: 4em
Initializing: The server broadcasts the initial parameters , , and ;
Local training: All active clients locally compute training parameters with local datasets and the global parameter;
Norm clipping: In order to prove the DP guarantee, the influence of each individual example on local parameters should be bounded with a clipping threshold . Each parameter vector will be clipped in norm, i.e., the -th local parameter vector at the -th communication round is replaced by . We can remark that parameter clipping of this form is a popular ingredient of SGD and ML for non-privacy reasons;
Noise adding: Artificial Gaussian noises with a certain STD will be added to the local trained parameters to guarantee -DP;
Parameter uploading: All active clients upload the noised parameters to the server for aggregation;
Model aggregation: The server performs aggregation over the uploaded parameters from clients;
Model broadcasting: The server broadcasts the aggregated parameters and the number of communication rounds to all clients;
Model updating: All clients update their respective models with the aggregated parameters, then test the performance of the updated models and upload the performance to the server;
Communication rounds discounting: When the convergence performance stops improving, the discounting method will be triggered in the server. The server will obtain a smaller than the previous one with a linear discounting factor . This factor can control the decaying speed of . The FL process will be completed when the aggregation time reaches a preset .
In this method, the value of is determined iteratively to ensure a high convergence performance in FL training. Obviously, when the value of is adjusted, we must recalculate a new STD of additive noises in terms of previous training process. The diagrammatic expression of this method is shown in Fig. 2. Therefore, we will develop a noise recalculation method to update the STD of additive noises and alternately in the following subsection.
In order to help clients calculate proper STDs added to their local trained parameters, the number of communication rounds is broadcasted from the server to all clients as well as the aggregated parameters.
V-B Noise Recalculation for Varying
Now, let be the index of the current communication round and () is the STD of additive noises at the -th communication round. In our CRD algorithm, with a new , if is greater than , the training will stop. If is less than , we need to recalculate the STD of noises and add them on the local parameters in the following communication round. Considering this, we obtain the following theorem.
After () communication rounds and a new , the STD of additive noises can be given as
See Appendix D.
In Theorem V-B, we can obtain a proper STD of the additive noises based on the previous training process and the value of . From this result, we can find that if we have large STDs (strong privacy guarantee) in the previous training process, i.e., is large, the recalculated STD will be small (weak privacy guarantee), i.e., is small.
We also can note that if the value of is not changed in this communication round, the value of STD will remain unchanged. Considering , and unchanged , from equation (19), we can obtain
Vi Experimental Results
In this section, we evaluate the accuracy of our analytical results for different learning tasks. Then, we evaluate our proposed CRD method, and demonstrate the effectiveness of various parameter settings, such as the privacy level, the initial value of and discounting factor.
Vi-a Evaluation of Numerical Results
In this subsection, we first describe our numerical validation of the bound in (39) with by varying from to with a step size of . In Fig. 3 (a), we set , , , and . In our validation, our bound is close but always higher than the real divergence. We also did tests for cases such as a smaller sampling ratio , a smaller size of local datasets , , and a larger number of clients in Fig. 3 (b), (c) and (d), respectively. We found that the empirical bound always holds and close under the given conditions, especially for a large enough . Therefore, we have the conjecture that our bound is a valid for the analytical moments accountant of sampled Gaussian mechanism and seek its formal proof in our future work.
Vi-B FL Models and Datasets
We evaluate the training of two different machine learning models on different datasets, which represent a large variety of both small and large models and datasets. The models include support vector machine (SVM) and multi-layer perceptron (MLP).
1) SVM is trained on the IPUMS-US dataset, which are census data extracted from  and contain individual records with attributes including age, education level and so on. The categorical attributes in this dataset are denoted by various integers. The label of each sample describes whether the annual income of this individual is over k. In this model, the loss function of SVM is given by
where is a regularization coefficient, is the -th sample in , is the dataset of -th client and for .
2) MLP network with a single hidden layer containing
hidden units, where ReLU units and softmax ofclasses (corresponding to the digits) are applied. We use the cross-entropy loss function, which can capture the error of the model on the training data. We conduct experiments on the standard MNIST dataset for handwritten digit recognition consisting of training examples and testing examples . Each example is a size gray-level image of handwritten digits from to . In Fig. 4, we show several samples of the standard MNIST dataset with a visual illustration.
Among them, the loss function for SVM is convex and satisfies the first assumption in Assumption IV-D, whereas the loss function for MLP and thus does not satisfy this condition. The value of , and are determined by the specific loss function, and we can use estimated values in our experiments . The experimental results in this setting show that our theoretical results and proposed algorithm also work well for models (such as MLP) whose loss functions do not satisfy Assumption IV-D.
Vi-C Evaluation of Theoretical Results
In this subsection, we verify our theoretical results with SVM and MLP models. In Fig. 5 and 6, we show experimental results of training loss as a function of with various privacy levels under CDP algorithm in FL using SVM and MLP models, respectively. The number of clients, the size of local samples and the privacy parameter are set as , and , respectively. This observation is in line with Remark IV-D, and the reason comes from the fact that a lower privacy level decreases the standard deviation of additive noises and the server can obtain better quality ML model parameters from the clients. Fig. 5 and Fig. 6 also imply that an optimal value of increases almost with respect to the increasing .
Figs. 7 and 8 illustrate the expectation of the training loss of CDP algorithm in FL using MLP models, by varying privacy level and the value of . From Fig. 7, the first scenario shows that there is no sampling, a large and a small may lead to the terrible performance. As shown in the second scenario with sampling ratio in Fig. 7, it also can retain the same property.
Vi-D Evaluation of the Proposed CRD Algorithm
In this subsection, to evaluate our CRD algorithm, we apply the MLP model with the standard MNIST dataset. Several experimental settings in our case study are introduced as follows. The setting including the following main parts: 1) various initial values of using discounting method; 2) various privacy levels using CRD algorithm; 3) various discounting factors using CRD algorithm. We evaluate the effectiveness of discounting method, and compare the results with an uniform privacy budget allocation algorithm adopted by Abadi et al.  in the CDP algorithm, named original CDP. Furthermore, we also use the CDP algorithm with the optimal (obtained by searching) as a benchmark, named optimal CDP.
Initial Value of . In the previous experiments, the initial value of is set as the default. To examine the effect of initial , we vary its value from to and measure the model convergence performance under several fixed privacy levels. We also choose this handwritten digit recognition task with the number of clients , the size of local samples , the discounting factor and the DP parameter . Fig. 9 shows that when is closer to the optimal value of (the optimal value of by searching is shown in the above subsection), we will obtain a better convergence performance. If we can obtain the optimal value of at the beginning of the CDP algorithm, we will have the best convergence performance. This observation is also in line with Remark IV-D.
Privacy Level. We choose a handwritten digit recognition task with the number of clients , the initial number of communication rounds , the size of local samples and the DP parameter . We also set two different sampling ratio and , which are corresponding to Fig. 10 (a) and (b), respectively. In Fig. 10, we describe how value of the loss function change with various values of the privacy level under original CDP (Algorithm 1), CDP with CRD (Algorithm 2) and optimal CRD (obtain the optimal by searching). Fig. 11 shows the test accuracy corresponding to Fig. 10. We can note that using Algorithm 2 with discounting factor can greatly improve the convergence performance in Fig. 10 (a) and (b), which is close to the optimal results. In Fig. 12, we choose the same parameters with Fig. 10 and show the change of during training in one experiment under the CDP algorithm with CRD method (). In Fig. 12, we can note that a smaller privacy level will have an early end in CDP algorithm with CRD method. The intuition is that a larger can lead to a higher chance of information leakage and a larger noise STD of additive noises. Then, the CRD method may be triggered and a decreased will be broadcasted to chosen clients from the server.
Discounting Factor. In Fig. 13, we vary the value of from to and plot their convergence results of the loss function. The number of clients, the size of local samples, the initial number of communication rounds and the DP parameter are set as , , and , respectively. We observe that when the privacy level is fixed, a larger results in a slower decay speed of the which means careful adjustments of in the training and benefits convergence performance. With various values of , when we choose , the CDP algorithm with CRD method will have the best convergence performance. However, this convergence performance is also worse than the optimal CDP, which is consistent with Remark IV-D. In Fig. 14, we show the communication rounds consuming (the number of required communication rounds) with various discounting factors using the same parameters with Fig. 13. We find that more careful adjustments (corresponding to a larger ) will lead to more communication rounds consuming. In Fig. 14, we can also note that the optimal CDP has the best convergence performance but a large communication rounds consuming. Hence, we can conclude that there is a tradeoff between communication rounds consuming and convergence performance by choosing . As a future work, it is of great interest to analytically evaluate the optimum value of for which the loss function is minimized.
In this paper, we have introduced a CDP algorithm in FL to preserve clients’ privacy and proved that the CDP algorithm can satisfy the requirement of DP under a certain privacy level by properly selecting the STD of additive noises. Then, we have developed a convergence bound on the loss function in the CDP algorithm. Our developed bound reveals that there is an optimal value of communication rounds () in terms of convergence performance for a given protection level, which inspires us to design an intelligent scheme for adaptively choosing the value of . To tackle this problem, we have proposed a CRD method for training FL model, which will be triggered when the convergence performance stops improving. Our experiments have demonstrated that this discounting method in the CDP algorithm can obtain a better tradeoff requirements on convergence performance and privacy levels. We can also note that various initial values of and discounting factors bring out different convergence results.
Appendix A Proof of Lemma Iv-B
Here, we want to compare the value between and . Hence, we conduct the property of and rewrite it as inequation (23).
Transforming the negative part of this integral, we have inequation (24).
Hence, we can obtain that
Then, let us use as
In order to develop the monotonicity of , we define , and then have equation (29).
Then, we define and we consider the condition that , we know
Therefore, we can obtain
And then, we can rewrite as equation (32) on the top of the next page.
Because , we know . Considering , we can conclude that and . This completes the proof.
Appendix B Proof of Theorem Iv-C
In this proof, we define , , and . Then, we have and . Hence, the -th moment can be expressed as
Here, we want to bound and .
Based on Lemma IV-B, let us consider , we have
Assuming for , we have