I Introduction
With AlphaGo’s glorious success, it is expected that the big datadriven artificial intelligence (AI) will soon be applied in all aspects of our daily life, including medical care, food and agriculture, intelligent transportation systems, etc. At the same time, the rapid proliferations of Internet of Things (IoTs) call for data mining and learning securely and reliably in distributed systems
[1, 2, 3]. When integrating AI in a variety of IoT applications, distributed machine learning (ML) are remarkably effective for many data processing tasks by defining parameterized functions from inputs to outputs as compositions of basic building blocks [4, 5]. Federated learning (FL), as a recent advance of distributed ML, was proposed, in which data are acquired and processed locally at the clients side, and then the updated ML parameters are transmitted to a central server for aggregating, i.e., averaging on these parameters [6, 7, 8]. Typically, clients in FL are distributed devices such as sensors, wearable devices, or mobile phones. The goal of FL is to fit a model generated by an empirical risk minimization (ERM) objective. However, FL also poses several key challenges, such as private information leakage, expensive communication costs between servers and clients, and device variability [9, 10, 11, 12, 13, 14].Generally, distributed stochastic gradient descent (SGD) is adopted in FL for training ML models. In
[15, 16], bounds for FL convergence performance were developed based on distributed SGD, with a onestep local update before global aggregations. The work in [17] considered partially global aggregations, where after each local update step, parameter aggregation is performed over a nonempty subset of the clients set. In order to analyze the convergence more effectively, federated proximal (FedProx) was proposed [18] by adding regularization on each local loss function. The work in [19] obtained the convergence bound of SGD based FL that incorporates nonindependentandidenticallydistributed (noni.i.d.) data distributions among clients.At the same time, with the ever increasing awareness of data security of personal information, privacy preservation has become a worldwide and significant issue, especially for the big data applications and distributed learning systems. One prominent advantage of FL is that it enables local training without personal data exchange between the server and clients, thereby protecting clients’ data from being eavesdropped by hidden adversaries. Nevertheless, private information can still be divulged to some extent from adversaries’ analyzing on the differences of related parameters trained and uploaded by the clients, e.g., weights trained in neural networks [20, 21, 22].
A natural approach to preventing differential attacks on privacy information is to add artificial noises, known as differentially private (DP) techniques [23, 24]. Existing works on DP based learning algorithms include local DP (LDP) [25, 26, 27], DP based distributed SGD [28, 29] and DP meta learning [30]. In the LDP, each client perturbs its information locally and only sends a randomized version to a server, thereby protecting both the clients and server against private information leakage. The work in [26] proposed solutions to building up a LDPcompliant SGD, which powers a variety of important ML tasks. The work in [27]
considered the distribution estimation at the server over uploaded data from clients while providing protections on these data with LDP. The work in
[28] improved the computational efficiency of DP based SGD by tracking detailed information of the privacy loss, and obtained accurate estimates on the overall privacy loss. The work in [29] proposed novel DP based SGD algorithms and analyzed their performance bounds which are shown to be related to privacy levels and the sizes of datasets. Also, the work in [30] focused on the class of gradientbased parametertransfer methods and developed a DP based meta learning algorithm that not only satisfies the privacy requirement but also retains provable learning performance in convex settings.More specifically, DP based FL approaches are usually devoted to capturing the tradeoff between privacy and convergence performance in the training process. The work in [31] proposed a FL algorithm with the consideration on preserving clients’ privacy. This algorithm can achieve a good training performance at a given privacy level, especially when there is a sufficiently large number of participating clients. The work in [32] presented an alternative approach that utilizes both DP and secure multiparty computation (SMC) to prevent differential attacks. However, the above two works on DPbased FL design have not taken into account the privacy protection during the parameter uploading stage, i.e., the clients’ private information can be potentially intercepted by hidden adversaries when uploading the training results to the server. Moreover, these two works only showed empirical results by simulations, but lacked theoretical analysis on the FL system, such as tradeoff between privacy, convergence performance, and convergence rate. Up to now, the theoretical analysis on convergence behavior of FL with privacypreserving noise perturbations has not yet been detailed in existing literatures, which will be the major focus of our work in this paper.
In this paper, to effectively prevent differential attacks, we propose a novel framework based on the concept of differential privacy (DP), in which each client perturbs its trained parameters locally by purposely adding noises before uploading them to the server for aggregation, namely, noising before model aggregation FL (NbAFL). To the best of authors’ knowledge, this is the first piece of work of its kind that theoretically analyzes the convergence property of differentially private FL algorithms. First, we prove that the proposed NbAFL scheme satisfies the requirement of DP in terms of global data under a certain noise perturbation level with Gaussian noises by properly adapting their variances. Then, we develop theoretically a convergence bound of the loss function of the trained FL model in the NbAFL with artificial Gaussian noises. Our developed bound reveals the following three key properties: 1) There is a tradeoff between the convergence performance and privacy protection levels, i.e., a better convergence performance leads to a lower protection level; 2) Increasing the number of overall clients participating in FL can improve the convergence performance, given a fixed privacy protection level; 3) There is an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. Furthermore, we propose a random scheduling strategy, where () clients are randomly selected from the overall clients to participate in each aggregation. We also develop the corresponding convergence bound of the loss function in this case. From our analysis, the random scheduling strategy can retain the above three properties. Also, we find that there exists an optimal value of that achieves the best convergence performance at a fixed privacy level. Evaluations demonstrate that our theoretical results are consistent with simulations. Therefore, our analytical results are helpful for the design on privacypreserving FL architectures with different tradeoff requirements on convergence performance and privacy levels.
The remainder of this paper is organized as follows. In Section II, we introduce backgrounds on FL, DP and a conventional DPbased FL algorithm. In Section III, we detail the proposed NbAFL and analyze the privacy performance based on DP. In Section IV, we analyze the convergence bound of NbAFL and reveal the relationship between privacy levels, convergence performance, the number of clients, and the number of global aggregations. In Section V, we propose the random scheduling scheme and develop the convergence bound. We show the analytical results and simulations in Section VI. We conclude the paper in Section VII. A summary of basic concepts and notations is provided in Tab. I.
A randomized mechanism for DP  
Adjacent databases  
The parameters related to DP  
The th client  
The database held by the owner  
The database held by all the clients  
The cardinality of a set  
Total number of all clients  
The number of chosen clients ()  
The index of the th aggregation  
The number of aggregation times  
The vector of model parameters 

Global loss function  
Local loss function from the th client  
A presetting constant of the proximal term  
Local uploading parameters of the th client  
Initial parameters of the global model  
Global parameters generated from all local parameters  
at the th aggregation  
Global parameters generated from clients’ parameters  
at the th aggregation  
True optimal model parameters that minimize  
The set of all the local parameters  
The set of all local parameters with pertubation 
Ii Preliminaries
In this section, we will present preliminaries and related background knowledge on FL and DP. Also, we introduce a conventional DPbased FL algorithm that will be discussed in our following analysis as a benchmark.
Iia Federated Learning
Let us consider a general FL system consisting of one server and clients, as depicted in Fig. 1. Let denote the local database held by the client , where and . At the server, the goal is to learn a model over data that reside at the associated clients. For an active client participating in the local training, it needs to find a vector of an AI model to minimize a certain loss function. Formally, at the server, it aggregates the weights sent from the clients as
(1) 
where is the parameter vector trained at the th client, and is the parameter vector after aggregating at the server. Such an optimization problem can be formulated as
(2) 
where is the local loss function of the th client, is the number of clients, with , and is the total size of all data samples. Generally, the local loss function is given by local empirical risks. The training process of such a FL system usually contains the following four steps: 4em
Local training: All active clients locally compute training gradients or parameters and send locally trained ML parameters to the server;
Model aggregating: The server performs secure aggregation over the uploaded parameters from clients without learning local information;
Parameters broadcasting: The server broadcasts the aggregated parameters to the clients;
Model updating: All clients update their respective models with the aggregated parameters and test the performance of the updated models.
In the FL process, the clients with the same data structure collaboratively learn a ML model with the help of a cloud server. After a sufficient number of local training and update exchanges between the server and its associated clients, the solution to the optimization problem (2) is able to converge to that of the global optimal learning model.
IiB Differential Privacy
DP provides a strong criterion for privacy preservation of distributed data processing systems. Here, is the distinguishable bound of all outputs on neighboring datasets in a database, and
represents the event that the ratio of the probabilities for two adjacent datasets
cannot be bounded by after adding a privacy preserving mechanism. With an arbitrarily given , a privacy preserving mechanism with a larger gives a clearer distinguishability of neighboring datasets and hence a higher risk of privacy violation. Now, we will formally define DP as follows.Definition:
(DP [23]): A randomized mechanism with domain and range satisfies DP, if for all and for any two adjacent databases ,
(3) 
For numerical data, a Gaussian mechanism defined in [23] can be used to guarantee DP. According to [23], we present the following DP mechanism by adding artificial Gaussian noises.
In order to ensure that the given noise distribution preserves DP, where
represents the Gaussian distribution, we choose noise scale
and the constant for . In this result, is the value of an additive noise sample for a data in the dateset, is the sensitivity of the function given by , and is a realvalued function.Considering the above DP mechanism, how to choose an appropriate level of noise remains to be a significant research problem, which will affect the privacy guarantee of clients and the convergence rate of the FL process.
IiC Noising after Aggregation FL (NaAFL)
Conventionally, adding overly conservative noises to the aggregated parameters, i.e., , is an effective method in FL to protect privacy. This DP mechanism, termed as noising after model aggregation FL (NaAFL), can be described as
(4) 
where is the noise vector and is a clipping threshold for bounding . Each element in represents an additive zeromean Gaussian noise sample with variance . The NaAFL algorithm averages over all the individual training results and adds a Gaussian noise to each of the averaged model parameters. In other words, it attempts to protect the privacy of training data by working only on the aggregated parameters that result from the training process. Superior to the conventional NaAFL, our proposed NbAFL with local noisy perturbations at the clients will prevent hidden adversaries from inferring the clients’ information by analyzing their uploaded parameters, which will be detailed in the following.
Iii Federated Learning with Differential Privacy
In this section, we first introduce the concept of global DP and analyze the DP performance in the context of FL. Then we propose the NbAFL scheme that can satisfy the DP requirement by adding proper noisy perturbations at both the clients and the server.
Iiia Threat Model
The fundamental purpose of FL is to build a ML model based on databases that reside locally in multiple clients while preventing the leakage of personal data. However, potential adversaries may still be able to eavesdrop clients’ private information from analyzing the parameters uploaded by the clients, as depicted in Fig. 1. The privacy leakage via analyzing parameters can happen in both uploading (through uplink channels) and broadcasting (through downlink channels) phases.
The server in this paper is assumed to be honest. However, there are external adversaries targeting at clients’ private information by analyzing the ML parameters. In addition, several smart adversaries may be disguised as clients to access private information of honest clients. We assume that uplink channels are more secure than downlink broadcast channels, since clients can be assigned to different channels (e.g., time slots, frequency bands) dynamically in each uploading time, while downlink channels are broadcasting. Hence, we assume at most () exposures of uploaded parameters from each client in the uplink and exposures of aggregated parameters in the downlink, where is the number of aggregation times.
IiiB Global Differential Privacy
Here, we define a global DP requirement for both uplink and downlink channels. From the uplink perspective, using a clipping technique, we can obtain that , where denotes the th client’s local training parameters without perturbation and is a clipping threshold for bounding . We define , where is the th client’s database. Thus, the sensitivity of can be expressed as
(5) 
To ensure
DP for each client in the uplink in one exposure, we set the noise scale, represented by the standard deviation of the additive Gaussian noise, as
. Considering exposures of local parameters, we need to set due to the linear relation between and with Gaussian mechanism.From the downlink perspective, the aggregation operation at the server can be expressed as
(6) 
where is the set of all uploaded parameters from the clients, and is the aggregated parameters at the server to be broadcast to the clients. Regarding the sensitivity of , i.e., , we have the following lemma.
Lemma (Sensitivity of the aggregation operation):
In FL training process, the sensitivity of the aggregation operation is given by
(7) 
Proof:
See Appendix A.
Remark:
From the above lemma, to achieve a small sensitivity , the ideal condition is that all the clients should use the same size of local datasets for training, i.e., .
From the above remark, when setting , , we can obtain the optimal value of the sensitivity . With this optimal value, we have the following theorem regarding how to add noises to the aggregated parameters at the server to satisfy the DP criterion in the downlink channels.
Theorem (DP guarantee for downlink channels):
To ensure DP in the downlink channels with aggregations, the standard deviation of Gaussian noises that are added to the aggregated parameter w by the server can be given as
(8) 
Proof:
See Appendix B.
Theorem IIIB shows that to satisfy a DP requirement for the downlink channels, additional noises need to be added by the server. With a certain , the standard deviation of additional noises is depending on the relationship between the number of aggregation times and the number of clients . The intuition is that a larger can lead to a higher chance of information leakage, while a larger number of clients is helpful for hiding their private information. This theorem also provides the variance value of the noises that should be added to the aggregated parameters. Based on the above results, we propose the following NbAFL algorithm.
IiiC Proposed NbAFL
Algorithm 1 outlines our NbAFL for training an effective model with a global DP requirement. We denote by the presetting constant of the proximal term and by the initiate global parameter. At the beginning of this algorithm, the server broadcasts the required privacy level parameters are set and the initiate global parameter are sent to clients. In the th aggregation, active clients respectively train the parameters by using local databases with preset termination conditions. After completing the local training, the th client, , will add noises to the trained parameters , and upload the noised parameters to the server for aggregation.
Then the server update the global parameters by aggregating the local parameters integrated with different weights. Additive noises are added to this according to Theorem IIIB before being broadcast to the clients. Based on the received global parameters , each client will estimate the accuracy by using local testing databases and start the next round of training process based on these received parameters. The FL process completes after the aggregation time reaches a preset number and the algorithm returns .
Now, let us focus on the privacy preservation performance of the NbAFL. First, the set of all local parameters, denoted by , are received by the server. Owing to the local perturbations in the NbAFL, it will be difficult for malicious adversaries to infer the information at the client from its uploaded parameters . After the model aggregation, the aggregated parameters will be sent back to clients via broadcast channels. This poses threats on clients’s privacy as potential adversaries may reveal sensitive information about individual clients from . In this case, additive noises may be posed to based on Theorem IIIB.
Iv Convergence Analysis on NbAFL
In this section, we are ready to analyze the convergence performance of the proposed NbAFL. First, we analyze the expected increment of adjacent aggregations in the loss function with Gaussian noises. Then, we focus on deriving the convergence property under the global DP requirement.
For the convenience of the analysis, we make the following assumptions on the loss function and network parameters.
Assumption:
We make assumptions on the global loss function and the th local loss function as follows:

is convex;

satisfies the PolyakLojasiewicz condition with the positive parameter , which implies that , where is the optimal result;

;

is Lipschitz, i.e., , for any , ;

is Lipschitz smooth, i.e., , for any , , where is a constant determined by the practical loss function;

For any and , , where is the divergence metric.
Similar to the gradient divergence, the divergence metric is the metric to capture the divergence between the gradients of the local loss functions and that of the aggregated loss function, which is essential for analyzing SGD. The divergence is related to how the data is distributed at different nodes. Using Assumption 1, we then have the following lemma.
Lemma (dissimilarity of various clients):
For a given ML parameter , there exists satisfying
(9) 
Proof:
See Appendix C.
Lemma IV comes from the assumption of the divergence metric and demonstrates the statistical heterogeneity of all clients. As mentioned earlier, the values of and are determined by the specific global loss function in practice and the training parameters . With the above preparation, we are now ready to analyze the convergence property of NbAFL. First, we present the following lemma to derive an expected increment bound on the loss function during each iteration of parameters with artificial noises.
Lemma (Expected increment in the loss function):
After receiving updates, from the th to the th aggregation, the expected difference in the loss function can be upperbounded by
(10) 
where
(11) 
(12) 
and are the equivalent noises imposed on the parameters after the th aggregation, given by
(13) 
Proof:
See Appendix D.
In this lemma, the value of an additive noise sample in vector satisfies the following Gaussian distribution . Also, we can obtain from Section III. From the right hand side (RHS) of the above inequality, we can see that it is crucial to select a proper proximal term to achieve a low upperbound. It is clear that artificial noises with a large may improve the DP performance in terms privacy protection. However, from the RHS of (10), a large may enlarge the expected difference of the loss function between two consecutive aggregations, leading to a deterioration of convergence performance.
Furthermore, to satisfy the global DP, by using Theorem IIIB, we have
(14) 
Next, we will analyze the convergence property of NbAFL with the DP requirement.
Theorem (Convergence upper bound of the NbAFL):
With required protection level , the convergence upper bound of Algorithm 1 after aggregations is given by
(15) 
where , and .
Proof:
See Appendix D.
Theorem IV reveals an important relationship between privacy and utility by taking into account the protection level and the number of aggregation times . As the number of aggregation times increases, the first term of the upper bound decreases but the second term increases. Furthermore, By viewing as a continuous variable and by writing the RHS of (15) as , we have
(16) 
It can be seen that the second term and third term of on the RHS of (16) are always positive. When and are set to be large enough, we can see that and are small, and thus the first term can also be positive. In this case, we have and the upper bound is convex for .
Remark:
As can be seen from this theorem, expected gap between the achieved loss function and the minimum one is a decreasing function of . By increasing , i.e., relaxing the privacy protection level, the performance of NbAFL algorithm will improve. This is reasonable because the variance of artificial noises decreases, thereby improving the convergence performance.
Remark:
The number of clients will also affect its iterative convergence performance, i.e., a larger would achieve a better convergence performance. This is because a lager leads to a lower variance of the artificial noises.
Remark:
There is an optimal number of maximum aggregation times in terms of convergence performance for given and . In more detail, a larger may lead to a higher variance of artificial noises, and thus pose a negative impact on convergence performance. On the other hand, more iterations can generally boost the convergence performance if noises are not large enough. In this sense, there is a tradeoff on choosing a proper .
V Client Random Scheduling Policy
In this section, we consider the case where only clients are selected to participate in the aggregation process, namelly random scheduling.
We now discuss how to add artificial noises in the random scheduling to satisfy a global DP. It is nature that in the uplink channels, each of the scheduled clients should add noises with scale for achieving DP. This is equivalent to the noise scale in the allclients selection case in Section III, since each client only considers its own privacy for uplink channels in both cases. However, the derivation of the noise scale in the downlink will be different for the random scheduling. As an extension of Theorem 1, we present the following lemma in the case of random scheduling on how to obtain .
Lemma (DP guarantee in random scheduling):
In the NbAFL algorithm with random scheduling, to satisfy a global DP, and the standard deviation of additive Gaussian noises for downlink channels should be set as
(17) 
where
(18) 
Proof:
See Appendix F.
Lemma V recalculate by considering the number of chosen clients . Generally, the number of clients is fixed, we thus focus on the effect of . Based on the DP analysis in Lemma V, we can obtain the following theorem.
Theorem (Convergence under random scheduling):
With required protection level and the number of chosen clients , for any , the convergence upper bound after aggregation times is given by
(19)  
where
(20) 
(21) 
and
(22) 
Proof:
See Appendix G.
The above theorem provides the convergence upper bound between and under random scheduling. Using random scheduling, we can obtain an important relationship between privacy and utility by taking into account the protection level , the number of aggregation times and the number of chosen clients .
Remark:
Naturally, we can conclude that by increasing , the performance under client random scheduling will improve because a lager leads to a lower variance of the artificial noises.
Remark:
From the bound derived in Theorem V, we conclude that there is an optimal in between and that achieves the optimal convergence performance. That is, by finding a proper , the random scheduling policy is superior to the one that all clients participate in the FL aggregations.
Vi Simulation Results
In this section, we evaluate the proposed NbAFL by using multilayer perception (MLP) and realworld federated datasets. In order to characterize the convergence property of NbAFL, we conduct experiments by comparing various protection levels of , the number of clients , the number of maximum aggregation times and the number of chosen clients .
We conduct experiments on the standard MNIST dataset for handwritten digit recognition consisting of training examples and testing examples [33]. Each example is a
size graylevel image. Our baseline model uses a a MLP network with a single hidden layer containing 256 hidden units. In this feedforward neural network, we use a ReLU units and softmax of
classes (corresponding to the digits) with the crossentropy loss function. For the optimizer of networks, we set the learning rate to . The values of , , and are determined by the specific loss function, and we will use estimated values in our simulations [19].Via Performance Evaluation on Protection Levels
In Fig. 2 and Fig. 3, we choose various protection levels , and to show their values of the loss function and testing accuracies in NbAFL. Furthermore, we also involve a nonprivate approach to compare with our NbAFL. In this experiment, we set , and , and compute the values of the loss function as a function of the aggregation time. As shown in Fig. 2, values of the loss function in NbAFL are decreasing with relaxing the privacy guarantees (increasing ). Meanwhile, in Fig. 3, testing accuracies are increasing with relaxing the privacy guarantees. The observation results above are consistent with Remark IV.
Considering the client random scheduling, in Fig. 4 and Fig. 5, we investigate the performances with various protection levels , and . For simulation parameters, we set , , , and . As shown in Fig. 4 and Fig. 5, the convergence performance under the client random scheduling is improved with increasing , which is corresponding to Remark V.
ViB Impact of the number of clients
Fig. 6 and Fig. 7 compare the convergence performance of NbAFL under required protection level and as a function of clients’ number, . In this experiment, we set , , and . We notice that the performance among different numbers of clients is consistent with Remark IV. This is because that not only can more clients provide larger global datasets for training, but also bring down the of standard deviation additive noises due to the aggregation.
ViC Impact of the number of maximum aggregation times
In Fig. 8, we show the theoretical upper bound of training loss as a function of maximum aggregation times with various privacy levels , and under NbAFL algorithm. Fig. 9 compares the theoretical upper bound using the dotted line and experimental results using the solid line with and . Fig. 8 and Fig. 9 reveals that under a low privacy level (a large ), running NbAFL will be in a large improvement in the convergence performance. This observation is in line with Remark IV, and the reason comes from the fact that a lower privacy level decreases the standard deviation of additive noises and the server can obtain better quality ML model parameters from the clients. Fig. 8 also implies that an optimal number of maximum aggregation times increases almost with respect to the increasing , which coincides with the experimental results.
Fig. 10 compares the normal NbAFL and random scheduling based NbAFL for a given protection level. In Fig. 10, we plot the values of the loss function in NbAFL with various numbers of maximum aggregation times. This figure implies that the value of loss function is as a convex function of maximum aggregation times for a given protection leavel under NbAFL algorithm, which coincides with Remark IV. From Fig. 10, we can also notice that for a given , random scheduling based NbAFL algorithm has a better convergence performance than the normalized NbAFL algorithm when is large. This is because that random scheduling can bring down the variance of artificial noises with little performance loss.
ViD Impact of the number of chosen clients
In Fig. 11, we plot values of the loss function with various numbers of chosen clients under the random scheduling policy in NbAFL. The number of clients is , and clients are randomly chosen to participate in training and aggregation in each iteration. In this experiment, we set , , and . Meanwhile, we also reveal the performance of the nonprivate approach with various numbers of chosen clients . Note that an optimal which further improves the convergence performance exist for various protection levels, due to a tradeoff between enhance privacy protection and attaining larger global training datasets in each updating round. The figure shows that in NbAFL, for a given protection level , random scheduling can obtain a better tradeoff than normal selection policy.
Vii Conclusions
In this paper, we have focused on differential attacks in SGD based FL. We first define a global DP requirement for both uplink and downlink channels, and develop variances of artificial noises at clients and server sides. Then, we propose a novel framework based on the concept of global DP, named NbAFL. We develop theoretically a convergence bound of the loss function of the trained FL model in the NbAFL. From theoretical convergence bounds, we obatin the following results: 1) There is a tradeoff between the convergence performance and privacy protection levels, i.e., a better convergence performance leads to a lower protection level; 2) Increasing the number of overall clients participating in FL can improve the convergence performance, given a fixed privacy protection level; 3) There is an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. Furthermore, we propose a random scheduling strategy and also develop the corresponding convergence bound of the loss function in this case. In addition to above three properties. we find that there exists an optimal value of that achieves the best convergence performance at a fixed privacy level. Extensive simulation results confirm the correctness of our analysis. Therefore, our analytical results are helpful for the design on privacypreserving FL architectures with different tradeoff requirements on convergence performance and privacy levels.
Appendix A Proof of Lemma IiiB
From the downlink perspective, the aggregation operation at the server can be expressed as , where . For all and which differ in a signal entry, we can obtain
(23)  
where . Hence, we obtain . This completes the proof.
Appendix B Proof of Theorem IiiB
To ensure a global DP in the uplink channels, the standard deviation of additive noises in client sides can be set to due to the linear relation between and with Gaussian mechanism, where is the sensitivity for the aggregation operation. We then set the sample in the th local noise vector to a same distribution (i.i.d for all ) because each client is coincident with the same global DP. The aggregation process with artificial noises added by clients can be expressed as
(24) 
The distribution of can be expressed as
(25) 
where , and is convolutional operation.
When we use Gaussian mechanism for with noise scale , the distribution of is also Gaussian distribution. To obtain a small sensitivity , we set . Furthermore, the noise scale of the Gaussian distribution can be calculated. To ensure a global DP in downlink channels, we know the standard deviation of additive noises can be set to , where . Hence, we can obtain the standard deviation of additive noises at the server as
(26) 
Hence, Theorem IIIB has been proved.
Appendix C Proof of Lemma Iv
Due to Assumption IV, we have
(27) 
and
(28)  
Considering (27), (28) and , we have
(29)  
Note that when , there exists
(30) 
which satisfies the equation. We can notice that a smaller value of implies that the local loss functions are more locally similar. When all the local loss functions are the same, then , for all . This completes the proof.
Appendix D Proof of Lemma Iv
Considering the aggregation process with artificial noises added by clients and the server in the th aggregation, we have
(31) 
where
(32) 
Because is Lipschitz smooth, we know
(33) 
for all , . Combining and , we have
(34) 
We define
(35) 
Then, we know
(36) 
and
(37) 
Because is Lipschitz smooth, we can obtain
(38) 
Now, let us bound . We know
(39) 
where . Let us define , then we know is convexity. Based on this, we can obtain
(40) 
and
(41) 
where denotes a solution of [18]. Now, we can use the inequality (40) and (41) to obtain
(42) 
Therefore,
(43)  
(44)  