With AlphaGo’s glorious success, it is expected that the big data-driven artificial intelligence (AI) will soon be applied in all aspects of our daily life, including medical care, food and agriculture, intelligent transportation systems, etc. At the same time, the rapid proliferations of Internet of Things (IoTs) call for data mining and learning securely and reliably in distributed systems[1, 2, 3]. When integrating AI in a variety of IoT applications, distributed machine learning (ML) are remarkably effective for many data processing tasks by defining parameterized functions from inputs to outputs as compositions of basic building blocks [4, 5]. Federated learning (FL), as a recent advance of distributed ML, was proposed, in which data are acquired and processed locally at the clients side, and then the updated ML parameters are transmitted to a central server for aggregating, i.e., averaging on these parameters [6, 7, 8]. Typically, clients in FL are distributed devices such as sensors, wearable devices, or mobile phones. The goal of FL is to fit a model generated by an empirical risk minimization (ERM) objective. However, FL also poses several key challenges, such as private information leakage, expensive communication costs between servers and clients, and device variability [9, 10, 11, 12, 13, 14].
Generally, distributed stochastic gradient descent (SGD) is adopted in FL for training ML models. In[15, 16], bounds for FL convergence performance were developed based on distributed SGD, with a one-step local update before global aggregations. The work in  considered partially global aggregations, where after each local update step, parameter aggregation is performed over a non-empty subset of the clients set. In order to analyze the convergence more effectively, federated proximal (FedProx) was proposed  by adding regularization on each local loss function. The work in  obtained the convergence bound of SGD based FL that incorporates non-independent-and-identically-distributed (non-i.i.d.) data distributions among clients.
At the same time, with the ever increasing awareness of data security of personal information, privacy preservation has become a worldwide and significant issue, especially for the big data applications and distributed learning systems. One prominent advantage of FL is that it enables local training without personal data exchange between the server and clients, thereby protecting clients’ data from being eavesdropped by hidden adversaries. Nevertheless, private information can still be divulged to some extent from adversaries’ analyzing on the differences of related parameters trained and uploaded by the clients, e.g., weights trained in neural networks [20, 21, 22].
A natural approach to preventing differential attacks on privacy information is to add artificial noises, known as differentially private (DP) techniques [23, 24]. Existing works on DP based learning algorithms include local DP (LDP) [25, 26, 27], DP based distributed SGD [28, 29] and DP meta learning . In the LDP, each client perturbs its information locally and only sends a randomized version to a server, thereby protecting both the clients and server against private information leakage. The work in  proposed solutions to building up a LDP-compliant SGD, which powers a variety of important ML tasks. The work in 
considered the distribution estimation at the server over uploaded data from clients while providing protections on these data with LDP. The work in improved the computational efficiency of DP based SGD by tracking detailed information of the privacy loss, and obtained accurate estimates on the overall privacy loss. The work in  proposed novel DP based SGD algorithms and analyzed their performance bounds which are shown to be related to privacy levels and the sizes of datasets. Also, the work in  focused on the class of gradient-based parameter-transfer methods and developed a DP based meta learning algorithm that not only satisfies the privacy requirement but also retains provable learning performance in convex settings.
More specifically, DP based FL approaches are usually devoted to capturing the tradeoff between privacy and convergence performance in the training process. The work in  proposed a FL algorithm with the consideration on preserving clients’ privacy. This algorithm can achieve a good training performance at a given privacy level, especially when there is a sufficiently large number of participating clients. The work in  presented an alternative approach that utilizes both DP and secure multiparty computation (SMC) to prevent differential attacks. However, the above two works on DP-based FL design have not taken into account the privacy protection during the parameter uploading stage, i.e., the clients’ private information can be potentially intercepted by hidden adversaries when uploading the training results to the server. Moreover, these two works only showed empirical results by simulations, but lacked theoretical analysis on the FL system, such as tradeoff between privacy, convergence performance, and convergence rate. Up to now, the theoretical analysis on convergence behavior of FL with privacy-preserving noise perturbations has not yet been detailed in existing literatures, which will be the major focus of our work in this paper.
In this paper, to effectively prevent differential attacks, we propose a novel framework based on the concept of differential privacy (DP), in which each client perturbs its trained parameters locally by purposely adding noises before uploading them to the server for aggregation, namely, noising before model aggregation FL (NbAFL). To the best of authors’ knowledge, this is the first piece of work of its kind that theoretically analyzes the convergence property of differentially private FL algorithms. First, we prove that the proposed NbAFL scheme satisfies the requirement of DP in terms of global data under a certain noise perturbation level with Gaussian noises by properly adapting their variances. Then, we develop theoretically a convergence bound of the loss function of the trained FL model in the NbAFL with artificial Gaussian noises. Our developed bound reveals the following three key properties: 1) There is a tradeoff between the convergence performance and privacy protection levels, i.e., a better convergence performance leads to a lower protection level; 2) Increasing the number of overall clients participating in FL can improve the convergence performance, given a fixed privacy protection level; 3) There is an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. Furthermore, we propose a -random scheduling strategy, where () clients are randomly selected from the overall clients to participate in each aggregation. We also develop the corresponding convergence bound of the loss function in this case. From our analysis, the -random scheduling strategy can retain the above three properties. Also, we find that there exists an optimal value of that achieves the best convergence performance at a fixed privacy level. Evaluations demonstrate that our theoretical results are consistent with simulations. Therefore, our analytical results are helpful for the design on privacy-preserving FL architectures with different tradeoff requirements on convergence performance and privacy levels.
The remainder of this paper is organized as follows. In Section II, we introduce backgrounds on FL, DP and a conventional DP-based FL algorithm. In Section III, we detail the proposed NbAFL and analyze the privacy performance based on DP. In Section IV, we analyze the convergence bound of NbAFL and reveal the relationship between privacy levels, convergence performance, the number of clients, and the number of global aggregations. In Section V, we propose the -random scheduling scheme and develop the convergence bound. We show the analytical results and simulations in Section VI. We conclude the paper in Section VII. A summary of basic concepts and notations is provided in Tab. I.
|A randomized mechanism for DP|
|The parameters related to DP|
|The -th client|
|The database held by the owner|
|The database held by all the clients|
|The cardinality of a set|
|Total number of all clients|
|The number of chosen clients ()|
|The index of the -th aggregation|
|The number of aggregation times|
The vector of model parameters
|Global loss function|
|Local loss function from the -th client|
|A presetting constant of the proximal term|
|Local uploading parameters of the -th client|
|Initial parameters of the global model|
|Global parameters generated from all local parameters|
|at the -th aggregation|
|Global parameters generated from clients’ parameters|
|at the -th aggregation|
|True optimal model parameters that minimize|
|The set of all the local parameters|
|The set of all local parameters with pertubation|
In this section, we will present preliminaries and related background knowledge on FL and DP. Also, we introduce a conventional DP-based FL algorithm that will be discussed in our following analysis as a benchmark.
Ii-a Federated Learning
Let us consider a general FL system consisting of one server and clients, as depicted in Fig. 1. Let denote the local database held by the client , where and . At the server, the goal is to learn a model over data that reside at the associated clients. For an active client participating in the local training, it needs to find a vector of an AI model to minimize a certain loss function. Formally, at the server, it aggregates the weights sent from the clients as
where is the parameter vector trained at the -th client, and is the parameter vector after aggregating at the server. Such an optimization problem can be formulated as
where is the local loss function of the -th client, is the number of clients, with , and is the total size of all data samples. Generally, the local loss function is given by local empirical risks. The training process of such a FL system usually contains the following four steps: 4em
Local training: All active clients locally compute training gradients or parameters and send locally trained ML parameters to the server;
Model aggregating: The server performs secure aggregation over the uploaded parameters from clients without learning local information;
Parameters broadcasting: The server broadcasts the aggregated parameters to the clients;
Model updating: All clients update their respective models with the aggregated parameters and test the performance of the updated models.
In the FL process, the clients with the same data structure collaboratively learn a ML model with the help of a cloud server. After a sufficient number of local training and update exchanges between the server and its associated clients, the solution to the optimization problem (2) is able to converge to that of the global optimal learning model.
Ii-B Differential Privacy
-DP provides a strong criterion for privacy preservation of distributed data processing systems. Here, is the distinguishable bound of all outputs on neighboring datasets in a database, and
represents the event that the ratio of the probabilities for two adjacent datasetscannot be bounded by after adding a privacy preserving mechanism. With an arbitrarily given , a privacy preserving mechanism with a larger gives a clearer distinguishability of neighboring datasets and hence a higher risk of privacy violation. Now, we will formally define DP as follows.
(-DP ): A randomized mechanism with domain and range satisfies -DP, if for all and for any two adjacent databases ,
In order to ensure that the given noise distribution preserves -DP, where
represents the Gaussian distribution, we choose noise scaleand the constant for . In this result, is the value of an additive noise sample for a data in the dateset, is the sensitivity of the function given by , and is a real-valued function.
Considering the above DP mechanism, how to choose an appropriate level of noise remains to be a significant research problem, which will affect the privacy guarantee of clients and the convergence rate of the FL process.
Ii-C Noising after Aggregation FL (NaAFL)
Conventionally, adding overly conservative noises to the aggregated parameters, i.e., , is an effective method in FL to protect privacy. This DP mechanism, termed as noising after model aggregation FL (NaAFL), can be described as
where is the noise vector and is a clipping threshold for bounding . Each element in represents an additive zero-mean Gaussian noise sample with variance . The NaAFL algorithm averages over all the individual training results and adds a Gaussian noise to each of the averaged model parameters. In other words, it attempts to protect the privacy of training data by working only on the aggregated parameters that result from the training process. Superior to the conventional NaAFL, our proposed NbAFL with local noisy perturbations at the clients will prevent hidden adversaries from inferring the clients’ information by analyzing their uploaded parameters, which will be detailed in the following.
Iii Federated Learning with Differential Privacy
In this section, we first introduce the concept of global DP and analyze the DP performance in the context of FL. Then we propose the NbAFL scheme that can satisfy the DP requirement by adding proper noisy perturbations at both the clients and the server.
Iii-a Threat Model
The fundamental purpose of FL is to build a ML model based on databases that reside locally in multiple clients while preventing the leakage of personal data. However, potential adversaries may still be able to eavesdrop clients’ private information from analyzing the parameters uploaded by the clients, as depicted in Fig. 1. The privacy leakage via analyzing parameters can happen in both uploading (through uplink channels) and broadcasting (through downlink channels) phases.
The server in this paper is assumed to be honest. However, there are external adversaries targeting at clients’ private information by analyzing the ML parameters. In addition, several smart adversaries may be disguised as clients to access private information of honest clients. We assume that uplink channels are more secure than downlink broadcast channels, since clients can be assigned to different channels (e.g., time slots, frequency bands) dynamically in each uploading time, while downlink channels are broadcasting. Hence, we assume at most () exposures of uploaded parameters from each client in the uplink and exposures of aggregated parameters in the downlink, where is the number of aggregation times.
Iii-B Global Differential Privacy
Here, we define a global -DP requirement for both uplink and downlink channels. From the uplink perspective, using a clipping technique, we can obtain that , where denotes the -th client’s local training parameters without perturbation and is a clipping threshold for bounding . We define , where is the -th client’s database. Thus, the sensitivity of can be expressed as
-DP for each client in the uplink in one exposure, we set the noise scale, represented by the standard deviation of the additive Gaussian noise, as. Considering exposures of local parameters, we need to set due to the linear relation between and with Gaussian mechanism.
From the downlink perspective, the aggregation operation at the server can be expressed as
where is the set of all uploaded parameters from the clients, and is the aggregated parameters at the server to be broadcast to the clients. Regarding the sensitivity of , i.e., , we have the following lemma.
Lemma (Sensitivity of the aggregation operation):
In FL training process, the sensitivity of the aggregation operation is given by
See Appendix A.
From the above lemma, to achieve a small sensitivity , the ideal condition is that all the clients should use the same size of local datasets for training, i.e., .
From the above remark, when setting , , we can obtain the optimal value of the sensitivity . With this optimal value, we have the following theorem regarding how to add noises to the aggregated parameters at the server to satisfy the -DP criterion in the downlink channels.
Theorem (DP guarantee for downlink channels):
To ensure -DP in the downlink channels with aggregations, the standard deviation of Gaussian noises that are added to the aggregated parameter w by the server can be given as
See Appendix B.
Theorem III-B shows that to satisfy a -DP requirement for the downlink channels, additional noises need to be added by the server. With a certain , the standard deviation of additional noises is depending on the relationship between the number of aggregation times and the number of clients . The intuition is that a larger can lead to a higher chance of information leakage, while a larger number of clients is helpful for hiding their private information. This theorem also provides the variance value of the noises that should be added to the aggregated parameters. Based on the above results, we propose the following NbAFL algorithm.
Iii-C Proposed NbAFL
Algorithm 1 outlines our NbAFL for training an effective model with a global -DP requirement. We denote by the presetting constant of the proximal term and by the initiate global parameter. At the beginning of this algorithm, the server broadcasts the required privacy level parameters are set and the initiate global parameter are sent to clients. In the -th aggregation, active clients respectively train the parameters by using local databases with preset termination conditions. After completing the local training, the -th client, , will add noises to the trained parameters , and upload the noised parameters to the server for aggregation.
Then the server update the global parameters by aggregating the local parameters integrated with different weights. Additive noises are added to this according to Theorem III-B before being broadcast to the clients. Based on the received global parameters , each client will estimate the accuracy by using local testing databases and start the next round of training process based on these received parameters. The FL process completes after the aggregation time reaches a preset number and the algorithm returns .
Now, let us focus on the privacy preservation performance of the NbAFL. First, the set of all local parameters, denoted by , are received by the server. Owing to the local perturbations in the NbAFL, it will be difficult for malicious adversaries to infer the information at the -client from its uploaded parameters . After the model aggregation, the aggregated parameters will be sent back to clients via broadcast channels. This poses threats on clients’s privacy as potential adversaries may reveal sensitive information about individual clients from . In this case, additive noises may be posed to based on Theorem III-B.
Iv Convergence Analysis on NbAFL
In this section, we are ready to analyze the convergence performance of the proposed NbAFL. First, we analyze the expected increment of adjacent aggregations in the loss function with Gaussian noises. Then, we focus on deriving the convergence property under the global -DP requirement.
For the convenience of the analysis, we make the following assumptions on the loss function and network parameters.
We make assumptions on the global loss function and the -th local loss function as follows:
satisfies the Polyak-Lojasiewicz condition with the positive parameter , which implies that , where is the optimal result;
is -Lipschitz, i.e., , for any , ;
is -Lipschitz smooth, i.e., , for any , , where is a constant determined by the practical loss function;
For any and , , where is the divergence metric.
Similar to the gradient divergence, the divergence metric is the metric to capture the divergence between the gradients of the local loss functions and that of the aggregated loss function, which is essential for analyzing SGD. The divergence is related to how the data is distributed at different nodes. Using Assumption 1, we then have the following lemma.
Lemma (-dissimilarity of various clients):
For a given ML parameter , there exists satisfying
See Appendix C.
Lemma IV comes from the assumption of the divergence metric and demonstrates the statistical heterogeneity of all clients. As mentioned earlier, the values of and are determined by the specific global loss function in practice and the training parameters . With the above preparation, we are now ready to analyze the convergence property of NbAFL. First, we present the following lemma to derive an expected increment bound on the loss function during each iteration of parameters with artificial noises.
Lemma (Expected increment in the loss function):
After receiving updates, from the -th to the -th aggregation, the expected difference in the loss function can be upper-bounded by
and are the equivalent noises imposed on the parameters after the -th aggregation, given by
See Appendix D.
In this lemma, the value of an additive noise sample in vector satisfies the following Gaussian distribution . Also, we can obtain from Section III. From the right hand side (RHS) of the above inequality, we can see that it is crucial to select a proper proximal term to achieve a low upper-bound. It is clear that artificial noises with a large may improve the DP performance in terms privacy protection. However, from the RHS of (10), a large may enlarge the expected difference of the loss function between two consecutive aggregations, leading to a deterioration of convergence performance.
Furthermore, to satisfy the global -DP, by using Theorem III-B, we have
Next, we will analyze the convergence property of NbAFL with the -DP requirement.
Theorem (Convergence upper bound of the NbAFL):
With required protection level , the convergence upper bound of Algorithm 1 after aggregations is given by
where , and .
See Appendix D.
Theorem IV reveals an important relationship between privacy and utility by taking into account the protection level and the number of aggregation times . As the number of aggregation times increases, the first term of the upper bound decreases but the second term increases. Furthermore, By viewing as a continuous variable and by writing the RHS of (15) as , we have
It can be seen that the second term and third term of on the RHS of (16) are always positive. When and are set to be large enough, we can see that and are small, and thus the first term can also be positive. In this case, we have and the upper bound is convex for .
As can be seen from this theorem, expected gap between the achieved loss function and the minimum one is a decreasing function of . By increasing , i.e., relaxing the privacy protection level, the performance of NbAFL algorithm will improve. This is reasonable because the variance of artificial noises decreases, thereby improving the convergence performance.
The number of clients will also affect its iterative convergence performance, i.e., a larger would achieve a better convergence performance. This is because a lager leads to a lower variance of the artificial noises.
There is an optimal number of maximum aggregation times in terms of convergence performance for given and . In more detail, a larger may lead to a higher variance of artificial noises, and thus pose a negative impact on convergence performance. On the other hand, more iterations can generally boost the convergence performance if noises are not large enough. In this sense, there is a tradeoff on choosing a proper .
V -Client Random Scheduling Policy
In this section, we consider the case where only clients are selected to participate in the aggregation process, namelly -random scheduling.
We now discuss how to add artificial noises in the -random scheduling to satisfy a global -DP. It is nature that in the uplink channels, each of the scheduled clients should add noises with scale for achieving -DP. This is equivalent to the noise scale in the all-clients selection case in Section III, since each client only considers its own privacy for uplink channels in both cases. However, the derivation of the noise scale in the downlink will be different for the -random scheduling. As an extension of Theorem 1, we present the following lemma in the case of -random scheduling on how to obtain .
Lemma (DP guarantee in -random scheduling):
In the NbAFL algorithm with -random scheduling, to satisfy a global -DP, and the standard deviation of additive Gaussian noises for downlink channels should be set as
See Appendix F.
Lemma V recalculate by considering the number of chosen clients . Generally, the number of clients is fixed, we thus focus on the effect of . Based on the DP analysis in Lemma V, we can obtain the following theorem.
Theorem (Convergence under -random scheduling):
With required protection level and the number of chosen clients , for any , the convergence upper bound after aggregation times is given by
See Appendix G.
The above theorem provides the convergence upper bound between and under -random scheduling. Using -random scheduling, we can obtain an important relationship between privacy and utility by taking into account the protection level , the number of aggregation times and the number of chosen clients .
Naturally, we can conclude that by increasing , the performance under -client random scheduling will improve because a lager leads to a lower variance of the artificial noises.
From the bound derived in Theorem V, we conclude that there is an optimal in between and that achieves the optimal convergence performance. That is, by finding a proper , the -random scheduling policy is superior to the one that all clients participate in the FL aggregations.
Vi Simulation Results
In this section, we evaluate the proposed NbAFL by using multi-layer perception (MLP) and real-world federated datasets. In order to characterize the convergence property of NbAFL, we conduct experiments by comparing various protection levels of , the number of clients , the number of maximum aggregation times and the number of chosen clients .
We conduct experiments on the standard MNIST dataset for handwritten digit recognition consisting of training examples and testing examples . Each example is aclasses (corresponding to the digits) with the cross-entropy loss function. For the optimizer of networks, we set the learning rate to . The values of , , and are determined by the specific loss function, and we will use estimated values in our simulations .
Vi-a Performance Evaluation on Protection Levels
In Fig. 2 and Fig. 3, we choose various protection levels , and to show their values of the loss function and testing accuracies in NbAFL. Furthermore, we also involve a non-private approach to compare with our NbAFL. In this experiment, we set , and , and compute the values of the loss function as a function of the aggregation time. As shown in Fig. 2, values of the loss function in NbAFL are decreasing with relaxing the privacy guarantees (increasing ). Meanwhile, in Fig. 3, testing accuracies are increasing with relaxing the privacy guarantees. The observation results above are consistent with Remark IV.
Considering the -client random scheduling, in Fig. 4 and Fig. 5, we investigate the performances with various protection levels , and . For simulation parameters, we set , , , and . As shown in Fig. 4 and Fig. 5, the convergence performance under the -client random scheduling is improved with increasing , which is corresponding to Remark V.
Vi-B Impact of the number of clients
Fig. 6 and Fig. 7 compare the convergence performance of NbAFL under required protection level and as a function of clients’ number, . In this experiment, we set , , and . We notice that the performance among different numbers of clients is consistent with Remark IV. This is because that not only can more clients provide larger global datasets for training, but also bring down the of standard deviation additive noises due to the aggregation.
Vi-C Impact of the number of maximum aggregation times
In Fig. 8, we show the theoretical upper bound of training loss as a function of maximum aggregation times with various privacy levels , and under NbAFL algorithm. Fig. 9 compares the theoretical upper bound using the dotted line and experimental results using the solid line with and . Fig. 8 and Fig. 9 reveals that under a low privacy level (a large ), running NbAFL will be in a large improvement in the convergence performance. This observation is in line with Remark IV, and the reason comes from the fact that a lower privacy level decreases the standard deviation of additive noises and the server can obtain better quality ML model parameters from the clients. Fig. 8 also implies that an optimal number of maximum aggregation times increases almost with respect to the increasing , which coincides with the experimental results.
Fig. 10 compares the normal NbAFL and -random scheduling based NbAFL for a given protection level. In Fig. 10, we plot the values of the loss function in NbAFL with various numbers of maximum aggregation times. This figure implies that the value of loss function is as a convex function of maximum aggregation times for a given protection leavel under NbAFL algorithm, which coincides with Remark IV. From Fig. 10, we can also notice that for a given , -random scheduling based NbAFL algorithm has a better convergence performance than the normalized NbAFL algorithm when is large. This is because that -random scheduling can bring down the variance of artificial noises with little performance loss.
Vi-D Impact of the number of chosen clients
In Fig. 11, we plot values of the loss function with various numbers of chosen clients under the random scheduling policy in NbAFL. The number of clients is , and clients are randomly chosen to participate in training and aggregation in each iteration. In this experiment, we set , , and . Meanwhile, we also reveal the performance of the non-private approach with various numbers of chosen clients . Note that an optimal which further improves the convergence performance exist for various protection levels, due to a trade-off between enhance privacy protection and attaining larger global training datasets in each updating round. The figure shows that in NbAFL, for a given protection level , -random scheduling can obtain a better tradeoff than normal selection policy.
In this paper, we have focused on differential attacks in SGD based FL. We first define a global -DP requirement for both uplink and downlink channels, and develop variances of artificial noises at clients and server sides. Then, we propose a novel framework based on the concept of global -DP, named NbAFL. We develop theoretically a convergence bound of the loss function of the trained FL model in the NbAFL. From theoretical convergence bounds, we obatin the following results: 1) There is a tradeoff between the convergence performance and privacy protection levels, i.e., a better convergence performance leads to a lower protection level; 2) Increasing the number of overall clients participating in FL can improve the convergence performance, given a fixed privacy protection level; 3) There is an optimal number of maximum aggregation times in terms of convergence performance for a given protection level. Furthermore, we propose a -random scheduling strategy and also develop the corresponding convergence bound of the loss function in this case. In addition to above three properties. we find that there exists an optimal value of that achieves the best convergence performance at a fixed privacy level. Extensive simulation results confirm the correctness of our analysis. Therefore, our analytical results are helpful for the design on privacy-preserving FL architectures with different tradeoff requirements on convergence performance and privacy levels.
Appendix A Proof of Lemma Iii-B
From the downlink perspective, the aggregation operation at the server can be expressed as , where . For all and which differ in a signal entry, we can obtain
where . Hence, we obtain . This completes the proof.
Appendix B Proof of Theorem Iii-B
To ensure a global -DP in the uplink channels, the standard deviation of additive noises in client sides can be set to due to the linear relation between and with Gaussian mechanism, where is the sensitivity for the aggregation operation. We then set the sample in the -th local noise vector to a same distribution (i.i.d for all ) because each client is coincident with the same global -DP. The aggregation process with artificial noises added by clients can be expressed as
The distribution of can be expressed as
where , and is convolutional operation.
When we use Gaussian mechanism for with noise scale , the distribution of is also Gaussian distribution. To obtain a small sensitivity , we set . Furthermore, the noise scale of the Gaussian distribution can be calculated. To ensure a global -DP in downlink channels, we know the standard deviation of additive noises can be set to , where . Hence, we can obtain the standard deviation of additive noises at the server as
Hence, Theorem III-B has been proved.
Appendix C Proof of Lemma Iv
Due to Assumption IV, we have
Note that when , there exists
which satisfies the equation. We can notice that a smaller value of implies that the local loss functions are more locally similar. When all the local loss functions are the same, then , for all . This completes the proof.
Appendix D Proof of Lemma Iv
Considering the aggregation process with artificial noises added by clients and the server in the -th aggregation, we have
Because is -Lipschitz smooth, we know
for all , . Combining and , we have
Then, we know
Because is -Lipschitz smooth, we can obtain
Now, let us bound . We know
where . Let us define , then we know is -convexity. Based on this, we can obtain