# Robust Federated Learning with Noisy Communication

Federated learning is a communication-efficient training process that alternates between local training at the edge devices and averaging the updated local model at the central server. Nevertheless, it is impractical to achieve a perfect acquisition of the local models in wireless communication due to noise, which also brings serious effects on federated learning. To tackle this challenge, we propose a robust design for federated learning to alleviate the effects of noise in this paper. Considering noise in the two aforementioned steps, we first formulate the training problem as a parallel optimization for each node under the expectation-based model and the worst-case model. Due to the non-convexity of the problem, a regularization for the loss function approximation method is proposed to make it tractable. Regarding the worst-case model, we develop a feasible training scheme which utilizes the sampling-based successive convex approximation algorithm to tackle the unavailable maxima or minima noise condition and the non-convex issue of the objective function. Furthermore, the convergence rates of both new designs are analyzed from a theoretical point of view. Finally, the improvement of prediction accuracy and the reduction of loss function are demonstrated via simulations for the proposed designs.

Comments

There are no comments yet.

## Authors

• 1 publication
• 59 publications
• 9 publications
• 12 publications
• 4 publications
• 8 publications
• ### CPFed: Communication-Efficient and Privacy-Preserving Federated Learning

Federated learning is a machine learning setting where a set of edge dev...
03/30/2020 ∙ by Rui Hu, et al. ∙ 0

read it

• ### Edge-Assisted Hierarchical Federated Learning with Non-IID Data

Federated Learning (FL) is capable of leveraging massively distributed p...
05/16/2019 ∙ by Lumin Liu, et al. ∙ 0

read it

• ### FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization

Federated learning is a new distributed machine learning approach, where...
09/28/2019 ∙ by Amirhossein Reisizadeh, et al. ∙ 18

read it

• ### On the Convergence of Local Descent Methods in Federated Learning

In federated distributed learning, the goal is to optimize a global trai...
10/31/2019 ∙ by Farzin Haddadpour, et al. ∙ 0

read it

• ### Overcoming Forgetting in Federated Learning on Non-IID Data

We tackle the problem of Federated Learning in the non i.i.d. case, in w...
10/17/2019 ∙ by Neta Shoham, et al. ∙ 0

read it

• ### Age-Based Scheduling Policy for Federated Learning in Mobile Edge Networks

Federated learning (FL) is a machine learning model that preserves data ...
10/31/2019 ∙ by Howard H. Yang, et al. ∙ 0

read it

• ### Personalized Federated Learning: A Meta-Learning Approach

The goal of federated learning is to design algorithms in which several ...
02/19/2020 ∙ by Alireza Fallah, et al. ∙ 4

read it

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Future wireless computing applications demand higher bandwidth, lower latency and more reliable connections with numerous devices [19]

. With the burgeoning development of artificial intelligence technologies, the edge devices need to generate a sheer volume of raw data to be transmitted to the center, which results in excessive latency and privacy concerns

[36, 18]. To solve this problem, federated learning has been proposed to encounter a paradigm shift from computing at the center to computing at the edge devices [21].

Federated learning can be traced back as federated optimization to decouple the data acquisition and computation at the central server [16]

. Federated optimization has recently been extended to deep learning platforms, which was known as federated learning

[21, 17]. Federated learning was designed as an iterative process between distributed learning at the edge devices and averaging the updated local models at the central server. In contrast to the conventional centralized training, federated learning is more efficient in communication by uploading no raw data but only local models. To further improve the availability of enormous data from edge devices, federated learning was adopted in several scenes of future wireless networks [32, 24, 33, 5]. Using federated learning and distributed MEC systems, the authors studied the trade-off between local computing and global aggregation under the given resource-constrained model in [32]. Moreover, the attractive property of lower latency drew attention to exploiting federated learning in latency-sensitive networks, such as vehicular networks [24, 5].

Due to the high-dimensional local model, as well as the long-term training process, the updating step of federated learning still consumes a lot of communication resources. The key issues are to reduce the overhead in the updating steps and to accelerate the training process. A series of research concentrating on reducing the overhead in the updating step was to transmit the compressed gradient vector via exploiting the quantization scheme

[2, 20]. Another research focused on scheduling the edge devices to save the transmission bandwidth [34, 9, 8, 14, 22]. Specifically, some novel updating rules were worked out, which only allowed the edge devices with significant training improvement[9], or the fast responding devices [8], to transmit their gradient vectors in each uploading round. Arranging the adaptive maximum number of transmission-permitted edge devices was also an intelligent way when time was limited [22]

. Furthermore, the authors developed a momentum method and cp-stochastic gradient descent algorithm to accelerate the training process for each edge device in local training in

[20, 1]. Utilizing the different computation capability of each node, an asynchronous federated learning scheme was proposed to reduce the training delay in [26].

The aforementioned pioneering works are all based on the assumption that the received signals at both the central server and the edge nodes are perfectly detected. In practice, this is difficult in wireless communications due to imperfect channel estimation, feedback quantization, or delay in signal acquisition on fading channels. In other words, the noise is indispensable during the training process. Furthermore, neural networks were proved to be not very robust to noise, which leads to the delay in the training process

[28].

In conventional centralized learning, a branch of research has been dedicated to eliminate the effects of noise, among which several works used the denoising autoencoder to filter noise, such as contractive auto-encoders and denoising auto-encoders

[23, 29], while others considered representing the effect of noise as imposing a penalty during the training process, known as the regularization scheme [7, 6, 12, 13, 27]

. In particular, the addition of noise with infinitesimal variance as the input of training dataset was proved to be equivalent to the punishment on the norm of the weights for some training models

[7, 6], whereas the added noise in the model was derived as appending a regularizer in the loss function which pushes the model to find the minima in the flat regions [12, 13]. Besides, the key idea of the Dropout method is to randomly drop units from the neural network during training to simulate the regularization [27]. However, to the best of our knowledge, no noise reduction has been studied for federated learning and it is still an open problem.

Motivated by these observations, we propose a robust federated learning method to alleviate the effects of noise in the training process. Robust designs are first introduced using the expectation-based model and the worst-case model. More specifically, the former model is based on the statistical properties of the noise uncertainty and the latter model represents the fixed uncertainty sets of noise. Furthermore, the corresponding convergence analysis is provided to illustrate the performance of the proposed designs. The main contributions of this work are summarized as follows.

• Robust design under the expectation-based model. With the consideration of noise at the central server and the edge nodes, we formulate the training problem using the expectation-based model as a parallel optimization problems for each edge node. To handle the statistical property of noise, as well as the non-convexity of the objective function, we propose a regularization for loss function approximation (RLA) algorithm to approach the objective function and develop the corresponding training process. The proposed solution is superior to the conventional scheme that ignores noise in terms of both prediction accuracy and performance of loss function.

• Robust design under the worst-case model. The training problem under the worst-case model meets the challenges that are the unavailable maxima or minima noise condition and the non-convex issue of the objective function. We solve the former problem via the sampling method and tackle the latter one by utilizing the successive convex approximation (SCA) algorithm to generate a feasible descent direction for the training process. The simulation results show that the proposed design outperforms the conventional one for prediction accuracy and values of loss function.

• Convergence analysis for the proposed designs. The convergent property of all proposed designs are derived. Specifically, it is found that the proposed training process under the expectation-based model converges at the equivalent rate to the centralized training scheme that ignores noise, and the convergent property of proposed robust design under the worst-case model outperforms the conventional centralized one.

The remainder of the paper is organized as follows. Section II introduces the system model of the federated learning considering noise. Section III presents the formulated problem under the expectation-based model and the worst-case model. The robust design under the expectation-based model and its convergence analysis are developed in Section IV. Section V shows the robust design under the worst-case model and the corresponding convergence analysis. Simulation results are provided in Section VI.

Throughout the paper, we use boldface lowercase to refer to vectors, and lowercase to refer to scalar. Let denote the transpose of a vector. Let denote size of the set,

denotes zero matrix, and

denotes unit matrix. is the expectation function.

## Ii System Model

We consider a distributed learning system consisting of a single central server and edge nodes, as shown in Fig. 1. A shared learning process with the global model is trained collaboratively by the edge nodes. Each node collects a fraction of labelled training datasets .

The loss function is to facilitate the learning and we define it as for each data sample , which consists of the input vector and the output scalar . For convenience, we rewrite as . Then the global loss function on all distributed datasets can be defined as

 F(w)=∑j∈Dfj(w)|∪iDi|, (1)

where denotes the size of the datasets and each dataset satisfies when , . The training target is to minimize the global loss function according to the distributed learning, i.e., to find

 w∗=argminF(w). (2)

One way to search for the optimal is to update the datasets of the distributed nodes, which only contains the input vector and the output scalar , called centralized learning. The center completes the training process using the whole datasets, and broadcasts the optimal model from (1) and (2

) to all nodes. However, the datasets are generally large in machine learning. Therefore, centralized learning requires numerous communication resources to collect the whole datasets. In other words, the training process will be limited by the communication rates.

Another way to solve (2) is a distributed manner as demonstrated in Fig. 1, which focuses on the model-averaging for the global model , called federated learning. The global loss function cannot be directly computed without sharing datasets among all edge nodes in federated learning. The federated learning algorithm alternates between two stages. In the first stage, the local models at each node are sent to the center for model-averaging via wireless links, and the center updates the global model . In the second stage, the center broadcasts the current model to all edge nodes at each iteration. Based on the received global model , each node updates its own model to minimize the local loss function using its own dataset. The updating rules follow:

 Center:w=∑Nj=1DjwjD, (3a) Local:wj=argminFj(w),j=1,2,...,N, (3b)

where denotes the local model of node , denotes the size of the whole datasets , denotes the size of the dataset , , is the local loss function of node with dataset , and can be written as

 Fj(w)=1|Dj|∑i∈Djfi(w)=1Dj∑i∈Djfi(w). (4)

The training process requires the iterations between (3b) and (3a) until convergence, and each node can obtain the optimal model .

Since the center and each node are connected using wireless links, it inevitably introduces noise. Therefore, the received signal has the aggregation noise at the center via local updating and the broadcasted global model with noise in each iteration for the node can be modeled as

 Aggregation: ~w=w+Δ~w, (5) Broadcast: ~wj=~w+Δ~wj,j=1,2,...,N,

where refers to the aggregation noise at the center, and refers to the broadcast noise for node .

The imperfect estimation is a major problem in wireless communication. In federated learning, it leads to the changing of optimization in the local update process. The noise in estimation error of the model will make the output data point blurred and make the training difficult to fit the input data point precisely for neural networks. Furthermore, the neural networks were proved to be not robust against noise. In other words, the performance of the learning scheme may be significantly reduced by noise. To solve this problem, robust design is proposed to ensure a certain level of the performance under the uncertainty model.

## Iii Problem Formulation

In this section, we formulate the robust problem using two robust models. According to the different characteristics of the two robust models, the corresponding problem is totally different. We write the corresponding problems in the following.

The aggregation noise and broadcasted noise in (5) can be modelled as the stochastic and the deterministic. The former is the expectation-based model and the latter is the worst-case model. According to that, each node updates its own model with a different initial point, , the corresponding local loss function is rewritten as , , and the global loss function is rewritten as . The iteration process still follows (3a) and (3b).

### Iii-a Training Under Expectation-based Model

Expectation-based model is a stochastic method to represent the random condition, which can only be used when statistical properties of noise are available [4]. The stochastic model assumes that the estimation value is a random quantity and its instantaneous value is unknown, but its statistics property, such as the mean and the covariance, is available. In this case, the robust design usually aims at optimizing either the long-term average performance or the outage performance. The corresponding robust model is called the expectation-based model and defined as follows.

###### Definition 1 (Worst-Case Robust Model [30, 3])

The expectation-based robust model refers to the stochastic property of noise as shown in Fig. 2 (a). For node

, the entries of the uncertainty vector are assumed to be Gaussian distributed with

, and , , and the aggregation noise at the center is assumed to satisfy , and .

With the assumption that the aggregation noise and the broadcast noise are Gaussian, we can obtain another summed Gaussian noise as so that the received value for node can be expressed as

 ~wj=w+Δwj,j=1,2,...,N, (6)

and is Gaussian with , and , , where .

Therefore, using the stochastic property of noise, we should focus on improving the stochastic performance for the network. Furthermore, the optimization object in federated learning is to find the local optimal model in (3b) and to utilize the combination method to find the global optimal model in (3a).

Since the combination method is determinate, we only need to optimize the local model for each node. Based on the aforementioned analysis, we formulate the robust training problem using the expectation-based model for each node as

 P1:minw E∥Fj(w+Δwj)∥2 (7) s.t. E{Δwj}=0,j=1,2,...,N, E{Δwj⋅ΔwTj}=σ2ejI,j=1,2,...,N,

where the constraints in represent the stochastic characteristic of noise from imperfect estimation in wireless communication.

We aim at improving the stochastic performance for the training process. Due to the expectation calculation, the objective function is non-convex. To tackle this challenge, we consider adding the regularizer into the loss function to approximate the objective function and to represent the effect of noise. We provide the corresponding federated learning process in Section IV.

### Iii-B Training Under Worst-Case Robust Model

In contrast to the expectation-based model, the worst-case model is a deterministic method to represent the instantaneous condition, which has fixed uncertainty sets, and to maximize the performance under the worst uncertainty [25, 31]. Using the worst-case robust design, we can guarantee a performance level for any value of estimation realization in the uncertainty region. It is applied to design which requires strict constraints, and is more suitable for characterizing instantaneous estimation value with errors. The worst-case approach assumes that the actual estimation value lies in the neighborhood of the uncertainty region with a known nominal estimation value. The size of this region represents the amount of estimation value uncertainty, i.e., the bigger the region is, the more uncertainty there is. We show the brief definition of the worst-case model as follows.

###### Definition 2 (Worst-Case Robust Model [30, 3])

The worst-case robust model assumes that the estimation lies in a known set of possible values shown as Fig. 2 (b), which can not be exactly known. The norm of the uncertainties vector and are bounded by the spherical region, which can be expressed as

 ∥Δ~wj∥2≤σ2j,j=1,2,...,N, (8) ∥Δ~w∥2≤σ2,

where denotes the radius of the spherical uncertainty region of the broadcast noise, while denotes the aggregation noise.

Consider the superposition of noise, the uncertainty is expanded to the larger region with the size . Therefore, we reformulate the received value at node as

 ~wj=w+Δwj,j=1,2,...,N, (9)

where denotes the whole noise and satisfies .

Similarly, the optimization is to find the local optimal model in (3b), and follows the aggregation rules in (3a). Therefore, we formulate the robust training problem under the worst-case model as a min-max problem for each node

 P2:minwmaxΔwj Fj(w+Δwj) (10) s.t. ∥Δwj∥2≤σ2wj,j=1,2,...,N,

where the constraints in represent the noise lies in a spherical region with radius .

One challenge to solve the problem is that the worst condition may not be available. The other is the non-convex objective function. We settle the challenges using the sampling method and the SCA algorithm to generate a feasible descent direction for the learning process in Section V.

## Iv Robust Design Using Expectation-based Model

In this section, we consider the robust design in federated learning using the expectation-based model. We propose the corresponding RLA algorithm to represent the effects of noise for the expectation-based model so that the local optimal model can be found via optimization.

### Iv-a Proposed Training Algorithm

We first model the noise under the expectation-based noise model, which is a stochastic method to represent the random condition, as shown in . We aim at optimizing the average performance based on the expectation-based model. However, the random noise results in the non-convexity property and uncertainty value of the local loss function.

To solve this problem, we propose the RLA to approximate the non-convexity local loss function and utilize the distributed gradient descent to find the optimal global model. The approximation method is inspired by previous works where training with noise was approximated via regularization to enhance the robust of neural networks [12]. We give a brief introduction in the following.

###### Lemma 1

Training with noise is equal to adding a regularizer , which can be expressed as

 F(~w)≈F(w)+λΩ(w), (11)

where denotes the loss function, is the designed function, is the learning model, represents the learning model including noise, and is a constant.

###### Proof:

Refer to [11].

There are many regularization strategies in the aforementioned works [7, 6, 12, 13]. However, there is no specific regularizer that is universally better than any others for the learning algorithm. In other words, there is no best form of regularization. We need to develop a specific form of using the expectation-based model.

Motivated by this observation, we propose a new regularization term to approximate the original loss function for federated learning in the training process. Using the expectation-based model, we intend to reduce the impact of noise for the training process. Due to the stochastic property of noise, we aim at optimizing the average performance in . We propose the corresponding training problem in the following.

###### Proposition 1 (Robust Training Under Expectation-based Model)

The robust training problem under the expectation-based model in for each node can be reformulated as

 P3:minw Fej(w), (12)

where denotes the new loss function for node and can be written as

 Fej(w)=Fj(w)+σ2e∥∇Fj(w)∥2. (13)
###### Proof:

Under the expectation-based model, we can obtain the objective function of utilizing Taylor expansion according to the work in [6] so that the objective loss function of the optimization problem is written as

 E∥Fj(w+wj)∥2 =E∥Fj(w)+wj∇Fj(w)+o(w)∥2 (14) ≈E∥Fj(w)∥2+σ2e∥∇Fj(w)∥2.

The first term refers to the training process with perfect estimation in (3b), and the second term is the additional cost of the loss function in training, which is determined by noise. Therefore, the objective loss function under the expectation-based equals adding the regularizer .

###### Remark 1

The penalty over the first-order of the loss function yields a preference for mapping that are invariant locally at the training points and drop the global model into the flat region.

To solve the training problem in (12), we utilize the gradient descent algorithm to find the optimal local model for each node, and the details are shown as follows.

In each iteration, the local update at each node is performed based on the previous iteration and the first gradient of the proposed loss function, and the center aggregates the distributed models to find the optimal global model for the next iteration. Therefore, the update rules of the gradient descent can be written as:

 Center:wt+1=∑Nj=1Djwt+1jD, (15a) Local:wt+1j=wt−η∇Fej(wt),j=1,2,...,N, (15b)

where is the step size for all nodes. The iteration is executed and it will stop if a specific condition is satisfied. This process is illustrated in Algorithm 1.

To solve the robust problem, we develop the training process by adding the regularizer to approximate the original loss function. We transfer the stochastic and non-convex problem into a deterministic and convex problem so that we can utilize the gradient descent method to find the optimal global model . The corresponding performance is shown through simulation in Section VI.

### Iv-B Convergence Analysis

In this subsection, we derive the convergence property of the proposed design under the expectation-based model. To obtain the convergence rate of the proposed scheme under the expectation-based model, we first prove that the proposed federated learning is equivalent to a centralized learning, and then derive the corresponding convergence rate.

We start with the essential assumption of the loss function, which can be satisfied normally.

###### Assumption 1

We assume the following conditions for the loss function of all nodes:
(1) is ,
(2) is , i.e. for any , ,
(3) is , i.e. for any ,

Then, we give a brief definition of centralized learning.

###### Definition 3 (Centralized learning problem under expectation-based model)

Given the proposed local loss function in (13), the global loss function can be written as

 Fe(w)=∑Ni=1DiFei(w)D, (16)

so that we aim at minimizing at the center by using the same whole datasets. Therefore, the centralized learning problem is to find the optimal global model as

 P4:minw Fe(w). (17)

The optimization can be easily solved by using the gradient descent, and the center completes the iteration until the specific condition is met. We derive that the proposed federated learning is equivalent to the centralized learning problem under the expectation-based model as follows.

###### Lemma 2

Given and under the expectation-based model, the proposed federated learning is equal to the centralized learning for each iteration , , which can be written as

 wt+1=wt−η∇Fe(wt). (18)
###### Proof:

Considering the global aggregation, we can obtain that

 wt+1 =∑Nj=1Djwt+1jD (19) =∑Nj=1Dj(wt−ηFej(wt))D =∑Nj=1DjwtD−η∑Nj=1DjFej(wt)D =wt−η∇Fe(wt).

To prove the convergence of the proposed distributed learning, we only need to derive that the equivalent centralized learning is convergent.

###### Lemma 3

Given the original loss function under Assumption 1, there exist constants and so that the loss function satisfies that

 Fi(wt)−Fi(w∗)≤∥w0−w∗∥2⋅1η(1−βη2)⋅1t, (20)

where is the initialization point of .

Refer to [10].

is , and .

###### Proof:

We can obtain that is the linear combination of via (16). Straightforwardly from the convexity property, this lemma holds.

###### Proposition 2 (Convergence Under Worst-case Model)

Algorithm 1 yields the following convergence property for the optimization of the global loss function under the expectation-based model

 Fe(wt)−Fe(w∗)≤∥w0−w∗∥2⋅1η(1−(1+λσ2e)βη2)⋅1t, (21)

where is the initialization point of . It means the convergence rate is .

###### Proof:

The proposed loss function of the node is

 Fej(w)=Fj(w)+σ2e∥∇Fj(w)∥2. (22)

Taking the derivation of it, we can obtain

 ∇Fej(w) =∇Fj(w)+σ2e∇tr(∇Fj(w)∇Fj(w)T) (23) =∇Fj(w)+σ2e∇Fj(w) =(1+σ2e)∇Fj(w).

Following the Lemma 4, we can obtain that the loss function of the node is with . Therefore, satisfies

 Fej(w)−Fej(w∗)≤∥w0−w∗∥2⋅1η(1−(1+σ2e)βη2)⋅1t. (24)

Furthermore, we can develop the conclusion that is to satisfy (21). The optimization of the global loss function converges at .

###### Remark 2

The proposed robust design under the expectation-based model converges at . The convergence property as (21) is reduced to the one in (20) as , i.e., it is equivalent to the convergence property that is training without noise. The convergence rate will decrease with the increase in and the proposed design cannot converge when specifically. The comparison between the proposed design and the centralized training is simulated specifically in Section VI.

## V Robust Design Using Worst-case Model

In this section, we solve the optimization problem using the worst-case model. To solve the uncertainty of noise and the non-convexity problem, we utilize the sampling-based SCA method to represent noise and approximate the objective loss function. We then propose the training process for the robust federated learning and finally derive the convergence property of the proposed design.

### V-a Proposed Training Algorithm

The training process is proposed to solve the learning problem under the worst-case model. We utilize the sampling-based SCA method to approximate the original objective function, and develop the corresponding updating rules.

The feasible sets of both the local model and the noise are convex sets, and there always exists a saddle point. However, the unavailability of noise results in that the finding of the global minimum point is, in general, NP hard. Therefore, the objective problem faces the main issues: i) the impossibility to estimate accurate value of noise of the worst condition; ii) the non-convexity of the objective functions leading to unavailable optimization.

Considering the uncertainty of noise, it is often possible to obtain a sample of the random noise, either from past data or from computer simulation as shown in [15]. Consequently, one may consider an approximate solution to the problem based on sampling, known as the sample average approximation (SAA) method, and we give a brief introduction as follows.

###### Lemma 5

The SAA method is to find the optimal for the stochastic objective in the optimization problem as,

 x∗=minE[f(x;ξ)], (25)

where is a given function and affected by the random vector which follows the distribution . However, the distribution is unknown, and only sample values of the random vector are available. To solve this problem, the SAA approach approximates the problem by solving

 ^x∗=min1NN∑j=1f(x;ξj), (26)

where is the random sample of the random vector , and the collection of realizations satisfies independent and identically distributed.

###### Proof:

Refer to [15].

Motivated by this method, we consider sampling noise in the objective function , and can easily obtain that the worst condition of noise occurs on the boundary. Based on the above consideration, we propose the sampling-based method. At each iteration of each node, a new realization of the noise is obtained and the optimization of the objective functions is updated via the loss function as follows,

 Fj(w+Δwj)=Fj(w+Δwtj),t=1,2,.... (27)

where satisfies .

It provides a simple way to approach the objective function under the perfect estimation, but the non-convexity of the objective function is still not resolved. To tackle this challenge, we utilize the SCA scheme to maintain the convexity of the objective functions.

###### Lemma 6

The SCA algorithm is proposed to approximate an arbitrarily function by expansion around which is a definite point in the feasible set. It can be simply written as

 f(x)≈~f(x,xt)=ρtf(x)+(1−ρt)⟨x−xt,g(xt)⟩, (28)

where is a sequence, and is the weight average of the first gradient and can be expressed as

 g(xt)=(1−ρt)g(xt−1)+ρt∇f(xt). (29)
###### Proof:

Refer to [35].

With the consideration of SAA and SCA methods, we propose the sampling-based SCA algorithm to solve the robust training problem under the worst-case model of in the following.

###### Proposition 3 (Robust Training Under Worst-case Model)

For the robust training problem under the worst-case model in ,the optimization problem of each node can be reformulated as

 P5:minw Fwj(w;wt,Δwtj) (30)

where is a sequence by sampling the noise satisfying that , is denoted as the loss function for the node , and expressed as

 Fwj(w;wt,Δwtj)= ρtFj(w+Δwtj)+λ∥w−wt∥2 (31) +(1−ρt)⟨w−wt,Gt−1j⟩,

and is an accumulation vector updated recursively according to

 Gtj=(1−ρt)Gt−1j+ρt∇wjFj(w+Δwtj), (32)

with being a sequence to be properly chosen , .

###### Proof:

As the efficient solutions of the SCA algorithm, the objective function at the iteration is determined by the latest updated model and defined as , which is consist of the original function , and the first gradient . We develop the objective function as follows,

 Fwj(w;wt,Δwtj)=ρtFj(w+Δwtj)+(1−ρt)⟨w−wt,Gt−1j⟩, (33)

and is an accumulation vector updated recursively according to

 Gtj=(1−ρt)Gt−1j+ρt∇wFj(w+Δwtj), (34)

with being a sequence to be properly chosen at iterations respectively.

Notice that the expansion is established only when is close to . We add a regularizer as the cost of shrinking the gap between and as:

 Ω2(w)=∥w−wt∥2. (35)

Therefore, we propose the local loss function as in .

###### Remark 3

Generally speaking, each node minimizes the sample approximation of the original unstable function. The first term in (31) refers to the sample objective function. The second term refers to the cost which controls the pace for each iteration. The vector in the last term represents the incremental estimate of the unknown by samples collection over the iterations. When the parameter is properly chosen, and the estimation accuracy increases as increases.

Due to the involving of the past optimized model , we consider utilizing the conditional gradient descent method for each node. Similarly, we aggregate the local update at the center and broadcast the new global model for next iteration. The aggregated model should be broadcasted to all nodes and it is used to complete the next iteration until it meets the specific condition. Given , the iteration rule is briefly written as follows.

 Center:wt+1=∑Nj=1Djwt+1jD, (36a) Local:wt+1j=wt+γt+1(wwj(wt,Δwtj)−wt), (36b)

where , . The iteration follows the process illustrated in Algorithm 2.

We develop the training process by utilizing the sampling-based SCA algorithm to approximate the training objective function for each node. With the iteration between the conditional gradient descent and the aggregation step, we can obtain the optimal global model . The corresponding performance is shown through simulations in Section VI.

### V-B Convergence Analysis

To obtain the convergence rate of the proposed scheme under the worst-case model, we similarly prove that the proposed federated learning is equal to the centralized learning, and then derive the corresponding convergence rate.

Without loss of generality, we first give some assumptions before the further analysis.

###### Assumption 2

We assume the following conditions for the loss function of all nodes
(1) is convex,
(2) is , i.e., for any , and ,
(3) is , i.e., for any , and .

We first develop a brief introduction of the optimization problem in centralized learning under the worst-case model.

###### Definition 4 (Centralized learning problem under worst-case model)

Given the local loss function in (31), we can obtain that the global loss function in iteration is

 Fw(w;wt,Δwt)=∑Ni=1DiFwi(w;wt,Δwt)D, (37)

where is the global model in last iteration , and denotes the sampled noise in last iteration , which satisfies .

Due to the fact that we aim at minimizing the global loss function, the centralized learning problem is to find the optimal global model in iteration , i.e.,

 P6:minw Fw(w;wt,Δwt). (38)

The problem can be solved by the SCA algorithm, and the center completes the iteration until it meets the specific condition.

In the following, we first prove that the federated learning is equivalent to the centralized learning under the worst-case model. Secondly, we show that the centralized learning under the worst-case model is convergent.

###### Lemma 7

Given the problem under Assumption 2, suppose that and step size and are chosen as and , , so that the distributed learning equals the centralized learning at iteration , which is expressed as

 ww(wt,Δwt)=argminFw(w;wt,Δwt), (39)

and the global model aggregation obeys the updating rules as

 wt+1=wt+γt+1(ww(wt,Δwt)−wt). (40)
###### Proof:

For any iteration , satisfies

 wt+1= ∑Nj=1Djwt+1jD (41) = ∑Nj=1Dj[wt+γt+1(wwj(wt,Δwtj)−wt)]D = ∑Nj=1DjwtD+γt+1∑Nj=1Djwwj(wt,Δwt)D −γt+1∑Nj=1DjwtjD = wt+γt+1(ww(wt,Δwt)−wt).

To prove the convergence of the distributed learning, we only need to prove that the equivalent centralized learning is convergent.

###### Lemma 8

Given the problem under Assumption 2, we can achieve that the global loss function satisfies Assumption 2.

###### Proof:

According to the aggregation rules, the global loss function is written in (37), which is the linear combination of the local loss function . Straightforwardly from the convexity property, we can derive the conclusion.

###### Proposition 4 (Convergence Under Worst-case Model)

Given problem under Assumption 2, suppose that and step size and are chosen as and , for the centralized learning. Let be the sequence generated by algorithm, be and be . The global loss function converges at so that there exists a constant satisfying

 Fw(wt)−Fw(w∗)≤Mγt. (42)
###### Proof:

Firstly, we can obtain that , and via the updating rules. Furthermore, according to lemma, we have that also satisfies the Assumption 2. Invoking the first-order optimality conditions of , we have

 ρt⟨wt−ww,t,∇~F(ww,t,Δwtj)⟩ (43) +λ∥wt−ww,t∥2+(1−ρt)⟨wt−ww,t,Gtj⟩ = ρt⟨wt−ww,t,∇~F(ww,t,Δwt)−∇~F(wt,Δwt)⟩ +⟨wt−ww,t,Gt⟩+λ∥wt−ww,t∥2≥0

Considering the convexity of the , we can obtain that

 ⟨wt−ww,t,Gt⟩≤−λ∥wt−ww,t∥2. (44)

Given under the Assumption 2, there will exist a constant so that

 Fw(wt+1)≤ Fw(wt)+γt+1⟨wt−ww,t,∇Fw(wt)⟩ (45) +L(γt+1)2∥wt−^wt∥2 = Fw(wt)+L(γt+1)2∥wt−ww,t∥2 +γt+1⟨wt−ww,t,∇Fw(wt)−Gt+Gt⟩ ≤ Fw(wt)−γt+1(λ−Lγt+1)∥wt−ww,t∥2 +γt+1∥wt−ww,t∥∥∇Fw(wt)−G