I Introduction
Future wireless computing applications demand higher bandwidth, lower latency and more reliable connections with numerous devices [19]
. With the burgeoning development of artificial intelligence technologies, the edge devices need to generate a sheer volume of raw data to be transmitted to the center, which results in excessive latency and privacy concerns
[36, 18]. To solve this problem, federated learning has been proposed to encounter a paradigm shift from computing at the center to computing at the edge devices [21].Federated learning can be traced back as federated optimization to decouple the data acquisition and computation at the central server [16]
. Federated optimization has recently been extended to deep learning platforms, which was known as federated learning
[21, 17]. Federated learning was designed as an iterative process between distributed learning at the edge devices and averaging the updated local models at the central server. In contrast to the conventional centralized training, federated learning is more efficient in communication by uploading no raw data but only local models. To further improve the availability of enormous data from edge devices, federated learning was adopted in several scenes of future wireless networks [32, 24, 33, 5]. Using federated learning and distributed MEC systems, the authors studied the tradeoff between local computing and global aggregation under the given resourceconstrained model in [32]. Moreover, the attractive property of lower latency drew attention to exploiting federated learning in latencysensitive networks, such as vehicular networks [24, 5].Due to the highdimensional local model, as well as the longterm training process, the updating step of federated learning still consumes a lot of communication resources. The key issues are to reduce the overhead in the updating steps and to accelerate the training process. A series of research concentrating on reducing the overhead in the updating step was to transmit the compressed gradient vector via exploiting the quantization scheme
[2, 20]. Another research focused on scheduling the edge devices to save the transmission bandwidth [34, 9, 8, 14, 22]. Specifically, some novel updating rules were worked out, which only allowed the edge devices with significant training improvement[9], or the fast responding devices [8], to transmit their gradient vectors in each uploading round. Arranging the adaptive maximum number of transmissionpermitted edge devices was also an intelligent way when time was limited [22]. Furthermore, the authors developed a momentum method and cpstochastic gradient descent algorithm to accelerate the training process for each edge device in local training in
[20, 1]. Utilizing the different computation capability of each node, an asynchronous federated learning scheme was proposed to reduce the training delay in [26].The aforementioned pioneering works are all based on the assumption that the received signals at both the central server and the edge nodes are perfectly detected. In practice, this is difficult in wireless communications due to imperfect channel estimation, feedback quantization, or delay in signal acquisition on fading channels. In other words, the noise is indispensable during the training process. Furthermore, neural networks were proved to be not very robust to noise, which leads to the delay in the training process
[28].In conventional centralized learning, a branch of research has been dedicated to eliminate the effects of noise, among which several works used the denoising autoencoder to filter noise, such as contractive autoencoders and denoising autoencoders
[23, 29], while others considered representing the effect of noise as imposing a penalty during the training process, known as the regularization scheme [7, 6, 12, 13, 27]. In particular, the addition of noise with infinitesimal variance as the input of training dataset was proved to be equivalent to the punishment on the norm of the weights for some training models
[7, 6], whereas the added noise in the model was derived as appending a regularizer in the loss function which pushes the model to find the minima in the flat regions [12, 13]. Besides, the key idea of the Dropout method is to randomly drop units from the neural network during training to simulate the regularization [27]. However, to the best of our knowledge, no noise reduction has been studied for federated learning and it is still an open problem.Motivated by these observations, we propose a robust federated learning method to alleviate the effects of noise in the training process. Robust designs are first introduced using the expectationbased model and the worstcase model. More specifically, the former model is based on the statistical properties of the noise uncertainty and the latter model represents the fixed uncertainty sets of noise. Furthermore, the corresponding convergence analysis is provided to illustrate the performance of the proposed designs. The main contributions of this work are summarized as follows.

Robust design under the expectationbased model. With the consideration of noise at the central server and the edge nodes, we formulate the training problem using the expectationbased model as a parallel optimization problems for each edge node. To handle the statistical property of noise, as well as the nonconvexity of the objective function, we propose a regularization for loss function approximation (RLA) algorithm to approach the objective function and develop the corresponding training process. The proposed solution is superior to the conventional scheme that ignores noise in terms of both prediction accuracy and performance of loss function.

Robust design under the worstcase model. The training problem under the worstcase model meets the challenges that are the unavailable maxima or minima noise condition and the nonconvex issue of the objective function. We solve the former problem via the sampling method and tackle the latter one by utilizing the successive convex approximation (SCA) algorithm to generate a feasible descent direction for the training process. The simulation results show that the proposed design outperforms the conventional one for prediction accuracy and values of loss function.

Convergence analysis for the proposed designs. The convergent property of all proposed designs are derived. Specifically, it is found that the proposed training process under the expectationbased model converges at the equivalent rate to the centralized training scheme that ignores noise, and the convergent property of proposed robust design under the worstcase model outperforms the conventional centralized one.
The remainder of the paper is organized as follows. Section II introduces the system model of the federated learning considering noise. Section III presents the formulated problem under the expectationbased model and the worstcase model. The robust design under the expectationbased model and its convergence analysis are developed in Section IV. Section V shows the robust design under the worstcase model and the corresponding convergence analysis. Simulation results are provided in Section VI.
Throughout the paper, we use boldface lowercase to refer to vectors, and lowercase to refer to scalar. Let denote the transpose of a vector. Let denote size of the set,
denotes zero matrix, and
denotes unit matrix. is the expectation function.Ii System Model
We consider a distributed learning system consisting of a single central server and edge nodes, as shown in Fig. 1. A shared learning process with the global model is trained collaboratively by the edge nodes. Each node collects a fraction of labelled training datasets .
The loss function is to facilitate the learning and we define it as for each data sample , which consists of the input vector and the output scalar . For convenience, we rewrite as . Then the global loss function on all distributed datasets can be defined as
(1) 
where denotes the size of the datasets and each dataset satisfies when , . The training target is to minimize the global loss function according to the distributed learning, i.e., to find
(2) 
One way to search for the optimal is to update the datasets of the distributed nodes, which only contains the input vector and the output scalar , called centralized learning. The center completes the training process using the whole datasets, and broadcasts the optimal model from (1) and (2
) to all nodes. However, the datasets are generally large in machine learning. Therefore, centralized learning requires numerous communication resources to collect the whole datasets. In other words, the training process will be limited by the communication rates.
Another way to solve (2) is a distributed manner as demonstrated in Fig. 1, which focuses on the modelaveraging for the global model , called federated learning. The global loss function cannot be directly computed without sharing datasets among all edge nodes in federated learning. The federated learning algorithm alternates between two stages. In the first stage, the local models at each node are sent to the center for modelaveraging via wireless links, and the center updates the global model . In the second stage, the center broadcasts the current model to all edge nodes at each iteration. Based on the received global model , each node updates its own model to minimize the local loss function using its own dataset. The updating rules follow:
(3a)  
(3b) 
where denotes the local model of node , denotes the size of the whole datasets , denotes the size of the dataset , , is the local loss function of node with dataset , and can be written as
(4) 
The training process requires the iterations between (3b) and (3a) until convergence, and each node can obtain the optimal model .
Since the center and each node are connected using wireless links, it inevitably introduces noise. Therefore, the received signal has the aggregation noise at the center via local updating and the broadcasted global model with noise in each iteration for the node can be modeled as
(5)  
where refers to the aggregation noise at the center, and refers to the broadcast noise for node .
The imperfect estimation is a major problem in wireless communication. In federated learning, it leads to the changing of optimization in the local update process. The noise in estimation error of the model will make the output data point blurred and make the training difficult to fit the input data point precisely for neural networks. Furthermore, the neural networks were proved to be not robust against noise. In other words, the performance of the learning scheme may be significantly reduced by noise. To solve this problem, robust design is proposed to ensure a certain level of the performance under the uncertainty model.
Iii Problem Formulation
In this section, we formulate the robust problem using two robust models. According to the different characteristics of the two robust models, the corresponding problem is totally different. We write the corresponding problems in the following.
The aggregation noise and broadcasted noise in (5) can be modelled as the stochastic and the deterministic. The former is the expectationbased model and the latter is the worstcase model. According to that, each node updates its own model with a different initial point, , the corresponding local loss function is rewritten as , , and the global loss function is rewritten as . The iteration process still follows (3a) and (3b).
Iiia Training Under Expectationbased Model
Expectationbased model is a stochastic method to represent the random condition, which can only be used when statistical properties of noise are available [4]. The stochastic model assumes that the estimation value is a random quantity and its instantaneous value is unknown, but its statistics property, such as the mean and the covariance, is available. In this case, the robust design usually aims at optimizing either the longterm average performance or the outage performance. The corresponding robust model is called the expectationbased model and defined as follows.
Definition 1 (WorstCase Robust Model [30, 3])
The expectationbased robust model refers to the stochastic property of noise as shown in Fig. 2 (a). For node
, the entries of the uncertainty vector are assumed to be Gaussian distributed with
, and , , and the aggregation noise at the center is assumed to satisfy , and .With the assumption that the aggregation noise and the broadcast noise are Gaussian, we can obtain another summed Gaussian noise as so that the received value for node can be expressed as
(6) 
and is Gaussian with , and , , where .
Therefore, using the stochastic property of noise, we should focus on improving the stochastic performance for the network. Furthermore, the optimization object in federated learning is to find the local optimal model in (3b) and to utilize the combination method to find the global optimal model in (3a).
Since the combination method is determinate, we only need to optimize the local model for each node. Based on the aforementioned analysis, we formulate the robust training problem using the expectationbased model for each node as
(7)  
where the constraints in represent the stochastic characteristic of noise from imperfect estimation in wireless communication.
We aim at improving the stochastic performance for the training process. Due to the expectation calculation, the objective function is nonconvex. To tackle this challenge, we consider adding the regularizer into the loss function to approximate the objective function and to represent the effect of noise. We provide the corresponding federated learning process in Section IV.
IiiB Training Under WorstCase Robust Model
In contrast to the expectationbased model, the worstcase model is a deterministic method to represent the instantaneous condition, which has fixed uncertainty sets, and to maximize the performance under the worst uncertainty [25, 31]. Using the worstcase robust design, we can guarantee a performance level for any value of estimation realization in the uncertainty region. It is applied to design which requires strict constraints, and is more suitable for characterizing instantaneous estimation value with errors. The worstcase approach assumes that the actual estimation value lies in the neighborhood of the uncertainty region with a known nominal estimation value. The size of this region represents the amount of estimation value uncertainty, i.e., the bigger the region is, the more uncertainty there is. We show the brief definition of the worstcase model as follows.
Definition 2 (WorstCase Robust Model [30, 3])
The worstcase robust model assumes that the estimation lies in a known set of possible values shown as Fig. 2 (b), which can not be exactly known. The norm of the uncertainties vector and are bounded by the spherical region, which can be expressed as
(8)  
where denotes the radius of the spherical uncertainty region of the broadcast noise, while denotes the aggregation noise.
Consider the superposition of noise, the uncertainty is expanded to the larger region with the size . Therefore, we reformulate the received value at node as
(9) 
where denotes the whole noise and satisfies .
Similarly, the optimization is to find the local optimal model in (3b), and follows the aggregation rules in (3a). Therefore, we formulate the robust training problem under the worstcase model as a minmax problem for each node
(10)  
where the constraints in represent the noise lies in a spherical region with radius .
One challenge to solve the problem is that the worst condition may not be available. The other is the nonconvex objective function. We settle the challenges using the sampling method and the SCA algorithm to generate a feasible descent direction for the learning process in Section V.
Iv Robust Design Using Expectationbased Model
In this section, we consider the robust design in federated learning using the expectationbased model. We propose the corresponding RLA algorithm to represent the effects of noise for the expectationbased model so that the local optimal model can be found via optimization.
Iva Proposed Training Algorithm
We first model the noise under the expectationbased noise model, which is a stochastic method to represent the random condition, as shown in . We aim at optimizing the average performance based on the expectationbased model. However, the random noise results in the nonconvexity property and uncertainty value of the local loss function.
To solve this problem, we propose the RLA to approximate the nonconvexity local loss function and utilize the distributed gradient descent to find the optimal global model. The approximation method is inspired by previous works where training with noise was approximated via regularization to enhance the robust of neural networks [12]. We give a brief introduction in the following.
Lemma 1
Training with noise is equal to adding a regularizer , which can be expressed as
(11) 
where denotes the loss function, is the designed function, is the learning model, represents the learning model including noise, and is a constant.
Proof:
Refer to [11].
There are many regularization strategies in the aforementioned works [7, 6, 12, 13]. However, there is no specific regularizer that is universally better than any others for the learning algorithm. In other words, there is no best form of regularization. We need to develop a specific form of using the expectationbased model.
Motivated by this observation, we propose a new regularization term to approximate the original loss function for federated learning in the training process. Using the expectationbased model, we intend to reduce the impact of noise for the training process. Due to the stochastic property of noise, we aim at optimizing the average performance in . We propose the corresponding training problem in the following.
Proposition 1 (Robust Training Under Expectationbased Model)
The robust training problem under the expectationbased model in for each node can be reformulated as
(12) 
where denotes the new loss function for node and can be written as
(13) 
Proof:
Under the expectationbased model, we can obtain the objective function of utilizing Taylor expansion according to the work in [6] so that the objective loss function of the optimization problem is written as
(14)  
The first term refers to the training process with perfect estimation in (3b), and the second term is the additional cost of the loss function in training, which is determined by noise. Therefore, the objective loss function under the expectationbased equals adding the regularizer .
Remark 1
The penalty over the firstorder of the loss function yields a preference for mapping that are invariant locally at the training points and drop the global model into the flat region.
To solve the training problem in (12), we utilize the gradient descent algorithm to find the optimal local model for each node, and the details are shown as follows.
In each iteration, the local update at each node is performed based on the previous iteration and the first gradient of the proposed loss function, and the center aggregates the distributed models to find the optimal global model for the next iteration. Therefore, the update rules of the gradient descent can be written as:
(15a)  
(15b) 
where is the step size for all nodes. The iteration is executed and it will stop if a specific condition is satisfied. This process is illustrated in Algorithm 1.
To solve the robust problem, we develop the training process by adding the regularizer to approximate the original loss function. We transfer the stochastic and nonconvex problem into a deterministic and convex problem so that we can utilize the gradient descent method to find the optimal global model . The corresponding performance is shown through simulation in Section VI.
IvB Convergence Analysis
In this subsection, we derive the convergence property of the proposed design under the expectationbased model. To obtain the convergence rate of the proposed scheme under the expectationbased model, we first prove that the proposed federated learning is equivalent to a centralized learning, and then derive the corresponding convergence rate.
We start with the essential assumption of the loss function, which can be satisfied normally.
Assumption 1
We assume the following conditions for the loss function of all nodes:
(1) is ,
(2) is , i.e. for any , ,
(3) is , i.e. for any ,
Then, we give a brief definition of centralized learning.
Definition 3 (Centralized learning problem under expectationbased model)
Given the proposed local loss function in (13), the global loss function can be written as
(16) 
so that we aim at minimizing at the center by using the same whole datasets. Therefore, the centralized learning problem is to find the optimal global model as
(17) 
The optimization can be easily solved by using the gradient descent, and the center completes the iteration until the specific condition is met. We derive that the proposed federated learning is equivalent to the centralized learning problem under the expectationbased model as follows.
Lemma 2
Given and under the expectationbased model, the proposed federated learning is equal to the centralized learning for each iteration , , which can be written as
(18) 
Proof:
Considering the global aggregation, we can obtain that
(19)  
To prove the convergence of the proposed distributed learning, we only need to derive that the equivalent centralized learning is convergent.
Lemma 3
Given the original loss function under Assumption 1, there exist constants and so that the loss function satisfies that
(20) 
where is the initialization point of .
Proof:
Refer to [10].
Lemma 4
is , and .
Proof:
We can obtain that is the linear combination of via (16). Straightforwardly from the convexity property, this lemma holds.
Proposition 2 (Convergence Under Worstcase Model)
Algorithm 1 yields the following convergence property for the optimization of the global loss function under the expectationbased model
(21) 
where is the initialization point of . It means the convergence rate is .
Proof:
The proposed loss function of the node is
(22) 
Taking the derivation of it, we can obtain
(23)  
Following the Lemma 4, we can obtain that the loss function of the node is with . Therefore, satisfies
(24) 
Furthermore, we can develop the conclusion that is to satisfy (21). The optimization of the global loss function converges at .
Remark 2
The proposed robust design under the expectationbased model converges at . The convergence property as (21) is reduced to the one in (20) as , i.e., it is equivalent to the convergence property that is training without noise. The convergence rate will decrease with the increase in and the proposed design cannot converge when specifically. The comparison between the proposed design and the centralized training is simulated specifically in Section VI.
V Robust Design Using Worstcase Model
In this section, we solve the optimization problem using the worstcase model. To solve the uncertainty of noise and the nonconvexity problem, we utilize the samplingbased SCA method to represent noise and approximate the objective loss function. We then propose the training process for the robust federated learning and finally derive the convergence property of the proposed design.
Va Proposed Training Algorithm
The training process is proposed to solve the learning problem under the worstcase model. We utilize the samplingbased SCA method to approximate the original objective function, and develop the corresponding updating rules.
The feasible sets of both the local model and the noise are convex sets, and there always exists a saddle point. However, the unavailability of noise results in that the finding of the global minimum point is, in general, NP hard. Therefore, the objective problem faces the main issues: i) the impossibility to estimate accurate value of noise of the worst condition; ii) the nonconvexity of the objective functions leading to unavailable optimization.
Considering the uncertainty of noise, it is often possible to obtain a sample of the random noise, either from past data or from computer simulation as shown in [15]. Consequently, one may consider an approximate solution to the problem based on sampling, known as the sample average approximation (SAA) method, and we give a brief introduction as follows.
Lemma 5
The SAA method is to find the optimal for the stochastic objective in the optimization problem as,
(25) 
where is a given function and affected by the random vector which follows the distribution . However, the distribution is unknown, and only sample values of the random vector are available. To solve this problem, the SAA approach approximates the problem by solving
(26) 
where is the random sample of the random vector , and the collection of realizations satisfies independent and identically distributed.
Proof:
Refer to [15].
Motivated by this method, we consider sampling noise in the objective function , and can easily obtain that the worst condition of noise occurs on the boundary. Based on the above consideration, we propose the samplingbased method. At each iteration of each node, a new realization of the noise is obtained and the optimization of the objective functions is updated via the loss function as follows,
(27) 
where satisfies .
It provides a simple way to approach the objective function under the perfect estimation, but the nonconvexity of the objective function is still not resolved. To tackle this challenge, we utilize the SCA scheme to maintain the convexity of the objective functions.
Lemma 6
The SCA algorithm is proposed to approximate an arbitrarily function by expansion around which is a definite point in the feasible set. It can be simply written as
(28) 
where is a sequence, and is the weight average of the first gradient and can be expressed as
(29) 
Proof:
Refer to [35].
With the consideration of SAA and SCA methods, we propose the samplingbased SCA algorithm to solve the robust training problem under the worstcase model of in the following.
Proposition 3 (Robust Training Under Worstcase Model)
For the robust training problem under the worstcase model in ,the optimization problem of each node can be reformulated as
(30) 
where is a sequence by sampling the noise satisfying that , is denoted as the loss function for the node , and expressed as
(31)  
and is an accumulation vector updated recursively according to
(32) 
with being a sequence to be properly chosen , .
Proof:
As the efficient solutions of the SCA algorithm, the objective function at the iteration is determined by the latest updated model and defined as , which is consist of the original function , and the first gradient . We develop the objective function as follows,
(33) 
and is an accumulation vector updated recursively according to
(34) 
with being a sequence to be properly chosen at iterations respectively.
Notice that the expansion is established only when is close to . We add a regularizer as the cost of shrinking the gap between and as:
(35) 
Therefore, we propose the local loss function as in .
Remark 3
Generally speaking, each node minimizes the sample approximation of the original unstable function. The first term in (31) refers to the sample objective function. The second term refers to the cost which controls the pace for each iteration. The vector in the last term represents the incremental estimate of the unknown by samples collection over the iterations. When the parameter is properly chosen, and the estimation accuracy increases as increases.
Due to the involving of the past optimized model , we consider utilizing the conditional gradient descent method for each node. Similarly, we aggregate the local update at the center and broadcast the new global model for next iteration. The aggregated model should be broadcasted to all nodes and it is used to complete the next iteration until it meets the specific condition. Given , the iteration rule is briefly written as follows.
(36a)  
(36b) 
where , . The iteration follows the process illustrated in Algorithm 2.
We develop the training process by utilizing the samplingbased SCA algorithm to approximate the training objective function for each node. With the iteration between the conditional gradient descent and the aggregation step, we can obtain the optimal global model . The corresponding performance is shown through simulations in Section VI.
VB Convergence Analysis
To obtain the convergence rate of the proposed scheme under the worstcase model, we similarly prove that the proposed federated learning is equal to the centralized learning, and then derive the corresponding convergence rate.
Without loss of generality, we first give some assumptions before the further analysis.
Assumption 2
We assume the following conditions for the loss function of all nodes
(1) is convex,
(2) is , i.e., for any , and ,
(3) is , i.e., for any , and .
We first develop a brief introduction of the optimization problem in centralized learning under the worstcase model.
Definition 4 (Centralized learning problem under worstcase model)
Given the local loss function in (31), we can obtain that the global loss function in iteration is
(37) 
where is the global model in last iteration , and denotes the sampled noise in last iteration , which satisfies .
Due to the fact that we aim at minimizing the global loss function, the centralized learning problem is to find the optimal global model in iteration , i.e.,
(38) 
The problem can be solved by the SCA algorithm, and the center completes the iteration until it meets the specific condition.
In the following, we first prove that the federated learning is equivalent to the centralized learning under the worstcase model. Secondly, we show that the centralized learning under the worstcase model is convergent.
Lemma 7
Given the problem under Assumption 2, suppose that and step size and are chosen as and , , so that the distributed learning equals the centralized learning at iteration , which is expressed as
(39) 
and the global model aggregation obeys the updating rules as
(40) 
Proof:
For any iteration , satisfies
(41)  
To prove the convergence of the distributed learning, we only need to prove that the equivalent centralized learning is convergent.
Lemma 8
Given the problem under Assumption 2, we can achieve that the global loss function satisfies Assumption 2.
Proof:
According to the aggregation rules, the global loss function is written in (37), which is the linear combination of the local loss function . Straightforwardly from the convexity property, we can derive the conclusion.
Proposition 4 (Convergence Under Worstcase Model)
Given problem under Assumption 2, suppose that and step size and are chosen as and , for the centralized learning. Let be the sequence generated by algorithm, be and be . The global loss function converges at so that there exists a constant satisfying
(42) 
Proof:
Firstly, we can obtain that , and via the updating rules. Furthermore, according to lemma, we have that also satisfies the Assumption 2. Invoking the firstorder optimality conditions of , we have
(43)  
Considering the convexity of the , we can obtain that
(44) 
Given under the Assumption 2, there will exist a constant so that
(45)  
Comments
There are no comments yet.