With the development of mobile devices, huge quantities and diverse types of data have been generated, which promote the utilization of machine learning technologies. However, when aggregated to central model training, data with sensitive privacy can lead to serious privacy leakage. To address the privacy challenges, federated learning (FL) has been proposed to aggregate local model parameters, which are only trained on local raw data, into a global model to improve performance. In FL, mobile devices are set as clients and upload their own models to the server. As a decentralized paradigm, FL significantly reduces the risks of privacy leakage by allowing clients to access only their own raw data.
Although FL realizes both efficient data utilization and data privacy protection in the application of mobile networks, it is fragile when various defects affect the global model during a FL process , such as malicious updates, poisoning attacks , low-quality data, and unstable network environments. Unfortunately, conventional approaches pay little attention to most defects . Therefore, an efficient approach to alleviating performance degradation caused by defective local models is strongly needed for FL. Existing researches on blockchain-based FL have defined the concept of reputation, which manifests the reliability of each local model  
. Similarly, we evaluate the model quality to measure how trustworthy a local model is. After learning about the quality of each local model, we are motivated to design a deep neural network (DNN) to assign optimal weights to local models, so that the global model can maintain a considerable performance no matter if there exist defects or not.
In this paper, we propose DEfect-AwaRe federated soft actor-critic (DearFSAC), a novel FL approach based on deep reinforcement learning (DRL) to guarantee a good performance of FL process by dynamically assigning optimal weights among defective local models through model quality evaluation. Since DRL algorithms often fall into local optimum, we adopt soft actor-critic (SAC)  to find near-optimal solutions for more stable performance. Besides, as unbalanced data distribution in the buffer may deteriorate the training process of the DRL model, prioritized experience replay (PER)  and emphasizing recent experience (ERE) 
are employed, which are two popular importance sampling techniques. Furthermore, as local raw data is not accessible for the server, high-dimensional model parameters trained on local raw data become the only alternative uploaded and fed into the DRL model. To avoid the curse of dimensionality, we design an embedding network using the auto-encoder framework to generate low-dimensional vectors containing model-quality features.
In summary, the main contributions of this paper are as follows:
As far as we know, we are the first to propose the approach that dynamically assigns weights to local models under defective scenarios based on DRL.
We design an auto-encoder based on network embedding techniques  to evaluate the quality of local models. This module also accelerates the DRL training process.
The experimental results show that DearFSAC outperforms existing approaches and achieves considerable performance while encountering defects.
2.1 Federated Learning
Suppose we have one server and clients whose data is sampled from the local raw dataset . The model parameters of the th client and the server are denoted as and at round respectively, where is the total parameter number of one model. Then, the objective of clients is converted into an empirical risk minimization  as follows:
where is downloaded from the server by clients, and represents the loss of on the local data sampled from . For the server, the objective is to find the optimal global model parameters:
2.2 Deep Reinforcement Learning
In DRL, an agent, which is usually in a DNN form, interacts with the environment by carrying out actions and obtaining rewards. The whole process can be modelled as a Markov decision process (MDP), defined by , in which denotes a set of states and denotes a set of actions.
is the state transition function used to compute the probabilityof the next state given current action and current state . The reward at time step is computed by the reward function and future rewards are discounted by the factor .
At each time step , the agent observes the state , and then interacts with environment by carrying out an action sampled from the policy which is a distribution of given . After that, the agent obtains a reward and observes the next state . The goal is to find an optimal policy which maximizes the cumulative return: .
In this section, we discuss the details of DearFSAC, a DRL-based approach to assigning optimal weights to defective local models in FL. In Section 3.1, we describe the entire process of our approach. In Section 3.2, we design the quality evaluation embedding network (QEEN) for dimension reduction and model quality evaluation. In Section 3.3, we adopt SAC to optimize , which gets more stable convergence and more sufficient exploration than other actor-critic algorithms.
3.1 Overall Architecture of DearFSAC
The overall architecture of DearFSAC is shown in Fig. 2. At the first round, the global model parameters and the DRL action are randomly initialized. Then all clients train their own models locally and of them are randomly selected to upload model parameters and local training loss . After receiving uploaded information, the server feeds the local model parameters into QEEN and gets the embedding vectors . Next, the embedding vectors, local losses, and the last action are concatenated and fed into the actor network of the DRL model to get the current action . Finally, by using , the server aggregates local model parameters to the global model parameters and shares with all clients. The whole process loops until convergence.
3.2 Dimension Reduction and Quality Evaluation
Based on an auto-encoder structure, QEEN is designed for both dimension reduction and quality evaluation. At round , for training efficiency, we upload all local model parameters to the server and add several types of defects into half of them. Then we design loss for the embedding of and loss  for quality evaluation. The auto-encoder on the server receives model parameters as training data and performs training using both and .
We feed each into the encoder composed of two fully connected (FC) layers and get the embedding vector of the th model:
After obtaining all embedding vectors, we put into the decoder to produce a decoded representation which approximates . Different from conventional ways of auto-encoder, we adopt network embedding  and design the decoder into parallel FC layers , where is the number of layers of the original model, is the th parallel FC layer corresponding to the th layer of the original model structure , where . Next, for the th model, the embedding vector is fed into the th parallel layer to get decoded layer parameters of the original model:
and concatenate each layer by layer to obtain the entire decoded model parameters:
, we use mean square error (MSE) loss function to compute:
As multiple defects have different impact on local models, we define defect marks as the ground truth, denoted as , where is the degree of defect. Next, we compare defect marks with quality evaluation marks . We feed into the quality evaluation module composed of two FC layers to get to predict the quality of the th model:
Then we compute :
3.3 DRL for Optimal Weight Assignment
3.3.1 MDP Modelling:
To guarantee the communication efficiency and fast convergence, clients are randomly selected among the clients at each round and upload models to the server. After receiving various information as the current state, the DRL model outputs an action containing weights of all selected models. The details and explanations of , , and are defined as follows:
State : At round , the state can be denoted as a vector , where denotes the embedding vector of th client’s model parameters, denotes the embedding vector of the server’s model parameters, denotes the local training loss of th local model, and denotes the action at the last round.
Action : The action, denoted as , is a weight vector calculated by the DRL agent for randomly selected subset of model parameters at round . All the weights in are within and satisfy the constraint . After obtaining the weight vectors, the server aggregates local model parameters to the global model as follows:
where is a set of all selected local models.
Reward : The goal of DRL is to maximize cumulative reward in total time steps , which is equivalent to finding the local model with minimum loss shown in Eq. (1). Therefore, we design a compound reward by combining three sub-rewards with appropriate weights , which can be formulated as:
In Eq. (13), is defined within to maximize global model’s accuracy. The exponential term and represent the accuracy gap, where is the global model’s accuracy on the held-out validation set at round , is the target accuracy which is usually set to , and is the accuracy of the model aggregated by FedAvg. is a positive constant to ensure an exponential growth of . As is in , the second term, , is used as time penalty at each round to set to for faster convergence.
Eq. (14) aims to provide auxiliary information for the agent to reduce exploration time. After obtaining quality prediction mark of the th local model from QEEN, we denote normalized as to calculate the MSE loss of and . Similarly, for time penalty, Eq. (14) is set to be negative.
3.3.2 Adopting SAC to Solve MDP:
First, locally trained models are randomly selected to upload the parameters and local loss to the server. Through QEEN, we can obtain embedding vectors as part of the current state . By feeding into the actor network, we obtain the current action . After model aggregation, we get reward and the next state . Empirically, we set as . At the end of each round, the tuple , which is denoted as , is recorded in the buffer.
For each iteration, SAC samples a batch of from the buffer and updates the DRL network parameters. To deal with poor sampling efficiency and data unbalance in DRL, we adopt two techniques of replay buffer named ERE  and PER  to sample data with priority and emphasis. For the th update, we sample data uniformly from the most recent data points , defined as:
where represents the degree of emphasis on recent data and is the maximum size of buffer . After obtaining an emphasizing buffer according to , the sampling probability of data point in PER is computed as:
is a hyperparameter determining the affection of the priority,is a hyperparameter controlling the affection of . in Eq. (17) is the priority value of the th data point, defined as:
where is the bias and is the action-value function formulated as:
where is the entropy of , formulated as:
Next, we compute the importance sampling weight of the th data point as:
After sampling , SAC  updates the DRL model and aims to find to maximize both the total reward and the entropy, which leads to more stable convergence and more sufficient exploration:
where is a trade-off coefficient.
In this section, we conduct various experiments to validate the performance of DearFSAC on defective local models. Specially, we compare the test accuracy of DearFSAC with different approaches on four datasets in Section 4.2. Then, we try different numbers of defective models and degrees of defect to show robustness of DearFSAC in Section 4.3. Besides, in Section 4.4, we discuss the effectiveness of QEEN by designing ablation experiments.
4.1 Experimental Setup
We validate the proposed DRL model on four datasets: MNIST
, CIFAR-10, KMNIST , and FashionMNIST . For convenience, we call the three MNIST datasets X-MNIST. The setup is illustrated in Table 1, The X-MNIST datasets contain both IID and non-IID data while the CIFAR-10 dataset contains only IID data.
4.1.2 Defect Types
We define the number of defective models as and the degree of defect as . Then we design three types of defect:
Data contamination: We add standard Gaussian noise to each pixel in an image and obtain defective pixels .
Communication Loss: We add standard Gaussian noise to each parameter in the last two layers and obtain defective parameters .
Malicious attack: For both IID and non-IID datasets, we shuffle labels of each local training batch.
To evaluate the performance of DearFSAC and compare it with other weight assignment approaches, we mainly identify three performance metrics as follows:
: The averaging accuracy on test datasets over multiple times.
: The number of communication rounds to first achieve in corresponding datasets.
: The cumulative reward of DRL approaches in each episode.
4.2 Comparisons across Different Datasets
In this subsection, we compare our approach with FedAvg, rule-based strategy, and supervised learning (SL) model. For rule-based strategy, it assigns weightsto models with no defects. For SL model , it consists of FC layers with and units and performs training with defect marks. We compose three types of defects at the same time to obtain the composite defect. Then we adopt it in both DRL training process and FL test. We conduct experiments on the FL training dataset for rounds, setting and .
As shown in Table 2, we carry out 100-round FL training process for ten times and compare and of each approach. The results on the IID datasets show that our approach significantly outperforms the other three approaches in all four IID datasets. Furthermore, we compare our approach with FedAvg with no defect in local models and find that our approach performs almost the same as FedAvg in the defectless setting. This is because the data distribution is IID so that averaging weights is a near-optimal strategy, which exhibits that our approach converges to FedAvg in the simplest setting.
On the other hand, our approach also performs the best on non-IID datasets. As data distribution is largely different, of each approach decreases obviously, especially the rule-based strategy and SL model. In non-IID KMNIST, the performance of rule-based strategy is similar to that of FedAvg. These two results show that fixed weight is not feasible in non-IID datasets. Besides, FedAvg with no defects needs more communication rounds on non-IID datasets than DearFSAC, which shows the advantage in speed of DearFSAC.
All the above results show that our approach performs the best no matter whether there exist defects in local models or not, which verify the generalization of our approach.
4.3 Defect Impact
In this subsection, we compare the performance of the above approaches on non-IID MNIST to study the impact of different and .
First, we change the value of to study how the numbers of defective models impact the performance. Fig. 4 shows that as increases, the accuracy decreases dramatically. When is small, defects cause little impact on the global model. On the contrary, if is relatively larger, it becomes sensitive to accuracy of the global model. It also shows that FL has limited capability to resist defects. Compared with FedAvg, our approach has a more robust performance despite large .
In Table 3, we study how the degree of the composite defect affects the performance. As increases, the accuracy of FedAvg decreases dramatically while our approach holds a high and stable accuracy, which indicates that the accuracy of the global model is quite sensitive to .
All the above experiments show that our approach is capable of adapting multiple numbers and degrees of composite defect, validating the robustness of our approach.
4.4 Effectiveness of QEEN
In this subsection, we study the effectiveness of QEEN by comparing the cumulative reward and the accuracy of DearFSAC with that of original SAC and embedding SAC, where embedding SAC adopts only an embedding network for dimension reduction. We compare the three versions of DRL model on IID and non-IID MNIST datasets. Here we set , where total episodes is and each episode contains rounds.
4.4.1 Cumulative Reward
In Fig. 5, for IID MNIST and non-IID MNIST, of DearFSAC increases rapidly at the beginning and gradually converges, while of embedding SAC fluctuates dramatically, which indicates that quality evaluation not only largely improves the accuracy, but also guarantees the convergence speed and stability in DearFSAC. Besides, of original SAC is the worst which means that embedding network also matters in DearFSAC for good performance.
In Fig. 6, the test accuracy of DearFSAC is significantly higher than the original SAC and embedding SAC, which again proves our conclusions.
5 Conclusion and Future Work
In this paper, we propose DearFSAC, which assigns optimal weights to local models to alleviate performance degradation caused by defects. For model quality evaluation and dimension reduction, an auto-encoder named QEEN is designed. After receiving embedding vectors generated from QEEN, the DRL agent optimizes the assignment policy via SAC algorithm. In the experiments, we evaluate the performance of DearFSAC on four image datasets in different settings. The results show that DearFSAC outperforms FedAvg, rule-based strategy, and SL model. Specially, our model exhibits high accuracy, stable convergence, and fast training speed no matter whether there exist defects in FL process or not.
In the future, it is worthwhile investigating how to extend DearFSAC to a multi-agent framework for personalized FL in defective situations.
-  (2018) Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718. Cited by: §4.1.1.
-  (2018) A survey on network embedding. IEEE Transactions on Knowledge and Data Engineering 31 (5), pp. 833–852. Cited by: item 2, §4.2.
-  (2010) Cosine similarity scoring without score normalization techniques.. In Odyssey, pp. 15. Cited by: §3.3.1.
-  (2018) Mitigating sybils in federated learning poisoning. arXiv preprint arXiv:1808.04866. Cited by: §1.
-  (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. Cited by: §1, §3.3.2.
-  (2009) Overview of supervised learning. In The elements of statistical learning, pp. 9–41. Cited by: §3.2.
-  (2019) Incentive mechanism for reliable federated learning: a joint optimization approach to combining reputation and contract theory. IEEE Internet of Things Journal 6 (6), pp. 10700–10714. Cited by: §1.
-  (2020) Reliable federated learning for mobile networks. IEEE Wireless Communications 27 (2), pp. 72–80. External Links: Cited by: §1.
-  (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §2.2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.1.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.1.
-  (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. Cited by: §2.1.
-  (2017) Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892. Cited by: §2.2.
-  (2015) Prioritized experience replay. arXiv preprint arXiv:1511.05952. Cited by: §1, §3.3.2.
-  (2018) Biscotti: a ledger for private and secure peer-to-peer machine learning. arXiv preprint arXiv:1811.09904. Cited by: §1.
Auto-encoder based data clustering.
Iberoamerican congress on pattern recognition, pp. 117–124. Cited by: §1.
-  (2018) Reinforcement learning: an introduction. MIT press. Cited by: §2.2.
Joint optimization framework for learning with noisy labels.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5552–5560. Cited by: §3.2.
-  (2019) Boosting soft actor-critic: emphasizing recent experience without forgetting the past. arXiv preprint arXiv:1906.04009. Cited by: §1, §3.3.2.
-  (2016) Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1225–1234. Cited by: §3.2.
-  (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §4.1.1.
-  (2019) Federated learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 13 (3), pp. 1–207. Cited by: §2.1.
-  (2018) Blockchain-based privacy preserving deep learning. In International Conference on Information Security and Cryptology, pp. 370–383. Cited by: §1.
Parallelized stochastic gradient descent.. In NIPS, Vol. 4, pp. 4. Cited by: §3.2.