Prophet: Proactive Candidate-Selection for Federated Learning by Predicting the Qualities of Training and Reporting Phases

Federated Learning (FL) is viewed as a promising technique for future distributed machine learning. It permits a large number of mobile devices participating in the training of a global model collaboratively without having to expose their local private data. Although the challenge of the network connection will be much relieved in 5G/B5G era, the training latency is still an obstacle preventing FL from being largely adopted. One of the most fundamental problems that leads to large training latency is the bad candidate-selection of FL participants. To the best of our knowledge, the existing candidate-selection algorithms belong to the reactive manner. Under such reactive selection, the FL parameter server only knows the currently-observed resources of all candidates. In the dynamic FL environment, the mobile devices selected by the reactive candidate-selection algorithms very possibly fail to complete the training and reporting phases of FL. To this end, we study the proactive candidate-selection for FL in this paper. We first let each candidate device locally predict the qualities of both its training and reporting phases using the LSTM network. Then, the proposed candidate-selection algorithm is implemented by the Deep Reinforcement Learning (DRL) framework, which can adapt to the dynamically varying factors in the metropolitan edge computing environment. Finally, the real-world trace-driven experiments prove that the proposed proactive approach outperforms the existing reactive algorithms with respect to the ratio of valid participants and the test accuracy of the aggregated global FL model.


page 1

page 9


A Survey on Participant Selection for Federated Learning in Mobile Networks

Federated Learning (FL) is an efficient distributed machine learning par...

A Multi-agent Reinforcement Learning Approach for Efficient Client Selection in Federated Learning

Federated learning (FL) is a training technique that enables client devi...

Flower: A Friendly Federated Learning Research Framework

Federated Learning (FL) has emerged as a promising technique for edge de...

On-the-fly Resource-Aware Model Aggregation for Federated Learning in Heterogeneous Edge

Edge computing has revolutionized the world of mobile and wireless netwo...

Online Data Selection for Federated Learning with Limited Storage

Machine learning models have been deployed in mobile networks to deal wi...

Resource Management for Blockchain-enabled Federated Learning: A Deep Reinforcement Learning Approach

Blockchain-enabled Federated Learning (BFL) enables model updates of Fed...

System Optimization in Synchronous Federated Training: A Survey

The unprecedented demand for collaborative machine learning in a privacy...

I Introduction

Federated Learning (FL) [mcmahan2016communication, bonawitz2019towards] is a branch of distributed machine learning that enables a group of distributed devices to train their individual local models using the local dataset. Thus, FL is a promising computing paradigm in our future intelligent life, especially under the fifth generation (5G) and the beyond (B5G) communications networks. For example, the FederatedAveraging (FedAvg) algorithm [mcmahan2016communication] can help mobile users predict the next-words when users are using the Google’s GBoard [hard2018federated] in their smartphones.

To realize a large-scale federated learning framework, a number of challenges must be addressed. For example, to attract a large number of candidate devices participating into the federated learning, the communication latency and network bandwidth between each distributed device and the FL parameter server are viewed as two typical bottleneck resources.

Although those two bottleneck resources will be much relieved in 5G/B5G era, another particular challenge is to handle the highly dynamic mobility of mobile users. The uncertainties include the dynamic trajectories of mobile users, the individual daily usage habits of their smartphones, unexpected cellular network fluctuations, etc. The mobility of mobile users in a metropolitan area and the user’s usage-habit of their smartphones are difficult to predict accurately. Even though a group of candidate devices with good communication and computing resources are selected to participate in a federated learning at a timeslot, no one can ensure they can complete the local training and upload the updated model timely to the FL parameter server in the near future.

Fig. 1: Candidate selection for Federated Learning based on the predicted qualities of both network connections and computing capabilities within a training round (measured in timeslot).

Recently, Google’s team [bonawitz2019towards] has implemented a scalable federated-learning production system by exploiting mobile devices. According to their technique report [bonawitz2019towards], each round of synchronous-fashioned federated learning include 3 major phases: (1) the selection of candidate devices; (2) the configuration on mobile devices and local training; and (3) the reporting of local training results.

Candidate-selection as the first essential phase, determines what group of candidate devices can be chosen to participate in an FL training task. To select the most appropriate group of mobile devices for an FL task is still an open issue, especially when the number of devices reaches billions in the near future.

To the best of our knowledge, the existing candidate-selection algorithms [mcmahan2016communication, chen2019joint, bonawitz2019towards]

for distributed federated learning can be classified into the

reactive manner. That is, candidate devices are selected only by their currently-observed resources, such as the network-connection quality and the remaining computing capability. However, we argue that the candidate-selection of FL only depending on the currently-observed resources is short-sighted. For example, if a device with good resources of both connection and computing was selected at the selection phase, but it may turn to have a low computing capability during the training phase, or have a low quality of network connection during the reporting phase. Those two situations can bring failures to a round of FL training. As a result, all the previous configurations that have been devoted to this mobile device are in vain.

Therefore, in this paper, we aim to propose an efficient proactive algorithm that can perform a prophetic selection for federated learning by predicting mobile users’ future qualities of both training and reporting phases. Our proposed approach mainly includes two parts. The first part is the prediction of users’ patterns with respect to the mobility trajectory and the App-usage habits in their smartphones. The second part is a decision-making scheduling algorithm that is based on deep reinforcement learning (DRL) framework.

The contributions of this paper are summarized as follows.

  • Firstly, different from the existing reactive approaches, we study the proactive candidate-selection for Federated Learning by predicting the qualities of both the training and reporting phases.

  • Secondly, to solve the proposed profit-maximization problem, we have designed a DRL-based algorithm, which is able to endure the highly uncertain events in the dynamic FL environment.

  • Finally, we developed a smartphone App to collect trace data for our experiments. The real-world trace dataset was then collected from our campus for 6 weeks. The trace-driven experimental results show that the proposed proactive candidate-selection algorithm outperforms the state-of-the-art baselines.

The remaining of this paper is organized as follows. Section II reviews the state-of-the-art studies. Section III describes the system model and problem formulation. Section IV elaborates the prediction scheme of the proposed approach. Then, Section V discusses the DRL-based candidate-selection algorithm. Section VI shows the real-world trace-driven experiments. Finally, Section VII concludes this paper.

Ii Related Work

Ii-a Proactive Provisioning based on Sequence Prediction

Understanding human mobility [gonzalez2008understanding]

is a classic scientific problem. A lot of proactive service provisioning can be conducted by predicting the human’s mobility. The most popular approach used to predict mobility is based on Recurrent Neural Networks (RNN), such as Long Short Term Memory (LSTM)

[feng2018deepmove, zhou2019predictive, jiang2019deepurbanevent]

and Gated Recurrent Unit (GRU)

[feng2018deepmove, liao2018predicting]. Several representative studies are reviewed as follows. Feng et al. [feng2018deepmove] proposed DeepMove, in which authors aim to capture the complicated sequential transitions during human mobility based on a multi-modal embedding RNN. Zhou et al. [zhou2019predictive] studied the cost-efficient server provisioning problem, which aims to proactively schedule the activating/deactivating of servers in edge computing environments. The proactive schedule is achieved by predicting the arriving data streams using an online learning algorithm. Next, aiming to predict the activity and the location of mobile users, Liao et al. [liao2018predicting] proposed a particular spatial activity topic modeling, by adopting the multi-task context into their RNN based prediction approach. To forecast the citywide crowd events, Jiang et al. [jiang2019deepurbanevent] proposed a LSTM-based multitask encoder-decoder framework, in which a data-driven dynamics prediction intelligent system has been built.

In addition, Hidden Markov Model (HMM) is another popular method to modeling personalized sequences such as users’ mobility and living habits. For example, based on a large-scale mobility dataset collected from LTE networks in a Southern Chinese city, Lv

et al. [Lv2017BigData] investigated the spatiotemporal prediction and users’ next destination by employing the HMM based predictor. Shi et al. [shi2019state] studied the personalized sequential patterns based on a state-sharing sparse HMM model under the situation of data scarcity.

Taking the advances of LSTM when predicting spatiotemporal sequences into account, we also exploit the LSTM-based prediction approach in the proposed proactive candidate-selection algorithm.

Ii-B Candidate Selection for Federated Learning

As the pioneer studies of federated learning, McMahan et al. [mcmahan2016communication] and Bonawitz et al. [bonawitz2017practical] introduced the FedAvg algorithm. For example in [mcmahan2016communication]

, authors proposed a communication-efficient learning framework that selects the participating devices to train a global model using the decentralized rich local dataset. The FL server then conducts the model averaging with the local stochastic gradient descent (SGD) reported by each mobile device.

Based on the original FL protocol FedAvg described in [mcmahan2016communication], Nishio et al. [nishio2019client] proposed a new FL protocol named FedCS, which towards the efficient FL by employing heterogeneous mobile users in a mobile edge computing environment. Particularly, the candidate selection in such FedCS is performed by requesting the resource information of mobile users at real time. Thus, FedCS protocol is essentially the reactive manner.

Later, Chen et al. [chen2019joint]

studied the implementation of FL framework for wireless networks. To jointly consider the candidate selection and resource allocation while minimizing the FL loss function, authors first formulated an optimization problem. Then, they derived a closed-form expression of the anticipated convergence rate of the FL algorithm, aiming to quantify the effect of wireless factors on FL training.

Google’s team [bonawitz2019towards] has recently proposed a technique report describing their deployment experience of large-scale FL architecture. However, in their candidate-selection phase, the participants are chosen by a simple reservoir-sampling approach. Therefore, the new FL protocol proposed by [bonawitz2019towards] has a large freedom to design a more sophisticated selection scheme with respect to a certain bias.

Through the review of the state-of-the-art selection approaches, we find out the two following facts.

  • To the best of our knowledge, the existing candidate-selection algorithms for FL belong to the reactive manner. That is, the candidate-selection over all mobile participants is performed only when receiving their current device information. Thus, the reactive manner incurs the inevitable resource-requesting latency. Moreover, the reactive manner can only see the computing environment of the current timeslot. That is why we say this manner is short-sighted.

  • The existing studies usually consider that all mobile users have the same usage-behavior pattern, thus they simulate the state of CPU and network connection of mobile devices by exploiting statistical distribution models such as Poisson and Uniform distributions. However, this simple assumption could be unrealistic for capturing the real-world users’ behaviors in a highly uncertain computing environment.

In contrast, our proposed approach is a proactive manner, under which the candidates are selected by the predicted qualities of both the training and reporting phases. In addition, the traces of mobile users used in our experiments were collected from the real volunteers for 6 weeks. The real trace of mobile users provides us a lot of confidence when designing our proactive candidate-selection algorithm.

Iii System Model and Problem Formulation

Iii-a System Model

As illustrated in Fig. 1, we consider a mobile edge computing (MEC) environment, which is built on metropolitan cellular networks. The synchronous-fashioned computing architecture consists of a central FL parameter server, a group of base stations, and a large number of mobile devices with heterogeneous resources. The central FL parameter server initializes a global model for a specific computing task such as image classification. For each training task, the FL server recruits a group of mobile users (MUs) to join in a training task. The selected MUs are viewed as FL participants, who strive for training a global model collaboratively. All MUs willing to join a specific FL task are denoted by the set .

Generally, each round of the synchronous FL training [bonawitz2019towards] includes three major phases: 1) selection, 2) training, and 3) aggregation, i.e., model reporting. At the beginning of each selection phase, each MU sends a participating-request message to the FL server. This request message consists of their current resource information on their device (e.g., CPU utilization, state of wireless connection). To conveniently describe the execution of FL system, we also call each round of FL training a timeslot, which is illustrated in Fig. 2. All timeslots are recorded in set . Thus, the deadline for each round of FL training is actually the length of a timeslot. In a real experimental setting, the length of a timeslot could be measured in milliseconds.

Assuming that MU has a number of CPU capacity, which represents its currently available computing resource that can be applied to FL training. The network connection quality of each MU is measured by the cellular parameters which can be perceived by each smartphone.

the set of all candidate devices for FL
the maximum number of selected participants for each FL round
the set of FL training rounds (i.e., all timeslots)
currently-observed connection quality of MU at timeslot
predicted connection quality of MU within round
currently-observed available CPU capacity of MU at timeslot
predicted available CPU capacity of MU within round
a binary indicator that indicates whether device uploads its
updated model parameters successfully to FL server at the end
of reporting phase

a binary variable that indicates whether candidate device

selected by the FL server at round
TABLE I: Symbols and Notations

When the number of the above-mentioned candidate MUs reaches a certain amount, the FL server begins to select participants, and sends the global model to these selected participants. A total deadline will be set for all training rounds. Due to the limitation of the wireless bandwidth and the deadline for each round, it is not a good idea to select as many participants as possible. Therefore, the FL server must select an appropriate subset of all candidate MUs at the very beginning of an FL training task. A good candidate selection not only can maximize the training efficiency, but also can avoid to wait too long for the slow-training participants, which fail to complete their local training and/or upload the local models before the deadline of a round.

Particularly, due to the mobility of these MUs, they may traverse different locations at different timeslots such that the quality fluctuations of cellular connection between MUs and the FL server are inevitable. Meanwhile, each of these MUs may use different applications (APPs) at different timeslots, such as playing high-computation games or watching high-resolution online videos, resulting in the dynamic changes of the remaining computing resources (i.e., the available CPU capacity) for the FL tasks.

As we can see from Fig. 1, not all participants can complete their local training when a candidate is selected. This is because only the current training resources can be observed at the beginning of each round (Fig. 2). Due to the highly-dynamic mobility of MUs in the MEC environment, the qualities of both network connections and available computing capabilities could be very different at the end of each round of FL training. Therefore, if there exists an algorithm that can predict the qualities of those two factors, a good group of candidates can be selected in a proactive manner. Accordingly, we use and to denote the predicted qualities of network connection and CPU capacity of MU within timeslot , respectively.

The other important symbols and notations are explained in Table I.

Fig. 2: The system model when deciding to select the candidate .

Iii-B Problem Formulation

As shown in Fig. 2, to denote whether a candidate device is selected or not by the FL server at the beginning of each timeslot, we define the following binary variable

To represent whether a chosen candidate, i.e., a participant, successfully uploads its updated model to the FL parameter server or not, we then define a binary indicator , the definition of which is shown as follows.

We then have the following formulation of the problem Prophet: proactive candidate-selection by predicting the qualities of training and reporting phases.


We call a participant a valid one if it can successfully complete the training and reporting its updated local model to the FL server within a timeslot. Thus, the objective function (1) claims that the total number of valid participants should be maximized for each FL round . Constraint (2) defines the maximum number, i.e., , of all participants that are selected by FL server for each round of training. Note that, to solve this problem in a proactive manner, must be predicted using a certain prediction approach.

Iv Predicting the Next-State of Candidates

This section presents the prediction towards the next-state, i.e., the qualities of network connections and computing capabilities within a timeslot, of mobile devices based on the Long Short-Term Memory (LSTM) network. The predicted next-state is the foundation of the proposed proactive candidate-selection algorithm.

There are three important factors that are believed as the bottleneck to a participant when training a task in an FL round, i.e., the size of heterogeneous local data, computing capabilities and network conditions [nishio2019client]. Here, we assume that the data size of each mobile device is relatively stable during the training of an FL task. Consequently, the major prediction objectives are the other two factors, i.e., the computing capabilities and network connections in the reporting phase, as the next-state of mobile devices.

Iv-a Local Prediction for the Next-State of Each MU

When deciding on what component to perform the next-state prediction, in fact we have two choices: 1) let all mobile devices report their local trace data to a central server, the server then predicts the next-state of all MUs; or 2) let the mobile devices predict their individual next-state using their local dataset. Taking the users’ privacy into account, it is unsuitable, sometimes illegal, to aggregate such sensitive trajectory data in a central server. Furthermore, when the number of MUs reaches a large scale, the central server will not be capable to predict the next-state for each MU, if adopting the first choice. Thus, we adopt the second choice, which is shown in Fig. 8. Each mobile device executes the LSTM-based local prediction algorithm to predict its own next-state, which is then fed back to the candidate-selection algorithm.

Iv-B Trace Collection using a Self-Developed App

To collect the trace dataset of mobile users, we first developed an Android App named Tracer. This App together with the technique report of the dataset collection are open to download from the homepage of our lab We then recruited about 100 students as volunteers who helped us collect their daily routines and App-usage habits on smartphones. All trace datasets were collected in the campus of Sun Yat-Sen University (Guangzhou) for over 6 weeks. Each volunteer’s smartphone reported the collected data to the central server located at our lab everyday. In detail, the dataset mainly includes three categories:

  • The GPS data, i.e., the latitude and longitude, of each MU, which is used to calculate the Place of Interests (POIs) [Lv2017BigData] of candidates.

  • The cellular parameters of channel signals, i.e., the RSSI, RSRP, RSRQ, SINR perceived by each smartphone. These signal parameters are used to quantify the network connection quality of each mobile device.

  • The CPU workload consumed by the top 10 Apps in each mobile device. This type of data is used to infer the available computing capability that can be spent on the local Federated Learning task on each candidate device.

We only show a part of raw trace data in Fig. 3 due to space limitation.

Fig. 3: Raw trace collected from our campus by one of the 100 volunteers. The raw trace includes latitude, longitude, time, cellular parameters, and CPU workload consumed by the top 10 Apps in a smartphone.
Fig. 4: Pre-processing the GPS traces into the trajectories of mobile users.
Fig. 5: The framework of LSTM-based local prediction towards the next-state of each mobile device.

Iv-C Pre-processing of Trajectory Traces

The raw datasets must be pre-processed, then can it be exploited to predict user’s next state in a timeslot .

First, each mobile user collects its raw GPS data using the smartphone App Tracer. Every raw GPS trace is represented by (, , ), which record the user’s position of latitude and longitude at time . However, the raw trace data contains some noises because of the inherent signal deviation of GPS. To retrieve the trajectories of MUs that can be directly provided to the prediction algorithm, we perform the pre-processing according to Fig. 4. The 4 main steps are as follow.

  1. Filtering the GPS noises and adjusting the sampling rate over all trace data items.

  2. Using the GPS data after filtering in the previous step, step 2 is to cluster the POIs of mobile users based on the density of their daily routines. Fig. 6 shows the results of all POIs generated by this step based on the datasets collected from all volunteers in our campus.

  3. Adding additional date information such as the day type and what day during a week.

  4. The final step is to combine POIs with the additional date information achieved in step 3. After this combining, the trajectory of each mobile user is obtained.

Please refer to our technique report [OurTechniqueReport] for more details of data pre-processing.

Fig. 6: POIs of all candidate devices located in our campus.
Fig. 7: Accuracy of prediction results using LSTM w.r.t POI and Sojourn duration of candidate devices.

Iv-D Local Next-State Prediction

Fig. 5 shows the framework of state prediction. Here the state includes the qualities of network connections and the available CPU capacity of each candidate device. When the trajectory history is generated according to the previous pre-processing, a LSTM-based prediction model is used for the prediction module. In this module, we use a many-to-one architecture to learn each candidate’s mobility pattern from their trajectory histories.

To satisfy the format requirement of LSTM network, all trajectory data should be converted by the training data generator module. In this data generator, the duration of a mobile user staying in a location during his/her trajectory is changed from the original data format (year-month-day hour:minute:second) into the unix timestamp format. We then use the min-max scaling approach to standardize the duration attribute. After standardization, each trajectory item is converted to [POI, day_type, what_day, , ], in which day_type represents that whether a day is a working day, weekend, or holiday; what_day stands for the week-day; while and denote the starting and ending time when a candidate device arrives at and leaves from a POI, respectively.

With the data processed by the training data generator module, we then use two fully connected layers to predict the mobile users’ location-based information in a timeslot . The output layers of the prediction module will yield the two prediction results, i.e., the POI and the Sojourn duration of each mobile user.

The POIs and Sojourn duration of mobile users, combining with the Traces of App Usage and the Traces of cellular networks, are used to infer the the output state of candidates, i.e., the tuples of [, , , ]. Due to space limitation, we elaborate how to infer such the state of candidates in our technique report [OurTechniqueReport].

Finally, Fig. 7 illustrates the prediction accuracy of the LSTM-based model with respect to (w.r.t) POI and Sojourn duration

. Within 200 epochs, we see that the accuracies of

POI and Sojourn duration reach 90% and 82% on average, respectively. Although the accuracies are not so high, the proposed proactive candidate-selection algorithm can still have the highest FL training results compared with other baseline algorithms. The performance comparison is presented in Section VI.

V DRL-based Candidate-Selection Algorithm

Although the network state in the near future can be predicted with a high accuracy, some significant uncertainties can still happen quite often in a dynamic federated learning environment. For example, the failures of cellular networks may occur during disasters, and the mobile users may suddenly become offline. Thus, to handle the unexpected dynamic events that can not be accurately predicted, we propose to design a Deep Reinforcement Learning (DRL)-based algorithm. Generally, the reinforcement learning [mnih2013playing] approach is a trial-and-error search, in which the agent not only prefers to exploit the good actions that have already been tried in the past, but also to explore new actions that may indicate higher rewards in the future. Building the policy network using the deep neural network (DNN), the reinforcement learning turns to be DRL. In particular, our proposed DRL-based algorithm is implemented by the Double DQN (DDQN) framework [van2016deep]. In the following, we describe how to construct such the proposed DDQN-based algorithm.

Fig. 8: Interactions between DRL Core and Federated Learning modules.

V-a Overview of DRL-based Framework

As Fig. 8 shows, the proposed DRL-based Framework mainly includes two modules: 1) the DRL Core, and 2) the Federated Learning. In DRL Core, the DDQN-based algorithm contains an offline constructing phase of deep neural network that can approximate the action-value function with corresponding states and actions, an online dynamic deep -learning phase for action filtering, and the dynamic network updating. For each round of FL, the role of DRL Core is to conduct the candidate-selection for Federated Learning module in a proactive manner.

In Federated Learning module, each mobile device provides its next-state to the FL server. All the predicted state results can be observed by DRL Core as experience items. Such the experience items are aggregated in the Replay Memory of DDQN algorithm. Via the training of DRL framework, a good policy will be yielded as the Action, which is then treated as a solution to the candidate-selection in the FL server. In detail, the training algorithm of the -network is presented in Algorithm 1.

V-B Problem Reformulation based on DRL

In DRL Core, the DDQN-based algorithm is essentially solving a stochastic optimization problem. This problem can be represented by , , , where , and () are the state space, action space, and reward function of the DDQN-based algorithm, respectively. We then define them associated with our problem as follows.

State Space: Combining all the states of candidates collected by the FL server at the beginning of round , the state space is denoted by = , where is the state-tuple of each candidate . Recall that each candidate’s state is the output of the local LSTM-based prediction model shown in Fig. 5. Thus, the state of candidate observed by the FL server is denoted by:

Therefore, we have the state space = . The and appeared in Fig. 8 and Alg. 1 are short for , , respectively.

Action Space: When observing the state set , DRL Core will make the candidate-selection decisions for FL server. The FL server then distributes the global model to each participant for training using the local datasets. The action space of the candidate-selection in timeslot is defined as:

where = 1 or 0 means that the candidate is selected to participate in FL round or not. Note that, the constraint (2), i.e., , must be satisfied. Similarly, we have the action space = .

1 Initialize evaluation and target -network parameters with , ; Initialize replay memory with the size of . for episode to  do
2       for round to  do
3             candidates send participant requests to FL server; [, , , ], ; // FL server observes states Select action according to the -greedy policy; Execute action : choose candidates to distribute FL model; Observe reward and the state of the next timeslot +1; Update Reply Memory: ; Sample random mini-batch of experience items from ; for  to  do
4                   ; ;
5            ; Perform a gradient descent step on to update ; Reset: in every steps
Algorithm 1 DDQN-based Proactive Candidate-Selection

Reward Function: Based on the original objective function (1), we design a reward function for the Prophet problem. The reward function is defined as follows:


where () is the set of the candidates randomly chosen from . Particularly, represents the opposite situations of , i.e., when , , and vice versa. We see that the reward contains two terms. The first term is a positive reward, which is essentially the objective (1). Then, the second term is a timeout penalty, which is used to quantify the impact brought by the participants who failed to complete the FL training task within timeslot . As mentioned, the reasons behind the failure include two situations, i.e., a participant failed to upload the updated model to FL server due to low-quality cellular connections, or a participant took too much training time because of its low computing capability. Thus, the parameter is a scale factor that decreases as the model-update time of the valid participant increases, while is the negative scale factor of the timeout penalty.

Vi Performance Evaluation

Vi-a Methodologies and Settings

Basic Settings: In experiment settings, we set totally =100 mobile devices as the candidates for federated learning. At most =10 of them will be selected to participate in each round of federated learning. At the beginning of each round, all candidates report their device information, which includes their individual network connection quality and current computing capability, to the FL parameter sever. Next, the parameter server selects the best group of devices as the participants for training an FL task, according to different candidate-selection algorithms. Under the proposed DDQN-based algorithm, after a number of episodes of training with other parameters configured following Table II, the DRL Core will learn a good policy for candidate selection. This good policy is then provided to the FL server to guide the distribution of global FL model.

Settings for Policy Training: The trace dataset of each of the candidates is divided into two parts. The first part is for training the DDQN-based model, while the other part is to provide the trajectory for a mobile device during the trace-driven experiments of online FL training. During the training of DDQN model towards the best candidate-selection policy, the candidates chosen by the FL server will not be altered until a good policy is retrieved within a certain number of episodes.

Settings for Global FL Training: To train the global Federated Learning model, we choose two well-known image datasets MNIST [lecun1998mnist] and CIFAR-10 [krizhevsky2009learning]. According to [mcmahan2016communication], the distribution of data samples in a dataset substantially affects the training quality of distributed learning. Considering that the image dataset in our smartphone is in fact following the non independently identically distribution (non-i.i.d.). Therefore, in our experiments, we evaluate the test accuracy of algorithms using such the non-i.i.d. datasets.

Parameters Value Parameters Value
NN size
Optimizer Adam
learning rate
-greedy 0-0.9
TABLE II: Experiment Configurations
(a) Numerical reward of algorithms
(b) Reward vs Uncertainties
Fig. 9: Numerical reward while striving for candidate-selection policy.

Vi-B Baseline Algorithms

Totally we compare the proposed DDQN-based algorithm with 4 baselines. The basic idea of each is explained as follows.

FedCS [nishio2019client]: The FL server chooses participants only by the currently-observed CPU capability and channel conditions of candidates. The better computing capability and network connections observed at current timeslot, the higher chance to be selected by the FL server.

FedAvg [mcmahan2016communication]

: This probably is the most original FL aggregation strategy. Under this algorithm, the FL server chooses

random mobile devices as participants to train each round of FL tasks.

Offline: We also implement an offline algorithm, in which the participants are selected in each FL round depending on their trace-driven states throughout all timeslots . Thus, this offline scheme yields the optimal solutions.

Vi-C Metrics of Evaluation

Numerical Reward: The numerical reward, which is calculated by equation (3), closely relates to the objective function (1).

Ratio of Valid Participants: Another factor that highly reflects the objective function (1) is the ratio of valid participants selected by algorithms. A higher ratio of the valid participants selected by an algorithm indicates that this algorithm can choose a better group of candidates.

Test Accuracy of Aggregated FL Model: We also evaluate the test accuracy of the aggregated global FL model, which is trained by all valid participants collaboratively.

(a) Accumulated # of valid participants (Parti.)
(b) Ratio of valid participants
Fig. 10: Performance of valid participants during FL training.

Vi-D Numerical Reward

Fig. 9(a) shows the numerical rewards of all algorithms. First, we see that FedAvg shows the lowest reward, because all the =10 candidates were selected by this algorithm to participate into each round of the FL training. However, a large portion of those participants cannot complete their FL training tasks every round due to the unknown dynamics of computing environment. The participants who failed in some rounds brought the negative reward. That is why FedAvg has the lowest numerical reward.

Then, FedCS algorithm shows a numerical reward around 950, which is much higher than the reward of FedAvg. This is because FedCS suggests the FL server selecting a number of the best participants during the =10 candidates, only depending on the currently-observed devices’ state [, ]. However, this reactive manner of selection only sees the state of the current timeslot. The future state of candidates cannot be known currently. Thus, some participants selected probably fail to train the FL task in the subsequent timeslots. The failed participants incur the negative rewards under FedCS.

We now analyze the performance of the proposed DDQN-based algorithm. This algorithm starts to train the candidate-selection model with the states of both the currently-observed and the predicted within each timeslot, i.e., [, , , ]. At the beginning of selection-model training, DDQN randomly selects a random number, which is not greater than , of candidates to participate into the FL training. Thus, the reward is low at the beginning. Through many times of explorations, DDQN keeps learning from the experience items. The candidate-selection policy also keeps improving. We observe that the reward of DDQN 1) grows higher than that of FedCS at the 3000 episode, and 2) converges to a very high numerical reward around 1900 after the 7500 episode. The final converged reward is very close to that of the offline solution. This shows that the DDQN has already learned a near-optimal candidate-selection policy. This good reward proves that the prediction of state [, ] in the near future within a timeslot enables FL server to select the group of candidates who have bad observed states currently but will have better states in the subsequent timeslots. With such the proactive prediction, the proposed proactive scheme can be benefited by the potential group of candidates which bring positive rewards.

Vi-E What Will DDQN Perform under Uncertain Environments?

You are probably wondering what will the algorithms perform under the highly uncertain environments. We show them using Fig. 9(b). The left-hand half of Fig. 9(b) shows that the DDQN has learned a high-reward policy around the 7000 episode. At the 10000 episode, we drastically change the cellular connection parameters manually on purpose to simulate the unexpected uncertainties, e.g., the crash of cellular base stations. Although this change doesn’t impact the prediction accuracy of POIs, it makes the network connection qualities on those POIs change dramatically. Consequently, the prediction of network connection quality based on the historical trace data will not be accurate. As shown at the 10000 episode, the numerical rewards of all algorithms decrease, especially to the proactive DDQN, which highly depends on the prediction results. When the uncertain events occur, the proposed proactive approach will still select the group of devices that are going to suffer from low connection-quality based on the false prediction. However, those candidates cannot complete the FL training tasks at all in the subsequent timeslots. This is why the reward of DDQN experiences a drastic drop of numerical reward after the uncertainties. Although DDQN still selects those group of devices at the very beginning exactly when the uncertainties occur, the -network quickly perceives the decreasing rewards. As a consequence, DDQN changes the policy selection adaptively towards higher rewards. Finally, DDQN achieves a high converged reward that is still close to the offline optimal performance. This result shows the advantage of DDQN algorithm when performing the distributed Federated Learning in a highly dynamic computing environment.

(a) Test accuracy with MNIST [lecun1998mnist]
(b) Test accuracy with CIFAR-10 [krizhevsky2009learning]
Fig. 11: Test accuracies of algorithms under the non-i.i.d. datasets of MNIST [lecun1998mnist] and CIFAR-10 [krizhevsky2009learning].

Vi-F Ratio of Valid Participants

Fig. 10(a) shows the performance of all 4 algorithms in terms of the accumulated number of valid participants during the rounds of FL training. Recall that we call a candidate a valid participant if it can successfully report its updated local model to the FL server within a round of FL training. In each algorithm, the dark bar shows the accumulated number of valid participants, while the shallow part indicates the number of invalid ones. We see that FedAvg has the largest number of invalid participants. The DDQN-based algorithm always shows the highest number of valid participants. Fig. 10(b) illustrates the ratio of valid participants under each algorithm. Once again, we observe that the proposed DDQN-based algorithm achieves a ratio as high as 95%. More clearly, FedAvg leads to a very low ratio of valid participants during all the FL training rounds. These results prove that the proposed proactive candidate-selection algorithms can always select the best group of candidates for training the global FL model.

Vi-G Test Accuracy of Global FL Model using 2 Open Datasets

In the experiments using the non-i.i.d. dataset, each participant only holds 20%-40% of all image samples as their training datasets. The number of images in each participant’s dataset is varying from 2500 to 5000. We then use a thread to simulate a mobile candidate, which adopts the preprocessed trajectory as its mobility data. Those candidates participate in the training of a global model residing in the FL server using their local non-i.i.d. image datasets. Note that, we don’t evaluate the test accuracy of FedAvg algorithm, since FedAvg always selects all the =10 candidates as participants. Thus, it is unfair to compare its test accuracy with other algorithms.

Fig. 11 demonstrates the time-varying test accuracies evaluated in each participant device after model aggregation on the FL server. We can observe that different candidate-selection algorithms yield different test accuracies. For example, in Fig. 11(a), the proposed DDQN-based algorithm shows a very close performance with the optimal offline scheme, and 10%-20% higher time-varying test accuracies than that of the reactive baseline FedCS. This is because the ratio of valid participants is higher under the proposed DDQN-based algorithm than that of FebCS. Thus, the FL server can receive much more updated model parameters and weights, which result in higher test accuracies of the aggregated global model. Similar results can be observed from Fig. 11(b). The reasons are same and thus omitted here for simplicity.

Vii Conclusion

Candidate-selection is the first essential phase during each round of the synchronous-fashioned Federated Learning. To the best of our knowledge, we are the first to propose the proactive candidate-selection by predicting the qualities of both training and reporting phases. The proposed DRL-based candidate-selection algorithm can adaptively handle the unexpected dynamic events occurred in the metropolitan mobile edge computing environment. To evaluate the effectiveness of the proposed proactive approach, we first collected the real-world user traces using a self-developed Andriod App. The trace-driven experiments show that the proposed DDQN-based algorithm can yield a near-optimal reward, and outperform other baselines with respect to multiple metrics including the ratio of valid participants and the ability to adapt to highly uncertain events. Using two open image datasets, experiment results show that the test accuracies of the proposed DDQN-based algorithm are higher than that of the existing candidate-selection baselines by around 10%-20% during the online FL training.

We believe this work will shed new lights on the synch-ronous-fashioned Federated Learning. The proposed DDQN-based algorithm will be viewed as a reliable approach that provides the proactive candidate-selection mechanism to the Federated Machine Learning.