In traditional machine learning approaches, neural network models are trained at a server or a data center. Thus, the centralized learning approaches typically require the raw data, e.g., photos and location information, collected by mobile devices to be centralized at the server. The centralized learning approaches thus face big issues including privacy, long propagation delay, and backbone network burden .
Recently, federated learning as a decentralized machine learning approach has been proposed to address the above issues [9, 3]. In the federated learning, mobile devices, i.e., workers, are required to collaboratively train the neural network model of the model owner111In the rest of the paper, we use “model owner” to refer to the server which creates the global model, and “worker” is a mobile device which trains the global model.. In particular, the model owner first transmits its global model to the workers. The workers then use their data to train the model locally and send the model updates to the model owner. The model owner aggregates the model updates from the workers to a new global model and transmits it back to the workers for training. The model owner and the workers periodically exchange and update the model until a target accuracy is achieved . By updating the models rather than the raw data, the federated learning alleviates many challenging problems, e.g, privacy issues and the backbone network burden issues, of the traditional machine learning . However, the federated learning faces two limitations.
The first limitation is that the workers as mobile devices have energy constraints, and this can make the workers inactive. To address this limitation, a power beacon can be used to recharge energy to the workers. However, the model owner may need to pay an energy cost to the power beacon for the energy recharge. Thus, the model owner must decide appropriate amounts of energy recharged to the workers. The second limitation is that to achieve the target accuracy of the global model, a number of global model transmissions from the model owner to the workers may be required. This incurs a high network resource cost, i.e., the bandwidth cost, to the model owner. To reduce the network resource cost, the model owner uses the WiFi channel or Bluetooth channel, called default channel, for the global model transmissions. As such, the model owner can transmit its global model to the workers with a free access cost. However, the default channel sometimes has low quality, and the global model transmissions may be unreliable . Moreover, the coverage of the default channel is very short that may result in the communication interruptions between the model owner and the workers due to the mobility of the workers. The model owner can thus use so-called special channels such as LTE channels or TV White Space channels for the global model transmissions . However, using such a special channel requires a high communication cost, i.e., channel cost.
As such, the problem of the model owner is (i) to decide amounts of energy recharged to the workers and (ii) to choose channels, i.e., the default channel or the special channels, for the global model transmissions to maximize the number of successful transmissions while minimizing the energy cost and the channel cost. This is challenging for the model owner since the mobile environment is stochastic in which the channel, energy, and mobility states are uncertain and unpredictable. In this paper, we thus propose to employ the Deep Q-Network (DQN)  that enables the model owner to find the optimal decisions on the energy and the channels with no existing network knowledge. In particular, we first formulate a stochastic optimization problem of the model owner that maximizes the number of global model transmissions while minimizing the energy cost and the channel cost. Then, the Deep Q-Learning (DQL) algorithm with Double Deep Q-Network (DDQN)  is adopted to derive the optimal policy for the model owner. Simulation results show that the proposed DQL always achieves the better performance compared with baseline algorithms.
Ii System Model
We consider a downlink model of a Federated Learning Network (FLNet) as shown in Fig. 1. The network consists of a model owner and a set of workers, i.e., . The model owner generates a global model, i.e., weight parameters, and transmits the model to the workers for training. The training requires multiple iterations until the global accuracy is achieved . Each worker is equipped with a capacity-limited battery that can store a maximum number of
energy units. Note that as there is few energy in the battery, the worker may become inactive and cannot communicate with the server. The worker’s battery can be recharged by a power beacon, and the model owner pays an energy cost to the power beacon for the energy recharging. To minimize the energy cost while reducing the inactivity probability of the workers, the model owner needs to determine appropriate amounts of energy recharged to the workers, e.g., through energy control links. Note that the workers may have different distances to the power beacon and consume different amounts of energy for uploading their local models. Letdenote the weighted metric of recharging energy to worker . Without loss of generality, assuming that the workers with higher index have longer distances to the power beacon and higher weighted metrics, i.e., . As such, using is to guarantee “fairness” among the workers.
As mobile devices, the workers have a possibility of movement that can result in the link disconnection and communication interruption as the worker is out of the coverage of the power beacon. Denote as the probability that worker is present in the coverage. Within the coverage, the model owner transmits its global model to the workers by using the WiFi channel or Bluetooth channel, i.e., default channel, with free-cost access. The default channel has an unstable connection due to the low quality. Let denote the probability that the model owner transmits its global model over the default channel successfully. This means that the transmission fails with a probability of . To mitigate the communication interruption between the model owner and the workers, special channels, e.g., the LTE channels and TV White Space channels, can be employed. Denote as a set of special channels, i.e., . As the model owner decides to use special channel , it is charged with an access cost that is proportional to the channel quality. The special channels may have different quality, and we assume that . Denote as the probability that the global model transmission over special channel is successful.
The problem of the model owner is to decide which channel, i.e., the default channel or one of special channels, can be used to transmit its global model to each worker to maximize the number of successful transmissions and minimize the channel cost. In addition, the model owner needs to decide appropriate amounts of energy recharged to the workers to minimize the energy cost while mitigating the inactivity possibility of the workers.
Iii Problem Formulation
Under the uncertainty of the states of the channels and the workers, the problem of the model owner can be formulated as a stochastic optimization problem that defined by a tuple , where
: The state space of the network.
: The action space of the model owner.
: The state transition probability function, where the current state transits to the next state with probability when action is executed.
: The reward function of the model owner.
First, we consider the action space of the model owner. Given the problem mentioned in Section II, the action space of the model owner can be expressed as follows:
where means that the model owner does not transmit the global model to worker , means that the default channel is used for transmitting the global model to worker , means that special channel is used for transmitting the model to worker , and refers to the amount of energy recharged to worker .
Next, we consider the state space of the network. The state space of the network can be regarded as a combination of state spaces of workers, i.e., , where is the Cartesian product, and is the state space of worker that is defined as follows:
where is the state of channel that the model owner uses to transmit its global model to worker , if the channel is good, and otherwise. refers to the energy state of worker that is the current number of energy units in the battery. is the mobility state of worker , means that worker is in the communication range of the model owner, and otherwise.
Now, we consider the state transition of the model owner. For the energy transition, each worker consumes its energy for training the model required by the model owner and running its local applications. In general, the amount of consumed energy is unknown to the model owner. To model the transition of energy state for the worker, we use the Markov chain. We assume that in each training iteration, the worker consumes at least one energy unit and at most two energy units for training the model and running local application. The recharging energy to the workers happens after the training in each iteration. The energy state transition of workeris shown in Fig. 2, where the energy is reduced by either one unit with a probability of or two units with a probability of under the condition of . Note that at , the energy state can directly transfer to state with the probability of due to the energy consumption of running local applications. For the channel and mobility, we model the channel state and the mobility state as a Bernoulli process, which takes value with probability or and , respectively.
Finally, we define the reward of the model owner. One of the objectives is to maximize the number of successful transmissions of the model owner. We assume that the model owner earns a positive utility from the successful transmission to each worker. Let denote the utility that the model owner receives for transmitting its global model to worker . can be expressed as follows:
As the model owner decides to use special channel to transmit the global model to worker , the model owner must pay channel access cost , i.e., to a network provider. Otherwise, as the model owner chooses not to transmit or chooses the default channel to transmit the global model to worker , . Thus, is determined as follows:
When the model owner decides to transmit its global model to worker , i.e., by using either the default channel or the special channel, the worker consumes energy for training and uploading the local model. The worker’s battery can be recharged by the power beacon, and there is energy cost . is the cost that the model owner pays the power beacon for recharging energy to worker . is a function of the weighted metric and the amount of recharging energy . is given by
where is the weighted metric for recharging energy from outside the coverage. Note that the model owner pays a higher cost to recharge the energy if the worker is outside of the coverage, meaning that .
The reward of the model owner is defined as a function of state and action as follows:
where , and are the scale factors. and are the maximum values of the total utility, channel cost and energy cost, respectively.
Given each state , the model owner must determine the optimal action to maximize the accumulated reward. The output is the optimal policy, which is defined as . To obtain the optimal policy , the conventional Q-Learning (QL) algorithm  can be utilized. The main idea of QL algorithm is to update -values, i.e., , of state-action pairs for a -table by using Bellman’s equation as follows :
Thus, the -value is updated as follows
where is the learning rate, and is the discount factor, .
After updating the values, the model owner can rely on the -table to determine the optimal action from any state to maximize the accumulated reward. However, this QL algorithm is only feasible for networks with small state and action spaces. As the number of workers increases, the problem of the model owner is high dimensional due to the involvement of the large state and action spaces. Therefore, the Deep Q-Learning (DQL) algorithm , which is a combination of deep neural network (DNN) and QL, is adopted to find the optimal policy for the model owner.
Iv Deep Q-Learning Algorithm
Different from the QL algorithm, the DQL algorithm uses a DNN to derive approximate -values, i.e., , instead of the -table. The input of the DNN is one of states of the model owner, and the output includes -values of all possible actions, where is the weights of the DNN. To obtain the approximate values , the DNN needs to be trained by using experiences . In particular, the DQL algorithm updates weights
of the DQN to minimize the loss function defined as follows:
where is the target value that is given by
where is the weights of the DNN from the previous iteration and is the current reward. Note that action is selected according to the -greedy policy. From (5), we observe that the max operator uses the same
-value for both action selection and action evaluation. As a result, the derived policy may be inaccurate due to the over-optimistic estimation.
To address the over-optimistic problem, the action selection should be decoupled from the action evaluation. For this, the Double Deep Q-network (DDQN)  can be used. The main feature of DDQN is the use of two separate DNNs i.e., an online network with weights and a target network with weights . The weights of the online network are updated at each iteration, while those of the target network are kept constant. For every iterations, the target network’s weights are reset to . The target function of DDQN is defined by
As seen from (6), the weights of the online network, i.e., , are used to select an action, while those of the target network, i.e., , are used to evaluate the action. Both the online network and the target network use the next state to compute the optimal value . Given and , the target value is calculated based on (6). Then, a gradient descent step is performed to update the weights of online networks based on the loss function in (4). To guarantee the stability of the learning, the DQL algorithm employs an experience replay memory , where a mini-batch of experiences is taken at each iteration to train the DNNs. Algorithm 1 shows how to implement the DQL algorithm.
V Numerical Results
In this section, we present experimental results to evaluate the performance of the proposed DQL algorithm. For comparison, we use the QL 
, greedy, and random algorithms as baseline schemes. For the greedy algorithm, the model owner decides the maximum amount of energy charged to each worker and selects the special channel with the highest quality to transmit the global model to the worker. For the random algorithm, selecting the channel and deciding the amount of charging energy for each worker are random. The algorithms are implemented by using TensorFlow deep learning library. In particular for the DQL, we employ two DNNs, and each DNN has a size of . The Adam optimizer is used that allows to adjust the learning rate during the training phase. The learning rate is set to to avoid the loss of local minima. The DQL algorithm prefers the long-term reward, and thus the discount factor is set to . We use the -greedy policy with that balances between the exploration and exploitation. During the training phase, is linearly reduced to zero that moves from the exploration to the exploitation. The probabilities of energy consumption and mobility of the workers are set as follows and . Other parameters are shown in Table I.
In Fig. 3, we plot the comparison among algorithms in terms of convergence speed and reward. As seen, the convergence speed of the DQL algorithm is much faster than that of the QL algorithm. Specifically, the DQL algorithm converges to the stable value of the reward within episodes, while the QL algorithm needs to take around episodes for the convergence. Also, the reward obtained by the DQL is much higher than those obtained by the baseline algorithms. In particular, the rewards obtained by the DQL, QL, greedy, and random algorithms are , , , and , respectively. These results show that the DQL algorithm enables the model owner to learn the optimal polity. In particular with the greedy algorithm, the model owner employs the special channel with the highest quality and decides the maximum amount of energy recharged to each worker. The greedy algorithm can enable the model owner to improve its utility. However, it incurs the high channel cost and energy cost that significantly reduces the reward. For the random algorithm, the model owner randomly selects channels and amounts of energy. This may reduce the number of successful transmissions, and the workers may often face low energy states. Therefore, the random algorithm obtains the worst performance.
The performance improvement of the DQL algorithm compared with the baseline algorithms is maintained even if the mobility parameter varies as shown in Fig. 4. Note that as the mobility parameter increases, the rewards obtained by all the algorithms generally increase. This is due to the fact as approaches , most workers are in the coverage and the default channel is used. This significantly reduces the channel and energy costs and increases the reward.
Next, we evaluate the DQL algorithm as the number of workers varies. Fig. 5 shows the average utility obtained by the DQL algorithm under the different number of workers. As seen, the convergence speed of the DQL algorithm is slower as the number of workers increases. The reason is that as the number of workers increases, the action and state spaces increase that reduces the convergence speed of the algorithm. It is worth noting that the DQL algorithm converges to the same average utility regardless the number of workers. This is because of that the model owner already learns the optimal policy to obtain the maximum utility.
Finally, it is interesting to consider how the model owner selects the channel for transmitting its global model to each worker given the worker’s mobility state. Without loss of generality, we consider the worker 1. As shown in Fig. 6, as the worker is outside of the coverage, , the model owner selects one special channel. Specifically, special channel , i.e., , is selected since it has a lower cost than other special channels. As the worker is present in the coverage, both the channels, i.e., the special channels and default channel, can be used. However, the selection frequency of the default channel is higher than that of the special channels. The reason is that the default channel has quality enough, e.g., , and using the default channel results in reducing the channel cost.
In this paper, we have presented the DQL algorithm for the resource allocation in the mobility-aware federated learning network. In particular, we first formulate the channel selection and energy decision of the model owner as a stochastic optimization problem. The optimization problem aims to maximize the number of successful transmissions of the model owner while minimizing the energy and channel costs. To solve the problem, we have developed the DQL algorithm with DDQN. The simulation results show that the reward obtained by the proposed DQL is significantly higher than those obtained by the conventional algorithms. This means that the proposed DQL algorithm enables the model owner to learn the optimal decisions under the stochastic and uncertainty of the network environment.
-  (2016) Tensorflow: A system for large-scale machine learning. In Proc. 12th USENIX Symp. Operating Syst. Des. Implementation, pp. 265–283. Cited by: §V.
-  (2018-10) Unlocking 5G spectrum potential for intelligent IoT: Opportunities, challenges, and solutions. IEEE Commun. Mag. 56 (10), pp. 92–93. Cited by: §I.
-  (2019-05) Efficient training management for mobile crowd-machine learning: A deep reinforcement learning approach. IEEE Wirel. Commun. Lett.. Cited by: §I.
-  (2018) Deep reinforcement learning for time scheduling in RF-powered backscatter cognitive radio networks. arXiv preprint arXiv:1810.04520. Cited by: §IV.
-  (2017-10) CrowdTracker: Optimized urban moving object tracking using mobile crowd sensing. IEEE Internet Things J. 5 (5), pp. 3452–3463. Cited by: §I.
-  (2019-09) Federated learning in mobile edge networks: A comprehensive survey. arXiv preprint arXiv:1909.11875. Cited by: §I.
-  (2017-08) A survey on mobile edge computing: The communication perspective. IEEE Commun. Surveys Tuts. 19 (4), pp. 2322–2358. Cited by: §I.
-  (2017-04) Federated learning: Collaborative machine learning without centralized training data. Google Research Blog 3. Cited by: §I.
-  (2016-02) Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629. Cited by: §I.
-  (2015-02) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §I, §III.
-  (2019-04) Federated learning over wireless networks: optimization model design and analysis. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, pp. 1387–1395. Cited by: §II.
-  (2016-03) Deep reinforcement learning with double Q-learning. In AAAI, Phoenix, Arizona, pp. 1–7. Cited by: §I, §IV.
-  (1992-05) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §III, §V.
-  (2016-03) Enterprise LTE and WiFi interworking system and a proposed network selection solution. In Proc. Symp. Archit. Netw. Commun. Syst., pp. 137–138. Cited by: §I.