I Introduction
Massive connectivity is among one of the most challenging requirements of InternetofThings (IoT) networks which necessitates efficient, scalable, and lowcomplexity network resource management. Furthermore, due to limited computation and battery capacity of the IoT devices, it is often impossible for them to process their resourceintensive tasks within a predefined deadline. In the sequel, mobile cloud computing (MCC) and mobile edge computing (MEC) enable IoT devices to offload their tasks to the cloud or edge servers to access their substantial processing capabilities at the expense of having to transmit the tasks over dynamic wireless channels. Subsequently, to take full advantage of the MCC and MEC paradigms, it becomes essential to carefully optimize offloading decisions, communication, and computation resources.
Most of the existing research works solved the joint offloading decision, communication, and computation resource allocation problem leveraging on tools from optimization theory [1, 2]. However, the algorithms were typically nonscalable, timeconsuming, and computationally expensive. Unlike optimization frameworks, deep reinforcement learning (DRL) enables agents to learn by interacting with the environment. This unique approach to learning, turns DRL into an ideal problemsolving tool in dynamic environments. Yet, most DRL algorithms are centralized and thus suffer from lack of scalability when the number of devices grow. Also, the computational complexity of finding an optimal policy may increase exponentially as the state space and action space grow. Furthermore, the centralized learning requires IoT devices to share their information in order to train the global model which may violate their privacy and create unnecessary communication overhead on the already scarce frequency spectrum.
Recently, federated learning (FDL) has emerged as a new paradigm for cooperative learning, where multiple nodes contribute in training a single global model. The devices use their local datasets to train and then offload their local models to the central unit for global aggregation. FDL enhances the cooperation between agents and scalability of the network resource management algorithms. Furthermore, FDL does not require local agents to share their data with any external entity, thereby preserves the privacy of each agent [3].
To date, several research works have explored the problem of minimizing delay and energy consumption of the IoT devices considering a FDL system [4, 5, 6]. Specifically, in the aforementioned research works, FDL features were incorporated in the problem formulation and the problem was then solved using traditional optimization theory. Similarly, [7, 8, 9] focused on optimizing different aspects of FDL, such as compression of weights, convergence analysis, reduction in the number of iterations, and incentive mechanisms. In these research works, the capabilities of FDL as a part of resource allocation solution approach were not investigated. In [10], the problem of computation resource allocation was addressed considering an FDL system. However, FDL is only considered to formulate an optimization problem which is later on solved by using a centralized actorcritic agent and without using FDL in the solution approach.
None of the aforementioned research works applied FDL to enhance the efficacy of solving a realistic wireless resource allocation problem. Very recently, [11, 12] adopted FDL to facilitate the learning process in DRL, i.e., local DRL models were trained and then integrated together to cooperatively develop a comprehensive global DRL model. However, in [11], a cooperative caching scheme was proposed and offloading decisions were not considered. In [12], computational offloading was considered; however, the network was modeled as a queuing system, transmit power was modeled as an integer variable whose maximum value is equal to the maximum length of the energy queue. Also, in [12], no explicit qualityofservice (QoS) was guaranteed for users’ tasks and computation resource allocation was overlooked.
In this paper, we employ federated DRL (FedRL) to address the problem of minimizing joint expected task completion delay and energy consumption of IoT devices with offloading decisions, computation resources, and transmit powers as variables. Considering the mixedinteger nonlinear programming programming (MINLP) nature of the problem, we first reformulate our problem as a multiagent DRL problem and solve it using double deep Qnetwork (DDQN). In this problem, offloading decisions would be the actions and the immediate cost is calculated through solving either the transmit power or local computation resource optimization. To improve learning quality and speed of DRL, we incorporate FDL at the end of each episode. Using FDL results in a privacypreserving and scalable framework and creates a context for cooperation between agents. Our numerical results demonstrate the efficacy of our federated DDQN framework in terms of learning speed compared to federated deep Qnetwork (DQN) and nonfederated DDQN algorithms.
The rest of this paper is organized as follows: Section II describes the system model and the problem is formulated in Section III. Then, in Section IV, our proposed algorithm is presented. Finally, simulation results and related discussions are provided in Section V followed by conclusions in Section VI.
Ii System model and Assumptions
We consider a network containing one MEC server, one cloud server, and a set of IoT devices with limited computation and energy resources. We consider a given time horizon which is divided into time steps. At each time , device needs to process one of the tasks in its queue, defined with the tuple , where is the size of the task (in bits), denotes the CPU cycle requirement of the task, and denotes the maximum delay threshold of the task. At any time , devices can either execute their task locally or offload it to edge or cloud server.
Let us denote local offloading decision of device at time as , where means the task would be performed locally, and , otherwise. Similarly, we define MCC and MEC offloading variable of device by and , respectively. If device offloads its task to the cloud and if it offloads the task to the edge server . As a binary offloading decision is considered, we have:
(1) 
When device offloads its task (whether to MEC server or to the cloud), the delay and energy consumption would depend on the channel condition, the size of the task, and the power with which the device transmits its task and, in the case of local computation they depend on computation resource utilization. In what follows, we model the delay and energy consumption that an IoT device would experience, given its offloading decision.
If the device decides to offload its task, it should first transmit it to the MECenabled base station (BS) through wireless channels. At time step , the transmission data rate of this user, denoted by is calculated as follows:
(2) 
where and denote the bandwidth and transmit power of device at time step , respectively. Also, and represent the pathgain of device at time and the receiver noise. Thus, the communication delay and energy consumption of device , while offloading is given, respectively, as follows:
(3) 
(4) 
If device offloads its task to the edge server the computation delay would be where denotes the average computation capacity of edge server. Also, if device offloads the task to cloud server, the computation delay would be where represents the average computation capacity of the cloud server.
From the perspective of IoT device, the energy that is consumed for processing a task when it is offloaded to either of the servers, is the energy spent on the transfer of the task. Therefore, both cloud and edge computing energy utilization at step , denoted by and , would be equal to .
If device chooses to perform its task locally, the local computation delay and energy consumption would depend on the amount of computation resource allocated to process the task at time , which we denote by . Thus, the local delay and energy consumption of the device is modeled as follows:
(5) 
where is a constant coefficient that depends on the chip architecture in devices. Note that higher resource utilization (transmit power or computation capacity) , decreases the task completion delay at the expense of increased energy consumption. Therefore, this tradeoff must be carefully managed through efficient offloading decision making and precise optimization of and in the case of local computation and offloading, respectively.
Iii Multiobjective Problem Statement
In this section, we formulate the multiobjective problem of jointly minimizing the longterm delay and energy consumption of an IoT device in a decentralized manner over a specified time horizon . The longterm expected cost (weighted sum of delay and energy consumption) for each IoT device is formulated, respectively, as follows:
(6) 
(7) 
where , , , , and
, represent the vectors of transmit powers, computation resource allocation, local computing, edge offloading, and cloud offloading decision of device
, respectively. As cloud server is generally located far from the IoT devices, the delay of accessing cloud is commonly more than the delay of offloading to the edge server which is located at the edge of the network. In the equation (III), denotes the delay of accessing the cloud server, including the time necessary to transfer the task from BS to the cloud, the possible routing in the path, and the response delay.For any given device at time , we model our problem as:
(8)  
In the above optimization problem, is a weighting factor whose value should be carefully selected based on the heterogeneity of resources available at each individual IoT device. If device is more restricted in the energy resource compared to computation resource, the value of should be set to a larger number. Otherwise, should be a small number. Furthermore, constraint C1 indicates the local computation capacity with the maximum threshold . Constraint C2 represents the restriction on the energy resource of the device and that energy utilization should not exceed . Furthermore, constraints C4 and C5 define the binary offloading scheme adopted in this paper. It can be proven that both equations (III) and (III) are convex with respect to the variables and , respectively. However, with binary offloading variables (, , and ) included, (8) turns into a MINLP that cannot be solved in an acceptable time span.
Iv Proposed Federated DDQN Algorithm
To solve (8) at each IoT device, we propose a DDQN algorithm, and solve the problem in the following two phases:
Offloading Decision Optimization: Since each IoT device has three options to process a task (namely local, edge server, or cloud server computing), there are almost possible offloading options (from the perspective of a centralized controller) at each given time step. As the number of devices increases, this complexity would also surge exponentially. To address this problem, we apply a multiagent DDQN framework where each IoT device would train their local DDQN models using their local data.
Computing and Communication Resource allocation: Given the offloading decision, we optimize computation capacity or transmit power of the devices to minimize the weighted sum of energy consumption and delay. We use optimization theory to address this part of the problem and then feed the results into the DDQN framework as the immediate cost function. In this way, we provide the learning agent with a real sense of the quality of the adopted offloading policy that reflects many important aspects of the system model (such as limitation of resources in each device and their QoS demands).
After the DDQN agent is trained through the above mentioned process for one training round, we apply a federated learning framework where each IoT device will train its DDQN models, share their models with the centralized controller, and update their models to the central aggregating unit. This mechanism is detailed in the flowchart provided in Fig. 1.
In what follows, we first focus on developing local models through DDQN algorithm and then explain how FDL would be deployed.
Iva Double deep Qnetwork for offloading decision making
In the first step of our algorithm, we model our problem as a multiagent DDQN problem. For each device (DRL agent), we have following components:

State space: the state space for each agent , denoted by , consists of the following components: the length of the task queue of device (tasks that are not yet processed or are not successfully processed, would be kept in this queue) which is denoted by , the path gain of the IoT device , the size of the task currently being processed , its CPU cycle requirement, , and available resources. Thus, . If a task is successfully processed under a given offloading decision policy (its QoS requirement is satisfied), it would be removed from the task queue of the device. Otherwise, it will remain at the top of the queue to be processed under another offloading decision.

Action space: The action space of agents, denoted by , contains possible offloading decisions, i.e., whether to process the task locally or offload.

Cost: (8) suggests that the cost of an agent is equal to the weighted summation of the delay and energy consumption given in the objective function. The value of this objective function and thus the cost depends on the value of in the case of offloading and if local computation is selected. Therefore, to ensure that the cost function accurately reflects the benefit of a given offloading decision, these variables should be carefully optimized. To this end, when local computation is selected (), the cost would be calculated by solving the instantaneous optimization problem below:
(9) subject to: C1, C2, C3. in the case offloading is selected ( or ), the transmit power would be optimized by solving the following optimization problem:
(10) subject to: C2, C3.
The design of state and cost function has a significant impact on the success of DDQN in finding the optimal offloading policy,
. By using a multiagent approach, we are in fact limiting the state and action space and focus on each device separately. Also, by modeling the cost function as an optimization problem not only can we optimize the local resource utilization and enforce system constraints, but also we can provide the agent with an accurate estimation of the quality of an offloading decision. Note that it can be easily proved that both (
9) and (10) are convex single variable optimization problems that can be solved using standard softwares.Let us denote the immediate cost of each device obtained from the solution of the above mentioned optimization process as . Using Bellman equation, the actionstate value is:
(11) 
where , , and
are the set of states, the transition probability function, and the discount factor, respectively. To overcome the need for having a full model of environment, calculating the transition probability function, and to acquiring a more stable learning process, DDQN is employed in this work. Each agent
has two neural networks working alongside each other, one called
online network with parameters and the other called target network with parameters . At each training iteration the target value for training the online network in device is calculated as:(12) 
While is updated at every iteration, the frequency of change in is typically much lower and only once in every rounds, would be set equal to .
As discussed before, training a DRL agent in a centralized manner can lead to critical issues related to scalability, agents’ privacy, and additional communication overheads. On the other hand, training a DRL agent in a distributed manner can impact the overall performance gains (e.g., an agent might consume a longer time to train its model). As such, we consider FDL to combine the benefits of both centralized and distributed learning. FDL enables each agent to train its own local model, using its own local data. Then these local models are sent to a central aggregation unit to be combined together. This process continues until a criterion is met.
IvB Federated DRL Approach
The steps to train the FedRL agents are presented in the following:
IvB1 Device selection strategy
At the beginning of each iteration of FDL, a set of IoT agents are selected to participate in the FDL. Thus, of all devices in the network, only a small subset, denoted by is selected to contribute in FDL. In this paper, the device selection is done based on the following criterion:
(13) 
where
represents the distance of device from BS and the function Var stands for variance. This criterion helps in identifying devices whose experiences are more heterogeneous and thus can contribute more in the the learning process.
IvB2 Training local models
As explained previously , all IoT devices use DDQN to train their local models. After this local training is finished (no more unprocessed task remains in the queue), the weights of online network, , is extracted in each agent and is then sent to the central aggregating unit.
IvB3 Model Aggregation
When central unit receives the models of participating IoT devices, it would aggregate the models which results in a single global model that would be then transmitted to all agents. For the purpose of aggregation, we utilize FedAvg [13], and perform model aggregation as:
(14) 
This global model, which has integrated the experiences of all devices, is then transmitted back to IoT devices and the three steps above would be repeated. The details of our proposed framework is provided in Algorithm 1 as well as the flowchart given in Fig. 1.
V Simulation results
Here, we present our simulation results and extract useful insights related to the performance of our proposed federated DDQN framework in comparison to federated DDQN and distributed DDQN algorithms. In addition, we investigate the impact of batch size, network layers, target network update frequency on the convergence of the FDL. In what follows, we first focus on the impact of parameters of DRL on the learning speed of our proposed algorithm and then the comparison of the proposed algorithm with benchmarks would be presented.
To simulate our system, we consider a network of 100 IoT devices among which only 20 devices are selected in each round to contribute in the FDL process. Without loss of generality and for the sake of fair comparison, we assume the maximum computation capacity and energy consumption limit of the IoT devices are 1 Gbps and 23 dBm, respectively.
Fig. 2 demonstrates the effect of network architecture on the convergence of our proposed FedRL algorithm. Here we have some shallow networks with up to five layers and deeper networks that are obtained by stacking multiple layers with [16,32,32] neurons on each other. We note that by increasing the number of layers, faster model training can be achieved. The reason behind this observation is that, by exploiting deeper neural networks, we can better find the patterns in data (here devices’ experiences), which subsequently improves the quality of our local models. Thus, the global model is trained much faster as its underlying components, local models, are more accurate.
However, since our algorithm will be executed on IoT devices that often lack necessary resources to train a deep network, it may be infeasible to implement deeper neural networks. Therefore, in the next figure, we select a rather simple network architecture with [30,64,16,32,32] neurons in each layer and instead look for other parameters that may facilitate the learning process.
The other parameter we focus on in Fig. 3, is the batch size. We can observe from this figure that as the batch size increases the convergence of our proposed FedRL algorithm becomes faster. When batch size is equal to 10 and is extremely small it takes up to 200 iterations to finally converge to a relative global model, whereas in case batch size is 30, convergence is achieved almost around iteration number 40. By increasing the size of batches, we are basically training our model using more data instances. which results in enhancing the quality of local models and faster training process.
Similar to network architecture and the stated concern regarding the limited computation capacity, memory is another bottleneck in learning process of IoT devices. Larger batch size means higher memory consumption. If the device is limited both in CPU and memory capacity, neither very deep neural networks nor increased batch size can be a proper solution to facilitate deployment of FedRL on IoT devices.
To this end, in Fig. 4, we illustrate the effect of one of the parameters of DDQN, namely frequency of updating target network with online network. We can observe here that while the effect of this parameter on performance of DDQN algorithm is well investigated, this parameter is also considerably effective in the performance of federated DDQNs. Since many of the components in our state space, such as pathgain, QoS of tasks, and the length of the task queue, are constantly changing, efficient choice of the frequency of updating target network can stabilize the environment enough for the agent to track it better and obtain a better solution. This effect on local models is quite notable on the FDL as well.
In Fig. 5, we compare the performance of our proposed federated DDQN approach with those of federated DQN and simple distributed DDQN without any aggregation. As can be seen, the performance of federated DDQN is superior to federated DQN in terms of learning speed. As previously explained, the main advantage of DDQN over DQN is the capability to keep target network stationary, helping with the tractability of states’ values and subsequently a faster convergence to the correct estimation of them. The impact of this approach is even more notable when DRL is combined with FDL, since if the local models are not correctly trained, their errors would be propagated to other devices’ local model through aggregation. Therefore, aggregation can in fact negatively effect the result.
The comparison between distributed DDQN and federated DDQN underlines that the benefits of federated DRL are not limited to its scalability and privacy preservation. The aggregation incorporated in FDL provides IoT devices a great context to cooperatively train their models and merge their intelligence together while preserving privacy of their information. Exploiting federated learning, at every training round is almost as if devices’ models are trained with times more data than their local information. The significance of this share of knowledge and not data is quite notable in Fig. 5, where as the result of this aggregation step, federated DDQN is working much better and faster than simple distributed DDQN. Note that, averaging is used as the smoothing function.
Vi Conclusion
In this paper, we investigated the problem of joint delay and energy minimization in an IoT network with a threetier offloading scheme. To solve this problem we combined FDL, DDQN, and optimization theory. Combination of these tools helped us to achieve a scalable, privacypreserving, and computationally efficient framework for joint power and computation resource allocation and offloading decision optimization. In simulation results, we compared our work with those of 1) federated DQN to demonstrate the superiority of DDQN, especially in dynamic environments and 2) with distributed DDQN to signify the impact of aggregation step incorporated in FDL on the performance of the framework.
References
 [1] S. Zarandi and H. Tabassum, “Delay minimization in sliced multicell mobile edge computing (mec) systems,” IEEE Communications Letters, pp. 1–1, 2021.
 [2] A. Khalili, S. Zarandi, and M. Rasti, “Joint resource allocation and offloading decision in mobile edge computing,” IEEE Communications Letters, vol. 23, no. 4, pp. 684–687, 2019.
 [3] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y. C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning in mobile edge networks: A comprehensive survey,” IEEE Communications Surveys Tutorials, vol. 22, no. 3, pp. 2031–2063, 2020.
 [4] S. Luo, X. Chen, Q. Wu, Z. Zhou, and S. Yu, “Hfel: Joint edge association and resource allocation for costefficient hierarchical federated edge learning,” IEEE Transactions on Wireless Communications, vol. 19, no. 10, pp. 6535–6548, Oct 2020.
 [5] J. X. X. Mo, “Energyefficient federated edge learning with joint communication and computation design,” arXiv, Feb 2020.
 [6] J. Yao and N. Ansari, “Enhancing federated learning in fogaided iot by cpu frequency and wireless power control,” IEEE Internet of Things Journal, pp. 1–1, 2020.

[7]
Y. Chen, X. Sun, and Y. Jin, “Communicationefficient federated deep learning with layerwise asynchronous model update and temporally weighted aggregation,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 10, pp. 4229–4238, 2020.  [8] J. Mills, J. Hu, and G. Min, “Communicationefficient federated learning for wireless edge intelligence in iot,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 5986–5994, 2020.
 [9] Y. Zhan, P. Li, Z. Qu, D. Zeng, and S. Guo, “A learningbased incentive mechanism for federated learning,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6360–6368, 2020.
 [10] Y. Zhan, P. Li, and S. Guo, “Experiencedriven computational resource allocation of federated learning by deep reinforcement learning,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 234–243.
 [11] X. Wang, C. Wang, X. Li, V. C. M. Leung, and T. Taleb, “Federated deep reinforcement learning for internet of things with decentralized cooperative edge caching,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 9441–9455, 2020.
 [12] J. Ren, H. Wang, T. Hou, S. Zheng, and C. Tang, “Federated learningbased computation offloading optimization in edge computingsupported internet of things,” IEEE Access, vol. 7, pp. 69 194–69 201, 2019.
 [13] D. R. H. B. McMahan, E. Moore and B. A. Y. Arcas, “Federated learning of deep networks using model averaging,” arXiv preprint arXiv:1602.05629, 2016., 2016.
Comments
There are no comments yet.