I Introduction
Over the past decade, moving computing, control, and data storage into the cloud has been an important trend in order to utilize much needed abundant computing resources to handle explosive traffic demands. However, this relocation introduces network delays that bring significant challenges related to meeting the latency requirements of critical applications. To overcome the disadvantages of the cloud, fog computing, which selectively moves computation, communication, control, and decision making close to the network edge where data is being generated, became inevitable in this era [1]. One of the key benefits of fog computing stems from its highly virtualized platform that offers computing capacities allowing various applications to run anywhere. Hence, fog computing resolves problems of cloudonly solutions for applications that require a realtime response with low latency, e.g., missioncritical applications [2, 3]. Given the substantial benefits that can be drawn from this technology, fog computing is expected to play a crucial role in IoT, 5G, and other advanced distributed and connected systems [4, 5, 6, 7].
In fog networks, where fog nodes and cloud data centers present heterogeneous resources (e.g., computational, bandwidth, and memory), service tasks are classified according to various performance requirements and heterogeneous resource configurations. In contrast to the cloud server, the computing capacity of fog nodes is usually limited and inhomogeneous. Thus, computationintensive tasks often exhibit poor performance when they are processed by fog nodes with extremely limited resource capacities
[8, 9]. In this context, offloading and distributing tasks over the network while guaranteeing the QualityofService (QoS) requirements of the users, can be particularly useful. Considering the fact that fog nodes are located relatively close to each other, offloading from an originally requested fog node to an affordable neighbor node with available resources can be an attainable solution even for delaycritical services. Moreover, it is critical for the fog nodes and cloud to be able to cope with heterogeneous tasks when deciding which service tasks should be deployed and where. Hence, both the fog and cloud should complement each other in a distributive way to fulfill service needs. To this end, the hierarchical fog architecture was introduced to support a better distribution of the computing resources with an elastic and flexible placement of resources [10].On the other hand, as wireless services and applications become more sophisticated and intelligent, there is a pressing need for an efficient management of the execution of increasingly complex tasks based on the application requirements. Specifically, the selection of suitable nodes and proper resource assignments are critical in fog networks, where various types of applications are simultaneously running over the same network [11, 12]. The problem is deteriorated due to the highly volatile service demands of endusers and uncertainties associated with resource availability at the fog nodes. When fog nodes handle significantly different traffic demands, the imbalance between fog nodes can lead to inefficient resource usage and inequitable QoS [13]. Furthermore, each node cannot acquire full knowledge of the other nodes due to a nontrivial amount of signaling overhead and communication latency. Therefore, how to distribute the computing resources optimally throughout the network and design algorithms based on local knowledge that can derive globally emergent system characteristics such as agility, efficiency, and reliability are the central questions that lead this paper [14].
Ia Related work
Recently, significant efforts have been centered on fog computing implementations to tackle the limitations of traditional cloud platforms. Specifically, many approaches have been proposed in the literature to enhance task offloading and resource allocation problems for fog networks. Yousefpour et al. [15]
introduce a delayminimizing offloading policy for fog nodes in IoTfogcloud application scenarios, where the policy considers not only the length of the queue but also different request types that vary in processing times. Following this, it determines whether or not to offload the selected tasks as the estimated waiting time of fog node is greater than an acceptance threshold; it will offload the request to its best neighbor fog node. Zhang
et al. [16] formulate the resource allocation problem between fog nodes, data service operators, and data service subscribers. First, service subscribers purchase the optimal number of computing resource blocks from service operators with a Stackelberg game. Each subscriber competes for the required computing resource blocks owned by the nearby fog nodes. With a manytomany matching game between service operators and fog nodes, each operator determines its fog nodes that have computing resources to sell. With another manytomany matching framework, resource blocks of fog nodes are allocated to the service subscribers.Although some promising works have been dedicated to studying task offloading and resource allocation in fog computing and edge computing networks, it is necessary to jointly address the two issues to improve the overall performance. Wang et al. [17] propose to jointly address computation offloading, resource allocation, and content caching in wireless cellular networks with mobile edge computing. First, they transform the original nonconvex problem into a convex problem and prove the convexity of the transformed problem. Then, they decompose the problem and apply an alternating direction method of multipliers to solve it in an distributed and practical way. Alameddine et al. [18] address task offloading with joint resource allocation and scheduling specifically focused on delaysensitive IoT services. They mathematically formulate the problem as a mixedinteger problem and present a decomposition approach to achieve faster run times while providing the optimal solution. For heterogeneous realtime tasks, the authors in [19] propose task offloading and resource allocation problems in a threetier fog system with a parallel virtual queueing model. They apply an adaptive queuing weight resource allocation policy based on the Lyapunov function. Moreover, they propose multiobjective sorting policies in terms of both the laxity and execution times of the task to achieve a tradeoff between the throughput and task completion ratio optimization.
However, the computation offloading and resource allocation designs [15, 17, 18, 19] are mostly based on oneshot optimization and may not be able to achieve a longterm stable performance in dynamic situations. And since most optimization problems that arise in network resource management are nonconvex and NPhard, all these algorithms generally impose restrictions on the network to simplify nontrivial mathematical equations [20]. Nevertheless, such assumptions would require a revision of the objective functions, or even the system models, that lead to these problem formulations in the first place.
Furthermore, there are related works using different metaheuristic methods
[21, 22, 23, 24]. S. K. Mishra et al. [22]introduce the scheduling of service requests to virtual machines (VMs) with the minimum energy consumption at the fog servers. They formulate the service allocation algorithm for the heterogeneous fog server system using three metaheuristic methods, namely particle swarm optimization (PSO), binary PSO, and bat algorithm. Moreover, the authors in
[24]introduce a new evolutionary algorithm (EA) that is combined with a PSO and genetic algorithm (GA) for the joint design of the computation offloading and fog nodes provisioning.
However, in metaheuristic algorithms, the memory required to maintain a population of candidate solutions becomes vast as the size of problems increases. Specifically, due to the larger search space in largescale problems, almost every state encountered will never have been seen before, which makes it impossible to converge in limited time steps. In that respect, as the system model becomes more complex, metaheuristic methods can no longer be applied. In this context, to make sensible decisions in such large search spaces, it is necessary to generalize from previous encounters with different states that are in some sense similar to the current one.
In order to cope with an unprecedented level of complexity as we consider many parameters to accurately model the system, embedding versatile machine intelligence into future wireless networks is drawing unparalleled research interest [25, 26]
. A lot of recent works try to address the resource allocation problem in IoT networks by using supervised machine learning, i.e., Support Vector Machines (SVMs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), etc
[27]. Nevertheless, supervised learning is learning from a fixed data set. Thus the algorithm does not directly interact with the environment where it operates, which is not adequate to dynamically provision the ondemand resources, especially for highly volatile IoT application demands. Moreover, in the context of resource management for IoT networks, the lack of sufficient labeled data is another factor that hinders the practicality of supervised learningbased algorithms. On the other hand, a different machine learning technique that does not fall in the category of supervised and unsupervised learning is reinforcement learning (RL)
[28]. One of the key features of reinforcement learning is that it explicitly considers the problem of a goaldirected algorithm interacting with an uncertain environment. Therefore, RLbased algorithms can continually adapt as the environment changes without needing explicit system models. To tackle the curse of dimensionality of RL, deep reinforcement learning (DRL) was recently introduced
[29]. DRL embraces deep neural networks to train the learning process, thereby improving the learning speed and the performance of RLbased algorithms. Therefore, a DRL can provide efficient solutions for future IoT networks [30].The authors in [31] introduce an optimal computation offloading policy for mobile edge computing (MEC) in an ultra dense system based on a deep Qnetwork (DQN) without prior knowledge of the dynamic statistics. Pan et al. [32] study the dependencyaware task offloading decision in MEC based on Qlearning aiming to minimize the execution time for mobile services with limited battery power consumption. Huynh et al. [33]
develop an optimal and fast resource slicing framework based on a semiMarkov decision process (MDP) that jointly allocates the computing, storage, and radio resources of the network provider to different slice requests. To further enhance the performance, they propose a deep Qlearning approach with a deep dueling neural network, which can improve the training process and outperform all other reinforcement learning techniques in managing network slicing. Chen
et al. [34] consider a softwaredefined radio access network where multiple service providers (SPs) compete to acquire channel access for their subscribed mobile users, thereby each mobile user proceeds to offload tasks and schedule queued packets over the assigned channel. They transform the stochastic game between noncooperative SPs into an abstract stochastic game and propose a linear decomposition approach to simplify decision making. Also, a DRL algorithm is leveraged to tackle the huge state space. Sun et al. [35] propose a DRL based joint communication mode selection and resource management approach with the objective of minimizing the network power consumption. This approach can help the network controller learn the environment from collected data and make fast and intelligent decisions to reduce power consumption. Moreover, the tremendous growth in data traffic over nextgeneration networks can be substantially reduced via caching, which proactively stores reusable contents in geographically distributed memory storage [36, 37, 38, 39]. The authors in [37] study the joint cache and radio resource optimization on different timescales in fog access networks. The optimization problem is modeled as a Stackelberg game. To solve the problem, singleagent RL and multiagent RL are utilized and rigorously analyzed. Meanwhile, the authors in [39] exploit the power allocation problem in nonorthogonal multiple access for a system with cacheenabled users. They propose a DRL based scheme, which responds quickly upon requests from users as well as allows all users to share the full bandwidth. Also, they show that applying iterative optimization algorithms is not suitable for satisfying a shortresponse requirement from the base station to users.IB Contributions
This paper focuses on resource management in a fog system with the aim of guaranteeing the specific quality of service of each task as well as maximizing the resource utilization by cooperating between fog computing nodes. To address this problem, we design a joint heterogeneous task offloading and resource allocation algorithm whose goal is to maximize the processing tasks completed within their delay time limits. More precisely, we consider an independent multiagent decisionmaking problem that is cooperative and partially observable. To solve this problem, we propose a deep recurrent Qnetwork (DRQN) based learning algorithm, namely deep Qlearning combined with a recurrent layer. The DRQNbased algorithm aims to resolve partially observable environments by maintaining internal states. In particular, to guarantee the convergence and accuracy of the neural network, the proposed DRQNbased algorithm adopts an adjusted explorationexploitation scheduling method, which efficiently avoids the exploitation of incorrect actions as the learning progresses. To our best knowledge, this is the first work that introduces DRQN to solve the joint task offloading and resource allocation problems in fog computing networks. The key contributions of this paper are summarized as follows.

The proposed algorithm considers twolevels of heterogeneity of service tasks in terms of QoS requirements and resource demand characteristics. In real IoT scenarios, various service tasks demanding different resource sizes can require different service performances. In order to consider these heterogeneities, we propose a fog network slicing model that manages different types of tasks separately and partitions physical resources to each slice.

Regarding the feedback and memory overhead, we consider cooperative scenarios where the independent multifog nodes perceive a common reward that is associated with each joint action while estimating the value of their individual actions solely based on the rewards that they receive for their actions. Therefore, this reduces the feedback and memory overheads considerably compared to jointaction learning schemes where the fog nodes require the reward, observation, and action sets of others.

To deal with partial observability, we apply a DRQN approach to approximate the optimal value functions. The DRQNbased algorithm can tackle partial observability issues by enabling the agent to perform the temporal integration of observations. This solution is more robust than DQN and deep convolutional Qnetwork (DCQN)based methods in ways that the neural network with a recurrent layer can learn its output depending on the temporal pattern of observations by maintaining a hidden state, and thus it can keep internal states and aggregate observations. Moreover, to guarantee the convergence and accuracy of the neural network, an adjusted explorationexploitation method is adopted.

Numerical experiments using Tensorflow are presented to support the model and the proposed algorithm. The proposed DRQNbased algorithm requires much less memory and computation than the conventional Qlearning and metaheuristic algorithms which are impractical for solving the problem considered in this paper. Particularly, the proposed DRQNbased algorithm is compared to the DQN and DCQN approaches where it is demonstrated that the performance in terms of average success rate, average overflow rate, and task delay can be significantly enhanced by using the proposed DRQNbased algorithm.
IC Organizations
The remainder of this article is organized as follows: in section II, the system description is presented. The formulation of the offloading and resource allocation problem as a partially observable MDP (POMDP) is detailed in Section III. In section IV, we propose the cooperative decisionmaking problem between independent multinodes and derive a deep reinforcement learning scheme to solve the problem formulated in Section III. Simulation results are presented in Section V. Finally, Section VI concludes this paper and provides insight on possible future work.
Ii System description
In this section, we introduce a threelayer fog network system model that supports the integration of different services while serving the best of each dissimilar service characteristics, such as CPU processing density and delay requirements, through a hierarchical model. The time horizon is divided into decision epochs of equal durations (in millisecond) and indexed by an integer
.Symbol  Definition  

The set of fog nodes  
The set of cloud servers  
The set of fog slices at fog node  
The packet size of the task of slice  
The maximum delay budget of the task of slice  
The task of slice arrival rate for the fog node  


The number of tasks in the buffer of slice at fog node  


CPU processing density demanded for the task of slice  
Memory size demanded for the task of slice  
Total CPU resource capacity of fog node  
Total memory resource capacity of fog node  
Total CPU resource capacity of cloud server  
Total memory resource capacity of cloud server  
The allocation unit of CPU resource at fog node  
The allocation unit of memory resource at fog node  
The allocation unit of CPU resource at cloud server  
The allocation unit of memory resource at cloud server  
The transmission bandwidth of fog node  
The transmission power of fog node  
The path loss constant and exponent  
The available CPU resource units at fog node  
The available memory resource units at fog node  




The local reward observed by fog node 
Iia Threelayer fog network system
To improve scalability and resource utility in fog networks, a threelayer hierarchy is the most considered architecture [5, 10]. A threelayer fog network consists of an enddevice layer, a fog layer, and a cloud layer. The enddevice layer includes various types of devices, known as unintelligent devices, that are only producing data and are not equipped with processing capacity. Therefore, devices can request the nearest fog node to run their tasks on behalf of them. These tasks can be classified into types according to two characteristics, the task performance requirement (also called QoS) and the heterogeneous resource configuration (e.g., different CPU and memory configurations).
The fog layer is composed of multiple fog nodes . As illustrated in Fig. 1, we consider the fog layer where the physical network infrastructure is split into multiple virtual networks to offer heterogeneous service requests for different types of enddevice segments, known as network slicing [40]. With network slicing technology, fog nodes can set up customized slices to guarantee specific latency and resource requirements by the supported services. Fog nodes are formed by at least one or more physical devices with high processing capabilities, which are aggregated as one single logical entity able to seamlessly execute distributed services as if it was on a single device. The shared physical resources (e.g., bandwidth, CPU, and memory) on fog nodes are partitioned into fog slices to enable running the network functions that meet certain required slice characteristics (e.g., ultralowlatency, ultrareliability, etc).
Finally, the cloud layer includes various servers that are capable of sufficient storage and computational resources but are physically remote from enddevices. In our architecture, a fog network comprises of a single cloud server that interacts with all fog nodes.
IiB SDNbased fog nodes and interSDN communications
To provide more distributed and scalable management, fog nodes make use of softwaredefined networking techniques where the control plane is capable of decisionmaking while the data plane simply serves forwarding and computing tasks [41, 42, 43]. Furthermore, many different applications are operated concurrently in the SDN application plane. Besides, as individual SDN controllers are located in separate fog nodes, we apply the concept of interSDN communications, which interconnects controllers to share information and coordinate their decisions [44]. It is noted that the need for interSDN communications is increased as the explosive increase in task demands of end devices is requiring networks formed by more than one SDN controller [45]. In our system model, each fog node defines, deploys, and adapts independent decisionmaking in its separate SDN controller, and communication between the SDN controllers of the fog nodes aims to exchange feedback information required by the independent decisionmaking processes. Details about the information exchange is provided in section III.
IiC Fog slices based on heterogeneous task models
We consider a fog network with fog nodes deploying the same set of logical fog slices for different task types over separate infrastructures (i.e. ). Let be the set of slices available in the node where each slice processes a specific type of tasks in a separate buffer. The task demanded from the end devices by slice has a size of bits. In this context, since the time slot duration is relatively small, at most one task arrives at each slice of the fog node within a time slot. At the beginning of each time slot , let be the arrived task, where if a task of slice arrives at fog node , otherwise
. Hence, the probability that a new task of slice
arrives at fog node within time slotfollows a Bernoulli distribution with parameter
, . The number of tasks in a buffer for slice at fog node and time slot is , of which tasks are in progress in time slot . Meanwhile, the maximum buffer size is .Classified tasks in each slice have specific QoS requirements as well as different resource configurations. We assume that tasks delivered from enddevices are classified to the corresponding slices regarding their characteristics without manual intervention since classifying tasks to predict the type of application [46] are outside the scope of this paper. In terms of QoS, we categorize tasks into three classes: 1) delaycritical class (e.g., selfdriving cars, livestreaming), 2) delaysensitive class (e.g., augmented reality/virtual reality (AR/VR), smartphone applications) and 3) delaytolerant class (e.g., IoT sensors). The priority of tasks is determined in a way that provides maximum reliability within an acceptable delay, proper to each slice.
On the other hand, with regard to resource configurations, tasks of each slice demand two types of resources (i.e. CPU and memory). To process a task of slice , we denote the CPU processing density (in cycles/bit) and the memory (in Mbyte) demands as and , respectively. Therefore, although tasks require the same QoS demands, the resource demands can be dissimilar [47]. One use case example of the delaycritical class is a live sportstreaming application requiring highthroughput, while another from the same class would be an emergency signal for selfdriving cars which doesn’t necessarily require highthroughputs. Thus, they are processed through different slices.
Furthermore, the total resource capacities (i.e. CPU and memory) of fog node and of the cloud server are , and , respectively, where the superscript and indicate CPU speed (in cycles/) and memory size (in Mbyte), respectively. We assume that the cloud server is much more computationally powerful than its associated fog nodes, i.e., , , and provides limitless storage, i.e., . The fog nodes and the cloud server can allocate their resource on a resource unit basis. Hence, the total amount of resource units, which can be allocated by fog node to all slices, can be computed as where and stand for the number of allocated units of computing and memory resources at fog node , respectively, and denotes the floor function. Likewise, and indicate the number of allocated units of computing and memory resources at cloud server and the total number of resource units of cloud server is unlimited.
Thus, for a given node at time , the occupied resource units of all slices can be calculated as
(1) 
where is the ceiling function since a minimum of memory units greater than or equal to must be allocated to execute the task of slice . At every time slot , fog nodes only monitor their own available resources which correspond to the total resources minus the sum of resources being allocated to tasks of all slices. Hence, the available resource units at fog node and time can be measured as
(2) 
where and . Once the task processing is completed during the time slot, the finished task will be eliminated from the buffer and the resource allocated to this task will return to the available resource pool in the next time slot.
IiD Calculation of task latency
In our architecture, at the beginning of each time slot , fog nodes use an independent offloading policy to decide whether they will process arrived tasks locally or offload them to another node between neighboring fog nodes and the cloud server. Furthermore, decisions on resource allocation are simultaneously made by fog nodes with regard to their own resources through separate allocation policies.
We define a task latency to enable different delay constraints for tasks, thereby minimizing timeout failures that result from high transmission latency from offloading to a remote node or long waiting delays due to insufficient resource capacities at the processing node. Formally, the task latency can be denoted as the sum of the transmission delay, waiting delay, and processing delay. We assume that the fog nodes have information regarding the distance to neighboring fog nodes in the fog network as well as to the cloud server. To model the transmission delay of offloading, the tasks are transmitted to the selected node over a wireless channel. Then the transmission delay for fog node to forward the task of slice can be defined as
(3) 
where if the selected node is a fog or cloud node, respectively, and indicates the total number of tasks offloaded by fog node at time slot . Moreover, represents the transmission rate from fog node to the selected node , which is given by [8]:
(4) 
where , , and are the distance between two nodes, the path loss constant, and the path loss exponent, respectively. The variable denotes the transmission power of fog node and is the noise power spectral density. Additionally, the bandwidth is given by , which means that the total bandwidth of the fog node is equally shared by tasks. For example, when a fog node offloads a total of two different tasks during a time slot, each task is transmitted with in separate ways. When , a fog node processes this task locally, thus there is no transmission delay. Moreover, in most cases, the size of task after processing is small, thus the transmission delay of the result after processing can be ignored.
Next, when the slice task arrives in the corresponding buffer at node , the waiting delay can be calculated as
(5) 
where is the number of tasks previously existing in a buffer and is the service rate (i.e., the rate of tasks leaving a buffer). However, since a fog node does not have prior information about the buffer status of the other nodes when offloaded tasks arrive in its buffer and given that the service rate varies depending on the resource scheduling process, the waiting delay cannot be calculated in advance. On the other hand, we assume that the waiting time at the buffer of the cloud can be disregarded because the cloud is equipped with a larger number of cores than the fog node. This indicates that the cloud initiates the computation for the received tasks without queueing delay.
When a task is computed by fog nodes, the processing delay can be denoted as
(6) 
where refers to the number of CPU cycles required to complete the execution of a task of slice . When the task is offloaded to a fog node , the task of slice is executed by a fog node with the CPU speed . Likewise, when the task is offloaded to the cloud server , the processing delay is formulated in the bottom equation of (6) where is the CPU speed of cloud server . Thus, the processing delay is dependent on both the resource configuration of the task and the amount of allocated resources.
In essence, if a slice task is offloaded from a fog node to a neighboring fog node , the latency is obtained as
(7) 
If a slice task is computed locally by a fog node , the latency becomes
(8) 
Finally, if a slice task is offloaded from a fog node to the cloud server , the latency is
(9) 
Iii Problem formulation
In this section, we define the problem of heterogeneous task offloading and resource allocation in a system with multiple fog nodes as a POMDP across the time horizon.
Iiia Partially observable MDP based problem formulation
The main goal of the system is to make an optimal offloading and resource allocation decision at each node with the objective of maximizing the successfully processed tasks while guaranteeing the corresponding delay constraint of each task. Therefore, the joint offloading and resource allocation decision is achieved by finding proper processing nodes for the tasks and an optimal allocation of the node’s resources to all individual slices. We assume that the joint offloading and resource allocation decisions from all fog nodes are made simultaneously at every time slot . To this end, each node repeatedly observes its own system states at the beginning of the time slot. The local observation by fog node is defined as
(10) 
where , , and are the set of arrived tasks, the number of tasks in the buffer, and the number of tasks in progress among from all slices at the fog node and time , respectively. Moreover, is the available resource units at the fog node at time .
Note that the underlying states of the system including the states of other fog nodes are not accessible by the fog node. Instead, only the aforementioned state set in (10) can be observed and thus the system becomes a POMDP. We suppose that the observations are limits to the measurement accuracy of the state but are enough to make usable state data for a POMDP system.
In the presence of uncertainties stemming from the task demands and resource availability at the fog nodes, we formulate the POMDP based problem across the time horizon as a stochastic game in which each node selects actions as a function of their local observation. In our model, a fog node’s offloading and resource allocation policy operates independently from the other nodes’ policies. Thus, each fog node does not have any prior information on the task demands, buffer status, and resource availability of the other fog nodes. Accordingly, the actions are defined as
(11) 
where and denote the offloading decision and resource allocation decision, respectively. represents by whom the task will be processed, where if the slice task doesn’t arrive at the fog node at time , if the slice task arrives and will be computed locally, and and , if the slice task arrives and will be offloaded to another node implies that the fog node will offload this task to the cloud server). The resource allocation decision represents how many tasks will be initiated by being allocated resources where is the maximum number of tasks that can be simultaneously processed by the fog node . For example, indicates that, at time , fog node allocates its resources to slice to execute two tasks that are not in progress in the buffer. Each node takes an action only among the ones allowed in that observation, i.e., . We apply the following constraints for the offloading and resource allocation at time ,
(12) 
(13) 
(14) 
(15) 
to ensure that the fog node cannot offload the task when it doesn’t arrive by (12), cannot allocate more than the number of tasks waiting for allocation by (13), and the sum of newly allocated resources cannot exceed the available resources by (14) and (15).
Given that each node is in state and action is chosen, a transition probability is given by (16), where are the set of actions occurring at time . From (16), and only depend on the action of fog node , while is determined regardless of the action. Since one node’s offloading decisions result in increasing others’ buffers, the sequence of each node’s buffer status depends on the actions of all agents .
(16)  
(17) 
Based on the set of actions in local observation , we define the local reward in (17), where is the maximum delay budget of the task in slice where the task is discarded if its processing is not completed within this budget. The first term of the summation in (17) represents the success reward, a positive reward if the task is successfully completed and negative reward if timeout failure is encountered, which depends on both offloading decisions of arrived fog node and resource allocation decision of the processing fog node. The second term describes the overflow cost which defines whether the task is dropped because the slice buffer is already full, thus it is related to the buffer status of processing fog node . Moreover, is a constant weighting factor that balances the importance of the overflow failure for tasks of slice .
IiiB Cooperative games by independent learners
Although each fog node’s main goal is to optimize its own service performance and its resource interests, the fog nodes must still coordinate on the resource flows between neighboring nodes in order to achieve a meaningful solution from an overall system perspective [49]. In addition, the service performance experienced by service tasks during the processing is determined by the offloading and the resource allocation decisions of all fog nodes. Therefore, our stochastic game, sometimes called Markov game, follows a cooperative network to maximize the common goal rather than a competitive game where each fog node has opposing goals [50, 51].
More precisely, we apply cooperative scenarios between independent multifog nodes where the fog nodes share their local rewards with others as feedback information. This decisionmaking problem implies that independent fog nodes perceive the common reward that is associated with each jointaction while estimating the value of their individual actions solely based on the rewards that they receive for their actions. Therefore, this reduces the feedback and memory overheads considerably compared to jointaction learning schemes where the fog nodes share their reward, observation, and action sets with others to maintain a model of the strategy of other agents. As such, at each time step, each node executes an individual action, with the joint goal of maximizing the average rewards of all nodes which can be formally formulated as
(18) 
Thus, each node’s reward is drawn from the same distribution, reflecting the value assessment of all nodes [53]. Moreover, the convergence performance of jointaction learning schemes is not enhanced dramatically despite the availability of more information due to the exploration strategy [53]. As detailed in section II, the reward feedback is transmitted through interSDN communications to the SDN controllers of all the fog nodes for the decisionmaking process. In summary, the decisionmaking process at each fog node is fully distributed for realtime task offloading and resource management while communications between SDN controllers aims to exchange less timesensitive reward information.
Iv Learning the optimal offloading and resource allocation policies
In this section, we propose a Qlearningbased optimal policy solution to address the limitations of the traditional approaches and discuss deep recurrent Qnetworks (DRQN) which can better approximate actual Qvalues from sequences of observations, leading to better policies in a partially observable environment.
Iva Optimal policy solution using Qlearning
In the case where the system has access to transition probability functions and rewards for any stateaction pair, the MDP can be solved through dynamic programming (DP) approaches to find the optimal control policy [55, 56]
. However, in our cases, the system cannot precisely predict the transition probability distributions and rewards. To address this limitation, reinforcement learning is proposed in which the lack of information is solved by making observations from experience
[28]. Among the different reinforcement learning techniques, Qlearning is used to find the optimal stateaction value for any MDP without an underlying policy. Given the controlled system, the learning node repeatedly observes the current state , takes action that incurs a transition, then it observes the new state and the reward . From these observations, it can update its estimation of the Qfunction for state and action as follows:(19)  
where is the learning rate (01), balancing the weight of what has already been learned with the weight of the new observation, and is the discount factor (01).The most common action selection rule is the greedy algorithm that behaves greedily most of the time i.e., Greedy selection () and explores other options by selecting a random action with a small probability . This greedy selection and the probability of random selection are called exploitation and exploration, respectively. Nonoptimal action selection can be uniform during exploration (greedy algorithm) or biased by the magnitudes of Qvalues (such as Boltzmann exploration).
Moreover, we discuss the computational complexity of the Qlearning algorithm. The Qalgorithm requires storing a size table of Qvalues, i.e., for all and . In our problem, the size of local observation spaces and local action spaces is calculated as and . When , , , , , and , one node has to update a total of Qfunction values, which makes it impossible for the conventional Qlearning process to converge within a limited number of time steps. This problem is even more pronounced in multiagent scenarios, where the number of joint actions grows exponentially with the number of agents in the system.
IvB Convergence to equilibrium
of a node is the optimal policy to other nodes. Recall from fictitious play [54], the exploration strategy is required to be asymptotically myopic to ensure that Nash equilibrium will be reached in multiagent RL strategies. An action selection rule is said to be asymptotically myopic if the loss from agent ’s choice of actions at every given history goes to zero as proceeds[54]:
(20) 
as , where denotes the reward function. Therefore, the independent multiagent Qlearning in cooperative systems will converge to equilibrium almost surely when the following conditions are satisfied [52]:

The learning rate decreases over time such that .

Each node visits every action infinitely often.

The probability that node selects action is nonzero, .

The exploration strategy of each node is exploitative such that
where
is a random variable denoting a nonoptimal action was taken based on estimated Qvalues of node
at time .
The first two conditions are required for convergence in Qlearning, while the third ensures that nodes explore with a positive probability at all times, which will ensure the second condition. Last but not least, the fourth condition guarantees that agents exploit their knowledge as the number of time steps increases. In fact, convergence of Qlearning does not depend on the exploration strategy used, which implies that there is no rule to choose actions as long as every action is visited infinitely often. However, effective exploration strategies will encourage long run optimal equilibrium [52]. To this end, we propose an adjusted explorationexploitation method in the next subsection.
IvC Deep Qlearning with nonlinear transformation
(21) 
(22) 
To solve the scalability issues of Qlearning, we adopt Qlearning with a neural network, called deep Qnetwork (DQN). The DQN embraces the advantage of deep neural networks (DNNs) to train the learning process at each node , thereby improving the learning speed and the performance of Qlearning algorithms.
The Qnetwork can be trained by iteratively adjusting the weights
to minimize a sequence of the loss function,
, where the loss function at time slot is defined in (18). Precisely, given a transition , the weights of the Qnetwork of node are updated in a way that minimizes the squared error loss between the current predicted Qvalue of and the target Qvalue of . The gradient of the loss function in (21) with respect to the weights is given by (22).Moreover, in the DQN algorithm, the experience replay technique is adopted as the training method to address the instability of the Qnetwork due to the use of nonlinear approximation functions [29]. Hence, the transition experiences, are stored into a replay buffer , where is the replay buffer capacity. Due to possible delays in the reward feedback between fog nodes, the past transition experiences may need to wait in the temporal replay buffer to combine the rewards (as in (18)), where they are transferred to the replay buffer as soon as it is ready. At each time step, instead of the most recent transition , a random minibatch of transitions from the replay memory is chosen to train the Qnetwork by node . Furthermore, every time steps, the network is duplicated to obtain a target network which is used for generating the target Qvalue for the following updates to .
In addition, to guarantee the convergence and accuracy of the neural network, we adopt an adjusted explorationexploitation scheduling method. At the beginning of the process, the agent with a normal greedy algorithm selects more random actions with a probability to encourage initial exploration. Then, the exploration rate is asymptotically decayed with until it reaches a certain minimum value and is preserved until the last iteration. Since is a very small number, after this initial exploration phase, most decisions take the highest estimated value at the present iteration. However, this often leads to a suboptimal policy due to exploiting bad estimates of the Qvalue which were learned during the early iterations and insufficient exploration in large stateaction space cases. To deal with this problem. the adjusted explorationexploitation method allows the agent to shift back into exploratory mode every time slots, where the starting exploration probability is decreased (01) times every update. Therefore, this method efficiently avoids the exploitation of incorrect actions by selecting better estimates of the Qvalue as the learning progresses. The optimal control policy learning implementation using the DQN algorithm is illustrated in Fig. 2.
IvD Deeprecurrent Qlearning for partial observability
Another problem is that estimating a Qvalue from an immediate observation in DQN can be arbitrarily wrong since the network states are partially observed and hinge upon multiple users [58]. Any system that requires a memory of more than an immediate observation will appear to be nonMarkovian because the future system states depend on more than just the current input. This issue can be solved by allowing the agent to perform temporal integration of observations. The solution adopted in [29] stacks the last four observations in memory and feeds them to the convolutional neural network (called DCQN) instead of a single observation at a time. However, the DCQN takes in a fixed size vector as input, a stack of 4 observations in [29], which limits its usage in situations that involve a sequence type input with no predetermined size. In order to address this issue, we implement a DRQN which replaces the DCQN’s first fully connected layer by a recurrent layer. By utilizing a recurrent layer, the neural network will be able to learn its output depending on the temporal pattern of observations by maintaining a hidden state that it computes at every iteration. The recurrent layer can feed the hidden state back into itself, and thus it can maintain internal states and aggregate observations.
However, during backpropagation, vanilla recurrent neural networks suffer from the vanishing gradient problem, which makes layers that get a small gradient value stop learning, and thus neural networks may forget important information from the beginning. To tackle this problem, we use Gated Recurrent Unit (GRU) for the recurrent layer. Similar to Long shortterm memory (LSTM), GRU was introduced as a solution to the shortterm memory of vanilla recurrent neural networks
[59]. The main concept of GRUs is a gating mechanism, which can learn which information is relevant to keep or forget during training in the recurrent network. GRU has two gates (reset and update) and is known to be computationally more efficient and faster than LSTM which consists of three gates (forget, input, and output) and cell state, while its performance is comparable to LSTM [59, 60]. The proposed DRQN structure is illustrated in Fig. 3.(23) 
where and
are reset and update gates, respectively. With that, the recurrent network can learn how to use some of its units to selectively cancel out the irrelevant information and protect the state. Sigmoid and Tanh activation functions can make these decisions by filtering values between 0 and 1 for each state element.
Algorithm 1 details the procedure of the proposed learning algorithm at fog node . The neural network takes the state sequence as an input to the first convolutional layer. Since the valid action space is dependent upon the current state value, we involve a step in the action selection that sets the probability of invalid actions to zero and renormalizes the sum of the probabilities of the other actions to 1.
V Performance evaluation
In this section, we quantify the performance gain from the proposed DRQNbased learning algorithm for heterogeneous task offloading and resource allocation problems in multifog networks using numerical experiments based on PythonTensorFlow simulator. We used three different environments which are equipped with Inter(R) Core i77500 U CPU @ 2.7GHz 64bit OS, Intel(R) Xeon(R) CPU E31225 v6 @ 3.3GHz 64bit OS, and AMD Ryzen Threadripper 1920X 12Core Processor.
Va Simulation settings
For our simulations, we consider a fog layer consisting of five fog nodes that are randomly distributed within a network area of . In addition, a total of three different slices are created on top of each fog node. Slice characteristics are customized by the twolevel of heterogeneity, namely the resource demands types and delay constraints, which are summarized in Table II. As an example of slice characteristics in Table II, slice can be dedicated to Standard resource type services to meet the delaycritical constraint. To obtain realistic values for the processing capacities of fog nodes, we use the CPU processing densities and memory sizes from [47] which used real applications data including a YouTube video data set in [57]. For slice at fog node , the task arrivals follow a Bernoulli distribution with parameter (in task/slot) and the packet size is 5 bits. Additionally, the buffer size in each slice is 10, which means that a maximum of 10 slice tasks can stay in the buffer concurrently until processing terminates. The path loss constant and exponent are set to and 4, respectively. The bandwidth for each fog node is 1MHz. The transmission power of the fog node is 20dBm, while the noise power density is 174dBm/Hz [8]. In regard to resource capacity distribution at fog nodes, the CPU speed of a fog node is randomly sampled from [5GHz, 6GHz, 7GHz, 8GHz, 9GHz, 10GHz], where the memory size of a fog node is randomly sampled from [2.4GB, 4GB, 8GB]. The allocation unit of CPU and memory resources are 1GHz and 400MB, respectively.

Delay Constraints  

Critical  Sensitive  Tolerant  
Standard 













To evaluate the performance of different neural network settings, three neural networks are considered to estimate the Qvalue in our simulation. For all of them, the output layer is a fully connected layer of units, where represents the dimension of the action set. Additionally, the activation function of the output layer is a linear activation function, which corresponds to the predicted Qvalue of all possible actions. These neural networks differ from each other on the input layer and the hidden layers as detailed below.

DRQN: for the design of the deep recurrent Qnetwork, the input to the network consists of , where is the dimension of the state set and
is the sequence size for the 1Dconvolutional network. The first hidden layer convolves 32 filters with a kernel size of 3 and applies a Rectified Linear Unit (ReLU). The second hidden layer convolves 64 filters with a kernel size of 3, again followed by a ReLU. This is followed by a recurrent layer in which we use GRU. The number of units in the GRU cell is 128 and the sequence length is 10. The final GRU state is followed by a fully connected layer with ReLU, which has 64 units.

DCQN: the deep convolutional Qnetwork is almost identical to the deep recurrent Qnetwork except for a recurrent hidden layer. The resulting activations from the second convolutional hidden layer are followed by two fullyconnected layers with ReLU, the first one has 128 units and the second has 64 units.

DQN: we use four fullyconnected hidden layers consisting of 64, 128, 128, and 64 units with ReLU.
In all the experiments, we use the Adam optimizer with a learning rate of 0.001 and learning starts after iterations. A discount factor of 0.98 is used in the Qlearning update. The replay memory size of is . The target network parameters is updated every time slots. We use a minibatch size of 32 transition samples per update. The renewal factor , renewal rate , start and minimum are set to 0.9, 5000, 1 and 0.01, respectively.
For performance comparisons, the existing methods are simulated as baseline schemes. Given the large state and action spaces in the problem considered, we compare methods that are practicable using limited computational resources. Specifically, one baseline offloading method is used as follows:

Threshold Offloading with Nearest Node selection: the node offloads its tasks only if the buffer is above a certain threshold and we set the threshold to 0.8 which implies that the task is to be offloaded if the buffer is more than 80 full. Also, the node selects the most adjacent neighboring node aiming to minimize communication delays and energy consumption which is an offloading algorithm widely used in IoT and devicetodevice communications.
On the other hand, two conventional resource allocation algorithms are simulated as baseline schemes, namely:

Round Robin (RR): this algorithm allows every slice that has tasks in the queue to take turns in processing on a shared resource in a periodically repeated order.

Priority Queuing (PQ): this algorithm handles the scheduling of the tasks following a prioritybased model. Tasks are scheduled to be processed from the head of a given queue only if all queues of higher priority are empty, which is determined by a delay constraint.
Case  fog slice1  fog slice2  fog slice3  

case1 






case2 






case3 





For different evaluation scenarios, we specifically assign three different cases in terms of slices’ characteristics to analyze how each slice’s different resource demands and delay priorities are interrelated to each other. Thus, the three simulation cases according to the three slices’ characteristics are summarized in Table III. It is noted that this evaluation can simply be expanded by configuring Table II and Table III to suit the needs of the fog network.
VB Performance analysis
In this subsection, we evaluate the performance of the proposed algorithm by comparing the simulation results under different system parameters.
VB1 Complexity analysis
In this section, the memory complexity and processing time of the proposed algorithm are investigated. The proposed DRQNbased learning algorithm described in Section IV.C requires storing a replay buffer which consists of the state , action , reward , next state , and valid action spaces of the next state for the target Qnetwork. In single transition experience, the state, action, and reward require storing a size array of decimal values, single integer numbers, and single float numbers, respectively. Moreover, the valid action space of the next state is an size array of binary values, where if is invalid in state , otherwise . Finally, using the parameters in Section IV.A, the proposed algorithm requires much less memory than the conventional Qlearning algorithm, i.e., approximately 2.8GB compared to 7.24TB. It is worth mentioning that we leverage a Pythonbased simulator where the array as a whole is an object that stores the float data in consecutive regions of the memory, and thus the memory size of an individual float is not explicit. Therefore, the memory usage compared corresponds to the data contained in the object.
Furthermore, multinode learning alleviates the overhead of the network infrastructure as well as improves the system response time, compared to the centralized architecture. In our simulation, assuming each node is equipped with a single CPU, the processing time per iteration is around 0.04s. Moreover, the proposed neural network model can be trained in parallel on multiple CPUs or GPUs to improve the training time and memory usage.
VB2 Convergence performance
In this experiment, we evaluate the convergence property of different neural networks with the above parameter settings to confirm whether the proposed deep reinforcement learningbased algorithm can achieve stable performance. To quantify the impact of task traffic status on the convergence performance, we implement two different average task arrival rates, which are categorized into normal(=0.6) and heavy(=0.8), where is the average task arrival rate per slice at the fog node. Since the uniform random policy runs for iterations at all nodes before learning starts, the total average reward value is not enhanced during this time and thus we show the average total reward of fog nodes after they start learning their networks. Once each node starts learning its own stateaction value functions with a preassigned neural network, the total average rewards are increasing and asymptotically converge after around iterations as shown in Fig. 4. In regard to the average task arrival rate, when the nodes receive a smaller number of tasks per time slot, the average total reward value is larger over all simulation cases. The main reason behind this is that the number of successfully processed tasks with limited resources of fog nodes is higher when a fewer number of tasks are waiting in the buffer and also that the buffers are less likely to be overflowed. Given the findings from this experiment, the proposed algorithm using DRQN can achieve greater total reward compared to DQN and DCQN. This result implies that DRQN controllers can handle partial observability by retaining some memory of previous observations to help the nodes achieve better decision making.
VB3 Performance under different characteristics of fog slices
This experiment mainly aims to demonstrate the performance in terms of the average processing success rate, the average overflow rate, and the average task latency under different characteristics of fog slices as shown in Table III. Like the previous experiment, two kinds of traffic demands, normal and heavy, are used for evaluation. Fig. 5 and Fig. 6 illustrates the average success rate of tasks and the average overflow rate per fog node. From Fig. 5 (a) and Fig. 6 (a), it can be observed that the proposed scheme using DRQN achieves the highest average success rate. Moreover, as the traffic requested increases, a larger number of tasks fail to attain their delay performance requirements due to the lack of resources. Comparing the three cases, when the nodes are requested the same traffic rate but demanding high computation and memory resources (Case1), fog nodes are more likely to experience a lower success rate. This is because the processing time takes longer with limited resources, which also leads to failure of the delay requirements.
In Fig. 5 (b) and Fig. 6 (b), it is shown that, when implementing baseline methods, extremely large amounts of tasks are dropped due to overloading buffers. This is because fog nodes always select the most proximate fog node to offload their tasks, where the same fog node can be selected by several neighbors and thus its buffers will fill up quickly with tasks from multiple neighboring nodes. Furthermore, the resource allocation methods also affect the overflow performance. As we mentioned, the slices with large resource demands take more processing time than the slices with small resource demands. Thus, when implementing RR and PQ methods, there is unfairness in the allocation of slices with small resource demands, which induces the high number of task drops from overloaded buffers. On the other hand, even though slices of Case2 constitute the tasks with the same resource demands, their overflow rate is higher than that of Case1 and Case3. The difference in Case2 is that the fog slice3 of nodes are dedicated to delaytolerant tasks where tasks of this slice can stay in the buffer until they exceed their large delay limit. Hence, the average processing success rate from Case2 is higher due to relatively adequate delay limits, while the average overflow performance is worse. However, the proposed algorithm can offload their tasks to different neighboring nodes depending on their buffer and resource status and avoid unfairness among slices with different priorities in the resource allocation process, leading to an increased average success rate. Moreover, Fig. 5 and Fig. 6
show that the variances of the success and overflow rates indicated by the error bars vary from one algorithm to the other. The error bars in these figures represent the largest value as the upper limit and the smallest value as the lower limit among all the nodes. For example in Fig.
5 (a), the success rate of Case2 using DRQN has a mean of 95.6 and varies between 95.3 to 96.1, while the success rate of Case2 using nearest node selection with PQ resource allocation has a mean of 62.1 and varies between 36.8 to 90.8. We can clearly see that the variability of the task success rate between fog nodes is greater for baseline methods than for the proposed algorithms, where the same trend is shown in the overflow rate. This result indicates that the proposed algorithm discourages selfish behavior in nodes and achieves a winwin cooperation between fog nodes by making rational offloading and resource allocation decisions.Fig. 7 illustrates the average task delay under different cases. In contrast to the baseline methods, the proposed algorithm decreases the average task delay by selecting a neighboring fog node that minimizes transmission delay as well as waiting time in the buffers and allowing distinct resource allocation with respect to characteristics of each slice.
VB4 Performance under different task arrival rates
In this experiment, we consider how the task arrival rates impact the average performance in terms of the average processing success rate, the average overflow rate, and the average task latency per slice. In this simulation, we set all fog nodes’ slice characteristics to Case2. In Fig. 8 (a), as the task arrival rate increases, more tasks fail to be completed within their delay limits. Similar observations can be found from Fig. 8 (b) where more tasks are dropped when the task arrival rate increases from 0.5 to 0.9. The reason behind this is that, as the task arrival rate increases, the waiting time becomes longer due to the larger number of tasks waiting in the buffer, which means that the tasks are more likely to fail their delay requirements or be dropped if they arrive when the buffer is full. Meanwhile, over the variation of the task arrival rate, the maximum success rate and minimum overflow rate is achieved from the proposed algorithm. Moreover, as shown in Fig. 8 (c), the DRQNbased algorithm has the lowest average delay in Case2. These empirical results show that temporal integration of observations from a recurrent network allows the nodes to coordinate in their choices without knowing the explicit state and action sets of the others which makes the proposed DRQNbased algorithm relatively robust to the dynamics of the partially observable environment. These results also demonstrate that intelligently distributing resources to slices requiring different delay constraints makes a huge impact on the overall system performance.
Vi Conclusion
In this paper, we devised a joint heterogeneous task offloading and resource allocation algorithm whose goal is to maximize the processing tasks completed within their delay constraints while minimizing the task drops from buffer overflows. The SDNbased fog network we consider has multiple fog nodes that are coordinating to achieve the best overall network performance without knowing the explicit status of other fog nodes. In the presence of uncertainties stemming from task demands and resource status, we formulate the problem as a partially observable stochastic game and apply cooperative multiagent deep reinforcement learning with a global reward that aims to maximize the common goal of nodes and stabilize the convergence property. Further, we implement a recurrent neural network to tackle the partialobservability by maintaining internal states and aggregating temporal observations. The simulation results show that the proposed DRQNbased algorithm can achieve a higher average success rate and lower average overflow than DQN and DCQN as well as nondeep learning based baseline methods. In the future, we will extend the multiagent learning to scenarios for agents in largescale fog networks with differing reward functions.
References
 [1] A. Zaidi, Y. Hussain, M. Hogan, and C. Kuhlins, ”Cellular IoT evolution for industry digitalization,” Ericsson, White Paper, 2019. [Online]. Available: https://www.ericsson.com/en/reportsandpapers/whitepapers/cellulariotevolutionforindustrydigitalization
 [2] M. Peng, S. Yan, K. Zhang, and C. Wang, “Fogcomputingbased radio access networks: Issues and challenges,” IEEE Network, vol. 30, no. 4, Jul. 2016, pp. 46–53.
 [3] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things J., vol. 3, Dec. 2016, pp. 854–864.
 [4] Y. J. Ku, D. Y. Lin, C. F. Lee, P. J. Hsieh, H. Y. Wei, C. T. Chou, and A. C. Pang, “5G radio access network design with fog paradigm: Confluence of communications and computing,” IEEE Commun. Mag., Apr. 2017.
 [5] C. Mouradian, D. Naboulsi, S. Yangui, R. H. Glitho, M. J. Morrow, and P. A. Polakos, “A comprehensive survey on fog computing: Stateoftheart and research challenges,” IEEE Commun. Surveys Tut., vol. 20, no. 1, Nov. 2017, 416464.
 [6] A. V. Dastjerdi, and R. Buyya, Internet of Things principles and paradigms, Elsevier, 2016. [Ebook]
 [7] A. A. Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of things: A survey on enabling technologies, protocols, and applications,” IEEE Commun. Surveys Tuts., vol. 17, no. 4, 2015.
 [8] G. Lee, W. Saad, and M.Bennis, “An online optimization framework for distributed fog network formation with minimal latency,” arXiv preprint arXiv:1710.05239, 14 Oct 2017.
 [9] X. Xu, S. Fu, Q. Cai, W. Tian, W. Liu, W. Dou, X. Sun, A. Liu, ”Dynamic resource allocation for load balancing in fog environment”, Wireless Communications and Mobile Computing, pp. 115, 2018.
 [10] W. Zhang, Z. Zhang, and H. Chao, “Cooperative fog computing for dealing with big data in the internet of vehicles: Architecture and hierarchical resource management,” IEEE Commun. Mag., vol. 55, no. 12, pp. 60–67, 2017.
 [11] M. Aazam, S. Zeadally, K. A. Harras, ”Offloading in fog computing for IoT: Review, enabling technologies, and research opportunities,” Future Generation Comp. syst., vol. 87, Oct. 2018, pp. 278289.
 [12] P. Mach and Z. Becvar, ”Mobile edge computing: A survey on architecture and computation offloading,” in IEEE Commun. Surv. Tut., vol. 19, no. 3, 2017.
 [13] L. Yu, L. Chen, Z. Cai, H. Shen, Y. Liang, and Y. Pan , “Stochastic load balancing for virtual resource management in datacenters,” in IEEE Trans. Cloud Comput., Nov. 2014.
 [14] J. Y. Baek, G. Kaddoum, S. Garg, K. Kaur, and V. Gravel, “Managing fog networks using reinforcement learning based load balancing algorithm,” in 2019 IEEE Wireless Communications and Networking Conference (WCNC), 15 18 Apr. 2019.
 [15] A. Yousefpour, G. Ishigaki, R. Gour, and J.P. Jue, ”On reducing IoT service delay via fog offloading,” IEEE Internet of Things J., vol. 5, no. 2, Apr. 2018, pp. 9981010.
 [16] H. Zhang, Y. Xiao, S. Bu, D. Niyato, F.R. Yu, and Z. Han, ”Computing resource allocation in threetier IoT fog networks: A joint optimization approach combining stackelberg game and matching,” IEEE Internet of Things J., vol. 4, no. 5, Oct. 2017, pp. 12041215.
 [17] C. Wang, C. Liang, F.R. Yu, Q. Chen, and L. Tang, ”Computation offloading and resource allocation in wireless cellular networks with mobile edge computing,” in IEEE Trans. Wireless Commun., vol. 16, no. 8, Aug. 2017, pp. 49244938.
 [18] H. A. Alameddine, S. Sharafeddine, S. Sebbah, S. Ayoubi, and C. Assi, ”Dynamic task offloading and scheduling for lowlatency IoT services in multiaccess edge computing,” in IEEE J. Sel. Areas Commun., vol. 37, no. 3, Mar. 2019, pp. 668682.
 [19] L. Li, Q. Guan, L. Jin, and M. Guo, ”Resource allocation and task offloading for heterogeneous realtime tasks with uncertain duration time in a fog queueing system,” in IEEE Access, vol. 7, 2019, pp. 9912–9925.
 [20] Luo ZQ and Yu W, “An introduction to convex optimization for communications and signal processing,” IEEE J. Select. Areas Commun., vol.24, no.8, Aug. 2008, pp.1426–1438.
 [21] S. K. Mishra, D. Puthal, J. J. P. C. Rodrigues, B. Sahoo, and E. Dutkiewicz, ”Sustainable service allocation using metaheuristic technique in fog server for industrial applications,” IEEE Trans. Ind. Informat., vol. 14, no. 10, Oct. 2018.
 [22] C. W. Tsai and J. J. Rodrigues, “Metaheuristic scheduling for cloud: A survey,” IEEE Syst. J., vol. 8, no. 1, pp. 279–291, Mar. 2014.
 [23] N. Bergmann, Y. Y. Chung, and X. Yang, “Using swarm intelligence to optimize the energy consumption for distributed systems,” Mod. Appl. Sci., vol. 7, no. 6, pp. 59–66, 2013.
 [24] D. Zhang, F. Haider, M. S. Hilaire, and C. Makaya, ”Model and algorithms for the planning of fog computing networks,” IEEE Internet of Things J., vol. 6, no. 2, Apr. 2019.
 [25] J. Wang, C. Jiang, H. Zhang, Y. Ren, K. C. Chen, and L. Hanzo, ”Thirty years of machine learning: The road to paretooptimal wireless networks.” IEEE Commun. Surveys Tuts., early access, Jan. 13, 2020.
 [26] C. Zhang, P. Patras, and H. Haddadi, ”Deep learning in mobile and wireless networking: A survey,” IEEE Commun. Surveys Tuts., Mar. 2019.
 [27] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, “Application of machine learning in wireless networks: Key techniques and open issues,” IEEE Commun. Surveys Tuts., vol. 21, no. 4, 2019, pp. 30723108.
 [28] R. S. Sutton and A. G. Barto, Introduction to Reinforcement learning 2nd edition. Cambridge, MA, USA: MIT Press, 1998.
 [29] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
 [30] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. C. Liang, and D. I. Kim, ”Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Commun. Surveys Tuts.., vol. 21, no. 4. May. 2019, pp. 31333174
 [31] X. Chen, H. Zhang, C. Wu, S.Mao, Y. Ji, and M. Bennis, ”Performance optimization in mobileedge computing via deep reinforcement learning,” arXiv preprint arXiv:1804.00514, Mar. 2018.
 [32] S. Pan, Z. Zhang, Z. Zhang, and D. Zeng, ” Dependencyaware computation offloading in mobile edge computing: A reinforcement learning approach,” IEEE Access, vol. 7, Sep. 2019.
 [33] N. V. Huynh, D. T. Hoang, D. N. Nguyen, and E. Dutkiewicz, “Optimal and fast realtime resource slicing with deep dueling neural networks,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, Jun. 2019.
 [34] X. Chen, Z. Zhao, C. Wu, M. Bennis, H. Liu, Y. Ji, and H. Zhang ,”Multitenant crossslice resource orchestration: A deep reinforcement learning approach”, IEEE J. Sel. Areas Commun., vol. 37, no. 10, pp. 23772392, Aug. 2019.
 [35] Y. Sun, M. Peng, and S. Mao, ”Deep reinforcement learning based mode selection and resource management for green fog radio access networks,” IEEE Internet of Things J., vol. 6, no. 2, pp. 19601971, Apr. 2019.
 [36] A. Sadeghi, G. Wang, and G. B. Giannakis, ”Deep reinforcement learning for adaptive caching in hierarchical content delivery networks,” IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 4, Dec. 2019.
 [37] Y. Sun, M. Peng, S. Mao, ”A gametheoretic approach to cache and radio resource management in fog radio access networks,” IEEE Trans. Veh. Technol., vol. 68, Oct. 2019.
 [38] Y. He, C. Liang, F. R. Yu, N. Zhao, and H. Yin, “Deepreinforcementlearningbased optimization for cacheenabled opportunistic interference alignment wireless networks,” IEEE Trans. Veh. Technol., vol. 66, no. 11, Nov. 2017.
 [39] K. N. Doan, M. Vaezi, W. Shin, H. V. Poor, H. Shin, and T. Q. S. Quek, ”Power allocation in cacheaided NOMA systems: Optimization and deep reinforcement learning approaches,” IEEE Trans. Commun., doi: 10.1109/TCOMM.2019.2947418.
 [40] R. Jain and S. Paul, “Network virtualization and software defined networking for cloud computing: A survey,” IEEE Commun. Mag., vol. 51, no. 11, Nov. 2013, pp. 24–31.
 [41] N. McKeown, “Softwaredefined networking,” INFOCOM Keynote Talk, vol. 17, no. 2, 2009, pp. 30–32.
 [42] K. Mahmood, A. Chilwan, O. Østerbø , and M. Jarschel, “Modelling of OpenFlowbased softwaredefined networks: The multiple node case,” IET Netw., vol. 4, no. 5, 2015, pp. 278–84.
 [43] S. Tomovic, K. Yoshigoe, I. Maljevic, and I. Radusinovic., “Softwaredefined fog network architecture for IoT,” Wireless Personal Commun., Springer, 2016.
 [44] D. Gupta and R. Jahan, “InterSDN controller communication: Using border gateway protocol”, White Paper by Tata Consultancy Services (TCS), Jun. 2014.
 [45] X. Hou, W. Muqing, L. Bo, and L. Yifeng, ”Multicontroller deployment algorithm in hierarchical architecture for SDWAN”, IEEE Access, vol. 7, pp. 6583965851, 2019.
 [46] P. Wang, S.C. Lin, and M. Luo, ”A framework for QoSaware traffic classification using semisupervised machine learning in SDNs,” in IEEE Int. Conf. on Services Computing (SCC), Jun. 2016, CA, USA. pp.760765.
 [47] J. Kwak, O. Choi, and S. Chong, ”Processornetwork speed scaling for energy–delay tradeoff in smartphone application,” in IEEE/ACM Trans. Netw., vol. 24, no. 3, Jun. 2016.
 [48] M. Verma, M. Bhardawaj, and A. K. Yadav, ”An architecture for load balancing techniques for fog computing environment,” Int. J. Computer Sci. and Commun., vol. 6, no. 2, Apr./Sep. 2015, pp. 269274.
 [49] L. Busoniu, R. Babuska, B. De Schutter, ”A comprehensive survey of multiagent reinforcement learning”, in IEEE Trans. Syst. Man Cybern., vol. 38, no. 2, pp. 156172, 2008.
 [50] R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, I. Mordatch, ”Multiagent actorcritic for mixed cooperativecompetitive environments”, CoRR, vol. abs/1706.02275, 2017.

[51]
M. Hausknecht, P. Stone, ”Deep recurrent Qlearning for partially observable MDPs”,
Association for the Advancement of Artificial Intelligence Fall Symp. Series
, 2015.  [52] C. Claus and C. Boutilier, “The dynamics of reinforcement learning in cooperative multiagent systems,” in Proc. 15th Nat. Conf. Artif. Intell. 10th Conf. Innov. Appl. Artif. Intell. (AAAI/IAAI98), Madison, WI, Jul. 26–30, pp. 746–752.
 [53] S. Kapetanakis and D. Kudenko, “Reinforcement learning of coordination in cooperative multiagent systems,” in Proc. 18th Nat. Conf. Artif. Intell. 14th Conf. Innov. Appl. Artif. Intell. (AAAI/IAAI02), Menlo Park, CA, Jul. 28–Aug. 1, pp. 326–331.
 [54] D. Fudenberg and D. M. Kreps, Lectures on Learning and Equilibrium in Strategic Form Games, CORE Foundation, LouvainLaNeuve, Belgium, 1992.
 [55] R. E. Bellman, Dynamic Programming, Princeton, NJ, Princeton University Press, 1957.
 [56] M. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, New York, NY: John Wiley & Sons, Inc., 1994.
 [57] Dataset for Statistics and Social Network of YouTube Videos. [Online]. Available: http://netsg.cs.sfu.ca/youtubedata/
 [58] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, ”Deep decentralized multitask multiagent reinforcement learning under partial observability”, in Proc. Int. Conf. Mach. Learn. (ICML), pp. 26812690, 2017.
 [59] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ”Empirical evaluation of gated recurrent neural networks on sequence modeling,” CoRR. 2014. [Online]. Available: http://arxiv.org/abs/1412.3555
 [60] R. Jozefowicz, W. Zaremba, and l. Sutskever, ”An empirical exploration of recurrent network architectures,” in Proc. 32nd Int. Conf. on Machine Learning, vol. 37, Jul. 2015, pp.23422350.
Comments
There are no comments yet.