The Internet of Things (IoT) connects a huge number of IoT devices to the Internet, which generate massive amount of sensory data to reflect the status of physical world. These data could be processed and analyzed by leveraging machine learning techniques, with the objective of making informed decisions to control the reactions of IoT devices to the physical world. In other words, IoT devices become autonomous with ambient intelligence by integrating IoT, machine learning and autonomous control. For example, smart thermostats can learn to autonomously control central heating systems based on the presence of users and their routine. IoT and autonomous control system (ACS) are originally independent concepts, and the realization of one does not necessarily require the other. The concept of autonomous IoT (AIoT) was proposed as the next wave of IoT that can explore its future potential .
The AIoT systems provide a dynamic and interactive environment with a number of AIoT devices, which sense the environment and make control decisions to react. As shown in Fig. 1, an AIoT system typically includes a physical system where AIoT devices with sensors and actuators are deployed. The IoT devices are connected usually by wireless networks to an access point (AP) such as a mobile base station (BS), which acts as a gateway to the Internet where cloud servers are deployed. Moreover, edge/fog servers with limited data processing and storage capabilities as compared to the cloud servers may be deployed at the APs . After the IoT devices acquire data from sensors that represent full or partial status of physical system, they need to process the data and generate control decisions for the actuators to react. The data processing tasks can be executed locally at the IoT devices, at the edge/fog servers, or at the cloud servers.
Reinforcement learning (RL) introduces ambient intelligence into AIoT systems by providing a class of solution methods to the closed-loop problem of processing the sensory data to generate control decisions to react. Specifically, the agents interact with the environment to learn optimal policies that map status or states to actions 
. The learning agent must be able to sense the current state of the environment to some extent (e.g., sensing room temperature) and take the corresponding action (e.g., turn thermostat on or off) to affect the new state and the immediate reward so that a long-term reward over extended time period is maximized (e.g., keeping room temperature at a target value). Different from most forms of machine learning, e.g., supervised learning, the learner is not told which actions to take but must discover which actions yield the most long-term reward by trying them out.
While RL has been successfully applied to a variety of domains, it confronts a main challenge when tackling problems with real-world complexity, i.e., the agents must efficiently represent the state of the environment from high-dimensional sensory data, and use these information to learn optimal policies. Therefore, deep reinforcement learning (DRL), a combination of RL with deep learning (DL), has been developed to overcome the challenge. One of the most famous applications of DRL is AlphaGo, the first computer program which can beat a human professional on a full-sized board.
It turns out that the formulation of RL/DRL models for real-world AIoT systems is not as straightforward as it may appear to be. There are two types of entities in a RL/DRL model as discussed above - environment and agent. Firstly, the environment in RL/DRL can be restricted to reflect only the physical system, or be extended to include the wireless networks, the edge/fog servers and cloud servers as well. This is because that the network and computation performance such as communication/computation delay, power consumption and network reliability will have important impacts on the control performance of the physical system. Therefore, the control actions in RL/DRL can be divided into two levels: (physical system) actuator control and (communications/computation) resources control, as shown in Fig. 1. The two levels of control can be separated or jointly learned and optimized. Secondly, the agent in RL is a logical concept that makes decisions on action selection. In AIoT systems, the agent with ambient intelligence can reside in IoT devices, edge/fog servers, and/or cloud servers as shown in Fig. 1. The time sensitiveness of the IoT application is an important factor to determine the location of the agents. For example in autonomous driving, images from an autonomous vehicle’s camera needs to be processed in real-time to avoid an accident. In this case, the agent should reside locally in the vehicle to make fast decisions, instead of transmitting the sensory data to the cloud and return the predictions back to the vehicle. However, there are many scenarios that it is not easy to determine the optimal locations for the agents, which may involve solving an RL problem in itself. Moreover, when there are multiple agents distributed in the IoT devices, the cooperation of the agents is also an important and challenging issue.
Although AIoT is a relatively new concept, related research works already exist in IoT and ACS, respectively. In this paper, we will review the state-of-art research, and identify the model and challenges for the application of DRL in AIoT. Although there are currently several recent articles discussing on the application of machine learning in general IoT systems [6, 3], this paper focuses on a specific type of machine learning - reinforcement learning, and its application on a promising type of IoT system - AIoT.
The remainder of the paper is organized as follows. In Section II, we review the RL/DRL methodologies. Section III introduces a general model for RL in AIoT with a detailed discussion on the key elements. In Section IV, the existing works will be surveyed according to the different IoT applications. The challenges and open issues are identified and highlighted in Section V. Finally, the conclusion is given in Section VI.
Ii Overview of Deep Reinforcement Learning
In this section, we first introduce the basic concepts of RL and DL, based on which DRL is developed. Then, we classify the DRL algorithms into two broad categories, i.e., value-based and policy gradient methods, and show that different elements in RL are approximated by deep neural networks. Finally, we introduce some advanced DRL techniques that are envisioned to be extremely useful in addressing the open issues in AIoT, which will be discussed in Section V.
Ii-a Building Block of DRL - RL
Generally, RL is a type of algorithms in machine learning that can achieve optimal control of a Markov Decision Process (MDP). As discussed in Section I, there are generally two entities in RL as shown in Fig. 2 - an agent and an environment. The environment evolves over time in a stochastic manner and may be in one of the states within a state space at any point in time. The agent performs as the action executor and interacts with the environment. When it performs an action under a certain state, the environment will generate a reward function as signals for positive or negative behaviour. Moreover, the action will also impact on the next state that the environment will transit to. The stochastic evolution of the state-action pair over time forms an MDP, which consists of the following elements
state , which is used to represent a specific status of environment in a possible state space . In MDP, the state comprises all the necessary information of the environment for the agent to choose the optimal action from the action space.
action , which is chosen by the agent from an action space in a specific state . An RL agent interacts with the environment and learn how to behave in different states by observing the consequences of its actions.
reward , which is generated when the agent takes a certain action in a state . Reward indicts the intrinsic desirability of an action in a certain state.
transition probability, which is the conditional probability that the next state of system will be given the current state and action . In model-based RL, this transition probability is considered to be known by the agent, while agent does not require this information in model-free RL.
A policy determines how an agent selects actions in different states, which can be categorized into either a stochastic policy or a deterministic policy . In stochastic case, the policy is described by , which denotes the probability that an action may be chosen in state . In deterministic case, the policy is described by , which denotes the action that must be chosen in state .
For simplicity of introduction, we focus on the discrete time model, where the agent interacts with the environment at each of a sequence of discrete time steps . The goal of agent is to learn how to map states to actions, i.e., to find a policy to optimize the value function for any state . The value function is the expected reward when a policy is taken with an initial state , i.e.,
where is a trajectory or sequence of triplets , with , or and . can be the total reward, discounted total reward, or average reward of trajectory , where is the terminal time step that can be .
Apart from value function, another important function is Q function , which is the expected reward for taking action in state and thereafter following a policy . When policy is the optimal policy , value function and Q function are denoted by and , respectively. Note that . If the Q functions , are given, the optimal policy can be easily found by .
In order to learn the value functions or Q functions, the Bellman optimality equations are usually used. Taking the discounted MDP with a discount factor of for example, the Bellman optimality equations for value function and Q function are
Bellman equations represent the relation between the value/Q functions of the current state and next state. For example, it can be inferred from (3) that the expected reward equals to the sum of the immediate reward and the maximum expected reward thereafter. When the future expected reward is obtained, the expected reward since current state can be calculated. Bellman equations are the basis of an important class of RL algorithms using the “bootstrap” method, such as Q-learning and Temporal-Difference (TD)-learning. During the learning process, the agent first initializes the value/Q functions to some random values. Then, it iteratively repeats the policy prediction and policy evaluation phases until the convergence of the value/Q functions. In the policy prediction phase, the agent chooses an action according to the current value/Q functions, which results in an immediate reward and a new state. In the policy evaluation phase, it updates the value/Q functions according to the Bellman equations (2) or (3) given the immediate reward and the new state.
In the policy prediction phase, instead of always selecting the greedy action that maximizes the current value/Q functions, a “soft” policy such as -greedy, -soft, and softmax is usually used to explore the environment seeking the potential to learn a better policy. Moreover, according to the different methods adopted in policy evaluation phase, RL algorithms can be either on-policy or off-policy
, depending on whether the value/Q functions of the predicted policy or an hypothetical (e.g., greedy) policy is estimated.
Usually, a large amount of memory is required to store the value functions and Q functions. In some cases when only small, finite state sets are involved, it is possible to store these in the form of tables or arrays. This method is called the tabular method.
However, in most of the real-world problems, the state sets are large, sometimes infinite, which makes it impossible to store the value functions or Q functions in the form of tables. Therefore, the trial-and-error interaction with the environment is hard to be learned due to formidably computation complexity and storage capacity requirement. Even if it can be learned, huge computing resources will be spent on it. This is where DL comes into the picture - some functions of RL such as Q functions or policy functions are approximated with a smaller set of parameters by the application of DL. The combination of RL and DL results in the more powerful DRL.
Ii-B Building Block of DRL - DL
Deep learning refers to a subset of machine learning algorithms and techniques that leverage artificial neural networks (ANN) to learn from large amount of data in an autonomous way. It is able to perform well in tasks like regression and classification. Regression task deals with predicting a continuous value, while classification task predicts the output from a set of finite categorical values. Given input data and output data , Neural network (NN) models can be viewed as mathematical models defining a function or a distribution over or both and . The learning rule of NN modifies its parameters in order for a given input , the network can produce a favored output
that best approximates the target output data.
A general feedforward neural network (NN), as shown in Fig. 3
, is constructed by an input layer, one or more hidden layers and an output layer. Each layer consists of one or multiple neurons which represent different non-linear nodes in the model. As illustrated in Fig.3, neuron in layer
has a vector of weightsfor the connections from layer to itself and a bias value
. It also has an activation function
such as Sigmoid, Logistic, Tanh and ReLU. The output of neuronin layer equals to , where is the vector of outputs from neurons in layer . The typical parameters of NN are the weights and bias of every node. Any NN with two or more hidden layers can be called Deep Neural Network (DNN).
A feedforward network has no notion of order in time, and the only input it considers is the current input data it has been exposed to. Recurrent neural network (RNN) refers to a special type of NNs which can process sequences of inputs by using internal memory. In RNN, the prior output of neurons in hidden state can be used as input along with the current input data, which enables the network to learn from history. The basic architecture of RNN is illustrated in Fig.4. At each time step, the input of RNN is propagated in the feedforward NN. First, it is modified by the weight matrix . Meanwhile, the hidden state of the previous time step is multiplied by another weight matrix
. Then, those two parts are added together and activated by the neuron. A special RNN architecture called Long Short-Term Memory (LSTM) is widely used. LSTM is able to solve shortcomings in RNN, i.e. vanishing gradient, exploding gradient and long term dependencies .
Usually, a loss functionis used in deep learning, which is a function of the ouput from NN and the target output . The loss function evaluates how well a specific NN along with current learned parameter values models the given data
. Loss functions can be classified into two major categories depending on the type of learning tasks - Regression losses and Classification losses. The common regression loss functions include Mean Square Error (MSE), Mean Absolute Error (MAE), Mean Bias Error (MBE). And common classification loss functions include Support Vector Machine (SVM) loss and Cross entropy loss.
The objective of the NN is to minimize the loss function, i.e., . For this purpose, the parameters in NNs are updated by a method called gradient descent. Given a function , the simple gradient is usually used to update the parameters. The gradient descent method starts from an initial point . As a mini-batch of input data is fed to NN, the average loss function over all input data in the mini-batch is derived, and used to find the minimum of by taking a step along the descent direction, i.e.,
where is a hyper-parameter named step size. It is set to determine how fast the parameter values move towards the optimal direction. The above process is repeated iteratively as more mini-batches of input data are fed to NN until convergence.
The simple gradient is easy to derive, but simple gradient descend is often not the most efficient method to optimize the loss function. During training, an appropriate value of step size should be set because if the value is too big, it may not be able to reach the local minimum and if the value is too small, it may take too much time to reach the local optimal point. On the other hand, natural gradients do not follow the usual steepest direction in the parameter space, but along the steepest descent direction with respect to the Fisher metric in the space of distributions. Specifically, the Fisher information metric is usually used to determine the step size, so that . Then, (4) can be used to update the parameters by replacing with .
Ii-C DRL Basics - Value-Based Methods
In value-based methods for DRL as illustrated in Fig. 5, the states or state-action pairs are used as inputs to NNs, while Q functions or value functions are approximated by parameters of NNs. An NN returns the approximated Q functions or value functions for the input states or state-action pairs. There can be a single output neuron (as shown in Fig. 5(a)) or multiple output neurons (as shown in Fig. 5(b)). For the former case, the output can be either or corresponding to the input or . For the latter case, the outputs are the Q functions for state combined with every action, i.e., .
To derive the loss functions, and are defined as the target values of Q functions and value functions, respectively. The regression loss
can be used to evaluate how well the NN approximate Q functions or value functions in value-based methods.
Ii-C1 Deep Q-Networks
Based on the idea of NN fitted Q functions, the Deep Q-networks (DQN) algorithm is introduced by Mnih et al. in 2015 to obtain strong ability in ATARI games . The illustration of DQN is shown in Fig. 6. The NN in DQN takes a state as input, and returns approximated Q functions for every action under the input state.
In DQN, the algorithm first randomly initialize the parameters of networks as . The target Q function is given by (7) according to Bellman equation as
where the subscripts or refer to the values of corresponding variables at the or iteration.
The parameters in DQN are updated by minimizing the loss function , which can be derived from (5) by replacing with .
By applying stochastic gradient descent, the parameters are updated as
where is the learning rate.
In order to deal with the limitations of DRL, two important techniques, freezing target networks and experience replay, are applied in DQN. To make the training process more stable and controllable, the target networks, whose parameters are kept fixed in a time period, are used to evaluate the Q function of the next state, i.e., instead of (7), we have
The parameters of online network are updated after each iteration. After a certain number of iterations, the online network shares its parameters to the target network. This reduces the risk of divergence and prevents the instabilities resulted from the too quick propagation.
To perform experience replay, the experience of the agent at each time step is stored in a data set. Then, the updates are made on this data set, which removes correlations in the observation sequence and smooths over changes in the data distribution. This technique allows the updates to cover a wide range state-action space and provides more possibility to make larger updates of the parameters.
Ii-C2 Double DQN
In DQN, the Q function evaluated by target networks is used both to select and evaluate an action, which makes it more likely to overestimate the Q function of an action. The estimating error will become larger if there are more actions. To overcome this problem, Hasselt et al. proposed a Double DQN (DDQN) method in 2016, where two sets of parameters are used to derive the target value as shown in Fig. 7 . Compared with (7), the target Q-value in DDQN can be rewritten as
where the selection of the action is due to the parameters in online network and the evaluation of the current action is due to the parameters in target network. This means there will be less overestimation of the Q-learning values and more stability to improve the performance of the DRL methods . The loss function can be derived from (5) by replacing with and the parameters can be updated accordingly. DDQN algorithm gets the benefit of double Q-learning and keeps the rest of DQN algorithm.
Apart from DQN and DDQN, there are also other value-based methods, some of which are developed based on DQN and DDQN with some further improvement, such as DDQN with Proportional Prioritization , DDQN with duel architecture , etc..
Remark 1 (Pros and cons of value-based DRL methods)
Although DQN and its improved versions have been widely adopted in existing literature as discussed in Section IV - mainly due to their relative simplicity and good performance, there are some limitations with value-based DRL methods. First, it cannot solve RL problems with large or continuous action space. Second, it cannot solve RL problems where the optimal policy is stochastic requiring specific probabilities. Since value-based method can only learn deterministic policies, the majority of the algorithms are off-policy, such as DQN.
Ii-D DRL Basics - Policy Gradient Methods
According to a policy , action is selected when the environment is in state . In policy gradient methods, NNs can be applied to directly approximate a policy as a function of state, i.e., . As shown in Fig. 8, the states are used as inputs to the NNs, while policy is approximated by parameters of NNs as .
To evaluate the performance of the current policy, the objective function is defined as
where is the value function of policy as shown in (1), and refers to the sampling trajectory with an initial state . If we can find the parameters for policy so that the objective function is maximized, we can solve the task. The basic idea of policy gradient methods are to adjust the parameters in the direction of greater expected reward . For this purpose, we can set the loss function of NN to be
In order to update the parameters, we need to express the gradient of with respect to parameter as an expectation of stochastic estimates based on (11). As mentioned in Section II-A, the policy in RL can be classified into two categories, i.e., the stochastic policy and the deterministic policy. Hence, the stochastic policy gradient (SPG) method and deterministic policy gradient (DPG) method are correspondingly discussed below.
Ii-D1 Stochastic Policy Gradients
By applying DRL, a stochastic policy is approximated as , which gives the probability of a specific action is taken in a specific state , when the agent follows the policy parameterized by . The policy parameters are usually the weights and bias of a neural network 
. For a DRL model with discrete state/action spaces, Softmax function is a typical probability density function. In the cases of continuous state/action spaces, Gaussian distribution is generally used to characterize the policy. An NN is applied to approximate the mean, and a set of parameters specifies the standard deviation of the Gaussian distribution.
According to the policy gradient theorem, we have
By applying stochastic gradient descent, the parameters are updated as
where is the learning rate. In this way, is adjusted to enlarge the probability of trajectory with higher total reward.
From the perspective of NN, we give the loss function of SPG algorithm as
Monte Carlo Policy Gradient
A typical algorithm of the SPG methods is the REINFORCE algorithm proposed in . Based on the Monte Carlo approach, a trajectory is firstly sampled by running the current policy from an initial state . Then for each time step , the total reward starting from time step is calculated, which is multiplied with the policy gradient to update the parameters according to (14). The above procedure is repeated over multiple runs, while in each run a different trajectory is sampled.
Moreover, in order to reduce the variance of the policy gradient, a baseline functionwhich is independent of is introduced. Based on this, the REINFORCE algorithm with baseline is introduced, and the loss function of it can be formulated as
Remark 2 (Pros and cons of Monte Carlo policy gradient DRL methods)
In contrast to value-based DRL methods, the policy gradient methods for DRL is a direct mapping from state to action, which leads to better convergence properties and higher efficiency in high-dimensional or continuous action spaces . Moreover, it can learn stochastic policies, which have better performance than deterministic policies in some situations. However, Monte Carlo policy gradient methods suffer from high variance of estimations. As on-policy methods, they require on-policy samples, which made them very sample intensive.
Actor-critic methods, combining the advantages of both policy gradient and value-based methods, have been widely studied in DRL. As illustrated in Fig. 10, an actor-critic method is generally realized by two NNs, i.e., the actor network and the critic network, which share parameters with each other. The actor network is similar to the NN of the policy gradient method, while the critic network is similar to the NN of the value-based method. During the learning process, the critic updates the parameters of value functions, i.e, , according to the policy given by the actor. Meanwhile, the actor updates the parameters of policy, i.e., , according to the value functions evaluated by the critic. Generally, two learning rates are required to be predefined respectively for the updates of and .
As mentioned previously, one important task in the policy gradient method is to obtain the value of in (15). In the actor-critic method, the critic network is used for this purpose. Specifically, the baseline in (16) is set to the value function , which is approximated by the critic network with a loss function as given in (6). In a state , the agent selects an action according to the current policy given by the actor network, receives a reward , and the state transits to . Similar to (6) in value-based method, the loss function for the critic network can be expressed as
Similar to (8) in DQN, the parameters of the critic network are updated as
where is the learning rate for the critic.
Note that is an estimate of . Therefore, given the value functions evaluated by the critic network, the value of in (16) can be replaced by in (18), which can be seen as an estimate of the advantage of action in state . The loss function of the actor network can be defined similar to (16), i.e.,
Similar to (14) in the policy gradient method, the parameters of the actor network are updated as
where is the learning rate for the actor.
Through the update processes in the actor-critic algorithm, the critic can make the approximation of value functions more accurately, while the actor can choose better action to get higher reward.
Typical actor-critic methods for SPG include the asynchronous advantage actor-critic (A3C) algorithm and soft actor-critic (SAC). The former mainly focuses on the parallel training of multiple actors that share global parameters . The latter involves a soft Q function, a tractable stochastic policy and off-policy updates . SAC achieves good performance on a range of continuous control tasks.
Remark 3 (Pros and cons of Actor-Critic DRL methods)
Actor-critic methods combine the advantages of both value-based and Monte Carlo policy gradient methods. They can be either on-policy or off-policy. Compared with Monte Carlo methods, they require far less samples to learn from and less computational resources to select an action, especially when the action space is continuous. Compared with value-based methods, they can learn stochastic policies and solve RL problems with continuous actions. However, it is prone to be unstable due to the recursive use of value estimates.
Ii-D2 Deterministic Policy Gradient
Different from stochastic policy gradient where the policy is modeled as a probability distribution over actions, deterministic policy gradient models the policy as a deterministic decision, i.e.,. According to the objective function given in (11) and the deterministic policy gradient theorem, we have
where the policy improvement is decomposed into the gradient of the Q-function with respect to actions, and the gradient of the policy with respect to the policy parameters. is the state distribution following policy . Thus, the parameters are updated as
A differentiable function can be used as an approximator of , and then the gradient can be replaced by . The approximator is compatibale with the deterministic policy, and is achieved as .
From the perspective of NN, the loss function of DPG algorithm is set as
One typical actor-critic method for DPG is the deep deterministic policy gradient (DDPG) algorithm as is shown in Fig. 11. The DDPG algorithm is a model-free off-policy actor-critic algorithm, which combines the ideas of DPG and DQN. It is first proposed by Lillicrap et al. in 2015 . Besides the online critic network with parameters , and the online actor network with parameters , the target networks and in the DDPG algorithm are specified with and , respectively. The parameters of these four NNs are required to be updated in the learning process. The gradient is obtained by the critic network.
Based on DDPG, several algorithms are proposed in recent years, such as Distributed Distributional Deep Deterministic Policy Gradients (D4PG) , Twin Delayed Deep Deterministic (TD3) , Multi-Agent DDPG (MADDPG) , and Recurrent Deterministic Policy Gradients (RDPG) , etc..
Remark 4 (Pros and cons of DPG DRL methods)
DPG methods are a special type of actor-critic methods that focus on deterministic policies. Compared with their SPG counterparts, they require less samples in computing the deterministic policy gradient, especially if the action space has many dimensions. Different from value-based methods that can only solve RL problems with discrete actions, DPG-based methods focus on and work well for high dimensional continuous action problems. It usually works off-policy to guarantee enough exploration unless there is sufficient noise in the environment. Also, when combined with DQN, DPG-based method is good only when the Q function is accurate.
Ii-D3 Further Improvements
Natural Policy Gradient
The policy gradient methods discussed above all use a simple gradient of loss function to update the parameters of NN. On the other hand, Natural policy gradient (NPG) method updates the parameters in NN using the natural gradient as discussed in Section II.B instead of simple gradient to provide a more efficient solution .
The loss function of NPG is the same as that of SPG, whose general expression is given in (12). The parameters are updated as
is the Fisher information matrix used to measure the step size for update .
NPG method defines a new form of step size that specifies how much those parameters should be adjusted, and therefore provides a more stable and effective update. However, the drawback of NPG is that when complicated NN is used to approximate the policy where the number of parameters is large, it is impractical to calculate the Fisher information matrix or store them appropriately . Methods originated from NPG, such as Trust Region Policy Optimization (TRPO)  and Proximal Policy Optimization (PPO)  solve the above problem to some extent and are widely used for DRL in practice. Moreover, there are algorithms applying NPG to actor-critic methods, such as Actor Critic using Kronecker-Factored Trust Region (ACKTR)  and Actor-Critic with Experience Replay (ACER) .
Combining Monte Carlo Policy Gradient and Actor-Critic
In Monte Carlo methods, the policy gradient is unbiased but with high variance; while in actor-critic methods, it is deterministic but biased. Therefore, an effective way is to combine these two types of methods together. Q-prop is such an efficient and stable algorithm proposed by S. Gu, T. Lillicrap et.al in 2016 . It constructs a new estimator that provides a solution to high sample complexity and combines the advantages of on-policy and off-policy methods.
Q-prop can be directly combined with a number of prior policy gradient DRL methods, such as DDPG, TRPO, etc.. Compared with actor-critic methods such as DDPG algorithms, Q-prop has achieved higher stability in DRL tasks in real-world problems. One limitation with Q-prop is that the computation speed will be slowed down by the critic training when the speed of data collection is fast.
Ii-E DRL Beyond MDP - POMDP-based DRL
In the previous sections, we consider RL in a Markovian environment, which implies that knowledge of the current state is always sufficient for optimal control. However in many real-world problems, total environment information cannot be observed by the agent accurately, usually due to the limitations in sensing and communications capabilities. An agent acting under situation with partial observability can model the environment as a partially observable Markov decision process (POMDP) . RL tasks in realistic environments need to deal with those incomplete and noisy state information resulting from POMDP.
POMDP can be seen as an extension of MDP by adding a finite set of observations and a corresponding observation model . A POMDP is usually defined as a six-tuple , where state space , action space , transition probability , and reward are defined previously as elements in MDP,
is the observation space, where is a possible observation.
is the conditional probability that taking an action leading to a new state will result in an observation .
Similar to MDP, an agent chooses an action according to policy which results in the environment transiting to a new state with probability and the agent receives a reward . Different from MDP, the agent cannot directly observe system states, but instead receives an observation which depends on the new state of the environment with probability . Also, the policy and Q function are modified as and respectively.
Since the agent cannot directly observe the underlying state, it needs to exploit history information to reduce uncertainty about the current state . The observation history at time step can be defined as . When the environment model is known to the RL agent, the optimal approach is for the agent to compute a belief that provides a probability distribution over states. By introducing belief state, POMDP problem can be converted to a belief-based MDP problem. On the other hand, if the environment model is not available to the agent, it is straightforward to use the last observations as input to the policy. However, using a finite history may result in important information in the past being forgotten and overlooked. In order to overcome this challenge, RNN appears to be a good solution, because it is designed to deal with time series and can maintain long-term memory .
Several typical existing methods of solving POMDP problems are listed as follows.
Deep Recurrent Q-Network (DRQN)
An agent is able to chose actions in complex tasks by value-based methods for DRL, e.g. DQN and DDQN. However, it is hard for those methods to get outstanding performance when the agent cannot perceive the complete information of the state. To address this problem, Hausknecht et al. proposed Deep Recurrent Q-Network (DRQN) in 2015 to integrate information through time and enhance DQN’s performance . DRQN adds recurrency to DQN by replacing DQN’s first fully-connected layer with a LSTM layer.
In the partially observed cases, the agent does not have access to state . So Q function in terms of history is defined as , which is the output of NN . The input to NN is , while the rest of the information in apart from , i.e., is captured by the hidden states in RNN. Here, the Bellman optimality equations for Q function is
Compared with DQN where tuples are stored in memory and sampled for training, in DRQN, the tuples are modified as and sampled for two types of updates. Those two types of updates are referred to as bootstrapped sequential updates and bootstrapped random updates, respectively. In both methods, episodes are selected randomly from the replay memory. In bootstrapped sequential updates, updates start at the beginning of the episode and sample experiences sequentially through the entire episode, while the RNN’s hidden state is carried forward throughout the episode. In bootstrapped random updates, updates start at random points in the episodes, while the RNN’s initial state is zeroed at the start of the update. The sequential updates method learns faster but violates the DQN’s random sampling policy. Both methods show good performance in experiments.
Recurrent Policy Gradients (RPG)
RPG methods belong to policy gradient methods where NNs are used to approximate policies and the parameters are updated in the direction of higher expected total reward . As mentioned in Section II-D, in policy gradient methods, or is a direct mapping from state to action . But in RPG, the goal of the agent is to learn a policy that maps history to action , which is denoted as or .
for stochastic policies, and
for deterministic policies, respectively, where refers to the sampling trajectory of history . Here, RNN is trained to obtain information from by its recurrent state and compute as well as .
RPG methods are applied to many partially observed physical control problems i.e. system identification with variable and unknown information, short-term integration of sensor information to estimate the system state, as well as long-term memory problems. A typical algorithm, Recurrent Deterministic Policy Gradient (RDPG), is proposed by N. Heess, J. J. Hunt et.al based on RPG methods .
Memory, RL, and Inference Network (MERLIN)
In RL algorithms, extensive memory can be used to solve POMDP tasks. MERLIN algorithm focuses on memory-dependent policies which output the action distribution based on the entire observation sequence in the past . The ideas for MERLIN, including predictive sensory coding, hippocampal representation theory and temporal context model, mainly originate in neuroscience and psychology.
MERLIN is mainly composed of two basic components: a memory-based predictor and a policy. The memory-based predictor is mainly used to compress the input observation into low-dimensional state variables to represent a state. In each time step, the recurrent network outputs a prior distribution to predict the next state variable. Next, a posterior distribution is obtained based on the observation and the prior distribution . The posterior distribution has corrected the prior distribution to form a more accurate estimate of state variable. Then, is sampled from distribution . is used to select action by the other basic component and stored in the memory next step prediction.
Deep Belief Q-Network (DBQN)
DBQN is a model-based method that uses DQN to map a belief to an action. When , and in a POMDP model are known ,
can be estimated accurately with Bayes’ theorem and sent to NN as input. The Bellman optimality equation for beliefs is
During updating, this approach usually leads to divergence. To stabilize the learning, techniques like experience replay, target network and an adaptive learning method are used. For experience replay, tuples are stored in memory and sampled uniformly. The adaptive learning method is used to regulate the parameter adjustment rate of the network .
Ii-F DRL Beyond MDP - Multi-Agent DRL
In the previous sections, we mainly discuss the DRL methods for single-agent cases. In practice, there are situations where multiple agents need to work together, e.g. the manipulation in multi-robot systems, the cooperative driving of multiple vehicles. In these cases, DRL methods for multi-agent systems are designed.
A multi-agent system consists of a group of autonomous, interacting agents sharing a common environment, and has a good degree of robustness and scalability . The multiple agents in the system can interact with each other in cooperative or competitive settings, and hence the concept of stochastic game is introduced to extend MDP into the multi-agent setting. A stochastic game or multi-agent MDP with agents is defined as a tuple , where
is the discrete set of states,
, are the discrete sets of actions available to the agents, yielding the joint action set ,
is the state transition probability function,
, are the reward functions for the agents.
In multi-agent MDP, the state transitions depend on the joint action of all the agents, where and . In the fully-collaborative problems, all the agents share the same reward, i.e., . In the fully-competitive problems, the agents have opposite rewards with . Therefore, in the typical scenario with two agents . Multi-agent MDP problems that are neither fully collaborative nor fully competitive are mixed games.
In multi-agent RL, each agent learns to improve its own policy by interacting with the environment to obtain rewards. For each agent, the environment is usually complex and dynamic, and the system may encounter the action space explosion problem. Since multiple agents are learning at the same time, for a particular agent, when the policies of other agents change, the optimal policy of itself may also change. This may affect the convergence of the learning algorithm and cause instability.
The simplest approach to learning in multi-agent settings is to use independently learning agents. For example, independent-Q learning is an algorithm in which each agent independently learns its own policy, treating other agents as part of the environment 
. However, independent-Q learning cannot deal with the non-stationary environment problem. Combining game theory with RL, some typical algorithms for Multi-Agent RL (MARL) have been studied, aiming to solve the problems mentioned above. The Minimax-Q algorithm is an approach that in cooperates the zero-sum game of two players and the TD method in Q-learning. The Nash Q-Learning algorithm extends the Minimax-Q algorithm from a zero-sum game to a general-sum game for multi-players . The Friend-or-foe Q-Learning (FFQ) algorithm is also derived from the Minimax-Q algorithm, which transforms the general-sum game of a multi-agent system into a zero-sum game of two agents . Note that all of the three methods mentioned above need to maintain the Q functions in the learning process. Each agent needs to have a very large space to store the Q functions. In order to reduce the space dimension, WoLF-PHC combines the “Win or Learn Fast (WoLF)” rule with the policy hill-climbing (PHC), where each agent is expected to maintain the Q functions only by knowing its own actions .
In recent years, the DRL methods for single-agent cases have been extended to the multi-agents cases as discussed below.
Multi-Agent Value-Based Methods
The experience replay mechanism in DQN algorithm is not designed for the non-stationary environment in multi-agent systems. Several variants of DQN have been proposed to deal with this problem.
Foerster et al.  introduced two methods for stabilizing experience replay of DQN in multi-agent DRL. In the multi-agent importance sampling (MAIS) algorithm, off-environment importance sampling is introduced to stabilize experience replay, where obsolete data is supposed to decay naturally. In the multi-agent fingerprints (MAF) algorithm, each agent needs to be able to condition on only those values that actually occur in its replay memory to stabilize experience replay. A low-dimensional fingerprint is designed to contain this information, and to disambiguate the age of the samples retrieved from the replay memory.
In , a coordinated multi-agent DRL method is designed based on DQN. Faster and more scalable learning is realized by using transfer planning. To coordinate between multiple agents, the global Q-function is factorized as a linear combination of local sub-problems. Then, the max-plus coordination algorithm is applied to optimize the joint global action over the entire coordination graph.
Multi-Agent Policy-Gradient Methods
Policy gradient methods usually exhibit very high variance when coordination of multiple agents is required. In order to overcome this challenge, several algorithms adopt the framework of centralized training with decentralized execution.
In the counterfactual multi-agent policy gradient (COMAPG) algorithm, a centralized critic is used to estimate the Q-function, and decentralized actors are used to optimize the policies of multiple agents. The core idea of the COMAPG algorithm is to apply a counterfactual baseline, which can marginalize out a single agent’s action and keep the other agents’ actions fixed. Moreover, a critic representation is introduced for efficiently evaluating the counterfactual baseline in a single forward pass. The experiments in  show that the COMAPG algorithm has a good final performance and an efficient training speed.
Multi-agent deep deterministic policy gradient (MADDPG) is essentially a DPG algorithm that trains each agent with a critic that requires global information and an actor that requires local information. It allows each agent to have its own reward function, so that it can be used for cooperative or competitive tasks. The core idea of the MADDPG algorithm is the centralized training and the distributed execution. The training process uses centralized learning to train critic and actor. When executing, the actor only needs to know the local information. Critic requires policy information from other agents. The study in  gives a method of estimating other agents’ policy, and can only use the observations and actions of other agents. Using the policy ensemble to learn multiple policies for each agent, the overall effect of all policies is optimized to improve the stability and robustness of the algorithm.
Based on the introduction above, we list classical algorithms for DRL in Table I and summarize the characteristics of each algorithm.
|Classification||Classical algorithms||Feature||Monte Carlo/Actor-critic method||Action space|
|Deep Q-network (DQN) ||\||\||discrete|
|Double Deep Q-network (DDQN) ||\|
|DDQN with duel architecture ||\|
|DDQN with Proportional Prioritization ||\|
|Deep Belief Q-network (DBQN) ||POMDP|
|Deep Recurrent Q-network (DRQN) ||POMDP|
|Multi-agent Importance Sampling (MAIS) ||MA|
|Coordinated Multi-agent DQN ||MA|
|Multi-agent Fingerprints (MAF) ||MA|
|REINFORCE ||\||Monte Carlo||
|Soft Actor-Critic (SAC) ||\||Actor-critic|
|Q-Prop ||\||Monte Carlo and Actor-critic|
Iii General Reinforcement Learning Model for Autonomous IoT
Before we discuss on the RL model for AIoT system, we first examine that of a wireless sensor and actuator network (WSAN), which can be considered as an element or a simplified version of AIoT. A WSAN consists of a group of sensors that gather information about their environment, and a group of actuators that interact with and act on the environment. All elements communicate wirelessly. In the RL model for WSAN as illustrated in Fig. 12, an agent obtains aspects of its environment through sensors, and chooses control actions that are implemented by the actuators. The chosen action determines the value of the immediate reward as well as influences the dynamics of its environment. The agent communicates with the sensors and actuators to receive state information and send control commands.
Compared with WSAN, the AIoT has a more complex ecosystem that encompasses identification, sensing, communication, computation, and services. A typical AIoT architecture consists of three fundamental building blocks as shown in Fig. 13:
Perception layer: corresponds to the physical autonomous systems in which IoT devices with sensors and actuators interact with the environment to acquire data and exert control actions;
Network layer: corresponds to the IoT communication networks including wireless access networks and the Internet that discover and connect the IoT devices to the edge/fog servers and cloud servers for data and control command transmission;
Application layer: corresponds to the IoT edge/fog/cloud computing systems for data processing/storage and control actions determination.
Due to the more sophisticated system architecture, the RL models for AIoT systems are more complex than those of WSAN as illustrated in Fig. 12. The environment can include one or more layers in the AIoT architecture. The agent(s) can locate at the IoT devices, the edge/fog/cloud servers, and wireless access points. In the following, we first define the basic RL elements such as state, action, and reward for each layer, respectively. Then, we define the RL elements when the environment includes all the three layer as an integrated part.
Iii-a Perception Layer
When the environment only includes the perception layer, the physical system dynamics are modelled by a controlled stochastic process with the following state, action, and reward.
Physical system state , e.g., the on-off status of the actuators, the RGB images of the system, the locations of the agents;
Actuator control action , e.g., controlling the movement of a robot, adjusting the driving speed and direction of a vehicle, turning on/off a device;
Physical system performance , e.g., energy consumption in a power grid, how fast a mobile agent such as a robot or a vehicle can move, or whether it is away from obstacles.
Iii-B Network Layer
When the environment only includes the network layer, the network dynamics are modelled by a controlled stochastic process with the following state, action, and reward.
Network resource state , e.g., the amount of allocated bandwidth, the signal to interference plus noise ratio, the channel vector of a finite state Markov channel model;
Communication resource control action , e.g., the power allocation, the multi-user scheduling, the subchannel allocation in OFDM system;
Network performance , e.g., the transmission delay, the transmission error probability, the transmission power consumption.
Iii-C Application Layer
When the environment only includes the application layer, the edge/fog/cloud computing system dynamics are modelled by a controlled stochastic process with the following state, action, and reward.
Computing resource state , e.g., the number of virtual machines (VMs) that currently run, the number of tasks buffered in the queue for processing;
Computing resource control action , e.g., the caching selection, the task offloading decisions, the virtual machine allocation;
Computing system performance , e.g., utilization rate of the computing resources, the processing delay of the offloading tasks.
Iii-D Integration of Three Layers
When the environment includes all the three layers of AIoT architecture, the RL models generally include elements defined as follows.
AIoT state () includes the aggregation of physical system state, network resource state, and computation resource state, i.e., ;
AIoT action () includes the aggregation of actuator control action, communication resource control action, and computing resource control action, i.e., ;
AIoT reward () is normally set to optimize the physical system performance, which can be expressed as a function of the network performance and computing system performance, i.e., .
As the agent in RL is a logical concept, the RL problem in each layer can be solved by the agent in its respective layer - observing the states and rewards from its environment and learning polices to determine corresponding actions as shown in Fig. 13. However, the physical location of an agent can be different from its logical layer. We classify the devices that an agent may locate in according to the physical locations of the devices as
perception layer devices, i.e., IoT devices;
network layer devices, i.e, wireless access points;
application layer devices, i.e., edge/fog/cloud servers.
As shown in Fig. 13, the mapping of the logical layer of an agent and its physical locations are given. A perception layer agent may locate in IoT devices and/or edge/fog/cloud servers. A network layer agent may locate in wireless access points and/or IoT devices (e.g., for Device-to-Device communications). An application layer agent may locate in edge/fog/cloud servers and/or even IoT devices (e.g., to perform task offloading).
When the environment of an RL problem includes more than one layer, the agents of different layers need to share information and jointly optimize their polices. For example, the network layer may provide transmission delay information to the perception layer to be included as part of the system state; or, the perception layer may provide its optimization objective to the network layer to formulate the reward function. When the physical locations of the agents of different layers are the same, e.g., when both perception layer agent and application layer agent locate at the cloud servers, a single logical agent combining agents of different layers can be considered for the RL problem.
Iv Applications of Deep Reinforcement Learning in Autonomous IoT
Although AIoT is a new trend in IoT that has not been adequately studied by existing research works, the respective applications of DRL in each of the three layers of AIoT architecture have been widely studied by recent works. Therefore, we provide a literature review of the applications of DRL in the perception layer (physical autonomous systems), the network layer (IoT communication networks), and the application layer (IoT edge/fog/cloud computing systems) in this section. As there are a great variety of physical autonomous systems, we focus on three types of systems that have received most attention in DRL research for the perception layer, i.e., autonomous robots, smart vehicles, and smart grid. The framework of the literature review is given in Fig. 14. Note that some IoT communication network technologies and IoT edge/fog/cloud computing technologies are designed specifically for a particular physical autonomous system, e.g, vehicular edge/fog/cloud computing and vehicular networks for smart vehicles, and cloud robotics for autonomous robots. In the following survey, we discuss these technologies in the respective physical autonomous system subsection, while the technologies discussed in the IoT communication networks and IoT edge/fog/cloud computing systems are those universal to various types of autonomous physical systems.
Iv-a Perception Layer - Autonomous Robots
The applications of DRL methods in autonomous robots have been widely discussed. The existing researches include the mobile behavior control of robots, the robotic manipulation, the management in multi-robot systems, and cloud robotics.
Iv-A1 Mobile Behavior Control
The mobile behavior control mainly refers to the path planning, navigation, and general movement control of robots. DRL approaches have been applied in many existing works for this purpose.
In , the authors apply DQN to the robot behavior learning simulation environment, so that mobile robots can learn to obtain good mobile behavior by using high-dimensional visual information as input data. The authors incorporate profit sharing methods into DQN to speed up learning, and the method reuses the best target network in the case of a sudden drop in learning performance. In order to solve the problem of mobile robot path planning, DQN is designed in  and DDPG is applied in . A mobile robot navigation problem in  is solved by applying the hybrid A3C method.
Iv-A2 Robotic Manipulation
Since intelligent robots usually help to perform some operation tasks in practice, appropriate controlling schemes for them are necessary for successful manipulations. The problem of controlling robots to solve compound tasks is solved by a hierarchical DRL algorithm in . In , the authors demonstrate that the DRL algorithm based on off-policy training of deep Q functions can be applied to complex three-dimensional (3D) operation tasks, and can effectively learn DNN strategies to train real physical robots. The policy updates are pooled asynchronously to decrease the training time. Similarly, the problem of learning vision-based dynamic manipulation skills is solved by using a scalable DQN approach in . In , two proposed sample efficient DRL algorithms, i.e., deep P-network (DPN) and dueling deep P-network (DDPN), are applied to real robotic cloth manipulation tasks.
Iv-A3 Multi-Robot System
In some cases, multiple robots are required to collaborate properly to fulfil some tasks that are difficult to be accomplished by an individual robot. A review on multi-agent RL in multi-robot systems is provided in . The research in  investigates a DRL approach to the collective behavior acquisition of swarm robotics systems. The multiple robots are expected to collect information in parallel and share their experience for accelerating the learning. In , the authors propose a collaborative multi-robot RL method, which realizes task learning and the emergence of heterogeneous roles under a unified framework. The method interleaves online execution and relearning to accommodate environmental uncertainty and improve performance. The study in  extends the A3C algorithm in single agent problems to a multi-robot scenario, where the robots work together toward a common goal. The policy and critic learning are centralized, while the policy execution is decentralized. A decentralized sensor-level collision avoidance policy for multi-robot systems is proposed in . A multi-scenario multi-stage training framework based on policy gradient methods is used to learn the optimal policy for a large number of robots in a rich, complex environment. The expanding of learning space is an issue to be tackled in multi-robot system. The methodology in  is proposed to minimize the learning space through the use of behaviors and conditions.
Iv-A4 Cloud Robotics
The concept of cloud robotics allows the robotic system to offload computing-intensive tasks from the robots to the cloud 
. Cloud robotics applications include perception and computer vision applications, navigation, grasping or manipulation, manufacturing or service robotics, etc.. In
, an effective transfer learning scheme based on lifelong federated reinforcement learning (LFRL) is proposed for the navigation in cloud robotic systems, where the robots can effectively use prior knowledge and quickly adapt to new environments. The authors in propose an RL-based resource allocation scheme, which can help the cloud to decide whether a request should be accepted and how many resources are supposed to be allocated. The scheme realizes an autonomous management of computing resources through online learning, reduces human participation in scheme planning, and improves the overall utility of the system in the long run.
Iv-B AIoT Perception Layer - Smart Vehicles
The development of the IoT technology has promoted the development of intelligent transportation systems (ITS). In Internet of Vehicles (IoV), smart vehicles with IoT capabilities including sensing, communications, and data processing can possess artificial intelligence to enhance driving aid. The existing works on the applications of DRL in the smart vehicles mainly include the studies on autonomous driving, vehicular networks, and vehicular edge/fog computing.
Iv-B1 Autonomous Driving
The application of DRL methods for the control of the autonomous vehicles is addressed in a number of existing works. The autonomous driving problem can be formulated as an MDP, where the driving status such as position and velocity of the autonomous vehicles as well as other non-autonomous vehicles in proximity are usually characterized as the states, and the driving decisions of the autonomous vehicles such as acceleration and changing lanes are characterized as actions. The rewards are usually related to assessment criteria of the driving operations, such as smoothness and speed.
In  and , deep Q-learning is applied to control simulated cars via a DRL-based algorithm. In , the authors address the autonomous driving issues by presenting an RL-based approach, which is combined with formal safety verification to ensure that only safe actions are chosen at any time. A DRL agent learns to drive as close as possible to a desired velocity by executing reasonable lane changes on simulated highways with an arbitrary number of lanes. Leveraging the advances in DRL, the authors in  use Flow to develop reliable controllers in mixed-autonomy traffic scenario. In , the leading vehicle and the traffic signal timing condition are taken into account when applying RL-based method to control the speed of a vehicle. The problem of autonomous vehicle navigation between lanes is formulated as an MDP and solved via RL-based methods in . In , the road geometry is taken into account in the MDP model in order to be applicable for more diverse driving styles. The authors in  apply a continuous, model-free DRL algorithm for autonomous driving. The distance travelled by the autonomous vehicle is used to evaluate the reward in the model. The study in  aims to optimize the driving utility of the autonomous vehicle, and enables the autonomous vehicle to jointly select the motion planning action performed on the road and the communication action of querying the sensed information from the infrastructure. The problem of ramp merging in autonomous driving is tackled in , where LSTM is applied to produce an internal state containing historical driving information, and DQN is applied for Q-function approximation. The authors in  the review the applications and address the challenges of real-world deployment of DRL in autonomous driving.
There are also studies on cooperative driving of multiple vehicles. In , the authors present a novel method of cooperative movement planning. RL is applied to solve this decision-making task of how two cars coordinate their movements to avoid collisions and then return to their intended path. A multi-agent multi-objective RL traffic signal control framework is proposed in , which simulates the driver’s behavior, e.g., acceleration or deceleration, continuously in space and time dimensions.
Iv-B2 Vehicular Networks
The concept of vehicular networking brings a new level of connectivity to vehicles, and has become a key driver of ITS. The control functionalities in vehicular network can be divided into three parts according to their usages, including communication control, computing control and storage control . In vehicular networks, problems such as resource allocation, caching, and networking, can be formulated and solved via DRL. In  and , the applications of machine learning in studying the dynamics of vehicular networks and making informed decisions to optimize network performance are discussed. In , the authors use a DRL approach to perform joint resource allocation and scheduling in vehicle-to-vehicle (V2V) broadcast communications. In the system, each vehicle makes a decision based on its local observations without the need of waiting for global information. A DRL algorithm based on echo state network (ESN) cells is proposed in , in order to provide an interference-aware path planning scheme for a network of cellular-connected unmanned aerial vehicles (UAVs). In , the authors develop an integration framework that enables dynamic orchestration of networking, caching, and computing resources to improve the performance of vehicular networks. The resource allocation strategy is formulated as a joint optimization problem, in which the gains of networking, caching and computing are all taken into consideration. To solve the problem, a double-dueling-deep Q-network algorithm is proposed. Similarly, deep Q-Learning is applied in  to learn a scheduling policy, which can guarantee both safety and quality-of-service (QoS) concerns in an efficient vehicular network.
Iv-B3 Vehicular Edge/Fog/Cloud Computing
Emerging vehicular applications require more computing and communication capabilities to perform well in computing-intensive and latency-sensitive tasks. Vehicular Cloud Computing (VCC) provides a new paradigm in which vehicles interact and collaborate to sense the environment, process the data, propagate the results and more generally share resources . Moreover, Vehicular Edge/Fog Computing (VEC/VFC) focuses on moving computing resources to the edge of the network to resolve latency constraints and reduce cloud ingress traffic [86, 87, 88]. The studies in  and  focus on the service offloading issues in the IoV. The determination of offloading decisions for the multiple tasks is considered as a long-term planning problem. Service offloading decision frameworks are proposed, which can provide the optimal policy via DRL. The authors in  propose an optimal computing resource allocation scheme to maximize the total long-term expected return of the vehicular cloud computing system. With multiple access edge computing techniques, roadside units (RSUs) can provide fast caching services to moving vehicles for content providers. In 
, the authors apply the MDP to model the caching strategy, and propose a heuristic Q-learning solution together with vehicle movement predictions based on a LSTM network.
Iv-C AIoT Perception Layer - Smart Grid
The integration of distributed renewable energy sources (DRES) into the power grid introduces the need for autonomous and smart energy management capabilities in smart grid due to the intermittent and stochastic nature of the renewable energy sources (RES). With advanced metering infrastructure (AMI) and various types of sensors in power grid to collect real-time power generation and demand data, RL and DRL provide promising methods to learn efficient energy management polices autonomously in such a complex environment with uncertainty. Specifically, the historical data can be leveraged by powerful DRL algorithms in learning optimal decisions to cope with the high uncertainty of the electrical patterns. A review of machine learning applications in smart grids is presented in . Different from , we focus only on the applications of RL and DRL on the energy management problem with DRES.
Iv-C1 Energy Storage Management
One promising method to deal with the lack of knowledge on future electricity generation and consumption is through energy storage. Direct energy storage such as in the battery is one of the energy storage options. RL/DRL applications in microgrid with energy storage system (ESS) to determine the optimal charging/discharging policy have been studied in some recent literature.
In , the problem of optimally activating the storage devices is formulated as a sequential decision making problem. Then, the problem is solved by a DQN based approach, without knowing the future electricity consumption and weather dependent PV production at each step. The authors in  develop an intelligent dynamic energy management system (I-DEMS) for a smart microgrid. The system can effectively schedule the backup battery energy storage and give a robust performance under different battery energy storage conditions. The authors in  design an interconnection topology and an RL-based algorithm to optimize the coordination of different energy storage systems (ESSs) in a microgrid. In , a novel dynamic energy management system is proposed to deal with microgrids real-time dispatch problems. The developed energy management system can optimize the long-term operational costs of microgrids without long-term forecast or distribution information about uncertainty. The authors in  present a multi-agent-based energy and load management approach for distributed energy resources in microgrid. The suppliers and consumers of electricity maximize their profit by using a model-free Q-learning algorithm. A framework based on RL is presented in  to control the operation, i.e., charging and discharging, of a battery storage device. The objective is to minimize the amount of energy bought or sold from/to a microgrid, where residential consumer, photovoltaic (PV) system, inverters and battery storage facility are considered.
Iv-C2 Demand Response
Another method to support the integration of DRES is through demand response (DR) systems, which dynamically adjust electrical demand in response to changing electrical energy prices or other grid signals. Thermostatically controlled loads (TCLs) such as electric water heaters are a prominent example of loads that offer flexibility at the residential level. In fact, TCLs can be seen as a type of energy storage entity through power to heat conversion, which is in contract to the direct energy storage entity such as a battery. DR can be divided into direct DR and priced-driven DR, where energy consumption profile of a user is adjusted according to a utility in the former while according to the price in the latter. In any case, the energy consumers need to make a continuing sequence of decisions as to either consume energy at current (known) utility/price or to defer power consumption until later at possibly unknown utility/prices.
An energy optimization problem in a smart grid is formulated in . An on-line energy scheduling strategy is learned using deep Q-learning and deep policy gradient methods. For the DR problem, the authors in  propose a new EMS formulation that sets the fully automated energy management system (EMS) rescheduling problem as an RL problem and argues that this problem can be solved approximately by decomposing the RL problem on the device clusters. The control scheme in  applies a model-free batch RL algorithm in combination with a market-based heuristic, which is tested in a stochastic setting, without prior information or system dynamics. In , a stochastic modeling framework based on MDP is presented, in order to employ adaptive control strategies for short term ancillary services to the power grid by using a population of heterogenous thermostatically controlled loads. The authors in  proposes a novel approach that uses a CNN to extract hidden state-time features in a load control problem. The CNN is used as a function approximator to estimate the Q function in the supervised learning step of fitted Q-iteration. In , the authors studied the energy supply plan of a microgrid to support the operation of a Mobile Edge Computing (MEC) system , with a goal of minimizing the energy consumption in the MEC system. The optimization problem is decomposed into two sub-problems: energy efficiency task allocation problem and energy supply planning problem. The output of the first sub-problem is used as input to solve the second sub-problem, and model-based deep reinforcement learning (MDRL) is applied in solving the issues.
Iv-C3 Energy Trading
The integration of the DRES into the power grid blurs the distinction between an energy provider and consumer. This is especially true for a microgrid, which may constantly switches its role between a provider or consumer depending on whether its generated energy exceeds or falls short of its demanded energy. In fact, a key goal of smart grid design is to facilitate two-way flow of electricity by enhancing the ability of distributed small-scale electricity producers, such as small wind farms or households with solar panels, to sell energy into the power grid. Due to the unpredictability of the DRES, autonomous control mechanism to ensure power supply/demand balance is essential. One promising method is through the introduction of Broker Agents, who buy electricity from distributed producers and also sell electricity to consumers. RL/DRL can be applied for the Broker Agents to learn pricing decisions to effectively maintain that balance, and earn profits while doing so, contribute to the stability of the grid through their continued participation.
To overcome the challenges of implementing dynamic pricing and energy scheduling, the authors in  and  study RL algorithms that allow each service provider and each customer to learn their policy with no need of prior information about the microgrid. A microgrid energy trading scheme based on RL is proposed in , which applies the DQN to improve the utility of the microgrids for the case of microgrids with a large number of connections. In , an adaptive learning algorithm is designed to find the Nash equilibrium (NE) of constrained energy trading game among individual strategic participants with incomplete information. Each player’s goal is to maximize his own average utility by generating a motion probability distribution based on his private information using a learning automaton scheme. In , the authors employ MDP and RL to investigate the learning of pricing strategies for an autonomous Broker Agent to profitably participate in a Tariff Market.
Iv-D AIoT Network Layer - IoT Communication Networks
A reliable and efficient wireless communication network is an essential part of the IoT ecosystem. Such wireless networks range from short range local area networks such as Bluetooth, Zigbee/IEEE 802.15.4, and IEEE 802.11 to long range wide area networks such as Narrowband Internet of Things (NB-IoT) and LoRaWAN. When designing resource control mechanisms to efficiently utilize the scarce radio resources in transmitting the huge amount of IoT data, the IoT networks need to consider the characteristics of IoT devices such as massive in number, limited in energy, memory and computation resources. Moreover, the requirements of IoT applications such as low latency and high reliability have to be considered. One of the promising approaches to develop resource control mechanisms tailored for IoT is to enable IoT devices to operate autonomously in a dynamic environment by using learning frameworks such as DRL .
Iv-D1 Wireless Sensor Networks
Wireless sensor networks (WSNs) offer practical applications that can directly benefit from artificial intelligence technology. For a large scale IoT application, sensors are needed in huge number. In , RL is used for modelling the sensors in the physical, routing and network layer. Routing and networking layer deals with the communication capabilities of the sensors. The resource scheduling issues among the sensors are solved in order to optimize the lifetime of the sensors, energy usage and communication costs. A multi-agent system approach on wireless sensor networks is able to tackle the resource constraints in these networks by efficiently coordinating the activities among the nodes. In , the authors consider the coordinated sensing coverage problem and study the behavior and performance of four distributed DRL algorithms, i.e., fully distributed Q-learning, distributed value function (DVF), optimistic DRL, and frequency maximum Q-Iearning (FMQ). Their performance in terms of communication and computational costs, energy consumption, and sensor coverage levels are evaluated and compared. The authors in  leverage DRL for router selection in wireless network with heavy traffic. Compared with existing routing algorithms, the proposed algorithms achieve higher network throughput due to the low congestion probability.
Iv-D2 Wireless Sensor and Actuator Networks
Wireless sensor and actuator networks (WSANs), e.g., ISA SP100.11a and WirelessHART, have special devices known as network managers which perform tasks such as admission control of devices, definition of routes, and allocation of communication resources. The authors in  present the design and implementation of a simulation system based on DQN for mobile actor node control in a WSAN. In , a global routing agent with Q-Learning is proposed for weight adjustment of the state-of-the-art routing algorithm, aiming at achieving a balance between the overall delay and the lifetime of the network. The study in  focuses on a DRL-based sensor scheduling problem for allocating wireless channels to sensors for the purposes of remote state estimation of dynamical systems. The algorithm can be run online, and is model-free with respect to the wireless channel parameters.
NB-IoT is a technology proposed by 3GPP in Release-13. It offers low energy consumption and extensive coverage to meet the requirements of a variety of social, industrial and environmental IoT applications. Compared to legacy LTE technologies, NB-IoT chooses to increase the number of repetitions of transmission to serve users in deep coverage. However, large repetitions can reduce system throughput and increase the energy consumption of IoT devices, which can shorten their battery life and increase their maintenance costs. In , the authors propose a new method based on RL algorithm to enhance NB-IoT coverage. Instead of employing a random spectrum access procedure, dynamic spectrum access can reduce the number of required repetitions, increase the coverage, and reduce the energy consumption. A cooperative multi-agent deep neural network based Q-learning (CMA-DQN) approach is developed in , in which each DQN agent independently controls a configuration variable for each group, in order to maximize the long-term average number of working IoT devices in NB-IoT.
Iv-D4 Energy Harvesting
Energy Harvesting (EH) is a promising technology for long-term and self-sustainable operation of the IoT devices. While energy harvesting is a promising technique to extend the lifetime of IoT devices, it also brings new challenges to resource control due to the stochastic nature of the harvested energy.  studies the joint access control and battery prediction problem in a small-cell IoT system including multiple EH user equipments (UEs) and a base station (BS) with limited uplink access channels. A DQN-based scheduling algorithm that maximizes uplink transmission sum rate is proposed. For the battery prediction problem, using a fixed round-robin access control policy, an RL-based algorithm is developed to minimize the prediction loss without any model knowledge about the energy source and energy arrival process. In , the energy management policy in an industrial wireless sensor network is investigated to minimize the weighted packet loss rate under the delay constraint, where the packet loss rate considers the lost packets, both during the sensing and delivering processes. The problem is formulated into an MDP model, and stochastic online learning with a post-decision state is applied to derive a distributed energy allocation algorithm with a water-filling structure and a scheduling algorithm by an auction mechanism.
|RL/DRL elements||Examples||Related Works|
|State||Physical system state||Smart grid: e.g. energy demand/storage/consumption, battery discharge efficiency|||
|Robotics: e.g. position/velocity of the robot, camera image|||
|Vehicles: e.g. position/velocity/orientation angle of the vehicle, distance headways between vehicles|||
|Network resource state||Channel information: e.g. SINR, selection of sub-channel|||
|Queue information: e.g. queue length of each user’s data buffer|||
|Computation resource state||Available virtual machines|||
|Queue information: e.g. queue length of the task buffer|||
|Action||Resource control action||Power allocation|||
|Actuator control action||Smart grid: e.g. turning on/off devices, prioritizing the power dispatch|||
|Robotics: e.g. moving direction of robots, opening/closing of grippers|| |
|Vehicles: e.g. moving direction/velocity of vehicles|||
|Reward||Physical system performance||Power/energy consumption|||
|Manipulation objectives: e.g. away from obstacles, reaching the target, a successful/failed grasp|||
|Network system performance||Transmission delay|||
|Transmission reliability: e.g. error probability, packet loss rate|||
|Computing system performance||Processing delay|||
|Utilization rate of computing resources|||
Iv-E AIoT Application Layer - IoT Edge/Fog/Cloud Computing Systems
Edge/fog/cloud computing is a helpful technique in realizing IoT. In such systems, multiple users can offload the computationally intensive tasks to the edge/fog/cloud servers.
Iv-E1 Task Offloading and Resource Allocation
Reasonable decisions are required to be made on whether to offload the computing tasks to the edge/fog/cloud servers or perform them locally at the IoT devices, and the amount of resources allocated to each IoT device. The problems of task offloading and resource allocation in edge/fog/cloud computing are widely discussed. The resources to allocate include both the computing resources and the communication resources. In , a real-time adaptive policy based on deep Q-learning is learned in a MEC system. The policy is to allocate computing resources for offloaded tasks of multiple users. In order to meet the reliability of end-to-end services, the objective is to reduce delay violation probability and decoding error probability. Similarly, a joint task offloading decision and bandwidth allocation optimization method based on a DQN is designed for the MEC system in . The overall offloading cost is evaluated in terms of energy cost, computation cost, and delay cost. Besides the most considered service delay, when designing the offloading policies via DRL approaches, the utilization rate of the physical machine and the power consumption are also taken into account in  , respectively. In , a namely deep reinforcement learning based resource allocation (DRLRA) scheme is proposed to allocate computing and network resources adaptively, in order to reduce the delay and balance the use of resources under varying MEC environment. In , several RL methods, e.g., Q-learning, SARSA, Expected SARSA, and Monte Carlo, are applied to solve the Fog-RAN resource allocation issues respectively. The performance and applicability of the methods are verified. In , a joint computation offloading and multi-user scheduling algorithm in NB-IoT edge computing system is proposed to minimize the long-term average weighted sum of delay and power consumption. The linear value-function approximation and TD learning with post-decision state and semi-gradient descent method are applied to derive a simple algorithm for the solution. In , a DRL based approach is applied to manage the mode selection in fog radio access networks (F-RANs). In , the authors present a novel DRL-based framework for power-efficient resource allocation in cloud radio access networks (C-RANs). The authors in  propose a DRL based approach that is able to manage data migration in MEC scenarios by learning during the system evolution. In , a DRL-based computing offloading approach is proposed to learn the optimal offloading policy in space-air-ground integrated network (SAGIN).
Caching IoT data at the network edge is considered to be able to alleviate the congestions and delays in transmitting IoT data through wireless networks. The research in  solves the problem of caching IoT data at the edge with the help of DRL. The proposed data caching policy aims to strike a balance between the communication cost and the loss of data freshness. In , the issue of caching strategy is tackled together with the offloading policy and resource allocation.
Based on the above literature review, we summarize and list some typical values of states, actions, and rewards in Table II, arranged in different categories as given in Section III corresponding to the three layers in AIoT architecture.
V Challenges, Open Issues, and Future Research Directions
Although DRL is a powerful theoretical tool that is well-suited to the task of introducing artificial intelligence to AIoT systems, there are still a lot of challenges and open issues to be overcome and addressed. The following lists some of the future research directions in this area.
V-a Incomplete Perception Problem
In AIoT systems, it might not be possible for the agent to have perfect and complete perception of the state of the environment. This could be due to
limited sensing capabilities of sensors in the perception layer;
information loss due to limited transmission capability in the network layer;
An important challenge in applying DRL to AIoT system is to learn with incomplete perception or partially observable states. The MDP model is no longer valid, as the state information is no longer sufficient to support the decision on optimal action. The action can be improved if more information is available to the agent in addition to the state information. Although the DRL algorithms and methods introduced in Section II.E can be applied, there are still some open issues with the POMDP-based DRL algorithms. Firstly, an agent in POMDP needs to select an action based on the observation history space which grows exponentially. Approaches proposed for this problem require large memory and can only work well for small discrete observation spaces . Secondly, when introducing belief state to POMDP problems, the belief space will not grow exponentially but the knowledge of the model becomes essential for the agent, which is not suitable for many complicated scenarios. Finally, nearly all these algorithms in POMDP problems need to face an challenge referred to as information gathering and exploitation dilemma. In a POMDP, the agent does not know what the current state is exactly. It needs to decide whether to gather more information about the true state first or to exploit its current knowledge first. Obviously, in order to find the optimal policy, an agent in POMDP needs to have more interactions with the environment. Apart from the above challenges associated with POMDP-based DRL problems, the DRL model formulation and parameter optimization for various AIoT systems are different case by case. Moreover, more efficient algorithms could be designed according to the specific characteristics of AIoT systems.
V-B Delayed Control Problem
In DRL problems, we normally consider that an action is exerted as soon as it is selected by the agent, and a corresponding reward is immediately available at the agent. However, a challenge in applying DRL to real-world AIoT system is to learn despite the existence of control delay, i.e., the delay between measuring a system’s state and acting upon it. Control delay is always present in real systems due to transporting measurement data to the learning agent, computing the next action, and changing the state of the actuator. Therefore, it is important to design RL/DRL algorithms which take the control delay into account.
Most of the existing RL algorithms don’t consider the control delay. At each time step , the state of the environment is observed, and an action is immediately determined by the agent. However in practice, the actual action executed at time step might be the action generated time steps before, i.e., . In this case, the next state depends on the current state and a previously determined action, i.e., , instead of the current state and currently determined action pair , which makes the state transition violating the Markov property. Therefore, the MDP model based on which RL/DRL algorithms are developed are no longer valid and a POMDP model is more appropriate.
In order to deal with the delayed control problem, existing works in RL developed several methods [139, 140, 141]. The first method  incorporates the past actions taken during the length of the delay into the current state in formulating an MDP model, so that the classical RL methods such as TD-learning and Q-learning can be applied. However, this method results in larger state space with the state dimensionality depending on the number of time steps for the delay. The second method  learns a state transition model so that it can predict the state at which the currently selected action is actually going to be executed. Then, a model-based RL algorithm can be applied. However, the learning process of the underlying model is usually time-consuming and will incur additional delay itself. Finally in the third method , the classical model-free RL algorithms such as TD-learning and Q-learning are applied, except that at each time step , the Q function with respect to current state and actually executed action is updated, instead of the normal with respect to current state and currently generated action .
The above methods mostly focus on the constant delay problem. However, the actual delay in an AIoT system is likely to be stochastic. Moreover, the delay can depend on the communication and computation resource control actions in the IoT communications networks and edge/fog/cloud servers. Therefore, developing RL algorithms to consider stochastic control delay or control delay that depends on other parameters is an open issue. Another important challenge is how to extend the above algorithms from RL to DRL leveraging the powerful neural networks while dealing with the intrinsic complexities.
V-C Multi-Agent Control Problem
The agent in RL is a virtual concept that learns the optimal policy by interacting with the environment. In AIoT system, agents can be implemented in IoT devices, edge/fog servers, and cloud servers as discussed previously. For a single RL task, there are some typical scenarios for the implementation of agents:
centralized architecture: a single agent in a cloud server, edge/fog node, or an IoT device;
distributed architecture: multiple agents with each agent implemented in an IoT device or edge/fog server;
semi-distributed architecture: one centralized agent in a cloud server or edge/fog server and multiple distributive agents in edge/fog servers or IoT devices.
For distributed and semi-distributed architecture, it is an important challenge to enable efficient collaboration and fair competition among multiple agents in a single RL task. The tasks of each agent in a multi-agent system may be different, and they are coupled to each other. Therefore, the design of a reasonable joint reward function becomes a challenge, which may directly affect the performance of the learning policy. Compared to the stable environment in the single-agent RL problem, the environment in the multi-agent RL is complex and dynamic, which brings challenges to design of multi-agent DRL approaches.
In most existing multi-agent DRL methods, the agents are assumed to have same capability. For examples, the robots in a multi-robot system have the same manipulation ability, or the multiple vehicles in a cooperative driving scenario have the same kinematic performance. Thus, the application of DRL in heterogeneous multi-agent systems remains to be further studied. The heterogeneity makes cooperative decision more complex, since each agent needs to model other agents when their capabilities are unknown. Although the multi-agent DRL algorithms and methods introduced in Section II.F can be applied to solve the problem of space explosion and guarantee the convergence of the algorithm, the DRL model formulation, parameter optimization, as well as algorithm adaptation and improvement remain to be open issues. Moreover, significant progress in the field of multi-agent reinforcement learning can be achieved by a more intensive cross-domain research between the fields of machine learning, game theory, and control theory.
V-D Joint Resource and Actuator Control Problem
In AIoT systems, there are two levels of control, i.e., resource control and actuator control as discussed previously. Although the ultimate objective is to optimize the long-term reward of the physical system by selecting appropriate actuator control actions, the computation and network resource control actions will impact the physical system performance through their effects on the network and computation system performances. For example, an efficient network resource control policy can result in larger data transmission rates for the sensory data, and thus allow more information to be available at the cloud server for the agent to derive an improved policy. Currently, most existing research works either optimize the computation and/or network performances for IoT systems, or optimize the physical system performance considering an ideal communication and computation environment. Therefore, how to jointly optimize the two levels of control actions to achieve an optimized physical system performance is an important open issue for applying DRL in AIoT system.
When the RL/DRL environment includes more than one layer in AIoT architecture, the corresponding RL/DRL model will be more complex as discussed in Section III. For example, instead of optimizing normal network performance such as transmission delay, transmission power, and packet loss rate in the network layer, the communication resource control actions need to be selected to optimize the control performance of a physical autonomous system, which may be a function of the network performance. In order to optimize the control performance, the best trade-off between several network performance metrics may need to be considered. For example, larger amount of sensory data may be transmitted at the cost of larger transmission delay, which relieves the incomplete perception problem but deteriorates the delayed control problem as discussed above. There are many challenges to model and solve such complex RL/DRL problems. Firstly, feature selection is an crucial task. An appropriate feature selection can lead to better generalization which is helpful for the bias-overfitting tradeoff as will be discussed later in most scenarios. When too many features are taken into consideration, it is hard for the agent to determine which features are more indispensable. Although some features may play a key role in reconstruction of the observation, they may be discarded because they are not related to the current task directly. Secondly, the selection of algorithm and function approximator is also a tough task. The function approximator used for value function or policy converts the features into abstraction in higher level. Sometimes the approximator is too simple to avoid the bias, while sometimes the approximator is too complex to obtain a good generalization result from the limited dataset, i.e., overfitting. Errors resulted from this bias/overfitting problem need to be overcome, so an appropriate approximator needs to be used according to the current task. Thirdly, in such complex RL/DRL problems, the objective function needs to be modified. Typical approaches include reward shaping and discount factor tuning. Reward shaping adds an additional functionto the original reward function . It is mainly used for DRL problems with sparse and delayed rewards . Discount factor tuning helps to adjust the impact of temporally distant rewards. When the discount factor is high, the training process tends to be instable in convergence and when the discount factor is low, some potential rewards will be discarded . Hence, modifying the objective function can help to tackle the above problems to some extent.
This paper has presented the model, applications and challenges of DRL in AIoT systems. Firstly, a summary of the existing RL/DRL methods has been provided. Then, the general model of AIoT system has been proposed, including the DRL framework for AIoT based on the three-layer structure of IoT. The applications of DRL in AIoT have been classified into several categories, and the applied methods and the typical state/action/reward in the models have been summarized. Finally, the challenges and open issues for future research have been identified.
-  P. J. Antsaklis, K. M. Passino, and S. Wang, “An introduction to autonomous control systems,” IEEE Control Syst. Mag., vol. 11, no. 4, pp. 5–13, 1991.
-  (2018) Smarter Things: The autonomous IoT. [Online]. Available: http://gdruk.com/smarter-things-autonomous-iot/
-  M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep learning for IoT big data and streaming analytics: A survey,” IEEE Communications Surveys Tutorials, vol. 20, no. 4, pp. 2923–2960, 2018.
-  R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
-  M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth, “Machine learning for Internet of Things data analysis: A survey,” Digital Communications and Networks, vol. 4, no. 3, pp. 161–175, 2018.
-  V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau et al., “An introduction to deep reinforcement learning,” Foundations and Trends® in Machine Learning, vol. 11, no. 3-4, pp. 219–354, 2018.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.
-  H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
-  Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
-  S.-I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
-  S. M. Kakade, “A natural policy gradient,” in Advances in neural information processing systems, 2002, pp. 1531–1538.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
-  R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
-  V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
-  Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” arXiv preprint arXiv:1611.01224, 2016.
-  V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
-  T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” arXiv preprint arXiv:1804.08617, 2018.
-  S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” arXiv preprint arXiv:1802.09477, 2018.
-  R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
-  N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memory-based control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
-  Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation,” in Advances in neural information processing systems, 2017, pp. 5279–5288.
-  S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Q-prop: Sample-efficient policy gradient with an off-policy critic,” arXiv preprint arXiv:1611.02247, 2016.
-  G. Shani, J. Pineau, and R. Kaplow, “A survey of point-based POMDP solvers,” Autonomous Agents and Multi-Agent Systems, vol. 27, no. 1, pp. 1–51, 2013.
-  P. Dai, C. H. Lin, D. S. Weld et al., “POMDP-based control of workflows for crowdsourcing,” Artificial Intelligence, vol. 202, pp. 52–85, 2013.
-  D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deep memory POMDPs with recurrent policy gradients,” in International Conference on Artificial Neural Networks. Springer, 2007, pp. 697–706.
-  K. P. Murphy, “A survey of POMDP solution techniques,” environment, vol. 2, p. X3, 2000.
-  M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially observable MDPs,” in 2015 AAAI Fall Symposium Series, 2015.
-  G. Wayne, C.-C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. Rae, P. Mirowski, J. Z. Leibo, A. Santoro et al., “Unsupervised predictive memory in a goal-directed agent,” arXiv preprint arXiv:1803.10760, 2018.
-  M. Egorov, “Deep reinforcement learning with POMDPs,” 2015.
-  P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforcement learning for POMDPs,” arXiv preprint arXiv:1804.06309, 2018.
-  J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent Q-networks,” arXiv preprint arXiv:1602.02672, 2016.
-  L. Bu, R. Babu, B. De Schutter et al., “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Syst., Man, Cybern. C, Appl., Rev., vol. 38, no. 2, pp. 156–172, 2008.
-  M. Tan, “Multi-agent reinforcement learning: Independent vs. cooperative learning,” 1997.
-  M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 157–163.
-  J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
-  M. L. Littman, “Friend-or-foe Q-learning in general-sum games,” in ICML, vol. 1, 2001, pp. 322–328.
-  M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence, vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 1021–1026.
-  J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multi-agent reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1146–1155.
-  E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement learners for traffic light control,” Proceedings of Learning, Inference and of Multi-Agent Systems (at NIPS 2016), 2016.
-  J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  H. Sasaki, T. Horiuchi, and S. Kato, “A study on vision-based mobile robot learning by deep Q-network,” in 2017 56th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE). IEEE, 2017, pp. 799–804.
-  J. Xin, H. Zhao, D. Liu, and M. Li, “Application of deep reinforcement learning in mobile robot path planning,” in 2017 Chinese Automation Congress (CAC). IEEE, 2017, pp. 7112–7116.
-  T. Yan, Y. Zhang, and B. Wang, “Path planning for mobile robot’s continuous action space based on deep reinforcement learning,” in 2018 International Conference on Big Data and Artificial Intelligence (BDAI). IEEE, 2018, pp. 42–46.
-  T. Tongloy, S. Chuwongin, K. Jaksukam, C. Chousangsuntorn, and S. Boonsang, “Asynchronous deep reinforcement learning for the mobile robot navigation with supervised auxiliary tasks,” in 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE). IEEE, 2017, pp. 68–72.
-  Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, “Hierarchical deep reinforcement learning for continuous action control,” IEEE Trans. Neural Netw. Learn. Syst., no. 99, pp. 1–11, 2018.
-  S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 3389–3396.
-  D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning, 2018, pp. 651–673.
-  Y. Tsurumine, Y. Cui, E. Uchibe, and T. Matsubara, “Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation,” Robotics and Autonomous Systems, vol. 112, pp. 72–83, 2019.
-  E. Yang and D. Gu, “A survey on multiagent reinforcement learning towards multi-robot systems.” in CIG, 2005.
-  T. Yasuda and K. Ohkura, “Collective behavior acquisition of real robotic swarms using deep reinforcement learning,” in 2018 Second IEEE International Conference on Robotic Computing (IRC). IEEE, 2018, pp. 179–180.
-  X. Sun, T. Mao, J. D. Kralik, and L. E. Ray, “Cooperative multi-robot reinforcement learning: A framework in hybrid state space,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009, pp. 1190–1196.
-  G. Sartoretti, Y. Wu, W. Paivine, T. S. Kumar, S. Koenig, and H. Choset, “Distributed reinforcement learning for multi-robot decentralized collective construction,” in Distributed Autonomous Robotic Systems. Springer, 2019, pp. 35–49.
-  P. Long, T. Fanl, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6252–6259.
-  M. J. Matarić, “Reinforcement learning in the multi-robot domain,” in Robot colonies. Springer, 1997, pp. 73–83.
-  O. Saha and P. Dasgupta, “A comprehensive survey of recent trends in cloud robotics architectures and applications,” Robotics, vol. 7, no. 3, p. 47, 2018.
-  B. Liu, L. Wang, M. Liu, and C. Xu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” arXiv preprint arXiv:1901.06455, 2019.
-  H. Liu, S. Liu, and K. Zheng, “A reinforcement learning-based resource allocation scheme for cloud robotics,” IEEE Access, vol. 6, pp. 17 215–17 222, 2018.
-  A. Yu, R. Palefsky-Smith, and R. Bedi, “Deep reinforcement learning for simulated autonomous vehicle control,” Course Project Reports: Winter, pp. 1–7, 2016.
-  M. Vitelli and A. Nayebi, “CARMA: A deep reinforcement learning approach to autonomous driving,” Tech. rep. Stanford University, Tech. Rep., 2016.
-  B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2156–2162.
-  C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow: Architecture and benchmarking for reinforcement learning in traffic control,” arXiv preprint arXiv:1710.05465, 2017.
-  H. D. Gamage and J. B. Lee, “Reinforcement learning based driving speed control for two vehicle scenario,” in Australasian Transport Research Forum (ATRF), 39th, 2017, Auckland, New Zealand, 2017.
-  M. Mueller, “Reinforcement learning: MDP applied to autonomous navigation,” 2017.
-  C. You, J. Lu, D. Filev, and P. Tsiotras, “Highway traffic modeling and decision making for autonomous vehicle using reinforcement learning,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1227–1232.
-  A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V.-D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” arXiv preprint arXiv:1807.00412, 2018.
-  M. K. Pal, R. Bhati, A. Sharma, S. K. Kaul, S. Anand, and P. Sujit, “A reinforcement learning approach to jointly adapt vehicular communications and planning for optimized driving,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3287–3293.
-  P. Wang and C.-Y. Chan, “Formulation of deep reinforcement learning architecture toward autonomous driving for on-ramp merge,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–6.
-  V. Talpaert, I. Sobh, B. R. Kiran, P. Mannion, S. Yogamani, A. El-Sallab, and P. Perez, “Exploring applications of deep reinforcement learning for real-world autonomous driving systems,” arXiv preprint arXiv:1901.01536, 2019.
-  Q. Wang and C. Phillips, “Cooperative collision avoidance for multi-vehicle systems using reinforcement learning,” in 2013 18th International Conference on Methods & Models in Automation & Robotics (MMAR). IEEE, 2013, pp. 98–102.
-  M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multi-agent framework,” Engineering Applications of Artificial Intelligence, vol. 29, pp. 134–151, 2014.
-  K. Zheng, L. Hou, H. Meng, Q. Zheng, N. Lu, and L. Lei, “Soft-defined heterogeneous vehicular network: architecture and challenges,” IEEE Network, vol. 30, no. 4, pp. 72–80, 2016.
-  L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks: A machine learning framework,” IEEE Internet of Things Journal, vol. 6, no. 1, pp. 124–135, 2019.
-  H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning for vehicular networks,” arXiv preprint arXiv:1712.07143, 2017.
-  H. Ye and G. Y. Li, “Deep reinforcement learning based distributed resource allocation for V2V broadcasting,” in 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC). IEEE, 2018, pp. 440–445.
-  U. Challita, W. Saad, and C. Bettstetter, “Interference management for cellular-connected uavs: A deep reinforcement learning approach,” IEEE Trans. Wirel. Commun., 2019.
-  Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach,” IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, 2018.
-  R. F. Atallah, C. M. Assi, and M. J. Khabbaz, “Scheduling the operation of a connected vehicular network using deep reinforcement learning,” IEEE Trans. Intell. Transp. Syst., no. 99, pp. 1–14, 2018.
-  A. Mehmood, S. H. Ahmed, and M. Sarkar, “Cyber-physical systems in vehicular communications,” in Handbook of Research on Advanced Trends in Microwave and Communication Engineering. IGI Global, 2017, pp. 477–497.
-  Y. Xiao and C. Zhu, “Vehicular fog computing: Vision and challenges,” in 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 2017, pp. 6–9.
-  J. C. Nobre, A. M. de Souza, D. Rosario, C. Both, L. A. Villas, E. Cerqueira, T. Braun, and M. Gerla, “Vehicular software-defined networking and fog computing: integration and design principles,” Ad Hoc Networks, vol. 82, pp. 172–181, 2019.
-  Z. Ning, J. Huang, and X. Wang, “Vehicular fog computing: Enabling real-time traffic management for smart cities,” IEEE Wirel. Commun., vol. 26, no. 1, pp. 87–93, 2019.
-  Q. Qi, J. Wang, Z. Ma, H. Sun, Y. Cao, L. Zhang, and J. Liao, “Knowledge-driven service offloading decision for vehicular edge computing: A deep reinforcement learning approach,” IEEE Trans. Veh. Technol., 2019.
-  Q. Qi and Z. Ma, “Vehicular edge computing via deep reinforcement learning,” arXiv preprint arXiv:1901.04290, 2018.
-  K. Zheng, H. Meng, P. Chatzimisios, L. Lei, and X. Shen, “An SMDP-based resource allocation in vehicular cloud computing systems,” IEEE Trans. Ind. Electron., vol. 62, no. 12, pp. 7920–7928, 2015.
-  L. Hou, L. Lei, K. Zheng, and X. Wang, “A Q-learning based proactive caching strategy for non-safety related services in vehicular networks,” IEEE Internet Things J., 2018.
-  D. Zhang, X. Han, and C. Deng, “Review on the research and practice of deep learning and reinforcement learning in smart grids,” CSEE Journal of Power and Energy Systems, vol. 4, no. 3, pp. 362–370, 2018.
-  V. François-Lavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep reinforcement learning solutions for energy microgrids management,” in European Workshop on Reinforcement Learning (EWRL 2016), 2016.
-  G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi, “Dynamic energy management system for a smart microgrid,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656, 2016.
-  X. Qiu, T. A. Nguyen, and M. L. Crow, “Heterogeneous energy storage optimization for microgrids,” IEEE Trans. Smart Grid, vol. 7, no. 3, pp. 1453–1461, 2016.
-  P. Zeng, H. Li, H. He, and S. Li, “Dynamic energy management of a microgrid using approximate dynamic programming and deep recurrent neural network learning,” IEEE Transactions on Smart Grid, 2018.
-  E. Foruzan, L.-K. Soh, and S. Asgarpoor, “Reinforcement learning approach for optimal distributed energy management in a microgrid,” IEEE Transactions on Power Systems, vol. 33, no. 5, pp. 5749–5758, 2018.
-  B. Mbuwir, F. Ruelens, F. Spiessens, and G. Deconinck, “Reinforcement learning-based battery energy management in a solar microgrid,” Energy-Open, vol. 2, no. 4, p. 36, 2017.
-  E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu, and J. G. Slootweg, “On-line building energy optimization using deep reinforcement learning,” IEEE Trans. Smart Grid, 2018.
-  Z. Wen, D. O’Neill, and H. Maei, “Optimal demand response using device-based reinforcement learning,” IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2312–2324, 2015.
-  F. Ruelens, B. J. Claessens, S. Vandael, S. Iacovella, P. Vingerhoets, and R. Belmans, “Demand response of a heterogeneous cluster of electric water heaters using batch reinforcement learning,” in 2014 Power Systems Computation Conference. IEEE, 2014, pp. 1–7.
-  E. C. Kara, M. Berges, B. Krogh, and S. Kar, “Using smart devices for system-level management and control in the smart grid: A reinforcement learning framework,” in 2012 IEEE Third International Conference on Smart Grid Communications (SmartGridComm). IEEE, 2012, pp. 85–90.
-  B. J. Claessens, P. Vrancx, and F. Ruelens, “Convolutional neural networks for automatic state-time feature extraction in reinforcement learning applied to residential load control,” arXiv preprint arXiv:1604.08382, 2016.
-  M. S. Munir, S. F. Abedin, N. H. Tran, and C. S. Hong, “When edge computing meets microgrid: A deep reinforcement learning approach,” IEEE Internet Things J., 2019.
-  B.-G. Kim, Y. Zhang, M. Van Der Schaar, and J.-W. Lee, “Dynamic pricing for smart grid with reinforcement learning,” in 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 2014, pp. 640–645.
-  B. G. Kim, Y. Zhang, M. Van Der Schaar, and J.-W. Lee, “Dynamic pricing and energy consumption scheduling with reinforcement learning,” IEEE Trans. Smart Grid, vol. 7, no. 5, pp. 2187–2198, 2016.
-  L. Xiao, X. Xiao, C. Dai, M. Pengy, L. Wang, and H. V. Poor, “Reinforcement learning-based energy trading for microgrids,” arXiv preprint arXiv:1801.06285, 2018.
-  H. Wang, T. Huang, X. Liao, H. Abu-Rub, and G. Chen, “Reinforcement learning for constrained energy trading games with incomplete information,” IEEE Trans. Cybern., vol. 47, no. 10, pp. 3404–3416, 2017.
-  P. P. Reddy and M. M. Veloso, “Strategy learning for autonomous agents in smart grid markets,” in Twenty-second international joint conference on artificial intelligence, 2011.
-  T. Park, N. Abuzainab, and W. Saad, “Learning how to communicate in the Internet of Things: Finite resources and heterogeneity,” IEEE Access, vol. 4, pp. 7063–7073, 2016.
-  T. P. Kumar and P. V. Krishna, “Power modelling of sensors for IoT using reinforcement learning,” International Journal of Advanced Intelligence Paradigms, vol. 10, no. 1-2, pp. 3–22, 2018.
-  J.-C. Renaud and C.-K. Tham, “Coordinated sensing coverage in sensor networks using distributed reinforcement learning,” in 2006 14th IEEE International Conference on Networks, vol. 1. IEEE, 2006, pp. 1–6.
-  R. Ding, Y. Xu, F. Gao, X. Shen, and W. Wu, “Deep reinforcement learning for router selection in network with heavy traffic,” IEEE Access, vol. 7, pp. 37 109–37 120, 2019.
-  T. Oda, R. Obukata, M. Ikeda, L. Barolli, and M. Takizawa, “Design and implementation of a simulation system based on deep Q-network for mobile actor node control in wireless sensor and actor networks,” in 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA). IEEE, 2017, pp. 195–200.
-  G. Künzel, G. P. Cainelli, I. Müller, and C. E. Pereira, “Weight adjustments in a routing algorithm for wireless sensor and actuator networks using Q-learning,” IFAC-PapersOnLine, vol. 51, no. 10, pp. 58–63, 2018.
-  A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforcement learning for wireless sensor scheduling in cyber-physical systems,” arXiv preprint arXiv:1809.05149, 2018.
-  M. Chafii, F. Bader, and J. Palicot, “Enhancing coverage in narrow band-IoT using machine learning,” in 2018 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2018, pp. 1–6.
-  N. Jiang, Y. Deng, O. Simeone, and A. Nallanathan, “Cooperative deep reinforcement learning for multiple-group NB-IoT networks optimization,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8424–8428.
-  M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning based multi-access control and battery prediction with energy harvesting in IoT systems,” IEEE Internet Things J., 2018.
-  L. Lei, Y. Kuang, X. S. Shen, K. Yang, J. Qiao, and Z. Zhong, “Optimal reliability in energy harvesting industrial wireless sensor networks,” IEEE Trans. Wirel. Commun., vol. 15, no. 8, pp. 5399–5413, 2016.
-  J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep Q-learning based transmission scheduling mechanism for the cognitive Internet of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, 2018.
-  T. Yang, Y. Hu, M. C. Gursoy, A. Schmeink, and R. Mathar, “Deep reinforcement learning based resource allocation in low latency edge computing networks,” in 2018 15th International Symposium on Wireless Communication Systems (ISWCS). IEEE, 2018, pp. 1–5.
-  Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching, computing, and radio resources for fog-enabled IoT using natural actor-critic deep reinforcement learning,” IEEE Internet Things J., 2018.
-  Z. Chen and X. Wang, “Decentralized computation offloading for multi-user mobile edge computing: A deep reinforcement learning approach,” arXiv preprint arXiv:1812.07394, 2018.
-  L. Quan, Z. Wang, and F. Ren, “A novel two-layered reinforcement learning for task offloading with tradeoff between physical machine utilization rate and delay,” Future Internet, vol. 10, no. 7, p. 60, 2018.
-  F. De Vita, D. Bruneo, A. Puliafito, G. Nardini, A. Virdis, and G. Stea, “A deep reinforcement learning approach for data migration in multi-access edge computing,” in 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K). IEEE, 2018, pp. 1–8.
-  J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation for mobile edge computing: A deep reinforcement learning approach,” IEEE Trans. Emerg. Top. Comput., 2019.
-  H. Zhu, Y. Cao, X. Wei, W. Wang, T. Jiang, and S. Jin, “Caching transient data for Internet of Things: A deep reinforcement learning approach,” IEEE Internet Things J., 2018.
-  Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wireless systems via asynchronous multiuser deep reinforcement learning,” IEEE Internet Things J., vol. 5, no. 6, pp. 4296–4307, 2018.
-  Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based mode selection and resource management for green fog radio access networks,” IEEE Internet Things J., 2018.
-  L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforcement learning-based joint task offloading and bandwidth allocation for multi-user mobile edge computing,” Digital Communications and Networks, vol. 5, no. 1, pp. 10–17, 2019.
-  S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone, and S. Katti, “Cellular network traffic scheduling with deep reinforcement learning,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  A. T. Nassar and Y. Yilmaz, “Reinforcement-learning-based resource allocation in fog radio access networks for various IoT environments,” arXiv preprint arXiv:1806.04582, 2018.
-  L. Lei, H. Xu, X. Xiong, K. Zheng, and W. Xiang, “Joint computation offloading and multi-user scheduling using approximate dynamic programming in NB-IoT edge computing system,” IEEE Internet Things J., 2019.
-  Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for power-efficient resource allocation in cloud RANs,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017, pp. 1–6.
-  N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerial-assisted computing offloading for IoT applications: A learning-based approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, 2019.
-  M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson, “Deep variational reinforcement learning for POMDPs,” arXiv preprint arXiv:1806.02426, 2018.
-  K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processes with delays and asynchronous cost collection,” IEEE Trans. Automat. Contr., vol. 48, no. 4, pp. 568–574, 2003.
-  T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning and planning in environments with delayed feedback,” AUTON. AGENT MULTI-AG., vol. 18, no. 1, p. 83, 2009.
-  E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker, “Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3226–3231.
-  G. Lample and D. S. Chaplot, “Playing FPS games with deep reinforcement learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.