I Introduction
The Internet of Things (IoT) connects a huge number of IoT devices to the Internet, which generate massive amount of sensory data to reflect the status of physical world. These data could be processed and analyzed by leveraging machine learning techniques, with the objective of making informed decisions to control the reactions of IoT devices to the physical world. In other words, IoT devices become autonomous with ambient intelligence by integrating IoT, machine learning and autonomous control. For example, smart thermostats can learn to autonomously control central heating systems based on the presence of users and their routine. IoT and autonomous control system (ACS)
[1] are originally independent concepts, and the realization of one does not necessarily require the other. The concept of autonomous IoT (AIoT) was proposed as the next wave of IoT that can explore its future potential [2].The AIoT systems provide a dynamic and interactive environment with a number of AIoT devices, which sense the environment and make control decisions to react. As shown in Fig. 1, an AIoT system typically includes a physical system where AIoT devices with sensors and actuators are deployed. The IoT devices are connected usually by wireless networks to an access point (AP) such as a mobile base station (BS), which acts as a gateway to the Internet where cloud servers are deployed. Moreover, edge/fog servers with limited data processing and storage capabilities as compared to the cloud servers may be deployed at the APs [3]. After the IoT devices acquire data from sensors that represent full or partial status of physical system, they need to process the data and generate control decisions for the actuators to react. The data processing tasks can be executed locally at the IoT devices, at the edge/fog servers, or at the cloud servers.
Reinforcement learning (RL) introduces ambient intelligence into AIoT systems by providing a class of solution methods to the closedloop problem of processing the sensory data to generate control decisions to react. Specifically, the agents interact with the environment to learn optimal policies that map status or states to actions [4]
. The learning agent must be able to sense the current state of the environment to some extent (e.g., sensing room temperature) and take the corresponding action (e.g., turn thermostat on or off) to affect the new state and the immediate reward so that a longterm reward over extended time period is maximized (e.g., keeping room temperature at a target value). Different from most forms of machine learning, e.g., supervised learning, the learner is not told which actions to take but must discover which actions yield the most longterm reward by trying them out.
While RL has been successfully applied to a variety of domains, it confronts a main challenge when tackling problems with realworld complexity, i.e., the agents must efficiently represent the state of the environment from highdimensional sensory data, and use these information to learn optimal policies. Therefore, deep reinforcement learning (DRL), a combination of RL with deep learning (DL), has been developed to overcome the challenge
[5]. One of the most famous applications of DRL is AlphaGo, the first computer program which can beat a human professional on a fullsized board.It turns out that the formulation of RL/DRL models for realworld AIoT systems is not as straightforward as it may appear to be. There are two types of entities in a RL/DRL model as discussed above  environment and agent. Firstly, the environment in RL/DRL can be restricted to reflect only the physical system, or be extended to include the wireless networks, the edge/fog servers and cloud servers as well. This is because that the network and computation performance such as communication/computation delay, power consumption and network reliability will have important impacts on the control performance of the physical system. Therefore, the control actions in RL/DRL can be divided into two levels: (physical system) actuator control and (communications/computation) resources control, as shown in Fig. 1. The two levels of control can be separated or jointly learned and optimized. Secondly, the agent in RL is a logical concept that makes decisions on action selection. In AIoT systems, the agent with ambient intelligence can reside in IoT devices, edge/fog servers, and/or cloud servers as shown in Fig. 1. The time sensitiveness of the IoT application is an important factor to determine the location of the agents. For example in autonomous driving, images from an autonomous vehicle’s camera needs to be processed in realtime to avoid an accident. In this case, the agent should reside locally in the vehicle to make fast decisions, instead of transmitting the sensory data to the cloud and return the predictions back to the vehicle. However, there are many scenarios that it is not easy to determine the optimal locations for the agents, which may involve solving an RL problem in itself. Moreover, when there are multiple agents distributed in the IoT devices, the cooperation of the agents is also an important and challenging issue.
Although AIoT is a relatively new concept, related research works already exist in IoT and ACS, respectively. In this paper, we will review the stateofart research, and identify the model and challenges for the application of DRL in AIoT. Although there are currently several recent articles discussing on the application of machine learning in general IoT systems [6, 3], this paper focuses on a specific type of machine learning  reinforcement learning, and its application on a promising type of IoT system  AIoT.
The remainder of the paper is organized as follows. In Section II, we review the RL/DRL methodologies. Section III introduces a general model for RL in AIoT with a detailed discussion on the key elements. In Section IV, the existing works will be surveyed according to the different IoT applications. The challenges and open issues are identified and highlighted in Section V. Finally, the conclusion is given in Section VI.
Ii Overview of Deep Reinforcement Learning
In this section, we first introduce the basic concepts of RL and DL, based on which DRL is developed. Then, we classify the DRL algorithms into two broad categories, i.e., valuebased and policy gradient methods, and show that different elements in RL are approximated by deep neural networks. Finally, we introduce some advanced DRL techniques that are envisioned to be extremely useful in addressing the open issues in AIoT, which will be discussed in Section V.
Iia Building Block of DRL  RL
Generally, RL is a type of algorithms in machine learning that can achieve optimal control of a Markov Decision Process (MDP)
[4]. As discussed in Section I, there are generally two entities in RL as shown in Fig. 2  an agent and an environment. The environment evolves over time in a stochastic manner and may be in one of the states within a state space at any point in time. The agent performs as the action executor and interacts with the environment. When it performs an action under a certain state, the environment will generate a reward function as signals for positive or negative behaviour. Moreover, the action will also impact on the next state that the environment will transit to. The stochastic evolution of the stateaction pair over time forms an MDP, which consists of the following elements
state , which is used to represent a specific status of environment in a possible state space . In MDP, the state comprises all the necessary information of the environment for the agent to choose the optimal action from the action space.

action , which is chosen by the agent from an action space in a specific state . An RL agent interacts with the environment and learn how to behave in different states by observing the consequences of its actions.

reward , which is generated when the agent takes a certain action in a state . Reward indicts the intrinsic desirability of an action in a certain state.

transition probability
, which is the conditional probability that the next state of system will be given the current state and action . In modelbased RL, this transition probability is considered to be known by the agent, while agent does not require this information in modelfree RL.
A policy determines how an agent selects actions in different states, which can be categorized into either a stochastic policy or a deterministic policy [7]. In stochastic case, the policy is described by , which denotes the probability that an action may be chosen in state . In deterministic case, the policy is described by , which denotes the action that must be chosen in state .
For simplicity of introduction, we focus on the discrete time model, where the agent interacts with the environment at each of a sequence of discrete time steps . The goal of agent is to learn how to map states to actions, i.e., to find a policy to optimize the value function for any state . The value function is the expected reward when a policy is taken with an initial state , i.e.,
(1) 
where is a trajectory or sequence of triplets , with , or and . can be the total reward, discounted total reward, or average reward of trajectory , where is the terminal time step that can be .
Apart from value function, another important function is Q function , which is the expected reward for taking action in state and thereafter following a policy . When policy is the optimal policy , value function and Q function are denoted by and , respectively. Note that . If the Q functions , are given, the optimal policy can be easily found by .
In order to learn the value functions or Q functions, the Bellman optimality equations are usually used. Taking the discounted MDP with a discount factor of for example, the Bellman optimality equations for value function and Q function are
(2) 
and
(3) 
respectively.
Bellman equations represent the relation between the value/Q functions of the current state and next state. For example, it can be inferred from (3) that the expected reward equals to the sum of the immediate reward and the maximum expected reward thereafter. When the future expected reward is obtained, the expected reward since current state can be calculated. Bellman equations are the basis of an important class of RL algorithms using the “bootstrap” method, such as Qlearning and TemporalDifference (TD)learning. During the learning process, the agent first initializes the value/Q functions to some random values. Then, it iteratively repeats the policy prediction and policy evaluation phases until the convergence of the value/Q functions. In the policy prediction phase, the agent chooses an action according to the current value/Q functions, which results in an immediate reward and a new state. In the policy evaluation phase, it updates the value/Q functions according to the Bellman equations (2) or (3) given the immediate reward and the new state.
In the policy prediction phase, instead of always selecting the greedy action that maximizes the current value/Q functions, a “soft” policy such as greedy, soft, and softmax is usually used to explore the environment seeking the potential to learn a better policy. Moreover, according to the different methods adopted in policy evaluation phase, RL algorithms can be either onpolicy or offpolicy
, depending on whether the value/Q functions of the predicted policy or an hypothetical (e.g., greedy) policy is estimated.
Usually, a large amount of memory is required to store the value functions and Q functions. In some cases when only small, finite state sets are involved, it is possible to store these in the form of tables or arrays. This method is called the tabular method.
However, in most of the realworld problems, the state sets are large, sometimes infinite, which makes it impossible to store the value functions or Q functions in the form of tables. Therefore, the trialanderror interaction with the environment is hard to be learned due to formidably computation complexity and storage capacity requirement. Even if it can be learned, huge computing resources will be spent on it. This is where DL comes into the picture  some functions of RL such as Q functions or policy functions are approximated with a smaller set of parameters by the application of DL. The combination of RL and DL results in the more powerful DRL.
IiB Building Block of DRL  DL
Deep learning refers to a subset of machine learning algorithms and techniques that leverage artificial neural networks (ANN) to learn from large amount of data in an autonomous way. It is able to perform well in tasks like regression and classification. Regression task deals with predicting a continuous value, while classification task predicts the output from a set of finite categorical values. Given input data and output data , Neural network (NN) models can be viewed as mathematical models defining a function or a distribution over or both and . The learning rule of NN modifies its parameters in order for a given input , the network can produce a favored output
that best approximates the target output data
.A general feedforward neural network (NN), as shown in Fig. 3
, is constructed by an input layer, one or more hidden layers and an output layer. Each layer consists of one or multiple neurons which represent different nonlinear nodes in the model. As illustrated in Fig.
3, neuron in layerhas a vector of weights
for the connections from layer to itself and a bias value. It also has an activation function
such as Sigmoid, Logistic, Tanh and ReLU. The output of neuron
in layer equals to , where is the vector of outputs from neurons in layer . The typical parameters of NN are the weights and bias of every node. Any NN with two or more hidden layers can be called Deep Neural Network (DNN).A feedforward network has no notion of order in time, and the only input it considers is the current input data it has been exposed to. Recurrent neural network (RNN) refers to a special type of NNs which can process sequences of inputs by using internal memory. In RNN, the prior output of neurons in hidden state can be used as input along with the current input data, which enables the network to learn from history. The basic architecture of RNN is illustrated in Fig.
4. At each time step, the input of RNN is propagated in the feedforward NN. First, it is modified by the weight matrix . Meanwhile, the hidden state of the previous time step is multiplied by another weight matrix. Then, those two parts are added together and activated by the neuron. A special RNN architecture called Long ShortTerm Memory (LSTM) is widely used
[8]. LSTM is able to solve shortcomings in RNN, i.e. vanishing gradient, exploding gradient and long term dependencies [9].Usually, a loss function
is used in deep learning, which is a function of the ouput from NN and the target output . The loss function evaluates how well a specific NN along with current learned parameter values models the given data. Loss functions can be classified into two major categories depending on the type of learning tasks  Regression losses and Classification losses. The common regression loss functions include Mean Square Error (MSE), Mean Absolute Error (MAE), Mean Bias Error (MBE). And common classification loss functions include Support Vector Machine (SVM) loss and Cross entropy loss.
The objective of the NN is to minimize the loss function, i.e., . For this purpose, the parameters in NNs are updated by a method called gradient descent. Given a function , the simple gradient is usually used to update the parameters. The gradient descent method starts from an initial point . As a minibatch of input data is fed to NN, the average loss function over all input data in the minibatch is derived, and used to find the minimum of by taking a step along the descent direction, i.e.,
(4) 
where is a hyperparameter named step size. It is set to determine how fast the parameter values move towards the optimal direction. The above process is repeated iteratively as more minibatches of input data are fed to NN until convergence.
The simple gradient is easy to derive, but simple gradient descend is often not the most efficient method to optimize the loss function. During training, an appropriate value of step size should be set because if the value is too big, it may not be able to reach the local minimum and if the value is too small, it may take too much time to reach the local optimal point. On the other hand, natural gradients do not follow the usual steepest direction in the parameter space, but along the steepest descent direction with respect to the Fisher metric in the space of distributions. Specifically, the Fisher information metric is usually used to determine the step size, so that . Then, (4) can be used to update the parameters by replacing with .
IiC DRL Basics  ValueBased Methods
In valuebased methods for DRL as illustrated in Fig. 5, the states or stateaction pairs are used as inputs to NNs, while Q functions or value functions are approximated by parameters of NNs. An NN returns the approximated Q functions or value functions for the input states or stateaction pairs. There can be a single output neuron (as shown in Fig. 5(a)) or multiple output neurons (as shown in Fig. 5(b)). For the former case, the output can be either or corresponding to the input or . For the latter case, the outputs are the Q functions for state combined with every action, i.e., .
To derive the loss functions, and are defined as the target values of Q functions and value functions, respectively. The regression loss
(5) 
or
(6) 
can be used to evaluate how well the NN approximate Q functions or value functions in valuebased methods.
IiC1 Deep QNetworks
Based on the idea of NN fitted Q functions, the Deep Qnetworks (DQN) algorithm is introduced by Mnih et al. in 2015 to obtain strong ability in ATARI games [5]. The illustration of DQN is shown in Fig. 6. The NN in DQN takes a state as input, and returns approximated Q functions for every action under the input state.
In DQN, the algorithm first randomly initialize the parameters of networks as . The target Q function is given by (7) according to Bellman equation as
(7) 
where the subscripts or refer to the values of corresponding variables at the or iteration.
The parameters in DQN are updated by minimizing the loss function , which can be derived from (5) by replacing with .
By applying stochastic gradient descent, the parameters are updated as
(8) 
where is the learning rate.
In order to deal with the limitations of DRL, two important techniques, freezing target networks and experience replay, are applied in DQN. To make the training process more stable and controllable, the target networks, whose parameters are kept fixed in a time period, are used to evaluate the Q function of the next state, i.e., instead of (7), we have
(9) 
The parameters of online network are updated after each iteration. After a certain number of iterations, the online network shares its parameters to the target network. This reduces the risk of divergence and prevents the instabilities resulted from the too quick propagation.
To perform experience replay, the experience of the agent at each time step is stored in a data set. Then, the updates are made on this data set, which removes correlations in the observation sequence and smooths over changes in the data distribution. This technique allows the updates to cover a wide range stateaction space and provides more possibility to make larger updates of the parameters.
IiC2 Double DQN
In DQN, the Q function evaluated by target networks is used both to select and evaluate an action, which makes it more likely to overestimate the Q function of an action. The estimating error will become larger if there are more actions. To overcome this problem, Hasselt et al. proposed a Double DQN (DDQN) method in 2016, where two sets of parameters are used to derive the target value as shown in Fig. 7 [10]. Compared with (7), the target Qvalue in DDQN can be rewritten as
(10) 
where the selection of the action is due to the parameters in online network and the evaluation of the current action is due to the parameters in target network. This means there will be less overestimation of the Qlearning values and more stability to improve the performance of the DRL methods [7]. The loss function can be derived from (5) by replacing with and the parameters can be updated accordingly. DDQN algorithm gets the benefit of double Qlearning and keeps the rest of DQN algorithm.
Apart from DQN and DDQN, there are also other valuebased methods, some of which are developed based on DQN and DDQN with some further improvement, such as DDQN with Proportional Prioritization [11], DDQN with duel architecture [12], etc..
Remark 1 (Pros and cons of valuebased DRL methods)
Although DQN and its improved versions have been widely adopted in existing literature as discussed in Section IV  mainly due to their relative simplicity and good performance, there are some limitations with valuebased DRL methods. First, it cannot solve RL problems with large or continuous action space. Second, it cannot solve RL problems where the optimal policy is stochastic requiring specific probabilities. Since valuebased method can only learn deterministic policies, the majority of the algorithms are offpolicy, such as DQN.
IiD DRL Basics  Policy Gradient Methods
According to a policy , action is selected when the environment is in state . In policy gradient methods, NNs can be applied to directly approximate a policy as a function of state, i.e., . As shown in Fig. 8, the states are used as inputs to the NNs, while policy is approximated by parameters of NNs as .
To evaluate the performance of the current policy, the objective function is defined as
(11) 
where is the value function of policy as shown in (1), and refers to the sampling trajectory with an initial state . If we can find the parameters for policy so that the objective function is maximized, we can solve the task. The basic idea of policy gradient methods are to adjust the parameters in the direction of greater expected reward [13]. For this purpose, we can set the loss function of NN to be
(12) 
In order to update the parameters, we need to express the gradient of with respect to parameter as an expectation of stochastic estimates based on (11). As mentioned in Section IIA, the policy in RL can be classified into two categories, i.e., the stochastic policy and the deterministic policy. Hence, the stochastic policy gradient (SPG) method and deterministic policy gradient (DPG) method are correspondingly discussed below.
IiD1 Stochastic Policy Gradients
By applying DRL, a stochastic policy is approximated as , which gives the probability of a specific action is taken in a specific state , when the agent follows the policy parameterized by . The policy parameters are usually the weights and bias of a neural network [7]
. For a DRL model with discrete state/action spaces, Softmax function is a typical probability density function. In the cases of continuous state/action spaces, Gaussian distribution is generally used to characterize the policy. An NN is applied to approximate the mean, and a set of parameters specifies the standard deviation of the Gaussian distribution
[14][15].According to the policy gradient theorem, we have
(13) 
By applying stochastic gradient descent, the parameters are updated as
(14) 
where is the learning rate. In this way, is adjusted to enlarge the probability of trajectory with higher total reward.
From the perspective of NN, we give the loss function of SPG algorithm as
(15) 
In (15), the value of needs to be derived. This can be achieved either by the Monte Carlo policy gradient method or the actorcritic method, as is illustrated in Fig. 9 and Fig. 10 respectively.
Monte Carlo Policy Gradient
A typical algorithm of the SPG methods is the REINFORCE algorithm proposed in [16]. Based on the Monte Carlo approach, a trajectory is firstly sampled by running the current policy from an initial state . Then for each time step , the total reward starting from time step is calculated, which is multiplied with the policy gradient to update the parameters according to (14). The above procedure is repeated over multiple runs, while in each run a different trajectory is sampled.
Moreover, in order to reduce the variance of the policy gradient, a baseline function
which is independent of is introduced. Based on this, the REINFORCE algorithm with baseline is introduced, and the loss function of it can be formulated as(16) 
Remark 2 (Pros and cons of Monte Carlo policy gradient DRL methods)
In contrast to valuebased DRL methods, the policy gradient methods for DRL is a direct mapping from state to action, which leads to better convergence properties and higher efficiency in highdimensional or continuous action spaces [7]. Moreover, it can learn stochastic policies, which have better performance than deterministic policies in some situations. However, Monte Carlo policy gradient methods suffer from high variance of estimations. As onpolicy methods, they require onpolicy samples, which made them very sample intensive.
ActorCritic
Actorcritic methods, combining the advantages of both policy gradient and valuebased methods, have been widely studied in DRL. As illustrated in Fig. 10, an actorcritic method is generally realized by two NNs, i.e., the actor network and the critic network, which share parameters with each other. The actor network is similar to the NN of the policy gradient method, while the critic network is similar to the NN of the valuebased method. During the learning process, the critic updates the parameters of value functions, i.e, , according to the policy given by the actor. Meanwhile, the actor updates the parameters of policy, i.e., , according to the value functions evaluated by the critic. Generally, two learning rates are required to be predefined respectively for the updates of and [17].
As mentioned previously, one important task in the policy gradient method is to obtain the value of in (15). In the actorcritic method, the critic network is used for this purpose. Specifically, the baseline in (16) is set to the value function , which is approximated by the critic network with a loss function as given in (6). In a state , the agent selects an action according to the current policy given by the actor network, receives a reward , and the state transits to . Similar to (6) in valuebased method, the loss function for the critic network can be expressed as
(17) 
where
(18) 
Similar to (8) in DQN, the parameters of the critic network are updated as
(19) 
where is the learning rate for the critic.
Note that is an estimate of . Therefore, given the value functions evaluated by the critic network, the value of in (16) can be replaced by in (18), which can be seen as an estimate of the advantage of action in state [18]. The loss function of the actor network can be defined similar to (16), i.e.,
(20) 
Similar to (14) in the policy gradient method, the parameters of the actor network are updated as
(21) 
where is the learning rate for the actor.
Through the update processes in the actorcritic algorithm, the critic can make the approximation of value functions more accurately, while the actor can choose better action to get higher reward.
Typical actorcritic methods for SPG include the asynchronous advantage actorcritic (A3C) algorithm and soft actorcritic (SAC). The former mainly focuses on the parallel training of multiple actors that share global parameters [19]. The latter involves a soft Q function, a tractable stochastic policy and offpolicy updates [20]. SAC achieves good performance on a range of continuous control tasks.
Remark 3 (Pros and cons of ActorCritic DRL methods)
Actorcritic methods combine the advantages of both valuebased and Monte Carlo policy gradient methods. They can be either onpolicy or offpolicy. Compared with Monte Carlo methods, they require far less samples to learn from and less computational resources to select an action, especially when the action space is continuous. Compared with valuebased methods, they can learn stochastic policies and solve RL problems with continuous actions. However, it is prone to be unstable due to the recursive use of value estimates.
IiD2 Deterministic Policy Gradient
Different from stochastic policy gradient where the policy is modeled as a probability distribution over actions, deterministic policy gradient models the policy as a deterministic decision, i.e.,
. According to the objective function given in (11) and the deterministic policy gradient theorem, we have(22) 
where the policy improvement is decomposed into the gradient of the Qfunction with respect to actions, and the gradient of the policy with respect to the policy parameters. is the state distribution following policy . Thus, the parameters are updated as
(23) 
A differentiable function can be used as an approximator of , and then the gradient can be replaced by . The approximator is compatibale with the deterministic policy, and is achieved as [21].
From the perspective of NN, the loss function of DPG algorithm is set as
(24) 
One typical actorcritic method for DPG is the deep deterministic policy gradient (DDPG) algorithm as is shown in Fig. 11. The DDPG algorithm is a modelfree offpolicy actorcritic algorithm, which combines the ideas of DPG and DQN. It is first proposed by Lillicrap et al. in 2015 [22]. Besides the online critic network with parameters , and the online actor network with parameters , the target networks and in the DDPG algorithm are specified with and , respectively. The parameters of these four NNs are required to be updated in the learning process. The gradient is obtained by the critic network.
Based on DDPG, several algorithms are proposed in recent years, such as Distributed Distributional Deep Deterministic Policy Gradients (D4PG) [23], Twin Delayed Deep Deterministic (TD3) [24], MultiAgent DDPG (MADDPG) [25], and Recurrent Deterministic Policy Gradients (RDPG) [26], etc..
Remark 4 (Pros and cons of DPG DRL methods)
DPG methods are a special type of actorcritic methods that focus on deterministic policies. Compared with their SPG counterparts, they require less samples in computing the deterministic policy gradient, especially if the action space has many dimensions. Different from valuebased methods that can only solve RL problems with discrete actions, DPGbased methods focus on and work well for high dimensional continuous action problems. It usually works offpolicy to guarantee enough exploration unless there is sufficient noise in the environment. Also, when combined with DQN, DPGbased method is good only when the Q function is accurate.
IiD3 Further Improvements
Natural Policy Gradient
The policy gradient methods discussed above all use a simple gradient of loss function to update the parameters of NN. On the other hand, Natural policy gradient (NPG) method updates the parameters in NN using the natural gradient as discussed in Section II.B instead of simple gradient to provide a more efficient solution [14].
The loss function of NPG is the same as that of SPG, whose general expression is given in (12). The parameters are updated as
(25) 
where
(26) 
is the Fisher information matrix used to measure the step size for update [15].
NPG method defines a new form of step size that specifies how much those parameters should be adjusted, and therefore provides a more stable and effective update. However, the drawback of NPG is that when complicated NN is used to approximate the policy where the number of parameters is large, it is impractical to calculate the Fisher information matrix or store them appropriately [7]. Methods originated from NPG, such as Trust Region Policy Optimization (TRPO) [15] and Proximal Policy Optimization (PPO) [27] solve the above problem to some extent and are widely used for DRL in practice. Moreover, there are algorithms applying NPG to actorcritic methods, such as Actor Critic using KroneckerFactored Trust Region (ACKTR) [28] and ActorCritic with Experience Replay (ACER) [18].
Combining Monte Carlo Policy Gradient and ActorCritic
In Monte Carlo methods, the policy gradient is unbiased but with high variance; while in actorcritic methods, it is deterministic but biased. Therefore, an effective way is to combine these two types of methods together. Qprop is such an efficient and stable algorithm proposed by S. Gu, T. Lillicrap et.al in 2016 [29]. It constructs a new estimator that provides a solution to high sample complexity and combines the advantages of onpolicy and offpolicy methods.
Qprop can be directly combined with a number of prior policy gradient DRL methods, such as DDPG, TRPO, etc.. Compared with actorcritic methods such as DDPG algorithms, Qprop has achieved higher stability in DRL tasks in realworld problems. One limitation with Qprop is that the computation speed will be slowed down by the critic training when the speed of data collection is fast.
IiE DRL Beyond MDP  POMDPbased DRL
In the previous sections, we consider RL in a Markovian environment, which implies that knowledge of the current state is always sufficient for optimal control. However in many realworld problems, total environment information cannot be observed by the agent accurately, usually due to the limitations in sensing and communications capabilities. An agent acting under situation with partial observability can model the environment as a partially observable Markov decision process (POMDP) [30]. RL tasks in realistic environments need to deal with those incomplete and noisy state information resulting from POMDP.
POMDP can be seen as an extension of MDP by adding a finite set of observations and a corresponding observation model [31]. A POMDP is usually defined as a sixtuple , where state space , action space , transition probability , and reward are defined previously as elements in MDP,

is the observation space, where is a possible observation.

is the conditional probability that taking an action leading to a new state will result in an observation .
Similar to MDP, an agent chooses an action according to policy which results in the environment transiting to a new state with probability and the agent receives a reward . Different from MDP, the agent cannot directly observe system states, but instead receives an observation which depends on the new state of the environment with probability . Also, the policy and Q function are modified as and respectively.
Since the agent cannot directly observe the underlying state, it needs to exploit history information to reduce uncertainty about the current state [32]. The observation history at time step can be defined as . When the environment model is known to the RL agent, the optimal approach is for the agent to compute a belief that provides a probability distribution over states. By introducing belief state, POMDP problem can be converted to a beliefbased MDP problem. On the other hand, if the environment model is not available to the agent, it is straightforward to use the last observations as input to the policy. However, using a finite history may result in important information in the past being forgotten and overlooked. In order to overcome this challenge, RNN appears to be a good solution, because it is designed to deal with time series and can maintain longterm memory [33].
Several typical existing methods of solving POMDP problems are listed as follows.
Deep Recurrent QNetwork (DRQN)
An agent is able to chose actions in complex tasks by valuebased methods for DRL, e.g. DQN and DDQN. However, it is hard for those methods to get outstanding performance when the agent cannot perceive the complete information of the state. To address this problem, Hausknecht et al. proposed Deep Recurrent QNetwork (DRQN) in 2015 to integrate information through time and enhance DQN’s performance [34]. DRQN adds recurrency to DQN by replacing DQN’s first fullyconnected layer with a LSTM layer.
In the partially observed cases, the agent does not have access to state . So Q function in terms of history is defined as , which is the output of NN [26]. The input to NN is , while the rest of the information in apart from , i.e., is captured by the hidden states in RNN. Here, the Bellman optimality equations for Q function is
(27) 
Compared with DQN where tuples are stored in memory and sampled for training, in DRQN, the tuples are modified as and sampled for two types of updates. Those two types of updates are referred to as bootstrapped sequential updates and bootstrapped random updates, respectively. In both methods, episodes are selected randomly from the replay memory. In bootstrapped sequential updates, updates start at the beginning of the episode and sample experiences sequentially through the entire episode, while the RNN’s hidden state is carried forward throughout the episode. In bootstrapped random updates, updates start at random points in the episodes, while the RNN’s initial state is zeroed at the start of the update. The sequential updates method learns faster but violates the DQN’s random sampling policy. Both methods show good performance in experiments.
Recurrent Policy Gradients (RPG)
RPG methods belong to policy gradient methods where NNs are used to approximate policies and the parameters are updated in the direction of higher expected total reward [32]. As mentioned in Section IID, in policy gradient methods, or is a direct mapping from state to action . But in RPG, the goal of the agent is to learn a policy that maps history to action , which is denoted as or .
(28) 
for stochastic policies, and
(29) 
for deterministic policies, respectively, where refers to the sampling trajectory of history . Here, RNN is trained to obtain information from by its recurrent state and compute as well as .
RPG methods are applied to many partially observed physical control problems i.e. system identification with variable and unknown information, shortterm integration of sensor information to estimate the system state, as well as longterm memory problems. A typical algorithm, Recurrent Deterministic Policy Gradient (RDPG), is proposed by N. Heess, J. J. Hunt et.al based on RPG methods [26].
Memory, RL, and Inference Network (MERLIN)
In RL algorithms, extensive memory can be used to solve POMDP tasks. MERLIN algorithm focuses on memorydependent policies which output the action distribution based on the entire observation sequence in the past [35]. The ideas for MERLIN, including predictive sensory coding, hippocampal representation theory and temporal context model, mainly originate in neuroscience and psychology.
MERLIN is mainly composed of two basic components: a memorybased predictor and a policy. The memorybased predictor is mainly used to compress the input observation into lowdimensional state variables to represent a state. In each time step, the recurrent network outputs a prior distribution to predict the next state variable. Next, a posterior distribution is obtained based on the observation and the prior distribution . The posterior distribution has corrected the prior distribution to form a more accurate estimate of state variable. Then, is sampled from distribution . is used to select action by the other basic component and stored in the memory next step prediction.
Deep Belief QNetwork (DBQN)
DBQN is a modelbased method that uses DQN to map a belief to an action. When , and in a POMDP model are known ,
can be estimated accurately with Bayes’ theorem and sent to NN as input
[36]. The Bellman optimality equation for beliefs is(30) 
During updating, this approach usually leads to divergence. To stabilize the learning, techniques like experience replay, target network and an adaptive learning method are used. For experience replay, tuples are stored in memory and sampled uniformly. The adaptive learning method is used to regulate the parameter adjustment rate of the network [36].
IiF DRL Beyond MDP  MultiAgent DRL
In the previous sections, we mainly discuss the DRL methods for singleagent cases. In practice, there are situations where multiple agents need to work together, e.g. the manipulation in multirobot systems, the cooperative driving of multiple vehicles. In these cases, DRL methods for multiagent systems are designed.
A multiagent system consists of a group of autonomous, interacting agents sharing a common environment, and has a good degree of robustness and scalability [39]. The multiple agents in the system can interact with each other in cooperative or competitive settings, and hence the concept of stochastic game is introduced to extend MDP into the multiagent setting. A stochastic game or multiagent MDP with agents is defined as a tuple , where

is the discrete set of states,

, are the discrete sets of actions available to the agents, yielding the joint action set ,

is the state transition probability function,

, are the reward functions for the agents.
In multiagent MDP, the state transitions depend on the joint action of all the agents, where and . In the fullycollaborative problems, all the agents share the same reward, i.e., . In the fullycompetitive problems, the agents have opposite rewards with . Therefore, in the typical scenario with two agents [39]. Multiagent MDP problems that are neither fully collaborative nor fully competitive are mixed games.
In multiagent RL, each agent learns to improve its own policy by interacting with the environment to obtain rewards. For each agent, the environment is usually complex and dynamic, and the system may encounter the action space explosion problem. Since multiple agents are learning at the same time, for a particular agent, when the policies of other agents change, the optimal policy of itself may also change. This may affect the convergence of the learning algorithm and cause instability.
The simplest approach to learning in multiagent settings is to use independently learning agents. For example, independentQ learning is an algorithm in which each agent independently learns its own policy, treating other agents as part of the environment [40]
. However, independentQ learning cannot deal with the nonstationary environment problem. Combining game theory with RL, some typical algorithms for MultiAgent RL (MARL) have been studied, aiming to solve the problems mentioned above. The MinimaxQ algorithm is an approach that in cooperates the zerosum game of two players and the TD method in Qlearning
[41]. The Nash QLearning algorithm extends the MinimaxQ algorithm from a zerosum game to a generalsum game for multiplayers [42]. The Friendorfoe QLearning (FFQ) algorithm is also derived from the MinimaxQ algorithm, which transforms the generalsum game of a multiagent system into a zerosum game of two agents [43]. Note that all of the three methods mentioned above need to maintain the Q functions in the learning process. Each agent needs to have a very large space to store the Q functions. In order to reduce the space dimension, WoLFPHC combines the “Win or Learn Fast (WoLF)” rule with the policy hillclimbing (PHC), where each agent is expected to maintain the Q functions only by knowing its own actions [44].In recent years, the DRL methods for singleagent cases have been extended to the multiagents cases as discussed below.
MultiAgent ValueBased Methods
The experience replay mechanism in DQN algorithm is not designed for the nonstationary environment in multiagent systems. Several variants of DQN have been proposed to deal with this problem.
Foerster et al. [45] introduced two methods for stabilizing experience replay of DQN in multiagent DRL. In the multiagent importance sampling (MAIS) algorithm, offenvironment importance sampling is introduced to stabilize experience replay, where obsolete data is supposed to decay naturally. In the multiagent fingerprints (MAF) algorithm, each agent needs to be able to condition on only those values that actually occur in its replay memory to stabilize experience replay. A lowdimensional fingerprint is designed to contain this information, and to disambiguate the age of the samples retrieved from the replay memory.
In [46], a coordinated multiagent DRL method is designed based on DQN. Faster and more scalable learning is realized by using transfer planning. To coordinate between multiple agents, the global Qfunction is factorized as a linear combination of local subproblems. Then, the maxplus coordination algorithm is applied to optimize the joint global action over the entire coordination graph.
MultiAgent PolicyGradient Methods
Policy gradient methods usually exhibit very high variance when coordination of multiple agents is required. In order to overcome this challenge, several algorithms adopt the framework of centralized training with decentralized execution.
In the counterfactual multiagent policy gradient (COMAPG) algorithm, a centralized critic is used to estimate the Qfunction, and decentralized actors are used to optimize the policies of multiple agents. The core idea of the COMAPG algorithm is to apply a counterfactual baseline, which can marginalize out a single agent’s action and keep the other agents’ actions fixed. Moreover, a critic representation is introduced for efficiently evaluating the counterfactual baseline in a single forward pass. The experiments in [47] show that the COMAPG algorithm has a good final performance and an efficient training speed.
Multiagent deep deterministic policy gradient (MADDPG) is essentially a DPG algorithm that trains each agent with a critic that requires global information and an actor that requires local information. It allows each agent to have its own reward function, so that it can be used for cooperative or competitive tasks. The core idea of the MADDPG algorithm is the centralized training and the distributed execution. The training process uses centralized learning to train critic and actor. When executing, the actor only needs to know the local information. Critic requires policy information from other agents. The study in [25] gives a method of estimating other agents’ policy, and can only use the observations and actions of other agents. Using the policy ensemble to learn multiple policies for each agent, the overall effect of all policies is optimized to improve the stability and robustness of the algorithm.
Based on the introduction above, we list classical algorithms for DRL in Table I and summarize the characteristics of each algorithm.
Classification  Classical algorithms  Feature  Monte Carlo/Actorcritic method  Action space  

Deep Qnetwork (DQN) [5]  \  \  discrete  
Double Deep Qnetwork (DDQN) [10]  \  
DDQN with duel architecture [12]  \  
DDQN with Proportional Prioritization [11]  \  
Deep Belief Qnetwork (DBQN) [37]  POMDP  
Deep Recurrent Qnetwork (DRQN) [34]  POMDP  
Multiagent Importance Sampling (MAIS) [45]  MA  
Coordinated Multiagent DQN [46]  MA  
Multiagent Fingerprints (MAF) [45]  MA  


REINFORCE [16]  \  Monte Carlo 


Soft ActorCritic (SAC) [20]  \  Actorcritic  

MA  

\  


\  Actorcritic  continuous  

MA  

\  

\  

POMDP  



\  Monte Carlo/Actorcritic 



\  

\  Actorcritic  

\  

QProp [29]  \  Monte Carlo and Actorcritic 
Iii General Reinforcement Learning Model for Autonomous IoT
Before we discuss on the RL model for AIoT system, we first examine that of a wireless sensor and actuator network (WSAN), which can be considered as an element or a simplified version of AIoT. A WSAN consists of a group of sensors that gather information about their environment, and a group of actuators that interact with and act on the environment. All elements communicate wirelessly. In the RL model for WSAN as illustrated in Fig. 12, an agent obtains aspects of its environment through sensors, and chooses control actions that are implemented by the actuators. The chosen action determines the value of the immediate reward as well as influences the dynamics of its environment. The agent communicates with the sensors and actuators to receive state information and send control commands.
Compared with WSAN, the AIoT has a more complex ecosystem that encompasses identification, sensing, communication, computation, and services. A typical AIoT architecture consists of three fundamental building blocks as shown in Fig. 13:

Perception layer: corresponds to the physical autonomous systems in which IoT devices with sensors and actuators interact with the environment to acquire data and exert control actions;

Network layer: corresponds to the IoT communication networks including wireless access networks and the Internet that discover and connect the IoT devices to the edge/fog servers and cloud servers for data and control command transmission;

Application layer: corresponds to the IoT edge/fog/cloud computing systems for data processing/storage and control actions determination.
Due to the more sophisticated system architecture, the RL models for AIoT systems are more complex than those of WSAN as illustrated in Fig. 12. The environment can include one or more layers in the AIoT architecture. The agent(s) can locate at the IoT devices, the edge/fog/cloud servers, and wireless access points. In the following, we first define the basic RL elements such as state, action, and reward for each layer, respectively. Then, we define the RL elements when the environment includes all the three layer as an integrated part.
Iiia Perception Layer
When the environment only includes the perception layer, the physical system dynamics are modelled by a controlled stochastic process with the following state, action, and reward.

Physical system state , e.g., the onoff status of the actuators, the RGB images of the system, the locations of the agents;

Actuator control action , e.g., controlling the movement of a robot, adjusting the driving speed and direction of a vehicle, turning on/off a device;

Physical system performance , e.g., energy consumption in a power grid, how fast a mobile agent such as a robot or a vehicle can move, or whether it is away from obstacles.
IiiB Network Layer
When the environment only includes the network layer, the network dynamics are modelled by a controlled stochastic process with the following state, action, and reward.

Network resource state , e.g., the amount of allocated bandwidth, the signal to interference plus noise ratio, the channel vector of a finite state Markov channel model;

Communication resource control action , e.g., the power allocation, the multiuser scheduling, the subchannel allocation in OFDM system;

Network performance , e.g., the transmission delay, the transmission error probability, the transmission power consumption.
IiiC Application Layer
When the environment only includes the application layer, the edge/fog/cloud computing system dynamics are modelled by a controlled stochastic process with the following state, action, and reward.

Computing resource state , e.g., the number of virtual machines (VMs) that currently run, the number of tasks buffered in the queue for processing;

Computing resource control action , e.g., the caching selection, the task offloading decisions, the virtual machine allocation;

Computing system performance , e.g., utilization rate of the computing resources, the processing delay of the offloading tasks.
IiiD Integration of Three Layers
When the environment includes all the three layers of AIoT architecture, the RL models generally include elements defined as follows.

AIoT state () includes the aggregation of physical system state, network resource state, and computation resource state, i.e., ;

AIoT action () includes the aggregation of actuator control action, communication resource control action, and computing resource control action, i.e., ;

AIoT reward () is normally set to optimize the physical system performance, which can be expressed as a function of the network performance and computing system performance, i.e., .
As the agent in RL is a logical concept, the RL problem in each layer can be solved by the agent in its respective layer  observing the states and rewards from its environment and learning polices to determine corresponding actions as shown in Fig. 13. However, the physical location of an agent can be different from its logical layer. We classify the devices that an agent may locate in according to the physical locations of the devices as

perception layer devices, i.e., IoT devices;

network layer devices, i.e, wireless access points;

application layer devices, i.e., edge/fog/cloud servers.
As shown in Fig. 13, the mapping of the logical layer of an agent and its physical locations are given. A perception layer agent may locate in IoT devices and/or edge/fog/cloud servers. A network layer agent may locate in wireless access points and/or IoT devices (e.g., for DevicetoDevice communications). An application layer agent may locate in edge/fog/cloud servers and/or even IoT devices (e.g., to perform task offloading).
When the environment of an RL problem includes more than one layer, the agents of different layers need to share information and jointly optimize their polices. For example, the network layer may provide transmission delay information to the perception layer to be included as part of the system state; or, the perception layer may provide its optimization objective to the network layer to formulate the reward function. When the physical locations of the agents of different layers are the same, e.g., when both perception layer agent and application layer agent locate at the cloud servers, a single logical agent combining agents of different layers can be considered for the RL problem.
Iv Applications of Deep Reinforcement Learning in Autonomous IoT
Although AIoT is a new trend in IoT that has not been adequately studied by existing research works, the respective applications of DRL in each of the three layers of AIoT architecture have been widely studied by recent works. Therefore, we provide a literature review of the applications of DRL in the perception layer (physical autonomous systems), the network layer (IoT communication networks), and the application layer (IoT edge/fog/cloud computing systems) in this section. As there are a great variety of physical autonomous systems, we focus on three types of systems that have received most attention in DRL research for the perception layer, i.e., autonomous robots, smart vehicles, and smart grid. The framework of the literature review is given in Fig. 14. Note that some IoT communication network technologies and IoT edge/fog/cloud computing technologies are designed specifically for a particular physical autonomous system, e.g, vehicular edge/fog/cloud computing and vehicular networks for smart vehicles, and cloud robotics for autonomous robots. In the following survey, we discuss these technologies in the respective physical autonomous system subsection, while the technologies discussed in the IoT communication networks and IoT edge/fog/cloud computing systems are those universal to various types of autonomous physical systems.
Iva Perception Layer  Autonomous Robots
The applications of DRL methods in autonomous robots have been widely discussed. The existing researches include the mobile behavior control of robots, the robotic manipulation, the management in multirobot systems, and cloud robotics.
IvA1 Mobile Behavior Control
The mobile behavior control mainly refers to the path planning, navigation, and general movement control of robots. DRL approaches have been applied in many existing works for this purpose.
In [48], the authors apply DQN to the robot behavior learning simulation environment, so that mobile robots can learn to obtain good mobile behavior by using highdimensional visual information as input data. The authors incorporate profit sharing methods into DQN to speed up learning, and the method reuses the best target network in the case of a sudden drop in learning performance. In order to solve the problem of mobile robot path planning, DQN is designed in [49] and DDPG is applied in [50]. A mobile robot navigation problem in [51] is solved by applying the hybrid A3C method.
IvA2 Robotic Manipulation
Since intelligent robots usually help to perform some operation tasks in practice, appropriate controlling schemes for them are necessary for successful manipulations. The problem of controlling robots to solve compound tasks is solved by a hierarchical DRL algorithm in [52]. In [53], the authors demonstrate that the DRL algorithm based on offpolicy training of deep Q functions can be applied to complex threedimensional (3D) operation tasks, and can effectively learn DNN strategies to train real physical robots. The policy updates are pooled asynchronously to decrease the training time. Similarly, the problem of learning visionbased dynamic manipulation skills is solved by using a scalable DQN approach in [54]. In [55], two proposed sample efficient DRL algorithms, i.e., deep Pnetwork (DPN) and dueling deep Pnetwork (DDPN), are applied to real robotic cloth manipulation tasks.
IvA3 MultiRobot System
In some cases, multiple robots are required to collaborate properly to fulfil some tasks that are difficult to be accomplished by an individual robot. A review on multiagent RL in multirobot systems is provided in [56]. The research in [57] investigates a DRL approach to the collective behavior acquisition of swarm robotics systems. The multiple robots are expected to collect information in parallel and share their experience for accelerating the learning. In [58], the authors propose a collaborative multirobot RL method, which realizes task learning and the emergence of heterogeneous roles under a unified framework. The method interleaves online execution and relearning to accommodate environmental uncertainty and improve performance. The study in [59] extends the A3C algorithm in single agent problems to a multirobot scenario, where the robots work together toward a common goal. The policy and critic learning are centralized, while the policy execution is decentralized. A decentralized sensorlevel collision avoidance policy for multirobot systems is proposed in [60]. A multiscenario multistage training framework based on policy gradient methods is used to learn the optimal policy for a large number of robots in a rich, complex environment. The expanding of learning space is an issue to be tackled in multirobot system. The methodology in [61] is proposed to minimize the learning space through the use of behaviors and conditions.
IvA4 Cloud Robotics
The concept of cloud robotics allows the robotic system to offload computingintensive tasks from the robots to the cloud [62]
. Cloud robotics applications include perception and computer vision applications, navigation, grasping or manipulation, manufacturing or service robotics, etc.. In
[63], an effective transfer learning scheme based on lifelong federated reinforcement learning (LFRL) is proposed for the navigation in cloud robotic systems, where the robots can effectively use prior knowledge and quickly adapt to new environments. The authors in
[64] propose an RLbased resource allocation scheme, which can help the cloud to decide whether a request should be accepted and how many resources are supposed to be allocated. The scheme realizes an autonomous management of computing resources through online learning, reduces human participation in scheme planning, and improves the overall utility of the system in the long run.IvB AIoT Perception Layer  Smart Vehicles
The development of the IoT technology has promoted the development of intelligent transportation systems (ITS). In Internet of Vehicles (IoV), smart vehicles with IoT capabilities including sensing, communications, and data processing can possess artificial intelligence to enhance driving aid. The existing works on the applications of DRL in the smart vehicles mainly include the studies on autonomous driving, vehicular networks, and vehicular edge/fog computing.
IvB1 Autonomous Driving
The application of DRL methods for the control of the autonomous vehicles is addressed in a number of existing works. The autonomous driving problem can be formulated as an MDP, where the driving status such as position and velocity of the autonomous vehicles as well as other nonautonomous vehicles in proximity are usually characterized as the states, and the driving decisions of the autonomous vehicles such as acceleration and changing lanes are characterized as actions. The rewards are usually related to assessment criteria of the driving operations, such as smoothness and speed.
In [65] and [66], deep Qlearning is applied to control simulated cars via a DRLbased algorithm. In [67], the authors address the autonomous driving issues by presenting an RLbased approach, which is combined with formal safety verification to ensure that only safe actions are chosen at any time. A DRL agent learns to drive as close as possible to a desired velocity by executing reasonable lane changes on simulated highways with an arbitrary number of lanes. Leveraging the advances in DRL, the authors in [68] use Flow to develop reliable controllers in mixedautonomy traffic scenario. In [69], the leading vehicle and the traffic signal timing condition are taken into account when applying RLbased method to control the speed of a vehicle. The problem of autonomous vehicle navigation between lanes is formulated as an MDP and solved via RLbased methods in [70]. In [71], the road geometry is taken into account in the MDP model in order to be applicable for more diverse driving styles. The authors in [72] apply a continuous, modelfree DRL algorithm for autonomous driving. The distance travelled by the autonomous vehicle is used to evaluate the reward in the model. The study in [73] aims to optimize the driving utility of the autonomous vehicle, and enables the autonomous vehicle to jointly select the motion planning action performed on the road and the communication action of querying the sensed information from the infrastructure. The problem of ramp merging in autonomous driving is tackled in [74], where LSTM is applied to produce an internal state containing historical driving information, and DQN is applied for Qfunction approximation. The authors in [75] the review the applications and address the challenges of realworld deployment of DRL in autonomous driving.
There are also studies on cooperative driving of multiple vehicles. In [76], the authors present a novel method of cooperative movement planning. RL is applied to solve this decisionmaking task of how two cars coordinate their movements to avoid collisions and then return to their intended path. A multiagent multiobjective RL traffic signal control framework is proposed in [77], which simulates the driver’s behavior, e.g., acceleration or deceleration, continuously in space and time dimensions.
IvB2 Vehicular Networks
The concept of vehicular networking brings a new level of connectivity to vehicles, and has become a key driver of ITS. The control functionalities in vehicular network can be divided into three parts according to their usages, including communication control, computing control and storage control [78]. In vehicular networks, problems such as resource allocation, caching, and networking, can be formulated and solved via DRL. In [79] and [80], the applications of machine learning in studying the dynamics of vehicular networks and making informed decisions to optimize network performance are discussed. In [81], the authors use a DRL approach to perform joint resource allocation and scheduling in vehicletovehicle (V2V) broadcast communications. In the system, each vehicle makes a decision based on its local observations without the need of waiting for global information. A DRL algorithm based on echo state network (ESN) cells is proposed in [82], in order to provide an interferenceaware path planning scheme for a network of cellularconnected unmanned aerial vehicles (UAVs). In [83], the authors develop an integration framework that enables dynamic orchestration of networking, caching, and computing resources to improve the performance of vehicular networks. The resource allocation strategy is formulated as a joint optimization problem, in which the gains of networking, caching and computing are all taken into consideration. To solve the problem, a doubleduelingdeep Qnetwork algorithm is proposed. Similarly, deep QLearning is applied in [84] to learn a scheduling policy, which can guarantee both safety and qualityofservice (QoS) concerns in an efficient vehicular network.
IvB3 Vehicular Edge/Fog/Cloud Computing
Emerging vehicular applications require more computing and communication capabilities to perform well in computingintensive and latencysensitive tasks. Vehicular Cloud Computing (VCC) provides a new paradigm in which vehicles interact and collaborate to sense the environment, process the data, propagate the results and more generally share resources [85]. Moreover, Vehicular Edge/Fog Computing (VEC/VFC) focuses on moving computing resources to the edge of the network to resolve latency constraints and reduce cloud ingress traffic [86, 87, 88]. The studies in [89] and [90] focus on the service offloading issues in the IoV. The determination of offloading decisions for the multiple tasks is considered as a longterm planning problem. Service offloading decision frameworks are proposed, which can provide the optimal policy via DRL. The authors in [91] propose an optimal computing resource allocation scheme to maximize the total longterm expected return of the vehicular cloud computing system. With multiple access edge computing techniques, roadside units (RSUs) can provide fast caching services to moving vehicles for content providers. In [92]
, the authors apply the MDP to model the caching strategy, and propose a heuristic Qlearning solution together with vehicle movement predictions based on a LSTM network.
IvC AIoT Perception Layer  Smart Grid
The integration of distributed renewable energy sources (DRES) into the power grid introduces the need for autonomous and smart energy management capabilities in smart grid due to the intermittent and stochastic nature of the renewable energy sources (RES). With advanced metering infrastructure (AMI) and various types of sensors in power grid to collect realtime power generation and demand data, RL and DRL provide promising methods to learn efficient energy management polices autonomously in such a complex environment with uncertainty. Specifically, the historical data can be leveraged by powerful DRL algorithms in learning optimal decisions to cope with the high uncertainty of the electrical patterns. A review of machine learning applications in smart grids is presented in [93]. Different from [93], we focus only on the applications of RL and DRL on the energy management problem with DRES.
IvC1 Energy Storage Management
One promising method to deal with the lack of knowledge on future electricity generation and consumption is through energy storage. Direct energy storage such as in the battery is one of the energy storage options. RL/DRL applications in microgrid with energy storage system (ESS) to determine the optimal charging/discharging policy have been studied in some recent literature.
In [94], the problem of optimally activating the storage devices is formulated as a sequential decision making problem. Then, the problem is solved by a DQN based approach, without knowing the future electricity consumption and weather dependent PV production at each step. The authors in [95] develop an intelligent dynamic energy management system (IDEMS) for a smart microgrid. The system can effectively schedule the backup battery energy storage and give a robust performance under different battery energy storage conditions. The authors in [96] design an interconnection topology and an RLbased algorithm to optimize the coordination of different energy storage systems (ESSs) in a microgrid. In [97], a novel dynamic energy management system is proposed to deal with microgrids realtime dispatch problems. The developed energy management system can optimize the longterm operational costs of microgrids without longterm forecast or distribution information about uncertainty. The authors in [98] present a multiagentbased energy and load management approach for distributed energy resources in microgrid. The suppliers and consumers of electricity maximize their profit by using a modelfree Qlearning algorithm. A framework based on RL is presented in [99] to control the operation, i.e., charging and discharging, of a battery storage device. The objective is to minimize the amount of energy bought or sold from/to a microgrid, where residential consumer, photovoltaic (PV) system, inverters and battery storage facility are considered.
IvC2 Demand Response
Another method to support the integration of DRES is through demand response (DR) systems, which dynamically adjust electrical demand in response to changing electrical energy prices or other grid signals. Thermostatically controlled loads (TCLs) such as electric water heaters are a prominent example of loads that offer flexibility at the residential level. In fact, TCLs can be seen as a type of energy storage entity through power to heat conversion, which is in contract to the direct energy storage entity such as a battery. DR can be divided into direct DR and priceddriven DR, where energy consumption profile of a user is adjusted according to a utility in the former while according to the price in the latter. In any case, the energy consumers need to make a continuing sequence of decisions as to either consume energy at current (known) utility/price or to defer power consumption until later at possibly unknown utility/prices.
An energy optimization problem in a smart grid is formulated in [100]. An online energy scheduling strategy is learned using deep Qlearning and deep policy gradient methods. For the DR problem, the authors in [101] propose a new EMS formulation that sets the fully automated energy management system (EMS) rescheduling problem as an RL problem and argues that this problem can be solved approximately by decomposing the RL problem on the device clusters. The control scheme in [102] applies a modelfree batch RL algorithm in combination with a marketbased heuristic, which is tested in a stochastic setting, without prior information or system dynamics. In [103], a stochastic modeling framework based on MDP is presented, in order to employ adaptive control strategies for short term ancillary services to the power grid by using a population of heterogenous thermostatically controlled loads. The authors in [104] proposes a novel approach that uses a CNN to extract hidden statetime features in a load control problem. The CNN is used as a function approximator to estimate the Q function in the supervised learning step of fitted Qiteration. In [105], the authors studied the energy supply plan of a microgrid to support the operation of a Mobile Edge Computing (MEC) system , with a goal of minimizing the energy consumption in the MEC system. The optimization problem is decomposed into two subproblems: energy efficiency task allocation problem and energy supply planning problem. The output of the first subproblem is used as input to solve the second subproblem, and modelbased deep reinforcement learning (MDRL) is applied in solving the issues.
IvC3 Energy Trading
The integration of the DRES into the power grid blurs the distinction between an energy provider and consumer. This is especially true for a microgrid, which may constantly switches its role between a provider or consumer depending on whether its generated energy exceeds or falls short of its demanded energy. In fact, a key goal of smart grid design is to facilitate twoway flow of electricity by enhancing the ability of distributed smallscale electricity producers, such as small wind farms or households with solar panels, to sell energy into the power grid. Due to the unpredictability of the DRES, autonomous control mechanism to ensure power supply/demand balance is essential. One promising method is through the introduction of Broker Agents, who buy electricity from distributed producers and also sell electricity to consumers. RL/DRL can be applied for the Broker Agents to learn pricing decisions to effectively maintain that balance, and earn profits while doing so, contribute to the stability of the grid through their continued participation.
To overcome the challenges of implementing dynamic pricing and energy scheduling, the authors in [106] and [107] study RL algorithms that allow each service provider and each customer to learn their policy with no need of prior information about the microgrid. A microgrid energy trading scheme based on RL is proposed in [108], which applies the DQN to improve the utility of the microgrids for the case of microgrids with a large number of connections. In [109], an adaptive learning algorithm is designed to find the Nash equilibrium (NE) of constrained energy trading game among individual strategic participants with incomplete information. Each player’s goal is to maximize his own average utility by generating a motion probability distribution based on his private information using a learning automaton scheme. In [110], the authors employ MDP and RL to investigate the learning of pricing strategies for an autonomous Broker Agent to profitably participate in a Tariff Market.
IvD AIoT Network Layer  IoT Communication Networks
A reliable and efficient wireless communication network is an essential part of the IoT ecosystem. Such wireless networks range from short range local area networks such as Bluetooth, Zigbee/IEEE 802.15.4, and IEEE 802.11 to long range wide area networks such as Narrowband Internet of Things (NBIoT) and LoRaWAN. When designing resource control mechanisms to efficiently utilize the scarce radio resources in transmitting the huge amount of IoT data, the IoT networks need to consider the characteristics of IoT devices such as massive in number, limited in energy, memory and computation resources. Moreover, the requirements of IoT applications such as low latency and high reliability have to be considered. One of the promising approaches to develop resource control mechanisms tailored for IoT is to enable IoT devices to operate autonomously in a dynamic environment by using learning frameworks such as DRL [111].
IvD1 Wireless Sensor Networks
Wireless sensor networks (WSNs) offer practical applications that can directly benefit from artificial intelligence technology. For a large scale IoT application, sensors are needed in huge number. In [112], RL is used for modelling the sensors in the physical, routing and network layer. Routing and networking layer deals with the communication capabilities of the sensors. The resource scheduling issues among the sensors are solved in order to optimize the lifetime of the sensors, energy usage and communication costs. A multiagent system approach on wireless sensor networks is able to tackle the resource constraints in these networks by efficiently coordinating the activities among the nodes. In [113], the authors consider the coordinated sensing coverage problem and study the behavior and performance of four distributed DRL algorithms, i.e., fully distributed Qlearning, distributed value function (DVF), optimistic DRL, and frequency maximum QIearning (FMQ). Their performance in terms of communication and computational costs, energy consumption, and sensor coverage levels are evaluated and compared. The authors in [114] leverage DRL for router selection in wireless network with heavy traffic. Compared with existing routing algorithms, the proposed algorithms achieve higher network throughput due to the low congestion probability.
IvD2 Wireless Sensor and Actuator Networks
Wireless sensor and actuator networks (WSANs), e.g., ISA SP100.11a and WirelessHART, have special devices known as network managers which perform tasks such as admission control of devices, definition of routes, and allocation of communication resources. The authors in [115] present the design and implementation of a simulation system based on DQN for mobile actor node control in a WSAN. In [116], a global routing agent with QLearning is proposed for weight adjustment of the stateoftheart routing algorithm, aiming at achieving a balance between the overall delay and the lifetime of the network. The study in [117] focuses on a DRLbased sensor scheduling problem for allocating wireless channels to sensors for the purposes of remote state estimation of dynamical systems. The algorithm can be run online, and is modelfree with respect to the wireless channel parameters.
IvD3 NBIoT
NBIoT is a technology proposed by 3GPP in Release13. It offers low energy consumption and extensive coverage to meet the requirements of a variety of social, industrial and environmental IoT applications. Compared to legacy LTE technologies, NBIoT chooses to increase the number of repetitions of transmission to serve users in deep coverage. However, large repetitions can reduce system throughput and increase the energy consumption of IoT devices, which can shorten their battery life and increase their maintenance costs. In [118], the authors propose a new method based on RL algorithm to enhance NBIoT coverage. Instead of employing a random spectrum access procedure, dynamic spectrum access can reduce the number of required repetitions, increase the coverage, and reduce the energy consumption. A cooperative multiagent deep neural network based Qlearning (CMADQN) approach is developed in [119], in which each DQN agent independently controls a configuration variable for each group, in order to maximize the longterm average number of working IoT devices in NBIoT.
IvD4 Energy Harvesting
Energy Harvesting (EH) is a promising technology for longterm and selfsustainable operation of the IoT devices. While energy harvesting is a promising technique to extend the lifetime of IoT devices, it also brings new challenges to resource control due to the stochastic nature of the harvested energy. [120] studies the joint access control and battery prediction problem in a smallcell IoT system including multiple EH user equipments (UEs) and a base station (BS) with limited uplink access channels. A DQNbased scheduling algorithm that maximizes uplink transmission sum rate is proposed. For the battery prediction problem, using a fixed roundrobin access control policy, an RLbased algorithm is developed to minimize the prediction loss without any model knowledge about the energy source and energy arrival process. In [121], the energy management policy in an industrial wireless sensor network is investigated to minimize the weighted packet loss rate under the delay constraint, where the packet loss rate considers the lost packets, both during the sensing and delivering processes. The problem is formulated into an MDP model, and stochastic online learning with a postdecision state is applied to derive a distributed energy allocation algorithm with a waterfilling structure and a scheduling algorithm by an auction mechanism.
RL/DRL elements  Examples  Related Works  

State  Physical system state  Smart grid: e.g. energy demand/storage/consumption, battery discharge efficiency  [94][95][100][105] 
Robotics: e.g. position/velocity of the robot, camera image  [48][49][50][51][53][54]  
Vehicles: e.g. position/velocity/orientation angle of the vehicle, distance headways between vehicles  [89][65][82]  
Network resource state  Channel information: e.g. SINR, selection of subchannel  [81][122][123][124][83]  
Queue information: e.g. queue length of each user’s data buffer  [125]  
Computation resource state  Available virtual machines  [126][127]  
Queue information: e.g. queue length of the task buffer  [126]  
Action  Resource control action  Power allocation  [122] 
Bandwidth/channel allocation  [89][122][128]  
Cache allocation  [83][129]  
Access/handover decision  [120][130]  
Actuator control action  Smart grid: e.g. turning on/off devices, prioritizing the power dispatch  [95][100]  
Robotics: e.g. moving direction of robots, opening/closing of grippers  [48][49][50] [54]  
Vehicles: e.g. moving direction/velocity of vehicles  [65][82][67]  
Reward  Physical system performance  Power/energy consumption  [105][125][131][132] 
Manipulation objectives: e.g. away from obstacles, reaching the target, a successful/failed grasp  [48][49]  
Network system performance  Transmission delay  [124][125][126][132]  
Transmission reliability: e.g. error probability, packet loss rate  [123][127][133]  
Computing system performance  Processing delay  [124][128][132]  
Utilization rate of computing resources  [126] 
IvE AIoT Application Layer  IoT Edge/Fog/Cloud Computing Systems
Edge/fog/cloud computing is a helpful technique in realizing IoT. In such systems, multiple users can offload the computationally intensive tasks to the edge/fog/cloud servers.
IvE1 Task Offloading and Resource Allocation
Reasonable decisions are required to be made on whether to offload the computing tasks to the edge/fog/cloud servers or perform them locally at the IoT devices, and the amount of resources allocated to each IoT device. The problems of task offloading and resource allocation in edge/fog/cloud computing are widely discussed. The resources to allocate include both the computing resources and the communication resources. In [123], a realtime adaptive policy based on deep Qlearning is learned in a MEC system. The policy is to allocate computing resources for offloaded tasks of multiple users. In order to meet the reliability of endtoend services, the objective is to reduce delay violation probability and decoding error probability. Similarly, a joint task offloading decision and bandwidth allocation optimization method based on a DQN is designed for the MEC system in [132]. The overall offloading cost is evaluated in terms of energy cost, computation cost, and delay cost. Besides the most considered service delay, when designing the offloading policies via DRL approaches, the utilization rate of the physical machine and the power consumption are also taken into account in [126] [125], respectively. In [128], a namely deep reinforcement learning based resource allocation (DRLRA) scheme is proposed to allocate computing and network resources adaptively, in order to reduce the delay and balance the use of resources under varying MEC environment. In [134], several RL methods, e.g., Qlearning, SARSA, Expected SARSA, and Monte Carlo, are applied to solve the FogRAN resource allocation issues respectively. The performance and applicability of the methods are verified. In [135], a joint computation offloading and multiuser scheduling algorithm in NBIoT edge computing system is proposed to minimize the longterm average weighted sum of delay and power consumption. The linear valuefunction approximation and TD learning with postdecision state and semigradient descent method are applied to derive a simple algorithm for the solution. In [131], a DRL based approach is applied to manage the mode selection in fog radio access networks (FRANs). In [136], the authors present a novel DRLbased framework for powerefficient resource allocation in cloud radio access networks (CRANs). The authors in [127] propose a DRL based approach that is able to manage data migration in MEC scenarios by learning during the system evolution. In [137], a DRLbased computing offloading approach is proposed to learn the optimal offloading policy in spaceairground integrated network (SAGIN).
IvE2 Caching
Caching IoT data at the network edge is considered to be able to alleviate the congestions and delays in transmitting IoT data through wireless networks. The research in [129] solves the problem of caching IoT data at the edge with the help of DRL. The proposed data caching policy aims to strike a balance between the communication cost and the loss of data freshness. In [124], the issue of caching strategy is tackled together with the offloading policy and resource allocation.
Based on the above literature review, we summarize and list some typical values of states, actions, and rewards in Table II, arranged in different categories as given in Section III corresponding to the three layers in AIoT architecture.
V Challenges, Open Issues, and Future Research Directions
Although DRL is a powerful theoretical tool that is wellsuited to the task of introducing artificial intelligence to AIoT systems, there are still a lot of challenges and open issues to be overcome and addressed. The following lists some of the future research directions in this area.
Va Incomplete Perception Problem
In AIoT systems, it might not be possible for the agent to have perfect and complete perception of the state of the environment. This could be due to

limited sensing capabilities of sensors in the perception layer;

information loss due to limited transmission capability in the network layer;
An important challenge in applying DRL to AIoT system is to learn with incomplete perception or partially observable states. The MDP model is no longer valid, as the state information is no longer sufficient to support the decision on optimal action. The action can be improved if more information is available to the agent in addition to the state information. Although the DRL algorithms and methods introduced in Section II.E can be applied, there are still some open issues with the POMDPbased DRL algorithms. Firstly, an agent in POMDP needs to select an action based on the observation history space which grows exponentially. Approaches proposed for this problem require large memory and can only work well for small discrete observation spaces [138]. Secondly, when introducing belief state to POMDP problems, the belief space will not grow exponentially but the knowledge of the model becomes essential for the agent, which is not suitable for many complicated scenarios. Finally, nearly all these algorithms in POMDP problems need to face an challenge referred to as information gathering and exploitation dilemma. In a POMDP, the agent does not know what the current state is exactly. It needs to decide whether to gather more information about the true state first or to exploit its current knowledge first. Obviously, in order to find the optimal policy, an agent in POMDP needs to have more interactions with the environment. Apart from the above challenges associated with POMDPbased DRL problems, the DRL model formulation and parameter optimization for various AIoT systems are different case by case. Moreover, more efficient algorithms could be designed according to the specific characteristics of AIoT systems.
VB Delayed Control Problem
In DRL problems, we normally consider that an action is exerted as soon as it is selected by the agent, and a corresponding reward is immediately available at the agent. However, a challenge in applying DRL to realworld AIoT system is to learn despite the existence of control delay, i.e., the delay between measuring a system’s state and acting upon it. Control delay is always present in real systems due to transporting measurement data to the learning agent, computing the next action, and changing the state of the actuator. Therefore, it is important to design RL/DRL algorithms which take the control delay into account.
Most of the existing RL algorithms don’t consider the control delay. At each time step , the state of the environment is observed, and an action is immediately determined by the agent. However in practice, the actual action executed at time step might be the action generated time steps before, i.e., . In this case, the next state depends on the current state and a previously determined action, i.e., , instead of the current state and currently determined action pair , which makes the state transition violating the Markov property. Therefore, the MDP model based on which RL/DRL algorithms are developed are no longer valid and a POMDP model is more appropriate.
In order to deal with the delayed control problem, existing works in RL developed several methods [139, 140, 141]. The first method [139] incorporates the past actions taken during the length of the delay into the current state in formulating an MDP model, so that the classical RL methods such as TDlearning and Qlearning can be applied. However, this method results in larger state space with the state dimensionality depending on the number of time steps for the delay. The second method [140] learns a state transition model so that it can predict the state at which the currently selected action is actually going to be executed. Then, a modelbased RL algorithm can be applied. However, the learning process of the underlying model is usually timeconsuming and will incur additional delay itself. Finally in the third method [141], the classical modelfree RL algorithms such as TDlearning and Qlearning are applied, except that at each time step , the Q function with respect to current state and actually executed action is updated, instead of the normal with respect to current state and currently generated action .
The above methods mostly focus on the constant delay problem. However, the actual delay in an AIoT system is likely to be stochastic. Moreover, the delay can depend on the communication and computation resource control actions in the IoT communications networks and edge/fog/cloud servers. Therefore, developing RL algorithms to consider stochastic control delay or control delay that depends on other parameters is an open issue. Another important challenge is how to extend the above algorithms from RL to DRL leveraging the powerful neural networks while dealing with the intrinsic complexities.
VC MultiAgent Control Problem
The agent in RL is a virtual concept that learns the optimal policy by interacting with the environment. In AIoT system, agents can be implemented in IoT devices, edge/fog servers, and cloud servers as discussed previously. For a single RL task, there are some typical scenarios for the implementation of agents:

centralized architecture: a single agent in a cloud server, edge/fog node, or an IoT device;

distributed architecture: multiple agents with each agent implemented in an IoT device or edge/fog server;

semidistributed architecture: one centralized agent in a cloud server or edge/fog server and multiple distributive agents in edge/fog servers or IoT devices.
For distributed and semidistributed architecture, it is an important challenge to enable efficient collaboration and fair competition among multiple agents in a single RL task. The tasks of each agent in a multiagent system may be different, and they are coupled to each other. Therefore, the design of a reasonable joint reward function becomes a challenge, which may directly affect the performance of the learning policy. Compared to the stable environment in the singleagent RL problem, the environment in the multiagent RL is complex and dynamic, which brings challenges to design of multiagent DRL approaches.
In most existing multiagent DRL methods, the agents are assumed to have same capability. For examples, the robots in a multirobot system have the same manipulation ability, or the multiple vehicles in a cooperative driving scenario have the same kinematic performance. Thus, the application of DRL in heterogeneous multiagent systems remains to be further studied. The heterogeneity makes cooperative decision more complex, since each agent needs to model other agents when their capabilities are unknown. Although the multiagent DRL algorithms and methods introduced in Section II.F can be applied to solve the problem of space explosion and guarantee the convergence of the algorithm, the DRL model formulation, parameter optimization, as well as algorithm adaptation and improvement remain to be open issues. Moreover, significant progress in the field of multiagent reinforcement learning can be achieved by a more intensive crossdomain research between the fields of machine learning, game theory, and control theory.
VD Joint Resource and Actuator Control Problem
In AIoT systems, there are two levels of control, i.e., resource control and actuator control as discussed previously. Although the ultimate objective is to optimize the longterm reward of the physical system by selecting appropriate actuator control actions, the computation and network resource control actions will impact the physical system performance through their effects on the network and computation system performances. For example, an efficient network resource control policy can result in larger data transmission rates for the sensory data, and thus allow more information to be available at the cloud server for the agent to derive an improved policy. Currently, most existing research works either optimize the computation and/or network performances for IoT systems, or optimize the physical system performance considering an ideal communication and computation environment. Therefore, how to jointly optimize the two levels of control actions to achieve an optimized physical system performance is an important open issue for applying DRL in AIoT system.
When the RL/DRL environment includes more than one layer in AIoT architecture, the corresponding RL/DRL model will be more complex as discussed in Section III. For example, instead of optimizing normal network performance such as transmission delay, transmission power, and packet loss rate in the network layer, the communication resource control actions need to be selected to optimize the control performance of a physical autonomous system, which may be a function of the network performance. In order to optimize the control performance, the best tradeoff between several network performance metrics may need to be considered. For example, larger amount of sensory data may be transmitted at the cost of larger transmission delay, which relieves the incomplete perception problem but deteriorates the delayed control problem as discussed above. There are many challenges to model and solve such complex RL/DRL problems. Firstly, feature selection is an crucial task. An appropriate feature selection can lead to better generalization which is helpful for the biasoverfitting tradeoff as will be discussed later in most scenarios. When too many features are taken into consideration, it is hard for the agent to determine which features are more indispensable. Although some features may play a key role in reconstruction of the observation, they may be discarded because they are not related to the current task directly. Secondly, the selection of algorithm and function approximator is also a tough task. The function approximator used for value function or policy converts the features into abstraction in higher level. Sometimes the approximator is too simple to avoid the bias, while sometimes the approximator is too complex to obtain a good generalization result from the limited dataset, i.e., overfitting. Errors resulted from this bias/overfitting problem need to be overcome, so an appropriate approximator needs to be used according to the current task. Thirdly, in such complex RL/DRL problems, the objective function needs to be modified. Typical approaches include reward shaping and discount factor tuning. Reward shaping adds an additional function
to the original reward function . It is mainly used for DRL problems with sparse and delayed rewards [142]. Discount factor tuning helps to adjust the impact of temporally distant rewards. When the discount factor is high, the training process tends to be instable in convergence and when the discount factor is low, some potential rewards will be discarded [7]. Hence, modifying the objective function can help to tackle the above problems to some extent.Vi Conclusion
This paper has presented the model, applications and challenges of DRL in AIoT systems. Firstly, a summary of the existing RL/DRL methods has been provided. Then, the general model of AIoT system has been proposed, including the DRL framework for AIoT based on the threelayer structure of IoT. The applications of DRL in AIoT have been classified into several categories, and the applied methods and the typical state/action/reward in the models have been summarized. Finally, the challenges and open issues for future research have been identified.
References
 [1] P. J. Antsaklis, K. M. Passino, and S. Wang, “An introduction to autonomous control systems,” IEEE Control Syst. Mag., vol. 11, no. 4, pp. 5–13, 1991.
 [2] (2018) Smarter Things: The autonomous IoT. [Online]. Available: http://gdruk.com/smarterthingsautonomousiot/
 [3] M. Mohammadi, A. AlFuqaha, S. Sorour, and M. Guizani, “Deep learning for IoT big data and streaming analytics: A survey,” IEEE Communications Surveys Tutorials, vol. 20, no. 4, pp. 2923–2960, 2018.
 [4] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
 [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
 [6] M. S. Mahdavinejad, M. Rezvan, M. Barekatain, P. Adibi, P. Barnaghi, and A. P. Sheth, “Machine learning for Internet of Things data analysis: A survey,” Digital Communications and Networks, vol. 4, no. 3, pp. 161–175, 2018.
 [7] V. FrançoisLavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau et al., “An introduction to deep reinforcement learning,” Foundations and Trends® in Machine Learning, vol. 11, no. 34, pp. 219–354, 2018.
 [8] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [9] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the difficulty of learning longterm dependencies,” 2001.
 [10] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Qlearning,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 [11] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 [12] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
 [13] S.I. Amari, “Natural gradient works efficiently in learning,” Neural computation, vol. 10, no. 2, pp. 251–276, 1998.
 [14] S. M. Kakade, “A natural policy gradient,” in Advances in neural information processing systems, 2002, pp. 1531–1538.
 [15] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [16] R. J. Williams, “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 34, pp. 229–256, 1992.
 [17] V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
 [18] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actorcritic with experience replay,” arXiv preprint arXiv:1611.01224, 2016.
 [19] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning, 2016, pp. 1928–1937.
 [20] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
 [21] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
 [22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 [23] G. BarthMaron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, A. Muldal, N. Heess, and T. Lillicrap, “Distributed distributional deterministic policy gradients,” arXiv preprint arXiv:1804.08617, 2018.
 [24] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actorcritic methods,” arXiv preprint arXiv:1802.09477, 2018.
 [25] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multiagent actorcritic for mixed cooperativecompetitive environments,” in Advances in Neural Information Processing Systems, 2017, pp. 6379–6390.
 [26] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memorybased control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
 [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [28] Y. Wu, E. Mansimov, R. B. Grosse, S. Liao, and J. Ba, “Scalable trustregion method for deep reinforcement learning using kroneckerfactored approximation,” in Advances in neural information processing systems, 2017, pp. 5279–5288.
 [29] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine, “Qprop: Sampleefficient policy gradient with an offpolicy critic,” arXiv preprint arXiv:1611.02247, 2016.
 [30] G. Shani, J. Pineau, and R. Kaplow, “A survey of pointbased POMDP solvers,” Autonomous Agents and MultiAgent Systems, vol. 27, no. 1, pp. 1–51, 2013.
 [31] P. Dai, C. H. Lin, D. S. Weld et al., “POMDPbased control of workflows for crowdsourcing,” Artificial Intelligence, vol. 202, pp. 52–85, 2013.
 [32] D. Wierstra, A. Foerster, J. Peters, and J. Schmidhuber, “Solving deep memory POMDPs with recurrent policy gradients,” in International Conference on Artificial Neural Networks. Springer, 2007, pp. 697–706.
 [33] K. P. Murphy, “A survey of POMDP solution techniques,” environment, vol. 2, p. X3, 2000.
 [34] M. Hausknecht and P. Stone, “Deep recurrent Qlearning for partially observable MDPs,” in 2015 AAAI Fall Symposium Series, 2015.
 [35] G. Wayne, C.C. Hung, D. Amos, M. Mirza, A. Ahuja, A. GrabskaBarwinska, J. Rae, P. Mirowski, J. Z. Leibo, A. Santoro et al., “Unsupervised predictive memory in a goaldirected agent,” arXiv preprint arXiv:1803.10760, 2018.
 [36] M. Egorov, “Deep reinforcement learning with POMDPs,” 2015.
 [37] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforcement learning for POMDPs,” arXiv preprint arXiv:1804.06309, 2018.
 [38] J. N. Foerster, Y. M. Assael, N. de Freitas, and S. Whiteson, “Learning to communicate to solve riddles with deep distributed recurrent Qnetworks,” arXiv preprint arXiv:1602.02672, 2016.
 [39] L. Bu, R. Babu, B. De Schutter et al., “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Syst., Man, Cybern. C, Appl., Rev., vol. 38, no. 2, pp. 156–172, 2008.
 [40] M. Tan, “Multiagent reinforcement learning: Independent vs. cooperative learning,” 1997.
 [41] M. L. Littman, “Markov games as a framework for multiagent reinforcement learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 157–163.
 [42] J. Hu and M. P. Wellman, “Nash Qlearning for generalsum stochastic games,” Journal of machine learning research, vol. 4, no. Nov, pp. 1039–1069, 2003.
 [43] M. L. Littman, “Friendorfoe Qlearning in generalsum games,” in ICML, vol. 1, 2001, pp. 322–328.
 [44] M. Bowling and M. Veloso, “Rational and convergent learning in stochastic games,” in International joint conference on artificial intelligence, vol. 17, no. 1. Lawrence Erlbaum Associates Ltd, 2001, pp. 1021–1026.
 [45] J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. Torr, P. Kohli, and S. Whiteson, “Stabilising experience replay for deep multiagent reinforcement learning,” in Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2017, pp. 1146–1155.
 [46] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement learners for traffic light control,” Proceedings of Learning, Inference and of MultiAgent Systems (at NIPS 2016), 2016.
 [47] J. N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multiagent policy gradients,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [48] H. Sasaki, T. Horiuchi, and S. Kato, “A study on visionbased mobile robot learning by deep Qnetwork,” in 2017 56th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE). IEEE, 2017, pp. 799–804.
 [49] J. Xin, H. Zhao, D. Liu, and M. Li, “Application of deep reinforcement learning in mobile robot path planning,” in 2017 Chinese Automation Congress (CAC). IEEE, 2017, pp. 7112–7116.
 [50] T. Yan, Y. Zhang, and B. Wang, “Path planning for mobile robot’s continuous action space based on deep reinforcement learning,” in 2018 International Conference on Big Data and Artificial Intelligence (BDAI). IEEE, 2018, pp. 42–46.
 [51] T. Tongloy, S. Chuwongin, K. Jaksukam, C. Chousangsuntorn, and S. Boonsang, “Asynchronous deep reinforcement learning for the mobile robot navigation with supervised auxiliary tasks,” in 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE). IEEE, 2017, pp. 68–72.
 [52] Z. Yang, K. Merrick, L. Jin, and H. A. Abbass, “Hierarchical deep reinforcement learning for continuous action control,” IEEE Trans. Neural Netw. Learn. Syst., no. 99, pp. 1–11, 2018.
 [53] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 3389–3396.
 [54] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Scalable deep reinforcement learning for visionbased robotic manipulation,” in Conference on Robot Learning, 2018, pp. 651–673.
 [55] Y. Tsurumine, Y. Cui, E. Uchibe, and T. Matsubara, “Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation,” Robotics and Autonomous Systems, vol. 112, pp. 72–83, 2019.
 [56] E. Yang and D. Gu, “A survey on multiagent reinforcement learning towards multirobot systems.” in CIG, 2005.
 [57] T. Yasuda and K. Ohkura, “Collective behavior acquisition of real robotic swarms using deep reinforcement learning,” in 2018 Second IEEE International Conference on Robotic Computing (IRC). IEEE, 2018, pp. 179–180.
 [58] X. Sun, T. Mao, J. D. Kralik, and L. E. Ray, “Cooperative multirobot reinforcement learning: A framework in hybrid state space,” in 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2009, pp. 1190–1196.
 [59] G. Sartoretti, Y. Wu, W. Paivine, T. S. Kumar, S. Koenig, and H. Choset, “Distributed reinforcement learning for multirobot decentralized collective construction,” in Distributed Autonomous Robotic Systems. Springer, 2019, pp. 35–49.
 [60] P. Long, T. Fanl, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multirobot collision avoidance via deep reinforcement learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 6252–6259.
 [61] M. J. Matarić, “Reinforcement learning in the multirobot domain,” in Robot colonies. Springer, 1997, pp. 73–83.
 [62] O. Saha and P. Dasgupta, “A comprehensive survey of recent trends in cloud robotics architectures and applications,” Robotics, vol. 7, no. 3, p. 47, 2018.
 [63] B. Liu, L. Wang, M. Liu, and C. Xu, “Lifelong federated reinforcement learning: A learning architecture for navigation in cloud robotic systems,” arXiv preprint arXiv:1901.06455, 2019.
 [64] H. Liu, S. Liu, and K. Zheng, “A reinforcement learningbased resource allocation scheme for cloud robotics,” IEEE Access, vol. 6, pp. 17 215–17 222, 2018.
 [65] A. Yu, R. PalefskySmith, and R. Bedi, “Deep reinforcement learning for simulated autonomous vehicle control,” Course Project Reports: Winter, pp. 1–7, 2016.
 [66] M. Vitelli and A. Nayebi, “CARMA: A deep reinforcement learning approach to autonomous driving,” Tech. rep. Stanford University, Tech. Rep., 2016.
 [67] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker, “Highlevel decision making for safe and reasonable autonomous lane changing using reinforcement learning,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2156–2162.
 [68] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow: Architecture and benchmarking for reinforcement learning in traffic control,” arXiv preprint arXiv:1710.05465, 2017.
 [69] H. D. Gamage and J. B. Lee, “Reinforcement learning based driving speed control for two vehicle scenario,” in Australasian Transport Research Forum (ATRF), 39th, 2017, Auckland, New Zealand, 2017.
 [70] M. Mueller, “Reinforcement learning: MDP applied to autonomous navigation,” 2017.
 [71] C. You, J. Lu, D. Filev, and P. Tsiotras, “Highway traffic modeling and decision making for autonomous vehicle using reinforcement learning,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 1227–1232.
 [72] A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.M. Allen, V.D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” arXiv preprint arXiv:1807.00412, 2018.
 [73] M. K. Pal, R. Bhati, A. Sharma, S. K. Kaul, S. Anand, and P. Sujit, “A reinforcement learning approach to jointly adapt vehicular communications and planning for optimized driving,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3287–3293.
 [74] P. Wang and C.Y. Chan, “Formulation of deep reinforcement learning architecture toward autonomous driving for onramp merge,” in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2017, pp. 1–6.
 [75] V. Talpaert, I. Sobh, B. R. Kiran, P. Mannion, S. Yogamani, A. ElSallab, and P. Perez, “Exploring applications of deep reinforcement learning for realworld autonomous driving systems,” arXiv preprint arXiv:1901.01536, 2019.
 [76] Q. Wang and C. Phillips, “Cooperative collision avoidance for multivehicle systems using reinforcement learning,” in 2013 18th International Conference on Methods & Models in Automation & Robotics (MMAR). IEEE, 2013, pp. 98–102.
 [77] M. A. Khamis and W. Gomaa, “Adaptive multiobjective reinforcement learning with hybrid exploration for traffic signal control based on cooperative multiagent framework,” Engineering Applications of Artificial Intelligence, vol. 29, pp. 134–151, 2014.
 [78] K. Zheng, L. Hou, H. Meng, Q. Zheng, N. Lu, and L. Lei, “Softdefined heterogeneous vehicular network: architecture and challenges,” IEEE Network, vol. 30, no. 4, pp. 72–80, 2016.
 [79] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks: A machine learning framework,” IEEE Internet of Things Journal, vol. 6, no. 1, pp. 124–135, 2019.
 [80] H. Ye, L. Liang, G. Y. Li, J. Kim, L. Lu, and M. Wu, “Machine learning for vehicular networks,” arXiv preprint arXiv:1712.07143, 2017.
 [81] H. Ye and G. Y. Li, “Deep reinforcement learning based distributed resource allocation for V2V broadcasting,” in 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC). IEEE, 2018, pp. 440–445.
 [82] U. Challita, W. Saad, and C. Bettstetter, “Interference management for cellularconnected uavs: A deep reinforcement learning approach,” IEEE Trans. Wirel. Commun., 2019.
 [83] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching, and computing for connected vehicles: A deep reinforcement learning approach,” IEEE Trans. Veh. Technol., vol. 67, no. 1, pp. 44–55, 2018.
 [84] R. F. Atallah, C. M. Assi, and M. J. Khabbaz, “Scheduling the operation of a connected vehicular network using deep reinforcement learning,” IEEE Trans. Intell. Transp. Syst., no. 99, pp. 1–14, 2018.
 [85] A. Mehmood, S. H. Ahmed, and M. Sarkar, “Cyberphysical systems in vehicular communications,” in Handbook of Research on Advanced Trends in Microwave and Communication Engineering. IGI Global, 2017, pp. 477–497.
 [86] Y. Xiao and C. Zhu, “Vehicular fog computing: Vision and challenges,” in 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops). IEEE, 2017, pp. 6–9.
 [87] J. C. Nobre, A. M. de Souza, D. Rosario, C. Both, L. A. Villas, E. Cerqueira, T. Braun, and M. Gerla, “Vehicular softwaredefined networking and fog computing: integration and design principles,” Ad Hoc Networks, vol. 82, pp. 172–181, 2019.
 [88] Z. Ning, J. Huang, and X. Wang, “Vehicular fog computing: Enabling realtime traffic management for smart cities,” IEEE Wirel. Commun., vol. 26, no. 1, pp. 87–93, 2019.
 [89] Q. Qi, J. Wang, Z. Ma, H. Sun, Y. Cao, L. Zhang, and J. Liao, “Knowledgedriven service offloading decision for vehicular edge computing: A deep reinforcement learning approach,” IEEE Trans. Veh. Technol., 2019.
 [90] Q. Qi and Z. Ma, “Vehicular edge computing via deep reinforcement learning,” arXiv preprint arXiv:1901.04290, 2018.
 [91] K. Zheng, H. Meng, P. Chatzimisios, L. Lei, and X. Shen, “An SMDPbased resource allocation in vehicular cloud computing systems,” IEEE Trans. Ind. Electron., vol. 62, no. 12, pp. 7920–7928, 2015.
 [92] L. Hou, L. Lei, K. Zheng, and X. Wang, “A Qlearning based proactive caching strategy for nonsafety related services in vehicular networks,” IEEE Internet Things J., 2018.
 [93] D. Zhang, X. Han, and C. Deng, “Review on the research and practice of deep learning and reinforcement learning in smart grids,” CSEE Journal of Power and Energy Systems, vol. 4, no. 3, pp. 362–370, 2018.
 [94] V. FrançoisLavet, D. Taralla, D. Ernst, and R. Fonteneau, “Deep reinforcement learning solutions for energy microgrids management,” in European Workshop on Reinforcement Learning (EWRL 2016), 2016.
 [95] G. K. Venayagamoorthy, R. K. Sharma, P. K. Gautam, and A. Ahmadi, “Dynamic energy management system for a smart microgrid,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 8, pp. 1643–1656, 2016.
 [96] X. Qiu, T. A. Nguyen, and M. L. Crow, “Heterogeneous energy storage optimization for microgrids,” IEEE Trans. Smart Grid, vol. 7, no. 3, pp. 1453–1461, 2016.
 [97] P. Zeng, H. Li, H. He, and S. Li, “Dynamic energy management of a microgrid using approximate dynamic programming and deep recurrent neural network learning,” IEEE Transactions on Smart Grid, 2018.
 [98] E. Foruzan, L.K. Soh, and S. Asgarpoor, “Reinforcement learning approach for optimal distributed energy management in a microgrid,” IEEE Transactions on Power Systems, vol. 33, no. 5, pp. 5749–5758, 2018.
 [99] B. Mbuwir, F. Ruelens, F. Spiessens, and G. Deconinck, “Reinforcement learningbased battery energy management in a solar microgrid,” EnergyOpen, vol. 2, no. 4, p. 36, 2017.
 [100] E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu, and J. G. Slootweg, “Online building energy optimization using deep reinforcement learning,” IEEE Trans. Smart Grid, 2018.
 [101] Z. Wen, D. O’Neill, and H. Maei, “Optimal demand response using devicebased reinforcement learning,” IEEE Trans. Smart Grid, vol. 6, no. 5, pp. 2312–2324, 2015.
 [102] F. Ruelens, B. J. Claessens, S. Vandael, S. Iacovella, P. Vingerhoets, and R. Belmans, “Demand response of a heterogeneous cluster of electric water heaters using batch reinforcement learning,” in 2014 Power Systems Computation Conference. IEEE, 2014, pp. 1–7.
 [103] E. C. Kara, M. Berges, B. Krogh, and S. Kar, “Using smart devices for systemlevel management and control in the smart grid: A reinforcement learning framework,” in 2012 IEEE Third International Conference on Smart Grid Communications (SmartGridComm). IEEE, 2012, pp. 85–90.
 [104] B. J. Claessens, P. Vrancx, and F. Ruelens, “Convolutional neural networks for automatic statetime feature extraction in reinforcement learning applied to residential load control,” arXiv preprint arXiv:1604.08382, 2016.
 [105] M. S. Munir, S. F. Abedin, N. H. Tran, and C. S. Hong, “When edge computing meets microgrid: A deep reinforcement learning approach,” IEEE Internet Things J., 2019.
 [106] B.G. Kim, Y. Zhang, M. Van Der Schaar, and J.W. Lee, “Dynamic pricing for smart grid with reinforcement learning,” in 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE, 2014, pp. 640–645.
 [107] B. G. Kim, Y. Zhang, M. Van Der Schaar, and J.W. Lee, “Dynamic pricing and energy consumption scheduling with reinforcement learning,” IEEE Trans. Smart Grid, vol. 7, no. 5, pp. 2187–2198, 2016.
 [108] L. Xiao, X. Xiao, C. Dai, M. Pengy, L. Wang, and H. V. Poor, “Reinforcement learningbased energy trading for microgrids,” arXiv preprint arXiv:1801.06285, 2018.
 [109] H. Wang, T. Huang, X. Liao, H. AbuRub, and G. Chen, “Reinforcement learning for constrained energy trading games with incomplete information,” IEEE Trans. Cybern., vol. 47, no. 10, pp. 3404–3416, 2017.
 [110] P. P. Reddy and M. M. Veloso, “Strategy learning for autonomous agents in smart grid markets,” in Twentysecond international joint conference on artificial intelligence, 2011.
 [111] T. Park, N. Abuzainab, and W. Saad, “Learning how to communicate in the Internet of Things: Finite resources and heterogeneity,” IEEE Access, vol. 4, pp. 7063–7073, 2016.
 [112] T. P. Kumar and P. V. Krishna, “Power modelling of sensors for IoT using reinforcement learning,” International Journal of Advanced Intelligence Paradigms, vol. 10, no. 12, pp. 3–22, 2018.
 [113] J.C. Renaud and C.K. Tham, “Coordinated sensing coverage in sensor networks using distributed reinforcement learning,” in 2006 14th IEEE International Conference on Networks, vol. 1. IEEE, 2006, pp. 1–6.
 [114] R. Ding, Y. Xu, F. Gao, X. Shen, and W. Wu, “Deep reinforcement learning for router selection in network with heavy traffic,” IEEE Access, vol. 7, pp. 37 109–37 120, 2019.
 [115] T. Oda, R. Obukata, M. Ikeda, L. Barolli, and M. Takizawa, “Design and implementation of a simulation system based on deep Qnetwork for mobile actor node control in wireless sensor and actor networks,” in 2017 31st International Conference on Advanced Information Networking and Applications Workshops (WAINA). IEEE, 2017, pp. 195–200.
 [116] G. Künzel, G. P. Cainelli, I. Müller, and C. E. Pereira, “Weight adjustments in a routing algorithm for wireless sensor and actuator networks using Qlearning,” IFACPapersOnLine, vol. 51, no. 10, pp. 58–63, 2018.
 [117] A. S. Leong, A. Ramaswamy, D. E. Quevedo, H. Karl, and L. Shi, “Deep reinforcement learning for wireless sensor scheduling in cyberphysical systems,” arXiv preprint arXiv:1809.05149, 2018.
 [118] M. Chafii, F. Bader, and J. Palicot, “Enhancing coverage in narrow bandIoT using machine learning,” in 2018 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2018, pp. 1–6.
 [119] N. Jiang, Y. Deng, O. Simeone, and A. Nallanathan, “Cooperative deep reinforcement learning for multiplegroup NBIoT networks optimization,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8424–8428.
 [120] M. Chu, H. Li, X. Liao, and S. Cui, “Reinforcement learning based multiaccess control and battery prediction with energy harvesting in IoT systems,” IEEE Internet Things J., 2018.
 [121] L. Lei, Y. Kuang, X. S. Shen, K. Yang, J. Qiao, and Z. Zhong, “Optimal reliability in energy harvesting industrial wireless sensor networks,” IEEE Trans. Wirel. Commun., vol. 15, no. 8, pp. 5399–5413, 2016.
 [122] J. Zhu, Y. Song, D. Jiang, and H. Song, “A new deep Qlearning based transmission scheduling mechanism for the cognitive Internet of Things,” IEEE Internet Things J., vol. 5, no. 4, pp. 2375–2385, 2018.
 [123] T. Yang, Y. Hu, M. C. Gursoy, A. Schmeink, and R. Mathar, “Deep reinforcement learning based resource allocation in low latency edge computing networks,” in 2018 15th International Symposium on Wireless Communication Systems (ISWCS). IEEE, 2018, pp. 1–5.
 [124] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching, computing, and radio resources for fogenabled IoT using natural actorcritic deep reinforcement learning,” IEEE Internet Things J., 2018.
 [125] Z. Chen and X. Wang, “Decentralized computation offloading for multiuser mobile edge computing: A deep reinforcement learning approach,” arXiv preprint arXiv:1812.07394, 2018.
 [126] L. Quan, Z. Wang, and F. Ren, “A novel twolayered reinforcement learning for task offloading with tradeoff between physical machine utilization rate and delay,” Future Internet, vol. 10, no. 7, p. 60, 2018.
 [127] F. De Vita, D. Bruneo, A. Puliafito, G. Nardini, A. Virdis, and G. Stea, “A deep reinforcement learning approach for data migration in multiaccess edge computing,” in 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K). IEEE, 2018, pp. 1–8.
 [128] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allocation for mobile edge computing: A deep reinforcement learning approach,” IEEE Trans. Emerg. Top. Comput., 2019.
 [129] H. Zhu, Y. Cao, X. Wei, W. Wang, T. Jiang, and S. Jin, “Caching transient data for Internet of Things: A deep reinforcement learning approach,” IEEE Internet Things J., 2018.
 [130] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wireless systems via asynchronous multiuser deep reinforcement learning,” IEEE Internet Things J., vol. 5, no. 6, pp. 4296–4307, 2018.
 [131] Y. Sun, M. Peng, and S. Mao, “Deep reinforcement learning based mode selection and resource management for green fog radio access networks,” IEEE Internet Things J., 2018.
 [132] L. Huang, X. Feng, C. Zhang, L. Qian, and Y. Wu, “Deep reinforcement learningbased joint task offloading and bandwidth allocation for multiuser mobile edge computing,” Digital Communications and Networks, vol. 5, no. 1, pp. 10–17, 2019.
 [133] S. Chinchali, P. Hu, T. Chu, M. Sharma, M. Bansal, R. Misra, M. Pavone, and S. Katti, “Cellular network traffic scheduling with deep reinforcement learning,” in ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [134] A. T. Nassar and Y. Yilmaz, “Reinforcementlearningbased resource allocation in fog radio access networks for various IoT environments,” arXiv preprint arXiv:1806.04582, 2018.
 [135] L. Lei, H. Xu, X. Xiong, K. Zheng, and W. Xiang, “Joint computation offloading and multiuser scheduling using approximate dynamic programming in NBIoT edge computing system,” IEEE Internet Things J., 2019.
 [136] Z. Xu, Y. Wang, J. Tang, J. Wang, and M. C. Gursoy, “A deep reinforcement learning based framework for powerefficient resource allocation in cloud RANs,” in 2017 IEEE International Conference on Communications (ICC). IEEE, 2017, pp. 1–6.
 [137] N. Cheng, F. Lyu, W. Quan, C. Zhou, H. He, W. Shi, and X. Shen, “Space/aerialassisted computing offloading for IoT applications: A learningbased approach,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1117–1129, 2019.
 [138] M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson, “Deep variational reinforcement learning for POMDPs,” arXiv preprint arXiv:1806.02426, 2018.
 [139] K. V. Katsikopoulos and S. E. Engelbrecht, “Markov decision processes with delays and asynchronous cost collection,” IEEE Trans. Automat. Contr., vol. 48, no. 4, pp. 568–574, 2003.
 [140] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman, “Learning and planning in environments with delayed feedback,” AUTON. AGENT MULTIAG., vol. 18, no. 1, p. 83, 2009.
 [141] E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker, “Control delay in reinforcement learning for realtime dynamic systems: a memoryless approach,” in 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2010, pp. 3226–3231.
 [142] G. Lample and D. S. Chaplot, “Playing FPS games with deep reinforcement learning,” in ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.