I Introduction
Deep reinforcement learning has made rapid progress in games [23, 33] and robotic control [32, 7, 13]. However, most algorithms are evaluated in turnbased simulators like Gym [4] and MuJoCo [35], where the action selection and actuation of the agent are assumed to be instantaneous. Action delay, although prevalent in many areas of the real world, including robotic systems [14, 16, 3], communication networks [25] and parallel computing [12], may not be directly handled in this scheme.
Previous research has shown that delays would not only degrade the performance of the agent but also induce instability to the dynamic systems [10, 8, 6], which is a fatal threat in safetycritical systems like connected and autonomous vehicles (CAVs) [9]. For instance, it usually takes more than 0.4 seconds for the hydraulic automotive brake system to generate the desired deceleration [3], which could make a huge impact on the planning and control modules of CAVs [29]. The control community has proposed several methods to address this problem, such as using Smith predictor [2, 21], Artstein reduction [1, 26], finite spectrum assignment [20, 24], and controller [22]. Most of these methods depend on accurate models [27, 10], which is usually not available in the realworld applications.
Recently, DRL has offered the potential to resolve this issue. The problems that DRL solves are usually modeled as Markov Decision Process (MDP). However, ignoring the delay of agents violates the Markov property and results in partially observable MDPs, or POMDPs, with historical actions as hidden states. From [34]
, it is shown that solving POMDPs without estimating hidden states can lead to arbitrarily suboptimal policies. To retrieve the Markov property, the delayed system was reformulated as an augmented MDP problem such as the work in
[17, 37]. While the problem was elegantly formulated, the computational cost increases exponentially as the delay increases. Travnik et al. [36] showed that the traditional MDP framework is illdefined, but did not provide a theoretical analysis. Ramstedt & Pal [30] proposed an offpolicy modelfree algorithm known as RealTime ActorCritic to address the delayed problem by adapting Qlearning to statevaluelearning. The delay issue could also be relieved with the modelbased manner by learning a dynamics model to predict the future state as in [37]. However, this paper mainly focused on discrete tasks and could suffer from the curse of dimensionality when discretizing state and action space for continuous control tasks
[19].In this paper, we further explore reinforcement learning methods on delayed systems in the following three aspects: 1) We formally define the multistep delayed MDP and prove it can be converted to standard MDP via the Markov reward process. 2) We propose a general framework of delayaware modelbased reinforcement learning for continuous control tasks. 3) By synthesizing the stateoftheart modeling and planning algorithms, we develop the DelayAware Trajectory Sampling (DATS) algorithm which can efficiently solve delayed MDPs with minimal degradation of performance.
The rest of the paper is organized as follows. We first review the preliminaries in Section II including the definition of DelayAware Markov Decision Process (DAMDP). In Section III, we formally define the DelayAware Markov Reward Process (DAMRP) and prove its solidity. In Section IV, we introduce the proposed framework of delayaware modelbased reinforcement learning for DAMDPs with a concrete algorithm: DelayAware Trajectory Sampling (DATS). In Section VA, we demonstrate the performance of the proposed algorithm in challenging continuous control tasks on Gym and MuJoCo platforms.
Ii Preliminaries
Iia DelayFree MDP and Reinforcement Learning
The Delayfree MDP framework is suitable to model games like chess and go, where the state keeps still until a new action is executed. The definition of a delayfree MDP is:
Definition 1.
A Markov Decision Process (MDP) is characterized by a tuple with
(1) state space , (2) action space ,
(3) initial state distribution ,
(4) transition distribution ,
(5) reward function .
In the framework of reinforcement learning, the problem is often modeled as an MDP, and the agent is represented by a policy that directs the action selection, given the current observation. The objective is to find the optimal policy that maximizes the expected cumulative discounted reward . Throughout this paper, we assume that we know the reward function and do not know the transition distribution .
IiB DelayAware MDP
The delayfree MDP is problematic with agent delays and could lead to arbitrarily suboptimal policies [34]. To retrieve the Markov property, DelayAware MDP (DAMDP) is proposed:
Definition 2.
A DelayeAware Markov Decision Process augments a Markov Decision Process , such that
(1) state space where denotes the delay step,
(2) action space ,
(3) initial state distribution
where denotes the initial action sequence,
(4) transition distribution
(5) reward function
The state vector of DAMDP is augmented with an action sequence being executed in the next
steps where is the delay duration. The superscript of means that the action is one element of and the subscript represents the action executed time. is the action taken at time in a DAMDP but executed at time due to the step action delay, i.e. .Policies interacting with the DAMDPs, which also need to be augmented since the dimension of state vectors has changed, are denoted by bold . Fig. 1, which compares MDP and DAMDP, shows that the state vector of DAMDP is augmented with an action sequence being executed in the next steps.
It should be noted that both action delay and observation delay could exist in realworld systems. However, it has been proved that from the point of view of the learning agent, observation and action delays form the same mathematical problem, since they both lead to the delay between the moment of measurement and the actual action
[17]. For simplicity, we will focus on the action delay in this paper, and the algorithm and conclusions should be able to generalize to systems with observation delays. We divide the action delay into two main parts into action selection and action actuation. For action selection, the time length depends on the complexity of the algorithm and the computing power of the processor. System users can limit the action selection time by constraining the searching depth, as in AlphaGo [33]. For action actuation, on the other hand, the actuators (e.g., motors, hydraulic machines) also need time to respond to the selected action. For instance, it usually takes more than 0.4 seconds for the hydraulic automotive brake system to generate the desired deceleration [3]. The actuation delay is usually decided by the hardware.To formulate a delayed system into a DAMDP, we must select a proper time step for discretely updating the environment. As shown in Fig. 0(c), the action selected at the current time step will be encapsulated in . Thus, must be accessible at time since the agent needs it as the state, which requires the action selection delay to be at most one time step. We satisfy this requirement by making the time step of the DAMDP larger than the action selection duration.
The above definition of DAMDP assumes that the delay time of the agent is an integer multiple of the time step of the system, which is usually not true for many realworld tasks like robotic control. For that, Schuitema et al. [31] has proposed an approximation approach by assuming a virtual effective action at each discrete system time step, which could achieve firstorder equivalence in linearizable systems with arbitrary delay time. With this approximation, the above DAMDP structure can be adapted to systems with arbitraryvalue delays.
Iii DelayAware Markov Reward Process
Our first step is to show that an MDP with multistep action delays can be converted to a regular MDP problem by state augmentation. We prove the equivalence of these two by comparing their corresponding Markov Reward Processes (MRPs). The delayfree MRP is:
Definition 3.
A Markov Reward Process can be recovered from a Markov Decision Process with a policy , such that
where is the sate transition distribution and is the state reward function of the MRP. is the original environment without delays.
In the delayfree framework, at each time step, the agent selects an action based on the current observation. The action will immediately be executed in the environment to generate the next observation. However, if an action delay exists, the interaction manner between the environment and the agent changes, and a different MRP is generated. An illustration of the delayed interaction between agents and the environment is shown in Fig. 2. The agent interacts with the environment not directly but through an action buffer.
Based on the delayed interaction manner between the agent and the environment, the DelayAware MRP (DAMRP) is defined as below.
Definition 4.
A DelayAware Markov Reward Process can be recovered from a Markov Decision Process with a policy and step action delay, such that
(1) state space
(2) initial state distribution
where denotes the initial action sequence,
(3) state transition distribution
(4) statereward function
With Def. 1 4, we are ready to prove that DAMDP is a correct augmentation of MDP with delay, as stated in Theorem. 1.
Theorem 1.
A policy interacting with in the delayfree manner produces the same Markov Reward Process as interacting with with nstep action delays, i.e.
(1) 
Iv DelayAware ModelBased Reinforcement Learning
Theorem. 1 shows that instead of solving MDPs with action delays, we can alternatively solve the corresponding DAMDPs. From the transition function of a with multistep delays
(2)  
we see that the dynamics is divided into the unknown original dynamics and the known dynamics caused by the action delays. Thus, solving DAMDPs with standard reinforcement learning algorithms will suffer from the curse of dimensionality if assuming a completely unknown environment. In this section, we propose a delayaware modelbased reinforcement learning framework to achieve high computational efficiency.
As mentioned, RTAC [30] has been proposed to deal the delay problem. However, we will show that this method is only efficient for 1step delay. When for , any transition in the replay buffer is always a valid transition in the Bellman equation with the statevalue function as
where , and . However, when considering the multistep delay, i.e., , it is challenging to use offpolicy modelfree reinforcement learning because augmented transitions need to be stored and we only learn the effect of an action on the statevalue function after step updates of the Bellman equation. Also, the dimension of the state vector increases with the delay step , resulting in the exponential growth of the statespace.
Another limitation of modelfree methods for DAMDPs is that it can be difficult to transfer the learned knowledge (e.g., value functions, policies) when the action delay step changes because the input dimensions of the value functions and policies depend on the delay step . The agent must learn again from scratch whenever the system delay changes, which is usual in realworld systems.
The problems of modelfree methods have motivated us to develop modelbased reinforcement learning (MBRL) methods to combat the action delay. MBRL tries to solve MDPs by learning the dynamics model of the environment. Intuitively, we can inject our knowledge into the learned model without leaning effort. Based on this intuition, in this paper, we propose a delayaware MBRL framework to solve multistep DAMDPs which can efficiently alleviate the aforementioned two problems of modelfree methods. From Eq. 2, the unknown part is exactly the dynamics that we learn in MBRL algorithms for delayfree MDPs. In our proposed framework, only is learned and the dynamics caused by the delay is combined with the learned model by adding action delays to the interaction scheme. As mentioned, the learned dynamics model is transferable between systems with different delay steps, since we can adjust the interaction scheme based on the delay step (See Section VC for an explanation of the transfer performance).
The proposed framework of delayaware MBRL is shown in Algorithm 1. In the for loop, we are solving a planning problem, given a dynamics model with an initial action sequence. For that, the learned model is used not only for the optimal control but also for the state prediction to compensate for the delay effect. By iteratively training, we gradually improve the model accuracy and obtain better planning performance and , especially in highreward regions.
Iva DelayAware Trajectory Sampling
Recently, several MBRL algorithms have been proposed to match the asymptotic performance of modelfree algorithms on challenging benchmark tasks, including probabilistic ensemble with trajectory sampling (PETS) [5], modelbased policy optimization (MBPO) [15], modelbased planning with policy networks (POPLIN) [38], etc. In this section, we will combine the stateoftheart PETS algorithm with the proposed delayaware MBRL framework to generate a new method for solving DAMDPs. We name the method as the DelayAware Trajectory Sampling (DATS).
In DATS, the dynamic model is represented by an ensemble of probabilistic neural networks that output Gaussian distributions which helps model the aleatoric uncertainty. The use of the ensemble can help incorporate the epistemic uncertainty of the dynamic model and approximate the Bayesian posterior
[28, 18]. The planning of action sequences applies the concept of model predictive control (MPC) with the crossentropy method (CEM) for elite selection of the sampled action sequences. In the most inner for loop of Algorithm 2, with the current state , we first propagate state particles with the same action sequence to make various estimates of the future state , and then use sampled action sequences to predict for each particle. In this way, the uncertainty of the learned model is considered in both stateprediction and planning phases, which improves the robustness of the algorithm. The complete algorithm is shown in Algorithm 2.Modelbased methods have a natural advantage when dealing with multistep DAMDPs when compared with modelfree methods. With modelfree methods, the effect of an action on the statevalue function can only be learned after time updates of the Bellman equation. The agent implicitly wastes both time and effort to learn the known part of system dynamics caused by action delay since it does not understand the meaning of the elements in the state vectors. As mentioned, the advantage of modelbased methods is that they incorporate delay effect into the system dynamics without extra learning (see Section VB for a performance comparison between modelfree and modelbased methods).
Performances (means and standard deviations of rewards) of different MBRL algorithms in Gym environments. The environment is nondelayed for SAC and PETS (
) and is onestepdelayed for other algorithms. DATS is the proposed algorithm. The results indicate that the performance degradation resulting from the environment action delay is minimal when using DATS is minimal.V Experiments
Va Reinforcement Learning in Delayed Systems
Experiments are conducted across four OpenAI Gym/Mujoco [4, 35] environments for continuous control: , , and as shown in Fig. 3. The details of the environments are described below.
A singlelinked pendulum is fixed on the one end, with an actuator located on the joint. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up to keep it upright. Observations include the joint angle and the joint angular velocity. The reward penalizes position and speed deviations from the upright equilibrium and the magnitude of the control input.
A pole is connected to the cart through an unactuated joint, and the cart moves along a frictionless track. Control the system by applying a realnumber force to the cart. The pole starts upright, and the goal is to prevent it from falling over. Let be the angle of the pole away from the upright vertical position, and be the position where the cart leaves the center of the rail at time . The 4dimensional observation at time is (, , , ). A reward of +1 is provided for every timestep that the pole remains upright.
Walker2d is a 2dimensional bipedal robot, consisting of 7 rigid links, including a torso and 2 legs. There are 6 actuators, 3 for each leg. The observations include the (angular) position and speed of all joints. The reward is the direction speed plus the penalty for the distance to a target height and the magnitude of control input. The goal is to walk forward as fast as possible while keeping the standing height with minimal control input.
Ant is a 3dimensional 4legged robot with 13 rigid links (including a torso and 4 legs). There are 8 actuators at the joints, 2 for each leg. The observations include the (angular) position and speed of all joints. The reward is the direction speed plus penalty for the distance to a target height and the magnitude of control input. The goal is to walk forward as fast as possible, and approximately maintain the normal standing height with minimal control input.
Among the 4 continuous control tasks, the tasks of and are considered more challenging than and indicated by the dimension of dynamics.
In experiments, we add delays manually by revising the interaction framework between the agents and the environments if needed.
To show the advantage of DATS, we use 5 algorithms:

SAC (): Soft actorcritic [11] is a stateoftheart modelfree reinforcement learning algorithm serving as a modelfree baseline. Only the performances at the maximum time step are visualized.

PETS (): The PETS algorithm [5] is implemented in the nondelayed environment without action delays, providing the performance upper bound for algorithms in delayed environments.

PETS (): The PETS algorithm is blindly implemented in the 1step delayed environment without modeling action delays.

WPETS (): The PETS algorithm is augmented to solve DAMDPs with . However, it inefficiently tries to learn the whole dynamics as shown in Eq. 2 including the known part caused by actions delays.
Each algorithm is run with 10 random seeds in each environment. Fig. 4 shows the algorithmic performances. As the modelfree baseline, SAC is not as efficient as PETS in the four environments when there are no delays. While PETS () has the worst performance because the agent does not consider the action delay and learns the wrong dynamics, it can still make some improvements in simple environments like (Fig. 3(a)), and (Fig. 3(b)) due to the correlation of transitions. PETS () performs poorly for tasks that need accurate transition dynamics for planning in (Fig. 3(c)) and (Fig. 3(d)). WPETS achieves similar performance with PETS in and . But its performance also degrades a lot when the task gets more difficult since it has to learn the dynamics of the extra dimensions of states caused by the step action delays (Fig. 3(c) and 3(d)). DATS performs the same as PETS for the four tasks, i.e., action delays do not affect DATS.
The reason why DATS in delayed environment matches the asymptotic performance of PETS in the nondelayed environment is that the quality and quantity of transitions used for model training in DATS are almost the same with PETS, despite the action delay. The slight difference is due to the distribution shift caused by the predefined initial actions, which has minimal influence on the overall performance if the task horizon is long enough compared to the action delay step.
VB ModelBased vs ModelFree
To show the advantage of the proposed delayaware MBRL framework when dealing with multistep delays, we compare the modelfree algorithm RTAC [30] and the proposed modelbased DATS. RTAC is suitable for solving DAMDPs and is modified based on SAC, but as explained in Section IV, RTAC can avoid extra learning only when the action delay is exactly onestep.
We test them in the simple environment and the complex environment with various delay step . The learning curves in Fig. 5. show that DATS outperforms RTAC in efficiency and stability. DATS keeps consistent performance while RTAC degrades significantly as the delay step increases, even for the simple task , as shown in Fig. 4(b). The reason is that with the original dynamics of and fixed, the extra dynamics caused by the action delay rapidly dominates the dimension of the state space of the learning problem as the delay step increases, and exponentially more transitions are needed to sample and learn.


VC Transferable Knowledge
In this section, we show the transferability of the knowledge learned by DATS. We first learn several dynamics models {} in and with DATS, where denotes the action delay step during training. The learned models are then tested in environments with step action delays (). We train the dynamics model in each environment with the same amount of transitions : 2,000 for and 200,000 for . The planning method and hyperparameters stay the same as those in Algorithm 2. RTAC provides the modelfree baseline for each environment. Recall that since RTAC is a modelfree algorithm, when changing the delay steps, it must learn from scratch.
The reward matrix in Table I shows that DATS performs well even when the delay step is twice larger than the maximum step during modeltraining () for and . We infer that the learned knowledge (dynamics in this case) is transferable, i.e., when the action delay of the system changes, the estimated dynamics are still useful by simply adjusting the known part of the dynamics caused by the action delay. On the other hand, RTAC performs poorly as the delay step increases since the dimension of the state space grows and the agent has to spend more effort to learn the delay dynamics. Notably, the learned knowledge of modelfree methods cannot transfer when the delay step changes.
The results suggest that the transferability of DATS makes it suitable for SimtoReal tasks when there are action delays in real systems, and that the delay step during model training does not have to be equal to the delay step in a real system. Therefore, if the delay steps of the realworld tasks are known and fixed, we can incorporate the delay effect with the original dynamics learned in the delayfree simulator, and obtain highly efficient SimtoReal transformations.
Vi Conclusion
This paper proposed a general delayaware MBRL framework which solves multistep DAMDPs with high efficiency and transferability. Our key insight is that the dynamics of DAMDPs can be divided into two parts: the known part caused by delays, and the unknown part inherited from the original delayfree MDP. The proposed delayaware MBRL framework learns the original unknown dynamics and incorporates the known part of the dynamics explicitly. We also provided an efficient implementation of delayaware MBRL as DATS by combining a stateoftheart modeling and planning method, PETS. The experiment results showed that the performance of PETS in instantaneous environments is similarly to the performance of DATS in delayed environments with respect to delay duration. Moreover, the learned dynamics by DATS is transferable when the time of action delay changes, thus making DATS the preferred algorithm for tasks in realworld systems.
References
 [1] (1982) Linear systems with delayed controls: a reduction. IEEE Transactions on Automatic control 27 (4), pp. 869–879. Cited by: §I.
 [2] (1994) A new smith predictor for controlling a process with an integrator and long deadtime. IEEE transactions on Automatic Control 39 (2), pp. 343–345. Cited by: §I.
 [3] (2009) Brake timing measurements for a tractorsemitrailer under emergency braking. SAE International Journal of Commercial Vehicles 2 (2009012918), pp. 245–255. Cited by: §I, §I, §IIB.
 [4] (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §I, §VA.
 [5] (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §IVA, 2nd item.
 [6] (1995) Timedelay control of structures. Earthquake Engineering & Structural Dynamics 24 (5), pp. 687–701. Cited by: §I.

[7]
(2016)
Benchmarking deep reinforcement learning for continuous control.
In
International Conference on Machine Learning
, pp. 1329–1338. Cited by: §I.  [8] (1998) Stability and control of timedelay systems. Vol. 228, Springer. Cited by: §I.
 [9] (2016) Constrained optimization and distributed computation based car following control of a connected and autonomous vehicle platoon. Transportation Research Part B: Methodological 94, pp. 314–334. Cited by: §I.
 [10] (2003) Survey on recent results in the stability and control of timedelay systems. Journal of dynamic systems, measurement, and control 125 (2), pp. 158–165. Cited by: §I.
 [11] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: 1st item.
 [12] (2018) On unbounded delays in asynchronous parallel fixedpoint algorithms. Journal of Scientific Computing 76 (1), pp. 299–326. Cited by: §I.
 [13] (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: §I.
 [14] (2004) Groundspace bilateral teleoperation of etsvii robot arm by direct bilateral coupling under 7s time delay condition. IEEE Transactions on Robotics and Automation 20 (3), pp. 499–511. Cited by: §I.
 [15] (2019) When to trust your model: modelbased policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §IVA.
 [16] (2008) Robust compliant motion control of robot with nonlinear friction using timedelay estimation. IEEE Transactions on Industrial Electronics 55 (1), pp. 258–269. Cited by: §I.
 [17] (2003) Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control 48 (4), pp. 568–574. Cited by: §I, §IIB.
 [18] (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §IVA.
 [19] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
 [20] (1979) Finite spectrum assignment problem for systems with delays. IEEE transactions on Automatic Control 24 (4), pp. 541–552. Cited by: §I.
 [21] (1999) On the modified smith predictor for controlling a process with an integrator and long deadtime. IEEE Transactions on Automatic Control 44 (8), pp. 1603–1606. Cited by: §I.
 [22] (2000) On the extraction of deadtime controllers from delayfree parametrizations. IFAC Proceedings Volumes 33 (23), pp. 169–174. Cited by: §I.
 [23] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I.
 [24] (2003) Finite spectrum assignment of unstable timedelay systems with a safe implementation. IEEE Transactions on Automatic Control 48 (12), pp. 2207–2212. Cited by: §I.

[25]
(1999)
Estimation and removal of clock skew from network delay measurements
. In IEEE INFOCOM’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No. 99CH36320), Vol. 1, pp. 227–234. Cited by: §I.  [26] (2008) Finitetime stability and stabilization of timedelay systems. Systems & Control Letters 57 (7), pp. 561–566. Cited by: §I.
 [27] (2001) Delay effects on stability: a robust control approach. Vol. 269, Springer Science & Business Media. Cited by: §I.
 [28] (2016) Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §IVA.
 [29] (2013) Lp string stability of cascaded systems: application to vehicle platooning. IEEE Transactions on Control Systems Technology 22 (2), pp. 786–793. Cited by: §I.
 [30] (2019) Realtime reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3067–3076. Cited by: §I, §IV, §VB.
 [31] (2010) Control delay in reinforcement learning for realtime dynamic systems: a memoryless approach. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226–3231. Cited by: §IIB.
 [32] (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I.
 [33] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I, §IIB.
 [34] (1994) Learning without stateestimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Cited by: §I, §IIB.
 [35] (2012) Mujoco: a physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §I, §VA.
 [36] (2018) Reactive reinforcement learning in asynchronous environments. Frontiers in Robotics and AI 5, pp. 79. Cited by: §I.
 [37] (2009) Learning and planning in environments with delayed feedback. Autonomous Agents and MultiAgent Systems 18 (1), pp. 83. Cited by: §I.
 [38] (2019) Exploring modelbased planning with policy networks. arXiv preprint arXiv:1906.08649. Cited by: §IVA.