Delay-Aware Model-Based Reinforcement Learning for Continuous Control

Action delays degrade the performance of reinforcement learning in many real-world systems. This paper proposes a formal definition of delay-aware Markov Decision Process and proves it can be transformed into standard MDP with augmented states using the Markov reward process. We develop a delay-aware model-based reinforcement learning framework that can incorporate the multi-step delay into the learned system models without learning effort. Experiments with the Gym and MuJoCo platforms show that the proposed delay-aware model-based algorithm is more efficient in training and transferable between systems with various durations of delay compared with off-policy model-free reinforcement learning methods. Codes available at:


Delay-Aware Multi-Agent Reinforcement Learning for Cooperative and Competitive Environments

Action and observation delays exist prevalently in the real-world cyber-...

Acting in Delayed Environments with Non-Stationary Markov Policies

The standard Markov Decision Process (MDP) formulation hinges on the ass...

Learning to Ask Medical Questions using Reinforcement Learning

We propose a novel reinforcement learning-based approach for adaptive an...

Lifelong Control of Off-grid Microgrid with Model Based Reinforcement Learning

The lifelong control problem of an off-grid microgrid is composed of two...

Deep Reinforcement Learning for Adaptive Learning Systems

In this paper, we formulate the adaptive learning problem—the problem of...

DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with Prototypical Representations

Top-performing Model-Based Reinforcement Learning (MBRL) agents, such as...

I Introduction

Deep reinforcement learning has made rapid progress in games [23, 33] and robotic control [32, 7, 13]. However, most algorithms are evaluated in turn-based simulators like Gym [4] and MuJoCo [35], where the action selection and actuation of the agent are assumed to be instantaneous. Action delay, although prevalent in many areas of the real world, including robotic systems [14, 16, 3], communication networks [25] and parallel computing [12], may not be directly handled in this scheme.

Previous research has shown that delays would not only degrade the performance of the agent but also induce instability to the dynamic systems [10, 8, 6], which is a fatal threat in safety-critical systems like connected and autonomous vehicles (CAVs) [9]. For instance, it usually takes more than 0.4 seconds for the hydraulic automotive brake system to generate the desired deceleration [3], which could make a huge impact on the planning and control modules of CAVs [29]. The control community has proposed several methods to address this problem, such as using Smith predictor [2, 21], Artstein reduction [1, 26], finite spectrum assignment [20, 24], and controller [22]. Most of these methods depend on accurate models [27, 10], which is usually not available in the real-world applications.

Recently, DRL has offered the potential to resolve this issue. The problems that DRL solves are usually modeled as Markov Decision Process (MDP). However, ignoring the delay of agents violates the Markov property and results in partially observable MDPs, or POMDPs, with historical actions as hidden states. From [34]

, it is shown that solving POMDPs without estimating hidden states can lead to arbitrarily suboptimal policies. To retrieve the Markov property, the delayed system was reformulated as an augmented MDP problem such as the work in

[17, 37]. While the problem was elegantly formulated, the computational cost increases exponentially as the delay increases. Travnik et al. [36] showed that the traditional MDP framework is ill-defined, but did not provide a theoretical analysis. Ramstedt & Pal [30] proposed an off-policy model-free algorithm known as Real-Time Actor-Critic to address the delayed problem by adapting Q-learning to state-value-learning. The delay issue could also be relieved with the model-based manner by learning a dynamics model to predict the future state as in [37]

. However, this paper mainly focused on discrete tasks and could suffer from the curse of dimensionality when discretizing state and action space for continuous control tasks


In this paper, we further explore reinforcement learning methods on delayed systems in the following three aspects: 1) We formally define the multi-step delayed MDP and prove it can be converted to standard MDP via the Markov reward process. 2) We propose a general framework of delay-aware model-based reinforcement learning for continuous control tasks. 3) By synthesizing the state-of-the-art modeling and planning algorithms, we develop the Delay-Aware Trajectory Sampling (DATS) algorithm which can efficiently solve delayed MDPs with minimal degradation of performance.

The rest of the paper is organized as follows. We first review the preliminaries in Section II including the definition of Delay-Aware Markov Decision Process (DA-MDP). In Section III, we formally define the Delay-Aware Markov Reward Process (DA-MRP) and prove its solidity. In Section IV, we introduce the proposed framework of delay-aware model-based reinforcement learning for DA-MDPs with a concrete algorithm: Delay-Aware Trajectory Sampling (DATS). In Section V-A, we demonstrate the performance of the proposed algorithm in challenging continuous control tasks on Gym and MuJoCo platforms.

Ii Preliminaries

Ii-a Delay-Free MDP and Reinforcement Learning

The Delay-free MDP framework is suitable to model games like chess and go, where the state keeps still until a new action is executed. The definition of a delay-free MDP is:

Definition 1.

A Markov Decision Process (MDP) is characterized by a tuple with
(1) state space ,    (2) action space ,   
(3) initial state distribution ,
(4) transition distribution ,   
(5) reward function .

In the framework of reinforcement learning, the problem is often modeled as an MDP, and the agent is represented by a policy that directs the action selection, given the current observation. The objective is to find the optimal policy that maximizes the expected cumulative discounted reward . Throughout this paper, we assume that we know the reward function and do not know the transition distribution .

Ii-B Delay-Aware MDP

The delay-free MDP is problematic with agent delays and could lead to arbitrarily suboptimal policies [34]. To retrieve the Markov property, Delay-Aware MDP (DA-MDP) is proposed:

Definition 2.

A Delaye-Aware Markov Decision Process augments a Markov Decision Process , such that
(1) state space where denotes the delay step,
(2) action space ,
(3) initial state distribution

where denotes the initial action sequence,
(4) transition distribution

(5) reward function

The state vector of DA-MDP is augmented with an action sequence being executed in the next

steps where is the delay duration. The superscript of means that the action is one element of and the subscript represents the action executed time. is the action taken at time in a DA-MDP but executed at time due to the -step action delay, i.e. .

Policies interacting with the DA-MDPs, which also need to be augmented since the dimension of state vectors has changed, are denoted by bold . Fig. 1, which compares MDP and DA-MDP, shows that the state vector of DA-MDP is augmented with an action sequence being executed in the next steps.

Fig. 1: Comparison between , and . denotes the action delay step. denotes the observed state while denotes the action executed, both at time . Arrows represent how the action selected in the current time step will be included in the future state.

It should be noted that both action delay and observation delay could exist in real-world systems. However, it has been proved that from the point of view of the learning agent, observation and action delays form the same mathematical problem, since they both lead to the delay between the moment of measurement and the actual action

[17]. For simplicity, we will focus on the action delay in this paper, and the algorithm and conclusions should be able to generalize to systems with observation delays. We divide the action delay into two main parts into action selection and action actuation. For action selection, the time length depends on the complexity of the algorithm and the computing power of the processor. System users can limit the action selection time by constraining the searching depth, as in AlphaGo [33]. For action actuation, on the other hand, the actuators (e.g., motors, hydraulic machines) also need time to respond to the selected action. For instance, it usually takes more than 0.4 seconds for the hydraulic automotive brake system to generate the desired deceleration [3]. The actuation delay is usually decided by the hardware.

To formulate a delayed system into a DA-MDP, we must select a proper time step for discretely updating the environment. As shown in Fig. 0(c), the action selected at the current time step will be encapsulated in . Thus, must be accessible at time since the agent needs it as the state, which requires the action selection delay to be at most one time step. We satisfy this requirement by making the time step of the DA-MDP larger than the action selection duration.

The above definition of DA-MDP assumes that the delay time of the agent is an integer multiple of the time step of the system, which is usually not true for many real-world tasks like robotic control. For that, Schuitema et al. [31] has proposed an approximation approach by assuming a virtual effective action at each discrete system time step, which could achieve first-order equivalence in linearizable systems with arbitrary delay time. With this approximation, the above DA-MDP structure can be adapted to systems with arbitrary-value delays.

Iii Delay-Aware Markov Reward Process

Our first step is to show that an MDP with multi-step action delays can be converted to a regular MDP problem by state augmentation. We prove the equivalence of these two by comparing their corresponding Markov Reward Processes (MRPs). The delay-free MRP is:

Definition 3.

A Markov Reward Process can be recovered from a Markov Decision Process with a policy , such that

where is the sate transition distribution and is the state reward function of the MRP. is the original environment without delays.

In the delay-free framework, at each time step, the agent selects an action based on the current observation. The action will immediately be executed in the environment to generate the next observation. However, if an action delay exists, the interaction manner between the environment and the agent changes, and a different MRP is generated. An illustration of the delayed interaction between agents and the environment is shown in Fig. 2. The agent interacts with the environment not directly but through an action buffer.

Fig. 2: Interaction manner between a delayed agents and the environment. The agent interacts with the environment not directly but through an action buffer. At time , the agent get the observation from the environment as well as a future action sequences from the action buffer. The agents then decide their future action and store them in the action buffer. The action buffer then pops actions to be executed to the environment.

Based on the delayed interaction manner between the agent and the environment, the Delay-Aware MRP (DA-MRP) is defined as below.

Definition 4.

A Delay-Aware Markov Reward Process can be recovered from a Markov Decision Process with a policy and -step action delay, such that
(1) state space

(2) initial state distribution

where denotes the initial action sequence,
(3) state transition distribution

(4) state-reward function

With Def. 14, we are ready to prove that DA-MDP is a correct augmentation of MDP with delay, as stated in Theorem. 1.

Theorem 1.

A policy interacting with in the delay-free manner produces the same Markov Reward Process as interacting with with n-step action delays, i.e.


For any , we need to prove that the above two MRPs are the same. Referring to Def. 2 and 3, for , we have
(1) state space     ,
(2) initial distribution

(3) transition kernel

(4) state-reward function

Since the expanded terms of match the corresponding terms of (Def. 4), Eq. 1 holds. ∎

Iv Delay-Aware Model-Based Reinforcement Learning

Theorem. 1 shows that instead of solving MDPs with action delays, we can alternatively solve the corresponding DA-MDPs. From the transition function of a with multi-step delays


we see that the dynamics is divided into the unknown original dynamics and the known dynamics caused by the action delays. Thus, solving DA-MDPs with standard reinforcement learning algorithms will suffer from the curse of dimensionality if assuming a completely unknown environment. In this section, we propose a delay-aware model-based reinforcement learning framework to achieve high computational efficiency.

As mentioned, RTAC [30] has been proposed to deal the delay problem. However, we will show that this method is only efficient for 1-step delay. When for , any transition in the replay buffer is always a valid transition in the Bellman equation with the state-value function as

where , and . However, when considering the multi-step delay, i.e., , it is challenging to use off-policy model-free reinforcement learning because augmented transitions need to be stored and we only learn the effect of an action on the state-value function after -step updates of the Bellman equation. Also, the dimension of the state vector increases with the delay step , resulting in the exponential growth of the state-space.

Another limitation of model-free methods for DA-MDPs is that it can be difficult to transfer the learned knowledge (e.g., value functions, policies) when the action delay step changes because the input dimensions of the value functions and policies depend on the delay step . The agent must learn again from scratch whenever the system delay changes, which is usual in real-world systems.

The problems of model-free methods have motivated us to develop model-based reinforcement learning (MBRL) methods to combat the action delay. MBRL tries to solve MDPs by learning the dynamics model of the environment. Intuitively, we can inject our knowledge into the learned model without leaning effort. Based on this intuition, in this paper, we propose a delay-aware MBRL framework to solve multi-step DA-MDPs which can efficiently alleviate the aforementioned two problems of model-free methods. From Eq. 2, the unknown part is exactly the dynamics that we learn in MBRL algorithms for delay-free MDPs. In our proposed framework, only is learned and the dynamics caused by the delay is combined with the learned model by adding action delays to the interaction scheme. As mentioned, the learned dynamics model is transferable between systems with different delay steps, since we can adjust the interaction scheme based on the delay step (See Section V-C for an explanation of the transfer performance).

The proposed framework of delay-aware MBRL is shown in Algorithm 1. In the for loop, we are solving a planning problem, given a dynamics model with an initial action sequence. For that, the learned model is used not only for the optimal control but also for the state prediction to compensate for the delay effect. By iteratively training, we gradually improve the model accuracy and obtain better planning performance and , especially in high-reward regions.

  Input: action delay step , initial actions , and task horizon

learned transition probability

  Initialize replay buffer with a random controller for one trial.
  for Episode to  do
     Train a dynamics model given .
     Optimize action sequence with initial actions and estimated system dynamics
     Record experience: .
  end for
Algorithm 1 Delay-Aware Model-Based Reinforcement Learning

Iv-a Delay-Aware Trajectory Sampling

Recently, several MBRL algorithms have been proposed to match the asymptotic performance of model-free algorithms on challenging benchmark tasks, including probabilistic ensemble with trajectory sampling (PETS) [5], model-based policy optimization (MBPO) [15], model-based planning with policy networks (POPLIN) [38], etc. In this section, we will combine the state-of-the-art PETS algorithm with the proposed delay-aware MBRL framework to generate a new method for solving DA-MDPs. We name the method as the Delay-Aware Trajectory Sampling (DATS).

In DATS, the dynamic model is represented by an ensemble of probabilistic neural networks that output Gaussian distributions which helps model the aleatoric uncertainty. The use of the ensemble can help incorporate the epistemic uncertainty of the dynamic model and approximate the Bayesian posterior

[28, 18]. The planning of action sequences applies the concept of model predictive control (MPC) with the cross-entropy method (CEM) for elite selection of the sampled action sequences. In the most inner for loop of Algorithm 2, with the current state , we first propagate state particles with the same action sequence to make various estimates of the future state , and then use sampled action sequences to predict for each particle. In this way, the uncertainty of the learned model is considered in both state-prediction and planning phases, which improves the robustness of the algorithm. The complete algorithm is shown in Algorithm 2.

  Input: action delay step , initial actions , task horizon , planning horizon
  Output: learned transition probability
  Initialize transition buffer with a random controller for one trial.
  for Trial to  do
     Train a probabilistic dynamics model given .
     Initialize action buffer
     for Time to  do
        for Sampled  do
           Concatenate with
           Propagate state particles using .
           Evaluate actions as
           Update distribution.
        end for
        Pick the first action from optimal action sequence and store in
     end for
     Record experience: .
  end for
Algorithm 2 Delay-Aware Trajectory Sampling

Model-based methods have a natural advantage when dealing with multi-step DA-MDPs when compared with model-free methods. With model-free methods, the effect of an action on the state-value function can only be learned after -time updates of the Bellman equation. The agent implicitly wastes both time and effort to learn the known part of system dynamics caused by action delay since it does not understand the meaning of the elements in the state vectors. As mentioned, the advantage of model-based methods is that they incorporate delay effect into the system dynamics without extra learning (see Section V-B for a performance comparison between model-free and model-based methods).

(a) Pendulum
(b) CartPole
(c) Walker2d
(d) Ant
Fig. 3: Benchmark environments.
(a) Pendulum-v0
(b) CartPole-v1
(c) Walker2d-v1
(d) Ant-v1
Fig. 4:

Performances (means and standard deviations of rewards) of different MBRL algorithms in Gym environments. The environment is non-delayed for SAC and PETS (

) and is one-step-delayed for other algorithms. DATS is the proposed algorithm. The results indicate that the performance degradation resulting from the environment action delay is minimal when using DATS is minimal.

V Experiments

V-a Reinforcement Learning in Delayed Systems

Experiments are conducted across four OpenAI Gym/Mujoco [4, 35] environments for continuous control: , , and as shown in Fig. 3. The details of the environments are described below.

A single-linked pendulum is fixed on the one end, with an actuator located on the joint. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up to keep it upright. Observations include the joint angle and the joint angular velocity. The reward penalizes position and speed deviations from the upright equilibrium and the magnitude of the control input.

A pole is connected to the cart through an un-actuated joint, and the cart moves along a frictionless track. Control the system by applying a real-number force to the cart. The pole starts upright, and the goal is to prevent it from falling over. Let be the angle of the pole away from the upright vertical position, and be the position where the cart leaves the center of the rail at time . The 4-dimensional observation at time is (, , , ). A reward of +1 is provided for every timestep that the pole remains upright.

Walker2d is a 2-dimensional bipedal robot, consisting of 7 rigid links, including a torso and 2 legs. There are 6 actuators, 3 for each leg. The observations include the (angular) position and speed of all joints. The reward is the direction speed plus the penalty for the distance to a target height and the magnitude of control input. The goal is to walk forward as fast as possible while keeping the standing height with minimal control input.

Ant is a 3-dimensional 4-legged robot with 13 rigid links (including a torso and 4 legs). There are 8 actuators at the joints, 2 for each leg. The observations include the (angular) position and speed of all joints. The reward is the direction speed plus penalty for the distance to a target height and the magnitude of control input. The goal is to walk forward as fast as possible, and approximately maintain the normal standing height with minimal control input.

Among the 4 continuous control tasks, the tasks of and are considered more challenging than and indicated by the dimension of dynamics.

In experiments, we add delays manually by revising the interaction framework between the agents and the environments if needed.

To show the advantage of DATS, we use 5 algorithms:

  • SAC (): Soft actor-critic [11] is a state-of-the-art model-free reinforcement learning algorithm serving as a model-free baseline. Only the performances at the maximum time step are visualized.

  • PETS (): The PETS algorithm [5] is implemented in the non-delayed environment without action delays, providing the performance upper bound for algorithms in delayed environments.

  • PETS (): The PETS algorithm is blindly implemented in the 1-step delayed environment without modeling action delays.

  • W-PETS (): The PETS algorithm is augmented to solve DA-MDPs with . However, it inefficiently tries to learn the whole dynamics as shown in Eq. 2 including the known part caused by actions delays.

  • DATS (): DATS is our proposed method as in Algorithm 2. It incorporates the action delay into the framework and only learns the unknown original dynamics as shown in Eq. 2.

Each algorithm is run with 10 random seeds in each environment. Fig. 4 shows the algorithmic performances. As the model-free baseline, SAC is not as efficient as PETS in the four environments when there are no delays. While PETS () has the worst performance because the agent does not consider the action delay and learns the wrong dynamics, it can still make some improvements in simple environments like (Fig. 3(a)), and (Fig. 3(b)) due to the correlation of transitions. PETS () performs poorly for tasks that need accurate transition dynamics for planning in (Fig. 3(c)) and (Fig. 3(d)). W-PETS achieves similar performance with PETS in and . But its performance also degrades a lot when the task gets more difficult since it has to learn the dynamics of the extra dimensions of states caused by the -step action delays (Fig. 3(c) and 3(d)). DATS performs the same as PETS for the four tasks, i.e., action delays do not affect DATS.

(a) DATS in Pendulum-v0
(b) RTAC in Pendulum-v0
(c) DATS in Walker2d-v1
(d) RTAC in Walker2d-v1
Fig. 5: Performances (means and standard deviations of rewards) of DATS and RTAC in Gym environments with different action delay steps. The model-based algorithm DATS outperforms the model-free algorithm RTAC in terms of efficiency and stability. RTAC degrades significantly as the delay step increases.

The reason why DATS in delayed environment matches the asymptotic performance of PETS in the non-delayed environment is that the quality and quantity of transitions used for model training in DATS are almost the same with PETS, despite the action delay. The slight difference is due to the distribution shift caused by the predefined initial actions, which has minimal influence on the overall performance if the task horizon is long enough compared to the action delay step.

V-B Model-Based vs Model-Free

To show the advantage of the proposed delay-aware MBRL framework when dealing with multi-step delays, we compare the model-free algorithm RTAC [30] and the proposed model-based DATS. RTAC is suitable for solving DA-MDPs and is modified based on SAC, but as explained in Section IV, RTAC can avoid extra learning only when the action delay is exactly one-step.

We test them in the simple environment and the complex environment with various delay step . The learning curves in Fig. 5. show that DATS outperforms RTAC in efficiency and stability. DATS keeps consistent performance while RTAC degrades significantly as the delay step increases, even for the simple task , as shown in Fig. 4(b). The reason is that with the original dynamics of and fixed, the extra dynamics caused by the action delay rapidly dominates the dimension of the state space of the learning problem as the delay step increases, and exponentially more transitions are needed to sample and learn.

1 154.1014.86 156.3713.29 163.2916.03 149.7813.778 121.3612.63
2 163.9215.23 162.9314.26 155.9016.11 160.0718.30 109.4412.58
4 160.3912.63 162.8716.21 171.5310.85 166.2914.22 80.1527.94
8 163.2915.53 151.2013.44 166.3713.32 166.5910.59 -110.2858.89
16 153.4117.35 159.0919.88 153.8914.22 149.9016.86 -122.9864.82
(a) Pendulum-v0
1 471.34426.26 524.76387.67 496.13442.89 395.78409.98 -471.13896.28
2 549.73410.76 487.32334.49 527.98477.19 492.56490.01 -754.42722.79
4 485.29438.98 439.23529.39 248.60611.82 552.91410.76 -1252.47710.10
8 356.93431.58 438.82563.13 482.09316.34 247.97595.63 -1766.85404.28
16 292.38521.86 311.44409.80 473.97309.81 401.34 634.12 -2173.87625.76
(b) Walker2d-v1
TABLE I: Reward matrix of DATS and RTAC

V-C Transferable Knowledge

In this section, we show the transferability of the knowledge learned by DATS. We first learn several dynamics models {} in and with DATS, where denotes the action delay step during training. The learned models are then tested in environments with -step action delays (). We train the dynamics model in each environment with the same amount of transitions : 2,000 for and 200,000 for . The planning method and hyper-parameters stay the same as those in Algorithm 2. RTAC provides the model-free baseline for each environment. Recall that since RTAC is a model-free algorithm, when changing the delay steps, it must learn from scratch.

The reward matrix in Table I shows that DATS performs well even when the delay step is twice larger than the maximum step during model-training () for and . We infer that the learned knowledge (dynamics in this case) is transferable, i.e., when the action delay of the system changes, the estimated dynamics are still useful by simply adjusting the known part of the dynamics caused by the action delay. On the other hand, RTAC performs poorly as the delay step increases since the dimension of the state space grows and the agent has to spend more effort to learn the delay dynamics. Notably, the learned knowledge of model-free methods cannot transfer when the delay step changes.

The results suggest that the transferability of DATS makes it suitable for Sim-to-Real tasks when there are action delays in real systems, and that the delay step during model training does not have to be equal to the delay step in a real system. Therefore, if the delay steps of the real-world tasks are known and fixed, we can incorporate the delay effect with the original dynamics learned in the delay-free simulator, and obtain highly efficient Sim-to-Real transformations.

Vi Conclusion

This paper proposed a general delay-aware MBRL framework which solves multi-step DA-MDPs with high efficiency and transferability. Our key insight is that the dynamics of DA-MDPs can be divided into two parts: the known part caused by delays, and the unknown part inherited from the original delay-free MDP. The proposed delay-aware MBRL framework learns the original unknown dynamics and incorporates the known part of the dynamics explicitly. We also provided an efficient implementation of delay-aware MBRL as DATS by combining a state-of-the-art modeling and planning method, PETS. The experiment results showed that the performance of PETS in instantaneous environments is similarly to the performance of DATS in delayed environments with respect to delay duration. Moreover, the learned dynamics by DATS is transferable when the time of action delay changes, thus making DATS the preferred algorithm for tasks in real-world systems.


  • [1] Z. Artstein (1982) Linear systems with delayed controls: a reduction. IEEE Transactions on Automatic control 27 (4), pp. 869–879. Cited by: §I.
  • [2] K. J. Astrom, C. C. Hang, and B. Lim (1994) A new smith predictor for controlling a process with an integrator and long dead-time. IEEE transactions on Automatic Control 39 (2), pp. 343–345. Cited by: §I.
  • [3] F. P. Bayan, A. D. Cornetto, A. Dunn, and E. Sauer (2009) Brake timing measurements for a tractor-semitrailer under emergency braking. SAE International Journal of Commercial Vehicles 2 (2009-01-2918), pp. 245–255. Cited by: §I, §I, §II-B.
  • [4] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §I, §V-A.
  • [5] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765. Cited by: §IV-A, 2nd item.
  • [6] L. Chung, C. Lin, and K. Lu (1995) Time-delay control of structures. Earthquake Engineering & Structural Dynamics 24 (5), pp. 687–701. Cited by: §I.
  • [7] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In

    International Conference on Machine Learning

    pp. 1329–1338. Cited by: §I.
  • [8] L. Dugard and E. I. Verriest (1998) Stability and control of time-delay systems. Vol. 228, Springer. Cited by: §I.
  • [9] S. Gong, J. Shen, and L. Du (2016) Constrained optimization and distributed computation based car following control of a connected and autonomous vehicle platoon. Transportation Research Part B: Methodological 94, pp. 314–334. Cited by: §I.
  • [10] K. Gu and S. Niculescu (2003) Survey on recent results in the stability and control of time-delay systems. Journal of dynamic systems, measurement, and control 125 (2), pp. 158–165. Cited by: §I.
  • [11] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: 1st item.
  • [12] R. Hannah and W. Yin (2018) On unbounded delays in asynchronous parallel fixed-point algorithms. Journal of Scientific Computing 76 (1), pp. 299–326. Cited by: §I.
  • [13] J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter (2017) Control of a quadrotor with reinforcement learning. IEEE Robotics and Automation Letters 2 (4), pp. 2096–2103. Cited by: §I.
  • [14] T. Imaida, Y. Yokokohji, T. Doi, M. Oda, and T. Yoshikawa (2004) Ground-space bilateral teleoperation of ets-vii robot arm by direct bilateral coupling under 7-s time delay condition. IEEE Transactions on Robotics and Automation 20 (3), pp. 499–511. Cited by: §I.
  • [15] M. Janner, J. Fu, M. Zhang, and S. Levine (2019) When to trust your model: model-based policy optimization. arXiv preprint arXiv:1906.08253. Cited by: §IV-A.
  • [16] M. Jin, S. H. Kang, and P. H. Chang (2008) Robust compliant motion control of robot with nonlinear friction using time-delay estimation. IEEE Transactions on Industrial Electronics 55 (1), pp. 258–269. Cited by: §I.
  • [17] K. V. Katsikopoulos and S. E. Engelbrecht (2003) Markov decision processes with delays and asynchronous cost collection. IEEE transactions on automatic control 48 (4), pp. 568–574. Cited by: §I, §II-B.
  • [18] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §IV-A.
  • [19] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I.
  • [20] A. Manitius and A. Olbrot (1979) Finite spectrum assignment problem for systems with delays. IEEE transactions on Automatic Control 24 (4), pp. 541–552. Cited by: §I.
  • [21] M. R. Matausek and A. Micic (1999) On the modified smith predictor for controlling a process with an integrator and long dead-time. IEEE Transactions on Automatic Control 44 (8), pp. 1603–1606. Cited by: §I.
  • [22] L. Mirkin (2000) On the extraction of dead-time controllers from delay-free parametrizations. IFAC Proceedings Volumes 33 (23), pp. 169–174. Cited by: §I.
  • [23] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §I.
  • [24] S. Mondié and W. Michiels (2003) Finite spectrum assignment of unstable time-delay systems with a safe implementation. IEEE Transactions on Automatic Control 48 (12), pp. 2207–2212. Cited by: §I.
  • [25] S. B. Moon, P. Skelly, and D. Towsley (1999)

    Estimation and removal of clock skew from network delay measurements

    In IEEE INFOCOM’99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No. 99CH36320), Vol. 1, pp. 227–234. Cited by: §I.
  • [26] E. Moulay, M. Dambrine, N. Yeganefar, and W. Perruquetti (2008) Finite-time stability and stabilization of time-delay systems. Systems & Control Letters 57 (7), pp. 561–566. Cited by: §I.
  • [27] S. Niculescu (2001) Delay effects on stability: a robust control approach. Vol. 269, Springer Science & Business Media. Cited by: §I.
  • [28] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy (2016) Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pp. 4026–4034. Cited by: §IV-A.
  • [29] J. Ploeg, N. Van De Wouw, and H. Nijmeijer (2013) Lp string stability of cascaded systems: application to vehicle platooning. IEEE Transactions on Control Systems Technology 22 (2), pp. 786–793. Cited by: §I.
  • [30] S. Ramstedt and C. Pal (2019) Real-time reinforcement learning. In Advances in Neural Information Processing Systems, pp. 3067–3076. Cited by: §I, §IV, §V-B.
  • [31] E. Schuitema, L. Buşoniu, R. Babuška, and P. Jonker (2010) Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. In 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226–3231. Cited by: §II-B.
  • [32] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §I.
  • [33] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484. Cited by: §I, §II-B.
  • [34] S. P. Singh, T. Jaakkola, and M. I. Jordan (1994) Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Cited by: §I, §II-B.
  • [35] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §I, §V-A.
  • [36] J. B. Travnik, K. W. Mathewson, R. S. Sutton, and P. M. Pilarski (2018) Reactive reinforcement learning in asynchronous environments. Frontiers in Robotics and AI 5, pp. 79. Cited by: §I.
  • [37] T. J. Walsh, A. Nouri, L. Li, and M. L. Littman (2009) Learning and planning in environments with delayed feedback. Autonomous Agents and Multi-Agent Systems 18 (1), pp. 83. Cited by: §I.
  • [38] T. Wang and J. Ba (2019) Exploring model-based planning with policy networks. arXiv preprint arXiv:1906.08649. Cited by: §IV-A.