Thinking While Moving: Deep Reinforcement Learning with Concurrent Control

04/13/2020 ∙ by Ted Xiao, et al. ∙ Google The Team at X 17

We study reinforcement learning in settings where sampling an action from the policy must be done concurrently with the time evolution of the controlled system, such as when a robot must decide on the next action while still performing the previous action. Much like a person or an animal, the robot must think and move at the same time, deciding on its next action before the previous one has completed. In order to develop an algorithmic framework for such concurrent control problems, we start with a continuous-time formulation of the Bellman equations, and then discretize them in a way that is aware of system delays. We instantiate this new class of approximate dynamic programming methods via a simple architectural extension to existing value-based deep reinforcement learning algorithms. We evaluate our methods on simulated benchmark tasks and a large-scale robotic grasping task where the robot must "think while moving".



There are no comments yet.


page 7

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, Deep Reinforcement Learning (DRL) methods have achieved tremendous success on a variety of diverse environments, including video games (Mnih et al., 2015), zero-sum games (Silver et al., 2016), robotic grasping (Kalashnikov et al., 2018), and in-hand manipulation tasks (OpenAI et al., 2018). While impressive, all of these examples use a blocking observe-think-act paradigm: the agent assumes that the environment will remain static while it thinks, so that its actions will be executed on the same states from which they were computed. This assumption breaks in the concurrent real world, where the environment state evolves substantially as the agent processes observations and plans its next actions. As an example, consider a dynamic task such as catching a ball: it is not possible to pause the ball mid-air while waiting for the agent to decide on the next control to command. In addition to solving dynamic tasks where blocking models would fail, thinking and acting concurrently can provide benefits such as smoother, human-like motions and the ability to seamlessly plan for next actions while executing the current one. Despite these potential benefits, most DRL approaches are mainly evaluated in blocking simulation environments. Blocking environments make the assumption that the environment state will not change between when the environment state is observed and when the action is executed. This assumption holds true in most simulated environments, which encompass popular domains such as Atari (Mnih et al., 2013) and Gym control benchmarks (Brockman et al., 2016). The system is treated in a sequential manner: the agent observes a state, freezes time while computing an action, and finally applies the action and unfreezes time. However, in dynamic real-time environments such as real-world robotics, the synchronous environment assumption is no longer valid. After observing the state of the environment and computing an action, the agent often finds that when it executes an action, the environment state has evolved from what it had initially observed; we consider this environment a concurrent environment.

In this paper, we introduce an algorithmic framework that can handle concurrent environments in the context of DRL. In particular, we derive a modified Bellman operator for concurrent MDPs and present the minimal set of information that we must augment state observations with in order to recover blocking performance with Q-learning. We introduce experiments on different simulated environments that incorporate concurrent actions, ranging from common simple control domains to vision-based robotic grasping tasks. Finally, we show an agent that acts concurrently in a real-world robotic grasping task is able to achieve comparable task success to a blocking baseline while acting faster.

2 Related Work

Minimizing Concurrent Effects

Although real-world robotics systems are inherently concurrent, it is sometimes possible to engineer them into approximately blocking systems. For example, using low-latency hardware (Abbeel et al., 2006) and low-footprint controllers (Cruz et al., 2017) minimizes the time spent during state capture and policy inference. Another option is to design actions to be executed to completion via closed-loop feedback controllers and the system velocity is decelerated to zero before a state is recorded (Kalashnikov et al., 2018). In contrast to these works, we tackle the concurrent action execution directly in the learning algorithm. Our approach can be applied to tasks where it is not possible to wait for the system to come to rest between deciding new actions.

Algorithmic Approaches

Other works utilize algorithmic modifications to directly overcome the challenges of concurrent control. Previous work in this area can be grouped into five approaches: (1) learning policies that are robust to variable latencies (Tan et al., 2018), (2) including past history such as frame-stacking (Haarnoja et al., 2018), (3) learning dynamics models to predict the future state at which the action will be executed (Firoiu et al., 2018; Amiranashvili et al., 2018), (4) using a time-delayed MDP framework (Walsh et al., 2007; Firoiu et al., 2018; Schuitema et al., 2010; Ramstedt and Pal, 2019),

and (5) temporally-aware architectures such as Spiking Neural Networks 

(Vasilaki et al., 2009; Frémaux et al., 2013), point processes (Upadhyay et al., 2018; Li et al., 2018), and adaptive skip intervals (Neitz et al., 2018). In contrast to these works, our approach is able to (1) optimize for a specific latency regime as opposed to being robust to all of them, (2) consider the properties of the source of latency as opposed to forcing the network to learn them from high-dimensional inputs, (3) avoid learning explicit forward dynamics models in high-dimensional spaces, which can be costly and challenging, (4) consider environments where actions are interrupted as opposed to discrete-time time-delayed environments where multiple actions are queued and each action is executed until completion. A recent work Ramstedt and Pal (2019) extends -step constant delayed MDPs to actor-critic methods on high dimension image based tasks. The approaches in (5) show promise in enabling asynchronous agents, but are still active areas of research that have not yet been extended to high-dimensional, image-based robotic tasks.

Continuous-time Reinforcement Learning

While previously mentioned related works largely operate in discrete-time environments, framing concurrent environments as continuous-time systems is a natural framework to apply. In the realm of continuous-time optimal control, path integral solutions (Kappen, 2005; Theodorou et al., 2010) are linked to different noise levels in system dynamics, which could potentially include latency that results in concurrent properties. Finite differences can approximate the Bellman update in continuous-time stochastic control problems (Munos and Bourgine, 1998) and continuous-time temporal difference learning methods (Doya, 2000) can utilize neural networks as function approximators (Coulom, 2002). The effect of time-discretization (converting continuous-time environments to discrete-time environments) is studied in  Tallec et al. (2019), where the advantage update is scaled by the time discretization parameter. While these approaches are promising, it is untested how these methods may apply to image-based DRL problems. Nonetheless, we build on top of many of the theoretical formulations in these works, which motivate our applications of deep reinforcement learning methods to more complex, vision-based robotics tasks.

3 Value-based Reinforcement Learning in Concurrent Environments

In this section, we first introduce the concept of concurrent environments, and then describe the preliminaries necessary for discrete- and continuous-time RL formulations. We then describe the MDP modifications sufficient to represent concurrent actions and finally, present value-based RL algorithms that can cope with concurrent environments.

The main idea behind our method is simple and can be implemented using small modifications to standard value-based algorithms. It centers around adding additional information to the learning algorithm (in our case, adding extra information about the previous action to a -function) that allows it to cope with concurrent actions. Hereby, we provide theoretical justification on why these modifications are necessary and we specify the details of the algorithm in Alg. 1.

While concurrent environments affect DRL methods beyond model-free value-based RL, we focus our scope on model-free value-based methods due to their attractive sample-efficiency and off-policy properties for real-world vision-based robotic tasks.

3.1 Concurrent Action Environments

In blocking environments (Figure 4a in the Appendix), actions are executed in a sequential blocking fashion that assumes the environment state does not change between when state is observed and when actions are executed. This can be understood as state capture and policy inference being viewed as instantaneous from the perspective of the agent. In contrast, concurrent environments (Figure 4b in the Appendix) do not assume a fixed environment during state capture and policy inference, but instead allow the environment to evolve during these time segments.

3.2 Discrete-Time Reinforcement Learning Preliminaries

We use standard reinforcement learning formulations in both discrete-time and continuous-time settings (Sutton and Barto, 1998). In the discrete-time case, at each time step , the agent receives state from a set of possible states and selects an action from some set of possible actions according to its policy , where is a mapping from to . The environment returns the next state sampled from a transition distribution and a reward . The return for a given trajectory of states and actions is the total discounted return from time step with discount factor : . The goal of the agent is to maximize the expected return from each state . The -function for a given stationary policy gives the expected return when selecting action at state : . Similarly, the value function gives expected return from state : .

The default blocking environment formulation is detailed in Figure 1a.

3.3 Value Functions and Policies in Continuous Time

For the continuous-time case, we start by formalizing a continuous-time MDP with the differential equation:


where is a set of states, is a set of actions, and describe the stochastic dynamics of the environment, and is a Wiener process (Ross et al., 1996). In the continuous-time setting, is analogous to the discrete-time , defined in Section 3.2. Continuous-time functions and specify the state and -th action taken by the agent. The agent interacts with the environment through a state-dependent, deterministic policy function and the return of a trajectory is given by (Doya, 2000):


which leads to a continuous-time value function (Tallec et al., 2019):


and similarly, a continuous -function:


where is the constant sampling period between state captures (i.e. the duration of an action trajectory) and refers to the continuous action function that is applied between and . The expectations are computed with respect to stochastic process defined in Eq. 1.

3.4 Concurrent Action Markov Decision Processes

Figure 1:

Shaded nodes represent observed variables and unshaded nodes represent unobserved random variables.

(a): In “blocking” MDPs, the environment state does not change while the agent records the current state and selects an action. (b): In “concurrent” MDPs, state and action dynamics are continuous-time stochastic processes and . At time , the agent observes the state of the world , but by the time it selects an action , the previous continuous-time action function has “rolled over” to an unobserved state . An agent that concurrently selects actions from old states while in motion may need to interrupt a previous action before it has finished executing its current trajectory.

We consider Markov Decision Processes (MDPs) with concurrent actions, where actions are not executed to full completion.

More specifically, concurrent action environments capture system state while the previous action is still executed. After state capture, the policy selects an action that is executed in the environment regardless of whether the previous action has completed, as shown in Figure 4 in the Appendix. In the continuous-time MDP case, concurrent actions can be considered as horizontally translating the action along the time dimension (Walsh et al., 2007), and the effect of concurrent actions is illustrated in Figure 1b. Although we derive Bellman Equations for handling delays in both continuous and discrete-time RL, our experiments extend existing DRL implementations that are based on discrete time.

3.5 Value-based Concurrent Reinforcement Learning Algorithms in Continuous and Discrete-Time

We start our derivation from this continuous-time reinforcement learning standpoint, as it allows us to easily characterize the concurrent nature of the system. We then demonstrate that the conclusions drawn for the continuous case also apply to the more commonly-used discrete setting that we then use in all of our experiments.

Continuous Formulation

In order to further analyze the concurrent setting, we introduce the following notation. As shown in Figure 1b, an agent selects action trajectories during an episode, , where each is a continuous function generating controls as a function of time . Let be the time duration of state capture, policy inference and any additional communication latencies. At time , an agent begins computing the -th trajectory from state , while concurrently executing the previous selected trajectory over the time interval . At time , where , the agent switches to executing actions from . The continuous-time -function for the concurrent case from Eq. 4 can be expressed as following:


The first two terms correspond to expected discounted returns for executing the action trajectory from time and the trajectory from time

. We can obtain a single-sample Monte Carlo estimator

by sampling random functions values , which simply correspond to policy rollouts:


Next, for the continuous-time case, let us define a new concurrent Bellman backup operator:


In addition to expanding the Bellman operator to take into account concurrent actions, we demonstrate that this modified operator maintain its contraction properties that are crucial for Q-learning convergence.

Lemma 3.1.

The concurrent continuous-time Bellman operator is a contraction.


See Appendix A.2. ∎

Discrete Formulation

In order to simplify the notation for the discrete-time case where the distinction between the action function and the value of that function at time step , , is not necessary, we refer to the current state, current action, and previous action as , , respectively, replacing subindex with . Following this notation, we define the concurrent -function for the discrete-time case:


Where is the “spillover duration” for action beginning execution at time (see Figure 1b). The concurrent Bellman operator, specified by a subscript , is as follows:


Similarly to the continuous-time case, we demonstrate that this Bellman operator is a contraction.

Lemma 3.2.

The concurrent discrete-time Bellman operator is a contraction.


See Appendix A.2. ∎

We refer the reader to Appendix A.1 for more detailed derivations of the Q-functions and Bellman operators. Crucially, Equation 9 implies that we can extend a conventional discrete-time Q-learning framework to handle MDPs with concurrent actions by providing the Q function with values of and , in addition to the standard inputs .

3.6 Deep -Learning with Concurrent Knowledge

While we have shown that knowledge of the concurrent system properties ( and , as defined previously for the discrete-time case) is theoretically sufficient, it is often hard to accurately predict during inference on a complex robotics system. In order to allow practical implementation of our algorithm on a wide range of RL agents, we consider three additional features encapsulating concurrent knowledge used to condition the -function: (1) Previous action (), (2) Action selection time (

), and (3) Vector-to-go (

), which we define as the remaining action to be executed at the instant the state is measured. We limit our analysis to environments where and are all obtainable and is held constant. See Appendix A.3 for details.

4 Experiments

In our experimental evaluation we aim to study the following questions: (1) Is concurrent knowledge defined in Section 3.6, both necessary and sufficient for a -function to recover the performance of a blocking unconditioned -function, when acting in a concurrent environment? (2) Which representations of concurrent knowledge are most useful for a -function to act in a concurrent environment? (3) Can concurrent models improve smoothness and execution speed of a real-robot policy in a realistic, vision-based manipulation task?

4.1 Toy First-Order Control Problems

(a) Cartpole
(b) Pendulum
Figure 2:

In concurrent versions of Cartpole and Pendulum, we observe that providing the critic with VTG leads to more robust performance across all hyperparameters.


Environment rewards achieved by DQN with different network architectures [either a feedforward network (FNN) or a Long Short-Term Memory (LSTM) network] and different concurrent knowledge features [Unconditioned, Vector-to-go (VTG), or previous action and

] on the concurrent Cartpole task for every hyperparameter in a sweep, sorted in decreasing order. (b) Environment rewards achieved by DQN with a FNN and different frame-stacking and concurrent knowledge parameters on the concurrent Pendulum task for every hyperparameter in a sweep, sorted in decreasing order. Larger area-under-curve implies more robustness to hyperparameter choices. Enlarged figures provided in Appendix A.5.

First, we illustrate the effects of a concurrent control paradigm on value-based DRL methods through an ablation study on concurrent versions of the standard Cartpole and Pendulum environments. We use 3D MuJoCo based implementations in DeepMind Control Suite (Tassa et al., 2018) for both tasks. For the baseline learning algorithm implementations, we use the TF-Agents (Guadarrama et al., 2018) implementations of a Deep

-Network agent, which utilizes a Feed-forward Neural Network (FNN), and a Deep

-Recurrent Neutral Network agent, which utilizes a Long Short-Term Memory (LSTM) network. To approximate different difficulty levels of latency in concurrent environments, we utilize different parameter combinations for action execution steps and action selection steps (). The number of action execution steps is selected from {0ms, 5ms, 25ms, or 50ms} once at environment initialization. is selected from {0ms, 5ms, 10ms, 25ms, or 50ms} either once at environment initialization or repeatedly at every episode reset. In addition to environment parameters, we allow trials to vary across model parameters: number of previous actions to store, number of previous states to store, whether to use VTG, whether to use , -network architecture, and number of discretized actions. Further details are described in Appendix A.4.1.

To estimate the relative importance of different concurrent knowledge representations, we conduct an analysis of the sensitivity of each type of concurrent knowledge representations to combinations of the other hyperparameter values, shown in Figure 1(a). While all combinations of concurrent knowledge representations increase learning performance over baselines that do not leverage this information, the clearest difference stems from including VTG. In Figure 1(b) we conduct a similar analysis but on a Pendulum environment where is fixed every environment; thus, we do not focus on for this analysis but instead compare the importance of VTG with frame-stacking previous actions and observations. While frame-stacking helps nominally, the majority of the performance increase results from utilizing information from VTG.

4.2 Concurrent QT-Opt on Large-Scale Robotic Grasping

(a) Simulation
(b) Real
Figure 3: An overview of the robotic grasping task. A static manipulator arm attempts to grasp objects placed in bins front of it. In simulation, the objects are procedurally generated.

Next, we evaluate scalability of our approach to a practical robotic grasping task. We simulate a 7 DoF arm with an over-the-shoulder camera, where a bin in front of the robot is filled with procedurally generated objects to be picked up by the robot. A binary reward is assigned if an object is lifted off a bin at the end of an episode. We train a policy with QT-Opt (Kalashnikov et al., 2018), a deep -Learning method that utilizes the cross-entropy method (CEM) to support continuous actions. In the blocking mode, a displacement action is executed until completion: the robot uses a closed-loop controller to fully execute an action, decelerating and coming to rest before observing the next state. In the concurrent mode, an action is triggered and executed without waiting, which means that the next state is observed while the robot remains in motion. Further details of the algorithm and experimental setup are shown in Figure 3 and explained in Appendix A.4.2.

Table 1 summarizes the performance for blocking and concurrent modes comparing unconditioned models against the concurrent knowledge models described in Section 3.6. Our results indicate that the VTG model acting in concurrent mode is able to recover baseline task performance of the blocking execution unconditioned baseline, while the unconditioned baseline acting in concurrent model suffers some performance loss. In addition to the success rate of the grasping policy, we also evaluate the speed and smoothness of the learned policy behavior. Concurrent knowledge models are able to learn faster trajectories: episode duration, which measures the total amount of wall-time used for an episode, is reduced by when comparing concurrent knowledge models with blocking unconditioned models, even those that utilize a shaped timestep penalty that reward faster policies. When switching from blocking execution mode to concurrent execution mode, we see a significantly lower action completion, measured as the ratio from executed gripper displacement to commanded displacement, which expectedly indicates a switch to a concurrent environment. The concurrent knowledge models have higher action completions than the unconditioned model in the concurrent environment, which suggests that the concurrent knowledge models are able to utilize more efficient motions, resulting in smoother trajectories. The qualitative benefits of faster, smoother trajectories are drastically apparent when viewing video playback of learned policies111

Real robot results

In addition, we evaluate qualitative policy behaviors of concurrent models compared to blocking models on a real-world robot grasping task, which is shown in Figure 3b. As seen in Table 2, the models achieve comparable grasp success, but the concurrent model is faster than the blocking model in terms of policy duration, which measures the total execution time of the policy (this excludes the infrastructure setup and teardown times accounted for in episode duration, which can not be optimized with concurrent actions). In addition, the concurrent VTG model is able to execute smoother and faster trajectories than the blocking unconditioned baseline, which is clear in video playback1.

Blocking Actions Timestep Penalty VTG Previous Action Grasp Success Episode Duration Action Completion
Yes No No No s s
Yes Yes No No s s
No No No No s s
No Yes No No s s
No Yes Yes No
No Yes No Yes s s
No Yes Yes Yes s s
Table 1: Large-Scale Simulated Robotic Grasping Results
Blocking Actions VTG Grasp Success Policy Duration
Yes No s s
No Yes
Table 2: Real-World Robotic Grasping Results.

5 Discussion and Future Work

We presented a theoretical framework to analyze concurrent systems where an agent must “think while moving”. Viewing this formulation through the lens of continuous-time value-based reinforcement learning, we showed that by considering concurrent knowledge about the time delay and the previous action, the concurrent continuous-time and discrete-time Bellman operators remained contractions and thus maintained -learning convergence guarantees. While more information than and previous action may be helpful, we showed that and previous action (and different representations of this information) are the sole theoretical requirements for good learning performance. In addition, we introduced Vector-to-go (VTG), which incorporates the remaining previous action to be executed, as an alternative representation for information about the concurrent system that previous action and contain.

Our theoretical findings were supported by experimental results on -learning models acting in simulated control tasks that were engineered to support concurrent action execution. We conducted large-scale ablation studies on toy task concurrent 3D Cartpole and Pendulum environments, across model parameters as well as concurrent environment parameters. Our results indicated that VTG is the least hyperparameter-sensitive representation, and was able to recover blocking learning performance in concurrent settings. We extended these results to a complex concurrent large-scale sienvironmentmulated robotic grasping task, where we showed that the concurrent models were able to recover blocking execution baseline model success while acting faster. We analyzed the qualitative benefits of concurrent models through a real-world robotic grasping task, where we showed that a concurrent model with comparable grasp success as a blocking baseline was able to learn smoother trajectories that were faster.

An interesting topic to explore in future work is the possibility of increased data efficiency when training on off-policy data from various latency regimes. Another natural extension of this work is to evaluate DRL methods beyond value-based algorithms, such as on-policy learning and policy gradient approaches. Finally, concurrent methods may allow robotic control in dynamic environments where it is not possible for the robot to stop the environment before computing the action. In these scenarios, robots must truly think and act at the same time.


  • P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng (2006) An application of reinforcement learning to aerobatic helicopter flight.. In NIPS, B. Schölkopf, J. C. Platt, and T. Hofmann (Eds.), pp. 1–8. External Links: ISBN 0-262-19568-2, Link Cited by: §2.
  • A. Amiranashvili, A. Dosovitskiy, V. Koltun, and T. Brox (2018) Motion perception in reinforcement learning with dynamic objects.. In CoRL,

    Proceedings of Machine Learning Research

    , Vol. 87, pp. 156–168.
    External Links: Link Cited by: §2.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. Note: cite arxiv:1606.01540 External Links: Link Cited by: §1.
  • R. Coulom (2002) Reinforcement learning using neural networks, with applications to motor control. Ph.D. Thesis, Institut National Polytechnique de Grenoble-INPG. Cited by: §2.
  • N. Cruz, K. Lobos-Tsunekawa, and J. R. del Solar (2017)

    Using convolutional neural networks in robots with limited computational resources: detecting nao robots while playing soccer.

    CoRR abs/1706.06702. External Links: Link Cited by: §2.
  • K. Doya (2000) Reinforcement learning in continuous time and space.. Neural Computation 12 (1), pp. 219–245. External Links: Link Cited by: §2, §3.3.
  • V. Firoiu, T. Ju, and J. Tenenbaum (2018) At Human Speed: Deep Reinforcement Learning with Action Delay. arXiv e-prints. External Links: 1810.07286 Cited by: §2.
  • N. Frémaux, H. Sprekeler, and W. Gerstner (2013)

    Reinforcement learning using a continuous time actor-critic framework with spiking neurons

    PLoS computational biology 9, pp. e1003024. External Links: Document Cited by: §2.
  • S. Guadarrama, A. Korattikara, O. Ramirez, P. Castro, E. Holly, S. Fishman, K. Wang, E. Gonina, C. Harris, V. Vanhoucke, et al. (2018)

    TF-agents: a library for reinforcement learning in tensorflow

    Cited by: §A.4.1, §4.1.
  • T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, and S. Levine (2018) Soft actor-critic algorithms and applications.. CoRR abs/1812.05905. External Links: Link Cited by: §2.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018) QT-opt: scalable deep reinforcement learning for vision-based robotic manipulation.. CoRR abs/1806.10293. External Links: Link Cited by: §A.4.2, §A.4.2, §1, §2, §4.2.
  • H. J. Kappen (2005) Path integrals and symmetry breaking for optimal control theory. Journal of Statistical Mechanics: Theory and Experiment 2005 (11), pp. P11011–P11011. External Links: Document, Link Cited by: §2.
  • S. Li, S. Xiao, S. Zhu, N. Du, Y. Xie, and L. Song (2018) Learning temporal point processes via reinforcement learning. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 10804–10814. External Links: Link Cited by: §2.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. Note:

    cite arxiv:1312.5602Comment: NIPS Deep Learning Workshop 2013

    External Links: Link Cited by: §1.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §1.
  • R. Munos and P. Bourgine (1998) Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems 10, M. I. Jordan, M. J. Kearns, and S. A. Solla (Eds.), pp. 1029–1035. External Links: Link Cited by: §2.
  • A. Neitz, G. Parascandolo, S. Bauer, and B. Schölkopf (2018) Adaptive skip intervals: temporal abstraction for recurrent dynamical models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9816–9826. External Links: Link Cited by: §2.
  • OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. W. Pachocki, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, and W. Zaremba (2018) Learning dexterous in-hand manipulation.. CoRR abs/1808.00177. External Links: Link Cited by: §1.
  • S. Ramstedt and C. Pal (2019) Real-time reinforcement learning. External Links: 1911.04448 Cited by: §2.
  • S. M. Ross, J. J. Kelly, R. J. Sullivan, W. J. Perry, D. Mercer, R. M. Davis, T. D. Washburn, E. V. Sager, J. B. Boyce, and V. L. Bristow (1996) Stochastic processes. Vol. 2, Wiley New York. Cited by: §3.3.
  • E. Schuitema, L. Busoniu, R. Babuka, and P. P. Jonker (2010) Control delay in reinforcement learning for real-time dynamic systems: a memoryless approach. 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3226–3231. Cited by: §2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of go with deep neural networks and tree search. Nature 529, pp. 484–. External Links: Link Cited by: §1.
  • R. S. Sutton and A. G. Barto (1998) Reinforcement learning: an introduction. The MIT Press. Note: Hardcover External Links: ISBN 0262193981, Link Cited by: §3.2.
  • C. Tallec, L. Blier, and Y. Ollivier (2019) Making Deep Q-learning Methods Robust to Time Discretization. arXiv e-prints. External Links: 1901.09732 Cited by: §2, §3.3.
  • J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke (2018) Sim-to-Real: Learning Agile Locomotion For Quadruped Robots. arXiv e-prints. External Links: 1804.10332 Cited by: §2.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. P. Lillicrap, and M. A. Riedmiller (2018) DeepMind control suite.. CoRR abs/1801.00690. External Links: Link Cited by: §A.4.1, §4.1.
  • E. Theodorou, J. Buchli, and S. Schaal (2010) Reinforcement learning of motor skills in high dimensions: a path integral approach. pp. 2397 – 2403. External Links: Document Cited by: §2.
  • U. Upadhyay, A. De, and M. Gomez-Rodrizuez (2018) Deep reinforcement learning of marked temporal point processes. In Proceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS’18, USA, pp. 3172–3182. External Links: Link Cited by: §2.
  • E. Vasilaki, N. Frémaux, R. Urbanczik, W. Senn, and W. Gerstner (2009) Spike-based reinforcement learning in continuous state and action space: when policy gradient methods fail. PLoS computational biology 5, pp. e1000586. External Links: Document Cited by: §2.
  • T. J. Walsh, A. Nouri, L. Li, and M. L. Littman (2007) Planning and learning in environments with delayed feedback.. In ECML, J. N. Kok, J. Koronacki, R. L. de Mántaras, S. Matwin, D. Mladenic, and A. Skowron (Eds.), Lecture Notes in Computer Science, Vol. 4701, pp. 442–453. External Links: ISBN 978-3-540-74957-8, Link Cited by: §2, §3.4.

Appendix A Appendix

a.1 Defining Blocking Bellman operators

As introduced in Section 3.5, we define a continuous-time -function estimator with concurrent actions.


We observe that the second part of this equation (after ) is itself a -function at time . Since the future state, action, and reward values at are not known at time , we take the following expectation:


which indicates that the -function in this setting is not just the expected sum of discounted future rewards, but it corresponds to an expected future -function.

In order to show the discrete-time version of the problem, we parameterize the discrete-time concurrent -function as:


which with , corresponds to a synchronous environment.

Using this parameterization, we can rewrite the discrete-time -function with concurrent actions as:


a.2 Contraction Proofs for the Blocking Bellman operators

Proof of the Discrete-time Blocking Bellman Update

Lemma A.1.

The traditional Bellman operator is a contraction, i.e.:


where and .


In the original formulation, we can show that this is the case as following:


with and . ∎

Similarly, we can show that the updated Bellman operators introduced in Section 3.5 are contractions as well.

Proof of Lemma 3.2


Proof of Lemma 3.1


To prove that this the continuous-time Bellman operator is a contraction, we can follow the discrete-time proof, from which it follows:


a.3 Concurrent Knowledge Representation

Figure 4: The execution order of different stages are shown relative to the sampling period as well as the latency . (a): In “blocking” environments, state capture and policy inference are assumed to be instantaneous. (b): In “concurrent” environments, state capture and policy inference are assumed to proceed concurrently to action execution.
Figure 5: Concurrent knowledge representations can be visualized through an example of a 2-D pointmass discrete-time toy task. Vector-to-go represents the remaining action that may be executed when the current state is observed. Previous action represents the full commanded action from the previous timestep.

We analyze 3 different representations of concurrent knowledge in discrete-time concurrent environments, described in Section 3.6. Previous action is the action that the agent executed at the previous timestep. Action selection time is a measure of how long action selection takes, which can be represented as either a categorical or continuous variable; in our experiments, which take advantage of a bounded latency regime, we normalize action selection time using these known bounds. Vector-to-go VTG is a feature that combines and by encoding the remaining amount of left to execute. See Figure 5 for a visual comparison.

We note that is available across the vast majority of environments and it is easy to obtain. Using , which encompasses state capture, communication latency, and policy inference, relies on having some knowledge of the concurrent properties of the system. Calculating

requires having access to some measure of action completion at the exact moment when state is observed. When utilizing a first-order control action space, such as joint angle or desired pose,

is easily computable if proprioceptive state is measured and synchronized with state observation. In these cases, VTG is an alternate representation of the same information encapsulated by and the current state.

a.4 Experiment Implementation Details

a.4.1 Cartpole and Pendulum Ablation Studies

Here, we describe the implementation details of the toy task Cartpole and Pendulum experiments in Section 4.1.

For the environments, we use the 3D MuJoCo implementations of the Cartpole-Swingup and Pendulum-Swingup tasks in DeepMind Control Suite (Tassa et al., 2018). We use discretized action spaces for first-order control of joint position actuators. For the observation space of both tasks, we use the default state space of ground truth positions and velocities.

For the baseline learning algorithms, we use the TensorFlow Agents (Guadarrama et al., 2018) implementations of a Deep -Network agent, which utilizes a Feed-forward Neural Network (FNN), and a Deep -Recurrent Neutral Network agent, which utilizes a Long Short-Term Memory (LSTM) network. Learning parameters such as learning_rate, lstm_size, and fc_layer_size were selected through hyperparameter sweeps.

To approximate different difficulty levels of latency in concurrent environments, we utilize different parameter combinations for action execution steps and action selection steps (). The number of action execution steps is selected from {0ms, 5ms, 25ms, or 50ms} once at environment initialization. is selected from {0ms, 5ms, 10ms, 25ms, or 50ms} either once at environment initialization or repeatedly at every episode reset. The selected is implemented in the environment as additional physics steps that update the system during simulated action selection.

Frame-stacking parameters affect the observation space by saving previous observations and actions. The number of previous actions to store as well as the number of previous observations to store are independently selected from the range . Concurrent knowledge parameters, as described in Section 4, include whether to use VTG and whether to use . Including the previous action is already a feature implemented in the frame-stacking feature of including previous actions. Finally, the number of actions to discretize the continuous space to is selected from the range .

a.4.2 Large Scale Robotic Grasping

Simulated Environment

We simulate a 7 DoF arm with an over-the-shoulder camera (see Figure 3a). A bin in front of the robot is filled with procedurally generated objects to be picked up by the robot and a sparse binary reward is assigned if an object is lifted off a bin at the end of an episode. States are represented in form of RGB images and actions are continuous Cartesian displacements of the gripper 3D positions and yaw. In addition, the policy commands discrete gripper open and close actions and may terminate an episode. In blocking mode, a displacement action is executed until completion: the robot uses a closed loop controller to fully execute an action, decelerating and coming to rest before observing the next state. In concurrent mode, an action is triggered and executed without waiting, which means that the next state is observed while the robot remains in motion. It should be noted that in blocking mode, action completion is close to unless the gripper moves are blocked by contact with the environment or objects; this causes average blocking mode action completion to be lower than , as seen in Table 1.

Real Environment

Similar to the simulated setup, we use a 7 DoF robotic arm with an over-the-shoulder camera (see Figure 3b). The main difference in the physical setup is that objects are selected from a set of common household objects.


We train a policy with QT-Opt (Kalashnikov et al., 2018), a Deep -Learning method that utilizes the Cross-Entropy Method (CEM) to support continuous actions. A Convolutional Neural Network (CNN) is trained to learn the -function conditioned on an image input along with a CEM-sampled continuous control action. At policy inference time, the agent sends an image of the environment and batches of CEM-sampled actions to the CNN -network. The highest-scoring action is then used as the policy’s selected action. Compared to the formulation in Kalashnikov et al. (2018), we also add a concurrent knowledge feature of VTG and/or previous action as additional input to the -network. Algorithm 1 shows the modified QT-Opt procedure.

Initialize replay buffer D;
Initialize random start state and receive image ;
Initialize concurrent knowledge features ;
Initialize environment state ;
Initialize action-value function with random weights ;
Initialize target action-value function with weights ;
while training do
       for t = 1, T do
             Select random action

with probability

, else ;
             Execute action in environment, receive , , ;
             Process necessary concurrent knowledge features , such as , , or ;
             Set ;
             Store transition in ;
             if episode terminates then
                   Reset to a random reset initialization state;
                   Reset to 0;
             end if
            Sample batch of transitions from ;
             for each transition in batch do
                   if terminal transition then
                         Select ;
                   end if
                  Perform SGD on with respect to ;
             end for
            Update target parameters with and periodically;
       end for
end while
Algorithm 1 QT-Opt with Concurrent Knowledge

For simplicity, the algorithm is described as if run synchronously on a single machine. In practice, episode generation, Bellman updates and Q-fitting are distributed across many machines and done asynchronously; refer to  (Kalashnikov et al., 2018) for more details. Standard DRL hyperparameters such as random exploration probability (), reward discount (), and learning rate are tuned through a hyperparameter sweep. For the time-penalized baselines in Table 1, we manually tune a timestep penalty that returns a fixed negative reward at every timestep. Empirically we find that a timestep penalty of , relative to a binary sparse reward of , encourages faster policies. For the non-penalized baselines, we set a timestep penalty of .

a.5 Figures

See Figure 6 and Figure 7.

Figure 6: Environment rewards achieved by DQN with different network architectures [either a feedforward network (FNN) or a Long Short-Term Memory (LSTM) network] and different concurrent knowledge features [Unconditioned, vector-to-go (VTG), or previous action and ] on the concurrent Cartpole task for every hyperparameter in a sweep, sorted in decreasing order. Providing the critic with VTG information leads to more robust performance across all hyperparameters. This figure is a larger version of 1(a).
Figure 7: Environment rewards achieved by DQN with a FNN and different frame-stacking and concurrent knowledge parameters on the concurrent Pendulum task for every hyperparameter in a sweep, sorted in decreasing order.