Perception-Prediction-Reaction Agents for Deep Reinforcement Learning

by   Adam Stooke, et al.
berkeley college

We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The reaction core incorporates new observations with input from the slow core to produce the agent's policy; the perception core accesses only short-term observations and informs the slow core; lastly, the prediction core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting Perception-Prediction-Reaction (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DMLab-30, particularly in tasks requiring long-term memory. We further show significant improvements in Capture the Flag, an environment requiring agents to acquire a complicated mixture of skills over long time scales. In a series of ablation experiments, we probe the importance of each component of the PPR agent, establishing that the entire, novel combination is necessary for this intriguing result.



There are no comments yet.


page 8


Recurrent Reinforcement Learning: A Hybrid Approach

Successful applications of reinforcement learning in real-world problems...

Learning What to Memorize: Using Intrinsic Motivation to Form Useful Memory in Partially Observable Reinforcement Learning

Reinforcement Learning faces an important challenge in partial observabl...

An Empirical Comparison of Neural Architectures for Reinforcement Learning in Partially Observable Environments

This paper explores the performance of fitted neural Q iteration for rei...

Influence-aware Memory for Deep Reinforcement Learning

Making the right decisions when some of the state variables are hidden, ...

Graph Convolutional Memory for Deep Reinforcement Learning

Solving partially-observable Markov decision processes (POMDPs) is criti...

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Many robotic applications require the agent to perform long-horizon task...

Learning What to Remember: Long-term Episodic Memory Networks for Learning from Streaming Data

Current generation of memory-augmented neural networks has limited scala...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the reinforcement learning (RL) problem, an agent is trained to solve an environment cast as a Markov decision process (MDP), specified as a tuple of states, actions, transition probabilities, and rewards:

. The agent must learn which states and actions lead to the best rewards without prior knowledge of . In many interesting RL problems, however, the agent receives an observation, , which does not completely specify the state of the MDP at that time step, resulting in partial observability. Therefore, for partially observable Markov decision processes (POMDPs) (Astrom, 1965; Kaelbling et al., 1998), a focus of agent design is how to integrate the sequence of historical observations to best approximate the state and produce a policy

to maximise future rewards. In deep RL, recurrent neural networks (RNNs) allow integrating observations over time with constant computational complexity 

(Mnih et al., 2016). Agents based on traditional recurrent networks, e.g. LSTMs (Hochreiter and Schmidhuber, 1997), are widely effective, but they sometimes struggle to learn in more complex environments involving long-term memory retention.

In this paper, we introduce a recurrent agent architecture, and associated auxiliary losses (Jaderberg et al., 2016), which aim to improve deep RL in partially observable environments, particularly those requiring long-term memory. Specifically, we introduce a slowly ticking recurrent core to augment the standard fast ticking agent core, to allow a pathway for long-term memory storage and ease the backwards flow of gradients over long time spans. In addition, we construct two auxiliary policies, the first of which is required to use only current observations without long-term memory (perception), and the second which must only use the long-term memory without current observations (prediction). These auxiliary policies are trained jointly with the full-information policy (reaction), with all three policies regularizing each other and shaping the representation of the slow-ticking recurrent core.

We evaluate this agent, dubbed the Perception-Prediction-Reaction agent (PPR) on a suite of experiments on 3D, partially observable environments (Beattie et al., 2016), and show consistent improvement compared to strong baselines, in particular on tasks requiring long-term memory. Ablation studies highlight the efficacy of each of the structural priors introduced in this paper. Finally, we apply this agent to the challenging DMLab-30 domain (one agent which must learn across 30 different POMDPs simultaneously) and Capture the Flag (Jaderberg et al., 2018), and show that even in these highly varied RL domains, the PPR agent can improve performance. It is striking that our simple regularising losses can have such a strong effect on learning dynamics. Without intending to replicate any biological system, we drew loose inspiration for our architecture from concepts in hierarchical sensorimotor control; we refer the interested reader to related articles such as (Loeb et al., 1999; Ting, 2007; Ting et al., 2015; Merel et al., 2019).

2 The Perception-Prediction-Reaction Agent

Figure 1: (a) A regular recurrent agent core (Mnih et al., 2016) which can integrate historical experience of observations using an RNN to produce a policy . (b) A minimal temporal hierarchical agent core, featuring fast- and slow-ticking recurrences (Jaderberg et al., 2018). The slow-ticking core skips large portions of time, facilitating BPTT. (c) The PPR agent recurrent structure introduced in this paper, featuring a slow-ticking core and three fast-ticking cores. The perception and prediction fast-ticking branches have different information hidden relative to the reaction fast-ticking core, which has full information and produces the behaviour policy. All fast cores can share the same NN weights. An auxiliary loss encourages the fast-ticking branches to predict the same policy with different information, where

is the symmetrized Kullback-Leibler Divergence.

This section introduces the structural and objective priors which constitute the PPR agent. We start with background on recurrent neural network-based agents for reinforcement learning, followed by discussion of a minimal hierarchical agent as an intermediate concept.

2.1 Reinforcement Learning and Recurrent Agents

In an MDP, the goal of the RL agent is to find a policy over actions, , conditioned on the state, that maximizes the expected discounted sum of future rewards, . The objective remains the same under partial observability (POMDPs) however the agent does not have access to the full state, instead receiving incomplete information gleaned from observations, .

In POMDPs, recurrent agents can improve their internal understanding of the current state by carrying information from past observations, , in an internal state, , to complement the current observation, . The agent updates its internal state by , and the remaining network layers receive conditioning as to produce the policy, see Figure 1

 (a). Training by backpropagation through time (BPTT)  

(Webros, 1990; Rumelhart et al., 1988) allows rewards to influence the processing of observations and internal state over earlier time steps. Sophisticated recurrent functions, , can extend the agent’s ability to handle longer (and hence more difficult) sequences. LSTM-based agents have succeeded in a range of partially observable environments, including ones with rich visual observations  (Mnih et al., 2016; Espeholt et al., 2018), but many such tasks remain difficult to master or are learned slowly with the traditional architecture.

2.2 Minimal Temporally Hierarchical Agent

Temporal hierarchy promises to further improve the processing of long sequences by dividing responsibilities for short- and long-term memory over different recurrent cores, simplifying the roles of each. See Related Work for numerous examples of architectures with different hidden neurons operating at different time scales. As a special case of this concept applied to RL, we consider employing an additional recurrent unit operating at a rate slower than the MDP. This unit reduces the number of intermediate computations between distant time steps and allows error gradients to skip backwards through long segments of time.

An example of such a hierarchical agent is shown in Figure 1 (b): the slow core advances every time steps (depicted is ); during the interim it provides a fixed output to modulate the fast core; the fast core provides summary information to the slow core. As depicted, the recurrence equations could take the following form:


where the superscripts and denote slow and fast cores, respectively, are the recurrent states, is the observation, and

denotes a vector of zeros (

i.e. the initial recurrent state). The policy could generically depend on the recurrent states, (in our case is an MLP). The internal state of the fast core is periodically reset to so as to divide memory responsibilities by time-scale; all information originating prior to must have routed through the slow core.

This minimal hierarchical agent does not on its own guarantee efficient training of long-term memory in a way that improves overall learning relative to the flat agent, see ablations in Figure 3 (b). Indeed, previous examples of temporally hierarchical agents (Vezhnevets et al., 2017; Jaderberg et al., 2018) introduce auxiliary objectives to best make use of similar hierarchical structures.

2.3 The PPR Agent

Our contribution is to construct a new hierarchical structure and associated auxiliary losses which enhance recurrent learning. It relies on architectural elements designed to create an information asymmetry (as in Galashov et al. (2018)) which permits leveraging certain policy priors to simultaneously (a) shape the representation of the slow-ticking core, (b) maximize information extracted from observations, and (c) balance the importance of both in the policy.

The PPR agent is depicted in Figure 1 (c), and we build its description starting from the minimal hierarchical agent. First, we eliminate the possibility of a trivial feed-through connection from fast-slow-fast. Rather than attempt a partial information bottleneck, we prevent the fast core (reaction) which receives input from the slow core, from passing any output back to the slow core. We introduce another fast-ticking core (perception) which feeds its output into the slow core but does not take input from it. Resetting the fast internal states at the interval forms branches in the graph. The reaction branch produces the agent’s behavior policy by integrating new observations together with the slow core’s output. The slow core assumes a central role in representing information originating prior to , as it receives periodic, short-term summaries from the perception branch, which also integrates observations. This forms a Perception-Reaction Agent without auxiliary losses, a baseline in our ablation experiments.

The final architectural element of the PPR agent is an additional fast recurrent core (prediction). It branches simultaneously to reaction and receives the slow core’s output and possibly partial information . This creates an information asymmetry against the perception branch, which lacks long-term memory, and the fully-informed reaction branch. We can leverage this asymmetry to enhance recurrent learning. We do so by drawing auxiliary policies, and from the perception and prediction branches, respectively, to form the auxiliary loss:


where is a statistical distance – we use the symmetrized Kullback-Leibler Divergence. All three branches are regularized against each other; encourages their policies to agree as much as possible despite their differences in access to information. Rather than apply a loss directly on the recurrent state, which may assume somewhat arbitrary values, the policy distribution space offers grounding in the environment.

The recurrence equations and policy structure of the PPR agent are summarized in Table 1. Loosely speaking, reaction is a short-term sensory-motor loop, perception a sensory loop, prediction a motor loop, and the slow core a long-term memory loop, all of which are decoupled in forward operation. The auxiliary divergence losses can be seen as imposing two priors on the fully informed reaction branch – that the policy should be expressible from only recent observations (perception) and from only long-term memory (prediction).

Core Recurrence Equation Policy
if : otherwise:


if : otherwise:
Table 1: Recurrence and policy equations of the PPR agent.


Although the contents of partial information remain flexible in our definition, deliberate selection of this quantity may be required to enable useful regularization. For visual environments, the recurrent A3C Agent (Mnih et al., 2016) suggests a convenient delineation, which we use in our experiments: the partial observation consists of the previous action and reward in the environment, . This is compared to the full observation provided to the agent which additionally includes the screen pixels . The actions provide critical information for the forward model of the prediction branch, and they hold natural appeal as information internal to the agent.

In practice, the PPR architecture is implemented as a self-contained recurrent neural network core, and training only requires an additional loss term computed on the current training batch, allowing the agent to be easily incorporated in most existing deep RL frameworks. In our experiments we found it possible to use the same recurrent network weights in all branches, as reflected in Table 1. Weight sharing limits the increase in recurrent parameters to only over the flat agent, using the same core size.111In our experiments, we found no effect in flat agents from using LSTM parameters.

3 Related Work

Recurrent networks with multiple time scales have appeared in numerous forms for supervised learning of long sequences, with the intent of leveraging the prior that time-series data is hierarchical. In the Sequence Chunker  

(Schmidhuber, 1992), a high recurrent level receives a reduced description of the input history from a low level, and the high level operates only at time-steps when the low level is unable to adequately predict its inputs. Hierarchical recurrent networks  (El Hihi and Bengio, 1996) were constructed with various fixed structures for multiple time scales. Hierarchical Multiscale RNNs  (Chung et al., 2016) extended this idea to include learnable hierarchy by allowing layers in a stacked RNN to influence temporal behavior of higher layers. They introduce three possible operations in their modified LSTM: FLUSH–feed output to higher level and reset recurrent state, UPDATE–receive input from lower level and advance recurrent state, and COPY–propagate the exact recurrent state. Our architecture can be understood in these terms, but we utilize a specific, new structure with multiple low levels assuming different roles, implemented by a fixed choice of when to perform each operation. Clockwork RNNs (CW-RNNs) (Koutník et al., 2014) and Phased LSTMs (Neil et al., 2016) perform hierarchical learning by using a range of fixed timescales for groupings of neurons within a recurrent layer. In contrast to these two methods, we construct a distinct routing of information between components (e.g., in CW-RNNs, information flows generically from slow to fast groups), and we require only two time scales to be effective.

Relative to vanilla RNNs, several works have improved learning on long sequences by introducing new recurrence formulas to address vanishing gradients or provide use of explicit memory. Long Short-Term Memory (LSTM)  

(Hochreiter and Schmidhuber, 1997) is a widely-used standard which we employ in all our experiments. More recent developments include: the Differentiable Neural Computer (Graves et al., 2016), an explicit memory-augmented architecture; Relational RNNs (Santoro et al., 2018), which supplement LSTMs to include memory-interaction operations; plastic neural networks using Hebbian learning  (Miconi et al., 2018); and the Gated Transformer-XL (GTrXL) (Parisotto et al., 2019), which adapted a purely self-attention based approach (Vaswani et al., 2017) for RL. The GTrXL agent was measured on a similar benchmark to ours (although using a more recent learning algorithm) and showed similar or better improvements over a 3-layer LSTM agent. However, this improvement came at the cost of orders of magnitude more parameters in their 12-layer self-attention architecture, and the other architectures likewise share the drawback of significantly increased computational burden over LSTMs. Still, any of them could be employed as the recurrent core within the PPR agent—future work could seek compounding gains.

In RL specifically, our work relates closely to the For-The-Win (FTW) agent of (Jaderberg et al., 2018). The FTW agent features a slow-ticking and a fast-ticking core, similar to what is depicted in Figure 1 (b), and includes a prior to regularise the hidden state distribution between slow and fast cores via an auxiliary loss. Our work also builds on recent approaches to learning priors with information asymmetry for RL. In (Galashov et al., 2018), a “default” policy is learned simultaneously to the agent’s policy through a regularisation loss based on their KL divergence, which replaces entropy regularisation against the uniform prior. In their experiments, hiding certain input information from the default policy, such as past or current states, was beneficial to the agent’s learning. Distral (Teh et al., 2017) promotes multitask learning by distillation and regularisation of each task policy against a shared, central policy.

Other works utilise a combination of memory modules and new learning algorithms for better learning through time (Hung et al., 2018; Wayne et al., 2018), and a wealth of previous work exists on more explicit hierarchical RL which often exploits temporal priors (Sutton et al., 1999; Heess et al., 2016; Vezhnevets et al., 2017). Unlike these methods, we impose minimal change on the RL algorithm, requiring only auxiliary losses computed using components of the same form as those already present in the standard deep RL agent.

4 Experiments

Figure 2: Learning curves of the PPR agent (blue) compared to the baseline recurrent agent (black) (Espeholt et al., 2018) on four representative DMLab tasks. The PPR agent can achieve higher scores and faster learning on long-term memory tasks (e.g. emstm_non_match, emstm_watermaze, nav_maze_random_goal_03), while not degrading in performance on more reactive tasks, such as lasertag (lt_hallway_slope). More levels can be found in the Appendix.

We conducted experiments seeking to answer the questions: (a) does the PPR hierarchy lead to improved learning relative to flat architectures, and if so, (b) which kind of tasks is it most effective at accelerating, and (c) what are the effects of different components of the architecture. We report here experiments on levels within the DMLab-30 suite (Beattie et al., 2016). It includes a collection of visually rich, 3D environments for a point-body agent with a discrete action space. The range of tasks vary in character from memory-, navigation-, and reactive agility-based ones. Language-based tasks are also included. Next, we report the PPR agent’s performance on the recent Capture the Flag environment, which combines elements of memory, navigation, reflex, and teamwork (Jaderberg et al., 2018). Lastly, we present an in-depth study isolating the prediction component of the agent in order to further clarify the effect of the combined architecture.

In our experiments, we used LSTM recurrent cores with hidden size 256 and shared weights among the three fast branches. We trained our agents and baseline using the V-Trace algorithm  (Espeholt et al., 2018)

on trajectory segments of length 100 agent time-steps, using action-repeat 4 in the environment. We introduced hyperparemeters to weight each auxiliary loss, one for each branch-pair, and included these in the set of hyperparameters tuned by Population-based training (PBT) 

(Jaderberg et al., 2017). For visual levels, our convolution network was a 15-layer residual network as in  Espeholt et al. (2018), and our baselines all used the identical architecture except with a flat LSTM core for memory. We typically fixed the slow core interval, , to 16.

4.1 PPR Agents in DMLab

DMLab Individual Levels.

We tested PPR agents on 12 DMLab levels. For each level, we trained a PPR and a baseline agent for 2 billion environment frames. Figure 2 highlights results from four tasks. Compared to the baseline, the PPR agent showed significantly faster and higher learning in tasks requiring long-term memory. In all 12 levels we tested, the PPR agent achieved the same or higher score as the baseline—full results are in the Appendix.

In emstm_non_match, the agent sees an object and must memorize it to later choose to collect any different object. The PPR agent demonstrated proper memorisation, scoring 65 average reward, whereas the baseline agent did not, scoring 35. In emstm_watermaze, the agent is rewarded for reaching an invisible platform in an empty room and can repeatedly visit it from random respawn locations within an episode. The second rise in learning corresponds to memorisation of the platform location and efficient navigation of return visits, which the PPR agent begins to do at roughly half the number of samples as the baseline. The level nav_maze_random_goal_03 is similar in terms of resets but takes place in a walled maze environment with a visible goal object. Here as well, the PPR agent exhibits significantly accelerated learning, surpassing the final score of the baseline using less than half as many samples.

In contrast, lt_hallway_slope is a laser tag level requiring quick reactions without reliance on long-term memory; the PPR agent is not expected to improve learning. Significantly, the equal scores shows that the hierarchy did not degrade reactive performance. While experimenting with agent architectures, it was a design challenge to increase performance on memory tasks without decreasing performance on reactive ones, the main difficulty being a learning mode in which the policy became less dynamic, to be easier to predict. One effective way we found to mitigate this phenomenon is to apply to only a (random) subset of training batches (found concurrently by Bansal et al. (2018)), to permit the policy to sometimes update toward pure reward-seeking behavior. Through experimentation, we also found rescaling by a factor randomly sampled from for each batch worked well. We used these techniques in all our experiments.


We next tested the PPR agent on a multi-task learning problem—the entire DMLab-30 suite—to test whether benefits could extend across the range of tasks while using a single set of agent weights and hyperparameters for all levels. Indeed, the PPR agent outperformed the flat LSTM baseline, achieving an average capped human-normalized ELO across levels of 72.0% mean (across 8 independent runs), compared to 64.3% with the baseline (Espeholt et al., 2018), Figure 3. The Appendix contains per-level scores from these learning runs. This difference, while modest, is difficult to achieve compared to the highly tuned baseline agent and represents a significant improvement.

(a) (b) (c)
Figure 3: (a) Learning curves on the DMLab-30 task domain with the PPR agent (blue) and recurrent agent baseline (black) (Espeholt et al., 2018). The PPR agent consistently outperforms the Impala (Espeholt et al., 2018) on this challenging domain. (b) Ablation study on losses. (c) Ablation study on time periods.


To determine the effects of individual components of the PPR agent, we returned to experimenting on individual DMLab levels. First, Figure 3 (c) shows results from emstm_watermaze, for slow core interval, , ranging from 2 to 32. A wide range worked well, with best performance at . We also experimented with evolving using PBT but did not observe improved performance.

In a separate experiment, with fixed , we activated different combinations of the three PPR auxiliary loss terms. Using no auxiliary loss reverts to the bare Perception-Reaction architecture. Figure 3 (b) shows results, also on emstm_watermaze, with the full PPR agent performing best. The prediction branch, which is only trained via the auxiliary loss, is revealed to be crucial to the learning gains, and so is inclusion of the behavior policy in . Although using two of the three auxiliary losses was sometimes effective in our experiments, we measured more consistent results with all three active.

4.2 PPR Agents in Capture the Flag

Figure 4: Win rates against bots (skill level 4) on Capture the Flag procedural levels, of baseline recurrent agent (blue), PR agent without auxiliary losses (green), and various settings of PPR agents.

The Capture the Flag environment is a first-person, 2-vs-2 multiplayer game based on the Quake III engine, developed in Jaderberg et al. (2018). The RL agent controls an individual player, and must learn to coordinate with a teammate to retrieve a flag from the opponent base, while the opponent team attempts to do the same. Players can “tag” opponents (as in laser tag), removing them from the game temporarily until they respawn at their base. Human-level performance by RL was first achieved in this game by Jaderberg et al. (2018). They trained agents from scratch using a combination of techniques including PBT, careful opponent selection for playing thousands of matches in parallel, and a temporally hierarchical agent architecture with an associated auxiliary loss and Differentiable Neural Computer recurrent cores (Graves et al., 2016). The PBT-evolved parameters included internal agent weightings for several possible reward events, to provide denser reward than only capturing a flag or winning/losing a match, which lasts 5 minutes. Together, these advancements comprised their For-The-Win (FTW) agent.

In our experiment, we used all components as in the FTW agent and training, except for the recurrent structure. Our baseline used the flat LSTM architecture, and the PPR agent used LSTM cores. Figure 4 shows win rates against bots, evaluated during training on the procedurally generated levels (see Appendix for bots of other skill levels). Without auxiliary losses, the Perception-Reaction hierarchical agent performed slightly above the baseline, reaching 62% and 50% win rates, respectively. By including the auxiliary losses, however, the full PPR agent dramatically accelerated learning and reached higher asymptotic performance, nearly 90% by 1.5B steps. Slow time scales of either 8 or 16 gave similar performance, showing low sensitivity to this hyperparameter. Using a reduced auxiliary loss rate (0.1 and 0.05 shown) further improved performance, with the best learning resulting from evolving the rate by PBT. In this experiment, the PPR agent demonstrated improved learning of the complicated mix of memory, navigation, and precision-control skills required to master this domain.

4.3 Flat, Prediction Agent (Ablation)

Figure 5: Prediction agent architecture and training loss.

In another ablation, we sought performance gains for a flat LSTM agent by training with the auxiliary regularisation loss of a prediction branch, Figure 5. During training we rolled out predictions up to 10 steps, and for some agents we included training samples from the prediction policy, . We then evaluated final agent performance under three different behavior schemes: i) the baseline using only , ii) using at a fixed number of steps after branching, and iii) using along a branch from its starting point. Figure 6 (a) and (b) show evaluations using 3-step and 7-step predictions, respectively, using the same trained agents, in rat_goal_driven_large.222This environment is similar to nav_maze_random_goal_03 from above. Using the branch-following scheme, the best 7-step prediction agents scored above 300, close to the baseline agent (around 340). In contrast, additional baseline agents we trained with frame-skip 32 performed significantly worse—score 50, Figure 6 (c)—despite using the same refresh rate for incorporating new observations into the policy.

These experiments show that the fast perception and control loops are essential, although they can operate more loosely coupled than in the baseline agent. Clearly, the auxiliary policy encodes future sequences of high-reward behavior, despite lacking access to input observations over similar time scales as used for the PPR agent’s hierarchy. Yet in no case did we observe any improvement in the base agent’s learning. Evidently, the full PPR agent is needed to accelerate learning.

(a) (b) (c)
Figure 6: Final evaluation scores for trained flat, prediction agent in rat_goal_driven_large. Various schemes used for drawing the behavior policy from the prediction auxiliary policy, many of which perform similarly to the baseline (reactive) agent. (a) Agents executing prediction policy up to 3 time steps without new observations. (b) The same trained agents, but evaluated up to 7 time steps without new observations. (c) Long action-repeat trained agent.

5 Conclusion

In this paper we introduced a new agent to learn in partially observable environments, the PPR agent, which incorporates a temporally hierarchical recurrent structure, as well as imposing priors on the behaviour policy to be both predictable from long-term memory only, and from current observations only. This agent was evaluated on a diverse set of 3D partially observable RL problems, and showed improved performance, in particular on tasks involving long-term memory. We ablated the various components of the agent, demonstrating the efficacy of each. We hope future work can build upon these ideas and continue exploring structural- and loss-based priors to further improve deep RL in partially observable environments.

Broader Impact

The immediate societal impacts of this work are limited, to the extent that the methods herein remain bound to RL tasks tailored for research and in the virtual setting. Further in the future, success of ours and related methods at producing agents capable of reasoning over extended time periods could influence how decisions are made in a variety of possible systems, hopefully with the effect of improving efficiency. We advocate no particular applications in this research.

A more immediate impact is the environmental cost of running our experiments, primarily the atmospheric emissions of greenhouse gases due to electricity usage. Our experiments were of moderate but non-negligible scale, as our methods apply to sophisticated RL tasks utilizing CPU-based simulators and neural networks running on more energy-efficient GPU/TPU hardware. Our research aims to reduce sample complexity in RL, permitting fewer compute cycles to be used for future developments. Otherwise, we rely on broader efforts to mitigate effects at the data-center level, such as the use of non-emitting, renewable energy sources. Perhaps the greatest risk is that future RL techniques are capable enough to engender widespread adoption, but without significantly improved computational efficiency, resulting in a net increase in overall energy usage and emissions.

Adam Stooke gratefully acknowledges previous support from the Fannie and John Hertz Foundation, and the authors thank Simon Osindero, Sasha Vezhnevets, Nicholas Heess, Greg Wayne, and others for insightful research discussions. Compute resources were provided by DeepMind.


  • K. J. Astrom (1965) Optimal control of markov processes with incomplete state information. Journal of mathematical analysis and applications 10 (1), pp. 174–205. Cited by: §1.
  • M. Bansal, A. Krizhevsky, and A. S. Ogale (2018) ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. CoRR abs/1812.03079. External Links: Link, 1812.03079 Cited by: §4.1.
  • C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen (2016) DeepMind lab. CoRR abs/1612.03801. External Links: Link, 1612.03801 Cited by: §1, §4.
  • J. Chung, S. Ahn, and Y. Bengio (2016) Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: §3.
  • S. El Hihi and Y. Bengio (1996) Hierarchical recurrent neural networks for long-term dependencies. In Proceedings of Annual Conference on Neural Information Processing Systems, pp. 493–499. Cited by: §3.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu (2018) IMPALA: scalable distributed deep-rl with importance weighted actor-learner architectures. CoRR abs/1802.01561. External Links: Link, 1802.01561 Cited by: Figure 7, Figure 8, §2.1, Figure 2, Figure 3, §4.1, §4.
  • A. Galashov, S. M. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M. Czarnecki, Y. W. Teh, R. Pascanu, and N. Heess (2018) Information asymmetry in kl-regularized rl. Cited by: §2.3, §3.
  • A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. (2016) Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. Cited by: §3, §4.2.
  • N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver (2016) Learning and transfer of modulated locomotor controllers. CoRR abs/1610.05182. Cited by: §3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §3.
  • C. Hung, T. Lillicrap, J. Abramson, Y. Wu, M. Mirza, F. Carnevale, A. Ahuja, and G. Wayne (2018) Optimizing agent behavior over long time scales by transporting value. arXiv preprint arXiv:1810.06721. Cited by: §3.
  • M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castaneda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, et al. (2018) Human-level performance in first-person multiplayer games with population-based deep reinforcement learning. arXiv preprint arXiv:1807.01281. Cited by: §1, Figure 1, §2.2, §3, §4.2, §4.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu (2017) Population based training of neural networks. CoRR abs/1711.09846. External Links: Link, 1711.09846 Cited by: §4.
  • M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu (2016) Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397. Cited by: §1.
  • L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artif. Intell. 101 (1-2), pp. 99–134. External Links: ISSN 0004-3702, Link, Document Cited by: §1.
  • J. Koutník, K. Greff, F. Gomez, and J. Schmidhuber (2014) A clockwork rnn. In

    Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32

    ICML’14, pp. II–1863–II–1871. External Links: Link Cited by: §3.
  • G. E. Loeb, I. E. Brown, and E. J. Cheng (1999) A hierarchical foundation for models of sensorimotor control. Experimental brain research 126 (1), pp. 1–18. Cited by: §1.
  • J. Merel, M. Botvinick, and G. Wayne (2019) Hierarchical motor control in mammals and machines. Nature Communications 10 (1), pp. 1–12. Cited by: §1.
  • T. Miconi, J. Clune, and K. O. Stanley (2018) Differentiable plasticity: training plastic neural networks with backpropagation. CoRR abs/1804.02464. External Links: Link, 1804.02464 Cited by: §3.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §1, Figure 1, §2.1, §2.3.
  • D. Neil, M. Pfeiffer, and S. Liu (2016) Phased lstm: accelerating recurrent network training for long or event-based sequences. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 3889–3897. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §3.
  • E. Parisotto, H. F. Song, J. W. Rae, R. Pascanu, C. Gulcehre, S. M. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury, M. M. Botvinick, N. Heess, and R. Hadsell (2019) Stabilizing transformers for reinforcement learning. External Links: 1910.06764 Cited by: §3.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1988) Learning internal representations by error propagation. Cited by: §2.1.
  • A. Santoro, R. Faulkner, D. Raposo, J. W. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. P. Lillicrap (2018) Relational recurrent neural networks. CoRR abs/1806.01822. External Links: Link, 1806.01822 Cited by: §3.
  • J. Schmidhuber (1992) Learning complex, extended sequences using the principle of history compression. Neural Computation 4 (2), pp. 234–242. Cited by: §3.
  • R. S. Sutton, D. Precup, and S. P. Singh (1999) Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1-2), pp. 181–211. External Links: Document Cited by: §3.
  • Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu (2017) Distral: robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pp. 4496–4506. Cited by: §3.
  • L. H. Ting, H. J. Chiel, R. D. Trumbower, J. L. Allen, J. L. McKay, M. E. Hackney, and T. M. Kesar (2015) Neuromechanical principles underlying movement modularity and their implications for rehabilitation. Neuron 86 (1), pp. 38–54. Cited by: §1.
  • L. H. Ting (2007) Dimensional reduction in sensorimotor systems: a framework for understanding muscle coordination of posture. Progress in brain research 165, pp. 299–321. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu (2017) FeUdal networks for hierarchical reinforcement learning. In Proceedings of International Conference on Machine Learning, pp. 3540–3549. Cited by: §2.2, §3.
  • G. Wayne, C. Hung, D. Amos, M. Mirza, A. Ahuja, A. Grabska-Barwinska, J. W. Rae, P. Mirowski, J. Z. Leibo, A. Santoro, M. Gemici, M. Reynolds, T. Harley, J. Abramson, S. Mohamed, D. J. Rezende, D. Saxton, A. Cain, C. Hillier, D. Silver, K. Kavukcuoglu, M. Botvinick, D. Hassabis, and T. P. Lillicrap (2018) Unsupervised predictive memory in a goal-directed agent. CoRR abs/1803.10760. External Links: Link, 1803.10760 Cited by: §3.
  • P. J. Webros (1990) BACKPROPAGATION through time: what it does and how to do it. Cited by: §2.1.

APPENDIX: Additional Learning Curves

Figure 7: Learning curves of the PPR agent (blue) compared to the baseline recurrent agent (black) (Espeholt et al., 2018) on various individual DMLab tasks.
Figure 8:

Learning curves on the DMLab-30 task domain with the PPR agent (blue) and recurrent agent baseline (black), separated by level. Shaded area shows the mean standard error. The PPR agent consistently outperforms the baseline 

Espeholt et al. (2018) on this challenging domain.
Figure 9: Learning curves on Capture the Flag procedural levels, evaluated against bot skill levels of increasing difficulty. Various settings of the PPR agent outperform the PR agent (without auxiliary losses; green) and the baseline flat agent (green) in this multi-faceted domain. Top row: win rate, bottom row: win plus one-half draw rate.