Recurrent World Models Facilitate Policy Evolution

09/04/2018 ∙ by David Ha, et al. ∙ Google 0

A generative recurrent neural network is quickly trained in an unsupervised manner to model popular reinforcement learning environments through compressed spatio-temporal representations. The world model's extracted features are fed into compact and simple policies trained by evolution, achieving state of the art results in various environments. We also train our agent entirely inside of an environment generated by its own internal world model, and transfer this policy back into the actual environment. Interactive version of paper at



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans develop a mental model of the world based on what they are able to perceive with their limited senses, learning abstract representations of both spatial and temporal aspects of sensory inputs. For instance, we are able to observe a scene and remember an abstract description thereof facial_identity_primate_brain ; single_neuron_viz

. Our decisions and actions are influenced by our internal predictive model. For example, what we perceive at any given moment seems to be governed by our predictions of the future

primary_viz_cortex_past_present ; mt_motion . One way of understanding the predictive model inside our brains is that it might not simply be about predicting the future in general, but predicting future sensory data given our current motor actions Keller2012 ; Leinweber2017 . We are able to instinctively act on this predictive model and perform fast reflexive behaviours when we face danger survival_optimization , without the need to consciously plan out a course of action mt_motion .

For many reinforcement learning (RL) problems Kaelbling:96 ; sutton_barto ; wiering2012 , an artificial RL agent may also benefit from a predictive model (M) of the future Werbos87specifications ; sutton1990dyna

(model-based RL). The backpropagation algorithm

Linnainmaa:1970 ; Kelley:1960 ; werbos1982sensitivity can be used to train a large M in form of a neural network (NN). In partially observable environments, we can implement M through a recurrent neural network (RNN) s05_making_the_world_differentiable ; s05a_cm ; s05b_rl ; Lin:93 to allow for better predictions based on memories of previous observation sequences.


Figure 1: We build probabilistic generative models of OpenAI Gym openai_gym environments. These models can mimic the actual environments (left). We test trained policies in the actual environments (right).

In fact, our M will be a large RNN that learns to predict the future given the past in an unsupervised manner. M’s internal representations of memories of past observations and actions are perceived and exploited by another NN called the controller (C) which learns through RL to perform some task without a teacher. A small and simple C limits C’s credit assignment problem to a comparatively small search space, without sacrificing the capacity and expressiveness of the large and complex M. We combine several key concepts from a series of papers from 1990–2015 on RNN-based world models and controllers s05_making_the_world_differentiable ; s05a_cm ; s05b_rl ; s05c_boredom ; learning_to_think with more recent tools from probabilistic modelling, and present a simplified approach to test some of those key concepts in modern RL environments openai_gym . Experiments show that our approach can be used to solve a challenging race car navigation from pixels task that previously has not been solved using more traditional methods.

Most existing model-based RL approaches learn a model of the RL environment, but still train on the actual environment. Here, we also explore fully replacing an actual RL environment with a generated one, training our agent’s controller C only inside of the environment generated by its own internal world model M, and transfer this policy back into the actual environment.

To overcome the problem of an agent exploiting imperfections of the generated environments, we adjust a temperature parameter of M to control the amount of uncertainty of the generated environments. We train C inside of a noisier and more uncertain version of its generated environment, and demonstrate that this approach helps prevent C from taking advantage of the imperfections of M. We will also discuss other related works in the model-based RL literature that share similar ideas of learning a dynamics model and training an agent using this model.

2 Agent Model

Our simple model is inspired by our own cognitive system. Our agent has a visual sensory component V that compresses what it sees into a small representative code. It also has a memory component M that makes predictions about future codes based on historical information. Finally, our agent has a decision-making component C that decides what actions to take based only on the representations created by its vision and memory components.


Figure 2:
Flow diagram showing how V, M, and C interacts with the environment (left).
Pseudocode for how our agent model is used in the OpenAI Gym openai_gym environment (right).

The environment provides our agent with a high dimensional input observation at each time step. This input is usually a 2D image frame that is part of a video sequence. The role of V is to learn an abstract, compressed representation of each observed input frame. Here, we use a Variational Autoencoder (VAE) 

vae ; vae_dm

as V to compress each image frame into a latent vector


While V’s role is to compress what the agent sees at each time frame, we also want to compress what happens over time. The RNN M serves as a predictive model of future

vectors that V is expected to produce. Since many complex environments are stochastic in nature, we train our RNN to output a probability density function

instead of a deterministic prediction of .

In our approach, we approximate

as a mixture of Gaussian distribution, and train M to output the probability distribution of the next latent vector

given the current and past information made available to it. More specifically, the RNN will model , where is the action taken at time and is the hidden state of the RNN at time . During sampling, we can adjust a temperature parameter to control model uncertainty, as done in previous work sketchrnn . We will find that adjusting to be useful for training our controller later on. This approach is known as a Mixture Density Network bishop combined with an RNN (MDN-RNN) graves_rnn , and has been applied in the past for sequence generation problems such as generating handwriting graves_rnn and sketches sketchrnn .

C is responsible for determining the course of actions to take in order to maximize the expected cumulative reward of the agent during a rollout of the environment. In our experiments, we deliberately make C as simple and small as possible, and train it separately from V and M, so that most of our agent’s complexity resides in V and M. C is a simple single layer linear model that maps and directly to action at each time step: . In this linear model, and are the parameters that map the concatenated input vector to the output action vector .

This minimal design for C also offers important practical benefits. Advances in deep learning provided us with the tools to train large, sophisticated models efficiently, provided we can define a well-behaved, differentiable loss function. V and M are designed to be trained efficiently with the backpropagation algorithm using modern GPU accelerators, so we would like most of the model’s complexity, and model parameters to reside in V and M. The number of parameters of C, a linear model, is minimal in comparison. This choice allows us to explore more unconventional ways to train C – for example, even using evolution strategies (ES) 

Rechenberg1973 ; Schwefel1977 to tackle more challenging RL tasks where the credit assignment problem is difficult.

To optimize the parameters of C, we chose the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) cmaes ; cmaes_original as our optimization algorithm since it is known to work well for solution spaces of up to a few thousand parameters. We evolve parameters of C on a single machine with multiple CPU cores running multiple rollouts of the environment in parallel. For more information about the models, training procedures, and experiment configurations, please see the Supplementary Materials.

3 Car Racing Experiment: World Model for Feature Extraction

In this section, we describe how we can train the Agent model described earlier to solve a car racing task. To our knowledge, our agent is the first known to solve this task.111We find this task interesting because although it is not difficult to train an agent to wobble around randomly generated tracks and obtain a mediocre score, CarRacing-v0 defines solving as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes.

Frame compressor V and predictive model M can help us extract useful representations of space and time. By using these features as inputs of C, we can train a compact C to perform a continuous control task, such as learning to drive from pixel inputs for a top-down car racing environment called CarRacing-v0 carracing_v0 . In this environment, the tracks are randomly generated for each trial, and our agent is rewarded for visiting as many tiles as possible in the least amount of time. The agent controls three continuous actions: steering left/right, acceleration, and brake.

  1. Collect 10,000 rollouts from a random policy.
  2. Train VAE (V) to encode frames into .
  3. Train MDN-RNN (M) to model .
  4. Evolve controller (C) to maximize the expected cumulative reward of a rollout.
Algorithm 1 Training procedure in our experiments.

To train V, we first collect a dataset of 10k random rollouts of the environment. We have first an agent acting randomly to explore the environment multiple times, and record the random actions taken and the resulting observations from the environment. We use this dataset to train our VAE to encode each frame into low dimensional latent vector by minimizing the difference between a given frame and the reconstructed version of the frame produced by the decoder from . We can now use our trained V to pre-process each frame at time into to train our M. Using this pre-processed data, along with the recorded random actions taken, our MDN-RNN can now be trained to model as a mixture of Gaussians. 222

Although in principle, we can train V and M together in an end-to-end manner, we found that training each separately is more practical, achieves satisfactory results, and does not require exhaustive hyperparameter tuning. As images are not required to train M on its own, we can even train on large batches of long sequences of latent vectors encoding the entire 1000 frames of an episode to capture longer term dependencies, on a single GPU.

In this experiment, V and M have no knowledge about the actual reward signals from the environment. Their task is simply to compress and predict the sequence of image frames observed. Only C has access to the reward information from the environment. Since there are a mere 867 parameters inside the linear C, evolutionary algorithms such as CMA-ES are well suited for this optimization task.

3.1 Experiment Results

V without M

Training an agent to drive is not a difficult task if we have a good representation of the observation. Previous works browser_car ; keras_car have shown that with a good set of hand-engineered information about the observation, such as LIDAR information, angles, positions and velocities, one can easily train a small feed-forward network to take this hand-engineered input and output a satisfactory navigation policy. For this reason, we first want to test our agent by handicapping C to only have access to V but not M, so we define our controller as .

Although the agent is still able to navigate the race track in this setting, we notice it wobbles around and misses the tracks on sharper corners, e.g., see Figure 1 (right). This handicapped agent achieved an average score of 632 251, in line with the performance of other agents on OpenAI Gym’s leaderboard carracing_v0 and traditional Deep RL methods such as A3C carracing_cs221 ; carracing_cs234 . Adding a hidden layer to C’s policy network helps to improve the results to 788 141, but not enough to solve this environment.

Method Average Score
DQN dqn_racecar 343 18
A3C (continuous) carracing_cs234 591 45
A3C (discrete) carracing_cs221 652 10
Gym Leader carracing_v0 838 11
V model 632 251
V model with hidden layer 788 141
Full World Model 906 21
Table 2: DoomTakeCover-v0 results, varying .
Temperature Virtual Score Actual Score
0.10 2086 140 193 58
0.50 2060 277 196 50
1.00 1145 690 868 511
1.15 918 546 1092 556
1.30 732 269 753 139
Random Policy N/A
Gym Leader takecover N/A
Table 1: CarRacing-v0 results over 100 trials.

Full World Model (V and M)

The representation provided by V only captures a representation at a moment in time and does not have much predictive power. In contrast, M is trained to do one thing, and to do it really well, which is to predict . Since M’s prediction of is produced from the RNN’s hidden state at time , is a good candidate for a feature vector we can give to our agent. Combining with gives C a good representation of both the current observation, and what to expect in the future.

We see that allowing the agent to access both and greatly improves its driving capability. The driving is more stable, and the agent is able to seemingly attack the sharp corners effectively. Furthermore, we see that in making these fast reflexive driving decisions during a car race, the agent does not need to plan ahead and roll out hypothetical scenarios of the future. Since contain information about the probability distribution of the future, the agent can just re-use the RNN’s internal representation instinctively to guide its action decisions. Like a Formula One driver or a baseball player hitting a fastball mt_motion , the agent can instinctively predict when and where to navigate in the heat of the moment.

Our agent is able to achieve a score of 906 21, effectively solving the task and obtaining new state of the art results. Previous attempts carracing_cs221 ; carracing_cs234 using Deep RL methods obtained average scores of 591–652 range, and the best reported solution on the leaderboard obtained an average score of 838 11. Traditional Deep RL methods often require pre-processing of each frame, such as employing edge-detection carracing_cs234 , in addition to stacking a few recent frames carracing_cs221 ; carracing_cs234 into the input. In contrast, our agent’s V and M take in a stream of raw RGB pixel images and directly learn a spatio-temporal representation. To our knowledge, our method is the first reported solution to solve this task.

Since our agent’s world model is able to model the future, we can use it to come up with hypothetical car racing scenarios on its own. We can use it to produce the probability distribution of given the current states, sample a and use this sample as the real observation. We can put our trained C back into this generated environment. Figure 1 (left) shows a screenshot of the generated car racing environment. The interactive version of this work includes a demo of the generated environments.

4 VizDoom Experiment: Learning Inside of a Generated Environment

We have just seen that a policy learned inside of the real environment appears to somewhat function inside of the generated environment. This begs the question – can we train our agent to learn inside of its own generated environment, and transfer this policy back to the actual environment?

If our world model is sufficiently accurate for its purpose, and complete enough for the problem at hand, we should be able to substitute the actual environment with this world model. After all, our agent does not directly observe the reality, but merely sees what the world model lets it see. In this experiment, we train an agent inside the environment generated by its world model trained to mimic a VizDoom vizdoom environment. In DoomTakeCover-v0 takecover , the agent must learn to avoid fireballs shot by monsters from the other side of the room with the sole intent of killing the agent. The cumulative reward is defined to be the number of time steps the agent manages to stay alive during a rollout. Each rollout of the environment runs for a maximum of 2100 time steps, and the task is considered solved if the average survival time over 100 consecutive rollouts is greater than 750 time steps.

4.1 Experiment Setup

The setup of our VizDoom experiment is largely the same as the Car Racing task, except for a few key differences. In the Car Racing task, M is only trained to model the next . Since we want to build a world model we can train our agent in, our M model here will also predict whether the agent dies in the next frame (as a binary event ), in addition to the next frame .

Since M can predict the state in addition to the next observation, we now have all of the ingredients needed to make a full RL environment to mimic DoomTakeCover-v0 takecover . We first build an OpenAI Gym environment interface by wrapping a gym.Env openai_gym interface over our M as if it were a real Gym environment, and then train our agent inside of this virtual environment instead of using the actual environment. Thus in our simulation, we do not need the V model to encode any real pixel frames during the generation process, so our agent will therefore only train entirely in a more efficient latent space environment. Both virtual and actual environments share an identical interface, so after the agent learns a satisfactory policy inside of the virtual environment, we can easily deploy this policy back into the actual environment to see how well the policy transfers over.

Here, our RNN-based world model is trained to mimic a complete game environment designed by human programmers. By learning only from raw image data collected from random episodes, it learns how to simulate the essential aspects of the game, such as the game logic, enemy behaviour, physics, and also the 3D graphics rendering. We can even play inside of this generated environment.

Unlike the actual game environment, however, we note that it is possible to add extra uncertainty into the virtual environment, thus making the game more challenging in the generated environment. We can do this by increasing the temperature parameter during the sampling process of . By increasing the uncertainty, our generated environment becomes more difficult compared to the actual environment. The fireballs may move more randomly in a less predictable path compared to the actual game. Sometimes the agent may even die due to sheer misfortune, without explanation.

After training, our controller learns to navigate around the virtual environment and escape from deadly fireballs launched by monsters generated by M. Our agent achieved an average score of 918 time steps in the virtual environment. We then took the agent trained inside of the virtual environment and tested its performance on the original VizDoom environment. The agent obtained an average score of 1092 time steps, far beyond the required score of 750 time steps, and also much higher than the score obtained inside the more difficult virtual environment. The full results are listed in Table 2.

We see that even though V is not able to capture all of the details of each frame correctly, for instance, getting the number of monsters correct, C is still able to learn to navigate in the real environment. As the virtual environment cannot even keep track of the exact number of monsters in the first place, an agent that is able to survive a noisier and uncertain generated environment can thrive in the original, cleaner environment. We also find agents that perform well in higher temperature settings generally perform better in the normal setting. In fact, increasing helps prevent our controller from taking advantage of the imperfections of our world model. We will discuss this in depth in the next section.

4.2 Cheating the World Model

In our childhood, we may have encountered ways to exploit video games in ways that were not intended by the original game designer video_game_exploits . Players discover ways to collect unlimited lives or health, and by taking advantage of these exploits, they can easily complete an otherwise difficult game. However, in the process of doing so, they may have forfeited the opportunity to learn the skill required to master the game as intended by the game designer. In our initial experiments, we noticed that our agent discovered an adversarial policy to move around in such a way so that the monsters in this virtual environment governed by M never shoots a single fireball during some rollouts. Even when there are signs of a fireball forming, the agent moves in a way to extinguish the fireballs.

Because M is only an approximate probabilistic model of the environment, it will occasionally generate trajectories that do not follow the laws governing the actual environment. As we previously pointed out, even the number of monsters on the other side of the room in the actual environment is not exactly reproduced by M. For this reason, our world model will be exploitable by C, even if such exploits do not exist in the actual environment.

As a result of using M to generate a virtual environment for our agent, we are also giving the controller access to all of the hidden states of M. This is essentially granting our agent access to all of the internal states and memory of the game engine, rather than only the game observations that the player gets to see. Therefore our agent can efficiently explore ways to directly manipulate the hidden states of the game engine in its quest to maximize its expected cumulative reward. The weakness of this approach of learning a policy inside of a learned dynamics model is that our agent can easily find an adversarial policy that can fool our dynamics model – it will find a policy that looks good under our dynamics model, but will fail in the actual environment, usually because it visits states where the model is wrong because they are away from the training distribution.

This weakness could be the reason that many previous works that learn dynamics models of RL environments do not actually use those models to fully replace the actual environments action_conditional_video_prediction ; recurrent_env_sim . Like in the M model proposed in s05_making_the_world_differentiable ; s05a_cm ; s05b_rl , the dynamics model is deterministic, making it easily exploitable by the agent if it is not perfect. Using Bayesian models, as in PILCO pilco

, helps to address this issue with the uncertainty estimates to some extent, however, they do not fully solve the problem. Recent work 

Nagabandi2017 combines the model-based approach with traditional model-free RL training by first initializing the policy network with the learned policy, but must subsequently rely on model-free methods to fine-tune this policy in the actual environment.

To make it more difficult for our C to exploit deficiencies of M, we chose to use the MDN-RNN as the dynamics model of the distribution of possible outcomes in the actual environment, rather than merely predicting a deterministic future. Even if the actual environment is deterministic, the MDN-RNN would in effect approximate it as a stochastic environment. This has the advantage of allowing us to train C inside a more stochastic version of any environment – we can simply adjust the temperature parameter to control the amount of randomness in M, hence controlling the tradeoff between realism and exploitability.

Using a mixture of Gaussian model may seem excessive given that the latent space encoded with the VAE model is just a single diagonal Gaussian distribution. However, the discrete modes in a mixture density model are useful for environments with random discrete events, such as whether a monster decides to shoot a fireball or stay put. While a single diagonal Gaussian might be sufficient to encode individual frames, an RNN with a mixture density output layer makes it easier to model the logic behind a more complicated environment with discrete random states.

For instance, if we set the temperature parameter to a very low value of , effectively training our C with an M that is almost identical to a deterministic LSTM, the monsters inside this generated environment fail to shoot fireballs, no matter what the agent does, due to mode collapse. M is not able to transition to another mode in the mixture of Gaussian model where fireballs are formed and shot. Whatever policy learned inside of this generated environment will achieve a perfect score of 2100 most of the time, but will obviously fail when unleashed into the harsh reality of the actual world, underperforming even a random policy.

By making the temperature an adjustable parameter of M, we can see the effect of training C inside of virtual environments with different levels of uncertainty, and see how well they transfer over to the actual environment. We experiment with varying of the virtual environment, training an agent inside of this virtual environment, and observing its performance when inside the actual environment.

In Table 2, while we see that increasing of M makes it more difficult for C to find adversarial policies, increasing it too much will make the virtual environment too difficult for the agent to learn anything, hence in practice it is a hyperparameter we can tune. The temperature also affects the types of strategies the agent discovers. For example, although the best score obtained is 1092 556 with , increasing

a notch to 1.30 results in a lower score but at the same time a less risky strategy with a lower variance of returns. For comparison, the best reported score 

takecover is 820 58.

5 Related Work

There is extensive literature on learning a dynamics model, and using this model to train a policy. Many basic concepts first explored in the 1980s for feed-forward neural networks (FNNs) 

Werbos87specifications ; Munro87 ; RobinsonFallside89 ; Werbos89identification ; NguyenWidrow89 and in the 1990s for RNNs s05_making_the_world_differentiable ; s05a_cm ; s05b_rl ; s05c_boredom laid some of the groundwork for Learning to Think learning_to_think . The more recent PILCO pilco ; McAllister2017 is a probabilistic model-based search policy method designed to solve difficult control problems. Using data collected from the environment, PILCO uses a Gaussian process (GP) model to learn the system dynamics, and uses this model to sample many trajectories in order to train a controller to perform a desired task, such as swinging up a pendulum.

While GPs work well with a small set of low dimension data, their computational complexity makes them difficult to scale up to model a large history of high dimensional observations. Other recent works deep_pilco ; Depeweg2017 use Bayesian neural networks instead of GPs to learn a dynamics model. These methods have demonstrated promising results on challenging control tasks Hein2017 , where the states well defined, and the observation is relatively low dimensional. Here we are interested in modelling dynamics observed from high dimensional visual data, as a sequence of raw pixel frames.

In robotic control applications, the ability to learn the dynamics of a system from observing only camera-based video inputs is a challenging but important problem. Early work on RL for active vision trained an FNN to take the current image frame of a video sequence to predict the next frame s04_trajectories , and use this predictive model to train a fovea-shifting control network trying to find targets in a visual scene. To get around the difficulty of training a dynamical model to learn directly from high-dimensional pixel images, researchers explored using neural networks to first learn a compressed representation of the video frames. Recent work along these lines learning_deep_dynamical_models_from_image_pixels ; from_pixels_to_torques was able to train controllers using the bottleneck hidden layer of an autoencoder as low-dimensional feature vectors to control a pendulum from pixel inputs. Learning a model of the dynamics from a compressed latent space enable RL algorithms to be much more data-efficient deep_spacial_autoencoders ; embed_to_control .

Video game environments are also popular in model-based RL research as a testbed for new ideas. Previous work game_engine_learning

used a feed-forward convolutional neural network (CNN) to learn a forward simulation model of a video game. Learning to predict how different actions affect future states in the environment is useful for game-play agents, since if our agent can predict what happens in the future given its current state and action, it can simply select the best action that suits its goal. This has been demonstrated not only in early work 

NguyenWidrow89 ; s04_trajectories (when compute was a million times more expensive than today) but also in recent studies learn_to_act_by_predicting_future on several competitive VizDoom environments.

The works mentioned above use FNNs to predict the next video frame. We may want to use models that can capture longer term time dependencies. RNNs are powerful models suitable for sequence modelling graves_rnn . Using RNNs to develop internal models to reason about the future has been explored as early as 1990 s05_making_the_world_differentiable , and then further explored in s05a_cm ; s05b_rl ; s05c_boredom . A more recent work learning_to_think presented a unifying framework for building an RNN-based general problem solver that can learn a world model of its environment and also learn to reason about the future using this model. Subsequent works have used RNN-based models to generate many frames into the future recurrent_env_sim ; action_conditional_video_prediction ; Denton2017 ; graves_lecture , and also as an internal model to reason about the future Silver2016 ; imagination_agent ; Watters2017 .

In this work, we used evolution strategies (ES) to train our controller, as this offers many benefits. For instance, we only need to provide the optimizer with the final cumulative reward, rather than the entire history. ES is also easy to parallelize – we can launch many instances of rollout with different solutions to many workers and quickly compute a set of cumulative rewards in parallel. Recent works pathnet ; openai ; stablees ; DeepNeuroevolution2017 have demonstrated that ES is a viable alternative to traditional Deep RL methods on many strong baselines. Before the popularity of Deep RL methods dqn , evolution-based algorithms have been shown to be effective at solving RL tasks neat ; gom5_ne_accelerated ; gom2_coevolve ; hyperneat ; pepg . Evolution-based algorithms have even been able to solve difficult RL tasks from high dimensional pixel inputs kou1_torcs ; hausknecht ; parker2012 ; vae_evolution .

6 Discussion

We have demonstrated the possibility of training an agent to perform tasks entirely inside of its simulated latent space world. This approach offers many practical benefits. For instance, video game engines typically require heavy compute resources for rendering the game states into image frames, or calculating physics not immediately relevant to the game. We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment. Agents that are trained incrementally to simulate reality may prove to be useful for transferring policies back to the real world. Our approach may complement sim2real approaches outlined in previous work Bousmalis2017 ; Higgins2017 .

The choice of implementing V as a VAE and training it as a standalone model also has its limitations, since it may encode parts of the observations that are not relevant to a task. After all, unsupervised learning cannot, by definition, know what will be useful for the task at hand. For instance, our VAE reproduced unimportant detailed brick tile patterns on the side walls in the Doom environment, but failed to reproduce task-relevant tiles on the road in the Car Racing environment. By training together with an M that predicts rewards, the VAE may learn to focus on task-relevant areas of the image, but the tradeoff here is that we may not be able to reuse the VAE effectively for new tasks without retraining. Learning task-relevant features has connections to neuroscience as well. Primary sensory neurons are released from inhibition when rewards are received, which suggests that they generally learn task-relevant features, rather than just any features, at least in adulthood 

Pi2013 .

In our experiments, the tasks are relatively simple, so a reasonable world model can be trained using a dataset collected from a random policy. But what if our environments become more sophisticated? In any difficult environment, only parts of the world are made available to the agent only after it learns how to strategically navigate through its world. For more complicated tasks, an iterative training procedure is required. We need our agent to be able to explore its world, and constantly collect new observations so that its world model can be improved and refined over time. Future work will incorporate an iterative training procedure learning_to_think , where our controller actively explores parts of the environment that is beneficial to improve its world model. An exciting research direction is to look at ways to incorporate artificial curiosity and intrinsic motivation schmidhuber_creativity ; s07_intrinsic ; s08_curiousity ; pathak2017 ; intrinsic_motivation and information seeking SchmidhuberStorck:94 ; Gottlieb2013 abilities in an agent to encourage novel exploration Lehman2011 . In particular, we can augment the reward function based on improvement in compression quality schmidhuber_creativity ; s07_intrinsic ; s08_curiousity ; learning_to_think .

Another concern is the limited capacity of our world model. While modern storage devices can store large amounts of historical data generated using an iterative training procedure, our LSTM lstm ; s12_lstm_forget -based world model may not be able to store all of the recorded information inside of its weight connections. While the human brain can hold decades and even centuries of memories to some resolution brain_capacity , our neural networks trained with backpropagation have more limited capacity and suffer from issues such as catastrophic forgetting Ratcliff1990 ; French1994 ; Kirkpatrick2016 . Future work will explore replacing the VAE and MDN-RNN with higher capacity models  outrageously_large_neural_nets ; hypernetworks ; suarez2017 ; wavenet ; attention , or incorporating an external memory module Gemici2017 ; Wu2018 , if we want our agent to learn to explore more complicated worlds.

Like early RNN-based C–M systems  s05_making_the_world_differentiable ; s05a_cm ; s05b_rl ; s05c_boredom , ours simulates possible futures time step by time step, without profiting from human-like hierarchical planning or abstract reasoning, which often ignores irrelevant spatio-temporal details. However, the more general Learning To Think learning_to_think approach is not limited to this rather naive approach. Instead it allows a recurrent C to learn to address subroutines of the recurrent M, and reuse them for problem solving in arbitrary computable ways, e.g., through hierarchical planning or other kinds of exploiting parts of M’s program-like weight matrix. A recent One Big Net onebignet2018 extension of the C–M approach collapses C and M into a single network, and uses PowerPlay-like s10_powerplay ; s11_powerplay behavioural replay (where the behaviour of a teacher net is compressed into a student net chunker91and92 ) to avoid forgetting old prediction and control skills when learning new ones. Experiments with those more general approaches are left for future work.


We would like to thank Blake Richards, Kory Mathewson, Chris Olah, Kai Arulkumaran, Denny Britz, Kyle McDonald, Ankur Handa, Elwin Ha, Nikhil Thorat, Daniel Smilkov, Alex Graves, Douglas Eck, Mike Schuster, Rajat Monga, Vincent Vanhoucke, Jeff Dean and Natasha Jaques for their thoughtful feedback, and for offering their valuable insights from their respective areas of expertise.

Appendix A Supplementary Materials

In this section we will describe in more details the models and training methods used in this work.

a.1 Comparing V, M, C Model Sizes

Model Parameter Count
VAE 4,348,547
MDN-RNN 422,368
Controller 867
Table 4: DoomTakeCover-v0 Parameter Count
Model Parameter Count
VAE 4,446,915
MDN-RNN 1,678,785
Controller 1,088
Table 3: CarRacing-v0 Parameter Count

a.2 Variational Autoencoder


Figure 3:

Description of tensor shapes for each layer of our ConvVAE. (left).

MDN-RNN similar to the one used in graves_rnn ; sketchrnn ; carter2016experiments (right).

We trained a Convolutional Variational Autoencoder (ConvVAE) model as our agent’s V, as illustrated in Figure 3 (left). Unlike vanilla autoencoders, enforcing a Gaussian prior over the latent vector also limits the amount of information capacity for compressing each frame, but this Gaussian prior also makes the world model more robust to unrealistic vectors generated by M.

As the environment may give us observations as high dimensional pixel images, we first resize each image to 64x64 pixels and use this resized image as V’s observation. Each pixel is stored as three floating point values between 0 and 1 to represent each of the RGB channels. The ConvVAE takes in this 64x64x3 input tensor and passes it through 4 convolutional layers to encode it into low dimension vectors and , each of size . The latent vector is sampled from the Gaussian prior . In the Car Racing task, is 32 while for the Doom task is 64. The latent vector is passed through 4 of deconvolution layers used to decode and reconstruct the image.

Each convolution and deconvolution layer uses a stride of 2. The layers are indicated in the diagram in

Italics as Activation-type Output Channels x Filter Size

. All convolutional and deconvolutional layers use relu activations except for the output layer as we need the output to be between 0 and 1. We trained the model for 1 epoch over the data collected from a random policy, using

distance between the input image and the reconstruction to quantify the reconstruction loss we optimize for, in addition to KL loss.

a.3 Mixture Density Network + Recurrent Neural Network

To implement M, we use an LSTM lstm recurrent neural network combined with a Mixture Density Network bishop as the output layer, as illustrated in Figure 3 (right). We use this network to model the probability distribution of the next in the next time step as a Mixture of Gaussian distribution. This approach is very similar to previous work graves_rnn in the Unconditional Handwriting Generation section and also the decoder-only section of SketchRNN sketchrnn . The only difference is that we did not model the correlation parameter between each element of , and instead had the MDN-RNN output a diagonal covariance matrix of a factored Gaussian distribution.

Unlike the handwriting and sketch generation works, rather than using the MDN-RNN to model the pdf of the next pen stroke, we model instead the pdf of the next latent vector . We would sample from this pdf at each time step to generate the environments. In the Doom task, we also use the MDN-RNN to predict the probability of whether the agent has died in this frame. If that probability is above 50%, then we set done to be true

in the virtual environment. Given that death is a low probability event at each time step, we find the cutoff approach to be more stable compared to sampling from the Bernoulli distribution.

The MDN-RNNs were trained for 20 epochs on the data collected from a random policy agent. In the Car Racing task, the LSTM used 256 hidden units, in the Doom task 512 hidden units. In both tasks, we used 5 Gaussian mixtures and did not model the correlation parameter, hence is sampled from a factored mixture of Gaussian distributions.

When training the MDN-RNN using teacher forcing from the recorded data, we store a pre-computed set of and for each of the frames, and sample an input each time we construct a training batch, to prevent overfitting our MDN-RNN to a specific sampled .

a.4 Controller

For both environments, we applied nonlinearities to clip and bound the action space to the appropriate ranges. For instance, in the Car Racing task, the steering wheel has a range from -1.0 to 1.0, the acceleration pedal from 0.0 to 1.0, and the brakes from 0.0 to 1.0. In the Doom environment, we converted the discrete actions into a continuous action space between -1.0 to 1.0, and divided this range into thirds to indicate whether the agent is moving left, staying where it is, or moving to the right. We would give C a feature vector as its input, consisting of and the hidden state of the MDN-RNN. In the Car Racing task, this hidden state is the output vector of the LSTM, while for the Doom task it is both the cell vector and the output vector of the LSTM.

a.5 Evolution Strategies

We used Covariance-Matrix Adaptation Evolution Strategy (CMA-ES) cmaes to evolve C’s weights. Following the approach described in Evolving Stable Strategies stablees , we used a population size of 64, and had each agent perform the task 16 times with different initial random seeds. The agent’s fitness value is the average cumulative reward of the 16 random rollouts. The diagram below (left) charts the best performer, worst performer, and mean fitness of the population of 64 agents at each generation:


Figure 4:
Training progress of CarRacing-v0 (left).
Histogram of cumulative rewards. Score is 906 21 (right).

Since the requirement of this environment is to have an agent achieve an average score above 900 over 100 random rollouts, we took the best performing agent at the end of every 25 generations, and tested it over 1024 random rollout scenarios to record this average on the red line. After 1800 generations, an agent was able to achieve an average score of 900.46 over 1024 random rollouts. We used 1024 random rollouts rather than 100 because each process of the 64 core machine had been configured to run 16 times already, effectively using a full generation of compute after every 25 generations to evaluate the best agent 1024 times. In the Figure 5 (left) below, we plot the results of same agent evaluated over 100 rollouts:


Figure 5:
When agent sees only but not , score is 632 251 (left).
If we add a hidden layer on top of only , score increases to 788 141 (right).

We also experimented with an agent that has access to only the vector from the VAE, but not the RNN’s hidden states. We tried 2 variations, where in the first variation, C maps directly to the action space . In second variation, we attempted to add a hidden layer with 40 activations between and , increasing the number of model parameters of C to 1443, making it more comparable with the original setup. These results are shown in In the Figure 5 (right).

a.6 DoomRNN

We conducted a similar experiment on the generated Doom environment we called DoomRNN. Please note that we have not actually attempted to train our agent on the actual VizDoom environment, and had only used VizDoom for the purpose of collecting training data using a random policy. DoomRNN is more computationally efficient compared to VizDoom as it only operates in latent space without the need to render an image at each time step, and we do not need to run the actual Doom game engine.


Figure 6: Training of DoomRNN (left). Histogram of time steps survived in the actual VizDoom environment over 100 consecutive trials. Score is 1092 556 (right).

In our virtual DoomRNN environment we increased the temperature slightly and used to make the agent learn in a more challenging environment. The best agent managed to obtain an average score of 959 over 1024 random rollouts. This is the highest score of the red line in Figure 6 (left). This same agent achieved an average score of 1092 556 over 100 random rollouts when deployed to the actual DoomTakeCover-v0 takecover environment, as shown in Figure 6 (right).