Learning and Querying Fast Generative Models for Reinforcement Learning

02/08/2018 ∙ by Lars Buesing, et al. ∙ 0

A key challenge in model-based reinforcement learning (RL) is to synthesize computationally efficient and accurate environment models. We show that carefully designed generative models that learn and operate on compact state representations, so-called state-space models, substantially reduce the computational costs for predicting outcomes of sequences of actions. Extensive experiments establish that state-space models accurately capture the dynamics of Atari games from the Arcade Learning Environment from raw pixels. The computational speed-up of state-space models while maintaining high accuracy makes their application in RL feasible: We demonstrate that agents which query these models for decision making outperform strong model-free baselines on the game MSPACMAN, demonstrating the potential of using learned environment models for planning.



page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep reinforcement learning has demonstrated remarkable progress in recent years, achieving high levels of performance across a wide array of challenging tasks, including Atari games (Mnih et al., 2015), locomotion (Schulman et al., 2015), and 3D navigation (Mnih et al., 2016)

. Many of these advances have relied on combining deep learning methods with model-free RL algorithms. A critical drawback of this approach is the vast amount of experience required to achieve good performance, as only weak prior knowledge is encoded in the agents’ networks (e.g., spatial translation invariance via convolutions).

The promise of model-based reinforcement learning is to improve sample-efficiency by making use of explicit models of the environment. The idea is that given a model of the environment (which can possibly be learned in the absence of rewards or from observational data only), an agent can learn task-specific policies rapidly by leveraging this model e.g., by trajectory optimization (Betts, 1998), search (Browne et al., 2012; Silver et al., 2016a), dynamic programming (Bertsekas et al., 1995) or generating synthetic experiences (Sutton, 1991). However, model-based RL algorithms typically pose strong requirements on the environment models, namely that they make predictions about the future state of the environment efficiently and accurately.

In this paper we aim to address the challenge of learning accurate, computationally efficient models of complex domains and using them to solve RL problems. First, we advocate the use computationally efficient state-space environment models that make predictions at a higher level of abstraction, both spatially and temporally, than at the level of raw pixel observations. Such models substantially reduce the amount of computation required to make predictions, as future states can be represented much more compactly. Second, in order to increase model accuracy, we examine the benefits of explicitly modeling uncertainty in state transitions. Finally we demonstrate that the computational efficiency of state-space models enables us to apply them to challenging RL domains: Extending a recent RL architecture (Weber et al., 2017), we propose an agent that learns to query a state-space model to anticipate outcomes of actions and aid decision making.

The main contributions of the paper are as follows: 1) we provide the first comparison of deterministic and stochastic, pixel-space and state-space models w.r.t. speed and accuracy, applied to challenging environments from the Arcade Learning Environment (ALE, Bellemare et al., 2013); 2) we demonstrate state-of-the-art environment modeling accuracy (as measured by log-likelihoods) with stochastic state-space models that efficiently produce diverse yet consistent rollouts; 3) using state-space models, we show model-based RL results on MS_PACMAN, and obtain significantly improved performance compared to strong model-free baselines, and 4) we show that learning to query the model further increases policy performance.

2 Environment models

In the following, for any sequence of variables , we use (or ) to denote all elements of the sequences up to , excluding (respectively including) . We write subsequences as . We consider an environment that outputs at each time step an observation and a reward . We also refer to the observations as pixels or frames, to give the intuition that they can be high-dimensional and highly redundant in many domains of interest. To ease the notation, in the following we will also write for the observations and rewards unless explicitly stated otherwise. Given action

, the environment transitions into a new, unobserved state and returns a sample of the observation and reward at the next time step with probability

. A main challenge in model-based RL is to learn a model of the environment that allows for computationally cheap and accurate predictions about the results of taking actions.

2.1 Model taxonomy



Figure 1:

The graphical models representing the architectures of different environment models. Boxes are deterministic nodes, circles are random variables and filled circles represent variables observed during training.

In the following, we discuss different environment models that can be learned in an unsupervised way from observations conditioned on actions . In particular, we will focus on how fast and accurately models can predict, at time step , some future statistics over a horizon that can later be used for decision making. We will simply call predictions and the rollout horizon or depth. Concretely, we will assume that we are interested at every time step in generating samples by doing Monte-Carlo rollouts of the model given an arbitrary sequence of actions (which will later be sampled from a rollout policy). The structure of the models we consider are illustrated in Fig. 1.

Auto-regressive models

A straight-forward choice is the family of temporally auto-regressive models over the observations , which we write in the following way:

If is given by a first-in-first-out (FIFO) buffer of the last observations and actions , the above definition is a regular auto-regressive model (of order ), which we denote by AR. Rolling out AR models is slow for two reasons: 1) we have to sequentially sample, or “render”, all pixels explicitly, which is particularly computationally demanding for high-dimensional observations, and 2) vanilla AR models without any additional structure do not reuse any computations from evaluating for evaluating . To speed-up AR models, we address the latter concern by considering the following model variant: we allow to be a recurrent mapping that recursively updates sufficient statistics , therefore reusing the previously computed statistics . We call these models recurrent auto-regressive models (RAR); if

is parameterized as a neural network, RARs are equivalent to recurrent neural networks (RNNs). Although faster, we still expect Monte-Carlo rollouts of RARs to be slow, as they still need to explicitly render observations

in order to make any predictions , which could be taken to be pixels or recurrent states .

State-space models: abstraction in space

As discussed above, rolling out ARs is computationally demanding as it requires sampling, or “rendering” all observations . State-space models (SSMs) circumvent this by positing that there is a compact state representation that captures all essential aspects of the environment on an abstract level: it is assumed that can be “rolled out”, i.e. predicted, from the previous state and action alone, without the help of previous pixels or any action other than : . Furthermore, we assume that is sufficient to predict , i.e. . Hence SSMs allow for the following factorization of the predictive distribution:

where is the initial state distribution. This modelling choice implies that the latent states are, by construction, sufficient to generate any predictions . Hence, we never have to directly sample pixel observations.

Transition model

We consider two flavors of SSMs: deterministic SSMs (dSSMs) and stochastic SSMs (sSSMs). For dSSMs, the latent transition is a deterministic function of the past, whereas for sSSMs, we consider transition distributions that explicitly model uncertainty over the state . sSSMs are a strictly larger model class than dSSMs, and we illustrate their difference in capacity for modelling stochastic time-series in the Appendix. We parameterize sSSMs by introducing for every a latent variable whose distribution depends on and , and by making the state a deterministic function of the past state, action, and latent variable:

Observation model

The observation model, or decoder, computes the conditional distribution . It either takes as input the state (deterministic decoder), or the state and latent (stochastic decoder). For sSSMs, we always use the stochastic decoder. For dSSMs, we can use either the deterministic decoder (dSSM-DET), or the stochastic decoder (dSSM-VAE). The latter can capture joint uncertainty over pixels in a given observation , but not across time steps. The former is a fully deterministic model, incapable of modeling joint uncertainties (in time or in space). Further details can be found in section in the Appendix.

2.2 Jumpy models: abstraction in time

To further reduce the computational time required for sampling a rollout of horizon , we also consider modelling environment transitions at a coarser time scale. To this end, we sub-sample observations by a factor of , i.e. for , we replace sequences , by the subsampled sequence

. We “chunk” the actions by concatenating them into a vector

, and sum the rewards . We refer to models trained on data pre-processed in this way as jumpy models. Jumpy training is a convenient way to inject temporal abstraction over at a time scale into environment models. This approach allows us to further reduce the computational load for Monte-Carlo rollouts roughly by a factor of .

2.3 Model architectures, inference and training

Here, we describe the parametric architectures for the models outlined above. We discuss the architecture of the sSSM in detail, and then briefly explain the modifications of this model used to implement RARs and dSSMs.

The states , latent variables and observations are all shaped like convolutional feature maps and are generated by transition modules , and the decoder

respectively. All latent variables are constrained to be normal with diagonal covariances. All modules consist of stacks of convolutional neural networks with ReLU nonlinearities. The transition modules use size-preserving convolutions, the decoder, size-expanding ones. To overcome the limitations of small receptive fields associated with convolutions, for modelling global effects of the environment dynamics, we use pool-and-inject layers introduced by

Weber et al. (2017)

: they perform max-pooling over their input feature maps, tile the results and concatenate them back to the inputs. Using these layers we can induce long-range spatial dependencies in the state

. All modules are illustrated in detail in the Appendix.

We train the AR, RAR and dSSM-DET models by maximum likelihood estimation (MLE), i.e. by maximizing

over model parameters , where and denotes some initial context (in our experiments ). We initialize the state with a convolutional network including an observation encoder . This encoder uses convolutions that reduce the size of the feature maps from the size of the observation to the size of the state.

For the models containing latent variables, i.e. dSSM-VAE and sSSM, we cannot evaluate in closed form in general. We maximize instead the evidence lower bound , where denotes an approximate posterior distribution:

where now denotes the union of the model parameters and the parameters of . Here, we used that the structure of the sSSM to assume without loss of generality that is Markovian in (see Krishnan et al. (2015) for an in-depth discussion). Furthermore, we restrict ourselves to the filtering distribution

, which we model as normal distribution with diagonal covariance matrix. We did not observe improvements in experiments by using the full smoothing distribution

. We share parameters between the prior and the posterior by making the posterior a function of the state computed by the prior transition module , as follows:

The posterior uses the observation encoder on ; the resulting feature maps are then concatenated to

, and a number of additional convolutions compute the posterior mean and standard deviation of

. For all latent variable models, we use the reparameterized representation of the computation graph (Kingma & Welling, 2013; Rezende et al., 2014) and a single posterior sample to obtain unbiased gradient estimators of the ELBO.

We can restrict the above sSSM to a dSSM-VAE, by not feeding samples of into the transition model . To ensure a fair model comparison (identical number of parameters and same amount of computation), we numerically implement this by feeding the mean of into the transition function instead. If we also do not feed (but the mean ) into the decoder for rendering , we arrive at the dSSM-DET, which does not contain any samples of . We implement the RAR based on the dSSM-DET by modifiying the transition model to , where denotes an encoder with the same architecture as the one of sSSM and dSSM-VAE.

3 RL agents with state-space models


model-free path
Figure 2: The architecture of the Imagination-Augmented Agent, which computes its policy and value function , by combining information from a model-free path with information from Monte-Carlo rollouts of its environment model.

Here we discuss how we can use a state-space model to help solve RL problems. A naive approach would be e.g. the following: Given a perfect model and unlimited computational resources, an agent could perform in principle a brute-force search for the optimal open-loop policy in any state by computing (assuming undiscounted reward over a finite horizon up to ), where is the expectation under the environment model . In practice, however, this optimization is costly and brittle. Quite generally, it has been observed that model-based planning often leads to catastrophic outcomes given unavoidable imperfections of when modelling complex environments (Talvitie, 2015).

Recent, Weber et al. (2017) proposed to combine model-free and model-based methods to increase robustness to model imperfections: the Imagination-Augmented Agent (I2A) queries its internal, pre-trained model via Monte-Carlo rollouts under a rollout policy. It then uses features (called imaginations) computed from these rollouts to anticipate the outcomes of taking different actions, thereby informing its decision-making. RL is used to learn to interpret the model’s predictions; this was shown to greatly diminish the susceptibility of planning to model imperfections.

In the following, we briefly recapitulate the I2A architecture and discuss how it can be extended to query state-space environment models.

3.1 Imagination-Augmented Agent

We briefly describe the agent, which is illustrated in Fig. 2; for details see Weber et al. (2017). The I2A is an RL agent with an actor-critic architecture, i.e. at each time step , it explicitly computes its policy over the next action to take and an approximate value function , and it is trained using standard policy gradient methods (Mnih et al., 2016). Its policy and value function are informed by the outputs of two separate pathways: 1) a model-free path, that tries to estimate the value and which action to take directly from the latest observation using a convolutional neural network (CNN); and 2) a model-based path, which we describe in the next paragraph.

The model-based path of an I2A is designed in the following way. The I2A is endowed with a pre-trained, fixed environment model . At every time , conditioned on past observations and actions , it uses the model to simulate possible futures (”rollouts”) represented by some features, so-called imaginations, over a horizon , under a rollout policy . It then extracts information from the rollout imaginations , and uses it, together with the results from the model-free path, to compute and . It has been shown that I2As are robust to model imperfections: they learn to interpret imaginations produced from the internal models in order to inform decision making as part of standard return maximization. More precisely, the model-based path is computed by executing the following steps (also see Fig. 2):

  • The I2A updates the state of its internal model by sampling from the initial model distribution . We denote this sample to clearly indicate the real environment information is contained in that sample up to time .

  • The I2A draws samples from the distribution . Here, denotes the model distribution with internal actions being sampled from the rollout policy . For SSMs, we require the rollout policy to only depend on the state so that rollouts can be computed purely in abstract space.

  • The imaginations are summarized by a ”summarizer” module (e.g. an LSTM), then combined with the model-free output and finally used to compute and .

Which imaginations the model predicts and passes to the agent is a design choice, which strongly depends on the model itself. For auto-regressive models (AR, RAR), we choose the imaginations to be rendered pixel predictions . For SSM, we are free to use predicted pixels or predicted abstract states as imaginations, the latter being much cheaper to compute.

Apart from the choice of environment model, a key ingredient to I2As is the choice of internal actions applied to the model. How to best design a rollout policy that extracts useful information from a given environment model remains an open question, which also depends on the choice of model itself. In the following and we investigate in the following different possibilities.

3.2 Distillation

In Weber et al. (2017), the authors propose to train the rollout policy to imitate the agent’s model-based behavioral policy . We call the resulting agent the distillation agent

. Concretely, we minimize the Kullback-Leibler divergence between

and :

where is the expectation over states and actions when following policy .

is a hyperparameter that trades off reward maximization with the distillation loss.

3.3 Learning to Query by Backpropagation

An obvious alternative to distillation is to learn the parameters of jointly with the other parameters of the agents by policy gradient methods. As the rollout actions sampled from

are discrete random variables, this optimization would require “internal” RL – i.e. redefining the action space to include the internal actions and learning a joint policy over external and internal actions. However, we expect the credit assignment of the rewards to the internal actions to be a difficult problem, resulting in slow learning. Therefore, we take a heurisitic approach similar to

Henaff et al. (2017) (and related to Bengio et al., 2013): Instead of feeding the sampled one-hot environment action to the model, we can instead directly feed the probability vector into the environment model during rollouts. This can be considered as a relaxation of the discrete internal RL optimization problem. Concretely, we back-propagate the RL policy gradients through the entire rollout into . This is possible since the environment model is fully differentiable thanks to the reparametrization trick, and the simulation policy is differentiable thanks to the relaxation of discrete actions. Parameters of the environment model are not optimized but kept constant, however. As the model was only trained on one-hot representation , and not on continuous actions probabilities, it is not guaranteed a-priori that the model generalizes appropriately. We explore promoting rollout probabilities to be close to one-hot action vectors, and therefore are numerically closer to the training data of the model, by introducing an entropy penalty.

3.4 Modulation agent

When learning the rollout policy (either by distillation or back-propagation), we learn to choose internal actions such that the simulated rollouts provide useful information to the agent. In these approaches, we do not change the environment model itself, which, by construction, aims to capture the true frequencies of possible outcomes. We can, however, go even one step further based on the following consideration: It might be beneficial for the agent to preferentially “imagine” extreme outcomes, e.g. rare (or even impossible) but highly rewarding or catastrophic transitions for a sequence of actions; hence to change the environment model itself in an informative way. For instance, in the game of MS_PACMAN, an agent might profit form imagining the ghosts moving in a particularly adversarial way, in order to choose actions safely. We can combine this consideration with the learning-to-query approach above, by learning an informative joint “imagination” distribution over actions and outcomes.

We implement this in the following way. First, we train an unconditional sSSM on environment transitions, i.e. a model that does not depend on the executed actions (this can simply be done by not providing the actions as inputs to the components of our state-space models). As a result, the sSSM has to jointly capture the uncertainty over the environment and the policy (the policy under which the training data was collected) in the latent variables

. This latent space is hence a compact, distributed representation over possible futures, i.e. trajectories, under

. We then let the I2A search over for informative trajectories, by replacing the learned prior module with a different distribution

. The model is fully differentiable and we simply backpropagate the policy gradients through the entire model; the remaining weights of the model are left unchanged, except for those of

. In our experiments, we simply replace the neural network parameterizing with a new one of the same size for , but with freshly initialized weights.

4 Results

Here, we apply the above models and agents to domains from the Arcade Learning Environment (ALE, Bellemare et al., 2013). In spite of significant progress (Hessel et al., 2017), some games are still considered very challenging environments for RL agents, e.g. MS_PACMAN, especially when not using any privileged information. All results are based on slightly cropped but full resolution ALE observations, i.e. .

4.1 Comparison of environment models

AR 1.9 —- 1.0
RAR -0.9 3.4 5.6 0.3 4.3 0.5 -4.7 12.2 2.0
dSSM-DET 0.4 0.0 3.5 0.2 0.4 0.3 -0.4 0.1 5.2
dSSM-VAE 0.5 0.0 5.0 1.3 2.4 3.0 0.7 0.0 5.2
sSSM 0.6 0.0 5.6 1.0 4.3 0.3 0.9 0.2 5.2
sSSM (jumpy) 3.0 2.0 13.6
Table 1:

Improvement of test likelihoods of environment models over a baseline model (standard variational autoencoder, VAE), on 4 different ALE domains. Stochastic models with state uncertainty (RAR, sSSM) outperform models without uncertainty representation. Furthermore, state-space models (dSSM, sSSM) show a substantial speed-up over auto-regressive models. Results are given as mean

standard deviation, in units of .

We applied auto-regressive and state-space models to four games of the ALE, namely BOWLING, CENTIPEDE, MS_PACMAN and SURROUND. These environment where chosen to cover a broad range of environment dynamics. The data was obtained by a running a pre-trained baseline policy and collecting sequences of actions, observations and rewards of length . Results are computed on held-out test data. We optimized model hyper-parameters (learning rate, weight decay, mini-batch size) on one game (MS_PACMAN) for each model separately and report mean likelihoods over five runs with the best hyper-parameter settings. In Tab. 1, we report likelihood improvements over a baseline model, being a Variational Autoencoder (VAE) that models frames as independent (conditioned on three initial frames).

In general, we found that, although operating on an abstract level, SSMs are competitive with, or even outperform, auto-regressive models. The sSSM, which take uncertainty into account, achieves consistently higher likelihoods in all games compared to models with deterministic state transitions, namely dSSM-DET and dSSM-VAE, in spite of having the same number of parameters and operations. An example from MS_PACMAN illustrating the differences in modelling capacity is shown in the Appendix: the prediction of dSSM-DET exhibits “sprite splitting” (and eventually, “sprite melting”) at corridors, whereas multiple samples from the sSSM show that the model has a reasonable and consistent representation of uncertainty in this situation.

We also report the relative computation time of rolling out, i.e. sampling from, the models. We observe that SSMs, which avoid computing pixel renderings at each rollout step, exhibit a speedup of over the standard AR model. We want to point out that our AR implementation is already quite efficient compared to a naive one, as it reuses costly vision pre-processing for rollouts where possible. Furthermore, we show that a jumpy sSSM, which learns a temporally and spatially abstracted state representation, is faster than the AR model by more than an order of magnitude, while exhibiting comparable performance as shown in Tab. 1. This shows that using an appropriate model architecture, we can learn highly predictive and compact dynamic state abstractions. Qualitatively, we observe that the best models capture the dynamics of ALE games well, even faithfully predicting global, yet subtle effects such as pixel representation of games scores over tens of steps (see figure in the Appendix)

4.2 RL with state-space models on MS_PACMAN

Here, we apply the I2A to a slightly simplified variant of the MS_PACMAN domain with five instead of eighteen actions. As environment models we use jumpy SSMs, since they exhibit a very favourable speed-accuracy trade-off as shown in the previous section; in fact I2As with AR models proved too expensive to run. In the following we compare the performance of I2As with different variants of SSMs, as well as various baselines. All agents we trained with an action repeat of 4 (Mnih et al., 2015). We report results in terms of averaged episode returns as a function of experience (in number of environment steps), averaged over the best hyper-parameter settings. All I2As do (equal to the number of actions) rollouts per time step. Rollout depth was treated as a hyper-parameter and varied over ; this corresponds to 24, 36 and 48 environment steps (due to action repeats and jumpy training), allowing I2As to plan over a substantial horizon. Learning curves for all agents with deterministic dSSMs are shown in Fig.3. Results and detailed discussion for agents with sSSMs can be found in the Appendix.

We first establish that all I2A agents, irrespective of the models they use, perform better than the model-free baseline agent; the latter is equivalent to an I2A without a model-based pathway. The improved performance of I2As is not simply due to having access to a larger number of input features: an I2A agent with an untrained environment model performs substantially worse (data not shown). A final baseline consists in using an I2A agent for which all imaginations are set to the initial state representation . The agent has the exact same architecture, number of weights (forward model excluded), and operations as the I2A agent (denoted as ”baseline copy model” in the figure legend). This agent performs substantially worse than the I2A agent, showing that environment rollouts lead to better decisions. It performs better however than the random model agent, which suggests that simply providing the initial state representation to the agent is already beneficial, emphasizing the usefulness of abstract dynamic state representations.

Figure 3: Learning curves of different agents on the MS_PACMAN environment. Model-based Imagination-Augmented Agents (I2As) outperform the model-free baseline by a large margin. Furthermore, learning the rollout policy , either by back-propagation or distillation provides the best results.

A surprising result is that I2As with the deterministic state-space models dSSM outperform their stochastic counterparts with sSSMs by a large margin. Although sSSMs capture the environment dynamics better than dSSM, learning from their outputs seems to be more challenging for the agents. We hypothesize that this could be due to the fact that we only produce only a small number of samples (5 in our simulations), resulting in highly variable features that are passed to the I2As.

For the agents with deterministic models, we find that a uniform random rollout policy is a strong baseline. It is outperformed by the distillation strategy, itself narrowly outperformed by the learning-to-query strategy. This demonstrates that “imagining” behaviors different from the agents’ policy can be beneficial for planning. Furthermore, we found that in general deeper rollouts with proved to outperfrom more shallow rollouts for all I2As with deterministic SSMs.

A final experiment consists of running the I2A agent with distillation, but instead of providing the abstract state features to the agent, we provide rendered pixel observations instead (as was done in Weber et al., 2017), and strengthen the summarizer (by adding convolutions). This model has to decode and re-encode observations at every imagination step, which makes it our slowest agent. We find that reasoning at pixel level eventually outperforms the copy and model-free baselines. It is however significantly outperformed by all variants of I2A which work at the abstract level, showing that the dynamics state abstractions, learned in an unsupervised way by a state-space model, are highly informative features about future outcomes, while being cheap to compute at the same time.

5 Related Work

Generative sequence models

We build directly on a plethora of recent work exploring the continuum of models ranging from standard recurrent neural networks (RNNs) to fully stochastic models with uncertainty (Chung et al., 2015; Archer et al., 2015; Fraccaro et al., 2016; Krishnan et al., 2015; Gu et al., 2015). Chung et al. (2015) explore a model class equivalent to what we called RARs here. Archer et al. (2015); Fraccaro et al. (2016) train stochastic state-space models, without however investigating their computational efficiency and their applicability to model-based RL. Most of the above work focuses on modelling music, speech or other low-dimensional data, whereas here we present stochastic sequence models on high-dimensional pixel-based observations; noteworthy exception are found in (Watter et al., 2015; Wahlström et al., 2015). There, the authors chose a two-stage approach by first learning a latent representation and then learning a transition model in this representation. Multiple studies investigate the graphical model structure of the prior and posterior graphs and stress the possible importance of smoothing over filtering inference distributions (e.g. Krishnan et al., 2015); in our investigations we did not find a difference between these distributions. Furthermore, to the best our knowledge, this is the first study applying stochastic state-space models as action-conditional environment models. Most previous work on learning simulators for ALE games apply deterministic models, and do not consider learning state-space models for efficient Monte-Carlo rollouts (Oh et al., 2015). Chiappa et al. (2017) successfully train deterministic state-space models for ALE modelling (largely equivalent to the considered dSSMs here); they however do not explore the computational complexity advantage of SSMs, and do not study RL applications of their models. Independently from our work, Babaeizadeh et al. (2018) develop a stochastic sequence model and illustrate its representational power compared to deterministic models, using an example similar to the one we present in the Appendix. Although designed for application in RL, they do not show RL results.

Model-based reinforcement learning

Most model-based RL with neural network models has previously focused on training the models on a given, compact state-representations. Directly learning models from pixels for RL is still an under-explored topic due to high demands on model accuracy and computational budget, but see (Finn & Levine, 2017; Watter et al., 2015; Wahlström et al., 2015). Finn & Levine (2017) train an action-conditional video-prediction network and use it for model-predictive control (MPC) of a robot arm. The applied model requires explicit pixel rendering for long-term predictions and does not operate in abstract space. Agrawal et al. (2016) propose to learn a forward and inverse dynamics model from pixels with applications to robotics. Our work is related to multiple approaches in RL which aim to implicitly learn a model on the environment using model-free methods. Tamar et al. (2016) propose an architecture that is designed to learn the value-iteration algorithm which requires knowledge about environment transitions. The Predictron is another implicit planning network, trained in a supervised way directly from raw pixels, mimicking Bellman updates / iterations (Silver et al., 2016b). A generalization of the Predictron to the controlled setting was introduced by Oh et al. (2017). Similar to these methods, our agent constructs an implicit plan; however, it uses an explicit environment model learned from sensory observations in an unsupervised fashion. Another approach, presented by Jaderberg et al. (2016), is to add auxiliary prediction losses to the RL training criterion in order to encourage implicit learning of environment dynamics. van Seijen et al. (2017) obtain state of the art performance on MS_PACMAN with a model free architecture, but they however rely on privileged information (object identity and positions, and decomposition of the reward function).

6 Discussion

We have shown that state-space models directly learned from raw pixel observations are good candidates for model-based RL: 1) they are powerful enough to capture complex environment dynamics, exhibiting similar accuracy to frame-auto-regressive models; 2) they allow for computationally efficient Monte-Carlo rollouts; 3) their learned dynamic state-representations are excellent features for evaluating and anticipating future outcomes compared to raw pixels. This enabled Imagination-Augemented Agents to outperform strong model-free baselines. On a conceptual level, we present (to the best of our knowledge) the first results on what we termed learning-to-query: We show a learning a rollout policy by backpropagating policy gradients leads to consistent (if modest) improvements.

Here, we adopted the I2A assumption of having access to a pre-trained envronment model. In future work, we plan to drop this assumption and jointly learn the model and the agent. Also, further speeding up environment models is a major direction of research; we think that learning models with the capacity of learning adaptive temporal abstractions is a particularly promising direction for achieving agents that plan to react flexibly to their environment.


Appendix A Details on environment models

a.1 Architectures

We show the structures the inference distributions of the models with latent variables in Fig. 4 and Fig. 5.

Figure 4: The architecture of the inference model for the dSSM-VAE.

Figure 5: The architecture of the inference model for the sSSM.

a.2 Detail in the observation model

For all models (auto-regressive and state-space), we interpret the three color channels of each pixel in the observation (with frame height and width ) as pseudo-probabilities; we score these using their KL divergence with model predictions. We model the reward with a separate distribution: we first compute a binary representation of the reward and model the coefficients as independent Bernoulli variables (conditioned on

). We also add two extra binary variables: one for the sign of the reward, and the indicator function of the reward being equal to


a.3 Details of neural network implementations

Here we show the concrete neural network layouts used to implement the sSSM. We first introduce three higher level build blocks:

  • a three layer deep convolutional stack , with kernel sizes and channels sizes , shown in Fig. 6;

  • a three layer deep residual convolutional stack with fixed sizes, shown in Fig. 7;

  • the Pool & Inject layer, shown in Fig. 8.

Based on these building blocks, we define all modules in Fig. 9 to Fig. 14.




Figure 6: Definition of the basic convolutional stack with kernel size parameters and channel parameters . Here, a box with the label denotes a convolution with a square kernel of size with

output channels; strides are always





Figure 7: Definition of the residual convolutional stack .


max pool



Figure 8: Definition of the Pool & Inject layer.


res conv


pool & inject

res conv

Figure 9: Transition module for computing the state transition function .


conv stack: (1, 32), (5, 32), (3, 64)

depth to space (2)

conv stack: (3, 64), (3, 64), (1, 48)

depth to space (4)






Figure 10:

Decoder module for computing the log-odds statistics of the Bernoulli distributions over the pixels

and the binary coefficients of the reward .


conv stack: (3, 32), (5, 32), (3, 64)

space to depth (2)

conv stack: (3, 16), (5, 16), (3, 64)

space to depth (4)

Figure 11: Encoder module computing an embedding of an observation (not including the reward).


conv stack: (1, 32), (3, 32), (3, 64)

Figure 12: Prior module for computing mean

and diagonal variance

of the normal distribution .


conv stack: (1, 32), (3, 32), (3, 64)

Figure 13: Posterior module for computing mean and diagonal variance of the normal distribution . The posterior gets as additional inputs the prior statistics .


conv stack: (1, 64), (3, 64), (3, 64)

Figure 14: Initial state module for computing the first initial state as a function of the embedding for of three previous observations.

a.4 Collection of training data

We train a standard DQN agents on the four games BOWLING, CENTIPEDE, MS PACMAN and SURROUND from the ALE as detailed by Mnih et al. (2015) using the original action space of 18 actions. After training, we collect a training set of and a test set of environment transitions for each game by executing the learned policies. Actions are represented by one-hot vectors and are tiled to yield convolutional feature maps of appropriate size. Pixel observations were cropped to pixels and normalized by 255 to lie in the unit cube . Because the DQN agent were trained with an action-repeat of four, we only model every fourth frame.

a.5 Training details

All models were optimized using Adam (Kingma & Ba, 2014) with a mini-batch size of 16.

Figure 15: Learning curves of the environment models on MS_PACMAN.

a.6 Comparison of deterministic and stochastic state-space models

We illustrate the difference in modelling capacity between deterministic (dSSM) and stochastic (sSSM) state-space models, by training both on a toy data set. It consists of small -pixel image sequences of a bouncing ball with a drift and a small random diffusion term. As shown in Fig. 16, after training, pixels rendered from the rollouts of a sSSM depict a plausible realization of a trajectory of the ball, whereas the dSSM produces blurry samples, as conditioned on any number of previously observed frames, the state of the ball is not entirely predictable due to diffusion. A dSSM (trained with an approximate maximum likelihood criterion, see above) will “hedge its bets” by producing a blurry prediction. A similar result can be observed in rollouts from models trained on ALE games, see Fig. 17.

Figure 16: Rollouts from a deterministic (dSSM, above) and a stochastic (sSSM, below) state-space model trained on a bouncing ball dataset with diffusion.


stochastic rollout # 2

stochastic rollout # 1
Figure 17: Two rollouts of length from a stochastic (sSSM, top two rows) and one rollout from a deterministic (dSSM) state-space model for the MS PACMAN environment, given the same initial frames and the same sequence of five actions.

Appendix B Appendix: Details on Agents

b.1 MS PACMAN environment variant

For the RL experiments in the paper, we consider a slightly simplified version of the MS PACMAN environment with only five actions (UP, LEFT, DOWN, RIGHT, NOOP). Furthermore, all agents have an action-repeat of four, and only observe every fourth frame from the environment.

b.2 Architecture

We re-implemented closely the agent architecture presented by Weber et al. (2017). In the following we list the changes in the architecture necessitated by the different environments and environment models.

Model-free baseline

The model-free baseline consisted of a four-layer CNN operating on with sizes (4, 2, 16), (8, 4, 32), (4, 2, 64) and (3, 1, 64), where donates a CNN layer with square kernel size , stride and output channels

; each CNN layer is followed by a relu nonlinearity. The output of the CNN is flatten and passed trough a fully-connected (FC) layer with 512 hidden units; the final output is a value function approximation and the logits of the policy at time


Imagination-Augmented Agent (I2A)

The model-free path consists of a CNN with the same size as the one of the model-free agent (including the FC layer with 512 units). The model-based path is designed as follows: The rollout outputs for each imagined time step are encoded with a two layer CNN with sizes (4, 1, 32) and (4, 1, 16), then flattened and passed to a fully-connected (FC) layer with 128 outputs. These rollout statistics are then summarized (in reversed order) with an LSTM with 256 hidden units and concatenated with the outputs of the model-free path.

Rollout policies

Trainable rollout policies that operate on the state are given by a two layer CNN with sizes (4, 1, 32) and (4, 1, 32), followed by an FC layer with 128 units. Pixel-based rollout policies have the same neural network sizes as the model-free baseline, except that the last two CNN layers have 32 feature maps each.

Appendix C Results I2A with stochastic state-space models

Learning curves for I2As with sSSMs are shown in Fig. 18. Both, I2As with learing-to-query and distillation rollout policies outperform a uniform random rollout policy. The learning-to-query agent shows weak initial performance, but eventually outperforms the other agents. This shows that learning-to-sample informative outcomes is beneficial for agent performance.

Figure 18: Results for I2A agents with stochastic state-space models on MS_PACMAN (the modulation agent is denoted as ”parasite” in the legend).