1 Introduction
Deep reinforcement learning has demonstrated remarkable progress in recent years, achieving high levels of performance across a wide array of challenging tasks, including Atari games (Mnih et al., 2015), locomotion (Schulman et al., 2015), and 3D navigation (Mnih et al., 2016)
. Many of these advances have relied on combining deep learning methods with modelfree RL algorithms. A critical drawback of this approach is the vast amount of experience required to achieve good performance, as only weak prior knowledge is encoded in the agents’ networks (e.g., spatial translation invariance via convolutions).
The promise of modelbased reinforcement learning is to improve sampleefficiency by making use of explicit models of the environment. The idea is that given a model of the environment (which can possibly be learned in the absence of rewards or from observational data only), an agent can learn taskspecific policies rapidly by leveraging this model e.g., by trajectory optimization (Betts, 1998), search (Browne et al., 2012; Silver et al., 2016a), dynamic programming (Bertsekas et al., 1995) or generating synthetic experiences (Sutton, 1991). However, modelbased RL algorithms typically pose strong requirements on the environment models, namely that they make predictions about the future state of the environment efficiently and accurately.
In this paper we aim to address the challenge of learning accurate, computationally efficient models of complex domains and using them to solve RL problems. First, we advocate the use computationally efficient statespace environment models that make predictions at a higher level of abstraction, both spatially and temporally, than at the level of raw pixel observations. Such models substantially reduce the amount of computation required to make predictions, as future states can be represented much more compactly. Second, in order to increase model accuracy, we examine the benefits of explicitly modeling uncertainty in state transitions. Finally we demonstrate that the computational efficiency of statespace models enables us to apply them to challenging RL domains: Extending a recent RL architecture (Weber et al., 2017), we propose an agent that learns to query a statespace model to anticipate outcomes of actions and aid decision making.
The main contributions of the paper are as follows: 1) we provide the first comparison of deterministic and stochastic, pixelspace and statespace models w.r.t. speed and accuracy, applied to challenging environments from the Arcade Learning Environment (ALE, Bellemare et al., 2013); 2) we demonstrate stateoftheart environment modeling accuracy (as measured by loglikelihoods) with stochastic statespace models that efficiently produce diverse yet consistent rollouts; 3) using statespace models, we show modelbased RL results on MS_PACMAN, and obtain significantly improved performance compared to strong modelfree baselines, and 4) we show that learning to query the model further increases policy performance.
2 Environment models
In the following, for any sequence of variables , we use (or ) to denote all elements of the sequences up to , excluding (respectively including) . We write subsequences as . We consider an environment that outputs at each time step an observation and a reward . We also refer to the observations as pixels or frames, to give the intuition that they can be highdimensional and highly redundant in many domains of interest. To ease the notation, in the following we will also write for the observations and rewards unless explicitly stated otherwise. Given action
, the environment transitions into a new, unobserved state and returns a sample of the observation and reward at the next time step with probability
. A main challenge in modelbased RL is to learn a model of the environment that allows for computationally cheap and accurate predictions about the results of taking actions.2.1 Model taxonomy
RAR  dSSMDET  
dSSMVAE  sSSM  
The graphical models representing the architectures of different environment models. Boxes are deterministic nodes, circles are random variables and filled circles represent variables observed during training.
In the following, we discuss different environment models that can be learned in an unsupervised way from observations conditioned on actions . In particular, we will focus on how fast and accurately models can predict, at time step , some future statistics over a horizon that can later be used for decision making. We will simply call predictions and the rollout horizon or depth. Concretely, we will assume that we are interested at every time step in generating samples by doing MonteCarlo rollouts of the model given an arbitrary sequence of actions (which will later be sampled from a rollout policy). The structure of the models we consider are illustrated in Fig. 1.
Autoregressive models
A straightforward choice is the family of temporally autoregressive models over the observations , which we write in the following way:
If is given by a firstinfirstout (FIFO) buffer of the last observations and actions , the above definition is a regular autoregressive model (of order ), which we denote by AR. Rolling out AR models is slow for two reasons: 1) we have to sequentially sample, or “render”, all pixels explicitly, which is particularly computationally demanding for highdimensional observations, and 2) vanilla AR models without any additional structure do not reuse any computations from evaluating for evaluating . To speedup AR models, we address the latter concern by considering the following model variant: we allow to be a recurrent mapping that recursively updates sufficient statistics , therefore reusing the previously computed statistics . We call these models recurrent autoregressive models (RAR); if
is parameterized as a neural network, RARs are equivalent to recurrent neural networks (RNNs). Although faster, we still expect MonteCarlo rollouts of RARs to be slow, as they still need to explicitly render observations
in order to make any predictions , which could be taken to be pixels or recurrent states .Statespace models: abstraction in space
As discussed above, rolling out ARs is computationally demanding as it requires sampling, or “rendering” all observations . Statespace models (SSMs) circumvent this by positing that there is a compact state representation that captures all essential aspects of the environment on an abstract level: it is assumed that can be “rolled out”, i.e. predicted, from the previous state and action alone, without the help of previous pixels or any action other than : . Furthermore, we assume that is sufficient to predict , i.e. . Hence SSMs allow for the following factorization of the predictive distribution:
where is the initial state distribution. This modelling choice implies that the latent states are, by construction, sufficient to generate any predictions . Hence, we never have to directly sample pixel observations.
Transition model
We consider two flavors of SSMs: deterministic SSMs (dSSMs) and stochastic SSMs (sSSMs). For dSSMs, the latent transition is a deterministic function of the past, whereas for sSSMs, we consider transition distributions that explicitly model uncertainty over the state . sSSMs are a strictly larger model class than dSSMs, and we illustrate their difference in capacity for modelling stochastic timeseries in the Appendix. We parameterize sSSMs by introducing for every a latent variable whose distribution depends on and , and by making the state a deterministic function of the past state, action, and latent variable:
Observation model
The observation model, or decoder, computes the conditional distribution . It either takes as input the state (deterministic decoder), or the state and latent (stochastic decoder). For sSSMs, we always use the stochastic decoder. For dSSMs, we can use either the deterministic decoder (dSSMDET), or the stochastic decoder (dSSMVAE). The latter can capture joint uncertainty over pixels in a given observation , but not across time steps. The former is a fully deterministic model, incapable of modeling joint uncertainties (in time or in space). Further details can be found in section in the Appendix.
2.2 Jumpy models: abstraction in time
To further reduce the computational time required for sampling a rollout of horizon , we also consider modelling environment transitions at a coarser time scale. To this end, we subsample observations by a factor of , i.e. for , we replace sequences , by the subsampled sequence
. We “chunk” the actions by concatenating them into a vector
, and sum the rewards . We refer to models trained on data preprocessed in this way as jumpy models. Jumpy training is a convenient way to inject temporal abstraction over at a time scale into environment models. This approach allows us to further reduce the computational load for MonteCarlo rollouts roughly by a factor of .2.3 Model architectures, inference and training
Here, we describe the parametric architectures for the models outlined above. We discuss the architecture of the sSSM in detail, and then briefly explain the modifications of this model used to implement RARs and dSSMs.
The states , latent variables and observations are all shaped like convolutional feature maps and are generated by transition modules , and the decoder
respectively. All latent variables are constrained to be normal with diagonal covariances. All modules consist of stacks of convolutional neural networks with ReLU nonlinearities. The transition modules use sizepreserving convolutions, the decoder, sizeexpanding ones. To overcome the limitations of small receptive fields associated with convolutions, for modelling global effects of the environment dynamics, we use poolandinject layers introduced by
Weber et al. (2017): they perform maxpooling over their input feature maps, tile the results and concatenate them back to the inputs. Using these layers we can induce longrange spatial dependencies in the state
. All modules are illustrated in detail in the Appendix.We train the AR, RAR and dSSMDET models by maximum likelihood estimation (MLE), i.e. by maximizing
over model parameters , where and denotes some initial context (in our experiments ). We initialize the state with a convolutional network including an observation encoder . This encoder uses convolutions that reduce the size of the feature maps from the size of the observation to the size of the state.For the models containing latent variables, i.e. dSSMVAE and sSSM, we cannot evaluate in closed form in general. We maximize instead the evidence lower bound , where denotes an approximate posterior distribution:
where now denotes the union of the model parameters and the parameters of . Here, we used that the structure of the sSSM to assume without loss of generality that is Markovian in (see Krishnan et al. (2015) for an indepth discussion). Furthermore, we restrict ourselves to the filtering distribution
, which we model as normal distribution with diagonal covariance matrix. We did not observe improvements in experiments by using the full smoothing distribution
. We share parameters between the prior and the posterior by making the posterior a function of the state computed by the prior transition module , as follows:The posterior uses the observation encoder on ; the resulting feature maps are then concatenated to
, and a number of additional convolutions compute the posterior mean and standard deviation of
. For all latent variable models, we use the reparameterized representation of the computation graph (Kingma & Welling, 2013; Rezende et al., 2014) and a single posterior sample to obtain unbiased gradient estimators of the ELBO.We can restrict the above sSSM to a dSSMVAE, by not feeding samples of into the transition model . To ensure a fair model comparison (identical number of parameters and same amount of computation), we numerically implement this by feeding the mean of into the transition function instead. If we also do not feed (but the mean ) into the decoder for rendering , we arrive at the dSSMDET, which does not contain any samples of . We implement the RAR based on the dSSMDET by modifiying the transition model to , where denotes an encoder with the same architecture as the one of sSSM and dSSMVAE.
3 RL agents with statespace models
Here we discuss how we can use a statespace model to help solve RL problems. A naive approach would be e.g. the following: Given a perfect model and unlimited computational resources, an agent could perform in principle a bruteforce search for the optimal openloop policy in any state by computing (assuming undiscounted reward over a finite horizon up to ), where is the expectation under the environment model . In practice, however, this optimization is costly and brittle. Quite generally, it has been observed that modelbased planning often leads to catastrophic outcomes given unavoidable imperfections of when modelling complex environments (Talvitie, 2015).
Recent, Weber et al. (2017) proposed to combine modelfree and modelbased methods to increase robustness to model imperfections: the ImaginationAugmented Agent (I2A) queries its internal, pretrained model via MonteCarlo rollouts under a rollout policy. It then uses features (called imaginations) computed from these rollouts to anticipate the outcomes of taking different actions, thereby informing its decisionmaking. RL is used to learn to interpret the model’s predictions; this was shown to greatly diminish the susceptibility of planning to model imperfections.
In the following, we briefly recapitulate the I2A architecture and discuss how it can be extended to query statespace environment models.
3.1 ImaginationAugmented Agent
We briefly describe the agent, which is illustrated in Fig. 2; for details see Weber et al. (2017). The I2A is an RL agent with an actorcritic architecture, i.e. at each time step , it explicitly computes its policy over the next action to take and an approximate value function , and it is trained using standard policy gradient methods (Mnih et al., 2016). Its policy and value function are informed by the outputs of two separate pathways: 1) a modelfree path, that tries to estimate the value and which action to take directly from the latest observation using a convolutional neural network (CNN); and 2) a modelbased path, which we describe in the next paragraph.
The modelbased path of an I2A is designed in the following way. The I2A is endowed with a pretrained, fixed environment model . At every time , conditioned on past observations and actions , it uses the model to simulate possible futures (”rollouts”) represented by some features, socalled imaginations, over a horizon , under a rollout policy . It then extracts information from the rollout imaginations , and uses it, together with the results from the modelfree path, to compute and . It has been shown that I2As are robust to model imperfections: they learn to interpret imaginations produced from the internal models in order to inform decision making as part of standard return maximization. More precisely, the modelbased path is computed by executing the following steps (also see Fig. 2):

The I2A updates the state of its internal model by sampling from the initial model distribution . We denote this sample to clearly indicate the real environment information is contained in that sample up to time .

The I2A draws samples from the distribution . Here, denotes the model distribution with internal actions being sampled from the rollout policy . For SSMs, we require the rollout policy to only depend on the state so that rollouts can be computed purely in abstract space.

The imaginations are summarized by a ”summarizer” module (e.g. an LSTM), then combined with the modelfree output and finally used to compute and .
Which imaginations the model predicts and passes to the agent is a design choice, which strongly depends on the model itself. For autoregressive models (AR, RAR), we choose the imaginations to be rendered pixel predictions . For SSM, we are free to use predicted pixels or predicted abstract states as imaginations, the latter being much cheaper to compute.
Apart from the choice of environment model, a key ingredient to I2As is the choice of internal actions applied to the model. How to best design a rollout policy that extracts useful information from a given environment model remains an open question, which also depends on the choice of model itself. In the following and we investigate in the following different possibilities.
3.2 Distillation
In Weber et al. (2017), the authors propose to train the rollout policy to imitate the agent’s modelbased behavioral policy . We call the resulting agent the distillation agent
. Concretely, we minimize the KullbackLeibler divergence between
and :where is the expectation over states and actions when following policy .
is a hyperparameter that trades off reward maximization with the distillation loss.
3.3 Learning to Query by Backpropagation
An obvious alternative to distillation is to learn the parameters of jointly with the other parameters of the agents by policy gradient methods. As the rollout actions sampled from
are discrete random variables, this optimization would require “internal” RL – i.e. redefining the action space to include the internal actions and learning a joint policy over external and internal actions. However, we expect the credit assignment of the rewards to the internal actions to be a difficult problem, resulting in slow learning. Therefore, we take a heurisitic approach similar to
Henaff et al. (2017) (and related to Bengio et al., 2013): Instead of feeding the sampled onehot environment action to the model, we can instead directly feed the probability vector into the environment model during rollouts. This can be considered as a relaxation of the discrete internal RL optimization problem. Concretely, we backpropagate the RL policy gradients through the entire rollout into . This is possible since the environment model is fully differentiable thanks to the reparametrization trick, and the simulation policy is differentiable thanks to the relaxation of discrete actions. Parameters of the environment model are not optimized but kept constant, however. As the model was only trained on onehot representation , and not on continuous actions probabilities, it is not guaranteed apriori that the model generalizes appropriately. We explore promoting rollout probabilities to be close to onehot action vectors, and therefore are numerically closer to the training data of the model, by introducing an entropy penalty.3.4 Modulation agent
When learning the rollout policy (either by distillation or backpropagation), we learn to choose internal actions such that the simulated rollouts provide useful information to the agent. In these approaches, we do not change the environment model itself, which, by construction, aims to capture the true frequencies of possible outcomes. We can, however, go even one step further based on the following consideration: It might be beneficial for the agent to preferentially “imagine” extreme outcomes, e.g. rare (or even impossible) but highly rewarding or catastrophic transitions for a sequence of actions; hence to change the environment model itself in an informative way. For instance, in the game of MS_PACMAN, an agent might profit form imagining the ghosts moving in a particularly adversarial way, in order to choose actions safely. We can combine this consideration with the learningtoquery approach above, by learning an informative joint “imagination” distribution over actions and outcomes.
We implement this in the following way. First, we train an unconditional sSSM on environment transitions, i.e. a model that does not depend on the executed actions (this can simply be done by not providing the actions as inputs to the components of our statespace models). As a result, the sSSM has to jointly capture the uncertainty over the environment and the policy (the policy under which the training data was collected) in the latent variables
. This latent space is hence a compact, distributed representation over possible futures, i.e. trajectories, under
. We then let the I2A search over for informative trajectories, by replacing the learned prior module with a different distribution. The model is fully differentiable and we simply backpropagate the policy gradients through the entire model; the remaining weights of the model are left unchanged, except for those of
. In our experiments, we simply replace the neural network parameterizing with a new one of the same size for , but with freshly initialized weights.4 Results
Here, we apply the above models and agents to domains from the Arcade Learning Environment (ALE, Bellemare et al., 2013). In spite of significant progress (Hessel et al., 2017), some games are still considered very challenging environments for RL agents, e.g. MS_PACMAN, especially when not using any privileged information. All results are based on slightly cropped but full resolution ALE observations, i.e. .
4.1 Comparison of environment models
Model  BOWLING  CENTIPEDE  MS_PACMAN  SURROUND  rel. speed 

AR  –  –  1.9 —  –  1.0 
RAR  0.9 3.4  5.6 0.3  4.3 0.5  4.7 12.2  2.0 
dSSMDET  0.4 0.0  3.5 0.2  0.4 0.3  0.4 0.1  5.2 
dSSMVAE  0.5 0.0  5.0 1.3  2.4 3.0  0.7 0.0  5.2 
sSSM  0.6 0.0  5.6 1.0  4.3 0.3  0.9 0.2  5.2 
sSSM (jumpy)  –  –  3.0 2.0  –  13.6 
Improvement of test likelihoods of environment models over a baseline model (standard variational autoencoder, VAE), on 4 different ALE domains. Stochastic models with state uncertainty (RAR, sSSM) outperform models without uncertainty representation. Furthermore, statespace models (dSSM, sSSM) show a substantial speedup over autoregressive models. Results are given as mean
standard deviation, in units of .We applied autoregressive and statespace models to four games of the ALE, namely BOWLING, CENTIPEDE, MS_PACMAN and SURROUND. These environment where chosen to cover a broad range of environment dynamics. The data was obtained by a running a pretrained baseline policy and collecting sequences of actions, observations and rewards of length . Results are computed on heldout test data. We optimized model hyperparameters (learning rate, weight decay, minibatch size) on one game (MS_PACMAN) for each model separately and report mean likelihoods over five runs with the best hyperparameter settings. In Tab. 1, we report likelihood improvements over a baseline model, being a Variational Autoencoder (VAE) that models frames as independent (conditioned on three initial frames).
In general, we found that, although operating on an abstract level, SSMs are competitive with, or even outperform, autoregressive models. The sSSM, which take uncertainty into account, achieves consistently higher likelihoods in all games compared to models with deterministic state transitions, namely dSSMDET and dSSMVAE, in spite of having the same number of parameters and operations. An example from MS_PACMAN illustrating the differences in modelling capacity is shown in the Appendix: the prediction of dSSMDET exhibits “sprite splitting” (and eventually, “sprite melting”) at corridors, whereas multiple samples from the sSSM show that the model has a reasonable and consistent representation of uncertainty in this situation.
We also report the relative computation time of rolling out, i.e. sampling from, the models. We observe that SSMs, which avoid computing pixel renderings at each rollout step, exhibit a speedup of over the standard AR model. We want to point out that our AR implementation is already quite efficient compared to a naive one, as it reuses costly vision preprocessing for rollouts where possible. Furthermore, we show that a jumpy sSSM, which learns a temporally and spatially abstracted state representation, is faster than the AR model by more than an order of magnitude, while exhibiting comparable performance as shown in Tab. 1. This shows that using an appropriate model architecture, we can learn highly predictive and compact dynamic state abstractions. Qualitatively, we observe that the best models capture the dynamics of ALE games well, even faithfully predicting global, yet subtle effects such as pixel representation of games scores over tens of steps (see figure in the Appendix)
4.2 RL with statespace models on MS_PACMAN
Here, we apply the I2A to a slightly simplified variant of the MS_PACMAN domain with five instead of eighteen actions. As environment models we use jumpy SSMs, since they exhibit a very favourable speedaccuracy tradeoff as shown in the previous section; in fact I2As with AR models proved too expensive to run. In the following we compare the performance of I2As with different variants of SSMs, as well as various baselines. All agents we trained with an action repeat of 4 (Mnih et al., 2015). We report results in terms of averaged episode returns as a function of experience (in number of environment steps), averaged over the best hyperparameter settings. All I2As do (equal to the number of actions) rollouts per time step. Rollout depth was treated as a hyperparameter and varied over ; this corresponds to 24, 36 and 48 environment steps (due to action repeats and jumpy training), allowing I2As to plan over a substantial horizon. Learning curves for all agents with deterministic dSSMs are shown in Fig.3. Results and detailed discussion for agents with sSSMs can be found in the Appendix.
We first establish that all I2A agents, irrespective of the models they use, perform better than the modelfree baseline agent; the latter is equivalent to an I2A without a modelbased pathway. The improved performance of I2As is not simply due to having access to a larger number of input features: an I2A agent with an untrained environment model performs substantially worse (data not shown). A final baseline consists in using an I2A agent for which all imaginations are set to the initial state representation . The agent has the exact same architecture, number of weights (forward model excluded), and operations as the I2A agent (denoted as ”baseline copy model” in the figure legend). This agent performs substantially worse than the I2A agent, showing that environment rollouts lead to better decisions. It performs better however than the random model agent, which suggests that simply providing the initial state representation to the agent is already beneficial, emphasizing the usefulness of abstract dynamic state representations.
A surprising result is that I2As with the deterministic statespace models dSSM outperform their stochastic counterparts with sSSMs by a large margin. Although sSSMs capture the environment dynamics better than dSSM, learning from their outputs seems to be more challenging for the agents. We hypothesize that this could be due to the fact that we only produce only a small number of samples (5 in our simulations), resulting in highly variable features that are passed to the I2As.
For the agents with deterministic models, we find that a uniform random rollout policy is a strong baseline. It is outperformed by the distillation strategy, itself narrowly outperformed by the learningtoquery strategy. This demonstrates that “imagining” behaviors different from the agents’ policy can be beneficial for planning. Furthermore, we found that in general deeper rollouts with proved to outperfrom more shallow rollouts for all I2As with deterministic SSMs.
A final experiment consists of running the I2A agent with distillation, but instead of providing the abstract state features to the agent, we provide rendered pixel observations instead (as was done in Weber et al., 2017), and strengthen the summarizer (by adding convolutions). This model has to decode and reencode observations at every imagination step, which makes it our slowest agent. We find that reasoning at pixel level eventually outperforms the copy and modelfree baselines. It is however significantly outperformed by all variants of I2A which work at the abstract level, showing that the dynamics state abstractions, learned in an unsupervised way by a statespace model, are highly informative features about future outcomes, while being cheap to compute at the same time.
5 Related Work
Generative sequence models
We build directly on a plethora of recent work exploring the continuum of models ranging from standard recurrent neural networks (RNNs) to fully stochastic models with uncertainty (Chung et al., 2015; Archer et al., 2015; Fraccaro et al., 2016; Krishnan et al., 2015; Gu et al., 2015). Chung et al. (2015) explore a model class equivalent to what we called RARs here. Archer et al. (2015); Fraccaro et al. (2016) train stochastic statespace models, without however investigating their computational efficiency and their applicability to modelbased RL. Most of the above work focuses on modelling music, speech or other lowdimensional data, whereas here we present stochastic sequence models on highdimensional pixelbased observations; noteworthy exception are found in (Watter et al., 2015; Wahlström et al., 2015). There, the authors chose a twostage approach by first learning a latent representation and then learning a transition model in this representation. Multiple studies investigate the graphical model structure of the prior and posterior graphs and stress the possible importance of smoothing over filtering inference distributions (e.g. Krishnan et al., 2015); in our investigations we did not find a difference between these distributions. Furthermore, to the best our knowledge, this is the first study applying stochastic statespace models as actionconditional environment models. Most previous work on learning simulators for ALE games apply deterministic models, and do not consider learning statespace models for efficient MonteCarlo rollouts (Oh et al., 2015). Chiappa et al. (2017) successfully train deterministic statespace models for ALE modelling (largely equivalent to the considered dSSMs here); they however do not explore the computational complexity advantage of SSMs, and do not study RL applications of their models. Independently from our work, Babaeizadeh et al. (2018) develop a stochastic sequence model and illustrate its representational power compared to deterministic models, using an example similar to the one we present in the Appendix. Although designed for application in RL, they do not show RL results.
Modelbased reinforcement learning
Most modelbased RL with neural network models has previously focused on training the models on a given, compact staterepresentations. Directly learning models from pixels for RL is still an underexplored topic due to high demands on model accuracy and computational budget, but see (Finn & Levine, 2017; Watter et al., 2015; Wahlström et al., 2015). Finn & Levine (2017) train an actionconditional videoprediction network and use it for modelpredictive control (MPC) of a robot arm. The applied model requires explicit pixel rendering for longterm predictions and does not operate in abstract space. Agrawal et al. (2016) propose to learn a forward and inverse dynamics model from pixels with applications to robotics. Our work is related to multiple approaches in RL which aim to implicitly learn a model on the environment using modelfree methods. Tamar et al. (2016) propose an architecture that is designed to learn the valueiteration algorithm which requires knowledge about environment transitions. The Predictron is another implicit planning network, trained in a supervised way directly from raw pixels, mimicking Bellman updates / iterations (Silver et al., 2016b). A generalization of the Predictron to the controlled setting was introduced by Oh et al. (2017). Similar to these methods, our agent constructs an implicit plan; however, it uses an explicit environment model learned from sensory observations in an unsupervised fashion. Another approach, presented by Jaderberg et al. (2016), is to add auxiliary prediction losses to the RL training criterion in order to encourage implicit learning of environment dynamics. van Seijen et al. (2017) obtain state of the art performance on MS_PACMAN with a model free architecture, but they however rely on privileged information (object identity and positions, and decomposition of the reward function).
6 Discussion
We have shown that statespace models directly learned from raw pixel observations are good candidates for modelbased RL: 1) they are powerful enough to capture complex environment dynamics, exhibiting similar accuracy to frameautoregressive models; 2) they allow for computationally efficient MonteCarlo rollouts; 3) their learned dynamic staterepresentations are excellent features for evaluating and anticipating future outcomes compared to raw pixels. This enabled ImaginationAugemented Agents to outperform strong modelfree baselines. On a conceptual level, we present (to the best of our knowledge) the first results on what we termed learningtoquery: We show a learning a rollout policy by backpropagating policy gradients leads to consistent (if modest) improvements.
Here, we adopted the I2A assumption of having access to a pretrained envronment model. In future work, we plan to drop this assumption and jointly learn the model and the agent. Also, further speeding up environment models is a major direction of research; we think that learning models with the capacity of learning adaptive temporal abstractions is a particularly promising direction for achieving agents that plan to react flexibly to their environment.
References
 Agrawal et al. (2016) Agrawal, Pulkit, Nair, Ashvin V, Abbeel, Pieter, Malik, Jitendra, and Levine, Sergey. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pp. 5074–5082, 2016.
 Archer et al. (2015) Archer, Evan, Park, Il Memming, Buesing, Lars, Cunningham, John, and Paninski, Liam. Black box variational inference for state space models. arXiv preprint arXiv:1511.07367, 2015.
 Babaeizadeh et al. (2018) Babaeizadeh, Mohammad, Finn, Chelsea, Erhan, Dumitru, Campbell, Roy H., and Levine, Sergey. Stochastic variational video prediction. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rk49MgCW.
 Bellemare et al. (2013) Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An evaluation platform for general agents. J. Artif. Intell. Res.(JAIR), 47:253–279, 2013.
 Bengio et al. (2013) Bengio, Yoshua, Léonard, Nicholas, and Courville, Aaron. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
 Bertsekas et al. (1995) Bertsekas, Dimitri P, Bertsekas, Dimitri P, Bertsekas, Dimitri P, and Bertsekas, Dimitri P. Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995.
 Betts (1998) Betts, John T. Survey of numerical methods for trajectory optimization. Journal of Guidance control and dynamics, 21(2):193–207, 1998.
 Browne et al. (2012) Browne, Cameron B, Powley, Edward, Whitehouse, Daniel, Lucas, Simon M, Cowling, Peter I, Rohlfshagen, Philipp, Tavener, Stephen, Perez, Diego, Samothrakis, Spyridon, and Colton, Simon. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 Chiappa et al. (2017) Chiappa, Silvia, Racaniere, Sébastien, Wierstra, Daan, and Mohamed, Shakir. Recurrent environment simulators. arXiv preprint arXiv:1704.02254, 2017.
 Chung et al. (2015) Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C, and Bengio, Yoshua. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
 Finn & Levine (2017) Finn, Chelsea and Levine, Sergey. Deep visual foresight for planning robot motion. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2786–2793. IEEE, 2017.
 Fraccaro et al. (2016) Fraccaro, Marco, Sønderby, Søren Kaae, Paquet, Ulrich, and Winther, Ole. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pp. 2199–2207, 2016.
 Gu et al. (2015) Gu, Shixiang, Ghahramani, Zoubin, and Turner, Richard E. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems, pp. 2629–2637, 2015.
 Henaff et al. (2017) Henaff, Mikael, Whitney, William F, and LeCun, Yann. Modelbased planning in discrete action spaces. arXiv preprint arXiv:1705.07177, 2017.
 Hessel et al. (2017) Hessel, Matteo, Modayil, Joseph, van Hasselt, Hado, Schaul, Tom, Ostrovski, Georg, Dabney, Will, Horgan, Dan, Piot, Bilal, Azar, Mohammad, and Silver, David. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298, 2017.
 Jaderberg et al. (2016) Jaderberg, Max, Mnih, Volodymyr, Czarnecki, Wojciech Marian, Schaul, Tom, Leibo, Joel Z, Silver, David, and Kavukcuoglu, Koray. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Krishnan et al. (2015) Krishnan, Rahul G, Shalit, Uri, and Sontag, David. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 Mnih et al. (2016) Mnih, V., Puigdomenech Badia, A., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. ArXiv preprint arXiv:1602.01783, 2016.
 Mnih et al. (2015) Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Oh et al. (2015) Oh, Junhyuk, Guo, Xiaoxiao, Lee, Honglak, Lewis, Richard L, and Singh, Satinder. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, pp. 2863–2871, 2015.
 Oh et al. (2017) Oh, Junhyuk, Singh, Satinder, and Lee, Honglak. Value prediction network. arXiv preprint arXiv:1707.03497, 2017.
 Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

Schulman et al. (2015)
Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan, Michael, and Moritz,
Philipp.
Trust region policy optimization.
In
Proceedings of the 32nd International Conference on Machine Learning (ICML15)
, pp. 1889–1897, 2015.  Silver et al. (2016a) Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016a.
 Silver et al. (2016b) Silver, David, van Hasselt, Hado, Hessel, Matteo, Schaul, Tom, Guez, Arthur, Harley, Tim, DulacArnold, Gabriel, Reichert, David, Rabinowitz, Neil, Barreto, Andre, et al. The predictron: Endtoend learning and planning. arXiv preprint arXiv:1612.08810, 2016b.
 Sutton (1991) Sutton, Richard S. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
 Talvitie (2015) Talvitie, Erik. Agnostic system identification for monte carlo planning. In AAAI, pp. 2986–2992, 2015.
 Tamar et al. (2016) Tamar, Aviv, Wu, Yi, Thomas, Garrett, Levine, Sergey, and Abbeel, Pieter. Value iteration networks. In Advances in Neural Information Processing Systems, pp. 2154–2162, 2016.
 van Seijen et al. (2017) van Seijen, Harm, Fatemi, Mehdi, Romoff, Joshua, Laroche, Romain, Barnes, Tavian, and Tsang, Jeffrey. Hybrid reward architecture for reinforcement learning. arXiv preprint arXiv:1706.04208, 2017.
 Wahlström et al. (2015) Wahlström, Niklas, Schön, Thomas B, and Deisenroth, Marc Peter. From pixels to torques: Policy learning with deep dynamical models. arXiv preprint arXiv:1502.02251, 2015.
 Watter et al. (2015) Watter, Manuel, Springenberg, Jost, Boedecker, Joschka, and Riedmiller, Martin. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754, 2015.
 Weber et al. (2017) Weber, Théophane, Racanière, Sébastien, Reichert, David P, Buesing, Lars, Guez, Arthur, Rezende, Danilo Jimenez, Badia, Adria Puigdomènech, Vinyals, Oriol, Heess, Nicolas, Li, Yujia, et al. Imaginationaugmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
Appendix A Details on environment models
a.1 Architectures
a.2 Detail in the observation model
For all models (autoregressive and statespace), we interpret the three color channels of each pixel in the observation (with frame height and width ) as pseudoprobabilities; we score these using their KL divergence with model predictions. We model the reward with a separate distribution: we first compute a binary representation of the reward and model the coefficients as independent Bernoulli variables (conditioned on
). We also add two extra binary variables: one for the sign of the reward, and the indicator function of the reward being equal to
.a.3 Details of neural network implementations
Here we show the concrete neural network layouts used to implement the sSSM. We first introduce three higher level build blocks:

a three layer deep convolutional stack , with kernel sizes and channels sizes , shown in Fig. 6;

a three layer deep residual convolutional stack with fixed sizes, shown in Fig. 7;

the Pool & Inject layer, shown in Fig. 8.
Based on these building blocks, we define all modules in Fig. 9 to Fig. 14.
a.4 Collection of training data
We train a standard DQN agents on the four games BOWLING, CENTIPEDE, MS PACMAN and SURROUND from the ALE as detailed by Mnih et al. (2015) using the original action space of 18 actions. After training, we collect a training set of and a test set of environment transitions for each game by executing the learned policies. Actions are represented by onehot vectors and are tiled to yield convolutional feature maps of appropriate size. Pixel observations were cropped to pixels and normalized by 255 to lie in the unit cube . Because the DQN agent were trained with an actionrepeat of four, we only model every fourth frame.
a.5 Training details
All models were optimized using Adam (Kingma & Ba, 2014) with a minibatch size of 16.
a.6 Comparison of deterministic and stochastic statespace models
We illustrate the difference in modelling capacity between deterministic (dSSM) and stochastic (sSSM) statespace models, by training both on a toy data set. It consists of small pixel image sequences of a bouncing ball with a drift and a small random diffusion term. As shown in Fig. 16, after training, pixels rendered from the rollouts of a sSSM depict a plausible realization of a trajectory of the ball, whereas the dSSM produces blurry samples, as conditioned on any number of previously observed frames, the state of the ball is not entirely predictable due to diffusion. A dSSM (trained with an approximate maximum likelihood criterion, see above) will “hedge its bets” by producing a blurry prediction. A similar result can be observed in rollouts from models trained on ALE games, see Fig. 17.
Appendix B Appendix: Details on Agents
b.1 MS PACMAN environment variant
For the RL experiments in the paper, we consider a slightly simplified version of the MS PACMAN environment with only five actions (UP, LEFT, DOWN, RIGHT, NOOP). Furthermore, all agents have an actionrepeat of four, and only observe every fourth frame from the environment.
b.2 Architecture
We reimplemented closely the agent architecture presented by Weber et al. (2017). In the following we list the changes in the architecture necessitated by the different environments and environment models.
Modelfree baseline
The modelfree baseline consisted of a fourlayer CNN operating on with sizes (4, 2, 16), (8, 4, 32), (4, 2, 64) and (3, 1, 64), where donates a CNN layer with square kernel size , stride and output channels
; each CNN layer is followed by a relu nonlinearity. The output of the CNN is flatten and passed trough a fullyconnected (FC) layer with 512 hidden units; the final output is a value function approximation and the logits of the policy at time
.ImaginationAugmented Agent (I2A)
The modelfree path consists of a CNN with the same size as the one of the modelfree agent (including the FC layer with 512 units). The modelbased path is designed as follows: The rollout outputs for each imagined time step are encoded with a two layer CNN with sizes (4, 1, 32) and (4, 1, 16), then flattened and passed to a fullyconnected (FC) layer with 128 outputs. These rollout statistics are then summarized (in reversed order) with an LSTM with 256 hidden units and concatenated with the outputs of the modelfree path.
Rollout policies
Trainable rollout policies that operate on the state are given by a two layer CNN with sizes (4, 1, 32) and (4, 1, 32), followed by an FC layer with 128 units. Pixelbased rollout policies have the same neural network sizes as the modelfree baseline, except that the last two CNN layers have 32 feature maps each.
Appendix C Results I2A with stochastic statespace models
Learning curves for I2As with sSSMs are shown in Fig. 18. Both, I2As with learingtoquery and distillation rollout policies outperform a uniform random rollout policy. The learningtoquery agent shows weak initial performance, but eventually outperforms the other agents. This shows that learningtosample informative outcomes is beneficial for agent performance.