Shaping Belief States with Generative Environment Models for RL

by   Karol Gregor, et al.

When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.


page 5

page 7

page 8

page 18

page 19

page 21

page 22


Learning and Querying Fast Generative Models for Reinforcement Learning

A key challenge in model-based reinforcement learning (RL) is to synthes...

Temporal Difference Variational Auto-Encoder

One motivation for learning generative models of environments is to use ...

Dyna Planning using a Feature Based Generative Model

Dyna-style reinforcement learning is a powerful approach for problems wh...

Neural Recursive Belief States in Multi-Agent Reinforcement Learning

In multi-agent reinforcement learning, the problem of learning to act is...

Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models

Dyna-style reinforcement learning (RL) agents improve sample efficiency ...

Learning to Simulate Dynamic Environments with GameGAN

Simulation is a crucial component of any robotic system. In order to sim...

Searching for an (un)stable equilibrium: experiments in training generative models without data

This paper details a developing artistic practice around an ongoing seri...

1 Introduction

We are interested in making agents that can solve a wide range of tasks in complex and dynamic environments. While tasks may be vastly different from each other, there is a large amount of structure in the world that can be captured and used by the agents in a task-independent manner. This observation is consistent with the view that such general agents must understand the world around them bengio2013representation

. The collection of algorithms that learn representations by exploiting structure in the data that are general enough to support a wide range of downstream tasks is what we refer to as unsupervised learning or self-supervised learning. We hypothesize that an ideal unsupervised learning algorithm should use past observations to create a stable representation of the environment. That is, a representation that captures the global factors of variation of the environment in a temporally coherent way. As an example, consider an agent navigating in a complex landscape. At any given time, only a small part of the environment is observable from the the perspective of the agent. The frames that this agent observes can vary significantly over time, even though the global structure of the environment is relatively static with only a few moving objects. An useful representation of such an environment would contain, for example, a map describing the overall layout of the terrain. Our goal is to learn such representations in a general manner.

Predictive models have long been hypothesized as a general mechanism to produce useful representations based on which an agent can perform a wide variety of tasks in partially observed worlds elias1955predictive ; schmidhuber1991curious

. A formal way of describing agents in partially observed environments is through the notion of partially observable Markov decision processes

kaelbling1998planning ; astrom1965optimal (POMDPs). A key concept in POMDPs is the notion of a belief-state, which is defined as the sufficient statistics of future states rabiner1989tutorial ; jaakkola1995reinforcement ; hauskrecht2000value

. In this paper we refer to belief-states as any vector representation that is sufficient to predict future observations as in

Gregor2018TemporalDV ; moreno2018nb .

A fundamental problem in building useful environment models, which we want to address in this work, is long-term consistency Chiappa2017RecurrentES ; fraccaro2018generative . This problem is characterized by many models’ failure to perform coherent long-term predictions, while performing accurate short-term predictions, even in trivial but partially observed environments Chiappa2017RecurrentES ; kalchbrenner2017video ; oh2015action .

We argue that this is not merely a model capacity issue. As previous work has shown, globally coherent environment maps can be learned by position conditioned models eslami2018neural . We thus propose that this problem is better diagnosed as the failure in model conditioning and a weak objective, which we discuss in more details in Section 2.1.

The main contributions of this paper are: 1) We demonstrate that an expressive generative model of rich 3D environments can be learned from merely first-person views and actions to capture long-term consistency; 2) we provide an analysis of three different belief-state architectures (LSTM hochreiter1997long , LSTM + Kanerva Memory wu2018kanerva and LSTM + slot-based Memory hung2018optimizing ) on the ability to decode the environment’s map and the agent’s location. 3) we design and perform experiments to test the effects of both overshooting and memory, demonstrating that generative models benefit from these components more than deterministic models; 4) we show that training agents that share their belief-state with the generative model have substantially increased data-efficiency compared to a strong model-free baseline espeholt2018impala ; hung2018optimizing , without significantly affecting the training speed; 5) we show one of the first agents that learns to collect construction materials in order to build complex structures from a first-person perspective in a 3d environment.

The remainder of this paper is organized as follows: in Section 2 we introduce the main components of our agent’s architecture and discuss some key challenges in using expressive generative models in RL. Namely, the problem of conditioning, Section 2.1, and why next-step models are not sufficient in Section 2.2, in Section 3 we discuss related research. Finally, we describe our experiments in Section 4.

2 Methods

In this section we describe our proposed agent and model architectures. Our agent has two main components. The first is a recurrent neural network (RNN) as in

espeholt2018impala which observes frames , processes them through a feed-forward network and aggregates the resulting outputs by updating its recurrent state

. From this updated state, the agent core then outputs the action logits, a sampled action and the value function baseline.

Figure 1: Diagram of the agent and model. The agent receives observations from the environment, processes them through a feed-forward residual network (green) and forms a state using a recurrent network (blue), online. This state is a belief state and is used to calculate policy and value as well as being the starting point for predictions of the future. These are done using a second recurrent network (orange) - a simulation network (SimCore) that simulates into the future seeing only the actions. The simulated state is used to conditioning for a generative model (red) of a future frame.

The second component is the unsupervised model, which can be: (i) a contrastive loss based on action-conditional CPC Guo2018NeuralPB ; (ii) a deterministic predictive model (similar to Chiappa2017RecurrentES ) and (iii) an expressive generative predictive model based on ConvDRAW, gregor2016towards . We also investigate different memory architectures in the form of slot-based memory (as used in the reconstructive memory agent, RMA) hung2018optimizing and compressive memory (Kanerva) wu2018kanerva . The unsupervised model consists of a recurrent network, which we refer to as simulation network or SimCore, that starts with a given belief state at some random time , and then simulates deterministically forward, seeing only the actions of the agent. After simulating for steps, we use the resulting state to predict the distribution of frames (in cases (ii) and (iii)). A diagram illustrating this agent is provided in Figure 1 and the precise computation steps are provided in Table 1.

Belief State Update
Agent Core Value and Policy Logits
Simulation State Initialization
Simulation State Update =
SimCore Simulation Starting Times unif
Likelihood Evaluation Times unif
Predictive Loss =-
Table 1: Agent and simulation core definitions. Here (typically 2) is the number of points in the future used to evaluate the predictive loss, (typically 6) is the number of random points along the trajectory where we unroll the predictive model and is the overshoot length (typically 4-32), which is the maximum time-length used to train the predictive model. We choose and to maintain a low computational cost.

2.1 Frame generative models and the problem of conditioning.

It is known that expressive frame generative models are hard to condition liu2019conditional ; alemi2017fixing ; salimans2017pixelcnn++ ; razavi2019preventing . This is especially problematic for learning belief-states, because it is this conditioning that provides the learning signal for the formation of belief-states. If the generative model ignores the conditioning information, it will not be possible to optimize the belief-states. More precisely, if the generative model fails to use the conditioning we have and thus , and consequently learning the belief-state is not possible.

We observe this problem by experimenting with expressive state of-the-art deep generative models such as PixelCNN van2016conditional , ConvDRAW gregor2016towards and RNVP dinh2016density . We found that a modified ConvDRAW in combination with GECO rezendegeneralized works well in practice, which allows us to learn stable belief-states while maintaining good sample quality. As a result, we can use our model to consistently simulate many steps into the future. More details of the modified ConvDRAW architecture and GECO optimization are provided in Appendix A and Appendix G.

2.2 Why is next-step prediction not sufficient?

A theoretically sufficient and conceptually simple way to form a belief state is to train a next-step prediction model , where summarizes the past. Under an optimal solution, it contains all the information needed to predict the future

because any joint distribution can be factorized as a product of such conditionals:

This reasoning motivates a lot of research using next-step prediction in RL, e.g. buesing2018learning ; ha2018world .

We argue that next-step prediction exacerbates the conditioning problem described in Section 2.1. In a physically realistic environment the immediate future observation can be predicted, with high accuracy, as a near-deterministic function of the immediate past observations . This intuition can be expressed as . That is, the immediate past weakens the dependency on belief-state vectors, resulting in . Predicting the distant future, in contrast, requires knowledge of the global structure of the environment, encouraging the formation of belief-states that contain that information.

Generative environment models trained with overshooting have been explored in the context of model-based planning co2018self ; hafner2018learning ; Silver2017ThePE ; amos2018learning . But evidence of the effect of overshooting has been restricted to the agent’s performance evaluation Silver2017ThePE ; co2018self . While there is some evidence that overshooting improves the ability to predict the long-term future Chiappa2017RecurrentES , there is no extensive study examining which aspects of the environment are retained by these models.

As noted above, for a given belief-state the entropy of the distribution of target observations increases with the overshoot length (due to partial observability and/or randomness in the environment), going from a near deterministic (uni-modal) distribution to a highly multi-modal distribution. This leads us to hypothesize that deterministic prediction models should benefit less from growing the overshooting length compared to generative prediction models. Our experiments below support this hypothesis.

2.3 Belief-states, Localization and Mapping

Extracting a consistent high-level representation of the environment such as the bird’s eye view "map" from merely first-person views and actions in a completely unsupervised manner is a notoriously difficult task fraccaro2018generative and a lot of the success in addressing this problem is due to the injection of a substantial amount of human prior knowledge in the models zhou2017unsupervised ; iyer2018geometric ; zhang2017neural ; parisotto2018 ; kayalibay2018navigation ; gupta2017cognitive .

While previous work has primarily focused on extracting human-interpretable maps of the environment, our approach is to decode position, orientation and top down view or layout of the environment from the agent’s belief-state . This decoding process does not interfere with the agent’s training, and is not restricted to 2D map-layouts.

We use one-layer MLP to predict the discretized position and orientation and a convolutional network to predict the top down view (map decoder). When the belief-state is learned by an LSTM, it is composed of the LSTM hidden state and the LSTM cell state . Since the location and map decoder need access to the full belief-state, we condition these maps on the vector . When using the episodic, slot based, RMA memory we first take a number of reads from the memory conditioned on the current belief-state and concatenate them with defined above. For the Kanerva memory we learn a fixed set of read vectors and concatenate the retrieved memories with .

3 Other Related Work

The idea of learning general world models to support decision-making is probably one of the most pervasive ideas in AI research,

ha2018world ; Chiappa2017RecurrentES ; schmidhuber2010formal ; xie2016model ; oh2015action ; munro1987dual ; werbos1987learning ; nguyen1990truck ; Deisenroth2011PILCOAM ; Silver2017ThePE ; ha2018world ; lin1992memory ; li2015recurrent ; schmidhuber1991curious ; diuk2008object ; igl2018deep . In spite of a vast literature supporting the potential advantages of model-based RL, it has been a challenge to demonstrate the benefits of model-based agents in visually rich, dynamic, 3D environments. The challenge of model-based RL in visually rich 3D environments has compelled some researchers to use privileged information such as camera-locations eslami2018neural , depth information mirowski2016learning , and other ground-truth state variables of the state simulator lample2017playing ; diuk2008object . On the other hand, some work has provided evidence that we may not need very expressive models to benefit to some degree ha2018world .

Our proposed model could in principle be used in combination with planning algorithms. But we take a step back from planning and focus more on the effect of various model choices on the learned representations and belief states. This approach is similar to having a representation shaped via auxiliary unsupervised losses for a model-free RL agent. Combining auxiliary losses with reinforcement learning is an active area of research and a variety of auxiliary losses have been explored. A non-exhaustive list includes pixel control Jaderberg2017ReinforcementLW , contrastive predictive coding (CPC) Oord2018RepresentationLW , action-conditional CPC Guo2018NeuralPB , frame reconstruction zhou2017unsupervised ; higgins2017darla , next-frame prediction hung2018optimizing ; racaniere2017imagination ; Guo2018NeuralPB ; dosovitskiy2016learning and successor representations barreto2017successor ; dayan1993improving ; kulkarni2016deep .

As in Guo2018NeuralPB ; moreno2018nb ; igl2018deep our proposed architecture has a shared belief-state between the generative model and the agent’s policy network. The closest paper to our work is Guo2018NeuralPB , where a comparison is made between action-conditional CPC and next-step prediction using a deterministic next-step model. There are several key differences between this paper and our work: (i) We analyze the decoding of the environment’s map from the belief state; (ii) We show that while next-frame prediction may be sufficient to encode position and orientation, it is necessary to combine expressive generative models with overshoot to form a consistent map representation; (iii) We demonstrate that an expressive generative model can be trained to simulate visually rich 3D environments for several steps in the future; (iv) We also analyze the impact on RL performance of various model choices. We also discuss and propose solutions to the general problem of conditioning expressive generative models.

4 Experiments

We analyze our agent’s performance with respect to both its ability to represent the environment (e.g. knows its position and map layout) and RL performance. Our experiments span three main axes of variation: (i) the choice of unsupervised loss for the overshoots (e.g. deterministic prediction, generative prediction and contrastive); (ii) the choice of overshoot length and (iii) the choice of architecture for the belief-state and simulation RNNs (LSTM hochreiter1997long , LSTM with Kanerva memory wu2018kanerva and LSTM with the memory from the reconstructive memory agent (RMA) hung2018optimizing ). RMA uses a slot based memory that stores all past vectors, whereas Kanerva memory updates existing memories with new information in a compressive way, see Appendix B for more details.

The agent is trained using the IMPALA framework espeholt2018impala , a variant of policy gradients, see Appendix D for details. The model is trained jointly with the agent, sharing the belief network. We find that the running speed decreases only by about compared to the agent without model. We use Adam for optimization kingma2014adam

. The detailed choice of various hyperparameters is provided in

Appendix F.

Our experiments are performed using four families of procedural environments: (a) DeepMind-Lab levels beattie2016deepmind and three new environments that we created using the Unity Engine: (b) Random City; (c) Block building environment; (d) Random Terrain.

4.1 Random City

The Random City is a procedurally generated 3D navigation environment, Figure 2 showing first person view (top row) and the top down view (second row). At the beginning of an episode, a random number of randomly colored boxes (i.e. “buildings”) is placed on top of a 2d plane. We used this environment primarily as a way to analyze how different model architectures affect the formation of consistent belief-states. We generated training data for the models using fixed handcrafted policy that chooses random locations and path planning policy to move between these locations and analyze the model and the content of the belief state (no RL in this experiment).

Figure 2: Random City environment. Rows: 1. Input to the model sequence starting from the beginning of the episode. 2. Top down view (a map). 3. Top down view decoded from the belief state. The belief state was not trained with this decoding signal, but only from the first person view (top row). We see that the model is able to fill up the map as it sees new frames. 4. Frames later in the sequence (after 170 steps). 5. Rollout from the model. The model know what will it see as the agent rotates. See supplementary video .

In the third row of Figure 2 we show the top down view reconstructions from the belief state (to emphasize, the belief-state was not trained with this information). We see that whenever the agent sees a new building, the building appears on a map, and it still preserves the other buildings seen so far even if they are not in its current field-of-view. Rows four and five show a later part of an input sequence (when the model has seen a large part of the environment) and a rollout from the model. We see that the model is able to preserve the information in its state and use this information to correctly simulate forward.

We analyze the effects of self-supervised loss type, overshoot length and memory architecture on position prediction and map decoding accuracy. The results are shown in Figure 3 and Figure 4. We make the following observations: (i) an increase in the overshoot length improves the ability to decode the agent’s position and the map layout for all losses (up to certain length); (ii) The contrastive loss provides the best decoding of the agent’s position for all overshoot lengths Figure 2(a); (iii) The generative prediction loss provides the best map decoding and is the most sensitive to the overshoot length with respect to map decoding error Figure 2(c). (iv) The combination of generative model with Kanerva memory provides the best map decoding accuracy.

We also see that the contrastive loss is very good at localization but poor at mapping. This loss is trained to distinguish a given state from others within the simulated sequence and from other elements of the mini-batch. We hypothesize that keeping track of location very accurately allows to distinguish a given time point from others, but that in a varied environment it is easy for the network to distinguish one batch element from others without forming a map of the environment.

We also see that Kanerva memory works significantly better then pure LSTM and the slot based memory. However, the latter result might be due to limitation of the method used to analyze the content of the belief state. In fact it is likely that the information is in the state since slot based memory stores all the past vectors, but that it is hard to extract this information. This also raises an interesting point - what is a belief state? Storing all past data contains all the information a model can have. We suggest that what we are after is a more compact representation that is stored in a easy to access way. Kanerva memory aims to not only to store the past information but integrate it with already stored information in a compressed way.

(a) Position decoding
(b) Map decoding
(c) Map samples
Figure 3: The choice of model and overshoot length have significant impact on state representation. (a) All models benefit from an increase in the overshoot length with respect to position decoding, with the Contrastive model reaching higher accuracy; (b) The Generative models are the most sensitive to overshoot length with respect to Map decoding MSE. A substantial reduction in map decoding MSE is obtained by using architectures with memory; (c) Examples of decoded maps. Each block shows real maps (top-row) and decoded maps (bottom-row). Top block: Contrastive model samples at Overshoot Length (MSE of approx. 160); Bottom block: Generative + Kanerva at Overshoot Length (MSE of approx. 117). We can clearly notice the difference in the details for both models.
Figure 4: Effect of overshoot on environment’s map decoding. This analysis shows that Generative and Generative + Kanerva benefit the most from an increase in overshoot length in contrast to Deterministic and Contrastive architectures. In particular, we observe that Generative + Kanerva architecture is particularly good at forming belief-states that contain a map of the environment.

4.2 DeepMind Lab

DeepMind Lab beattie2016deepmind is a collection of 3D, first-person view environments where agents perform a diverse set of tasks. We use a subset of DeepMind Lab (rat_goal_driven, rat_goal_doors, rat_goal_driven_large and keys_doors_random) to investigate how the addition of the generative prediction loss with overshoot affects the agent’s representation or belief-state as well as its RL performance.

We compare four agents in the following experiments. The first termed LSTM is the standard IMPALA agent with LSTM recurrent core. Next agent, termed RMA is the agent of hung2018optimizing , the core of which consist of and LSTM and a slot based episodic memory. The final two agents termed LSTM+SimCore and RMA+SimCore are the same as LSTM and RMA agents, but with the model loss added.

The results of our experiments are shown in Figure 5. We see that adding the model loss improves performance for both LSTM and RMA agents.

Figure 5:

Generative SimCore results in substantial data-efficiency gains for agents in DeepMind-Lab relative to a strong model-free baseline. We also observe that model-free agents have substantially higher variance in their scores. See supplementary video


We found that map reconstruction loss varies significantly during training. This could be due to policy gradients affecting the belief state, changing policy or changing of the way the model represents the information, with decoder having hard time keeping up. We found that longer overshoot lengths perform better than shorter ones, but that did not translate into improved RL performance. This could also be an artifact of the environment - there are permanent features present on the horizon, and the agent does not need to know the map to navigate to the goal. The model is able to correctly rollout for a number of steps, Figure 6 knowing where the rewarding object is (the object on the bottom right).

Figure 6: The input and the rollout in DeepMind Lab. The agent is able to correctly rollout for many steps, and remember where the rewarding object is (the object in the bottom right frames).

While Kanerva memory significantly helps in the data regime we found it to be unstable in the RL setting. More work is required to solve this problem.

4.3 Voxel environment

We want to create an environment that requires agents to learn complex behaviours to solve tasks. For this, we created a voxel-based, procedural environment with Unity that can be modified by the agents via building mechanisms, resulting in a large combinatorial space of possible voxel configurations and behavioural strategies.

This environment consists of blocks of different types and appearances that are placed in a three dimensional grid. The agent moves continuously through the environment, obeying Unity engine’s physics. The agent has a simple inventory, can pick up blocks of certain types, place them back into the world and interact with certain blocks. We build four levels, Figure 7 top, where the goal is to consume all yellow blocks (‘food’). The levels are: Food: Five food blocks placed at random locations in a plane - this is a curriculum level for the agent to quickly learn that yellow blocks give reward. HighFood: The same setting, but the food is also placed at random height. If the food is slightly high, the agent needs to look up and jump to get it. If the food is even higher, the agent needs to place blocks on the ground, jump on them, look up at the food and jump. Cliff: There is a cliff of random height with food at the top. The agent needs to first pick up blocks and then build structures to climb and reach the top of the cliff. Interestingly, the agent learns to pick them up and build stairs on the side of the cliff. Bridge: The agent needs to get to the other side of a randomly sized gap, either by building a bridge or falling down and then building a tower to climb back up. The agent learns the latter. We also trained the agent on more complex versions of the levels, showing rather competent abilities of building high structures climbing, see Appendix J.

We compared the LSTM and LSTM+SimCore agents on these levels. In this case, one agent is playing all four levels at the same time. From Figure 7 we see that the SimCore significantly improves the performance on all the levels. In addition we found that the performance is much less sensitive to Adam hyper-parameters as well as unroll length. We also found that the model is able to simulate its movement, building behaviours and block picking, see (Appendix J) for samples.

Figure 7: Top: Voxel levels. There are four levels: BridgeFood, Cliff, Food and HighFood. For each level, four views are shown: Early frame first person view, early frame third person view, later frame first person view, later frame third person view. The agent sees only the first person view and its goal is to pick up yellow blocks, which it needs to get to. The agent has blocks that it can place. The agent learns how to build towers (BridgeFood and HighFood) and stairs (Cliff) to climb to the food. Bottom: Training the agent with SimCore substantially increases data-efficiency. See supplementary video .

Finally we tested a map building ability in a more naturalistic, procedurally generated terrain, Figure 12. This environment is harder than the city, because it takes significantly more steps to cross the environment. We also analyzed a simple RL setting of picking up randomly placed blocks. We found that an LSTM agent contains an approximate map, but the information not seen for a while gradually fades away. We hope to scale up these experiments in the future.

5 Discussion

In this paper we introduced a scheme to train expressive generative environment models with RL agents. We found that expressive generative models in combination with overshoot can form stable belief states in 3D environments from first person views, with little prior knowledge about the structure of these environments. We also showed that by sharing the belief-states of the model with the agent we substantially increase the data-efficiency in a variety of RL tasks relative to strong baselines. There are more elements that need to be investigated in the future. First, we found that training the belief state together with the agent makes it harder to either form a belief state or decode the map from it. This could result from the effect of policy gradients or changing of policy or changing the way the belief is represented. Additionally we aim towards scaling up the system, either through better training or better use of memory architectures. Finally, it would be good to use the model not only for representation learning but for planning as well.


Appendix A Modified ConvDRAW

Convolutional DRAW [21] is an instance of a deep variational auto-encoder [63, 64]. It has a recurrent encoder and recurrent decoder with latent variables at each step, forming auto-regressive distributions over the latent variables, Figure 8.

The decoder has a special layer named canvas, which accumulates the result into distribution over inputs . All the operations are convolutional, and all the states are three dimensional (spatial times feature dimensions).

We introduce several modifications to the original formulation. First, we replaced the LSTM recurrence by a product of tanh and sigmoid non-linearity, which halves the number of operations, given the number of feature maps. This is part of the LSTM operation and was also used in [26]. Second, instead of passing a conditioning vector into the decoder, we create a separate network that defines the prior over the latent variables. We find that this helps with conditioning. Finally, we adaptively scale the input loss relative to the latent loss so as to achieve a set target accuracy on the reconstructions, as described in [28].

Figure 8: Diagram of ConvDraw’s likelihood computation.

We use 8 iterations of the repeated operations (two are displayed in Figure 8). The input of size

is processed through two layer convolutional network with rectified linear units, with hidden sizes


. All the circles have have tanh sigmoid nonlinearity. The encoder recurrent state (blue circles) is of size

(after tanh sigmoid multiplication) and the other recurrent parts (orange and red circles) are of size . The decoder convolutional network hidden layers are of size and . The canvas is the size of the image and contains the current reconstruction. We don’t model the variance. The error vector

is the difference between the current canvas and the reconstruction. The kernel sizes of the convolutions between layers that change stride are

and between those that don’t are . The first conditioning operation (the first red circle) is applied four times before producing the first prior on . No weights are shared.

Appendix B Memory Architectures

Here we discuss the architectures used to aggregate the observations as well as perform prediction. Such networks need to store a new information quickly and incorporate it into their belief state. There are two places where recurrent networks store information - in activations and in weights. For classic recurrent networks, the weights are updated by back-propagation, which is a slow process that results in small updates at every time step. Therefore such networks need to form the belief state in activations. For this type of network in our experiments, we use the standard LSTM [16].

We use two models that were developed as one shot memory architectures. The first one is that of [18]. It is a slot based memory that at a given time step, takes a specific vector produced from the input and an LSTM, and stores that vector in a new slot in memory. The memory starts empty at the beginning of the episode and a new slot is allocated at every time step. Reading from memory is done using an attention based mechanism. The advantage of such memory is that nothing is forgotten and one does not need to learn how to write into the memory. The disadvantage is that new slots are being allocated and written to even if nothing new is happening.

More appealing would be a mechanism that compares the current input to what is already stored in the memory, and only makes updates necessary to incorporate the new information present in such input. For such mechanism we use Kanerva machines [17], specifically the version in [65]. The Kanerva machine is a generative model of exchangeable observations, in which the global latent variable is used as a memory. Due to its statistical interpretation, writing into the memory is equivalent to inferring the posterior distribution of this global latent variable given a new observation. This inference is exact and efficient, since the linear Gaussian model underlying a Kanerva machine is analytically tractable.

For all the types of RNNs (LSTM, LSTM+RMA, LSTM+Kanerva), we use the same network architecture for both the belief state and the simulation networks. For the networks with memory, we turn off writing into the memory in the simulation network. The LSTM’s in all the cases do not share weights between the belief and simulation networks.

Appendix C Integrating Kanerva Machines with the RNNs

This section describes how to integrate a Kanerva Machine (KM) with an RNN, so it functions as an external memory for the SimCore (LSTM + Kanerva in the main text). KMs were originally proposed as unsupervised models trained with lower-bounds of log-likelihoods [17, 65]

. Here we present a simplification that works as black-box module, which is trained end-to-end by back-propagation with the rest of the model without incurring any auxiliary loss function.

We use the dynamic version as in [65], which uses fully content-based addressing, requiring only a word as the input for either reading or writing. The optimal location for memory access is obtained by solving a least-squares problem involving both and memory mean matrix [65]. At time step , the memory module takes as input the RNN’s previous output , and current action and embedding of the observation . They are concatenated and linearly projected to obtain a read word . is used to query the memory for a read-out, which is then use as an input to the RNN, together with and , for updating to . To update the memory, we obtain a write word by passing through an MLP a similar concatenated vector using the new state : The memory is updated as the posterior distribution conditioned on . These operations are illustrated in Figure 9.

Figure 9: Illustration of a dynamic Kanerva machine integrated with an RNN.

Appendix D Reinforcement learning framework

For all reinforcement learning experiments, we assume a standard setup where the agents interact with the environment in discrete time steps and using a discrete set of actions. At a particular time step the agent with belief-state observes a frame , produces an action and receives a reward . The goal of the optimization is to maximize the discounted sum of rewards with a discount and subject to entropy regularization of the policy.

The agents are trained in a distributed setting using the IMPALA framework [19], which we describe here briefly. There are N parallel ‘actors’ acting in an environment and collecting their experience in a replay buffer. There is one learner which takes subsets of trajectories, forms a batch and performs learning. The learning step consists of unrolling the recurrent core of the agent and computing the losses and gradients. A given piece of experience is used only once and is only slightly off policy. The policy loss and model loss are computed in the same networks, both passing gradients into the main recurrent core.

Appendix E Data pre-processing for plotting

For all plots showing RL scores, position accuracy and map MSE we first smooth each individual curves using an exponential moving window of length 10. We then sub-sample the curves using linear interpolation, re-evaluating them at 128 points uniformly covering the x-axis range. The shown error bars correspond to the

confidence region.

Appendix F Hyperparameters

The hyper-parameter values used in experiments are reported in Table 2. In addition, Table 3 and Table 4 report the parameters for slot-based and Kanerva memory respectively. Please refer to [18] and [65] for explanations of the memory models’ parameters.

Hyper-parameter Description Range
learning rate [0.0001-0.0002]
policy entropy regularization [0.03-0.0005]
Adam [0, 0.95]
Adam [0.99, 0.999]
Overshoot Length {1, 2, 3, 6, 12}
Unroll Length [24-100]
Number of points used to evaluate
the generative loss per trajectory
Number of points used to evaluate
the generative loss per overshoot
Number of ConvDRAW Steps [8]
Number of units in LSTM [512-1024]
Table 2: Hyper-parameters used. Each reported experiment was repeated at least 3 times with different random seeds. The reported curves for each model are the best we found with the hyper-parameters in the ranges shown above.
Hyper-parameter Description Range
memory size 1350
word size 200
number of reads 3
top k entries for read 50
Table 3: Hyper-parameters used for RMA memory.
Hyper-parameter Description Range
memory size 32
word size 512
initial noise variance 1.0
write projection linear
read projection 2-layer MLP with hidden layer size 400
Table 4: Hyper-parameters used for Kanerva memory.

Appendix G Geco

It has been shown [28] that latent variable models defining a conditional density such as ConvDRAW and VAEs can achieve better sample-quality if we constrain the reconstruction error to be not larger than a given threshold . A Lagrangian formulation of this constrained optimization problem can be written as a min-max optimization instead of direct ELBO maximization. More specifically, we train ConvDRAW following

where is an approximation to the posterior distribution with parameters .

We look at the effect of the choice of in our belief-state model and observe empirically that there is an optimal range of values with respect to map reconstruction. This analysis is shown in Figure 10.

Figure 10: Effect of the choice of GECO’s threshold on map-reconstruction using SimCore. We find that a value 1e-3 produces the best results for map reconstruction.

Appendix H Map decoding on Random City environment

Here we show additional map decoding samples for each of model in Figure 11.

Figure 11: Additional map decoding samples for each model. All models were trained for the same amount of iterations. For each model we show 8 samples for overshoot length (left) and 8 samples for (right). We used unroll length and the maps were extracted at time-step . These results confirm that the best models allow to decode the maps with substantially more details compared to other baselines.

Appendix I Procedurally generated terrain

We created a procedurally generated terrain to analyze the belief state in a more naturalistic environment, Figure 12. See the caption for a description and analysis.

We also show in Figure 13 the map decoding mse and a few map decoding samples along a single trajectory.

Figure 12: Terrain. The agent moves around a procedurally generated terrain. The first row in each of the three sections show the frames seen by the agent (only a fraction of frames are shown to display a large part of an episode). The second row shows top down view. Conditioned on state, we trained the decoder to predict the top down view (without passing gradients into belief state formation). The third row of each section shows the map reconstructed from the belief state. We also trained ConvDRAW as a model of the map conditioned on the belief state. In the last three rows we show samples from the model. If the agent is uncertain about parts of the map, the model should sample random pieces of map in those locations. This environment is harder than the city environment in that it is larger (), the speed of the agent is slower and it is run in RL setting (with the goal of collecting 20 randomly placed yellow blocks). We see that the agent does form a map that persist for some time but the map also fades away slowly the longer the agent does not see that part of the environment.
Figure 13: Illustration of the Map decoding MSE along a single trajectory in the procedural terrain environment. At each inset we can see from top to bottom: the true top-view, the decoded top-view and the first-person-view at the same time. This graph show that the Map MSE decreases along a single trahjectory, indicating that the model was successful at accumulating and remembering evidence about the environment’s layout.

Appendix J Extra levels and samples from the model

We also trained the agent on harder versions of the levels. The first level is a higher version of the cliff where the agent learns to build longer staircase. The remaining two consists of food placed in an even higher level locations, showing agent building high structures.

We show a number of rollouts from the model in different environments Figure 14 and Figure 15. To make a rollout, we simulate deterministically in latent space. To obtain a frame, we sample from the convDRAW model. If the simulation knows well what should be in a given frame, the sample should match the actual frame. If it does not, either because of limitation of the model or because it has not seen that part of the environment, it should sample something consistent with its knowledge, but (most likely) different from the actual frame.

Figure 14: Inputs and samples from the building levels. First two rows show input and samples from the model. These continue into the row three and four. Then, the process repeats for another example. Top rows shows agent simulating building stairs. The second set shows the level with food placed at high location and the last set shows the agent building a tower to climb to a platform.
Figure 15: Inputs and samples from the terrain environment.