Log In Sign Up

COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration

by   Nicholas Watters, et al.

Data efficiency and robustness to task-irrelevant perturbations are long-standing challenges for deep reinforcement learning algorithms. Here we introduce a modular approach to addressing these challenges in a continuous control environment, without using hand-crafted or supervised information. Our Curious Object-Based seaRch Agent (COBRA) uses task-free intrinsically motivated exploration and unsupervised learning to build object-based models of its environment and action space. Subsequently, it can learn a variety of tasks through model-based search in very few steps and excel on structured hold-out tests of policy robustness.


page 3

page 4

page 5

page 8

page 9

page 22

page 24


Guided Dyna-Q for Mobile Robot Exploration and Navigation

Model-based reinforcement learning (RL) enables an agent to learn world ...

Model-Based Policy Gradients with Parameter-Based Exploration by Least-Squares Conditional Density Estimation

The goal of reinforcement learning (RL) is to let an agent learn an opti...

Recurrent Off-policy Baselines for Memory-based Continuous Control

When the environment is partially observable (PO), a deep reinforcement ...

Unsupervised Learning of Object Keypoints for Perception and Control

The study of object representations in computer vision has primarily foc...

The Journey is the Reward: Unsupervised Learning of Influential Trajectories

Unsupervised exploration and representation learning become increasingly...

Code Repositories


Spriteworld: a flexible, configurable python-based reinforcement learning environment

view repo

1 Introduction

Recent advances in deep reinforcement learning (RL) have shown remarkable success on challenging tasks (Mnih et al., 2015; Silver et al., 2016; Andrychowicz et al., 2018). However, data efficiency and robustness to new contexts remain persistent challenges for deep RL algorithms, especially when the goal is for agents to learn practical tasks with limited supervision. Drawing inspiration from self-supervised “play” in human development (Gopnik et al., 1999; Settles, 2011), we introduce an agent that learns object-centric representations of its environment without supervision and subsequently harnesses these to learn policies efficiency and robustly.

Our agent, which we call Curious Object-Based seaRch Agent (COBRA), brings together three key ingredients: (i) learning representations of the world in terms of objects, (ii) curiosity-driven exploration, and (iii) model based RL. The benefits of this synthesis are data efficiency and policy robustness. To put this into practice, we introduce the following technical contributions:

  • A method for learning action-conditioned dynamics over slot-structured object-centric representations that requires no supervision and is trained from raw pixels.

  • A method for learning a distribution over a multi-dimensional continuous action space. This learned distribution can be sampled efficiently.

  • An integrated continuous control agent architecture that combines unsupervised learning, adversarial learning through exploration, and model-based RL.

COBRA is trained in two phases. During the initial exploration phase it explores its environment, in which it can move objects freely with a continuous action space but is not rewarded for its actions. In this phase, it learns how to see, predict, and act in a task-free setting. It uses these capacities during a subsequent task phase, in which it is trained through model-based RL and quickly learns to solve tasks.

2 Related Work

Our work builds upon three areas of prior research: Curiosity-driven exploration, object-oriented RL and model-based RL.

Curiosity-driven exploration

was first proposed in (Schmidhuber, 1990b, 1991) and expanded upon in (Schmidhuber, 2010). These works introduce the notion of learning a “world model”, which can be used to plan or explore given a “curiosity” measure. More recently, some models (Haber et al., 2018; Laversanne-Finot et al., 2018) have shown interesting exploration behaviors emerge through task-free curiosity-driven exploration in agents. Others have shown benefits of intrinsic motivation for learning RL tasks (Kulkarni et al., 2016; Eysenbach et al., 2018; Pathak et al., 2017), though the connections to policy robustness and object-based representations are not explored in these works.

Object-Oriented RL

Proposed as an alternative to classical MDPs, object-oriented RL (Diuk et al., 2008) promises data efficiency and generalization by leveraging representations of objects and their relations (Keramati et al., 2018). Often these approaches are in line with our intuition that representations factored in terms of objects support structured reasoning and facilitate solving many tasks. They can also be readily extended to be compatible with hierarchical RL (Roderick et al., 2017; Vezhnevets et al., 2017). However, most existing approaches require hand-crafting object representations or their relations (Diuk et al., 2008; Cobo et al., 2013; Garnelo et al., 2016; Lázaro-Gredilla et al., 2019). In contrast, our approach addresses the core problem of automatically discovering these from data without hand-crafting or supervision.

Model-based RL

The promises of learning and leveraging a model of the environment have been discussed before (Sutton and Barto, 2018), yet doing so in practice for complex environments remains a challenge (Deisenroth and Rasmussen, 2011). Recent works have made progress in this regard (Finn et al., 2016; Ha and Schmidhuber, 2018; Zhang et al., 2018; Kurutach et al., 2018; Kaiser et al., 2019), but model-free alternatives are still hard to beat in terms of asymptotic performance. Concurrent work by Hafner et al. (2018) combines intrinsic motivation with model-based RL on continuous control tasks. Our model differs in that it learns to infer a more structured latent representation by decomposing scenes into their constituent objects, which we hypothesized will help us scale up to more complex situations. However, our current policy learning approach is more limited, and hence a careful comparison with the Hafner et al. (2018) model would be interesting for future work.

3 The COBRA model

Figure 1: COBRA model schematic
A. Entire model. The vision module (scene encoder and decoder), transition model, and exploration policy are all trained in a pure exploration phase with no reward. B. Transition model architecture. An action-conditioned slot-wise MLP learns one-step future-prediction. This is trained by applying the scene decoder to , through which gradients from a pixel loss are passed. An auxiliary transition error prediction provides a more direct path to the pixel loss and makes adversarial training with the exploration policy more efficient. C. Adversarial training of transition model and exploration policy through which the behavior of moving objects emerges.

Our model (Figure 1) consists of four components: The vision module, transition model, and exploration policy are trained during an unsupervised exploration phase without rewards, while the reward predictor is trained during a subsequent task phase (with reward).

To present the parametrization of these components, we must first briefly discuss the environment: We are interested in environments containing objects that can be manipulated, in which agents can learn purposeful exploration without reward and a diverse distribution of tasks can be defined. In all of our experiments we use a 2-dimensional virtual “touch-screen” environment that contains objects with configurable shape, position, and color. The agent can move any visible object by clicking on the object and clicking in a direction for the object to move. Hence the action space is continuous and 4-dimensional, namely a pair of clicks. If the first click does not land in an object then nothing will move, regardless of the second click, the action space is sparse. Despite its apparent simplicity, this environment supports highly diverse tasks that can challenge existing SOTA agents. See Section 4 for details about our environment and tasks.

3.1 Unsupervised exploration phase

We want our agent to learn how its environment “works” in a task-free exploration phase, where no extrinsic reward signals are available. If this is done well, then the knowledge thus gained should be useful and generalizable for subsequent tasks. Specifically, in the exploration phase COBRA learns representations of objects, dynamics, and its own action space, which it can subsequently use for any task in this environment. For the exploration phase we let COBRA explore our touch-screen environment in short episodes. Each episode is initialized with 1-7 objects with randomly sampled shape, position, orientation and color.

Vision Module

We use MONet (Burgess et al., 2019)

to learn a static scene representation. MONet is an auto-regressive VAE that learns to decompose a scene into entities, such as objects, without supervision. We can view it as having an encoder

and a decoder . The encoder maps an image

to a tensor

representing the full scene. This representation consists of entities, each encoded by the mean of MONet’s latent distribution of dimension . We hereafter call each row of the scene representation a “slot.” The decoder maps this tensor to a reconstructed image .

Critically, MONet learns to put each object from a scene into a different slot. Moreover, the MONet architecture ensures that each slot has the same representational semantics, so there is a common learned latent space to which each object is mapped. Each slot represents meaningful properties of the objects, such as position, angle, color and shape. See Figure S7 in Appendix G for an example. MONet can handle scene datasets with a variable number of objects by letting some slots’ codes lie in a region of latent space that represents no object. See Figure 2-A for scene decomposition results, indicating that MONet successfully learns to accurately represent scenes from our environment into objects.

While our environment is visually quite simple, in more complex environments MONet’s slots learn to represent other elements of a scene (e.g. backgrounds or walls in 3D scenes) and its latent space can capture more complex visual properties (e.g. surface glossiness). See (Burgess et al., 2019) for examples.

Figure 2: Scene decomposition, transition model, and exploration policy results
A. Vision module (MONet) decomposing scenes into objects (one column per sample scene). (First row) Data samples. (Second row) Reconstruction of the full scene. (Other rows) Individual reconstructions of each slot in the learned scene representation. Some slots are decoded as blank images by the decoder. B. Rollouts of transition model, treated as an RNN, compared to ground truth on two scenes. In each scene, one single item (indicated by dotted circle) is being moved along a line in multiple steps. C. Exploration policy. (Left) Position click component of random samples from the trained exploration policy, which learns to click on (and hence move) objects. (Middle)

Slice through the first two dimensions (position click) of the action sampler’s quantile function, showing deformations applied on a grid of first clicks

with randomized second click. (Right) Slice through second two dimensions (motion click). There is virtually no deformation, indicating the exploration policy learns to sample motions randomly.
Transition model

We introduce a method that maps an action and a scene representation at time step to a predicted next-step scene representation . This model applies a shared MLP slot-wise to the concatenations of each slot with the action . This is sufficient to predict , because our environment has no physical interactions between objects and any action will affect at most one object. To train this transition model, we cannot easily use a loss in representation space, because the encoding of image may have a different ordering of its slots than and solving the resulting matching problem in a general, robust way is non-trivial. Instead, we circumvent this problem by applying the visual decoder (through which gradients can be passed) to and using a pixel loss in image space:

Additionally, we train an extra network to predict the output of the pixel loss, which we use as a measure of curiosity when training the exploration policy, as we found this to work better and be more stable than using the pixel loss directly. See Appendix D for alternative transition models considered, including those that do not use pixel loss.

See Figure 2-B for examples of transition model rollouts after training in tandem with the exploration policy (covered below).

Exploration Policy

In many environments a uniformly random policy is insufficient to produce action and observation sequences representative of those useful for tasks. Consequently, in order for a agent to learn through pure exploration a transition model that transfers accurately to a task-driven setting, it should not take uniformly random actions. In our environment, this manifests itself in the sparseness of the action space. If an agent were to take uniformly random actions, it would rarely move any object because the agent must click within an object to move it and the objects only take up a small portion of the image ( per object). This is shown in Figure 3 (top) and Appendix A, and our transition model does not get enough examples of objects moving to be trained well in this condition. Hence we need our agent to learn in the exploration phase a policy that clicks on and moves objects more frequently.

Figure 3: Random policy and exploration policy
Observations and actions taken by an agent during the unsupervised exploratory phase. Actions are shown with small green arrows. (Top) Random agent, which rarely moves any object, provides a bad source of data for the transition model. (Bottom) trained exploration policy, which frequently moves objects, provides a good source of data for the transition model.

Our approach is to train the transition model adversarially with an exploration policy that learns to take actions on which the transition model has a high error. Such difficult-to-predict actions should be those that move objects (given that others would leave the scene unchanged). In this way the exploration policy and the transition model learn together in a virtuous cycle. This approach is a form of curiosity-driven exploration, as previously described in both the psychology (Gopnik et al., 1999) and the reinforcement learning literature (Schmidhuber, 1990a, b; Pathak et al., 2017). We apply this adversarial training of the transition model and exploration policy in an exploration environment with randomly generated objects (see Appendix C).

To put this idea into practice, we must learn a distribution over the 4-dimensional action space from which we can sample. To do this, we take inspiration from distributional RL (Bellemare et al., 2017; Dabney et al., 2017, 2018) and learn an approximate quantile function that maps uniform samples to non-uniform samples in our action space . This can be thought of as a perturbation of each point in , parameterized by an MLP . We train to maximize the predicted error of the transition model’s prediction of action , subject to an regularization on the perturbation magnitude. Specifically, given , the exploration policy is trained as:

Note that this method does not pressure to exactly be a quantile function. Instead it is incentivized to learn a discontinuous approximation of the quantile function. We tried various approaches to parameterizing (see Appendix D) and found this one to work best.

While the vision module, transition model, and exploration policy can in principle all be trained simultaneously, in practice for the sake of simplicity in this current work we pre-trained the vision module on random frames from the exploration environment. We then reloaded this vision module with frozen variables while training the transition model and exploration policy in tandem in the active adversarial exploration phase just described. See Appendices B-C for architecture and training details.

See Figure 2-C for examples of samples from the trained exploration policy and a visualization of its deformation. Images of the exploration policy in action can be seen in Figure 3 (bottom), and see Appendix A for videos.

3.2 Task phase

Once the exploration phase is complete, COBRA enters the task phase, in which it is trained to solve several tasks independently. We freeze the trained vision module, transition model, and exploration policy, training only a separate reward predictor independently for each task. See Section 4 for task details.

Reward Predictor

For each task we train a model-based agent that uses the components learned in the exploration phase and only learns a reward predictor. The reward predictor is trained from a replay buffer of (observation, reward) pairs, see Algorithm 1.

Our agent acts with a simple 1-step search, sampling a batch of actions from the exploration policy, rolling them out for one step with the transition model, evaluating the predicted states with the reward predictor and selecting the best action (with epsilon-greedy). This is effectively a myopic 1-step Model Predictive Control (MPC) policy. It is sufficient to act optimally given the dense rewards in our tasks, but could readily be extended to multi-step MPC, other existing planning algorithms as in Hafner et al. (2018) or Monte-Carlo Tree Search (Silver et al., 2016). Also, see Appendix F for an extension that works with sparse rewards.

0:  Branching factor , Training factor
0:  Trained reward predictor
   initialize replay buffer
  for  to num_steps do
      scene representation
      actions and predicted rewards
     for  to  do
         exploration policy action
         transition model
         reward predictions
     end for
      best action
     for  to  do
         sample minibatch
         train reward predictor
     end for
  end for
Algorithm 1 Task phase agent training

4 Environment and Tasks

Our environment is a 2-dimensional square arena with a variable number of colored sprites, freely placed and rendered with occlusion but no collisions (see Figures 1-2). Agents use a continuous 4-dimensional click-and-push action space, where an action is a point in the hypercube . The first two coordinates of an action are a "position click" and the second two are a "motion click" . If the position click, treated as a point within the environment, falls within an object’s boundaries, then that object will take a small step in the direction specified by the motion click. If the position click does not fall within an object then no object moves, regardless of the motion click.

While this environment looks visually simple, it has some important features:

  • The multi-object arena reflects the compositionality of the real world, with cluttered scenes of objects that can share features yet move independently. This also provides ways to test robustness to task-irrelevant features/objects and combinatorial generalization.

  • The structure of the continuous click-and-push action space reflects the structure of space and motion in the world. It also allows the agent to move any visible object in any direction.

  • The notion of an object is not provided in any privileged way (e.g. no object-specific components of the action space) and can be fully discovered by agents.

Furthermore, difficult tasks can be designed in this environment that challenge state-of-the-art continuous control agents. We consider a suite of six tasks, grouped into three categories. For each task we have a held-out extrapolation set to test the agent’s policy robustness to task-irrelevant properties of the environment:

  • Goal-Finding. The agent must bring a set of target objects (identifiable by some feature, e.g. "green") to a hidden location on the screen, ignoring distractor objects (e.g. those that are not green) This goal location is fixed across episodes. To examine robustness, we test extrapolation over the number of targets, number of distractors, and task-irrelevant target features.

  • Sorting. The agent must bring each object to a goal location based on the object’s color. To examine robustness, we test extrapolation to an object combination that was not seen during training time (e.g. trained on {blue, red} pairs and {red, green} pairs, tested on {blue, green} object pairs). This tests whether the agent can learn to factorize and compose independent goals.

  • Clustering. The agent must arrange the objects in clusters according to their color. We test robustness to colors different from those used for training.

See Figure 5 for a visualization of our agent solving them (note that the targets are only shown for visualization purposes and are not provided to the agent). Some of our robustness tests are rather mild (e.g. robustness to different shapes when only color matters), while others are more demanding (e.g. increasing the number of target objects, or clustering novel color combinations). However, we believe all are intuitively reasonable behaviors to desire. See Appendix E for more details about our environment, tasks, and reward functions.

5 Results

We now show results on solving the tasks described above. As explained before, we are interested both in data efficiency (in terms of rewarded environment steps), as well as the robustness of the policies to task-irrelevant perturbations.

As our environment uses sparse continuous actions, finding appropriate baselines is challenging. We compare our agent to two baselines:

  • MPO raw: Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018), a state-of-the-art model-free continuous control algorithm known for its data-efficiency.

  • MPO handheld: MPO endowed with our agent’s vision module and exploration policy. This agent applies the vision model’s encoder to its image observations received from the environment and applies the exploration policy’s deformation to its actions before passing them to the environment.

Figure 4: Performance, Robustness, and Data Efficiency
A. Performance and robustness of agents after training until convergence. Top row shows test-time performance of agents on random environments sampled from the training distribution (higher is better). Bottom row shows robustness tests to out-of-distribution task-irrelevant environment perturbations (see main text for details). B. Data efficiency (lower is better). Computed as smallest number of on-task environment steps needed to reach and sustain average performance over 30 consecutive episodes. The corresponding number of episodes varies depending on task and agent performance, but for COBRA ranges from (Goal finding new shapes) to (sorting). Gray bars indicate no agent reached performance.

Because MPO is model-free, the MPO raw agent must be entirely re-trained for all tasks. This makes comparing data efficiency to our model difficult because our model’s unsupervised exploration phase only needs to be trained once and can then be used for any task. The data-efficiency comparison with MPO handheld is fairer because MPO handheld re-uses the same amount of pre-training as our agent. We posit that our model’s extreme on-task data efficiency (see log-scale in Figure 4-B) justifies our training paradigm, and note that even our agent’s exploration phase is more data efficient than MPO raw on any of the tasks. COBRA learns to solve most of our tasks using only 100-1000 rewarded environment steps, compared to for all MPO baselines. We can expect these gains to especially add up when the number of tasks to be achieved in this environment increases, as we effectively amortize the cost of our pretraining across all future tasks.

Our model performs on the robustness tests (Figure 4-A), far exceeding the baselines on many of them. We presume this arises because the task-specific portion of our model is minimal, reducing over-fitting. Specifically, our model learns only a reward function, not a policy network. In general, we conjecture that learning goals and using a model to plan a policy results in more robustness than can be achieved with model-free RL. Example behaviors of our agent solving tasks are shown in Figure 5 (see also Appendix A for links to videos).

Figure 5: COBRA solving our tasks
Demonstration of a trained COBRA agent solving different tasks and its behaviour on robustness tests. Agent actions are shown with white arrows, and target goals are shown with crossed circles. These targets are only shown for visualization purposes and are not provided to the agent. See Appendix A for links to videos. Only five steps are displayed for each episode. (Top) Solving a “Goal finding” task. Having been trained only with a single distractor, COBRA is robust to the addition of a second distractor at test time. (Middle) Solving a “Sorting by color” task. COBRA has been trained to bring objects to different goals depending on their colours, seeing all pairs of colors except (blue, red). It is robust to testing on this held out combination of objects and successfully brings them to their targets. (Bottom) Solving a “Cluster by color” task. Having been trained only on clustering green/blue objects, COBRA successfully extrapolates its reward predictor, and hence its policy, to clustering red/yellow objects.

6 Discussion

We have introduced COBRA, which to our knowledge is the first agent to combine unsupervised learning of object-centric representations, curiosity-driven exploration, and model-based RL all together into one architecture. We demonstrated how this approach can be used to both achieve very high data-efficiency when learning tasks and yield policies that are robust to task-irrelevant perturbations.

We used an environment with a continuous action space where the agent can move any object, without hand-crafting the notion of object in any way. We considered instead using a discretized version of our touch-screen action space consisting of a fine mesh in . While this would have allowed us to compare to numerous model-based discrete-action-space baselines, we found in early experiments that it was difficult to get any agent to train with such a large discrete action space. Furthermore, we believe that the continuous version provides a more realistic geometric relationship between action space and environment. We also considered using the action space of Haber et al. (2018), which has a distinct component controlling each object in the scene (according to an arbitrary ordering). While that would have made exploration much easier, we believe it builds in the notion of an object in a way that circumvents the important problem of object discovery and is unlikely to scale to large numbers of objects.

Our current version of the transition model does not take into account interactions between objects, though we think this limitation can be readily overcome by incorporating a GraphNet (Battaglia et al., 2018). Furthermore, it might seem that training the transition model using a pixel loss could give rise to problems for small objects. However, as we do not update the variables of MONet based on the transition model’s pixel loss, our transition model does not suffer from this issue. One limitation of our transition model is its lack of memory. While this is fine in a fully observable environment like the one we use, in more complex environments we would like a transition model that remembers out-of-view objects. Exploring this would be an interesting direction of future work.

Our exploration policy is a form of distributional RL, (Bellemare et al., 2017; Dabney et al., 2018, 2017), though we found that the Huber quantile regression loss used in existing distributional RL work scaled poorly with dimension and was unable to learn the sparse multimodal 4-dimensional distributions needed by COBRA. This motivated our alternative deformation-based approximate quantile loss, which we found to train quite robustly. Note, however, that our method is not pressured to learn a smooth quantile function, and is better thought of as learning basins of attraction around high-value (i.e. high-curiosity) actions. While we found this to scale well with dimensionality, it has no pressure to force the basins of attraction to all have the same volume, hence in our environment it prefers moving objects surrounded by empty space to moving objects next to a boundary or other object. We did not find this to impact agent performance, though might be a concern in more cluttered environments.

Our use of transition model error as a metric of curiosity is quite simple. While it works on our environment, it would not be suitable for environments with unpredictable dynamics. The Intrinsic Curiosity Module of Pathak et al. (2017) is complementary to our model and would be a natural way to address this limitation. The approach in (Laversanne-Finot et al., 2018), which samples goals according to predicted policy improvement, is also appealing, though parameterizing the goal space is challenging in environments with a variable number of objects.

While COBRA’s 1-step search planning policy is sufficient for our tasks, to solve long-term credit assignment for complicated tasks we would likely need to do multi-step rollouts, such as Model Predictive Coding or Monte-Carlo Tree Search. One orthogonal but important limitation of stateless rollout-based planning is the inability to solve tasks that require memory, such as a task where a goal location temporarily flashes on the screen.

In summary, natural directions for future work include: more complex environments (e.g. 3 dimensions, physics, and richer visuals); more complex avatars (e.g. embodied, with multiple joints/limbs); learning to set goals; multi-step planning in the exploration phase; and learning a policy at task-time rather than simple search.


We thank Matt Botvinick, Tiago Ramalho, and Tejas Kulkarni for helpful discussions and insights.


Appendix A Agent Videos

Follow this link to see videos of our agent’s performance on the training and robustness testing modes of all of our tasks and a README summarizing them:

Appendix B Model Architectures

In all networks we used ReLU activations, weights initialized by a truncated normal

(Ioffe and Szegedy, 2015), and biases initialized to zero.

b.1 Vision Module

For the vision model we use MONet, with nearly all the same hyperparameters as in

Burgess et al. (2019). The only differences were: (i) we use a 3-layer Spatial Broadcast Decoder (instead of 4-layer) (ii) we use and . These differences seemed to improve disentangling and decomposition slightly for our dataset, though not much — the default MONet parameters worked well out-of-the-box.

We set the number of entities in the model to 8. Note that this is more than the maximum number of objects in our exploration phase environment (which is 6), but MONet always uses one slot to encode the background (even though that is black in our case) and an extra slot does not hurt the model.

We preprocess each input image to MONet by rescaling its (R, G, B) color channels the color channels by (1.25, 1.0, 0.75) respectively. This makes training slightly more stable and efficient by asymmetrizing the color generative factors in the dataset, helping the VAE’s color latents emerge more easily.

b.2 Transition Module

Our transition model consists of a slot-wise MLP which outputs both the predicted next scene and a transition error prediction. The model itself is a single network , which is an MLP with 3 hidden layers each of size 512 and an output of size 9. We treat the first 8 components of this output as the delta of the scene representation and the last component as an error prediction contribution, which we sum over slots to get the error prediction . Specifically, for the scene representation prediction we have (note that for the sake of clarity we did not mention this in the main text). We use the error prediction output as a proxy for the transition model’s error when training the exploration policy adversarially because it is more efficient and stable to pass gradients through than the full MONet decoder.

The model has 3 loss terms:

  • Future-prediction loss .

  • Error-prediction loss .

  • Regularization (since predicts the delta of the scene representation).

The total loss for is the sum of these terms (with no reweighting/coefficients).

We believe that predicting the delta of the scene representation is not necessary. Namely, we expect letting itself be the output of and removing the regularization loss term would work just as well, though have not tried this simplification of the model.

b.3 Exploration Policy

The exploration policy has a single MLP

with 2 hidden layers each of size 64. The output size of this MLP is 8, because (while not mentioned in the main text) this network outputs the mean and scale of a 4-dimension Gaussian distribution, from which we sample to get a deformation. We found that using a distribution like this rather than predicting the deformation deterministically helps training efficiency and stability.

We train this network during the exploration phase. Given an action (where is sampled uniformly in ), we use the following 3 loss terms:

  • Transition model error prediction , where the gradients are passed through the transition model without updating the transition model network itself.

  • Deformation regularization . For this we use a coefficient of .

  • If the scale of any coordinate of the deformation distribution is less than , we add a penalty of times that coordinate’s scale. This ensures stability.

b.4 Reward Predictor

The reward predictor must map a tensor of slots representing an environment state to a scalar reward prediction . To take advantage of the slot-structured representation of objects in , we use a Relation Network (Santoro et al., 2017), a sub-type of graph network (Battaglia et al., 2018). Specifically, we use two MLPs and with layer sizes (128, 128) and (128, 1) respectively. We compute the reward prediction as .

b.5 Baselines

For all MPO-based models, we used the hyperparameters in Table S1. We used trajectories of length 2 in the replay (longer trajectories yielded lower performance on our tasks, likely because some of the tasks can often be solved in 3 or 4 steps).

For the MPO from slot representations, we used a slot-structured network similar to our agent’s reward predictor for both the actor network and the critic network. Specifically, we applied a per-slot MLP to each slot, then summed across slots, then applied a global MLP. For the actor network, the per-slot MLP had output sizes (128, 128) and the global MLP had output sizes (128, 128, 8) (the final output size must be 8 since the action space is 4-dimensional). For the critic network, the per-slot MLP had output sizes (512, 512) and the global MLP had output sizes (256, 1).

For the MPO from image, both the actor and critic had the same architecture: We applied a 2-layer CNN with kernel sizes 3x3, 32 channels per layer, and stride 2 to the input image, then flattened and applied a 3-layer MLP with hidden sizes (256, 256).

These network sizes we chose after numerous hyperparameter sweeps, enough that we can confidently say that neither changing the sizes of any MLP layers by a factor of two nor changing the number of layers in any of these networks improves overall performance.

target update period batch size
0.2 0.99 200 512
Table S1: MPO hyperparameters.

All MPO-based models were trained for gradient steps with batch size 512 using Adam optimizer with learning rate .

Appendix C Training details

For our agent, we pre-trained the vision module on a datasets of random static frames from the exploration environment. We did not try training it online with the transition model and exploration policy, but because of the modularity of the model believe that would work similarly (as long as the transition model and exploration policy don’t get caught in local minima while the vision module is training).

We trained the vision module used the RMSProp optimizer with learning rate

as in (Burgess et al., 2019) and batch size 16 for gradient steps.

We trained the transition model and exploration policy using the Adam optimizer (Kingma and Ba, 2015) with learning rate for gradient steps with batch size 16. The exploration environment was initialized with a random number of objects per episode (between 1 and 7), each with random random shape, color and initial locations. Exploration environment episodes lasting 10 steps. For efficiency, we used a distributed actor setup: 1 learner running on GPU trained the transition model and exploration policy networks, while 32 actors runnning on separate CPUs used the exploration policy to collect environment transitions and write them to a replay with capacity . Each actors fetched exploration policy variables from the learner once per 50 environment steps. Thus our agent was trained off-policy during the exploration phase.

The total number of environment steps used by our model’s exploration phase is strictly upper-bounded by , though due to the distributed nature it is difficult to determine the number precisely. Note that for COBRA this exploration phase need only be trained once, after which any number of tasks can be learned. Note also that the MPO handheld baseline also requires this exploration phase to train the vision representations and exploration policy.

For the task-phase, we trained our COBRA agent with branching factor B = 128, training factor N = 10, and batch_size = 16. We used an epsilon-greedy training policy with epsilon = 0.2 and trained the agent for 1000 environment episodes. Depending on the task and how quickly the agent learned, this corresponded to somewhere between 5,000 and 30,000 environment steps. This was sufficiently long for COBRA to reach what appeared to be asymptotic performance on all the tasks.

All MPO-based models were trained for gradient steps with batch size 512 using the Adam optimizer (Kingma and Ba, 2015) with learning rate . This was sufficiently long to reach what appeared to be asymptotic performance on all the tasks. Like our agent’s exploration policy, this used a distributed actor/learner setup with 1 GPU learner and 32 actors.

The distributed-actor nature of both the MPO models and our agent during its exploration phased makes drawing conclusive claims about data efficiency inherently difficult: While we swept over the number of actors and found a significant degradation of MPO learning efficiency with fewer actors (i.e. when the learner must reuse a greater portion of the replay during training). However, fully exploring the performance/data efficiency trade-off as the number of actors is varied was beyond our scope. We do note that the rate of learner gradient steps for all MPO baselines far exceeds the rate of data-collection by the actors (by a factor of  2 in the vision-based baselines and a factor of  100 in the state-based baselines), so they were certainly reusing their replay buffer to a large extent. In fact, on most tasks the replay reuse probability of the state-based MPO models was very similar to the replay reuse probability of our agent at task-time.

Consequently, while the exact data efficiency numbers we report should be taken with a grain of salt, we’re confident that the data efficiency difference between our model and the baselines is very significant (given that it is several orders of magnitude).

Appendix D Model Variations Considered

Here we recount some of the model variations we explored before arriving at the model described in the main text. We focus on the transition model and exploration policy, which are the primary algorithmic contributions of this work. We hope this may be useful for researchers exploring closely related models.

Transition Model

Before converging on the pixel-loss-via-decoder method described in Section 3, we tried a variety of methods to train the transition model with a loss in latent space. The aim here was to pressure to be similar to .

However, as mentioned in the main text, the slot ordering of may differ from that of , because the ordering that MONet encodes the slots is a complicated implicit function that is sensitive to small changes in the visual input. Thus to compute a loss in latent space we need to solve a matching problem.

There are many ways to find such a matching just by looking at the slot representations and without considering all possible slot orderings. One that worked well for us was to take each slot of and match it with the slot of with which it had lowest mean squared error (allowing for double-assignment). Using KL divergence instead of mean squared error also worked.

However, we were bothered by a subtle drawback of these matching methods: They are based purely on static scenes across time without considering how objects move. For example, if two similar-looking objects cross paths and seem to “switch places” in consecutive timesteps, they could be incorrectly matched, which would throw off the transition model’s training.

Thus we find such matching approaches unnatural. We prefer using pixel loss through the MONet decoder, as it is a very general principle and uses simplicity of dynamic prediction to determine the slot-ordering of future timesteps.

Exploration Policy

As mentioned in Section 6, our exploration policy is a form of distributional RL.

The first approach we tried for the exploration policy was to parameterize the action distribution density function by a neural network and sample actions via rejection sampling. This is effectively the method used in (Haber et al., 2018) and works when the curiosity signals are dense. However, we found the sampling to be computationally expensive in our setting of sparse curiosity signals, and were intrigued by the prospect of learning a way to sample direction without rejection, hence quantile regression.

As mentioned in Section 6, we initially tried using a multi-dimensional version of the Huber quantile function loss typically used in distributional RL, but this was very slow to train. Controlled experiments with artificial curiosity functions showed it to scale poorly both with sparsity of the curiosity function and with dimensionality of the action space.

We then arrived at our deformation method, which trained quickly and scaled surprisingly well: In experiments with artificial curiosity signals we found it to work well even in 50-dimensional action spaces. In addition to working well in our context, we find the prospect of learning attractor dynamics in an action space quite appealing, and imagine similar approaches may be more generally useful in continuous control.

Curiosity Signal

As mentioned in Section 6, our use of the transition model error as a curiosity signal is rudimentary, though could easily be extended to a more general method as in (Pathak et al., 2017).

However, we originally hoped to use an entirely different method altogether: We hoped to use the intrinsic uncertainty of the transition model as a metric of curiosity. Namely, we aimed to build a stochastic transition model and use its variance as curiosity. We came very close to getting this to work, but were ultimately foiled by some subtleties about the MONet’s representation.

Specifically, making the transition model stochastic was easy — that can be done by letting the transition model parameterize the mean and variance of the slots in . It can be trained via a KL penalty with the distribution of if using a slot-matching loss, or if using the decoder-pass-through pixel loss can be trained by sampling from before decoding.

The trouble comes when trying to use the variance of the transition model as uncertainty. As shown in Figure 2, in order to handle a variable number of objects MONet lets some of its slots encode no object (e.g. a blank image). This implies that there is a region in latent space that is decoded to a blank image. Consequently, when MONet infers a blank slot it can afford to use high variance in the latent representation. In practice, the variance for blank slot encodings in MONet is much higher than that for non-blank slots, and this is amplified with the number of blank slots (which, given our exploration environment had 1-6 objects and 8 slots, was more than 50%). Hence the high variance for blank slots in the stochastic transition model drowns out the signal from uncertainty about objects moving, so is a poor curiosity signal.

There are certainly some tricks one could use to circumvent this problem (e.g. explicitly excluding blank slots from the uncertainty calculation, or dividing by inferred variance). However, we found these unsatisfactory and did not see an elegant solution so ended up using the pixel loss, which is very simple but works.

Appendix E Environment details

Our environment was a square, 2-dimensional world with geometric objects varying in position, shape, and color. To render this world, we used the PyGame renderer with an anti-aliasing factor of 5 and output shape (64, 64, 3). Objects could occlude (in a consistent manner with each object living on a different z-layer) but could not collide/interact in any other way. Object size did not vary in the environment. Instead, each shape’s size was fixed so that its area was (in units of squared frame-width).

As mentioned in the main text, the action space for this environment is the continuous hypercube . We view the first two components of an action as a "position click" and the second two as a "motion click" . If the position click, when viewed as point in the rendered environment, lands within the boundary of an object, that object is moved in the direction of the motion click (centered so that a motion click of corresponds to zero motion). The magnitude of the motion is scaled by a factor of 0.25 and we add scale-0.05 Gaussian noise to motions, so . Note that because the motion click is centered, and are both in , hence the furthest an object can move in each direction is 0.125 (and the furthest overall is ), so the agent can move an object from any point in the environment to any other point within 8 steps. Note also that including the Gaussian noise added to the motion did help the transition model/exploration policy training loop during our agent’s exploration phase, as it introduced noise to object motions (hence unavoidable prediction error when objects moved). However, even without this noise the exploration policy did learn to click on objects though showed some biases to click near object borders.

We ensure that objects never exit the frame. To do this, after an object moves we clip each coordinate of it’s position to .

For each task, the environment procedurally generated each episode from an initialization distribution. Table S2 summarizes some of the properties of the tasks. The details not easily tabulated are detailed as follows:

  • Shape Robustness. During training the object’s shape is a square. During robustness testing it is in circle, triangle. We used a shaped reward which is linearly inversely proportional to the distance from the object to the goal location, which is always the center of the frame. If the object is brought within a distance 0.075 (in units of the frame width) of the center of the frame, the episode is considered a success and resets (assuming this happens before the maximum episode length timeout).

  • Position Robustness One object with hue in is the target. One object with hue in is the task-irrelevant distractor. During training, the target’s position is initialized randomly uniformly in the frame except the lower-right quadrant. For robustness testing, the target’s position is initialized randomly uniformly in the lower-right quadrant. The distractor’s position is unconstrained in for both modes. The reward and termination criterion is the same as in the shape robustness task as a function of the target only.

  • # Targets Robustness The target(s) has/have shape square. The distractors have shape in circle, triangle. During training there is 1 target and 2 distractors. During robustness testing there are 2 targets and 2 distractors. Again, the reward is linearly inversely proportional to the distance from the target to the center of the frame. When there are two targets, the reward is the sum of the reward-per-target of each. When there are two targets, the episode terminates only when both meet the termination criterion, i.e. both are within a distance 0.075 of the frame center.

  • # Distractors Robustness The targets have shape square. The distractor(s) has/have shape in circle, triangle. During training there are 2 targets and 1 distractors. During robustness testing there are 2 targets and 2 distractors. Again, the reward is the sum over targets of their goal-finding rewards.

  • Sorting There are 5 narrow hue distributions: red [0.9, 1], blue [0.55, 0.65], green [0.27, 0.37], purple [0.73, 0.83], and yellow [0.12, 0.22]. Each color has it’s own goal location, which are [0.75, 0.75], [0.75, 0.25], [0.25, 0.75], [0.25, 0.25], and [0.5, 0.5] respectively. During training a random pair of colors is sampled, except not the pair (red, blue). Objects of these colors are generated, and the agent must bring each to their respective goal location. For robustness testing, the held-out (red, blue) color pair. This tests robustness to an unseen combination of objects and the agent’s ability to correctly factor and recompose the objects’ respective goals. The reward here is the sum of the per-object goal-finding rewards of the two objects in each episode, except here the goal locations are not the center of the frame for all colors except yellow. Agents are also given a bonus reward if it successfully completes the task. this did not affect performance of our agent at all, but did help the MPO baselines.

  • Clustering Given the color hue distributions in the sorting task, the environment samples a pair of colors for each episode and generates 2 objects of each color. The agent is rewarded for clustering the objects by color, namely bringing each pair of similarly-colored objects together while creating a sufficient inter-pair distance. During training, the color pair is always (blue, green). For robustness testing, the color pair is (purple, yellow). To compute the reward we used the inverse of the Davies-Bouldin clustering metric (Davies and Bouldin, 1979) and terminated the episode when this inverse clustering metric is higher than 2.5, in which case the agent received a bonus reward (which, as in the sorting task, helped only the MPO baselines).

(H, S, V)
initial position
(x, y)
number of
exploration 2 ([0, 1], [0, 1]) [square, circle, triangle] [1, 6]
shape robustness
20 ([0, 1], [0, 1]) see text 1
init position robustness
20 see text see text [square, circle, triangle] 2
# targets robustness
20 ([0, 0.5], [0.3, 1], 1) ([0, 1], [0, 1]) see text 3
# distractors robustness
20 see text ([0, 1], [0, 1]) see text 3
sorting 50 ([0, 0.5], [0.3, 1], 1) ([0, 1], [0, 1]) [square, circle, triangle] 5
clustering 50 see text ([0, 1], [0, 1]) [square, circle, triangle] 4
Table S2: Specifications for each task. These are the easily-summarized components of the task specifications. See text for the boxes that are more complicated.

Appendix F Sparse rewards

Figure S6: Training curve in Sparse reward situation

Comparison between a COBRA agent using a Reward predictor or a Value predictor, when solving Goal Finding tasks with sparse terminal rewards only. Moving average over 50 episodes, shaded color indicates one standard deviation around the median, over 5 replica.

In the tasks we presented so far, we used dense, shaped rewards. This meant that a single step search algorithm and a reward predictor are sufficient to perform optimally. However, we could relax this requirement, and use sparse rewards instead. This requires deeper search rollouts, or the use of a value function.

We implemented a Value function version of our search agent, where the Reward predictor is replaced by a Value predictor. This can be trained in a similar fashion, using TD-learning:


We found that using a sum of these three losses offered the fastest and most stable estimation of the Value function in our setting, however in some situations we also had good success with using only

. Finally, we act so that actions which maximize are selected with our 1-step search.

We demonstrate this on a modified version of the Goal Finding - Shape robustness task, where we only provide a reward of 1 when the target reached the goal, and 0 everywhere else. As can be expected, this increases the difficulty of the tasks dramatically, see training curves in Figure S6. The Reward predictor version of COBRA only reaches 50% success, as it can bring sprites appearing close to the target to the goal. This is because some actions will lead to predicted scenes where the target is close to the center, and due to the smoothness of our reward predictor, there is a non-zero reward signal to follow. Using a value predictor instead successfully solves all situations, without this requirement for the sprite to being close to the goal.

Appendix G Vision Module Disentangling

As shown in Burgess et al. (2019), the MONet model not only learns a scene decomposition but can also learn a disentangled latent space for object representations. We found that indeed this was the case for our dataset, as can be seen from the latent-space traversals from the model shown in Figure S7. We believe that this disentangling may have helped our agent be robust to some perturbations (e.g. perturbing irrelevant features), though we did not explicitly incorporate any attention mechanisms in our reward predictor or encourage it in any way to be invariant to any latent components.

Figure S7: Vision module disentangling. Each row show the effect on reconstructions of sweeping one latent component from -1.5 to 1.5 (keeping all other latent components fixed). Only the 6 most significant latents are shown — the model learns two more non-coding "none" latents that revert to its Gaussian prior. The latent space is disentangled, as indicated by the labels (assigned post-hoc by eye) on the right.

Appendix H Ablations and Additional Baselines

In the main text we compare our COBRA agent to two baselines: (i) Raw MPO from pixels and (ii) MPO with slot/object-structured MONet features and our agent’s learned exploration policy. While the combination of MONet features and exploration policy improved MPO’s performance, it is unclear which factor was primarily responsible for this improvement. Consequently, here we analyze them independently, running both an MPO with only MONet features and an MPO with only our agent’s exploration policy. These results are shown in Figure S8. In terms of data efficiency, both of these perform in between the raw MPO and the MPO with both MONet features and exploration policy. In terms of asymptotic performance and robustness/generalization, it seems that the exploration policy provides a bigger boost than MONet features on our task suite.

Figure S8: Ablation study for MPO baselines
(left) Performance and Robustness. Test-time agent performance on environments sampled from the training distribution (top) and from the robustness generalization tasks (bottom), analogous to main text Figure 4-A. (right) Data efficiency. Number of environment steps to reach 90% test performance on the training task distributions. This shows MPO agents with ech combinations of visual slot-structured features (from our model) and exploration policy (from our model). The one with both features and exploration policy is what we refer to in the main text as "MPO handheld". Note the log scale and axis range of the data efficiency plot.
Figure S9: Ablation of Exploration policy for our agent
Our COBRA agent is shown in green, and the same agent without using the exploration policy at task time is shown in blue. Without the exploration policy, this ablated agent samples actions randomly uniformly in when doing model-based search. It does, however, use the exploration during the exploration phase (this is always necessary to learn a good transition model). Note that unlike Figures 4 and S8, the data efficiency plot here does not use a log scale. (left) Robustness. Test-time agent performance on environments sampled from the training distribution (top) and from the robustness generalization tasks (bottom), analogous to main text Figure 4-A. (right) Data efficiency. Number of environment steps to reach 90% test performance on the training task distributions.

Appendix I Additional Transition Model Rollouts

In Figure 2 in the main text we show 2 examples of transition model rollouts. Here in Figure S10 we show many more.

Figure S10: Transition model rollouts. As in Figure 2

, each pair of rows compares the ground truth environment observations from a sequence of actions to rollouts from the transition model on the same sequence of actions (using the vision module decoder as a renderer). The actions are indicated by the red arrows (which display the motion click vector centered at the position click location). These action sequences were generated by using the same motion click repeatedly while moving the position click to the ground-truth object location at each step.