Entity Abstraction in Visual Model-Based Reinforcement Learning

10/28/2019 ∙ by Rishi Veerapaneni, et al. ∙ 33

This paper tests the hypothesis that modeling a scene in terms of entities and their local interactions, as opposed to modeling the scene globally, provides a significant benefit in generalizing to physical tasks in a combinatorial space the learner has not encountered before. We present object-centric perception, prediction, and planning (OP3), which to the best of our knowledge is the first entity-centric dynamic latent variable framework for model-based reinforcement learning that acquires entity representations from raw visual observations without supervision and uses them to predict and plan. OP3 enforces entity-abstraction – symmetric processing of each entity representation with the same locally-scoped function – which enables it to scale to model different numbers and configurations of objects from those in training. Our approach to solving the key technical challenge of grounding these entity representations to actual objects in the environment is to frame this variable binding problem as an inference problem, and we developing an interactive inference algorithm that uses temporal continuity and interactive feedback to bind information about object properties to the entity variables. On block-stacking tasks, OP3 generalizes to novel block configurations and more objects than observed during training, outperforming an oracle model that assumes access to object supervision and achieving two to three times better accuracy than a state-of-the-art video prediction model.



There are no comments yet.


page 8

page 9

page 18

Code Repositories


Entity Abstraction in Visual Model-Based Reinforcement Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A powerful tool for modeling the complexity of the physical world is to frame this complexity as the composition of simpler entities and processes. For example, the study of classical mechanics in terms of macroscopic objects and a small set of laws governing their motion has enabled not only an explanation of natural phenomena like apples falling from trees but the invention of structures that never before existed in human history, such as skyscrapers. Paradoxically, the creative variation of such physical constructions in human society is due in part to the uniformity with which human models of physical laws apply to the literal building blocks that comprise such structures – the reuse of the same simpler models that apply to primitive entities and their relations in different ways obviates the need, and cost, of designing custom solutions from scratch for each construction instance.

The challenge of scaling the generalization abilities of learning robots follows a similar characteristic to the challenges of modeling physical phenomena: the complexity of the task space may scale combinatorially with the configurations and number of objects, but if all scene instances share the same set of objects that follow the same physical laws, then transforming the problem of modeling scenes into a problem of modeling objects and the local physical processes that govern their interactions may provide a significant benefit in generalizing to solving novel physical tasks the learner has not encountered before. This is the central hypothesis of this paper.

We test this hypothesis by defining models for perceiving and predicting raw observations that are themselves compositions of simpler functions that operate locally on entities rather than globally on scenes. Importantly, the symmetry that all objects follow the same physical laws enables us to define these learnable entity-centric functions to take as input argument a variable that represents a generic entity, the specific instantiations of which are all processed by the same function. We use the term entity abstraction to refer to the abstraction barrier that isolates the abstract variable, which the entity-centric function is defined with respect to, from its concrete instantiation, which contains information about the appearance and dynamics of an object that modulates the function’s behavior.

Figure 1: (a) OP3 can infer a set of entity variables from a series of interactions (interactive entity grounding) or a single image (entity grounding). OP3 rollouts predict the future entity states given a sequence of actions . We evaluate these rollouts during planning by scoring these predictions against inferred goal entity-states . (b) OP3 enforces the entity abstraction, factorizing the latent state into local entity states, each of which are symmetrically processed with the same function that takes in a generic entity as an argument. In contrast, prior work either (c) process a global latent state [22] or (d) assume a fixed set of entities, each processed with a different function [13, 34, 58]. (e-g) Enforcing the entity-abstraction on modeling the (f) dynamics and (g) observation distributions of a POMDP, and on the (e) interactive inference procedure for grounding the entity variables in raw visual observations. Actions are not shown to reduce clutter.

Defining the observation and dynamic models of a model-based reinforcement learner as neural network functions of abstract entity variables allows for symbolic computation in the space of entities, but the key challenge for realizing this is to ground the values of these variables in the world from raw visual observations. Fortunately, the language of partially observable Markov decision processes (POMDP) enables us to represent these entity variables as latent random state variables in a state-factorized POMDP, thereby transforming the variable binding problem into an inference problem with which we can build upon state-of-the-art techniques in amortized iterative variational inference 

[39, 38, 19] to use temporal continuity and interactive feedback to infer the posterior distribution of the entity variables given a sequence of observations and actions.

We present a framework for object-centric perception, prediction, and planning (OP3), a model-based reinforcement learner that predicts and plans over entity variables inferred via an interactive inference algorithm from raw visual observations. Empirically OP3 learns to discover and bind information about actual objects in the environment to these entity variables without any supervision on what these variables should correspond to. As all computation within the entity-centric function is local in scope with respect to its input entity, the process of modeling the dynamics or appearance of each object is protected from the computations involved in modeling other objects, which allows OP3 to generalize to modeling a variable number of objects in a variety of contexts with no re-training.

Contributions: Our conceptual contribution is the use of entity abstraction to integrate graphical models, symbolic computation, and neural networks in a model-based RL agent. This is enabled by our technical contribution: defining models as the composition of locally-scoped entity-centric functions and the interactive inference algorithm for grounding the abstract entity variables in raw visual observations without any supervision on object identity. Empirically, we find that OP3 achieves two to three times greater accuracy than state of the art video prediction models in solving novel single and multi-step block stacking tasks.

2 Related Work

Representation learning for visual model-based reinforcement learning: Prior works have proposed learning video prediction models [55, 7, 35, 13] to improve exploration [43] and planning [14] in reinforcement learning. However, such works and others [22, 62, 40, 42]

that represent the scene with a single representation vector may be susceptible to the binding problem 

[20, 46] and must rely on data to learn that the same object in two different contexts can be modeled similarly. But processing a disentangled latent state with a single function [54, 6, 33, 34, 18] or processing each disentangled factor with a different function [35, 59] (1) assumes a fixed number of entities that cannot be dynamically adjusted for generalizing to more objects than in training and (2) has no constraints to enforce that multiple instances of the same entity in the scene be modeled in the same way. For generalization, often the particular arrangement of objects in a scene does not matter so much as what is constant across scenes – properties of individual objects and inter-object relationships – which the inductive biases of these prior works do not capture. The entity abstraction in OP3 enforces symmetric processing of entity representations, thereby overcoming the limitations of these prior works.

Unsupervised grounding of abstract entity variables in concrete objects: Prior works that model entities and their interactions often pre-specify the identity of the entities [5, 3, 23, 26, 41, 2, 1], provide additional supervision [17, 24, 53, 60], or provide additional specification such as segmentations [25], crops [15], or a simulator [57, 29]. Those that do not assume such additional information often factorize the entire scene into pixel-level entities [48, 61, 9], which do not model objects as coherent wholes. None of these works solve the problem of grounding the entities in raw observation, which is crucial for autonomous learning and interaction. OP3 builds upon recently proposed ideas in grounding entity representations via inference on a symmetrically factorized generative model of static [20, 21, 19] and dynamic [51] scenes, whose advantage over other methods for grounding [64, 12, 4, 32] is the ability to refine the grounding with new information. In contrast to other methods for binding in neural networks [37, 28, 49, 52], formulating inference as a mechanism for variable binding allows us to model uncertainty in the values of the variables.

3 Problem Formulation

Let denote a physical scene and denote the objects in the scene. Let and

be random variables for the image observation of the scene

and the agent’s actions respectively. In contrast to prior works [22] that use a single latent variable to represent the state of the scene, we use a set of latent random variables to represent the state of the objects . We use the term object to refer to , which is part of the physical world, and the term entity to refer to , which is part of our model of the physical world. The generative distribution of observations and latent entities from taking actions is modeled as:


where and are the observation and dynamics distribution respectively shared across all timesteps . Our goal is to build a model that, from simply observing raw observations of random interactions, can generalize to solve novel compositional object manipulation problems that the learner was never trained to do, such as building various block towers during test time from only training to predict how blocks fall during training time.

When all tasks follow the same dynamics we can achieve such generalization with a planning algorithm if given a sequence of actions we could compute

, the posterior predictive distribution of observations

steps into the future. Approximating this predictive distribution can be cast as a variational inference problem (Appdx. B) for learning the parameters of an approximate observation distribution , dynamics distribution , and a time-factorized recognition distribution that maximize the evidence lower bound (ELBO), given by , where

The ELBO pushes to produce states of the entities that contain information useful for not only reconstructing the observations via in but also for predicting the entities’ future states via in . Sec. 4 will next offer our method for incorporating entity abstraction into modeling the generative distribution and optimizing the ELBO.

4 Object-Centric Perception, Prediction, and Planning (OP3)

The entity abstraction is derived from an assumption about symmetry: that the problem of modeling a dynamic scene of multiple entities can be reduced to the problem of (1) modeling a single entity and its interactions with an entity-centric function and (2) applying this function to every entity in the scene. Our choice to represent a scene as a set of entities exposes an avenue for directly encoding such a prior about symmetry that would otherwise not be straightforward with a global state representation.

As shown in Fig. 1, a function that respects the entity abstraction requires two ingredients. The first ingredient (Sec. 4.1) is that is expressed in part as the higher-order operation that broadcasts the same entity-centric function to every entity variable

. This yields the benefit of automatically transferring learned knowledge for modeling an individual entity to all entities in the scene rather than learn such symmetry from data. As

is a function that takes in a single generic entity variable as argument, the second ingredient (Sec. 4.2) should be a mechanism that binds information from the raw observation about a particular object to the variable .

4.1 Entity Abstraction in the Observation and Dynamics Models

The functions of interest in model-based RL are the observation and dynamics models and with which we seek to approximate the data-generating distribution in equation 1.

Figure 2: (a) The observation model models an observation image as a composition of sub-images weighted by segmentation masks. The shades of gray in the masks indicate the depth from the camera of the object that the sub-image depicts. (b) The graphical model of the generative model of observations, where indexes the entity, and indexes the pixel. is the indicator variable that signifies whether an object’s depth at a pixel is the closest to the camera.

Observation Model: The observation model approximates the distribution , which models how the observation is caused by the combination of entities . We enforce the entity abstraction in (in Fig. 1g) by applying the same entity-centric function to each entity , which we can implement using a mixture model at each pixel :


where computes the mixture components that model how each individual entity is independently generated, combined via mixture weights that model the entities’ relative depth from the camera, the derivation of which is in Appdx. A.

Dynamics Model: The dynamics model approximates the distribution , which models how an action intervenes on the entities to produce their future values . We enforce the entity abstraction in (in Fig. 1f) by applying the same entity-centric function to each entity , which reduces the problem of modeling how an action affects a scene with a combinatorially large space of object configurations to the problem of simply modeling how an action affects a single generic entity and its interactions with the list of other entities . Modeling the action as an finer-grained intervention on a single entity rather than the entire scene is a benefit of using local representations of entities rather than global representations of scenes.

However, at this point we still have to model the combinatorially large space of interactions that a single entity could participate in. Therefore, we can further enforce a pairwise entity abstraction on by applying the same pairwise function to each entity pair , for . Omitting the action to reduce clutter (the full form is written in Appdx. F.2), the structure of the therefore follows this form:


The entity abstraction therefore provides the flexibility to scale to modeling a variable number of objects by solely learning a function that operates on a single generic entity and a function that operates on a single generic entity pair, both of which can be re-used for across all entity instances.

Figure 3: The dynamics model models the time evolution of every object by symmetrically applying the function to each object. For a given object, models the individual dynamics of that object , embeds the action vector , computes the action’s effect on that object , computes each of the other objects’ effect on that object , and aggregates these effects together .

4.2 Interactive Inference for Binding Object Properties to Latent Variables

For the observation and dynamics models to operate from raw pixels hinges on the ability to bind the properties of specific physical objects to the entity variables . For latent variable models, we frame this variable binding problem as an inference problem: binding information about to can be cast as a problem of inferring the parameters of , the posterior distribution of given a sequence of interactions. Maximizing the ELBO in Sec. 3 offers a method for learning the parameters of the observation and dynamics models while simultaneously learning an approximation to the posterior , which we have chosen to factorize into a per-timestep recognition distribution shared across timesteps. We also choose to enforce the entity abstraction on the process that computes the recognition distribution (in Fig. 1e) by decomposing it into a recognition distribution applied to each entity:


Whereas a neural network encoder is often used to approximate the posterior [22, 58, 34], a single forward pass that computes in parallel for each entity is insufficient to break the symmetry for dividing responsibility of modeling different objects among the entity variables [63] because the entities do not have the opportunity to communicate about which part of the scene they are representing.

We therefore adopt an iterative inference approach [39] to compute the recognition distribution , which has been shown to break symmetry among modeling objects in static scenes [19]. Iterative inference computes the recognition distribution via a procedure, rather than a single forward pass of an encoder, that iteratively refines an initial guess for the posterior parameters

by using gradients from how well the generative model is able to predict the observation based on the current posterior estimate. The initial guess provides the noise to break the symmetry.

For scenes where position and color are enough for disambiguating objects, a static image may be sufficient for inferring . However, in interactive environments disambiguating objects is more underconstrained because what constitutes an object depends on the goals of the agent. We therefore incorporate actions into the amortized varitional filtering framework [38] to develop an interactive inference algorithm (Appdx. D and Fig. 4) that uses temporal continuity and interactive feedback to disambiguate objects. Another benefit of enforcing entity abstraction is that preserving temporal consistency on entities comes for free: information about each object remains bound to its respective through time, mixing with information about other entities only through explicitly defined avenues, such as in the dynamics model.

4.3 Training at Different Timescales

The variational parameters are the interface through which the neural networks , , that respectively output the distribution parameters of , , and communicate. For a particular dynamic scene, the execution of interactive inference optimizes the variational parameters . Across scene instances, we train the weights of , ,

by backpropagating the ELBO through the entire inference procedure, spanning multiple timesteps. OP3 thus learns at three different timescales: the variational parameters learn (1) across

steps of inference within a single timestep and (2) across timesteps within a scene instance, and the network weights learn (3) across different scene instances.

Beyond next-step prediction, we can directly train to compute the posterior predictive distribution by sampling from the approximate posterior of with , rolling out the dynamics model in latent space from these samples with a sequence of actions, and predicting the observation with the observation model . This approach to action-conditioned video prediction predicts future observations directly from observations and actions, but with a bottleneck of time-persistent entity-variables with which the dynamics model performs symbolic relational computation.

Figure 4: Amortized interactive inference alternates between refinement (pink) and dynamics (orange) steps, iteratively updating the belief of over time. corresponds to the output of the dynamics network, which serves as the initial estimate of that is subsequently refined by and . denotes the feedback used in the refinement process, which includes gradient information and auxiliary inputs (Appdx. D).

4.4 Object-Centric Planning

OP3 rollouts, computed as the posterior predictive distribution, can be integrated into the standard visual model-predictive control [14] framework. Since interactive inference grounds the entities in the actual objects depicted in the raw observation, this grounding essentially gives OP3 access to a pointer to each object, enabling the rollouts to be in the space of entities and their relations. These pointers enable OP3 to not merely predict in the space of entities, but give OP3 access to an object-centric action space: for example, instead of being restricted to the standard (pick_xy, place_xy) action space common to many manipulation tasks, which often requires biased picking with a scripted policy [36, 27], these pointers enable us to compute a mapping (Appdx. G.2) between entity_id and pick_xy, allowing OP3 to automatically use a (entity_id, place_xy) action space without needing a scripted policy.

4.5 Generalization to Various Tasks

We consider tasks defined in the same environment with the same physical laws that govern appearance and dynamics. Tasks are differentiated by goals, in particular goal configurations of objects. Building good cost functions for real world tasks is generally difficult [16] because the underlying state of the environment is always unobserved and can only be modeled through modeling observations. However, by representing the environment state as the state of its entities, we may obtain finer-grained goal-specification without the need for manual annotations [11]. Having rolled out OP3 to a particular timestep, we construct a cost function to compare the predicted entity states with the entity states inferred from a goal image by considering pairwise distances between the entities, another example of enforcing the pairwise entity abstraction. Letting and denote the set of goal and predicted entities respectively, we define the form of the cost function via a composition of the task specific distance function operating on entity-pairs:


in which we pair each goal entity with the closest predicted entity and sum over the costs of these pairs. Assuming a single action suffices to move an object to its desired goal position, we can greedily plan each timestep by defining the cost to be , the pair with minimum distance, and removing the corresponding goal entity from further consideration for future planning.

5 Experiments

Our experiments aim to study to what degree entity abstraction improves generalization, planning, and modeling. Sec. 5.1 shows that from only training to predict how objects fall, OP3 generalizes to solve various novel block stacking tasks with two to three times better accuracy than a state-of-the-art video prediction model. Sec. 5.2 shows that OP3 can plan for multiple steps in a difficult multi-object environment. Sec. 5.3 shows that OP3 learns to ground its abstract entities in objects from real world videos.

5.1 Combinatorial Generalization without Object Supervision

We first investigate how well OP3 can learn object-based representations without additional object supervision, as well as how well OP3’s factorized representation can enable combinatorial generalization for scenes with many objects.

Domain: In the MuJoCo [50] block stacking task introduced by Janner et al. [25] for the O2P2 model, a block is raised in the air and the model must predict the steady-state effects of dropping the block on a surface with multiple objects, which implicitly requires modeling the effects of gravity and collisions. The agent is never trained to stack blocks, but is tested on a suite of tasks where it must construct block tower specified by a goal image. Janner et al. [25] showed that an object-centric model with access to ground truth object segmentations can solve these tasks with about 76% accuracy. We now consider whether OP3 can do better, but without any supervision on object identity.

SAVP O2P2 OP3 (ours) 24% 76% 82% Table 1: Accuracy (%) of block tower builds by the SAVP baseline, the O2P2 oracle, and our approach. O2P2 uses image segmentations whereas OP3 uses only raw images as input. # Blocks SAVP OP3 (xy) OP3 (entity) 1 54% 73% 91% 2 28% 55% 80% 3 28% 41% 55% Table 2: Accuracy (%) of multi-step planning for building block towers. (xy) means (pick_xy, place_xy) action space while (entity) means (entity_id, place_xy) action space.

Setup: We train OP3 on the same dataset and evaluate on the same goal images as Janner et al. [25]. While the training set contains up to five objects, the test set contains up to nine objects, which are placed in specific structures (bridge, pyramid, etc.) not seen during training. The actions are optimized using the cross-entropy method (CEM) [47], with each sampled action evaluated by the greedy cost function described in Sec. 4.5. Accuracy is evaluated using the metric defined by Janner et al. [25], which checks that all blocks are within some threshold error of the goal.

Results: The two baselines, SAVP [35] and O2P2, represent the state-of-the-art in video prediction and symmetric object-centric planning methods, respectively. SAVP models objects with a fixed number of convolutional filters and does not process entities symmetrically. O2P2 does process entities symmetrically, but requires access to ground truth object segmentations. As shown in Table 1, OP3 achieves better accuracy than O2P2, even without any ground truth supervision on object identity, possibly because grounding the entities in the raw image may provide a richer contextual representation than encoding each entity separately without such global context as O2P2 does. OP3 achieves three times the accuracy of SAVP, which suggests that symmetric modeling of entities is enables the flexibility to transfer knowledge of dynamics of a single object to novel scenes with different configurations heights, color combinations, and numbers of objects than those from the training distribution. Fig. 7 and Fig. 8 in the Appendix show that, by grounding its entities in objects of the scene through inference, OP3’s predictions isolates only one object at a time without affecting the predictions of other objects.

Figure 5: (a) In the block stacking task from [25] with single-step greedy planning, OP3’s generalizes better than both O2P2, an oracle model with access to image segmentations, and SAVP, which does not enforce entity abstraction. (b) OP3 exhibits better multi-step planning with objects already present in the scene. By planning with MPC using random pick locations (SAVP and OP3 (xy)), the sparsity of objects in the scene make it rare for random pick locations to actually pick the objects. However, because OP3 has access to pointers to the latent entities, we can use these to automatically bias the pick locations to be at the object location, without any supervision (OP3 (entity)).

5.2 Multi-Step Planning

The goal of our second experiment is to understand how well OP3 can perform multi-step planning by manipulating objects already present in the scene. We modify the block stacking task by changing the action space to represent a picking and dropping location. This requires reasoning over extended action sequences since moving objects out of place may be necessary.

Goals are specified with a goal image, and the initial scene contains all of the blocks needed to build the desired structure. This task is more difficult because the agent may have to move blocks out of the way before placing other ones which would require multi-step planning. Furthermore, an action only successfully picks up a block if it intersects with the block’s outline, which makes searching through the combinatorial space of plans a challenge. As stated in Sec. 4.4, having a pointer to each object enables OP3 to plan in the space of entities. We compare two different action spaces (pick_xy, place_xy) and (entity_id, place_xy) to understand how automatically filtering for pick locations at actual locations of objects enables better efficiency and performance in planning. Details for determining the pick_xy from entity_id are in appendix G.2.

Results: We compare with SAVP, which uses the (pick_xy, place_xy) action space. With this standard action space (Table 2) OP3 achieves between 1.5-2 times the accuracy of SAVP. This performance gap increases to 2-3 times the accuracy when OP3 uses the (entity_id, place_xy) action space. The low performance of SAVP with only two blocks highlights the difficulty of such combinatorial tasks for model-based RL methods, and highlights the both the generalization and localization benefits of a model with entity abstraction. Fig. 5b shows that OP3 is able to plan more efficiently, suggesting that OP3 may be a more effective model than SAVP in modeling combinatorial scenes. Fig. 6a shows the execution of interactive inference during training, where OP3 alternates between four refinement steps and one prediction step. Notice that OP3 infers entity representations that decompose the scene into coherent objects and that entities that do not model objects model the background. We also observe in the last column () that OP3 predicts the appearance of the green block even though the green block was partially occluded in the previous timestep, which shows its ability to retain information across time.

Figure 6: Visualization of interactive inference for block-manipulation and real-world videos [10]. Here, OP3 interacts with the objects by executing pre-specified actions in order to disambiguate objects already present in the scene by taking advantage of temporal continuity and receiving feedback from how well its prediction of how an action affects an object compares with the ground truth result. (a) OP3 does four refinement steps on the first image, and then 2 refinement steps after each prediction. (b) We compare OP3, applied on dynamic videos, with IODINE, applied independently to each frame of the video, to illustrate that using a dynamics model to propagate information across time enables better object disambiguation. We observe that initially, both OP3 (green circle) and IODINE (cyan circles) both disambiguate objects via color segmentation because color is the only signal in a static image to group pixels. However, we observe that as time progresses, OP3 separates the arm, object, and background into separate latents (purple) by using its currently estimates latents predict the next observation and comparing this prediction with the actually observed next observation. In contrast, applying IODINE on a per-frame basis does not yield benefits of temporal consistency and interactive feedback (red).

5.3 Real World Evaluation

The previous tasks used simulated environments with monochromatic objects. Now we study how well OP3 scales to real world data with cluttered scenes, object ambiguity, and occlusions. We evaluate OP3 on the dataset from Ebert et al. [10] which contains videos of a robotic arm moving cloths and other deformable and multipart objects with varying textures.

We evaluate qualitative performance by visualizing the object segmentations and compare against vanilla IODINE, which does not incorporate an interaction-based dynamics model into the inference process. Fig. 6b highlights the strength of OP3 in preserving temporal continuity and disambiguating objects in real world scenes. While IODINE can disambiguate monochromatic objects in static images, we observe that it struggles to do more than just color segmentation on more complicated images where movement is required to disambiguate objects. In contrast, OP3 is able to use temporal information to obtain more accurate segmentations, as seen in Fig. 6b where it initially performs color segmentation by grouping the towel, arm, and dark container edges together, and then by observing the effects of moving the arm, separates these entities into different groups.

6 Discussion

We have shown that enforcing the entity abstraction in a model-based reinforcement learner improves generalization, planning, and modeling across various compositional multi-object tasks. In particular, enforcing the entity abstraction provides the learner with a pointer to each entity variable, enabling us to define functions that are local in scope with respect to a particular entity, allowing knowledge about an entity in one context to directly transfer to modeling the same entity in different contexts. In the physical world, entities are often manifested as objects, and generalization in physical tasks such as robotic manipulation often may require symbolic reasoning about objects and their interactions. However, the general difficulty with using purely symbolic, abstract representations is that it is unclear how to continuously update these representations with more raw data. OP3 frames such symbolic entities as random variables in a dynamic latent variable model and infers and refines the posterior of these entities over time with neural networks. This suggests a potential bridge to connect abstract symbolic variables with the noisy, continuous, high-dimensional physical world, opening a path to scaling robotic learning to more combinatorially complex tasks.

The authors would like to thank the anonymous reviewers for their helpful feedback and comments. The authors would also like to thank Sjoerd van Steenkiste, Nalini Singh and Marvin Zhang for helpful discussions on the graphical model, Klaus Greff for help in implementing IODINE, Alex Lee for help in running SAVP, Tom Griffiths, Karl Persch, and Oleg Rybkin for feedback on earlier drafts, Joe Marino for discussions on iterative inference, and Sam Toyer, Anirudh Goyal, Jessica Hamrick, and Peter Battaglia for insightful discussions. This research was supported in part by the National Science Foundation under IIS-1651843, IIS-1700697, and IIS-1700696, the Office of Naval Research, ARL DCIST CRA W911NF-17-2-0181, DARPA, Berkeley DeepDrive, Google, Amazon, and NVIDIA.


  • [1] A. Ajay, M. Bauza, J. Wu, N. Fazeli, J. B. Tenenbaum, A. Rodriguez, and L. P. Kaelbling (2019) Combining physical simulators and object-based networks for control. arXiv:1904.06580. Cited by: §2.
  • [2] V. Bapst, A. Sanchez-Gonzalez, C. Doersch, K. L. Stachenfeld, P. Kohli, P. W. Battaglia, and J. B. Hamrick (2010) Structured agents for physical construction. arXiv:1904.03177. Cited by: §2.
  • [3] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems, pp. 4502–4510. Cited by: §2.
  • [4] C. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019) MONet: unsupervised scene decomposition and representation. arXiv:1901.11390. Cited by: §2.
  • [5] M. B. Chang, T. Ullman, A. Torralba, and J. B. Tenenbaum (2016) A compositional object-based approach to learning physical dynamics. arXiv:1612.00341. Cited by: §2.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180. Cited by: §2.
  • [7] E. L. Denton et al. (2017) Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §2.
  • [8] C. Doersch (2016)

    Tutorial on variational autoencoders

    arXiv preprint arXiv:1606.05908. Cited by: Appendix B.
  • [9] Y. Du and K. Narasimhan (2019) Task-agnostic dynamics priors for deep reinforcement learning. arXiv:1905.04819. Cited by: §2.
  • [10] F. Ebert, S. Dasari, A. X. Lee, S. Levine, and C. Finn (2018)

    Robustness via retrying: closed-loop robotic manipulation with self-supervised learning

    arXiv:1810.03043. Cited by: Figure 6, §5.3.
  • [11] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568. Cited by: §4.5.
  • [12] S. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton, et al. (2016)

    Attend, infer, repeat: fast scene understanding with generative models

    In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §2.
  • [13] C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pp. 64–72. Cited by: Figure 1, §2.
  • [14] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 2786–2793. Cited by: §2, §4.4.
  • [15] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik (2015) Learning visual predictive models of physics for playing billiards. arXiv:1511.07404. Cited by: §2.
  • [16] J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine (2018) Variational inverse control with events: a general framework for data-driven reward definition. In Advances in Neural Information Processing Systems, pp. 8538–8547. Cited by: §4.5.
  • [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 580–587. Cited by: §2.
  • [18] V. Goel, J. Weng, and P. Poupart (2018) Unsupervised video object segmentation for deep reinforcement learning. arXiv:1805.07780. Cited by: §2.
  • [19] K. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and A. Lerchner (2019) Multi-object representation learning with iterative variational inference. arXiv:1903.00450. Cited by: Appendix D, Appendix F, §1, §2, §4.2.
  • [20] K. Greff, R. K. Srivastava, and J. Schmidhuber (2015) Binding via reconstruction clustering. arXiv:1511.06418. Cited by: §2, §2.
  • [21] K. Greff, S. van Steenkiste, and J. Schmidhuber (2017)

    Neural expectation maximization

    Cited by: §2.
  • [22] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. arXiv:1811.04551. Cited by: §F.2, Appendix F, Figure 1, §2, §3, §4.2.
  • [23] J. B. Hamrick, A. J. Ballard, R. Pascanu, O. Vinyals, N. Heess, and P. W. Battaglia (2017) Metacontrol for adaptive imagination-based optimization. arXiv:1705.02670. Cited by: §2.
  • [24] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.
  • [25] M. Janner, S. Levine, W. T. Freeman, J. B. Tenenbaum, C. Finn, and J. Wu (2018) Reasoning about physical interactions with object-oriented prediction and planning. arXiv:1812.10972. Cited by: Figure 7, Figure 8, §G.1, Appendix H, §2, Figure 5, §5.1, §5.1.
  • [26] M. Janner, K. Narasimhan, and R. Barzilay (2018) Representation learning for grounded spatial reasoning. Transactions of the Association for Computational Linguistics 6, pp. 49–61. Cited by: §2.
  • [27] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, et al. (2018) Qt-opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293. Cited by: §4.4.
  • [28] P. Kanerva (2009)

    Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors

    Cognitive computation. Cited by: §2.
  • [29] K. Kansky, T. Silver, D. A. Mély, M. Eldawy, M. Lázaro-Gredilla, X. Lou, N. Dorfman, S. Sidor, S. Phoenix, and D. George (2017) Schema networks: zero-shot transfer with a generative causal model of intuitive physics. arXiv:1706.04317. Cited by: §2.
  • [30] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv:1412.6980. Cited by: Appendix F.
  • [31] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv:1312.6114. Cited by: Appendix B.
  • [32] A. R. Kosiorek, H. Kim, I. Posner, and Y. W. Teh (2018) Sequential attend, infer, repeat: generative modelling of moving objects. arXiv:1806.01794. Cited by: §2.
  • [33] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539–2547. Cited by: §2.
  • [34] T. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih (2019) Unsupervised learning of object keypoints for perception and control. arXiv:1906.11883. Cited by: Figure 1, §2, §4.2.
  • [35] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic adversarial video prediction. arXiv:1804.01523. Cited by: §2, §5.1.
  • [36] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen (2018)

    Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection

    The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §4.4.
  • [37] S. D. Levy and R. Gayler (2008) Vector symbolic architectures: a new building material for artificial general intelligence. In Conference on Artificial General Intelligence, Cited by: §2.
  • [38] J. Marino, M. Cvitkovic, and Y. Yue (2018) A general method for amortizing variational filtering. In Advances in Neural Information Processing Systems, pp. 7857–7868. Cited by: §1, §4.2.
  • [39] J. Marino, Y. Yue, and S. Mandt (2018) Iterative amortized inference. arXiv:1807.09356. Cited by: §1, §4.2.
  • [40] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02-26) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §2.
  • [41] K. Narasimhan, R. Barzilay, and T. Jaakkola (2018) Grounding language for transfer in deep reinforcement learning.

    Journal of Artificial Intelligence Research

    63, pp. 849–874.
    Cited by: §2.
  • [42] J. Oh, V. Chockalingam, S. Singh, and H. Lee (2016) Control of memory, active perception, and action in minecraft. arXiv:1605.09128. Cited by: §2.
  • [43] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in atari games. In Advances in neural information processing systems, pp. 2863–2871. Cited by: §2.
  • [44] R. Pascanu, T. Mikolov, and Y. Bengio (2012)

    Understanding the exploding gradient problem

    ArXiv abs/1211.5063. Cited by: Appendix F.
  • [45] D. J. Rezende, S. Mohamed, and D. Wierstra (2014) Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. Cited by: Appendix B.
  • [46] F. Rosenblatt (1961)

    Principles of neurodynamics. perceptrons and the theory of brain mechanisms

    Technical report CORNELL AERONAUTICAL LAB INC BUFFALO NY. Cited by: §2.
  • [47] R. Y. Rubinstein and D. P. Kroese (2004) The cross-entropy method. In Information Science and Statistics, Cited by: §5.1.
  • [48] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap (2017) A simple neural network module for relational reasoning. arXiv:1706.01427. Cited by: §2.
  • [49] P. Smolensky (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence 46 (1-2), pp. 159–216. Cited by: §2.
  • [50] E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §5.1.
  • [51] S. van Steenkiste, M. Chang, K. Greff, and J. Schmidhuber (2018) Relational neural expectation maximization: unsupervised discovery of objects and their interactions. arXiv:1802.10353. Cited by: Appendix D, §2.
  • [52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [53] D. Wang, C. Devin, Q. Cai, F. Yu, and T. Darrell (2018) Deep object centric policies for autonomous driving. arXiv:1811.05432. Cited by: §2.
  • [54] W. F. Whitney, M. Chang, T. Kulkarni, and J. B. Tenenbaum (2016) Understanding visual concepts with continuation learning. arXiv:1602.06822. Cited by: §2.
  • [55] N. Wichers, R. Villegas, D. Erhan, and H. Lee (2018) Hierarchical long-term video prediction without supervision. arXiv:1806.04768. Cited by: §2.
  • [56] R. J. Williams and D. Zipser (1989)

    A learning algorithm for continually running fully recurrent neural networks

    Neural computation 1 (2), pp. 270–280. Cited by: Appendix C.
  • [57] J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum (2017) Learning to see physics via visual de-animation. In Advances in Neural Information Processing Systems, pp. 153–164. Cited by: §2.
  • [58] Z. Xu, Z. Liu, C. Sun, K. Murphy, W. T. Freeman, J. B. Tenenbaum, and J. Wu (2018) Modeling parts, structure, and system dynamics via predictive learning. Cited by: Figure 1, §4.2.
  • [59] Z. Xu, Z. Liu, C. Sun, K. Murphy, W. T. Freeman, J. B. Tenenbaum, and J. Wu (2019) Unsupervised discovery of parts, structure, and dynamics. arXiv:1903.05136. Cited by: §2.
  • [60] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi (2018) Visual semantic navigation using scene priors. arXiv:1810.06543. Cited by: §2.
  • [61] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart, et al. (2018) Deep reinforcement learning with relational inductive biases. Cited by: §2.
  • [62] M. Zhang, S. Vikram, L. Smith, P. Abbeel, M. J. Johnson, and S. Levine (2018) Solar: deep structured latent representations for model-based reinforcement learning. arXiv:1808.09105. Cited by: §2.
  • [63] Y. Zhang, J. Hare, and P. Adam (2019) Deep set prediction networks. arXiv:1906.06565. Cited by: §4.2.
  • [64] G. Zhu, J. Wang, Z. Ren, and C. Zhang (2019) Object-oriented dynamics learning through multi-level abstraction. arXiv:1904.07482. Cited by: §2.

Appendix A Observation Model

The observation model models how the objects cause the image observation . Each object is rendered independently as the sub-image and the resulting sub-images are combined to form the final image observation . To combine the sub-images, each pixel in each sub-image is assigned a depth that specifies the distance of object from the camera at coordinate . of the image plane. Thus the pixel takes on the value of its corresponding pixel in the sub-image if object is closest to the camera than the other objects, such that



is the indicator random variable

, allowing us to intuitively interpret as segmentation masks and as color maps. In reality we do not directly observe the depth values, so we must construct a probabilistic model to model our uncertainty:


where every pixel is modeled through a set of mixture components that model how pixels of the individual sub-images are generated, as well as through the mixture weights that model which point of each object is closest to the camera.

Appendix B Evidence Lower Bound

Here we provide a derivation of the evidence lower bound. We begin with the log probability of the observations

conditioned on a sequence of actions :


We have freedom to choose the approximating distribution so we choose it to be conditioned on the past states and actions, factorized across time:

With this factorization, we can use linearity of expectation to decouple Equation 8 across timesteps:

where at the first timestep

and at subsequent timesteps

By the Markov property, the marginal is computed recursively as

whose base case is when .

We approximate observation distribution and the dynamics distribution by learning the parameters of the observation model and dynamics model respectively as outputs of neural networks. We approximate the recognition distribution via an inference procedure that refines better estimates of the posterior parameters, computed as an output of a neural network. To compute the expectation in the marginal , we follow standard practice in amortized variational inference by approximating the expectation with a single sample of the sequence by sequentially sampling the latents for one timestep given latents from the previous timestep, and optimizing the ELBO via stochastic gradient ascent [8, 31, 45].

Appendix C Posterior Predictive Distribution

Here we provide a derivation of the posterior predictive distribution for the dynamic latent variable model with multiple latent states. Section B described how we compute the distributions , , , and . Here we show that these distributions can be used to approximate the predictive posterior distribution by maximizing the following lower bound:


The numerator can be decomposed into two terms, one of which involving the posterior :

This allows Equation 9 to be broken up into two terms:

Maximizing the second term, the negative KL-divergence between the variational distribution and the posterior is the same as maximizing the following lower bound:


where the first term is due to the conditional independence between and the future states and actions . Note that Equation 10 is not the same as the ELBO in Equation 8 because the KL divergence term is with respect to distributions over , not . We choose to express as conditioned on past states and actions, factorized across time:

In summary, Equation 9 can be expressed as

which can be interpreted as a reconstruction term for timesteps , a reconstruction term for timesteps , and a complexity term for all timesteps. We can maximize this using the same techniques as maximizing Equation 8.

Whereas approximating the ELBO in Equation 9 can be implemented by rolling out OP3 to predict the next observation via teacher forcing [56], approximating the posterior predictive distribution in Equation 9 can be implemented by rolling out the dynamics model steps beyond the last observation and using the observation model to predict the future observations.

Appendix D Interactive Inference

Algorithms 1 and 2 detail steps of the interactive inference algorithm at timestep and respectively. Algorithm 1 is equivalent to the IODINE algorithm described [19]. Recalling that are the parameters for the distribution of the random variables , we consider in this paper the case where this distribution is an isotropic Gaussian (e.g. where

), although OP3 need not be restricted to the Gaussian distribution. The

refinement network produces the parameters for the distribution . The dynamics network produces the parameters for the distribution . To implement , we repurpose the dynamics model to transform into the initial posterior estimate and then use to iteratively update this parameter estimate. indicates the auxiliary inputs into the refinement network used in [19]. We mark the major areas where the algorithm at timestep differs from the algorithm at timestep in blue.