Entity Abstraction in Visual Model-Based Reinforcement Learning
This paper tests the hypothesis that modeling a scene in terms of entities and their local interactions, as opposed to modeling the scene globally, provides a significant benefit in generalizing to physical tasks in a combinatorial space the learner has not encountered before. We present object-centric perception, prediction, and planning (OP3), which to the best of our knowledge is the first entity-centric dynamic latent variable framework for model-based reinforcement learning that acquires entity representations from raw visual observations without supervision and uses them to predict and plan. OP3 enforces entity-abstraction – symmetric processing of each entity representation with the same locally-scoped function – which enables it to scale to model different numbers and configurations of objects from those in training. Our approach to solving the key technical challenge of grounding these entity representations to actual objects in the environment is to frame this variable binding problem as an inference problem, and we developing an interactive inference algorithm that uses temporal continuity and interactive feedback to bind information about object properties to the entity variables. On block-stacking tasks, OP3 generalizes to novel block configurations and more objects than observed during training, outperforming an oracle model that assumes access to object supervision and achieving two to three times better accuracy than a state-of-the-art video prediction model.READ FULL TEXT VIEW PDF
Object-based factorizations provide a useful level of abstraction for
Generative models are emerging as promising tools in robotics and
3D scene representation for robot manipulation should capture three key
We study the problem of dynamic visual reasoning on raw videos. This is ...
We present GATSBI, a generative model that can transform a sequence of r...
We propose a framework for the completely unsupervised learning of laten...
University timetabling (UTT) is a complex problem due to its combinatori...
Entity Abstraction in Visual Model-Based Reinforcement Learning
A powerful tool for modeling the complexity of the physical world is to frame this complexity as the composition of simpler entities and processes. For example, the study of classical mechanics in terms of macroscopic objects and a small set of laws governing their motion has enabled not only an explanation of natural phenomena like apples falling from trees but the invention of structures that never before existed in human history, such as skyscrapers. Paradoxically, the creative variation of such physical constructions in human society is due in part to the uniformity with which human models of physical laws apply to the literal building blocks that comprise such structures – the reuse of the same simpler models that apply to primitive entities and their relations in different ways obviates the need, and cost, of designing custom solutions from scratch for each construction instance.
The challenge of scaling the generalization abilities of learning robots follows a similar characteristic to the challenges of modeling physical phenomena: the complexity of the task space may scale combinatorially with the configurations and number of objects, but if all scene instances share the same set of objects that follow the same physical laws, then transforming the problem of modeling scenes into a problem of modeling objects and the local physical processes that govern their interactions may provide a significant benefit in generalizing to solving novel physical tasks the learner has not encountered before. This is the central hypothesis of this paper.
We test this hypothesis by defining models for perceiving and predicting raw observations that are themselves compositions of simpler functions that operate locally on entities rather than globally on scenes. Importantly, the symmetry that all objects follow the same physical laws enables us to define these learnable entity-centric functions to take as input argument a variable that represents a generic entity, the specific instantiations of which are all processed by the same function. We use the term entity abstraction to refer to the abstraction barrier that isolates the abstract variable, which the entity-centric function is defined with respect to, from its concrete instantiation, which contains information about the appearance and dynamics of an object that modulates the function’s behavior.
Defining the observation and dynamic models of a model-based reinforcement learner as neural network functions of abstract entity variables allows for symbolic computation in the space of entities, but the key challenge for realizing this is to ground the values of these variables in the world from raw visual observations. Fortunately, the language of partially observable Markov decision processes (POMDP) enables us to represent these entity variables as latent random state variables in a state-factorized POMDP, thereby transforming the variable binding problem into an inference problem with which we can build upon state-of-the-art techniques in amortized iterative variational inference[39, 38, 19] to use temporal continuity and interactive feedback to infer the posterior distribution of the entity variables given a sequence of observations and actions.
We present a framework for object-centric perception, prediction, and planning (OP3), a model-based reinforcement learner that predicts and plans over entity variables inferred via an interactive inference algorithm from raw visual observations. Empirically OP3 learns to discover and bind information about actual objects in the environment to these entity variables without any supervision on what these variables should correspond to. As all computation within the entity-centric function is local in scope with respect to its input entity, the process of modeling the dynamics or appearance of each object is protected from the computations involved in modeling other objects, which allows OP3 to generalize to modeling a variable number of objects in a variety of contexts with no re-training.
Contributions: Our conceptual contribution is the use of entity abstraction to integrate graphical models, symbolic computation, and neural networks in a model-based RL agent. This is enabled by our technical contribution: defining models as the composition of locally-scoped entity-centric functions and the interactive inference algorithm for grounding the abstract entity variables in raw visual observations without any supervision on object identity. Empirically, we find that OP3 achieves two to three times greater accuracy than state of the art video prediction models in solving novel single and multi-step block stacking tasks.
Representation learning for visual model-based reinforcement learning: Prior works have proposed learning video prediction models [55, 7, 35, 13] to improve exploration  and planning  in reinforcement learning. However, such works and others [22, 62, 40, 42]
that represent the scene with a single representation vector may be susceptible to the binding problem[20, 46] and must rely on data to learn that the same object in two different contexts can be modeled similarly. But processing a disentangled latent state with a single function [54, 6, 33, 34, 18] or processing each disentangled factor with a different function [35, 59] (1) assumes a fixed number of entities that cannot be dynamically adjusted for generalizing to more objects than in training and (2) has no constraints to enforce that multiple instances of the same entity in the scene be modeled in the same way. For generalization, often the particular arrangement of objects in a scene does not matter so much as what is constant across scenes – properties of individual objects and inter-object relationships – which the inductive biases of these prior works do not capture. The entity abstraction in OP3 enforces symmetric processing of entity representations, thereby overcoming the limitations of these prior works.
Unsupervised grounding of abstract entity variables in concrete objects: Prior works that model entities and their interactions often pre-specify the identity of the entities [5, 3, 23, 26, 41, 2, 1], provide additional supervision [17, 24, 53, 60], or provide additional specification such as segmentations , crops , or a simulator [57, 29]. Those that do not assume such additional information often factorize the entire scene into pixel-level entities [48, 61, 9], which do not model objects as coherent wholes. None of these works solve the problem of grounding the entities in raw observation, which is crucial for autonomous learning and interaction. OP3 builds upon recently proposed ideas in grounding entity representations via inference on a symmetrically factorized generative model of static [20, 21, 19] and dynamic  scenes, whose advantage over other methods for grounding [64, 12, 4, 32] is the ability to refine the grounding with new information. In contrast to other methods for binding in neural networks [37, 28, 49, 52], formulating inference as a mechanism for variable binding allows us to model uncertainty in the values of the variables.
Let denote a physical scene and denote the objects in the scene. Let and
be random variables for the image observation of the sceneand the agent’s actions respectively. In contrast to prior works  that use a single latent variable to represent the state of the scene, we use a set of latent random variables to represent the state of the objects . We use the term object to refer to , which is part of the physical world, and the term entity to refer to , which is part of our model of the physical world. The generative distribution of observations and latent entities from taking actions is modeled as:
where and are the observation and dynamics distribution respectively shared across all timesteps . Our goal is to build a model that, from simply observing raw observations of random interactions, can generalize to solve novel compositional object manipulation problems that the learner was never trained to do, such as building various block towers during test time from only training to predict how blocks fall during training time.
When all tasks follow the same dynamics we can achieve such generalization with a planning algorithm if given a sequence of actions we could compute
, the posterior predictive distribution of observationssteps into the future. Approximating this predictive distribution can be cast as a variational inference problem (Appdx. B) for learning the parameters of an approximate observation distribution , dynamics distribution , and a time-factorized recognition distribution that maximize the evidence lower bound (ELBO), given by , where
The ELBO pushes to produce states of the entities that contain information useful for not only reconstructing the observations via in but also for predicting the entities’ future states via in . Sec. 4 will next offer our method for incorporating entity abstraction into modeling the generative distribution and optimizing the ELBO.
The entity abstraction is derived from an assumption about symmetry: that the problem of modeling a dynamic scene of multiple entities can be reduced to the problem of (1) modeling a single entity and its interactions with an entity-centric function and (2) applying this function to every entity in the scene. Our choice to represent a scene as a set of entities exposes an avenue for directly encoding such a prior about symmetry that would otherwise not be straightforward with a global state representation.
As shown in Fig. 1, a function that respects the entity abstraction requires two ingredients. The first ingredient (Sec. 4.1) is that is expressed in part as the higher-order operation that broadcasts the same entity-centric function to every entity variable
. This yields the benefit of automatically transferring learned knowledge for modeling an individual entity to all entities in the scene rather than learn such symmetry from data. Asis a function that takes in a single generic entity variable as argument, the second ingredient (Sec. 4.2) should be a mechanism that binds information from the raw observation about a particular object to the variable .
The functions of interest in model-based RL are the observation and dynamics models and with which we seek to approximate the data-generating distribution in equation 1.
Observation Model: The observation model approximates the distribution , which models how the observation is caused by the combination of entities . We enforce the entity abstraction in (in Fig. 1g) by applying the same entity-centric function to each entity , which we can implement using a mixture model at each pixel :
where computes the mixture components that model how each individual entity is independently generated, combined via mixture weights that model the entities’ relative depth from the camera, the derivation of which is in Appdx. A.
Dynamics Model: The dynamics model approximates the distribution , which models how an action intervenes on the entities to produce their future values . We enforce the entity abstraction in (in Fig. 1f) by applying the same entity-centric function to each entity , which reduces the problem of modeling how an action affects a scene with a combinatorially large space of object configurations to the problem of simply modeling how an action affects a single generic entity and its interactions with the list of other entities . Modeling the action as an finer-grained intervention on a single entity rather than the entire scene is a benefit of using local representations of entities rather than global representations of scenes.
However, at this point we still have to model the combinatorially large space of interactions that a single entity could participate in. Therefore, we can further enforce a pairwise entity abstraction on by applying the same pairwise function to each entity pair , for . Omitting the action to reduce clutter (the full form is written in Appdx. F.2), the structure of the therefore follows this form:
The entity abstraction therefore provides the flexibility to scale to modeling a variable number of objects by solely learning a function that operates on a single generic entity and a function that operates on a single generic entity pair, both of which can be re-used for across all entity instances.
For the observation and dynamics models to operate from raw pixels hinges on the ability to bind the properties of specific physical objects to the entity variables . For latent variable models, we frame this variable binding problem as an inference problem: binding information about to can be cast as a problem of inferring the parameters of , the posterior distribution of given a sequence of interactions. Maximizing the ELBO in Sec. 3 offers a method for learning the parameters of the observation and dynamics models while simultaneously learning an approximation to the posterior , which we have chosen to factorize into a per-timestep recognition distribution shared across timesteps. We also choose to enforce the entity abstraction on the process that computes the recognition distribution (in Fig. 1e) by decomposing it into a recognition distribution applied to each entity:
Whereas a neural network encoder is often used to approximate the posterior [22, 58, 34], a single forward pass that computes in parallel for each entity is insufficient to break the symmetry for dividing responsibility of modeling different objects among the entity variables  because the entities do not have the opportunity to communicate about which part of the scene they are representing.
We therefore adopt an iterative inference approach  to compute the recognition distribution , which has been shown to break symmetry among modeling objects in static scenes . Iterative inference computes the recognition distribution via a procedure, rather than a single forward pass of an encoder, that iteratively refines an initial guess for the posterior parameters
by using gradients from how well the generative model is able to predict the observation based on the current posterior estimate. The initial guess provides the noise to break the symmetry.
For scenes where position and color are enough for disambiguating objects, a static image may be sufficient for inferring . However, in interactive environments disambiguating objects is more underconstrained because what constitutes an object depends on the goals of the agent. We therefore incorporate actions into the amortized varitional filtering framework  to develop an interactive inference algorithm (Appdx. D and Fig. 4) that uses temporal continuity and interactive feedback to disambiguate objects. Another benefit of enforcing entity abstraction is that preserving temporal consistency on entities comes for free: information about each object remains bound to its respective through time, mixing with information about other entities only through explicitly defined avenues, such as in the dynamics model.
The variational parameters are the interface through which the neural networks , , that respectively output the distribution parameters of , , and communicate. For a particular dynamic scene, the execution of interactive inference optimizes the variational parameters . Across scene instances, we train the weights of , ,
by backpropagating the ELBO through the entire inference procedure, spanning multiple timesteps. OP3 thus learns at three different timescales: the variational parameters learn (1) acrosssteps of inference within a single timestep and (2) across timesteps within a scene instance, and the network weights learn (3) across different scene instances.
Beyond next-step prediction, we can directly train to compute the posterior predictive distribution by sampling from the approximate posterior of with , rolling out the dynamics model in latent space from these samples with a sequence of actions, and predicting the observation with the observation model . This approach to action-conditioned video prediction predicts future observations directly from observations and actions, but with a bottleneck of time-persistent entity-variables with which the dynamics model performs symbolic relational computation.
OP3 rollouts, computed as the posterior predictive distribution, can be integrated into the standard visual model-predictive control  framework. Since interactive inference grounds the entities in the actual objects depicted in the raw observation, this grounding essentially gives OP3 access to a pointer to each object, enabling the rollouts to be in the space of entities and their relations. These pointers enable OP3 to not merely predict in the space of entities, but give OP3 access to an object-centric action space: for example, instead of being restricted to the standard (pick_xy, place_xy) action space common to many manipulation tasks, which often requires biased picking with a scripted policy [36, 27], these pointers enable us to compute a mapping (Appdx. G.2) between entity_id and pick_xy, allowing OP3 to automatically use a (entity_id, place_xy) action space without needing a scripted policy.
We consider tasks defined in the same environment with the same physical laws that govern appearance and dynamics. Tasks are differentiated by goals, in particular goal configurations of objects. Building good cost functions for real world tasks is generally difficult  because the underlying state of the environment is always unobserved and can only be modeled through modeling observations. However, by representing the environment state as the state of its entities, we may obtain finer-grained goal-specification without the need for manual annotations . Having rolled out OP3 to a particular timestep, we construct a cost function to compare the predicted entity states with the entity states inferred from a goal image by considering pairwise distances between the entities, another example of enforcing the pairwise entity abstraction. Letting and denote the set of goal and predicted entities respectively, we define the form of the cost function via a composition of the task specific distance function operating on entity-pairs:
in which we pair each goal entity with the closest predicted entity and sum over the costs of these pairs. Assuming a single action suffices to move an object to its desired goal position, we can greedily plan each timestep by defining the cost to be , the pair with minimum distance, and removing the corresponding goal entity from further consideration for future planning.
Our experiments aim to study to what degree entity abstraction improves generalization, planning, and modeling. Sec. 5.1 shows that from only training to predict how objects fall, OP3 generalizes to solve various novel block stacking tasks with two to three times better accuracy than a state-of-the-art video prediction model. Sec. 5.2 shows that OP3 can plan for multiple steps in a difficult multi-object environment. Sec. 5.3 shows that OP3 learns to ground its abstract entities in objects from real world videos.
We first investigate how well OP3 can learn object-based representations without additional object supervision, as well as how well OP3’s factorized representation can enable combinatorial generalization for scenes with many objects.
Domain: In the MuJoCo  block stacking task introduced by Janner et al.  for the O2P2 model, a block is raised in the air and the model must predict the steady-state effects of dropping the block on a surface with multiple objects, which implicitly requires modeling the effects of gravity and collisions. The agent is never trained to stack blocks, but is tested on a suite of tasks where it must construct block tower specified by a goal image. Janner et al.  showed that an object-centric model with access to ground truth object segmentations can solve these tasks with about 76% accuracy. We now consider whether OP3 can do better, but without any supervision on object identity.
Setup: We train OP3 on the same dataset and evaluate on the same goal images as Janner et al. . While the training set contains up to five objects, the test set contains up to nine objects, which are placed in specific structures (bridge, pyramid, etc.) not seen during training. The actions are optimized using the cross-entropy method (CEM) , with each sampled action evaluated by the greedy cost function described in Sec. 4.5. Accuracy is evaluated using the metric defined by Janner et al. , which checks that all blocks are within some threshold error of the goal.
Results: The two baselines, SAVP  and O2P2, represent the state-of-the-art in video prediction and symmetric object-centric planning methods, respectively. SAVP models objects with a fixed number of convolutional filters and does not process entities symmetrically. O2P2 does process entities symmetrically, but requires access to ground truth object segmentations. As shown in Table 1, OP3 achieves better accuracy than O2P2, even without any ground truth supervision on object identity, possibly because grounding the entities in the raw image may provide a richer contextual representation than encoding each entity separately without such global context as O2P2 does. OP3 achieves three times the accuracy of SAVP, which suggests that symmetric modeling of entities is enables the flexibility to transfer knowledge of dynamics of a single object to novel scenes with different configurations heights, color combinations, and numbers of objects than those from the training distribution. Fig. 7 and Fig. 8 in the Appendix show that, by grounding its entities in objects of the scene through inference, OP3’s predictions isolates only one object at a time without affecting the predictions of other objects.
The goal of our second experiment is to understand how well OP3 can perform multi-step planning by manipulating objects already present in the scene. We modify the block stacking task by changing the action space to represent a picking and dropping location. This requires reasoning over extended action sequences since moving objects out of place may be necessary.
Goals are specified with a goal image, and the initial scene contains all of the blocks needed to build the desired structure. This task is more difficult because the agent may have to move blocks out of the way before placing other ones which would require multi-step planning. Furthermore, an action only successfully picks up a block if it intersects with the block’s outline, which makes searching through the combinatorial space of plans a challenge. As stated in Sec. 4.4, having a pointer to each object enables OP3 to plan in the space of entities. We compare two different action spaces (pick_xy, place_xy) and (entity_id, place_xy) to understand how automatically filtering for pick locations at actual locations of objects enables better efficiency and performance in planning. Details for determining the pick_xy from entity_id are in appendix G.2.
Results: We compare with SAVP, which uses the (pick_xy, place_xy) action space. With this standard action space (Table 2) OP3 achieves between 1.5-2 times the accuracy of SAVP. This performance gap increases to 2-3 times the accuracy when OP3 uses the (entity_id, place_xy) action space. The low performance of SAVP with only two blocks highlights the difficulty of such combinatorial tasks for model-based RL methods, and highlights the both the generalization and localization benefits of a model with entity abstraction. Fig. 5b shows that OP3 is able to plan more efficiently, suggesting that OP3 may be a more effective model than SAVP in modeling combinatorial scenes. Fig. 6a shows the execution of interactive inference during training, where OP3 alternates between four refinement steps and one prediction step. Notice that OP3 infers entity representations that decompose the scene into coherent objects and that entities that do not model objects model the background. We also observe in the last column () that OP3 predicts the appearance of the green block even though the green block was partially occluded in the previous timestep, which shows its ability to retain information across time.
The previous tasks used simulated environments with monochromatic objects. Now we study how well OP3 scales to real world data with cluttered scenes, object ambiguity, and occlusions. We evaluate OP3 on the dataset from Ebert et al.  which contains videos of a robotic arm moving cloths and other deformable and multipart objects with varying textures.
We evaluate qualitative performance by visualizing the object segmentations and compare against vanilla IODINE, which does not incorporate an interaction-based dynamics model into the inference process. Fig. 6b highlights the strength of OP3 in preserving temporal continuity and disambiguating objects in real world scenes. While IODINE can disambiguate monochromatic objects in static images, we observe that it struggles to do more than just color segmentation on more complicated images where movement is required to disambiguate objects. In contrast, OP3 is able to use temporal information to obtain more accurate segmentations, as seen in Fig. 6b where it initially performs color segmentation by grouping the towel, arm, and dark container edges together, and then by observing the effects of moving the arm, separates these entities into different groups.
We have shown that enforcing the entity abstraction in a model-based reinforcement learner improves generalization, planning, and modeling across various compositional multi-object tasks. In particular, enforcing the entity abstraction provides the learner with a pointer to each entity variable, enabling us to define functions that are local in scope with respect to a particular entity, allowing knowledge about an entity in one context to directly transfer to modeling the same entity in different contexts. In the physical world, entities are often manifested as objects, and generalization in physical tasks such as robotic manipulation often may require symbolic reasoning about objects and their interactions. However, the general difficulty with using purely symbolic, abstract representations is that it is unclear how to continuously update these representations with more raw data. OP3 frames such symbolic entities as random variables in a dynamic latent variable model and infers and refines the posterior of these entities over time with neural networks. This suggests a potential bridge to connect abstract symbolic variables with the noisy, continuous, high-dimensional physical world, opening a path to scaling robotic learning to more combinatorially complex tasks.
The authors would like to thank the anonymous reviewers for their helpful feedback and comments. The authors would also like to thank Sjoerd van Steenkiste, Nalini Singh and Marvin Zhang for helpful discussions on the graphical model, Klaus Greff for help in implementing IODINE, Alex Lee for help in running SAVP, Tom Griffiths, Karl Persch, and Oleg Rybkin for feedback on earlier drafts, Joe Marino for discussions on iterative inference, and Sam Toyer, Anirudh Goyal, Jessica Hamrick, and Peter Battaglia for insightful discussions. This research was supported in part by the National Science Foundation under IIS-1651843, IIS-1700697, and IIS-1700696, the Office of Naval Research, ARL DCIST CRA W911NF-17-2-0181, DARPA, Berkeley DeepDrive, Google, Amazon, and NVIDIA.
Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908. Cited by: Appendix B.
Robustness via retrying: closed-loop robotic manipulation with self-supervised learning. arXiv:1810.03043. Cited by: Figure 6, §5.3.
Attend, infer, repeat: fast scene understanding with generative models. In Advances in Neural Information Processing Systems, pp. 3225–3233. Cited by: §2.
Neural expectation maximization. Cited by: §2.
Hyperdimensional computing: an introduction to computing in distributed representation with high-dimensional random vectors. Cognitive computation. Cited by: §2.
Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International Journal of Robotics Research 37 (4-5), pp. 421–436. Cited by: §4.4.
Journal of Artificial Intelligence Research63, pp. 849–874. Cited by: §2.
Understanding the exploding gradient problem. ArXiv abs/1211.5063. Cited by: Appendix F.
Principles of neurodynamics. perceptrons and the theory of brain mechanisms. Technical report CORNELL AERONAUTICAL LAB INC BUFFALO NY. Cited by: §2.
A learning algorithm for continually running fully recurrent neural networks. Neural computation 1 (2), pp. 270–280. Cited by: Appendix C.
The observation model models how the objects cause the image observation . Each object is rendered independently as the sub-image and the resulting sub-images are combined to form the final image observation . To combine the sub-images, each pixel in each sub-image is assigned a depth that specifies the distance of object from the camera at coordinate . of the image plane. Thus the pixel takes on the value of its corresponding pixel in the sub-image if object is closest to the camera than the other objects, such that
is the indicator random variable, allowing us to intuitively interpret as segmentation masks and as color maps. In reality we do not directly observe the depth values, so we must construct a probabilistic model to model our uncertainty:
where every pixel is modeled through a set of mixture components that model how pixels of the individual sub-images are generated, as well as through the mixture weights that model which point of each object is closest to the camera.
Here we provide a derivation of the evidence lower bound. We begin with the log probability of the observationsconditioned on a sequence of actions :
We have freedom to choose the approximating distribution so we choose it to be conditioned on the past states and actions, factorized across time:
With this factorization, we can use linearity of expectation to decouple Equation 8 across timesteps:
where at the first timestep
and at subsequent timesteps
By the Markov property, the marginal is computed recursively as
whose base case is when .
We approximate observation distribution and the dynamics distribution by learning the parameters of the observation model and dynamics model respectively as outputs of neural networks. We approximate the recognition distribution via an inference procedure that refines better estimates of the posterior parameters, computed as an output of a neural network. To compute the expectation in the marginal , we follow standard practice in amortized variational inference by approximating the expectation with a single sample of the sequence by sequentially sampling the latents for one timestep given latents from the previous timestep, and optimizing the ELBO via stochastic gradient ascent [8, 31, 45].
Here we provide a derivation of the posterior predictive distribution for the dynamic latent variable model with multiple latent states. Section B described how we compute the distributions , , , and . Here we show that these distributions can be used to approximate the predictive posterior distribution by maximizing the following lower bound:
The numerator can be decomposed into two terms, one of which involving the posterior :
This allows Equation 9 to be broken up into two terms:
Maximizing the second term, the negative KL-divergence between the variational distribution and the posterior is the same as maximizing the following lower bound:
where the first term is due to the conditional independence between and the future states and actions . Note that Equation 10 is not the same as the ELBO in Equation 8 because the KL divergence term is with respect to distributions over , not . We choose to express as conditioned on past states and actions, factorized across time:
In summary, Equation 9 can be expressed as
which can be interpreted as a reconstruction term for timesteps , a reconstruction term for timesteps , and a complexity term for all timesteps. We can maximize this using the same techniques as maximizing Equation 8.
Whereas approximating the ELBO in Equation 9 can be implemented by rolling out OP3 to predict the next observation via teacher forcing , approximating the posterior predictive distribution in Equation 9 can be implemented by rolling out the dynamics model steps beyond the last observation and using the observation model to predict the future observations.
Algorithms 1 and 2 detail steps of the interactive inference algorithm at timestep and respectively. Algorithm 1 is equivalent to the IODINE algorithm described . Recalling that are the parameters for the distribution of the random variables , we consider in this paper the case where this distribution is an isotropic Gaussian (e.g. where
), although OP3 need not be restricted to the Gaussian distribution. Therefinement network produces the parameters for the distribution . The dynamics network produces the parameters for the distribution . To implement , we repurpose the dynamics model to transform into the initial posterior estimate and then use to iteratively update this parameter estimate. indicates the auxiliary inputs into the refinement network used in . We mark the major areas where the algorithm at timestep differs from the algorithm at timestep in blue.