Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

by   Haozhi Qi, et al.

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Our code is available at https://github.com/HaozhiQi/RPIN.



There are no comments yet.


page 2

page 6

page 15

page 16

page 18


Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Active speaker detection (ASD) seeks to detect who is speaking in a visu...

Long Short-Term Relation Networks for Video Action Detection

It has been well recognized that modeling human-object or object-object ...

TALLFormer: Temporal Action Localization with Long-memory Transformer

Most modern approaches in temporal action localization divide this probl...

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Structured video representation in the form of dynamic scene graphs is a...

Thanks for Stopping By: A Study of "Thanks" Usage on Wikimedia

The Thanks feature on Wikipedia, also known as "Thanks", is a tool with ...

The Introspective Agent: Interdependence of Strategy, Physiology, and Sensing for Embodied Agents

The last few years have witnessed substantial progress in the field of e...

Graph-based Task-specific Prediction Models for Interactions between Deformable and Rigid Objects

Capturing scene dynamics and predicting the future scene state is challe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As argued by Kenneth Craik, if an organism carries a model of external reality and its own possible actions within its head, it is able to react in much fuller, safer and more competent manner to emergencies which face it craik1952nature

. Indeed, building prediction models has been long studied in computer vision and intuitive physics. In vision, most approaches make predictions in pixel-space 

denton2018stochastic; lee2018stochastic; ebert2018visual; jayaraman2018time; walker2016uncertain, which ends up capturing the optical flow walker2016uncertain and is difficult to generalize to long-horizon. In intuitive physics, a common approach is to learn the dynamics directly in an abstracted state space of objects to capture Newtonian physics battaglia2016interaction; chang2016compositional; sanchez2020learning. However, the states end up being detached from raw sensory perception. Unfortunately, these two extremes have barely been connected. In this paper, we argue for a middle-ground to treat images as a window into the world, i.e., objects exist but can be accessed only via images. Images are neither to be used for predicting pixels nor to be isolated from dynamics. We operationalize it by learning to extract a rich state representation directly from images and build dynamics using the extracted state representations.

It is difficult to make predictions, especially about the future — Niels Bohr

Contrary to Niels Bohr, predictions are, in fact, easy if made only for the short-term. Predictions that are indeed difficult to make and actually matter are the ones made over the long-term. Consider the example of “Three-cushion Billiards” in Figure 1. The goal is to hit the cue in such a way that it touches the other two balls and contacts the wall thrice before hitting the last ball. This task is extremely challenging even for human experts because the number of successful trajectories is very sparse. Do players perform classical Newtonian physics calculations to obtain the best action before each shot, or do they just memorize the solution by practicing through exponentially many configurations? Both extremes are not impossible, but often impractical. Players rather build a physical understanding by experience mccloskey1983intuitive; mccloskey1983intuitive2; kubricht2017intuitive and plan by making intuitive, yet accurate predictions in the long-term.

Figure 1: Long-term dynamics prediction tasks. Left: three-cushion billiards. Right: PHYRE intuitive-physics dataset bakhtin2019phyre. Our proposed approach makes accurate long-term predictions that do not necessarily align with the ground truth but provide strong signal for planning.

Learning such long-term prediction models is arguably the “Achilles’ heel” of modern machine learning methods. Current approaches on learning physical dynamics of the world

cleverly side-step the long-term dependency by re-planning at each step via model-predictive control (MPC) allgower2012nonlinear; camacho2013model. The common practice is to train short-term dynamical models (usually 1-step) in a simulator. However, small errors in short-term predictions can accumulate over time in MPC. Hence, in this work, we focus primarily on the long-term aspect of prediction by just considering environments, such as the three-cushion billiards example or the PHYRE bakhtin2019phyre in Figure 1, where an agent is allowed to take only one action in the beginning so as to preclude any scope of re-planning.

Our objective is to build data-driven prediction models for intuitive physics mccloskey1983intuitive that can both: (a) model long-term interactions over time to plan successfully for new instances, and (b) work from raw visual input in real-world scenarios. The question we ask is: how to leverage the ideas from success stories in computer vision tasks (e.g., object detection girshick2015fast; ren2015faster) to build long-term physical prediction models in a real-world environment? Our main idea is to represent each video frame with an object-centered representation where each object is treated as an individual entity. However, instead of performing prediction in the pixel space as most prior methods do ye2019cvp; walker2016uncertain, we predict the state (e.g., location, shape) for each object entity in latent feature space which is similar in spirit to the idea of region proposals in detection methods ren2015faster.

How to extract object-centric features in an end-to-end fashion? We leverage the region of interests (RoI) pooling girshick2015fast

to extract object representation from the frame-level feature, and build an interaction module to perform reasoning among the objects. Object feature extraction based on region proposals have achieved huge success in computer vision 

girshick2015fast; he2017mask; dai2017deformable; gkioxari2019mesh, and yet, surprisingly under-explored in the field of intuitive physics. By using RoI pooling, each object feature not only contains its own object information but also the context of the environment. We will show in Section 5, the contextual information is critical in dealing with interactions in complex environments. The interaction module and the object feature extraction are trained end-to-end by minimizing the distance between predicted and ground-truth object trajectories. We name our approach Region Proposal Interaction Networks (RPIN), illustrated in Figure 2.

Notably, our approach is simple, yet outperforms the state-of-the-art object feature extraction methods in both simulation and real datasets. In Section 5, we thoroughly evaluate our approach across four datasets to study scientific questions related to a) prediction quality, b) generalization to time horizons longer than training, c) generalization to unseen configurations, d) the role of each design choice and, e) planning ability for downstream tasks.

2 Related Work

Physical Reasoning and Intuitive Physics.

Learning models that can predict the changing dynamics of the scene is the key to building physical common-sense. Such models date back to “NeuroAnimator” grzeszczuk1998neuroanimator for simulating articulated objects. Several methods in recent years have leveraged deep networks to build data-driven models of intuitive physics bhattacharyya2016long; ehrhardt2017learning; fragkiadaki2015learning; chang2016compositional; stewart2017label

. However, these methods either require access to the underlying ground-truth state-space or do not scale to long-range due to absence of interaction reasoning. A more generic yet explicit approach has been to leverage graph neural networks 

scarselli2009graph to capture interactions between entities in a scene battaglia2018relational; chang2016compositional. Closest to our approach are interaction models that scale to pixels and reason about object interaction watters2017visual; ye2019cvp. However, these approaches either reason about object crops with no context around or can only deal with a predetermined number and order of objects.

Other common ways to measure physical understanding are to predict future judgments given a scene image, e.g., predicting the stability of a configuration groth2018shapestacks; jia20153d; lerer2016learning; li2016fall; li2016visual

. Several hybrid methods take a data-driven approach to estimate Newtonian parameters from raw images 

brubaker2009estimating; wu2016physics; bhat2002computing; wu2015galileo, or model Newtonian physics via latent variable to predict motion trajectory in images mottaghi2016newtonian; mottaghi2016happens; ye2018interpretable. An extreme example is to use an actual simulator to do inference over objects hamrick2011internal. The reliance on explicit Newtonian physics makes them infeasible on real-world data and un-instrumented settings. In contrast, we take into account the context around each object via RoIPooling and explicitly model their interaction with each other or with the environment without relying on Newtonian physics, and hence, easily scalable to real videos for long-range predictions.

Video Prediction.

Instead of modeling physics from raw images, an alternative is to treat visual reasoning as an image translation problem. This approach has been adopted in the line of work that falls under video prediction. The most common theme is to leverage latent-variable models for predicting future lee2018savp; denton2018stochastic; babaeizadeh2017stochastic. Predicting pixels is difficult so several methods leverage auxiliary information like back/fore-ground villegas2017decomposing; tulyakov2017mocogan; vondrick2016generating, optical flow walker2016uncertain; liu2017video, appearance transformation jia2016dynamic; finn2016unsupervised; chen2017video; xue2016visual, etc. These inductive biases help in a short interval but do not capture long-range behavior as needed in several scenarios, like playing billiards, due to lack of explicit reasoning. Some approaches can scale to relative longer term but are domain-specific, e.g., pre-defined human-pose space villegas2017learning; walker2017pose. Furthermore, the primary evaluation of these methods is either via rendering quality or representation mathieu2015deep; denton2018stochastic. However, our goal is to model long-term interactions not only for prediction but also to facilitate planning for downstream tasks.

Learning Dynamics Models.

Unlike video prediction, dynamics models take actions into account for predicting the future, also known as forward models jordan1992forward. Learning these forward dynamics models from images has recently become popular in robotics for both specific tasks wahlstrom2015pixels; agrawal2016learning; oh2015action; finn2016unsupervised and exploration pathak2017curiosity; burdaICLR19largescale. In contrast to these methods where a deep network directly predicts the whole outcome, we leverage our proposed region-proposal interaction module to capture each object trajectories explicitly to learn long-range forward dynamics as well as video prediction models.

Planning via Learned Models.

Leveraging models to plan is the standard approach in control for obtaining task-specific behavior. Common approach is to re-plan after each action via Model Predictive Control allgower2012nonlinear; camacho2013model; deisenroth2011pilco

. Scaling the models and planning in a high dimensional space is a challenging problem. With deep learning, several approaches shown promising results on real-world robotic tasks 

finn2016unsupervised; finn2017deep; agrawal2016learning; pathak2018zero. However, the horizon of these approaches is still very short, and replanning in long-term drifts away in practice. Some methods try to alleviate this issue via object modeling janner2018reasoning; li2019propagation or skip connections ebert2018robustness but assume the models are trained with state-action pairs. In contrast to prior works where a short-range dynamic model is unrolled in time, we learn our long-range models from passive data and then couple them with short-range forward models to infer actions during planning.

Figure 2: Our Region Proposal Interaction Network. Given frames as inputs, we forward them to an encoder network, and then extract the foreground object features with RoIPooling (different colors represent different instances). We then perform interaction reasoning on top of the region proposal features (gray box on the bottom right). We predict each future object feature based on the previous time steps. We then estimate the object location from each object feature.

3 Region Proposal Interaction Networks

Our model takes video frames as inputs and predicts the object locations for the future timesteps, as illustrated in Figure 2. We first extract the image feature representation using a ConvNet for each frame, and then apply RoI pooling to obtain object-centric visual features. These object feature representations are forwarded to the interaction modules to perform interaction reasoning and predict future object locations. The whole pipeline is trained end-to-end by minimizing the loss between predicted and the ground-truth object locations. Since the parameters of each interaction module is shared so we can apply this process recurrently over time to an arbitrary during testing.

3.1 Representation and Prediction Modules

Object-Centric Visual Representation. We first apply a houglass network newell2016stacked to extract the image features. Given an input image with size , the extracted feature map dimension will be , where

is the spatial stride and

is the dimension of visual feature. On top of this feature map, we use the RoI pooling operator to extract the object features. This feature is then flattened and forwarded to a fully connected layer to output a

-dimension feature vector. Besides the visual features, we also forward the object center location to a 2 layer fully-connected encoder to get object position embedding. The final object feature is the concatenation of visual feature and position feature. We will use

to represent the feature at -th timestep for the -th object.

Interaction Module. The interaction module is shown in the gray box on the bottom right of Figure 2. Our interaction reasoning is directly applied on the latent feature representation for each object. Assuming we have object at time step , with feature . The interaction reasoning is performed between every two objects:


where is a learnable weight. Note that the interaction reasoning is only applied when the Euclidean distance between these two objects are smaller than a pre-defined threshold chang2016compositional; sanchez2020learning. For object , as the set of objects satisfies this constraint, where is the position of object and

is the threshold hyperparameter. Then the updated feature for the

-th object by:


where is the updated feature and and are learnable weights implemented via a fully connected layer. The feature dimension of the same as . For simplicity, we denote the interaction reasoning process on as .

Prediction Model. Given the individual object representation from the interaction module in a few time steps, we can predict the future object state representation by:


where we first concatenate the features for object in the past time steps, and then forward the concatenated feature to a fully connected layer with weights . Note that although we show an example of using in Figure 2 (dashed black rectangle), the model can be easily generalized to integration of more time steps. We apply this prediction model recurrently over different time steps until predicting all the object features in the future frames.

3.2 Learning Region Proposal Interaction Networks (RPIN)

Instead of predicting pixels, we train our model by predicting the future locations of each object since we believe this is the key of doing planning tasks. Given the predicted feature for the th object in time , we estimate its spatial location coordinates by a simple one layer decoder: . The ground-truth coordinate is a 2-dimension coordinate normalized by the size of the input image. To facilitate training, besides the object location, we also predict its relative location (offset) with another fully connected layer. We apply Euclidean distance between the predicted object locations and the ground-truth locations as the training objective:


where is a constant value to balance the two losses. We use discounted loss during training watters2017visual to mitigate the effect of inaccurate prediction at early training stage and is the discounted factor.

Uncertainty Modeling. Our model can also be adopted in cases where only a single image input is available by setting . We also incorporate uncertainty estimation follows ye2019cvp, by modeling the latent distribution using a variational auto-encoder kingma2013auto. For the complete details, we refer the reader to ye2019cvp. Here we only give a summary: we build an encoder which takes the image feature from first and last frame of a video sequence as the input. The output of is a distribution parameter, denoted by . Given a particular sample from such distribution, we recover the latent variable by feeding them into a one-layer LSTM and merge into the object feature

. In this case, our pipeline is trained with an additional loss that minimize the KL divergence between the predicted distribution and normal distribution 


4 Experimental Setup


We evaluate our method’s prediction performance on four different datasets, and demonstrate the ability to perform planning for downstream tasks on two of them. We briefly introduce the four datasets below. The full dataset details are in the appendix.

Simulation Billiards (SimB): We use the simulation environment extended from sutskever2009recurrent; kossen2019structured. The image size is 6464. Three different colored balls with a radius are randomly placed in the image. To get initial velocity, we randomly sample 5 different magnitudes and 12 directions and apply it on one of the balls. We generate 1,000 video sequences for training and 1,000 video sequences for testing, with 100 frames per sequence. We will also evaluate the ability to generalize to more balls and different sized balls in the experiment section.

Real World Billiards (RealB): This dataset contains “Three-cushion Billiards” videos from three separate professional games with different viewpoints downloaded from YouTube. There are training videos with frames, and testing videos with frames. To get the bounding box annotations, we use off-the-shelf detector lin2017feature; wu2019detectron2 to detect the billiards. The detector is initialized from a ResNet-101 FPN model pretrained on COCO lin2014microsoft dataset and fine-tuned on a subset of images from our dataset.

PHYRE: We select 13 out of 25 tasks from the PHYRE benchmark bakhtin2019phyre. We treat all the moving balls as objects and other static bodies as background. For each task, we split the provided 100 templates to 80 training templates and testing 20 templates. This setting is called within task generalization (PHYRE-W), where the testing environments contain the same object category but different sizes and positions. We will also evaluate our model’s performance on environments containing objects and context it never seen during training (called cross task generalization (PHYRE-C)). In this setting, ten tasks are used as the training set. And the remaining three tasks are used for testing.

ShapeStacks (SS): This dataset is a synthetic dataset of multiple stacked objects (cubes, cylinders, or balls) ye2019cvp. Only objects’ center positions provided. Following ye2019cvp, we assume the object bounding box is square and of size 7070. There are 1,320 training videos and 296 testing videos, with 32 frames per video. In this dataset, we set , and uncertainty estimation is incorporated.

Baseline Comparisons.

We consider the following baselines. Since the considered baselines are usually tuned on different datasets using different architectures, it is hard to make a fair comparison. To mitigate this, we re-implement them using the same network structure and hyperparameter as ours so that only the way of getting visual object features are changed.

Visual Interaction Network (VIN) kipf2019contrastive; watters2017visual: Instead of using object-centric spatial pooling to extract object features, it directly assigns different channels of image feature to different objects. This approach requires specifying a fixed number of objects and a fixed mapping between feature channels and object identity, which limits its generalization ability to a different number of objects and different appearances.

Object Masking (OM) wu2017learning; veerapaneni2019entity; janner2018reasoning: This approach takes one image and object proposals as input. For each proposal, only the pixels inside object proposals are kept while others are set to , leading to masked images. This approach assumes no background information is needed thus fails to predict accurate trajectories in complex environments such as PHYRE. And it also cost times computational resources.

Compositional Video Prediction (CVP) ye2019cvp: The object feature is extracted by cropping the object image patch and forwarding it to an encoder. Since the object features are directly extracted from the raw image patches, the context information is ignored. We use CVP to denote our re-implementation. For the ShapeStack dataset, we consider both our re-implementation as well as the original model published with ye2019cvp.


Given predictions for objects . The prediction error for time step is


In the results, we report the average error for two horizons: and .

5 Evaluation Results: Prediction, Generalization, and Planning

Figure 3 shows some qualitative prediction results. More results are in the supplementary material and our project website. We organize this section and analyze our results by discussing five scientific questions related to the prediction quality, generalization ability across time & environment configurations, different design choices, and the ability to plan actions for downstream tasks.

Figure 3: Visualization on all of the four datasets. The first row is our prediction results and the second row is the corresponding ground-truth trajectories. Our method accurately predicts long-term future even after complex interactions.

5.1 How accurate is the predicted dynamics?

To evaluate how well the world dynamics is modeled, we report the prediction errors on the test split over similar time-horizon as which model is trained on, i.e., . The results are shown in Table 1 (left half). In this setting, the OM method performs relatively better than other baselines in the billiard and PHYRE datasets since it explicitly models objects by instance masking. In contrast, VIN needs to learn to attend on object features from a global image, which may make learning accurate object representation harder, leading to worse performance. Meanwhile, VIN requires a fixed number of objects, so it is not even trainable in the PHYRE dataset which contains a variable number of objects. For the CVP* method, it performs poorly on simulated billiard due to the complex dynamics in this dataset (i.e., there are more interactions, as shown in 3). The performance of CVP is reasonable in RealB and PHYRE, but still worse than OM. In the SS dataset, all the baselines including CVP work decently well. One possible explanation is that the object size is large, and cropped image regions already provide enough context. Note that the re-implemented CVP method has similar performance with the original number reported, which shows our re-implementation is proper. Finally, our method, with both explicit object modeling and context feature learning, achieves the best results on all of the four datasets. This demonstrates the advantage of using rich state representations. Note that in ShapeStack datasets, neither the baseline nor our method uses the pixel-wise supervision and stacked interaction networks as in ye2019cvp. Thanks to the rich state representation, we can achieve much more accurate object trajectory prediction even with a much simpler interaction modeling.

VIN kipf2019contrastive; watters2017visual 3.89 1.02 N.A. 2.47 29.51 5.11 N.A. 7.77
OM wu2017learning; janner2018reasoning 3.58 0.59 11.31 3.01 27.87 3.23 22.96 9.51
CVP* ye2019cvp 82.15 0.85 22.63 2.84 112.09 4.26 35.84 7.72
CVP ye2019cvp - - - 1.95 - - - 11.42
Ours 2.44 0.34 4.46 1.59 22.20 2.19 12.20 6.83
Table 1: We compare our method with different baselines as well as ye2019cvp on all four datasets. The left part shows the prediction error when rollout timesteps is the same as training time. The right part shows the generalization ability to longer horizon unseen during training. The error is scaled by 1,000. * denotes re-implementation for fair comparison with ours.

5.2 Does learned model generalize to longer horizon than training?

As the parameters of our interaction module and prediction module are shared over time, our model can predict a longer sequence than training time. In Table 1 (right half), we show the average prediction error for . The results in this setting are consistent with what we found in Section 5.1: OM performs better on both the billiard and PHYRE datasets, and the CVP* method performs poorly on SimB and a little bit better on RealB and PHYRE. On the ShapeStacks dataset, an interesting observation is that all the baselines are better than ye2019cvp. We hypothesize this is because the network representation power is efficiently spent on predicting accurate locations, instead of achieving balance with visual quality as in ye2019cvp. Still, our method achieves the best performance against all baselines as well as ye2019cvp. This again validates our hypothesis that the key to making accurate long-term feature prediction is the rich state representation extracted from an image.

5.3 Does learned mode generalize to unseen configurations?

method SimB-5 SimB-L PHYRE-C SS-4 VIN kipf2019contrastive; watters2017visual N.A. 54.77 N.A. N.A. OM wu2017learning; janner2018reasoning 59.70 39.42 19.83 36.30 CVP* ye2019cvp 113.39 102.34 73.72 36.02 CVP ye2019cvp - - - 15.96 Ours 15.56 38.65 11.36 13.97
Table 2: The ability to generalize to novel environments. We show the average prediction error for . Our method achieves significantly better results compared to previous methods.
visual local pos (a) 14.45 (b) 15.18 (c) 4.86 (d) 4.63 (e) 4.46
Table 3: Ablation on PHYRE-W. We compare the effect of applying local interaction and position features to our baseline.

The general applicability of RoIPool has been extensively verified in the computer vision community. Our method can generalize to novel environments configurations without any modifications, thanks to the object-centric representations. We test such a claim by testing on several novel environments unseen during training. Specifically, we construct 1) simulation billiard dataset contains 5 balls with radius 2 (SimB-5); 2) simulation billiard dataset contains 3 balls and larger radius from 2 to 5 (SimB-L); 3) PHYRE-C where the test tasks are not seen during training; 4) ShapeStacks with 4 stacked blocks (SS-4). The results are shown in Table 3.

Since VIN needs a fixed number of objects as input, it cannot generalize to a different number of objects, thus we don’t report its performance on SimB-5, PHYRE-C, and SS-4. Its generalization to larger objects is also poor for lack of explicit object modeling. The OM method performs better than other baselines. One surprising finding is that the baselines are worse than ye2019cvp on SS-4, which is in contrast to our findings in Table 1. We hypothesize this indicates the pixel-wise supervision provides regularization to the model, thus helps reduce overfitting and improve generalization to novel environments. Our method, although without such regularization, still achieves better performance.

5.4 How does model performance vary with respect to different design choices?

In Table 3, we analyze the effect of several network components on the PHYRE-W dataset, including position encoding and local interaction constraint. Firstly, (a) shows that with only position features, the prediction error is very high. This is because the position of objects cannot represent the complex environment setting in PHYRE dataset. With the local interaction constraint, the error is even higher. In contrast, (c) shows that our method achieves significantly better results using only visual features, which demonstrates the effectiveness of simultaneously model the environment and the object. (d) shows that adding position encoding features to our baseline leads to another 0.23 improvement, which indicates these two features are complementary to each other. Finally, adding local interaction constraints can improve performance by about 0.17, suggesting the effectiveness of prior knowledge to facilitate interaction learning.

5.5 How well can the learned model be used for planning actions?

Target State Error Hitting Accuracy PHYRE-W PHYRE-C
Random Policy 36.91 9.50% 0.0% / 46.9% 0.0% / 30.0%
VIN kipf2019contrastive; watters2017visual 8.03 62.1% N.A. N.A.
OM wu2017learning; janner2018reasoning 7.79 64.5% 29.2% / 80.4% 15.3% / 45.0%
CVP* ye2019cvp 29.65 23.8% 4.2% / 40.0% 2.7% / 34.3%
Ours 6.86 68.8% 33.1% / 83.5% 18.3% / 74.7%
Table 4: We show planning results for Simulation Billiards and PHYRE. From left to right (i) Init-End State dataset (lower number the better); (ii) Hitting Accuracy (higher number the better); (iii) PHYRE Within task success rate (high number the better). (iv) PHYRE Cross task success rate. For PHYRE, we shows success rate for 100 action trials.

The advantage of using a general-purpose prediction model is that it can be used to do downstream planning tasks without any adaptation. We evaluate our prediction model in simulation billiards and a subset of PHYRE. To analyze the long-term prediction ability of our model under a controlled setting, we will use the environment to generate the first frames given one initial configuration and one candidate action. The resulting frames will be used as the input to our predictive model. We score each action according to the similarity between the generated trajectory and the goal state. Then the action with the highest score is selected. The full planning algorithm and implementation details will be included in the appendix. We evaluate the planning performance on the following tasks:

Billiard Target State. Given an initial and final configuration after 40 timesteps, the goal is to find one action that will lead to the target configuration. We report the smallest distances between the trajectory between timestep 35-45 and the final position.

Billiard Hitting. Given the initial configurations, the goal is to find an action that can hit the other two balls within 50 timesteps. We report the average success rate over all different configurations.

PHYRE. In this task, we need to place a red ball to solve a specific goal for each environment (see figure 1 right for an example). The action space contains the position and size of the red ball. We uniformly sample actions from the continuous action space and score each action according to the similarity. We report the success rate for both top-1 and top-100 actions.

The results are shown in Table 4. The planning accuracy is consistent with the prediction performance. Our method performs significantly better than baselines in both simulated billiard planning and PHYRE tasks, especially on the PHYRE cross-task generalization tasks.

6 Conclusions

In this paper, we leverage the modern computer vision techniques to propose Region Proposal Interaction Networks for physical interaction reasoning with visual inputs. We show that our general, yet simple method achieves a significant improvement and can generalize across both simulation and real-world environments for long-range prediction and planning. We believe this method may serve as a good benchmark for developing future methods in the field of learning intuitive physics, as well as their application to real-world robotics.


This work is supported in part by DARPA MCS and DARPA LwLL. We would like to thank the members of BAIR for fruitful discussions and comments.


Appendix A Implementation Details

a.1 Network Backbone

To keep the comparison as fair as possible, we use the same hourglass network as the image feature extractor for our method and the baselines (VIN, OM, CVP*). Given an input image, we apply a 77 stride-2 convolution, three residual blocks with channel dimension

, and a stride-2 max pooling on it. Then this intermediate feature representation is fed into one hourglass modules. In each hourglass, the feature maps are down-sampled with 3 stride-2 residual blocks and then up-sampled with nearest neighbor interpolation. The dimensions of both the input channel and the output channel of each residual block are

. Finally, the output features are transformed to object-centric representations. For the Simulated Billiard dataset, we use since the environment is relatively simple. For Real Billiard, PHYRE, and ShapeStack dataset,

. We use batch normalization before each convolutional layer for the Simulated Billiard and ShapeStack dataset. And since the batch size of PHYRE and Real Billiard is relatively small, we use group normalization 

wu2018group. Normalization layers are not used after the network backbone.

a.2 Dataset Details

SimB: To get the initial velocity, the magnitude (number of pixels moved per timestep) is sampled from and the direction is sampled from .

RealB: We found that the bounding box prediction results are accurate enough to serve as the ground-truth. After running the detector, we also manually go through the dataset and filter out images with incorrect detections.

PHYRE: The id for the 13 selected tasks is {0, 1, 2, 7, 11, 12, 13, 14, 15, 16, 19, 20, 24}. We use the fold id provided by the dataset to split it into the training/testing dataset. For within task generalization (PHYRE-W), the training set contains 80 templates from each task. The testing set contains the remaining 20 templates from each task. For cross task generalization (PHYRE-C), the training set contains 100 templates from {0, 1, 2, 7, 11, 12, 13, 16, 20, 24} while the test set contains 100 templates from {14, 15, 19}. For each template, we randomly sample a maximum success and fail actions to collect the trajectories to train our model. The image sequence is temporally downsampled by .

a.3 Hyperparameters

We use Adam optimizer kingma2014adam with cosine decay loshchilov2016sgdr to train our networks. The default input frames is except for ShapeStacks. We set to be 256 except for simulation billiard is 64. During training, (denoted as ) is set to be except for ShapeStacks where we use 15 for fair comparison with ye2019cvp. The discounted factor is set to be .

Simulation Billiards. The image size is 6464. We train the model for 100K iterations with a learning rate 210 and batch size 200. The local constraint threshold is times object size.

Real World Billiards. The image is resized to 19264. We train the model for 240K iterations with a learning rate 110 and batch size 20. The local constraint threshold is times object size.

PHYRE. The image is resized to 128128. We train the model for 150K iterations with a learning rate 210 and batch size 20. The local constraint threshold is times object size.

ShapeStacks. The image is resized to 224224. We train the model for 25K iterations with a learning rate 210 and batch size 40. In this dataset, we apply uncertainty modeling as described in section 3.2. The loss weight of KL-divergence is . During inference, following ye2019cvp, we randomly sample outputs from our model, and select the best (in terms of the distance to ground-truth) of them as our model’s output. The local constraint threshold is times object size.

Appendix B Planning Details

Given an initial state (represented by an image) and a goal, we aim to produce an action that can lead to the goal from the initial state. Our planning algorithm works in a similar way as visual imagination fragkiadaki2015learning: Firstly, we select a candidate action from a candidate action set . Then we generate six input images using the corresponding simulator. After that, we can forward the images to our prediction model to get future object trajectories for each object and each timestep : . The score of each action can be calculated by a score function designed for each task. We then select the action that can maximize the score.

We introduce the action set for each task in section B.1, and how to design distance function in B.2. A summary of our algorithm is in Algorithm 1.

b.1 Candidate Action Sets

For simulation billiard, the action is 3 dimensional. The first two dimensions stand for the direction of the force. The last dimension stands for the magnitude of the force. During doing planning, we enumerate over 5 different magnitudes and 12 different angles, leading to 60 possible actions. All of the initial condition is guaranteed to have a solution.

For PHYRE, the action is also 3 dimensional. The first two dimensions stand for the location placing the red ball. The last dimension stands for the radius of the ball. During doing planning, we randomly draw 2000 actions from a uniform distribution.

b.2 Distance Function

Init-End State Error. Denote the given target location of objects as . We use the following distance function, which measures the distance between the final rollout location and the target location:


Hitting Accuracy. Denote the given initial location of objects as . We apply force at the object . We use the following distance function, which prefer the larger moving distance for objects other than :


PHYRE task. In this task, we are required to place a red ball in a way that can make a given green ball touch a certain goal object, either another moving ball or a purple wall. We denote the center position of the goal object as and the index of green ball as . Then we define the following distance function, which consider the distance between the green ball and the goal position in the horizontal and vertical distance respectively:


For task , we use as our score function. For task , we use as our score function. For the remaining tasks, we use as our score function.

b.3 Planning Algorithm

Input: candidate actions , initial state , end state (optional)
Output: action
for  in  do
       = Simulation ;
       = PredictionModel ;
       calculate according to task as in B.2;
       if  then
       end if
end for
Algorithm 1 Planning Algorithm for Simulated Billiard and PHYRE

Appendix C Qualitative Experiments

We show some of the qualitative results in this section. For more results, we refer reader to our project website: https://haozhiqi.github.io/RPIN/.

ground-truth predictions ground-truth predictions
Figure 4: Visualization results of the Simulated Billiard dataset. We visualize the first input image and the trajectories in future timesteps.
ground-truth predictions
Figure 5: Visualization results of the Real-World Billiard dataset. We visualize the first input image and the trajectories in future timesteps.
ground-truth predictions ground-truth predictions
Figure 6: Visualization results of the PHYRE-W dataset. We visualize the first input image and the trajectories in future timesteps.
ground-truth predictions ground-truth predictions
Figure 7: Visualization results of the PHYRE-C dataset. These environments are never shown in the training set. We visualize the first input image and the trajectories in future timesteps.
ground-truth predictions ground-truth predictions
Figure 8: Visualization results of the ShapeStack dataset. We visualize the first input image and the trajectories in future timesteps.