Unsupervised Intuitive Physics from Visual Observations

05/14/2018 ∙ by Sebastien Ehrhardt, et al. ∙ University of Oxford UCL 0

While learning models of intuitive physics is an increasingly active area of research, current approaches still fall short of natural intelligences in one important regard: they require external supervision, such as explicit access to physical states, at training and sometimes even at test times. Some authors have relaxed such requirements by supplementing the model with an handcrafted physical simulator. Still, the resulting methods are unable to automatically learn new complex environments and to understand physical interactions within them. In this work, we demonstrated for the first time learning such predictors directly from raw visual observations and without relying on simulators. We do so in two steps: first, we learn to track mechanically-salient objects in videos using causality and equivariance, two unsupervised learning principles that do not require auto-encoding. Second, we demonstrate that the extracted positions are sufficient to successfully train visual motion predictors that can take the underlying environment into account. We validate our predictors on synthetic datasets; then, we introduce a new dataset, ROLL4REAL, consisting of real objects rolling on complex terrains (pool table, elliptical bowl, and random height-field). We show that in all such cases it is possible to learn reliable extrapolators of the object trajectories from raw videos alone, without any form of external supervision and with no more prior knowledge than the choice of a convolutional neural network architecture.



There are no comments yet.


page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A striking property of natural intelligences is their ability to perform accurate and rapid predictions of physical phenomena using only noisy sensory inputs. Even more remarkable is the fact that such predictors are learned without explicit supervision; rather, natural intelligences induce their internal representation of physics automatically from experience.

Several authors have recently looked into the problem of learning physical predictors using deep neural networks in order to partially mimic this functionality. Early attempts predicted trajectories in hand-crafted spaces of physical parameters, such as positions and velocities, assuming that the ground-truth values of such parameters are fully observable during training. Others have considered performing predictions from visual observations, but used full supervision for training. Furthermore, while several papers [7, 3] make use of simulators as a way to generate the required supervisory signals, limited work has been done in transferring such models to real data.

In this paper, we also investigate learning physical predictors using deep neural network. However, we do so in a fully unsupervised manner, learning from observations of unlabelled video sequences. In contrast to approaches such as the recent de-animation method of [39], we do not require synthetic data, nor do we rely on any handcrafted physical simulator for prediction. Our models are built directly from real data and learn intuitive physics models that empirically outperform more principled, but more brittle, models based on physical parameters [31].

Importantly, our goal is not to merely predict future frames in a video, a problem addressed before by several authors [19]. While we also predict future dynamics from a video stream, our goal is not

to estimate appearance changes, but physical quantities such as object positions and velocities. So, where future frame prediction generates an image, our goal is to extract meaningful and actionable physical parameters from the data.

As a working example, we consider video footage of balls rolling on various surfaces, such as pool tables, bowls and random height-fields. Balls interact with the underlying environment (e.g., roll around obstacles) and among themselves (e.g., collide with each other). For rigorous assessment, in addition to considering several synthetic datasets, we also contribute a new public dataset, Roll4Real, containing a large number of such sequences captured in real-life. Methodologically, we make two contributions. First, inspired by [23], we show that an object detector can be learned in an unsupervised manner by tuning a convolutional detector to extract tracks that are maximally characteristic of the natural, causal ordering of the frames in a video. Second, we use these trajectories to learn visual predictors that automatically learn an internal representation of physics and can extrapolate the trajectory of the balls more reliably than even supervised approaches such as Interaction Networks (IN) [3] that use direct measurements of physical parameters.

Note that our goal, similar to other papers in this area, is not to come up with the best possible method for physical prediction. A handcrafted solution heavily engineered to use supervision, off-the-shelf trackers, and/or physical simulators may do better in raw predictive performance (although the task is in fact not simple, particularly as our terrains are complex and somewhat deformable). Rather, we focus on developing machines that can learn such physical predictors from raw input.

Empirically, we show that vision-based models more gracefully handle observation noise compared to approaches such as [7, 3] that are learned using physical ground-truth parameters extracted from simulated scenarios. We also show that the Visual Interaction Network (VIN) of [22]

, which also propose a vision-based physical predictor, fails to account for the interaction of the objects and their environment, whereas more distributed tensor based approach succeeds.

The rest of the paper is organized as follows. We discuss related work in section 2. We then present the technical details of our approach in section 3. Next, we introduce the new Roll4Real data in section 4 and use the latter as well as several existing synthetic datasets to evaluate the approach in section 5. We summarise our findings in section 6.

2 Related Work

Existing work in learning physics can be organised according to several axes.

Nature of the Representation of Physics:

A natural way to represent physics is to manually encode every object parameters and physical properties (mass, velocity, positions, etc.). From the earliest approaches [4] this has been widely used to represent physics and propagate it [3, 7, 26, 32]. Some focusing on representing a small subset of physical parameters such as positions and velocities [37, 38]. However, other approaches try to learn an implicit representation of physics, inspired by the success of implicit representation of dynamics [29, 28, 8, 5]. Implicit physics are usually represented as activations in a deep neural network [10, 36, 20].

Hand-crafted vs Learned Dynamics:

Some approaches [37], including
simulation-based ones [4, 40], use physics by explicitly integrating parameters such as velocities. While this generally require extensive knowledge of the environment and object properties, other methods [3, 7, 26, 32, 10], integrate parameters of the scenarios through recurrent learnable predictors to make physical long term predictions.

Physical vs Visual Observations:

Many approaches [3, 7] assume direct access to physical quantities such as positions and velocities for prediction. If this first approach enable to make very accurate predictions it is however unlikely that such accuracy can be reached in the real-world. Others [4, 20, 21, 40, 36, 13, 33, 17] take as input one or several frames of a scene to deduce physical properties (intuitive or explicit) or predict the next state of a system.

Qualitative vs Quantitative Predictions:

While most of the papers discussed above consider quantitative predictions such as extrapolating trajectory, others have considered qualitative predictions focusing on intuitive physics, such as the stability of stacks of objects [4, 20, 21], the likelihood of a scenario [30] or the forces acting behind a scene [40]. Other papers are in between, and learn plausible if not accurate physical predictions [35, 18, 24], often for 3D computer graphics.

Nature of the Supervision:

Most approaches are passive and supervised, as they are passive observer of physical scenarios and use ground truth information about key physical parameters (positions, velocities, stability) during training. While this approaches require an expensive annotation of data, some work tried to learn from unsupervised data either through active manipulation [2, 9] or using the laws of physics [33].


Two favorite scenarios in such experiments are bouncing balls, including billard-like environment [14], and block towers [20]. As a variant, [36] consider balls subject to gravitational pulls, ignoring harder-to-model collisions. Most papers make use of simulated data, with limited validation on real data. A different approach [25] is to predict qualitative object forces and trajectories in fully-unconstrained natural images. The approach of [2] considers instead learning from active poking using a real-life robot. In most cases experiments are done on synthetic data. However, approaches such as [37, 38, 21] also used real data; [38] also contributed a dataset of videos of short physical experiments called Phys-101.

We relate to such previous work in that we also make physical predictions of the trajectory of ball-like objects. However, we differ in two significant ways. First, our approach, while using only passive observations, is fully unsupervised, and yet competitive if not more accurate than supervised counterparts. In particular, while [33, 40] also do not use image labels, they use a-priori knowledge of physics for training (a fully-fledged simulator and renderer in the case of [40]). Second, we systematically test on several real-life scenarios, both in training and testing, using our new Roll4Real dataset. Compared to datasets such as Phys-101, ours allows testing long-term ball-rolling prediction in complex scenarios.

3 Method

Figure 1: Overview of our unsupervised object tracker

Each training point consists of a sequence of five video frames. Top: the sequence is randomly permuted with probability 50%. The position extractor (a) computes a probability map

for the object location, whose entropy is penalised by . The reconstructed trajectory is then fed to a causal/non-causal discriminator network (b) that determines whether the sequence is causal or not, encouraged by . The bottom Siamese branch (c) of the architecture takes a randomly warped version of the video and is expected by to extract correspondingly-warped positions in (d). Blue and green blocks contain learnable weights and green blocks are siamese shared ones. At test time only is retained.

Our goal is to construct a machine that can, given only raw videos and no supervision, learn physical parameters such as the position of the objects in the videos as well as proxies to physical laws that allow to predict the evolution of such parameters over time. For this, predicting appearance changes is not sufficient; instead, we decompose the problem in two steps. The first one is a method to discover and learn to extract object positions using as cue the fact that they should have a non-trivial causal dynamics (

section 3.1). This tracker scales well to large datasets and is able to detect different type of objects without any further specification. Then, we use the resulting object trajectories to learn visual predictors that can extrapolate the object positions through time, embodying a proxy to the laws of mechanics (section 3.2).

3.1 Unsupervised Detection and Tracking of Dynamic Objects

Single-object Detection.

Let be a RGB video frame and assume we are given video sequences , initially containing a single object moving in an environment, such as a rolling ball. Our goal is to learn a detector function that extracts the 2D position of the moving object at any given time (Fig. 1). The challenge is to do so without access to any label for supervision or any a-priori information about object shape.

We start by implementing as a shallow Convolutional Neural Network (CNN) that extracts a scalar score for each image pixel

, resulting in a heat map. This is then normalised to a probability distribution using the softmax operator

and the location of the object is obtained as the expected value  [13].

We learn by combining two learning principles. The first one is causality. Applied to a video sequence, the detector produces a trajectory . We expect that, when the detector locks properly on the rolling object, the trajectory is physically plausible (e.g., causal/smooth); at the same time, if the frames are shuffled by a random permutation , the resulting trajectory should not be plausible anymore. We incorporate this constraint by learning a discriminator network that, for a subsequence, can distinguish between the natural ordering of the frames and a random shuffle (top row of Fig. 1). The permutation is sampled with 50% probability as a consecutive sequence of 5 frames (,

) and with 50% uniformly at random. The discriminator is a 3 layers multi-layer perceptron followed by a sigmoid and the loss

The second learning principle is equivariance (cf.,  [34, 27]). This principle suggests that, if a transformation is applied to a frame (e.g., a rotation), then the output of the detector should change accordingly: . This is implemented as a Siamese branch (bottom row in Fig. 1) extracting 2D positions from the rotated frames and comparing them to the rotated 2D positions extracted from the original frames using the loss: .

Finally, in order to encourage the softmax operator to produce peaky distributions, we minimise the entropy of the resulting distribution . The final loss is therefore . In our experiment, , , and .

Multi-object Tracking.

We now extend the method from detection of single objects to tracking of multiple objects. In order to do so, the network is fine-tuned to videos containing two or more moving objects of different appearance.

Since the network produces only a single pair of coordinates, it can formally estimate the location of a single object in the image. However, when multiple objects are present, the unsupervised learning process could still converge to an undesirable result, such as predicting the center of mass of several objects combined, or randomly jumping between objects over time. The first is discouraged by the entropy loss which prefers sharp heat map. The second is discouraged by the causality loss, as discontinuous trajectories would not look plausibly ordered and consistent.

In practice, our model learns to track consistently a single object selected at random among the visible ones. Once this is done, in the next iteration, a second object is detected by suppressing (setting to zero) a circular region of radius around the first object location in the activations immediately preceding the softmax operator, and the process is repeated for further object occurrences. Before the suppression we also add a positive bias to the activations in order to consider the previously detected objects as zero probabilities in the new heatmap. Note that we consider the number of objects as given since it is in itself already a very challenging task that is under active research [12].


Figure 2: Multiple object unsupervised tracker (a) We first extract an object heatmap with the method described in 3.1. (b) Then we mask the objects detected by previously trained tracker ( and ) on the heatmap by zeroing out the values around a circular area around their center. (c) Finally we extract position from this last heatmap with masked values.

3.2 Trajectory Extrapolation Networks

We consider existing network modules for physical prediction. While these modules use external supervision in the original papers, here we apply them to the output of the unsupervised tracker of 3.1, hence training such physical extraploators in a fully unsupervised manner for the first time.

We experiment in particular with PosNet, DispNet, and ProbNet from [11], configuring them to take as input the first four frames of a sequence and to produce as output the prediction of future object positions. These models learn an implicit representation of physics, which is extrapolated automatically by a recurrent propagation layer and used to extract estimates of the object positions. The difference between the models is that PosNet regresses positions from state, while DispNet and ProbNet regress displacements from state. Furthermore, ProbNet produces a probability estimate over trajectories.

We also consider the Visual Interaction Network (VIN) module and its variant Interaction Network from State (IFS) [36]

. While VIN uses only visual inputs for prediction just like the other networks, IFS works with an explicit state vector of physical parameters, which we set as the stacking of the 2D positions for four past frames which starts with positions extracted from our tracker. Additionally, in the synthetic experiments (first part of

Table 2), IFS uses velocity and in BowlS experiments the ground-truth ellipsoid axes parameters are appended to the state to inform the model of the shape of the ground. IFS and VIN are trained following [11]; in particular, this means that VIN uses the same setting as the original paper ( pixels images).

We also note that while VIN and models from [11] have essentially the same core concepts (they consist of a first feature extractor module to extract implicit physical state, a recurrent propagation module to propagate the state, and an extractor module to get desired physical parameters from the state) their main difference resides in the structure of the propagated state. While VIN used a vector state representation, each of PosNet, DispNet, and ProbNet use a tensor representation.

All such models are trained by showing the network four initial frames of a sequence and the output of the unsupervised tracker up to time frames. At test time, the networks, which are recurrent, are used to extrapolate the trajectory up to an arbitrary time , also starting from four video frames. We test in particular and to assess the generalization capabilities of the models learned by the network.

In addition, for some experiments on single object we also consider linear and quadratic extrapolators as baselines. In both cases we fit a first (respectively second) order polynomial to the first positions given as input (hence with a significant advantage compared to the networks which only observe four frames).

4 Roll4Real: A New Benchmark Dataset

Figure 3: Physical setup In each of the three real-world scenarios (PoolR, BowlR, HeightR), we show the experimental setup (left) and a sample data frame (right).

In the absence of a suitable real-world dataset to evaluate intuitive physics on objects rolling on complex terrains, we created a new benchmark, Roll4Real (R4R).

Dataset Content.

R4R consists of short videos containing one or two balls rolling on three types of terrains (Fig. 3): a flat pool table (PoolR), a large ellipsoidal ‘bowl’ (BowlR), and an irregular height-field (HeightR). More specifically, there are videos (avg. frames/video) for the PoolR dataset with one ball; videos ( frames/video) for the BowlR dataset with one ball; videos ( frames/video) for the HeightR dataset with one ball; and videos ( frames/video) for the HeightR dataset with two balls. We rolled a total of 7 differently colored balls for the HeightR and BowlR datasets, varying from  cm to  cm in diameter. The height-field surface fits into a  cm bounding box, with  cm diameter. The bowl was created using a  cm diameter ball, and is  cm high. Videos were randomly split into train, validation, and test sets. Ground-truth annotations are provided for the test split.

Dataset Collection.

Both the bowl and height-field terrains were modeled using paper mâché on scaffolds, using a large inflatable ball and a custom-made wire-mesh frame, respectively. For the the PoolR dataset, balls were rolled on the table, while for the other settings, balls were manually dropped from a small height and allowed to roll on. The setup was imaged using a fixed camera (Samsung Galaxy S8) from the top. The PoolR dataset was captured at 30fps (due to low light), while all the others at 240fps in order to reduce motion blur and later downsampled to 80fps. Videos were cropped to only focus on the scenario of interest, i.e., ball(s) and terrain, and trimmed to retain the portion of the video containing motion. We rolled a total of 7 different balls: a pink foam ball ( cm diameter), a fluorescent yellow tennis ball ( cm), a blue and an orange ping-pong ball ( cm), a black squash ball wit two yellow dots ( cm), and a green and a brown cork ball ( cm).

In order to create ground-truth tracks for the ball centers, we used a template-based tracker using zero-normalized cross-correlation in the LAB color space, and tracked each frame along with a smoothness term over time. The setup was manually initiated by providing suitable template. The raw results were then manually inspected, corrected, and saved as ground-truth. We found that due to environment jitter (the ball rolling on the different terrains often created vibration or deformation in the BowlR and HeightR datasets), differences in lighting across some experiments and different ball colors, the template-based tracker was not perfect and manual inspection was required.

It is worth noting that, while this process was enough to produce ground truth annotations for the test set, the method does not scale due to the need for manual verification and correction. While our aim here is to show the feasibility of learning physics in an unsupervised manner, such problems show that our deep tracker also has an applicative advantage compared to these traditional handcrafted approaches.

5 Results and Discussion

Implementation Details

For all networks trained on every dataset, weights were initialised using Xavier initialization [15]. The learning rate was initially set to and was progressively decreased by a factor of 10 when no improvements were found over epochs ( for the synthetic datasets). Training was stopped when the loss did not decrease for consecutive epochs. Before processing images, we resized all dataset images to

pixels to fit in the GPU memory. We used TensorFlow 

[1] on a single NVIDIA Titan X GPU for all the experiments.

5.1 Unsupervised Tracker

We first evaluate our unsupervised object detector and tracker and compare against currently state-of-the-art trackers. We report results in Table 1 against the following trackers: 1. Optical Flow Lucas-Kanade (OFLK) from OpenCV[6] library; 2a. Flownet2-simple, which computes pairwise flowfields using FlowNet2 [16] and follows the velocity vectors; 2b. Flownet2-blob, where we after computing the flowfields from FlowNet2 [16], update the positions as the center of the blobs found in the flowfield. If no blob was detected, we updated the position according to 2a; 3. LAB: a template tracker similar to section 4 without any manual corrections. Note that these methods need manual initialization at the objects positions (expect for LAB) or templating which needs more work with growing object count and/or variety. In addition to PoolR, BowlR, and HeightR from Roll4Real, we also consider two synthetic datasets from [11] in Fig. 4: BowlS for the ellipsoidal bowl with one or two balls and HeightS for the random height-fields. Fig. 4-left reports the mean and percentile pixel error of the extracted object positions against ground-truth averaged over multiple runs of our experiments.

Figure 4: Tracker errors and Ablation study Left: Tracker errors on different dataset. The errors are consistently small across dataset and show that our tracker can perform well on a different range of real situations. Right: Ablation study. We try different combination of tracker losses on the BowlR dataset. ‘Const.’ indicates that we are predicting a constant point at the center of the image for reference. For left and right, position errors are reported in pixels. The number of balls in the datasets is appended to the name of the dataset.
PoolR BowlR HeightR HeightR 2B.
1. Optical Flow Lucas-Kanade 23.3 965 5.6 275 2.7 12.9 2.0 5.3
2a. FlowNet2-simple 41.4 767 30.4 715 16.6 206 - -
2b. FlowNet2-blob 3.9 12.1 2.2 4.8 4.6 28.7 - -
3. LAB w/o manual correction 0.3 0.1 16.4 247 8.3 104 21.7 102
4. Ours 1.9 0.2 4.1 0.5 3.3 0.5 3.4 1.2
Table 1: Tracker results across real datasets

The reported numbers are the average (left) and the variance (right) of the pixel error. All numbers refer to 128

128 images.

Even though the trackers perform well in practice, they suffer from large variance. For example, OFLK went off-track 15% of the time on the BowlR dataset, 10% for the HeightR, and 30% for PoolR. In contrast, ours never loses track of the object. The 99th percentile reported in Fig. 4 shows that the offset is almost constant generally due to the detection occurring on the edge of the objects. Overall, our method learns to track objects robustly in a diverse range of complex scenarios.

Importantly, since our tracker does not use any manual annotations it scales easily to larger synthetic datasets, multiple objects, and different object appearances within the same dataset by just providing more example data.

We also conducted an ablation study on the BowlR dataset to measure the impact of each loss term. Fig. 4-right shows that, while each loss contributes to the final results, the best performance is obtained when all the terms are combined.

Figure 5: Qualitative performance comparison for the various methods against ground-truth trajectories Top-to-bottom: two balls colliding on an ellipsoidal bowl; single ball colliding against the walls of a pool table; single ball rolling on an ellipsoidal bow; single ball rolling on complex height-field; and two balls rolling on complex height-field. The top row is on synthetic data, while the other rows are on real-data. The green ellipsoids in the last column show the variance of the predictions estimated by ProbNet at selected locations.

5.2 Unsupervised Physics Extrapolation

Supervised vs Unsupervised (Single Ball Synthetic Datasets).

We now compare training predictors using either ground-truth object positions or the output of the unsupervised tracker. All predictors observe only frames as input (either positions or video frames) except VIN which uses and the least squares baselines which use . All the networks were trained to predict positions. Table 3 reports the average errors at time and to measure the ability of predictors to generalise beyond the training regime.

We see that the Net models (ProbNet, DispNet, PosNet) perform well using ground-truth positions or the unsupervised tracker outputs (e.g. PosNet error for BowlS/HeightS is 2.9/6.4 supervised vs 4.9/6.9 unsupervised), whereas IFS does not handle the transition well (3.3/10.4 to 13.3/23.1) and Linear, Quadratic and VIN are not competitive. The latest result shows a clear advantage of tensor-based state representations compared to vector based one. This suggests that modelling objects positions is done better by a representation which is spatially distributed. IFS also seems very sensitive to defects in the supplied annotations, since its knowledge of the environment is very limited, error correction is very challenging for it.

BowlS- = 20 HeightS- = 20
Method Input State With positions from simulator
Linear 2D pos. Exp.
Quadratic 2D pos. Exp.
IFS 2D pos. Exp.
VIN Visual Imp.
PosNet Visual Imp.
DispNet Visual Imp.
ProbNet Visual Imp. (32.1) (54.0) (9.5 ) (12.7)
Method Input State With positions from unsupervised tracker
IFS 2D pos. Exp.
VIN Visual Imp.
PosNet Visual Imp.
DispNet Visual Imp.
ProbNet Visual Imp. (6.3) (20.6) (8.3 ) (13.4)
Table 2: Long term predictions compared on synthetic datasets with model trained with ground-truth from simulator All the models (except VIN, Linear, and Quadratic) are given frames as input and train to predict first positions. We report the average pixel error and perplexity for PosNet model at two different times. Perplexity, shown in bracket, is defined as where is the estimated posterior distribution. State shows either the carried forward state is a physical quantity (Exp.), or an implicit vector or tensor (Imp.)

The main weakness of the Net models is that their performance degrades as prediction extends beyond the training horizon , whereas IFS generalizes more. At least ProbNet explicitly indicates that the model is uncertain when this occurs.

Synthetic vs Real (One Ball Datasets).

On real datasets (Table 3), the Net models uniformly outperform others at both and , with errors comparable to the synthetic case. Note that the real datasets in Roll4Real are particularly challenging due to the non-idealities of the surface (e.g. the BowlR surface is slightly elastic and wobbles as the ball rolls).

PoolR- HeightR- BowlR-
ProbNet (6.3) (11.3) (5.8) (22.5) (6.8) (13.8)
Table 3: Long term predictions using one ball and real data The table has the same format as Table 2. All models are trained using the unsupervised tracker, input and state are the same as Table 2, and we report pixel error (perplexity) at .

One vs Multiple Balls (Real and Synthetic Datasets).

Finally, we move to cases where the balls are interacting with the environment and with each others due to collisions. This is particularly challenging when no ground-truth is used as multiple object tracking is much harder to achieve in an unsupervised setting than tracking a single object.

Method BowlS 2b.- HeightR 2b.-
ProbNet (7.3) (13.7) (7.9) (12.4)
Table 4: Long term predictions using two balls on real and synthetic data Table layout and measures are the same as Table 2. Models are trained with positions from tracker, input and state are the same as Table 2, and we report pixel error (perplexity) at .

As shown in Table 4, the Net models still perform well. Due to memory limitations, models were trained for a slightly shorter time span ; since the corresponding predictions are shorter term, their errors are a little lower than before. Overall, the results show that neither perfect ground-truth annotations nor a very large dataset is required to train a reliable physical extrapolator. Still, we noticed that collisions were difficult to predict in the HeightR dataset (see the bottom row of Fig. 5), probably because such events are rare during training. In contrast, this seems to be much better handled by the models in the synthetic dataset (First row of Fig. 5).

5.3 Unsupervised Physics Interpolation

As in [11]

, we also study the interpolation problem considering their

InterpNet configuration. We compare the latter to the extrapolation network DispNet trained over a longer horizon . InterpNet has the same architecture has DispNet with the difference that, in addition to the first frames of the sequence, InterpNet additionally takes as input the last video frame as well. The first extracted state is used to regress the first positions as well as the positions at time , so that this state is explicitly encouraged to encode information about the last position of the object as well. In Table 5 and Table 6 we see that InterpNet managed to reduce the error in most cases. However in this case, compared to results in [11] InterpNet performs poorer on synthetic dataset and estimation of the intermediate states seems to be more challenging. Our interpretation is that the imperfect nature of the training data creates several possible path that this model in unable to solve. Finally we also noticed that the heightfield datasets seem to be very challenging as training for longer horizons didn’t reduce the error as much as it does on the ‘bowl.’

PoolR BowlS HeightS BowlR HeightR
10 20 30 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40
Table 5: Extrapolation vs interpolation: one ball datasets One ball datasets synthetic and real. Models are trained with positions from tracker. Pixel error at different time .
BowlS 2b HeightR 2b
10 20 30 10 20 30
Table 6: Extrapolation vs interpolation: two balls datasets Two balls datasets synthetic and real. Models are trained with positions from tracker. Pixel error at different time .

6 Conclusions

We presented a method that can learn to track physical objects such as balls rolling on complex terrains using only raw video sequences and no supervision. Combined with recent neural networks that can learn an implicit representation of physics, such a system is able to extrapolate object trajectories over time while accounting for object-environment and object-object interactions. To the best of our knowledge, this is the first time that learning long-term physics extrapolation without access to supervision or handcrafted simulators has been demonstrated. Through an extensive benchmark we also demonstrated the superiority of tensor-based state representation that were able to produce satisfactory results on real data without the need of large datasets.

We also contributed a new dataset, Roll4Real, of real-life video sequences for complex scenarios such as ball rollings on pool tables, bowls, and height-field, showing that all such methods are applicable to the real world. This data will be made publicly available.

In this work we used different colored objects to make them distinguishable, which in practice is one of the main limitation of our work. We plan to address this issue by using same colored objects and build a tracker that would be trained to detect all objects at once removing the need for iterative training.

Finally, we also plan to train the tracker and the extrapolator end-to-end, further improving tracking of multiple objects. We also aim at improving the generalisation of the predictors beyond the training regime; we believe that the key is to factor knowledge about the environment and the object dynamics to allow the models to remember the first better over longer time spans.

Acknowledgements. The authors would like to gratefully acknowledge the support of ERC 677195-IDIU and ERC SmartGeometry StG-2013-335373 grants.


  • [1]

    Abadi, et al.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015), software available from tensorflow.org

  • [2] Agrawal, P., et al.: Learning to Poke by Poking: Experiential Learning of Intuitive Physics. In: Proc. NIPS. pp. 5074–5082 (2016)
  • [3] Battaglia, P., et al.: Interaction networks for learning about objects, relations and physics. In: Proc. NIPS. pp. 4502–4510 (2016)
  • [4]

    Battaglia, P., Hamrick, J., Tenenbaum, J.: Simulation as an engine of physical scene understanding. PNAS

    110(45), 18327–18332 (2013)
  • [5]

    Bhattacharyya, A., et al.: Long-term image boundary prediction. In: Thirty-Second AAAI Conference on Artificial Intelligence. AAAI (2018)

  • [6] Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)
  • [7] Chang, M.B., et al.: A compositional object-based approach to learning physical dynamics. In: Proc. ICLR (2017)
  • [8] Chiappa, S., et al.: Recurrent environment simulators (2017)
  • [9]

    Denil, M., et al.: Learning to perform physics experiments via deep reinforcement learning. Deep Reinforcement Learning Workshop, NIPS (2016)

  • [10] Ehrhardt, S., others.: Learning A Physical Long-term Predictor. arXiv e-prints arXiv:1703.00247 (Mar 2017)
  • [11] Ehrhardt, S., et al.: Learning to Represent Mechanics via Long-term Extrapolation and Interpolation. arXiv preprint arXiv:1706.02179 (Jun 2017)
  • [12] Eslami, S.A., et al.: Attend, infer, repeat: Fast scene understanding with generative models. In: Advances in Neural Information Processing Systems. pp. 3225–3233 (2016)
  • [13]

    Finn, C., et al.: Deep spatial autoencoders for visuomotor learning. In: Robotics and Automation (ICRA), 2016 IEEE International Conference on. pp. 512–519. IEEE (2016)

  • [14] Fragkiadaki, K., et al.: Learning visual predictive models of physics for playing billiards. In: Proc. NIPS (2016)
  • [15] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 249–256 (2010)
  • [16] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks
  • [17] Kansky, K., et al.: Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. In: International Conference on Machine Learning. pp. 1809–1818 (2017)
  • [18] Ladický, et al.: Data-driven fluid simulations using regression forests. ACM Trans. on Graphics (TOG) 34(6),  199 (2015)
  • [19] Lee, A.X., et al.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
  • [20] Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. pp. 430–438 (2016)
  • [21] Li, W., Leonardis, A., Fritz, M.: Visual stability prediction and its application to manipulation. AAAI (2017)
  • [22] Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. ICCV (2017)
  • [23]

    Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: European Conference on Computer Vision. pp. 527–544. Springer (2016)

  • [24] Monszpart, A., Thuerey, N., Mitra, N.: SMASH: Physics-guided Reconstruction of Collisions from Videos. ACM Trans. on Graphics (TOG) (2016)
  • [25] Mottaghi, R., et al.: Newtonian scene understanding: Unfolding the dynamics of objects in static images. In: IEEE CVPR (2016)
  • [26] Mrowca, D., et al.: Flexible Neural Representation for Physics Prediction. ArXiv e-prints (2018)
  • [27] Novotny, D., et al.: Self-supervised learning of geometrically stable features through probabilistic introspection (2018)
  • [28] Oh, J., et al.: Action-conditional video prediction using deep networks in atari games. In: Advances in Neural Information Processing Systems. pp. 2863–2871 (2015)
  • [29]

    Ondruska, P., Posner, I.: Deep tracking: Seeing beyond seeing using recurrent neural networks. In: Proc. AAAI (2016)

  • [30] Riochet, R., et al.: IntPhys: A Framework and Benchmark for Visual Intuitive Physics Reasoning. ArXiv e-prints (2018)
  • [31] Sanborn, A.N., Mansinghka, V.K., Griffiths, T.L.: Reconciling intuitive physics and newtonian mechanics for colliding objects. Psychological review 120(2),  411 (2013)
  • [32] Sanchez-Gonzalez, A., et al.: Graph networks as learnable physics engines for inference and control (2018)
  • [33] Stewart, R., Ermon, S.: Label-free supervision of neural networks with physics and domain knowledge. In: AAAI. pp. 2576–2582 (2017)
  • [34] Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object frames by dense equivariant image labelling. In: Advances in Neural Information Processing Systems (NIPS). pp. 844–855 (2017)
  • [35] Tompson, J., et al.: Accelerating Eulerian Fluid Simulation With Convolutional Networks. ArXiv e-print arXiv:1607.03597 (2016)
  • [36] Watters, N., et al.: Visual interaction networks: Learning a physics simulator from video. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 4542–4550. Curran Associates, Inc. (2017)
  • [37]

    Wu, J., et al.: Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Proc. NIPS. pp. 127–135 (2015)

  • [38] Wu, J., et al.: Physics 101: Learning physical object properties from unlabeled videos. In: Proc. BMVC (2016)
  • [39] Wu, J., et al.: Learning to see physics via visual de-animation. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NIPS) 30, pp. 153–164. Curran Associates, Inc. (2017)
  • [40] Wu, J., et al.: Learning to see physics via visual de-animation. In: Proc. NIPS (2017)