Neural Allocentric Intuitive Physics Prediction from Real Videos

09/07/2018 ∙ by Zhihua Wang, et al. ∙ 0

Humans are able to make rich predictions about the future dynamics of physical objects from a glance. On the other hand, most existing computer vision approaches require strong assumptions about the underlying system, ad-hoc modeling, or annotated datasets, to carry out even simple predictions. To tackle this gap, we propose a new perspective on the problem of learning intuitive physics that is inspired by the spatial memory representation of objects and spaces in human brains, in particular the co-existence of egocentric and allocentric spatial representations. We present a generic framework that learns a layered representation of the physical world, using a cascade of invertible modules. In this framework, real images are first converted to a synthetic domain representation that reduces complexity arising from lighting and texture. Then, an allocentric viewpoint transformer removes viewpoint complexity by projecting images to a canonical view. Finally, a novel Recurrent Latent Variation Network (RLVN) architecture learns the dynamics of the objects interacting with the environment and predicts future motion, leveraging the availability of unlimited synthetic simulations. Predicted frames are then projected back to the original camera view and translated back to the real world domain. Experimental results show the ability of the framework to consistently and accurately predict several frames in the future and the ability to adapt to real images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Figure 1: Overview of the framework.

Humans have certain expectations about the physical world and learn to estimate mass and velocity of objects at an early stage of development through observation

[Spelke and Kinzler2007]. The problem of learning the intuitive dynamics of objects from data is often referred to as learning intuitive physics. Applications of intuitive physics are especially promising in the field of robotics, including manipulation, navigation and co-working scenarios. A robot equipped with intuitive physics understanding is able to navigate the environment and perform nuanced actions, such as carrying a cup of coffee without spilling it, catching a falling tool, and so on.

At a more generic level, physical understanding is a core domain of human knowledge and amongst the earliest topics in artificial intelligence. However, devising systems for physical reasoning that are able to learn from a few real unlabeled images is still an open problem. One major challenge is the limited amount of available data collected from the real world, which are generally sparse and lack of annotations. In addition, the quality of collected real world dataset is usually impacted by factors such as illumination, occlusion and perspective. Normally, not only can objects move within the scene, but the observer themselves can be shifting as well, which further complicates the predictions. Therefore, in order to learn intuitive physics with machines, we attempt to seek the inspirations from human brains.

At the core of physical reasoning lies a spatial representation of objects and the environment. In the human brain, spatial representation is necessary for navigating through known or unknown environments, locating objects and interacting with them. An idea was introduced in the mid 20th century (Figure 4) that information in these cognitive maps is represented with two types of frames111The frame in this context bears a more relaxed meaning compared to geometric reference frames.. The egocentric frame represents the information from the perspective of the observer in the environment, while the allocentric frame represents the information about the spatial relationship of objects relative to each other. The egocentric representation focuses on subject-to-object relationships, which is view-dependent and generally believed to be learnt first during early development, while the allocentric frame is based on world-based (global) coordinates and is believed to be acquired later in life [Colombo et al.2017]. These systems have already informed some approaches in the computer vision community for tasks such as mapping [Henriques and Vedaldi2018].

Two models of spatial cognition have been proposed. In the two-system model, an allocentric representation of spatial relationships between objects is stored in the long-term memory, while a self-reference system keeps track of egocentric relations to each object. In the three-system model, a dynamic egocentric system stores relationships between the observer and each object in its neighborhood, while a second system maintains allocentric representation in the long-term memory and a third one stores visual snapshots of the environment at different times [Avraamides and Kelly2008].

By taking inspiration from these two models, we propose a framework (Figure 1) for learning the intuitive dynamics of objects interacting in the environment which starts from ego-centric observations, warps these into an allocentric view of the scene, and learns the dynamics using a recurrent latent variation model. The predictor makes predictions about future observations in the allocentric frame, that are then translated back to the ego-centric view of the observer 1.

In order to learn to predict future frames, we use a realistic physics simulator (the framework is data-based and hence agnostic to the simulator used) to generate synthetic observations. The two advantages of using synthetic data are the possibility to train the egocentric to allocentric warping module in a supervised way and the abstraction of real-world images to synthetic images, which removes most lighting, color, texture artifacts that are irrelevant to the task of predicting object dynamics.

The main contributions of this work are the following: a neuro-inspired framework that tackles the problem of predicting the future state of objects from real video inputs with arbitrary viewpoints by projecting the scene to and from an allocentric representation; a Recurrent Latent Variation Network (RLVN), based on Convolutional LSTM networks (ConvLSTMs) [Xingjian et al.2015], able to predict the dynamics of objects interacting among themselves and with the sourrounding environment, with improved long-term prediction capability compared to other state-of-the-art approaches. In particular, we demonstrate the performance of this novel framework on the problem of predicting the motion of billiard balls.

Related Work

Prediction of the dynamics of physical objects lies at the intersection between two bodies of work: future frame prediction, which focuses on global frames, and learning intuitive physics, which often focuses on object-based representations.

Learning intuitive physics

In an early work [Wu et al.2015]

the authors first proposed to use deep generative models for learning the effect of gravity and friction on rolling objects by inverting a physics engine, in order to estimate the dynamics from observations. Deep neural networks have been later used for predicting the stability of tower blocks

[Lerer, Gross, and Fergus2016], motion of billiard balls [Fragkiadaki et al.2016] and other object dynamics [Mottaghi et al.2016].

Differentiable physics engines have been proposed in [Chang et al.2016].

Applications of intuitive physics to robotics have been recently explored in [Byravan and Fox2017, Wang et al.2018b], for predicting rigid and non-rigid body motion of objects subjected to forces, or in [Li, Leonardis, and Fritz2017] for stability prediction in stacking blocks.

Recently, [Wu et al.2017] proposed to decouple the prediction problem by learning an abstract physical representation of the world with a perception network, and using the physical representation as input to a physics engine and a rendering engine in order to generate visual data, which can be then matched to the visual input. One advantage of such approaches is that it is able to generate very sharp predictions. However, a disadvantage is that different simulation engines and renderers are required for different tasks, which leads to the poor generalization ability.

Interaction Networks [Battaglia et al.2016]

model interactions combine a relational reasoning network and an object reasoning network to predict object dynamics in a similar fashion to simulators. By adding the vector outputs of all object interactions, a global interaction vector is obtained, that is used together with object features to predict the future velocity of each object. Visual Interaction Networks

[Watters et al.2017] learn to predict future trajectories of objects in a physical system from video frames by jointly training a perceptual front-end based on convolutional networks and a dynamics predictor based on interaction networks. In [Ehrhardt et al.2017], the focus is on learning the motion of balls on non-homogeneous surfaces.

More recently, Relational-NEM [van Steenkiste et al.2018]

proposes a compositional approach for unsupervised learning the dynamics of multiple bouncing balls. The approach is focused on interactions between multiple objects, thus an estimate of the boundary conditions of the problem is required (e.g., the number of existing objects). Moreover, a noise is injected into input images in order to help learning of object grouping; tuning of the injected noise can be an issue when dealing with small objects. In addition, Pred-RNN proposes causal LSTM cells and Pred-RNN++

[Wang et al.2018a] addresses the problem of balancing long-term predictions with the induced difficulty in back-propagation with a Gradient Highway architecture, providing alternative routes for gradient flow.

In [Bhattacharyya et al.2018] the authors propose a context-base model for predicting image boundaries in future frames, and apply it to the problem of predicting the motion of billiard balls among other scenarios.

Predicting future video frames

Among the first works on future frame prediction, [Mathieu, Couprie, and LeCun2015] proposed a CNN architecture with adversarial training. [Srivastava, Mansimov, and Salakhudinov2015] uses multi-layer LSTMs for unsupervised learning. In [Xue et al.2016], the authors propose an architecture called Cross Convolutional Networks, that encodes image and motion information separately as feature maps and convolutional kernels, respectively.

Predictive Neural Network (PredNet) architectures [Lotter, Kreiman, and Cox2016] are inspired by the concept of predictive coding from a neuroscience perspective. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and forwarding deviations from those predictions to successive network layers.

Method

The framework is composed of three modules: the domain transfer module works at the lower level and translates image appearence between the real world and a simplified synthetic domain; the egocentric to allocentric transformer works at an intermediate level to translate egocentric images to a canonical allocentric view; the phsyics predictor module works at the physical level and learns the properties of the objects and the scene. An overview of the framework is shown in Figure 1.

We first describe our physics predictor network, then we show how to go from simulated data to real world data with the domain transfer module and the allocentric transformer.

Real-world to Synthetic Data Domain Transfer

In order to transfer the domain between real images and synthetic images, we use unpaired images to carry out image-to-image translation. The objective is to learn mapping functions

and between two domains and given two sets of unpaired training samples and . Two discriminators and classify and output images as real or fake by learning a perception-level representation of the inputs (2).

A cycle-consistency loss term ([Zhu et al.2017]) is added in order to add structure to the adversarial losses and :

(1)

Similar to [Bousmalis et al.2018]

, in order to anchor the translated images in the synthetic domain on a semantic level that preserves object position and identity, it is necessary to add auxiliary loss functions. We add auxiliary loss function

, but we use a different approach compared to [Bousmalis et al.2018]. In particular, we extract object segmentation masks from the synthetic domain images. We let the generators output a segmentation mask in addition to the domain-adapted output, and we compute L2 loss against the ground-truth mask. As a result, the mask loss informs both generators, that share the same latent representation, enforcing the semantics of the image to be preserved (i.e., spatial position of objects). The advantage of this approach is that semantic segmentation masks are natively available from the simulator, and are sufficient for the semantic consistency loss to be back-propagated to the whole network.

Thus, the total domain transfer loss is:

(2)
Figure 2: The domain transfer module. Bottom: mapping functions between domains. Top: detailed view of F.

Egocentric to Allocentric Viewpoint Transform

In order to make our framework invariant to camera perspective in the input video, we use a Spatial Transformer Network (STN)

[Jaderberg et al.2015] architecture as a learnable image warping module in order to warp the input images to a canonical view of the scene. For the example of billiard balls, we warp the input images to a bird’s-eye orthographic view of the table. This removes perspective artifacts and provides the maximum amount of information to the underlying network.

Figure 3: Spatial memory representations.
Figure 4: Egocentric to allocentric view transform module.

STNs are differentiable modules that allow the spatial manipulation of data within the network, giving neural networks the ability to spatially transform feature maps. The action of the spatial transformer module is conditioned on individual data samples, with the appropriate behaviour learnt during training for the task in question. Unlike pooling layers, where the receptive fields are fixed and local, the spatial transformer module is a dynamic mechanism that can actively spatially transform an image or a feature map by producing an appropriate transformation for each input sample. The transformation is then performed on the entire feature map (non-locally) and can include scaling, cropping, rotations, as well as non-rigid deformations. This allows networks, including spatial transformers, to not only select regions of an image that are most relevant (implementing an attention mechanism), but also to transform those regions to a canonical, expected view, to simplify inference in the subsequent layers. Spatial transformers can be trained with standard back-propagation, allowing for end-to-end training of the models they are injected in.

In [Lin and Lucey2017] the structure is modified such that the network propagates warp parameters instead of warped images directly. This solves the boundary effect problem of STNs and enables a natural recurrent implementation by composing a series of warp transformations. The warp operation can represent any transformation (e.g., affine, perspective).

A spatial transformer learns a warping of an input image conditioned on the image:

(3)

where is the identity warp. In the original STN this is achieved by a localisation network that outputs a transformation , a parametrised sampling grid and a differentiable image sampling module. corresponds to a linear regressor plus a bias term , such that:

(4)

The egocentric to allocentric transformation module is shown in Figure 4.

Recurrent Prediction of Physical Interactions

To carry out the prediction of physical interactions, we propose the Recurrent Latent Variation Networks (RLVN). The model is implemented based on Convolutional LSTM networks, which extends traditional LSTMs with convolutional structures in the input-to-state and state-to-state transitions. Here, we decompose the recurrent model into three components: encoder, latent variation and decoder.

At each time step , the encoder takes an image as input and produces its dense representation. Then, conditioned on the dense representation, a latent distribution is constructed for capturing the physical variations . The decoder combines the information from the encoder via skip-connections and the latent residual to generate the predicted image as an up-convolutional decoder . Here, we employ variational inference to carry out the learning of the latent variable model. The variational lower bound can be derived as:

(5)

where is the image at th time step and

is the latent variation. The generative distribution is defined as a parameterised diagonal Gaussian distribution

, where and are the parameters generated by the encoder . Similarly, the variational distribution is constructed as in order to approximate the posterior , where and are generated based on both the input and the observation , and . Here, both the generative distribution and variational distribution are jointly learned while optimising the variational lower bound.

Hence, during variational inference, we apply the reparameterisation trick [Kingma and Welling2014] by sampling which yields to . The estimated lower bound can be derived as:

(6)

where the Kullback-Leibler divergence term is integrated as a Gaussian KLD, and the gradients can be directly constructed and back-propagated through the neural network.

The combination of variational inference and u-net shape allows the network to learn expressive representations of the scenes. Intuitively, by decomposing the latent variation from the network, we place an inductive bias into the model to separate the learning of the scene/object appearance and the dynamic interactions (including the position, velocity, mass and friction components of the objects). In this way, the u-net (as shown in Figure 5) is encouraged to learn the scene/object appearance and construct deterministic representation, while the latent variation attempts to capture the dynamic interactions which is considered as stochastic representation. Therefore, the decoder is able to easily construct the predicted images by combining the two representations.

Compared to the deterministic counterpart which has no latent variation (i.e., is not drawn from a latent distribution, but is directly generated from the encoder instead), our framework has better capacity for predicting complex dynamic physical interactions. In addition, the deterministic model overfits the dataset very quickly. Interestingly, it gradually ignores the prediction of dynamic interactions but only focuses on learning the scene/object appearance.

Figure 5: The RLVN physics predictor module.
Domain transfer module
Encoder layer 1 Conv.

, Stride

, ReLU activ.

Encoder layer 2-3 Conv. , Stride , ReLU activ.
Encoder layer 4-9 Resid.
Decoder layer 10-11 Upconv. , Stride , ReLU activ.
Decoder layer 12 Conv. , Stride , ReLU activ.
Dis layer 1-4 Conv. , Stride , ReLU activ.
View transform module (x4 recursion)
Layer 1-2 Conv. , Stride , ReLU activ.
Layer 3 FC 48
Layer 4 FC 8
Layer 5 Warp op. layer 1
RVLN predictor
Encoder layer 1 Conv. , Stride , ReLU activ.
Encoder layer 2 Conv. , Stride , ReLU activ.
Encoder layer 7 FC 1000
Encoder layer 8 FC 400, FC 400
Encoder layer 9 FC 2048
Encoder layer 10 ConvLSTM
Decoder layer 10 Deconv. , Stride CONCAT
Decoder layer 11 Deconv. , Stride CONCAT
Conv layer 1-8 Same as Encoder 1-8
Table 1: Implementation details for network modules.

Experimental Results

We first validate our proposed physics predictor by comparing it to different state-of-the-art baselines on synthetic billiard videos. The billiard scenario is ideal to evaluate long-term prediction, since it is a chaotic system; even non-skilled humans have difficulty in making medium-term predictions. Then we test the complete framework in the actual complex real scenario: real billiard videos with multiple camera positions. The network was implemented using Tensorflow. The implementation details are reported in Table

1.

Data generation Simplified 2D billiard-like bouncing balls scenarios are a common benchmark for physics prediction [Fragkiadaki et al.2016, van Steenkiste et al.2018, Bhattacharyya et al.2018]. For our experiments we generate a more realistic 3D dataset using Blender, with Bullet as the underlying physics engine. In each video four balls of similar mass and size are placed at random and with random velocities on a billiard table. The balls and billiard table behave in a realistic way, including friction and restitution forces. Each video is composed by 20 frames. We train the network on the first 10 frames and predict the successive 10, evaluating the predictions against the true frames. We generate 10k video episodes for training and 1k for testing.

We then capture several billiard videos from different camera angles and different lightning conditions, with four balls of the same color. The videos are captured from three different viewpoints, and the video sequences are manually cut in order to remove players occluding the image. The longer video sequences are then segmented into sequences of 20 frames.

Figure 6: Comparison of our approach to PredNet, R-NEM and Pred-RNN++ over 10 prediction steps. For each method the first line shows the ground-truth and the second line shows the predicted sequence. All networks were fed a sequence of 10 steps in order to predict the next 10. Sequences on the left: best results; sequences on the right: worst results.

Validating the physics predictor

We explore how our predictor learns the physical object state (position, velocity, mass, and friction) and we compare the ability of our predictor to predict long sequences to three recent state-of-the-art baselines (PredNet [Lotter, Kreiman, and Cox2016], Relational-NEM [van Steenkiste et al.2018] and Pred-RNN++ [Wang et al.2018a]) on the synthetic billiard dataset. All networks are trained on 64

64 grayscale images, with a batch size of 8, with the exception of Relational-NEM, which was trained with binarized inputs as in the original implementation. Given a sequence of 10 frames, we predict the next 10, and compare against the ground-truth. PredNet was trained for next frame prediction on sequences of 20 frames and successively fine-tuned on full sequences of 20 frames.

Figure 6 shows best and worst qualitative results on random test sequences, while Figure 7 reports the Intersection Over Union (IOU) and Binary Cross-Entropy (BCE) scores for the four approaches over 10 future predicted time steps.

IOU score is defined as:

where is the indicator function, is the predicted value at position , is the true value , and . In our experiments, is set to 0.8.

It can be seen that PredNet is able to predict future frames with reasonable accuracy up to 4-5 steps in the future. This is in line with the results from [Lotter, Kreiman, and Cox2016]. Relational-NEM generates accurate predictions for most sequences, while not being able to correctly predict in a few cases. This was due to the network not being able to learn disentangled representations for the four balls in a few cases. Pred-RNN++ is able to generate accurate predictions for most sequences on our dataset. As expected, the prediction accuracy of all methods decreases over time, with PredNet showing increased degradation.

Our method shows consistently more accurate predictions over time for all sequences compared to competing approaches, with an IOU of and comparable accuracy degradation over time to Pred-RNN++.

Figure 7: Average IOU and BCE scores for the four approaches over 10 prediction steps.
Figure 8: Results for the multi-view camera scenario. Given a sequence of 10 input frames at steps, the network predicts 10 frames in the future. First row: predicted allocentric view; second row: predicted view in the original egocentric frame; third row: final prediction in the real domain; last row: ground truth. First column: last input frame; following three columns: predicted time steps.

Billiard Tables with Multiple Camera Views

We now evaluate the complete framework on a series of real videos from different viewpoints. The camera remains static for the duration of each video sequence. We first train the domain transfer module on an unpaired training set of 40k real and synthetic samples. We then train the allocentric viewpoint transform module on a synthetic dataset of 20k samples. Finally, the physics predictor is trained on a synthetic dataset of 10k video sequences, for a total of 200k samples.

In Figure 8 we show qualitative results on sequences with different out-of-sample viewpoints. It is possible to see how the images are transported to the allocentric representation and the predictions are then transported back to realistic images. It should be noted that the presence of occlusions (e.g., the player’s hand occluding part of the table) can have a negative effect on the predictions by propagating through the network. Figure 9 shows some of the effects of failures in the style transfer module on the predicted images.

Figure 9: Examples of failed predictions due to foreign objects. Top: a hand with cue stick (top-left) causes allucinated object in the synthetic and real predicted images (top-middle and top-right). Bottom:the point of the cue stick is incorrectly recognized as a ball and causes the prediction to fail.

Conclusion

Generative models for intuitive physics understanding are showing promising performances, but they are still mostly limited to toy examples and require many thousands of examples in order to gain the ability of transferring to real world scenario or to deal with different viewpoints. We proposed a neuro-inspired framework that can learn from fewer examples, by projecting arbitrary viewpoints to a canonical ensemble view of the scene and to a canonical image domain. The domain transformation from real images to canonical images is possible by training the networks on unpaired image sets. Meanwhile, the projection from arbitrary to canonical views can be trained with pairs of synthetic images only, making the module independent from labeled real data. In addition, we proposed RLVN, a novel physical predictor which decomposes the learning of scene representation and objects’ dynamics, which grants strong ability to carry out predictions on physical dynamics in different scenarios. By decomposing the latent variation from the network, the former is free to learn the stochastic physical properties of the objects, thus their interactions, while the latter is encouraged to learn the deterministic object and scene appearance. The next natural step would be to investigate a two-step approach in which the architecture is bootstrapped on synthetic data and then trained by observing real data in an unsupervised manner.

References

  • [Avraamides and Kelly2008] Avraamides, M. N., and Kelly, J. W. 2008. Multiple systems of spatial memory and action. Cognitive processing 9(2):93–106.
  • [Battaglia et al.2016] Battaglia, P.; Pascanu, R.; Lai, M.; Rezende, D. J.; et al. 2016. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, 4502–4510.
  • [Bhattacharyya et al.2018] Bhattacharyya, A.; Malinowski, M.; Schiele, B.; and Fritz, M. 2018. Long-term image boundary prediction. In AAAI.
  • [Bousmalis et al.2018] Bousmalis, K.; Irpan, A.; Wohlhart, P.; Bai, Y.; Kelcey, M.; Kalakrishnan, M.; Downs, L.; Ibarz, J.; Sampedro, P. P.; Konolige, K.; Levine, S.; and Vanhoucke, V. 2018. Using simulation and domain adaptation to improve efficiency of deep robotic grasping.
  • [Byravan and Fox2017] Byravan, A., and Fox, D. 2017. Se3-nets: Learning rigid body motion using deep neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, 173–180. IEEE.
  • [Chang et al.2016] Chang, M. B.; Ullman, T.; Torralba, A.; and Tenenbaum, J. B. 2016. A compositional object-based approach to learning physical dynamics. arXiv preprint arXiv:1612.00341.
  • [Colombo et al.2017] Colombo, D.; Serino, S.; Tuena, C.; Pedroli, E.; Dakanalis, A.; Cipresso, P.; and Riva, G. 2017. Egocentric and allocentric spatial reference frames in aging: A systematic review. Neuroscience & Biobehavioral Reviews 80:605–621.
  • [Ehrhardt et al.2017] Ehrhardt, S.; Monszpart, A.; Mitra, N. J.; and Vedaldi, A. 2017. Taking visual motion prediction to new heightfields. CoRR abs/1712.09448.
  • [Fragkiadaki et al.2016] Fragkiadaki, K.; Agrawal, P.; Levine, S.; and Malik, J. 2016. Learning visual predictive models of physics for playing billiards. In International Conference on Learned Representations.
  • [Henriques and Vedaldi2018] Henriques, J. F., and Vedaldi, A. 2018. Mapnet: An allocentric spatial memory for mapping environments. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    .
  • [Jaderberg et al.2015] Jaderberg, M.; Simonyan, K.; Zisserman, A.; et al. 2015. Spatial transformer networks. In Advances in neural information processing systems, 2017–2025.
  • [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In ICLR.
  • [Lerer, Gross, and Fergus2016] Lerer, A.; Gross, S.; and Fergus, R. 2016. Learning physical intuition of block towers by example. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48

    , ICML’16, 430–438.
  • [Li, Leonardis, and Fritz2017] Li, W.; Leonardis, A.; and Fritz, M. 2017. Visual stability prediction for robotic manipulation. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, 2606–2613. IEEE.
  • [Lin and Lucey2017] Lin, C.-H., and Lucey, S. 2017. Inverse compositional spatial transformer networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Lotter, Kreiman, and Cox2016] Lotter, W.; Kreiman, G.; and Cox, D. 2016. Deep predictive coding networks for video prediction and unsupervised learning. In ICLR.
  • [Mathieu, Couprie, and LeCun2015] Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multi-scale video prediction beyond mean square error. In ICLR.
  • [Mottaghi et al.2016] Mottaghi, R.; Bagherinezhad, H.; Rastegari, M.; and Farhadi, A. 2016. Newtonian scene understanding: Unfolding the dynamics of objects in static images. In CVPR, 3521–3529.
  • [Spelke and Kinzler2007] Spelke, E. S., and Kinzler, K. D. 2007. Core knowledge. Developmental science 10(1):89–96.
  • [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In International conference on machine learning, 843–852.
  • [van Steenkiste et al.2018] van Steenkiste, S.; Chang, M.; Greff, K.; and Schmidhuber, J. 2018.

    Relational neural expectation maximization: Unsupervised discovery of objects and their interactions.

    In International Conference on Learned Representations.
  • [Wang et al.2018a] Wang, Y.; Gao, Z.; Long, M.; Wang, J.; and Yu, P. S. 2018a. PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5123–5132. PMLR.
  • [Wang et al.2018b] Wang, Z.; Rosa, S.; Yang, B.; Wang, S.; Trigoni, N.; and Markham, A. 2018b. 3d-physnet: Learning the intuitive physics of non-rigid object deformations. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-18, 4958–4964. International Joint Conferences on Artificial Intelligence Organization.
  • [Watters et al.2017] Watters, N.; Zoran, D.; Weber, T.; Battaglia, P.; Pascanu, R.; and Tacchetti, A. 2017. Visual interaction networks: Learning a physics simulator from video. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30. Curran Associates, Inc. 4539–4547.
  • [Wu et al.2015] Wu, J.; Yildirim, I.; Lim, J. J.; Freeman, B.; and Tenenbaum, J. 2015.

    Galileo: Perceiving physical object properties by integrating a physics engine with deep learning.

    In Advances in neural information processing systems, 127–135.
  • [Wu et al.2017] Wu, J.; Lu, E.; Kohli, P.; Freeman, W. T.; and Tenenbaum, J. B. 2017. Learning to see physics via visual de-animation. In Advances in Neural Information Processing Systems.
  • [Xingjian et al.2015] Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; and Woo, W.-c. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, 802–810.
  • [Xue et al.2016] Xue, T.; Wu, J.; Bouman, K.; and Freeman, B. 2016. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In Advances in Neural Information Processing Systems, 91–99.
  • [Zhu et al.2017] Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2223–2232.