1 Introduction
What makes an object an object? Researchers in cognitive science have made profound investigations into this fundamental problem; results suggest that humans, even young infants, recognize objects as continuous, integrated regions that move together (Carey, 2009; Spelke & Kinzler, 2007). Watching objects move, infants gradually build the internal notion of objects in their mind. The whole process requires little external supervision from experts.
Motion gives us not only the concept of objects and parts, but also their hierarchical structure. The classic study from Johansson (1973) reveals that humans recognize the structure of a human body from a few moving dots representing the keypoints on a human skeleton. This connects to the classic Gestalt theory in psychology (Koffka, 2013), which argues that human perception is holistic and generative, explaining scenes as a whole instead of in isolation. In addition to being unsupervised and hierarchical, our perception gives us concepts that are fully interpretable and disentangled. With an objectbased representation, we are able to reason about object motion, predict what is going to happen in the near future, and imagine counterfactuals like “what happens if?” (Spelke & Kinzler, 2007)
How can we build machines of such competency? Would that be possible to have an artificial system that learns an interpretable, hierarchical representation with system dynamics, purely from raw visual data with no human annotations? Recent research in unsupervised and generative deep representation learning has been making progress along this direction: there have been models that efficiently explain multiple objects in a scene (Huang & Murphy, 2015; Eslami et al., 2016), some simultaneously learning an interpretable representation (Chen et al., 2016). Most existing models however either do not produce a structured, hierarchical object representation, or do not characterize system dynamics.
In this paper, we propose a novel formulation that learns an interpretable, hierarchical object representation and scene dynamics by predicting the future. Our model requires no human annotations, learning purely from unlabeled videos of paired frames. During training, the model sees videos of objects moving; during testing, it learns to recognize and segment each object and its parts, build their hierarchical structure, and model their motion distribution for future frame synthesis, all from a single image.
Our model, named Parts, Structure, and Dynamics (PSD), learns to recognize the object parts via a layered image representation. PSD learns their hierarchy via a structural descriptor that composes lowlevel concepts into a hierarchical structure. Formulated as a fully differentiable module, the structural descriptor can be endtoend trained within a neural network. PSD learns to model the system dynamics by predicting the future.
We evaluate our model in many possible ways. On real and synthetic datasets, we first examine its ability in learning the concept of objects and segmenting them. We then compute the likelihood that it correctly captures the hierarchical structure in the data. We finally validate how well it characterizes object motion distribution and predicts the future. Our system works well on all these tasks, with minimal input requirement (two frames during training, and one during testing). While previous stateoftheart methods that jointly discover objects, relations, and predict future frames only work on binary images of shapes and digits, our PSD model works well on complex realworld RGB images and requires fewer input frames.
2 Related Work
Our work is closely related to the research on learning an interpretable representation with a neural network (Hinton & Van Camp, 1993; Kulkarni et al., 2015b; Chen et al., 2016; Higgins et al., 2017, 2018). Recent papers explored using deep networks to efficiently explain an object (Kulkarni et al., 2015a; Rezende et al., 2016; Chen et al., 2018), a scene with multiple objects (Ba et al., 2015; Huang & Murphy, 2015; Eslami et al., 2016), or sequential data (Li & Mandt, 2018; Hsu et al., 2017). In particular, Chen et al. (2016) proposed to learn a disentangled representation without direct supervision. Wu et al. (2017) studied video deanimation, building an objectbased, structured representation from a video. Higgins et al. (2018) learned an implicit hierarchy of abstract concepts from a few symbolimage pairs. Compared with these approaches, our model not only learns to explain observations, but also build a dynamics model that can be used for future prediction.
There have been also extensive research on hierarchical motion decomposition (Ross & Zemel, 2006; Ross et al., 2010; Grundmann et al., 2010; Xu et al., 2012; FloresMangas & Jepson, 2013; Jain et al., 2014; Ochs et al., 2014; PérezRúa et al., 2016; Gershman et al., 2016; Esmaeili et al., 2018). These papers focus on segment objects or parts from videos and infer their hierarchical structure. In this paper, we propose a model that learns to not only segment parts and infer their structure, but also to capture each part’s dynamics for synthesizing possible future frames.
Physical scene understanding has attracted increasing attention in recent years
(Fragkiadaki et al., 2016; Battaglia et al., 2016; Chang et al., 2017; Finn et al., 2016; Ehrhardt et al., 2017; Shao et al., 2014). Researchers have attempted to go beyond the traditional goals of highlevel computer vision, inferring “what is where”, to capture the physics needed to predict the immediate future of dynamic scenes, and to infer the actions an agent should take to achieve a goal. Most of these efforts do not attempt to learn physical object representations from raw observations. Some systems emphasize learning from pixels but without an explicitly objectbased representation
(Fragkiadaki et al., 2016; Agrawal et al., 2016), which makes generalization challenging. Others learn a flexible model of the dynamics of object interactions, but assume a decomposition of the scene into physical objects and their properties rather than learning directly from images (Chang et al., 2017; Battaglia et al., 2016; Kipf et al., 2018). A few very recent papers have proposed to jointly learn a perception module and a dynamics model (Watters et al., 2017; Wu et al., 2017; van Steenkiste et al., 2018). Our model moves further by simultaneously discovering the hierarchical structure of object parts.Another line of related work is on future state prediction in either image pixels (Xue et al., 2016; Mathieu et al., 2016; Lotter et al., 2017; Lee et al., 2018; Balakrishnan et al., 2018b) or object trajectories (Kitani et al., 2017; Walker et al., 2016). Some of these papers, including our model, draw insights from classical computer vision research on layered motion representations (Wang & Adelson, 1993). These papers often fail to model the object hierarchy. There has also been abundant research making use of physical models for human or scene tracking (Salzmann & Urtasun, 2011; Kyriazis & Argyros, 2013; Vondrak et al., 2013; Brubaker et al., 2009). Compared with these papers, our model learns to discover the hierarchical structure of object parts purely from visual observations, without resorting to prior knowledge.
3 Formulation
By observing objects move, we aim to learn the concept of object parts and their relationships. Take human body as an example (Figure 1). We want our model to parse human parts (e.g., torso, hands, and legs) and to learn their structure (e.g., hands and legs are both parts of the human body).
Formally, given a pair of images , let be the Lagrangian motion map (i.e. optical flow). Consider a system that learns to segment object parts and to capture their motions, without modeling their structure. Its goal is to find a segment decomposition of , where each segment corresponds to an object part with distinct motion. Let be their corresponding motions.
Beyond that, we assume that these object parts form a hierarchical tree structure: each part has a parent , unless itself is the root of a motion tree. Its motion can therefore be decomposed into its parent’s motion and a local motion component within its parent’s reference frame. Specifically, , if is not a root. Here we make use of the fact that Lagrangian motion components and are additive.
Figure 2 gives an intuitive example: knowing that the legs are part of human body, the legs’ motion can be written as the sum of the body’s motion (e.g., moving to the left) and the legs’ local motion (e.g., moving to lower or upper left). Therefore, the objective of our model is, in addition to identifying the object components , learning the hierarchical tree structure to effectively and efficiently explain the object’s motion.
Such an assumption makes it possible to decompose the complex object motions into simple and disentangled local motion components. Reusing local components along the hierarchical structure helps to reduce the description length of the motion map . Therefore, such a decomposition should naturally emerge within a design with information bottleneck that encourages compact, disentangled representations. In the next section, we introduce the general philosophy behind our model design and the individual components within.
4 Method
In this section, we discuss our approach to learn the disentangled, hierarchical representation. Our model learns by predicting future motions and synthesizing future frames without manual annotations. Figure 3 shows an overview of our Parts, Structure, and Dynamics (PSD) model.
4.1 Overview
Motion can be decomposed in a layerwise manner, separately modeling different object component’s movement (Wang & Adelson, 1993). Motivated by this, our model first decomposes the input frame into multiple feature maps using an image encoder (Figure 3c). Intuitively, these feature maps correspond to separate object components. Our model then performs convolutions (Figure 3d) on these feature maps using separate kernels obtained from a kernel decoder (Figure 3b), and synthesizes the local motions of separate object components with a motion decoder (Figure 3e). After that, our model employs a structural descriptor (Figure 3f) to recover the global motions from local motions , and then compute the overall motion . Finally, our model uses an image decoder (Figure 3g) to synthesize the next frame from the input frame and the overall motion .
Our PSD model can be seen as a conditional variational autoencoder. During training, it employs an additional
motion encoder (Figure 3a) to encode the motion into the latent representation ; during testing, it instead samples the representation from its prior distribution, which is assumed to be a multivariate Gaussian distribution, where each dimension is
i.i.d., zeromean, and unitvariance. We emphasize the different behaviors of training and testing in Algorithm
1 and 2.4.2 Network Structure
We now introduce each component.
Dimensionality.
The hyperparameter
is set to 32, which determines the maximum number of objects we are able to deal with. During training, the variational loss encourages our model to use as few dimensions in the latent representation as possible, and consequently, there will be only a few dimensions learning useful representations, each of which correspond to one particular object, while all the other dimensions will be very close to the Gaussian noise.Motion Encoder.
Our motion encoder takes the flow field between two consecutive frames as input, with resolution of 128128. It applies seven convolutional layers with number of channels {16, 16, 32, 32, 64, 64, 64}, kernel sizes 5
5, and stride sizes 2
2. Between convolutional layers, there are batch normalizations
(Ioffe & Szegedy, 2015), Leaky ReLUs (Maas et al., 2013) with slope 0.2. The output will have a size of 6411. Then it is reshaped into adimensional mean vector
and a dimensional variance vector . Finally, the latent motion representation is sampled from .Kernel Decoder.
Our kernel decoder consists of separate fully connected networks, decoding the latent motion representation to the convolutional kernels of size 55. Therefore, each kernel corresponds to one dimension in the latent motion representation
. Within each network, we make uses four fully connected layers with number of hidden units {64, 128, 64, 25}. In between, there are batch normalizations and ReLU layers.
Image Encoder.
Our image encoder applies six convolutional layers to the image, with number of channels {32, 32, 64, 64, 32, 32}, kernel sizes 55, two of which have strides sizes 22. The output will be a 64channel feature map. We then upsample the feature maps by 4 with nearest neighbor sampling, and finally, the resolution of feature maps will be 128128.
Cross Convolution.
The cross convolution layer (Xue et al., 2016) applies the convolutional kernels learned by the kernel decoder to the feature maps learned by the image encoder. Here, the convolution operations are carried out in a channelwise manner (also known as depthwise separable convolutions in Chollet (2017)): it applies each of the convolutional kernels to its corresponding channel in the feature map. The output will be a channel transformed feature map.
Motion Decoder.
Our motion decoder takes the transformed feature map as input and estimates the
axis and axis motions separately. For each axis, the network applies two 99, two 55 and two 11 depthwise separable convolutional layers, all with 32 channels. We stack the outputs from two branches together. The output motion will have a size of 1281282. Note that the local motion is determined by only.Structural Descriptor.
Our structural descriptor recovers the global motions from the local motions and the hierarchical tree structure using
(1)  
(2) 
Then, we define the structural matrix as , where each binary indicator represents whether is an ancestor of . This is what we aim to learn, and it is shared across different data points. In practice, we relax the binary constraints on to to make this module differentiable: , where are trainable parameters. Finally, the overall motion can be simply computed as .
Image Decoder.
Given the input frame and the predicted overall motion , we employ the UNet (Ronneberger et al., 2015) as our image decoder to synthesize the future image frame .
4.3 Training Details
Our objective function is a weighted sum over three separate components:
(3) 
The first component is the pixelwise reconstruction loss, which enforces our model to accurately estimate the motion and synthesize the future frame . We have , where is a weighting factor (which is set to in our experiments).
The second component is the variational loss, which encourages our model to use as few dimensions in the latent representation as possible (Xue et al., 2016; Higgins et al., 2017). We have where is the KLdivergence, and
is the prior distribution of the latent representation (which is set to normal distribution in our experiments).
The last component is the structural loss, which encourages our model to learn the hierarchical tree structure so that it helps the motions be represented in an efficient way: . Note that we apply the structural loss on local motion fields, not on the structural matrix. In this way, the structural loss serves as a regularization, encouraging the motion field to have small values.
We implement our PSD model in PyTorch
(Paszke et al., 2017). Optimization is carried out using ADAM (Kingma & Ba, 2015) with and . We use a fixed learning rate of and minibatch size of 32. We propose the twostage optimization schema, which first learns the disentangled and then learns the hierarchical representation.In the first stage, we encourage the model to learn a disentangled representation (without structure). We set the in Equation 3 to and fix the structural matrix to the identity . The in Equation 3 is the same as the one in the VAE (Higgins et al., 2017), and therefore, larger ’s encourage the model to learn a more disentangled representation. We first initialize the to and then adaptively double the value of when the reconstruction loss reaches a preset threshold.
In the second stage, we train the model to learn the hierarchical representation. We fix the weights of motion encoder and kernel decoder, and set the to . We initialize the structural matrix , and optimize it with the image encoder and motion decoder jointly. We adaptively tune the value of in the same way as the in the first stage.
5 Experiments
We evaluate our model on three diverse settings: i) simple yet nontrivial shapes and digits, ii) Atari games of basketball playing, and iii) realworld human motions.
5.1 Movement of Shapes and Digits
We first evaluate our method on shapes and digits. For each dataset, we rendered totally 100,000 pairs for training and 10,000 for testing, with random visual appearance (i.e., sizes, positions, and colors).
For the shapes dataset, we use three types of shapes: circles, triangles and squares. Circles always move diagonally, while the other two shapes’ movements consist of two submovements: moving together with circles and moving in their own directions (triangles horizontally, and squares vertically). Figure A3 demonstrates the motion distributions of each shape. The complex global motions (after structure descriptor) are decomposed into several simple local motions (before structure descriptor). These local motions are much easier to represent.
We also construct an additional dataset with up to nine different shapes. We assign these shapes into four different groups: i) square and two types of parallelograms, ii) circle and two types of triangles, iii) two types of trapezoids, and iv) pentagon. The movements of shapes in the same group have intrinsic relations, while shapes in different groups are independent of each other. These nine shapes have their own different motion direction. In the first group, the tree structure is the same as that of our original shapes dataset: replacing circles with squares, triangles with left parallelograms, and squares with right parallelograms. In the second group, circle and two types of triangles form a chainlike structure, which is similar to the one in our digits dataset. In the third group, the structure is a chain contains two types of trapezoids. In the last group, there is only a pentagon.
As for the digits dataset, we use six types of handwritten digits from MNIST (LeCun et al., 1998). These digits are divided into two groups: 0’s, 1’s and 2’s are in the first group, and 3’s, 4’s and 5’s in the second group. The movements of digits in the same group have some intrinsic relations, while digits in different groups are independent of each other. In the first group, the tree structure is the same as that of our shapes dataset: replacing circles with 0’s, triangles with 1’s, and squares with 2’s. The second group has a chainlike structure: 3’s move diagonally, 4’s move together with 3’ and move horizontally at the same time, and 5’s move with 4’s and move vertically at the same time.
After training, our model should be able to synthesize future frames, segment different objects (i.e., shapes and digits), and discover the relationship between these objects.
Shapes  Digits  

Circles  Squares  Triangles  0’s  1’s  2’s  3’s  4’s  5’s  
NEM  0.368  0.457  0.348  0.470  0.229  0.322  0.512  0.295  0.251 
RNEM  0.540  0.559  0.583  0.323  0.416  0.339  0.448  0.352  0.326 
PSD (ours)  0.935  0.816  0.905  0.750  0.742  0.739  0.739  0.472  0.641 
Future Prediction.
In Figure 4d and Figure 6d, we present some qualitative results of synthesizing future frames. Our PSD model captures the different motion patterns for each object and synthesizes multiple possible future frames. Figure A3 summarizes the distribution of sampled motion of these shapes; our model learns to approximate each shape’s dynamics in the training set.
Latent Representation.
After analyzing the representation , we observe that its intrinsic dimensionality is extremely sparse. On the shapes dataset, there are three dimensions learning meaningful representations, each of which correspond to one particular shape, while all the other dimensions are very close to the Gaussian noise. Similarly, on digits dataset, there are six dimensions, corresponding to different digits. In further discussions, we will only focus on these meaningful dimensions.
Object Segmentation.
For each meaningful dimension, the feature map can be considered as the segmentation mask of one particular object (by thresholding). We evaluate our model’s ability on learning the concept of objects and segmenting them by computing the intersection over union (IoU) between model’s prediction and the groundtruth instance mask. We compare our model with
Neural Expectation Maximization
(NEM) proposed by Greff et al. (2017) and Relational Neural Expectation Maximization (RNEM) proposed by van Steenkiste et al. (2018). As these two methods both take a sequence of frames as inputs, we feed two input frames repetitively (, , , , , , …) into these models for fair comparison. Besides, as these methods do not learn the correspondence of objects across data points, we manually iterate all possible mappings and report the one with the best performance.We present qualitative results in Figure 7 and Figure 5b, and quantitative results in Table 1. Our PSD model significantly outperforms two baselines. In particular, RNEM and our PSD model focus on complementary topics: RNEM learns to identify instances through temporal reasoning, using signals across the entire video to group pixels into objects; our PSD model learns the appearance prior of objects: by watching how they move, it learns to recognize how object parts can be grouped based on their appearance and can be applied on static images. As the videos in our dataset has only two frames, temporal signals alone are often not enough to tell objects apart. This explains the less compelling results from RNEM. We included a more systematic study in Section A.3 to verify that.
To evaluate the generalization ability, we train our PSD model on a dataset with two squares, among other shapes, and test it on a dataset with three squares. In each piece of data, all squares move together and have the same motion. Other settings are the same as the original shapes dataset. Figure 8 shows segmentation results on these two datasets. Our model generalizes to recognize the three squares simultaneously, despite having seen up to two in training.
Hierarchical Structure.
To discover the tree structure between these dimensions, we binarize the structural matrix
by a threshold of 0.5 and recover the hierarchical structure from it. We compare our PSD model with RNEM and Neural Relational Inference (NRI) proposed by Kipf et al. (2018). As the NRI model requires objects’ feature vectors (i.e., location and velocity) as input, we directly feed the coordinates of different objects in and ask it to infer the underlying interaction graph. In Figure 4f and Figure 6f, we visualize the hierarchical tree structure obtained from these models. Our model is capable of discovering the underlying structure; while two baselines fail to learn any meaningful relationships. This might be because NRI and RNEM both assume that the system dynamics is fully characterized by their current states and interactions, and therefore, they are not able to model the uncertainties in the system dynamics. On the challenging dataset with more shapes, our PSD model is still able to discover the underlying structure among them (see Figure 5c).5.2 Atari Games of Playing Basketball
We then evaluate our model on a dataset of Atari games. In particular, we select the Basketball game from the Atari 2600. In this game, there are two players competing with each other. Each player can move in eight different directions. The offensive
player constantly dribbles the ball and throws the ball at some moment; while the
defensive player tries to steal the ball from his opponent player. We download a video of playing this game from YouTube and construct a dataset with 5,000 pairs for training and 500 for testing.Our PSD model discovers three meaningful dimensions in the latent representation . We visualize the feature maps in these three dimensions in Figure 9. We observe that one dimension (in Figure 9d) is learning the offensive player with ball, another (in Figure 9e) is learning the ball, and the other (in Figure 9f) is learning the defensive player. We construct the hierarchical tree structure among these three dimensions from the structural matrix . As illustrated in Figure 9g, our PSD model is able to discover the relationship between the ball and the players: the offensive player controls the ball. This is because our model observes that the ball always moves along with the offensive player.
5.3 Movement of Humans
We finally evaluate our method on two datasets of realworld human motions: the human exercise dataset used in Xue et al. (2016) and the yoga dataset used in Balakrishnan et al. (2018a). We estimate the optical flows between frames by an offtheshelf package (Liu, 2009). Compared with previous datasets, these two require much more complicated visual perception, and they have challenging hierarchical structures. In the human exercise dataset, there are 50,000 pairs of frames used for training and 500 for testing. As for the yoga dataset, there are 4,720 pairs of frames for training and 526 for testing.
Future Prediction.
In Figure 10 and Figure 11, we present qualitative results of synthesizing future frames. Our model is capable of predicting multiple future frames, each with a different motion. We compare with 3DcVAE (Li et al., 2018), which takes one frame as input and predicts the next 16 frames. As our training dataset only has paired frames, for fair comparison, we use the repetition of two frames as input: (, , , , …, , ). We also use the same optical flow (Liu, 2009) for both methods. In Figure 12, the future frames predicted by 3DcVAE have much more artifacts, compared with our PSD model.
Object Segmentation.
In Figure 13 and Figure 14, we visualize the feature maps corresponding to the active latent dimensions. It turns out that each of these dimensions corresponds to one particular human part: full torsos (13c, 14c), upper torsos (13d), arms (13e), left arms (14d), right arms (14e), right legs (13f, 14g), and left legs (13g, 14f). Note that it is extremely challenging to distinguish different parts from motions, because different parts (e.g., arms and legs) might have similar motions (see Figure 13b). RNEM is not able to segment any meaningful parts, let alone structure, while our PSD model gives imperfect yet reasonable part segmentation results. For quantitative evaluation, we collect the ground truth part segmentation for 30 images and compute the intersection over union (IoU) between the groundtruth and the prediction of our model and the other two baselines (NEM, RNEM). The quantitative results are presented in Table 2. Our PSD model significantly outperforms the two baselines.
Hierarchical Structure.
We recover the hierarchical tree structure among these dimensions from the structural matrix . From Figure 13h, our PSD model is able to discover that the upper torso and the legs are part of the full torso, and the arm is part of the upper torso, and from Figure 14h, our PSD model discovers that the arms and legs are parts the full torso.
Full torso  Upper torso  Arm  Left leg  Right leg  Overall  

NEM  0.298  0.347  0.125  0.264  0.222  0.251 
RNEM  0.321  0.319  0.220  0.294  0.228  0.276 
PSD (ours)  0.697  0.574  0.391  0.374  0.336  0.474 
6 Conclusion
We have presented a novel formulation that simultaneously discovers object parts, their hierarchical structure, and the system dynamics from unlabeled videos. Our model uses a layered image representation to discover basic concepts and a structural descriptor to compose them. Experiments suggest that it works well on both real and synthetic datasets for part segmentation, hierarchical structure recovery, and motion prediction. We hope our work will inspire future research along the direction of learning structural object representations from raw sensory inputs.
Acknowledgements.
We thank Michael Chang and Sjoerd van Steenkiste for helpful discussions and suggestions. This work was supported in part by NSF #1231216, NSF #1447476, ONR MURI N000141612007, and Facebook.
References
 Agrawal et al. (2016) Pulkit Agrawal, Ashvin Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. Learning to poke by poking: Experiential learning of intuitive physics. In NeurIPS, 2016.
 Ba et al. (2015) Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015.
 Balakrishnan et al. (2018a) Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In CVPR, 2018a.
 Balakrishnan et al. (2018b) Guha Balakrishnan, Amy Zhao, Adrian V Dalca, Fredo Durand, and John Guttag. Synthesizing images of humans in unseen poses. In CVPR, 2018b.
 Battaglia et al. (2016) Peter W. Battaglia, Razvan Pascanu, Matthew Lai, Danilo Rezende, and Koray Kavukcuoglu. Interaction networks for learning about objects, relations and physics. In NeurIPS, 2016.
 Brubaker et al. (2009) Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann. Physicsbased person tracking using the anthropomorphic walker. IJCV, 87(12):140–155, aug 2009. doi: 10.1007/s1126300902745.
 Carey (2009) Susan Carey. The origin of concepts. Oxford University Press, 2009.
 Chang et al. (2017) Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B Tenenbaum. A compositional objectbased approach to learning physical dynamics. In ICLR, 2017.
 Chen et al. (2018) Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv:1802.04942, 2018.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, 2016.

Chollet (2017)
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
In CVPR, 2017.  Ehrhardt et al. (2017) Sebastien Ehrhardt, Aron Monszpart, Niloy J Mitra, and Andrea Vedaldi. Learning a physical longterm predictor. arXiv:1703.00247, 2017.
 Eslami et al. (2016) SM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, and Geoffrey E Hinton. Attend, infer, repeat: Fast scene understanding with generative models. In NeurIPS, 2016.
 Esmaeili et al. (2018) Babak Esmaeili, Hao Wu, Sarthak Jain, Siddharth Narayanaswamy, Brooks Paige, and JanWillem van de Meent. Hierarchical disentangled representations. arXiv:1804.02086, 2018.
 Finn et al. (2016) Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In NeurIPS, 2016.
 FloresMangas & Jepson (2013) Fernando FloresMangas and Allan D Jepson. Fast rigid motion segmentation via incrementallycomplex local models. In CVPR, 2013.
 Fragkiadaki et al. (2016) Katerina Fragkiadaki, Pulkit Agrawal, Sergey Levine, and Jitendra Malik. Learning visual predictive models of physics for playing billiards. In ICLR, 2016.
 Gershman et al. (2016) Samuel J Gershman, Joshua B Tenenbaum, and Frank Jäkel. Discovering hierarchical motion structure. Vis. Res., 126:232–241, 2016.
 Greff et al. (2017) Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber. Neural expectation maximization. In NeurIPS, 2017.
 Grundmann et al. (2010) Matthias Grundmann, Vivek Kwatra, Mei Han, and Irfan Essa. Efficient hierarchical graphbased video segmentation. In CVPR, 2010.
 Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Betavae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
 Higgins et al. (2018) Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: learning abstract hierarchical compositional visual concepts. In ICLR, 2018.
 Hinton & Van Camp (1993) Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In COLT, 1993.
 Hsu et al. (2017) WeiNing Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In NeurIPS, 2017.
 Huang & Murphy (2015) Jonathan Huang and Kevin Murphy. Efficient inference in occlusionaware generative models of images. In ICLR Workshop, 2015.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Jain et al. (2014) Mihir Jain, Jan Van Gemert, Hervé Jégou, Patrick Bouthemy, and Cees GM Snoek. Action localization with tubelets from motion. In CVPR, 2014.
 Johansson (1973) Gunnar Johansson. Visual perception of biological motion and a model for its analysis. Perception & Psychophysics, 14(2):201–211, 1973.
 Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kipf et al. (2018) Thomas N Kipf, Ethan Fetaya, KuanChieh Wang, Max Welling, and Richard S Zemel. Neural relational inference for interacting systems. arXiv:1802.04687, 2018.
 Kitani et al. (2017) Kris M. Kitani, DeAn Huang, and WeiChiu Ma. Activity forecasting. In Group and Crowd Behavior for Computer Vision, pp. 273–294. Elsevier, 2017. doi: 10.1016/b9780128092767.00014x.
 Koffka (2013) Kurt Koffka. Principles of Gestalt psychology. Routledge, 2013.
 Kulkarni et al. (2015a) Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic programming language for scene perception. In CVPR, 2015a.
 Kulkarni et al. (2015b) Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In NeurIPS, 2015b.
 Kyriazis & Argyros (2013) Nikolaos Kyriazis and Antonis Argyros. Physically plausible 3d scene tracking: The single actor hypothesis. In CVPR, 2013.
 LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proc. IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2018) Jungbeom Lee, Jangho Lee, Sungmin Lee, and Sungroh Yoon. Msnet: Mutual suppression network for disentangled video representations. arXiv:1804.04810, 2018.
 Li et al. (2018) Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and MingHsuan Yang. Flowgrounded spatialtemporal video prediction from still images. In ECCV, 2018.
 Li & Mandt (2018) Yingzhen Li and Stephan Mandt. A deep generative model for disentangled representations of sequential data. arXiv:1803.02991, 2018.
 Liu (2009) Ce Liu. Beyond pixels: exploring new representations and applications for motion analysis. PhD thesis, Massachusetts Institute of Technology, 2009.
 Lotter et al. (2017) William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, 2017.
 Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
 Mathieu et al. (2016) Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multiscale video prediction beyond mean square error. In ICLR, 2016.
 Ochs et al. (2014) Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. TPAMI, 2014.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS Workshop, 2017.
 PérezRúa et al. (2016) JuanManuel PérezRúa, Tomas Crivelli, Patrick Pérez, and Patrick Bouthemy. Discovering motion hierarchies via treestructured coding of trajectories. In BMVC, 2016.
 Rezende et al. (2016) Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsupervised learning of 3d structure from images. In NeurIPS, 2016.
 Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
 Ross & Zemel (2006) David A Ross and Richard S Zemel. Learning partsbased representations of data. JMLR, 2006.
 Ross et al. (2010) David A Ross, Daniel Tarlow, and Richard S Zemel. Learning articulated structure and motion. IJCV, 2010.
 Salzmann & Urtasun (2011) Mathieu Salzmann and Raquel Urtasun. Physicallybased motion models for 3d tracking: A convex formulation. In ICCV, 2011.
 Shao et al. (2014) Tianjia Shao, Aron Monszpart, Youyi Zheng, Bongjin Koo, Weiwei Xu, Kun Zhou, and Niloy J Mitra. Imagining the unseen: Stabilitybased cuboid arrangements for scene understanding. ACM TOG, 33(6), 2014.
 Spelke & Kinzler (2007) Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Dev. Psychol., 10(1):89–96, 2007.
 van Steenkiste et al. (2018) Sjoerd van Steenkiste, Michael Chang, Klaus Greff, and Jürgen Schmidhuber. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In ICLR, 2018.
 Vondrak et al. (2013) Marek Vondrak, Leonid Sigal, and Odest Chadwicke Jenkins. Dynamical simulation priors for human motion tracking. IEEE TPAMI, 35(1):52–65, 2013.
 Walker et al. (2016) Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV, 2016.
 Wang & Adelson (1993) John YA Wang and Edward H Adelson. Layered representation for motion analysis. In CVPR, 1993.
 Watters et al. (2017) Nicholas Watters, Andrea Tacchetti, Theophane Weber, Razvan Pascanu, Peter Battaglia, and Daniel Zoran. Visual interaction networks. In NeurIPS, 2017.
 Wu et al. (2017) Jiajun Wu, Erika Lu, Pushmeet Kohli, William T Freeman, and Joshua B Tenenbaum. Learning to see physics via visual deanimation. In NeurIPS, 2017.
 Xu et al. (2012) Chenliang Xu, Caiming Xiong, and Jason J Corso. Streaming hierarchical video segmentation. In ECCV, 2012.
 Xue et al. (2016) Tianfan Xue, Jiajun Wu, Katherine L Bouman, and William T Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, 2016.
Appendix A.1 More qualitative results
Appendix A.2 Motion Distribution of shape dataset
In Figure A3, we demonstrate the motion distributions of each shape.
Appendix A.3 Additional Results of RNEM
As mentioned in the main paper, RNEM and our PSD model focus on complementary topics: RNEM learns to identify instances through temporal reasoning, using signals across the entire video to group pixels into objects; our PSD model learns the appearance prior of objects: by watching how they move, it learns to recognize how object parts can be grouped based on their appearance and can be applied on static images. As the videos in our dataset has only two frames, temporal signals alone are often not enough to tell objects apart. This may explain the less compelling results from RNEM.
Here, we include a more systematic study to verify that. We train the RNEM with three types of inputs: 1) only one frame; 2) two input frames appear repetitively (the setup we used on our dataset, where videos only have two frames); 3) longer videos with 20 sequential frames. Figure A4 and Table A1 show that results on 20frame input are significantly better than the previous two. RNEM handles occluded objects with long trajectories, where each object appears without occlusion in at least one of the frames.
Circles  Squares  Triangles  Overall  

1 frame  0.418  0.511  0.559  0.501 
2 frames  0.513  0.552  0.612  0.558 
20 frames  0.760  0.850  0.871  0.833 