Unsupervised Learning of Object Structure and Dynamics from Videos

by   Matthias Minderer, et al.

Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning. To address this challenge, we adopt a keypoint-based image representation and learn a stochastic dynamics model of the keypoints. Future frames are reconstructed from the keypoints and a reference frame. By modeling dynamics in the keypoint coordinate space, we achieve stable learning and avoid compounding of errors in pixel space. Our method improves upon unstructured representations both for pixel-level video prediction and for downstream tasks requiring object-level understanding of motion dynamics. We evaluate our model on diverse datasets: a multi-agent sports dataset, the Human3.6M dataset, and datasets based on continuous control tasks from the DeepMind Control Suite. The spatially structured representation outperforms unstructured representations on a range of motion-related tasks such as object tracking, action recognition and reward prediction.


page 5

page 6

page 7

page 18

page 19


Unsupervised Learning of Object Keypoints for Perception and Control

The study of object representations in computer vision has primarily foc...

KINet: Keypoint Interaction Networks for Unsupervised Forward Modeling

Object-centric representation is an essential abstraction for physical r...

Decomposing Motion and Content for Natural Video Sequence Prediction

We propose a deep neural network for the prediction of future frames in ...

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...

Image Animation with Keypoint Mask

Motion transfer is the task of synthesizing future video frames of a sin...

Unsupervised Discovery of Parts, Structure, and Dynamics

Humans easily recognize object parts and their hierarchical structure by...

Filtered-CoPhy: Unsupervised Learning of Counterfactual Physics in Pixel Space

Learning causal relationships in high-dimensional data (images, videos) ...

1 Introduction

Videos provide rich visual information to understand the dynamics of the world. However, extracting a useful representation from videos (e.g. detection and tracking of objects) remains challenging and typically requires expensive human annotations. In this work, we focus on unsupervised learning of object structure and dynamics from videos.

One approach for unsupervised video understanding is to learn to predict future frames Oh et al. (2015); Mathieu et al. (2016); Finn et al. (2016); Lotter et al. (2017); Villegas et al. (2017a); Xue et al. (2016); Denton and Fergus (2018); Babaeizadeh et al. (2018); Lee et al. (2018). Based on this body of work, we identify two main challenges: First, it is hard to make pixel-level predictions because motion in videos becomes highly stochastic for horizons beyond about a second. Since semantically insignificant deviations can lead to large error in pixel space, it is often difficult to distinguish good from bad predictions based on pixel losses. Second, even if good pixel-level prediction is achieved, this is rarely the desired final task. The representations of a model trained for pixel-level reconstruction are not guaranteed to be useful for downstream tasks such as tracking, motion prediction and control.

Here, we address both of these challenges by using an explicit, interpretable keypoint-based representation of object structure as the core of our model. Keypoints are a natural representation of dynamic objects, commonly used for face and pose tracking. Training keypoint detectors, however, generally requires supervision. We learn the keypoint-based representation directly from video, without any supervision beyond the pixel data, in two steps: first encode individual frames to keypoints, then model the dynamics of those points. As a result, the representation of the dynamics model is spatially structured, though the model is trained only with a pixel reconstruction loss. We show that enforcing spatial structure significantly improves video prediction quality and performance for tasks such as action recognition and reward prediction.

By decoupling pixel generation from dynamics prediction, we avoid compounding errors in pixel space because we never condition on predicted pixels. This approach has been shown to be beneficial for supervised video prediction Villegas et al. (2017b). Furthermore, modeling dynamics in keypoint coordinate space allows us to sample and evaluate predictions efficiently. Errors in coordinate space are more meaningful than in pixel space, since distance between keypoints is more closely related to semantically relevant differences than pixel-space distance. We exploit this by using a best-of-many-samples objective Bhattacharyya et al. (2018) during training to achieve stochastic predictions that are both highly diverse and of high quality, outperforming the predictions of models lacking spatial structure.

Finally, because we build spatial structure into our model a priori, its internal representation is biased to contain object-level information that is useful for downstream applications. This bias leads to better results on tasks such as trajectory prediction, action recognition and reward prediction.

Our contributions are: (1) a novel architecture and optimization techniques for unsupervised video prediction with a structured internal representation; (2) a model that outperforms recent work Denton and Fergus (2018); Wichers et al. (2018) and our unstructured baseline in pixel-level video prediction; (3) improved performance vs. unstructured models on downstream tasks requiring object-level understanding.

2 Related work

Unsupervised learning of keypoints.

Previous work explores learning to find keypoints in an image by applying an autoencoding architecture with keypoint-coordinates as a representational bottleneck

Jakab et al. (2018); Zhang et al. (2018). The bottleneck forces the image to be encoded in a small number of points. We build on these methods by extending them to the video setting.

Stochastic sequence prediction. Successful video prediction requires modeling uncertainty. We adopt the VRNN Chung et al. (2015)

architecture, which adds latent random variables to the standard RNN architecture, to sample from possible futures. More sophisticated approaches to stochastic prediction of keypoints have been recently explored 

Yan et al. (2018); Sun et al. (2019), but we find the basic VRNN architecture sufficient for our applications.

Unsupervised video prediction. A large body of work explores learning to predict video frames using only a pixel-reconstruction loss Ranzato et al. (2014); Srivastava et al. (2015); Oh et al. (2015); Finn et al. (2016); Villegas et al. (2017a); Denton and Birodkar (2017). Most similar to our work are approaches that perform deterministic image generation from a latent sample produced by stochastic sampling from a prior conditioned on previous timesteps Denton and Fergus (2018); Babaeizadeh et al. (2018); Lee et al. (2018). Our approach replaces the unstructured image representation with a structured set of keypoints, improving performance on video prediction and downstream tasks compared with SVG Denton and Fergus (2018) (Section 5).

Recent methods also apply adversarial training to improve prediction quality and diversity of samples Tulyakov et al. (2018); Lee et al. (2018). EPVA Wichers et al. (2018) predicts dynamics in a high-level feature space and applies an adversarial loss to the predicted features. We compare against EPVA and show improvement without adversarial training, but adversarial training is compatible with our method and is a promising future direction.

Video prediction with spatially structured representations. Like our approach, several recent methods explore explicit, spatially structured representations for video prediction. Vid2Vid Wang et al. (2018) proposed a video-to-video translation network from segmentation masks, edge masks and human pose. The method is also used for predicting a few frames into the future by predicting the structure representations first. Villegas et al. (2017b) proposed to train a human pose predictor and then use the predicted pose to generate future frames of human motion. In Walker et al. (2018), a method is proposed where future human pose is predicted using a stochastic network and the pose is then used to generate future frames. Recent methods on video generation have used spatially structured representations for video motion transfer between humans Aberman et al. (2019); Chan et al. (2018). In contrast, our model is able to find spatially structured representation without supervision while using video frames as the only learning signal.

3 Architecture

Our model is composed of two parts: a keypoint detector that encodes each frame into a low-dimensional, keypoint-based representation, and a dynamics model that predicts dynamics in the keypoint space (Figure 1).

3.1 Unsupervised keypoint detector

Figure 1: Architecture of our model. Variables are black, functions blue, losses red. Some arrows are omitted for clarity, see Equations 1 to 4 for details.

The keypoint detection architecture is inspired by Jakab et al. (2018), which we adapt for the video setting. Let be a video sequence of length . Our goal is to learn a keypoint detector that captures the spatial structure of the objects in each frame in a set of keypoints .

The detector

is a convolutional neural network that produces

feature maps, one for each keypoint. Each feature map is normalized and condensed into a single -coordinate by computing the spatial expectation of the map. The number of heatmaps

is a hyperparameter that represents the maximum expected number of keypoints necessary to model the data.

For image reconstruction, we learn a generator that reconstructs frame from its keypoint representation. The generator also receives the first frame of the sequence to capture the static appearance of the scene: . Together, the keypoint detector and generator form an autoencoder architecture with a representational bottleneck that forces the structure of each frame to be encoded in a keypoint representation Jakab et al. (2018).

The generator is also a convolutional neural network. To supply the keypoints to the network, each point is converted into a heatmap with a Gaussian-shaped blob at the keypoint. The heatmaps are concatenated with feature maps from the first frame . We also concatenate the keypoint-heatmaps for the first frame to the decoder input for subsequent frames

, to help the decoder to "inpaint" background regions that were occluded in the first frame. The resulting tensor forms the input to the generator. We add skip connections from the first frame of the sequence to the generator output such that the actual task of the generator is to predict


We use the mean intensity of each keypoint feature map returned by the detector as a continuous-valued indicator of the presence of the modeled object. When converting keypoints back into heatmaps, each map is scaled by the corresponding . The model can use to encode the presence or absence of individual objects on a frame-by-frame basis.

3.2 Stochastic dynamics model

To model the dynamics in the video, we use a variational recurrent neural network (VRNN) 

Chung et al. (2015). The core of the dynamics model is a latent belief over keypoint locations . In the VRNN architecture, the prior belief is conditioned on all previous timesteps through the hidden state of an RNN, and thus represents a prediction of the current keypoint locations before observing the image:

We obtain the posterior belief by combining the previous hidden state with the unsupervised keypoint coordinates detected in the current frame:
Predictions are made by decoding the latent belief:
Finally, the RNN is updated to pass information forward in time:

Note that to compute the posterior (Eq. 2), we obtain from the keypoint detector, but for the recurrence in Eq. 4, we obtain by decoding the latent belief. We can therefore predict into the future without observing images by decoding from the prior belief. Because the model has both deterministic and stochastic pathways across time, predictions can account for long-term dependencies as well as future uncertainty Hafner et al. (2019); Chung et al. (2015).

4 Training

We train the keypoint detector and dynamics model simultaneously, but with separate losses.

4.1 Keypoint detector

The keypoint detector is trained with a simple L2 image reconstruction loss , where is the true and

is the reconstructed image. Errors from the dynamics model are not backpropagated into the keypoint detector, which is essential to prevent the model from trading image reconstruction quality for easy-to-predict keypoint dynamics.

Ideally, the representation should as few keypoints as possible to encode each object. To encourage such parsimony, we add two additional losses to the keypoint detector:

Temporal separation loss. Image features whose motion is highly correlated are likely to belong to the same object and should ideally be represented jointly by a single keypoint. We therefore add a separation loss that encourages keypoint trajectories to be decorrelated in time. The loss penalizes "overlap" between trajectories within a Gaussian radius :


where is the distance between the trajectoires of keypoints and , computed after subtracting the temporal mean from each trajectory.  denotes the squared Euclidean norm.

Keypoint sparsity loss. For similar reasons, we add an L1 penalty on the keypoint scales to encourage keypoints to be sparsely active. We found that training speed and stability was improved by choosing to be significantly larger than the expected number of objects, and then reducing the number of active keypoints with this penalty. Both the separation loss and the sparsity loss contribute to the stability and final performance of our model (Figure S9).

4.2 Dynamics model

The standard VRNN Chung et al. (2015) is trained to encode the detected keypoints by maximizing the evidence lower bound (ELBO), which is composed of a reconstruction loss and a KL term between the Gaussian prior and posterior distribution :


The KL term regularizes the latent representation. In the VRNN architecture, it is also responsible for training the RNN, since it encourages the prior to predict the posterior based on past information. To balance reliance on predictions with fidelity to observations, we add the hyperparameter (see also Alemi et al. (2018)). We found it essential to tune for each dataset to achieve a balance between reconstruction quality (lower ) and prediction diversity.

The KL term only trains the dynamics model for single-step predictions because the model receives observations after each step Hafner et al. (2019). To encourage learning of long-term dependencies, we add a pure reconstruction loss, without the KL term, for multiple future timesteps:


The standard approach to estimate

in Eq. 6 and 7 is to sample a single . To further encourage diverse predictions, we instead use the best of a number of samples Bhattacharyya et al. (2018) at each timestep during training:


where for observed steps and for predicted steps. By giving the model several chances to make a good prediction, it is encouraged to cover a range of likely data modes, rather than just the most likely. Sampling and evaluating several predictions at each timestep would be expensive in pixel space. However, since we learn the dynamics in the low-dimensional keypoint space, we can evaluate sampled predictions without reconstructing pixels and thus with relatively small computational cost. As shown in Section 5, the best-of-many objective is crucial to the performance of our model.

The combined loss of the whole model is:


where and are scale parameters for the keypoint separation and sparsity losses. See Section S1 for implementation details, including a list of hyperparameters and tuning ranges (Table S1).

5 Results

We first show that the structured representation of our model improves prediction quality on two video datasets, then show that it is more useful than unstructured representations for downstream tasks that require object-level information.

5.1 Structured representation improves video prediction

(a) Basketball
(b) Human3.6M
Figure 2: Main datasets used in our experiments. First row shows ground truth images. In the second row, dots indicate the decoded coordinates (; see Figure 1), lines indicate past trajectories. Third row shows the image reconstructed from . Green borders indicate observed frames, red indicate predicted frames.

We evaluate frame prediction on two video datasets (Figure 2). The Basketball dataset consists of a synthetic top-down view of a basketball court containing five offensive players and the ball, all drawn as colored dots. The videos are generated from real basketball player trajectories Zhan et al. (2019), testing the ability of our model to detect and stably represent multiple objects with complex dynamics. The dataset contains 107,146 training and 13,845 test sequences. The Human3.6 dataset Ionescu et al. (2014) contains video sequences of human actors performing various actions. We use subjects S1, S5, S6, S7, and S9 for training (600 videos), and subjects S9 and S11 for evaluation (239 videos). For both datasets, ground-truth object coordinates are available for evaluation, but are not used by the model. The Basketball dataset contains the coordinates of each of the 5 players and the ball. The Human dataset contains 32 motion capture points, of which we select 12 for evaluation.

We compare the full model (Struct-VRNN) to various baselines and ablations: the Struct-VRNN (no BoM) model was trained without the best-of-many objective; the Struct-RNN is deterministic; the CNN-VRNN

architecture uses the same stochastic dynamics model as the Struct-VRNN, but uses an unstructured deep feature vector as its internal representation instead of structured keypoints. All models use

for Basketball, and for Human3.6M, and were conditioned on 8 frames and trained to predict 8 future frames. Finally, we compare to two published models: SVG Denton and Fergus (2018) and EPVA Wichers et al. (2018) (Figure 3).

The Struct-VRNN model matches or outperforms the other models in perceptual image and video quality as measured by VGG Simonyan and Zisserman (2014)

feature cosine similarity and Fréchet Video Distance 

Unterthiner et al. (2018) (Figure 3). Results for the lower-level metrics SSIM and PSNR are similar (see supplemental material). The difference is especially large for the Fréchet Video Distance, which correlates better than SSIM and PSNR with human judgment of video quality Unterthiner et al. (2018).

The ablations suggest that the structured representation, the stochastic belief, and the best-of-many objective all contribute to model performance. The full Struct-VRNN model generates the best reconstructions of ground-truth, and also generates the most diverse samples (i.e., samples that are furthest from ground-truth; Figure 3 bottom left). In contrast, the ablated models and SVG show both lower best-case accuracy and smaller differences between closest and furthest samples, indicating less diverse samples. Qualitatively, Struct-VRNN exhibits sharper images and longer object permanence than the unstructured models (Figure 3, top; note limb detail and dynamics).

Figure 3: Video generation quality on Human3.6M. Our stochastic structured model (Struct-VRNN) outperforms our deterministic baseline (Struct-VRNN), our unstructured baseline (CNN-VRNN), and the SVG model Denton and Fergus (2018). Top: Example input (green borders) and predicted (red borders) frames (best viewed as video: https://mjlm.github.io/video_structure/). Bottom left:

reconstruction accuracy under VGG cosine similarity for the closest-from-GT and furthest-from-GT of 100 samples. Higher is better. Plots show mean performance across 5 model initializations, with the 95% confidence interval shaded.

Bottom right: Fréchet Video Distance Unterthiner et al. (2018), using all samples. Lower is better. Each dot represents a separate model initialization. EPVA Wichers et al. (2018) is not stochastic, so we compare performance with a single sample from our method on their test set.

5.2 The learned keypoints track objects

Figure 4:

Prediction error for GT trajectories by linear regression from internal network features. (sup.) indicates supervised baseline.

We now examine how well the learned keypoints track the location of objects. Since we do not expect the keypoints to align exactly with human-labeled objects, we fit a linear regression from the keypoints to the ground-truth object positions and measure trajectory prediction error on held-out sequences (Figure 4). The trajectory error is the average distance between true and predicted coordinates at each timestep. To account for stochasticity, we sample 50 predictions and report the error of the best.111For Human3.6M, we choose the best sample based on the average error of all coordinates. For Basketball, we choose the best sample separately for each player.

Figure 5: Unsupervised keypoints allow human-guided exploration of object dynamics. We manipulated the observed coordinates for Player 1 (arrow) to change the original (blue) trajectory. The other players were not manipulated. The dynamics were then rolled out into the future to predict how the players will behave in the manipulated (red) scenario. Light-colored parts of the trajectories are observed, dark-colored parts are predicted. Dots indicate final position. Lines of the same color indicate different samples conditioned on the same observed coordinates.
Figure 6: Keypoints learned by our method may be manipulated to change peoples’ poses. Note that both manipulations and effects are spatially local. Best viewed in video (https://mjlm.github.io/video_structure/).

As a baseline, we train Struct-VRNN and CNN-VRNN models with additional supervision that forces the learned keypoints to match the GT keypoints. The keypoints learned by the unsupervised Struct-VRNN model are nearly as predictive as those trained with supervision, indicating that the learned keypoints represent useful spatial information. In contrast, prediction from the internal representation of the unsupervised CNN-VRNN is poor. When trained with supervision, however, the CNN-VRNN reaches similar performance as the supervised Struct-VRNN. In other words, both the Struct-VRNN and the CNN-VRNN can learn a spatial internal representation, but the Struct-VRNN learns it without supervision.

As expected, the less diverse predictions of the Struct-VRNN (no BoM) and Struct-RNN perform worse on the coordinate regression task. Finally, for comparison, we remove the dynamics model entirely and simply predict the last observed keypoint locations for all future timepoints. All models except unsupervised CNN-VRNN outperform this baseline.

Note that for both video prediction (Figure 3) and trajectory prediction error (Figure 4), the variability between model initializations is low for the Struct-VRNN compared to the baselines. We found the keypoint separation and sparsity losses to contribute to this stability (Figure S9).

5.3 Manipulation of keypoints allows interaction with the model

Since the learned keypoints track objects, the model’s predictions can be intuitively manipulated by directly adjusting the keypoints.

On the Basketball dataset, we can explore counterfactual scenarios such as predicting how the other players react if one player moves left as opposed to right (Figure 5). We simply manipulate the observed keypoint locations before they are passed to the RNN, thus conditioning the predictions on the manipulated observations.

For the Human3.6M dataset, we can independently manipulate body parts and generate poses that are not present in the training set (Figure 6; please see https://mjlm.github.io/video_structure/for videos). The model learns to associate keypoints with local areas of the body, such that moving keypoints near an arm moves the arm without changing the rest of the image.

5.4 Structured representation retains more semantic information

Figure 7: Action recognition on the Human3.6M dataset. Solid line: null model (predict the most frequent action). Dashed line: prediction from ground-truth coordinates. Sup., supervised.
Figure 8: Predicting rewards on the DeepMind Control Suite continuous control domains. We chose domains with dense rewards to ensure the random policy would provide a sufficient reward signal for this analysis. To make scales comparable across domains, errors are normalized to a null model which predicts the mean training-set-reward at all timesteps. Lines show the mean across test-set examples and 5 random model initializations, with the 95% confidence interval shaded.

The learned keypoints are also useful for downstream tasks such as action recognition and reward prediction in reinforcement learning.

To test action recognition performance, we train a simple 3-layer RNN to classify Human3.6M actions from a sequence of keypoints (see Section 

S2.1 for model details).

The keypoints learned by the structured models perform better than the unstructured features learned by the CNN-VRNN (Figure 7). Future prediction is not needed, so the RNN and VRNN models perform similarly.

One major application we anticipate for our model is planning and reinforcement learning of spatially defined tasks. As a first step, we trained our model on a dataset collected from six tasks in the DeepMind Control Suite (DMCS), a set of simulated continuous control environments (Figure 8). Image observations and rewards were collected from the DMCS environments using random actions, and we modified our model to condition predictions on the agent’s actions by feeding the actions as an additional input to the RNN. Models were trained without access to the task reward function. We used the latent state of the dynamics model as an input to a separate reward prediction model for each task (see Section S2.2 for details). The dynamics learned by the Struct-VRNN give better reward prediction performance than the unstructured CNN-VRNN baseline, suggesting our architecture may be a useful addition to planning and reinforcement learning models.

6 Discussion

A major question in machine learning is to what degree prior knowledge should be built into a model, as opposed to learning it from the data. This question is especially important for unsupervised vision models trained on raw pixels, which are typically far removed from the information that is of interest for downstream tasks. We propose a model with a spatial inductive bias, resulting in a structured, keypoint-based internal representation. We show that this structure leads to superior results on downstream tasks compared to a representation derived from a CNN without a keypoint-based representational bottleneck.

The proposed spatial prior using keypoints represents a middle ground between unstructured representations and an explicitly object-centric approach. For example, we do not explicitly model object masks, occlusions, or depth. Our architecture either leaves these phenomena unmodeled, or learns them from the data. By choosing to not build this kind of structure into the architecture, we keep our model simple and achieve stable training (see variability across initializations in Figures 34, and S9) on diverse datasets, including multiple objects and complex, articulated human shapes.

Because of its simplicity, our architecture is straightforward to combine with existing architectures that may benefit from spatial structure, such as planning and reinforcement learning for control tasks. Applying our model to such tasks is an important future direction of this work.


  • K. Aberman, R. Wu, D. Lischinski, B. Chen, and D. Cohen-Or (2019) Learning Character-Agnostic Motion for Motion Retargeting in 2D. In SIGGRAPH, Cited by: §2.
  • A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy (2018) Fixing a Broken ELBO. In ICML, Cited by: §4.2.
  • M. Babaeizadeh, C. Finn, R. Erhan, and S. Levine (2018) Stochastic variational video prediction. In ICLR, Cited by: §1, §2.
  • S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer (2015) Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks. In NeurIPS, Cited by: §S1.3.
  • A. Bhattacharyya, B. Schiele, and M. Fritz (2018) Accurate and Diverse Sampling of Sequences based on a "Best of Many" Sample Objective. In CVPR, Cited by: §1, §4.2.
  • S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio (2016) Generating Sentences from a Continuous Space. In CONLL, Cited by: §S1.2.
  • C. Chan, S. Ginosar, T. Zhou, and A. A. Efros (2018) Everybody Dance Now. In CoRR, Vol. abs/1808.07371. Cited by: §2.
  • J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio (2015) A Recurrent Latent Variable Model for Sequential Data. In NeurIPS, Cited by: §2, §3.2, §4.2.
  • E. Denton and V. Birodkar (2017) Unsupervised Learning of Disentangled Representations from Video. In NeurIPS, Cited by: §2.
  • E. Denton and R. Fergus (2018) Stochastic Video Generation with a Learned Prior. In ICML, Cited by: Figure S11, §1, §1, §2, Figure 3, §5.1.
  • C. Finn, I. Goodfellow, and S. Levine (2016) Unsupervised learning for physical interaction through video prediction. In NIPS, Cited by: §1, §2.
  • D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In KDD, Cited by: §S1.4.
  • D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2019) Learning Latent Dynamics for Planning from Pixels. In ICML, Cited by: §3.2, §4.2.
  • C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. In PAMI, Cited by: §5.1.
  • T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi (2018) Conditional Image Generation for Learning the Structure of Visual Objects. In NeurIPS, Cited by: §S1.1.1, §2, §3.1, §3.1.
  • D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §S1.2.
  • A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine (2018) Stochastic Adversarial Video Prediction. In CoRR, Vol. abs/1804.01523. Cited by: §1, §2, §2.
  • W. Lotter, G. Kreiman, and D. Cox (2017) Deep predictive coding networks for video prediction and unsupervised learning. In ICLR, Cited by: §1.
  • M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In ICLR, Cited by: §1.
  • J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh (2015) Action-conditional video prediction using deep networks in atari games. In NeurIPS, Cited by: §1, §2.
  • M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint:1412.6604. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. In CoRR, Cited by: §5.1.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised Learning of Video Representations using LSTMs. In ICML, Cited by: §2.
  • C. Sun, P. Karlsson, J. Wu, J. B. Tenenbaum, and K. Murphy (2019) Stochastic Prediction of Multi-Agent Interactions From Partial Observations. In ICLR, Cited by: §2.
  • Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, T. Lillicrap, and M. Riedmiller (2018) DeepMind Control Suite. In arXiv: 1801.00690, Cited by: §S2.2.
  • S. Tulyakov, M. Liu, X. Yang, and J. Kautz (2018) Mocogan: Decomposing motion and content for video generation. In CVPR, Cited by: §2.
  • T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly (2018) Towards Accurate Generative Models of Video: A New Metric & Challenges. In CoRR, Cited by: Figure S11, Figure 3, §5.1.
  • R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017a) Decomposing Motion and Content for Natural Video Sequence Prediction. In ICLR, Cited by: §1, §2.
  • R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017b) Learning to Generate Long-term Future via Hierarchical Prediction. In ICML, Cited by: §1, §2.
  • J. Walker, K. Marino, A. Gupta, and M. Hebert (2018) The Pose Knows: Video Forecasting by Generating Pose Futures. In NeurIPS, Cited by: §2.
  • T. Wang, M. Liu, J. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro (2018) Video-to-Video Synthesis. In NeurIPS, Cited by: §2.
  • N. Wichers, R. Villegas, D. Erhan, and H. Lee (2018) Hierarchical Long-term Video Prediction without Supervision. In ICML, Cited by: §1, §2, Figure 3, §5.1.
  • T. Xue, J. Wu, K. Bouman, and B. Freeman (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, Cited by: §1.
  • X. Yan, A. Rastogi, R. Villegas, K. Sunkavalli, E. Shechtman, S. Hadap, E. Yumer, and H. Lee (2018) MT-vae: learning motion transformations to generate multimodal human dynamics. In ECCV, Cited by: §2.
  • E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey (2019) Generating Multi-Agent Trajectories using Programmatic Weak Supervision. In ICLR, Cited by: §5.1.
  • Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee (2018) Unsupervised Discovery of Object Landmarks as Structural Representations. In CVPR, Cited by: §S1.1.1, §2.

Appendix S1 Model implementation details

s1.1 Architecture

s1.1.1 Keypoint detector

Our goal is to encode the input image in terms of the locations of objects in the image Jakab et al. [2018], Zhang et al. [2018]. To this end, we use a keypoint detector network (Figure 1, bottom) that consists of

keypoint detectors. Ideally, each detector learns to detect a distinct object or object part. The detector network is implemented as a series of convolutional layers with stride 2 that reduce the input image

into a stack of keypoint detection score maps with channels, , where . We use the softplus function on the activations of the final layer to ensure the maps are positive.

The raw maps are normalized to obtain detection weight maps ,


where is the value of the -th channel of at pixel . We then reduce each to a single -coordinate by computing the weighted mean over pixel coordinates:


To model keypoint presence or absence, we compute the mean value of the raw detection score maps,


In summary, each keypoint is represented by a -triplet encoding its location in the image and its scale.

For image reconstruction, each keypoint is converted back into a pixel representation by creating a map

containing a Gaussian blob with standard deviation

at the location of the keypoint, scaled by :


The map contains the same information as the keypoint tuple , but in a pixel representation that is suitable as input to the convolutional reconstructor network . The image is reconstructed as follows:


where applies alternating convolutional layers and twofold bilinear upsampling until the maps are expanded to the original image resolution, denotes concatenation (here, channel-wise), and is a network with the same architecture as (except for the final softmax nonlinearity) that extracts image features from the first frame to capture appearance information of the scene.

The internal layers of the convolutional encoder and decoder are connected through leaky rectified linear units

. L2 weight decay of is applied to all convolutional kernels. To increase model capacity, we add one (for Basketball and DMCS) or two (for Human3.6M) additional size-preserving (stride 1) convolutional layers at each resolution scale of the detector and reconstructor. The image resolution is pixels.

s1.1.2 Dynamics model

The dynamics model (Figure 1, top) has the following components:

The prior

network consists of a dense layer with ReLU activation functions (for number of units, see

Prior net size in Table S1), followed by a dense layer that projects the activations to the mean and standard deviation that parameterize the prior latent distribution ,


The encoder network consists of a dense layer with 128 units and ReLU activation functions, followed by a dense layer that projects the activations to the mean and standard deviation that parameterize the posterior latent distribution ,


The decoder network consists of a dense layer with 128 units and ReLU activation functions, followed by a dense layer that projects the activations to the the linearized keypoint vector of length (containing , and components),


where for observed steps and for predicted steps.

The recurrent network consists of a GRU layer with 512 units:


For the action-conditional model used for reward prediction (Figure 8), the input to is , where is the vector of random actions used to generate frame of the DeepMind Control Suite dataset.

The size of and the latent representation were optimized as hyperparameters (see Table S1).

s1.2 Optimization

We used the ADAM optimizer Kingma and Ba [2015] with and . We trained on batches of size 32 for steps. The learning rate was set to at the start of training and reduced by half every steps. We used an L2 weight decay of

on the weights of the convolutional layers in the image encoder and decoder. Weights were initialized using the "He uniform" method as implemented by Keras. Models were trained on a single Nvidia P100 GPU. Training took approximately 12 hours.

During training, we linearly annealed the KL loss scale from to over the first steps, as in Bowman et al. [2016].

s1.3 Scheduled sampling

When training an RNN for many timesteps, the initially large errors compound over time, leading to slow learning. Therefore, during training, we initially supplied the observed keypoint coordinates as to the RNN, instead of the RNN’s own predictions. This is similar to teacher forcing, although we note that we used the output of the unsupervised keypoint detector, rather than the ground truth.

We find that teacher forcing causes the model to make more dynamic predictions which are qualitatively realistic, but may have poor error metrics because of the mismatch between the training and test distributions. We therefore gradually switched to using samples from the model over the course of training (scheduled sampling, Bengio et al. [2015]

). We linearly increased the probability of choosing samples from the model from

to a final value over the course of training. We chose the final probability to be for the observed timesteps and for the predicted timesteps.

s1.4 Hyperparameter optimization

We used a black-box optimization tool based on Gaussian process bandits Golovin et al. [2017] to tune several of the hyperparameters of our model. See Table S1 for parameters and their tuning ranges.

Parameter name Symbol Tuning range Basketball Human3.6M DMCS
Batch size - 32 32 32
Init. learning rate -
Input steps - 8 8 8
Predicted steps 8 8 8
Num. keypoints varied 12 48 64
Keypoint sparsity scale 5
Separation loss scale
Separation loss width
Keypoint blob width (pix) 1.5 1.5 1.5
Latent code size 16 16 128
KL loss scale
Prior net size 16 4 16
Posterior net size - 128 128 128
Num. RNN units 512 512 512
Num. samp. for BoM loss 50 50 50
Table S1: Hyperparameters

Appendix S2 Experimental details

s2.1 Human3.6M action recognition

To understand how much semantically useful information the representations of our models contain, we predicted the actions performed in the Human3.6M dataset from the model representations (Figure 7). We first used trained models to extract keypoints (or unstructured image representations) for sequences of 8 observed steps from the Human3.6M test set. These keypoint sequences represented the dataset used for action recognition. The action recognition training set comprised 881 sequences, the test set 279 sequences. We ensured that no test sequences came from the same original videos as those used in the training set.

We then trained a separate recurrent neural network to classify each sequence into one of the 15 action categories (Walking, Sitting, Eating, Discussions, …) in the Human3.6M dataset. No categories were excluded. The network consisted of two GRU layers (128 units), followed by a dense layer (15 units) and a softmax layer. We used 25% dropout after each GRU layer. The model was trained for 100 epochs using the ADAM optimizer with a starting learning rate of 0.01 that was successively reduced to 0.0001. We report the mean action recognition accuracy (fraction correct) on the 279 test sequences.

s2.2 DeepMind Control Suite reward prediction

To explore if the structured representation learned by our model may be useful for planning, we used our model to predict rewards in DeepMind Control Suite Tassa et al. [2018] continuous control tasks (Figure 8). We chose tasks that have dense rewards and thus provide a strong signal for evaluation (Acrobot Swingup, Cartpole Balance, Cheetah Run, Reacher Easy, Walker Stand, Walker Walk).

We generated a dataset based on DeepMind Control Suite (DMCS) continuous control tasks by performing random actions and recording 64 by 64 pixel observations, the actions, and the rewards. We then trained our model variants on this dataset. Importantly, we trained a single model on data from all domains, to test the generality of our approach. We modified our models to be action-conditional by passing the vector of actions as an additional input to the RNN at each timestep.

To predict rewards, we used the RNN hidden state of our models as a representation of the dynamics learned by the model. We first collected the hidden states of the trained models for 10,000 length-20 sequences from the test split for each of the six domains in our DMCS dataset. We then trained a separate, smaller reward prediction model to predict rewards for each of the six domains. The reward prediction models took the sequence of RNN hidden states as input and returned a sequence of scalar reward values as output. The model consisted of a fully connected layer (128 units), two GRU layers (128 units) and a dense layer (1 unit), all connected through rectified linear units. The reward prediction model was trained on 80% of the data with the ADAM optimizer with a starting learning rate of 0.001 that was successively reduced to 0.0001. We report the mean squared error of the predicted reward on the remaining 20% of the data.

Figure S9: Ablating either the temporal separation loss or the keypoint sparsity loss reduces model performance and stability. Plots show the coordinate prediction error when regressing the ground-truth object coordinates on the discovered keypoints. Lines show the mean of five model initializations, with the 95% confidence intervals shaded.
Figure S10: Additional video metrics on Human3.6M: structural similarity (SSIM) and peak signal-to-noise ratio (PSNR). The models were conditioned on 8 frames and trained to predict 8 future frames. Higher is better (closer to ground truth). Top row shows the mean across all test-set examples of the closest-to-GT of 100 stochastic samples, bottom shows the furthest. Lines show the mean across 5 random model initializations, with the 95% confidence interval shaded.
Figure S11: Video generation quality on Basketball. The models were conditioned on 8 frames and trained to predict 8 future frames. Our stochastic structured model (Struct-VRNN) outperforms our deterministic baseline (Struct-VRNN), our unstructured baseline (CNN-VRNN), and the SVG model Denton and Fergus [2018] in the FVD metric and qualitatively. Top: Example input (green borders) and predicted (red borders) frames. Bottom left: Fréchet Video Distance (FVD) Unterthiner et al. [2018]. Lower is better. Each dot represents a separate model initialization. For SVG, the FVD for several runs was greater than 1700. The example at the top comes from the best run. Bottom right: VGG feature cosine similarity, structural similarity (SSIM), and peak signal-to-noise ratio (PSNR). Higher is better. Lines show the mean across 5 random model initializations, with the 95% confidence interval shaded. The SVG model fails to represent objects stably at later timepoints. This is captured by the FVD metric, causing a large difference to our models. However, it is not captured by the other metrics, suggesting that they are not informative on this synthetic dataset. Also see videos in supplemental material or at https://mjlm.github.io/video_structure/.
(a) Struct-VRNN
(b) Struct-VRNN without best-of-many-samples objective
(c) Struct-RNN (deterministic dynamics)
Figure S12: Effect of stochastic belief and best-of-many-samples objective on sample diversity. Each row shows one example Basketball play, with the trajectories for one player in each column. The black line indicates the true trajectory, the colored lines indicate 20 stochastic predictions, all conditioned on the same observed steps. Trajectory endpoints are marked with dots. The model trained with the best-of-many-samples objective (a) produces more diverse samples than the model without (b). As expected, the deterministic model (c) lacks diversity completely. Players were matched to detected keypoints by finding, for each player, the keypoint which was closest to that player on average.
Figure S13: The Struct-VRNN model generates plausible and diverse predictions. Each block shows the true sequence in the top row, followed by three samples conditioned on the same initial frames (green outlines). Also see videos in supplemental material or at https://mjlm.github.io/video_structure/.
(a) Acrobot
(b) Cartpole
(c) Cheetah
(d) Reacher
(e) Walker
Figure S14: Action-conditional predictions for the DeepMind Control Suite domains. Even though the CNN-VRNN has enough capacity to encode the observed frames (green outlines) well, it struggles to make future predictions (red outlines), in contrast to the Struct-VRNN. Also see videos in supplemental material or at https://mjlm.github.io/video_structure/.