Learning Disentangled Representations of Video with Missing Data

06/23/2020
by   Armand Comas Massague, et al.
Northeastern University
16

Missing data poses significant challenges while learning representations of video sequences. We present Disentangled Imputed Video autoEncoder (DIVE), a deep generative model that imputes and predicts future video frames in the presence of missing data. Specifically, DIVE introduces a missingness latent variable, disentangles the hidden video representations into static and dynamic appearance, pose, and missingness factors for each object, while it imputes each object trajectory where data is missing. On a moving MNIST dataset with various missing scenarios, DIVE outperforms the state of the art baselines by a substantial margin. We also present comparisons for real-world MOTSChallenge pedestrian dataset, which demonstrates the practical value of our method in a more realistic setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 15

page 16

12/10/2018

Disentangled Dynamic Representations from Unordered Data

We present a deep generative model that learns disentangled static and d...
06/24/2018

Disentangled VAE Representations for Multi-Aspect and Missing Data

Many problems in machine learning and related application areas are fund...
10/16/2017

A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning

This paper takes a step towards temporal reasoning in a dynamically chan...
12/14/2016

Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders

There are many forms of feature information present in video data. Princ...
06/23/2020

not-MIWAE: Deep Generative Modelling with Missing not at Random Data

When a missing process depends on the missing values themselves, it need...
03/08/2018

A Deep Generative Model for Disentangled Representations of Sequential Data

We present a VAE architecture for encoding and generating high dimension...
06/11/2018

Learning to Decompose and Disentangle Representations for Video Prediction

Our goal is to predict future video frames given a sequence of input fra...

Code Repositories

DIVE

Disentangled Imputed Video autoEncoder (DIVE)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Videos contain rich structured information about our physical world. Learning representations from video enables intelligent machines to reason about the surroundings and it is essential to a range of tasks in machine learning and computer vision, including activity recognition

karpathy2014large, video prediction mathieu2016deep and spatiotemporal reasoning jang2017tgif. One of the fundamental challenges in video representation learning is the high-dimensional, dynamic, multi-modal distribution of pixels. Recent research in deep generative models denton2017unsupervised; ddpae; kosiorek2018sequential; ye2019compositional

tackles the challenge by exploiting inductive biases of videos and projecting the high-dimensional data into substantially lower dimensional space. These methods search for

disentangled representations by decomposing the latent representation of video frames into semantically meaningful factors locatello2018challenging.

Unfortunately, existing methods cannot reason about the objects when they are missing in videos. In contrast, a five month-old child can understand that objects continue to exist even when they are unseen, a phenomena known as “object permanence” baillargeon1985object. Towards making intelligent machines, we study learning disentangled representations of videos with missing data. We consider a variety of missing scenarios that might occur in natural videos: objects can be partially occluded; objects can disappear in a scene and reappear; objects can also become missing while changing their size, shape, color and brightness. The ability to disentangle these factors and learn appropriate representations is an important step toward spatiotemporal decision making in complex environments.

In this work, we build on the deep generative model of ddpae ddpae

which integrates structured graphical models into deep neural networks. Our model, which we call Disentangled-Imputed-Video-autoEncoder (

dive), (i) learns representations that factorize into appearance, pose and missingness latent variables; (ii) imputes missing data by sampling from the learned latent variables; and (iii

) performs unsupervised stochastic video prediction using the imputed hidden representation. Besides imputation, another salient feature of our model is (

iv) its ability to robustly generate objects even when their appearances are changing by modeling the static and dynamic appearances separately. Thismakes our technique more applicable to real-world problems.

We demonstrate the effectiveness of our method on a moving MNIST dataset with a variety of missing data scenarios including partial occlusions, out of scene, and missing frames with varying appearances. We further evaluate on the Multi-Object Tracking and Segmentation (MOTSChallenge) object tracking and segmentation challenge dataset. We show that dive is able to accurately infer missing data, perform video imputation and reconstruct input frames and generate future predictions. Compared with baselines, our approach is robust to missing data and achieves significant improvements in video prediction performances.

2 Related Work

Disentangled Representation.

Unsupervised learning of disentangled representation for sequences generally falls into three categories: VAE-based hsu2017unsupervised; kosiorek2018sequential; ddpae; ye2019compositional; kossen2020structured; stanic2019r, GAN-like models villegas2017decomposing; denton2018stochastic; kurutach2018learning and Sum-Product networks kossen2020structured; pmlr-v97-stelzner19a. For video data, a common practice is to encode a video frame into latent variables and disentangle the latent representation into content and dynamics factors. For example, ddpae assumes the content (objects, background) of a video is fixed across frames, while the position of the content can change over time. In most cases, models can only handle complete video sequences without missing data. One exception is sqair kosiorek2018sequential, an generalization of air Eslami2016AttendIR, which makes use of a latent variable to explicitly encode the presence of the respective object. sqair is further extended to an accelerated training scheme pmlr-v97-stelzner19a or to better encode relational inductive biases kossen2020structured; stanic2019r. However, sqair and its extensions have no mechanism to recall an object. This leads to discovering an object as new when it reappears in the scene.

Video Prediction.

Conditioning on the past frames, video prediction models are trained to reconstruct the input sequence and predict future frames. Many video prediction methods use dynamical modeling liu2018dyan or deep neural networks to learn a deterministic transformation from input to output, including LSTM srivastava2015unsupervised, Convolutional LSTM finn2016unsupervised and PredRNN wang2018predrnn++. These methods often suffer from blurry predictions and cannot properly model the inherently uncertain future kalchbrenner2017video. In contrast to deterministic prediction, we prefer stochastic video prediction mathieu2016deep; xue2016visual; kalchbrenner2017video; babaeizadeh2018stochastic; denton2018stochastic; wangprobabilistic, which is more suitable for capturing the stochastic dynamics of the environment. For instance, kalchbrenner2017video proposes an auto-regressive model to generate pixels sequentially. denton2018stochastic generalizes vae to video data with a learned prior. kumar2019videoflow develops a normalizing flow video prediction model. wangprobabilistic proposes a Bayesian Predictive Network to learn the prior distribution from noisy videos but without disentangled representations. Our main goal is to learn disentangled latent representations from video that are both interpretable and robust to missing data.

Missing Value Imputation.

Missing value imputation is the process of replacing the missing data in a sequence by an estimate of its true missing value. It is a central challenge of sequence modeling. Statistical methods often impose strong assumptions on the missing patterns. For example, mean/median averaging  

acuna2004treatment and MICE buuren2010mice, can only handle data missing at random. Latent variables models with the EM algorithm nelwamondo2007missing

can impute data missing not-at-random but are restricted to certain parametric models. Deep generative models offer a flexible framework of missing data imputation. For instance,

yoon2018estimating; che2018recurrent; cao2018brits

develop variants of recurrent neural networks to impute time series.

yoon2018gain; luo2018multivariate; liu2019naomi propose GAN-like models to learn missing patterns in multivariate time series. Unfortunately, to the best of our knowledge, all recent developments in generative modeling for missing value imputation have focused on low-dimensional time series, which are not directly applicable to high-dimensional video with complex scene dynamics.

3 Disentangled-Imputed-Video-autoEncoder (DIVE)

Videos often capture multiple objects moving with complex dynamics. For this work, we assume that each video has a maximum number of objects, we observe a video sequence up to time steps and aim to predict time steps ahead. The key component of dive is based on the decomposition and disentangling of the objects representations within a VAE framework, with similar recursive modules as in ddpae. Specifically, we decompose the objects in a video and assign three sets of latent variables to each object: appearance, pose and missingness, representing distinct attributes. During inference, dive encodes the input video into latent representations, performs sequence imputation in the latent space and updates the hidden representations. The generation model then samples from the latent variables to reconstruct and generate future predictions. Figure 1 depicts the overall pipeline of our model.

Denote a video sequence with missing data as where each is a frame. We assume an object in a video consists of appearance, pose (position and scale), and missingness. For each object in frame , we aim to learn the latent representation and disentangle it into three latent variables:

(1)

where is the appearancevector with dimension , is the pose vector with , coordinates and scale and is the binary missingness label. if the object is occluded or missing.

Figure 1: Overall architecture of dive, which takes the input video with missing data, infers the missingness (red), pose (green) and appearance (blue) latent variables. Two separate decoders reconstruct and predict the future sequences. The model is trained following the VAE framework.

3.1 Imputation Model

The imputation model leverages the missingness variable to update the hidden states. When there is no missing data, the encoded hidden state, given the input frame, is , where we enforce separate representations for each object. We implement the encoding function with a bidirectional LSTM to propagate the hidden state over time. However, in the presence of missing data, such hidden state is unreliable and needs imputation. Denote the imputed hidden state as which will be discussed shortly. We update a latent space vector to select the corresponding hidden state, given the sampled missingness variable:

(2)

Note that we apply a mixture of input hidden state and imputed hidden state

with probability

. In our experiments, we found this mixed strategy to be helpful in mitigating covariate shift bickel2009discriminative. It forces the model to learn the correct imputation with self-supervision, which is reminiscent of the scheduled sampling bengio2015scheduled technique for sequence prediction.

The pose hidden states are obtained by propagating the updated latent representation through an LSTM network . For prediction we use an LSTM network, with only as input in time . We obtain the imputed hidden state by means of auto-regression. This is based on the assumption that a video sequence is locally stationary and the most recent history is predictive of the future. Given the updated latent representation at time , the imputed hidden state at the next time step is:

(3)

where is a fully connected layer. This approach is similar in spirit to the time series imputation method in cao2018brits. However, instead of imputing in the observation space, we perform imputation in the space of latent representations.

3.2 Inference Model

Missingness Inference.

For the missingness variable , we also leverage the input encoding. We use a heaviside step function to make it binary:

(4)

where

is the standard deviation of the noise, which is obtained from the hidden representation.

Pose Inference.

The pose variable (position and scale) encodes the spatiotemporal dynamics of the video. We follow the variational inference technique for state-space representation of sequences chung2015recurrent. That is, instead of directly inferring for input frames, we use a stochastic variable to reparameterize the state transition probability:

(5)

where the state transition is a deterministic mapping from the previous state to the next time step. The stochastic transition variable

is sampled from a Gaussian distribution parameterized by a mean

and variance

with .

Dynamic Appearance.

Another novel feature of our approach is its ability to robustly generate objects even when their appearances are changing across frames. is the time-varying appearance. In particular, we decompose the appearance latent variable into a static component and a dynamic component which we model separately. The static component captures the inherent semantics of the object while the dynamic component models the nuanced variations in shape.

For the static component, we follow the procedure in ddpae to perform inverse affine spatial transformation , given the pose of the object to center in the frame and rectify the images with a selected crop size. Future prediction is done in an autoregressive fashion:

(6)

Here the appearance hidden state is propagated through an LSTM, whose last output is used to infer the static appearance. Similar to poses, we use a state-space representation for the dynamic component, but directly model the difference in appearances, which helps stabilizing training:

(7)

The final appearance variable is sampled from a Gaussian distribution parametrized by the concatenation of static and dynamic components, which are randomly mixed with a probability :

(8)

The mixing strategy helps to mitigate covariate shift and enforces the static component to learn the inherent semantics of the objects across frames.

Figure 2: A graphical representation of dive. From top to bottom: inference of the missingness variable , missing data imputations model, inference of the pose vectors and appearance variable using dynamic appearance inference.

3.3 Generative Model and Learning

Given a video with missing data , denote the underlying complete video as . Then, the generative distribution of the video sequence is given by:

(9)

In unsupervised learning of video representations, we simultaneously reconstruct the input video and predict future frames. Given the inferred latent variables, we generate and predict for each object sequentially. In particular, we first generate the rectified object in the center, given the appearance . The decoder is parameterized by a deconvolutional layer. After that, we apply an spatial transformer to rescale and place the object according to the pose . For each object, the generative model is:

(10)

Future prediction is similar to reconstruction, except we assume the video is always complete. The generated frame is the summation over for all objects. Following the VAE framework, we train the model by maximizing the evidence lower bound (ELBO). Please see details in the Appendix.

4 Experiments

4.1 Experimental Setup

We evaluate our method on variations of moving MNIST and MOTSChallenge multi-object tracking datasets. The prediction task is to generate 10 future frames, given an input of 10 frames. The baselines include the established state-of-the-art stochastic video prediction methods: drnet denton2017unsupervised, ddpae ddpae and sqair babaeizadeh2018stochastic.

Evaluation Metrics.

We use common evaluation metrics for video quality on the visible pixels, which include pixel-level Binary Cross entropy (BCE) per frame, Mean Square Error (MSE) per frame, Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM). Additionally,

dive is a probabilistic model, hence we also report Negative Evidence Lower Bound (NELBO).

As our dive model simultaneously imputes missing data and generates improved predictions, we report reconstruction and prediction performances separately. For implementation details for the experiments, please see the Appendix.

4.2 Moving MNIST Experiments

Data Description.

Moving MNIST srivastava2015unsupervised is a synthetic dataset consisting of two digits with size moving independently in a

frame. Each sequence is generated on-the-fly by sampling MNIST digits and synthesizing trajectories with fixed velocity with randomly sampled angle and initial position. We train the model for 300 epochs in scenarios 1 and 2, and 600 epochs in scenario 3. For each epoch we generate 10k sequences. The test set contains 1,024 fixed sequences. We simulate a variety of missing data scenarios including:

  • [leftmargin=*,nolistsep]

  • Partial Occlusion: we occlude the upper rows of the pixel frame to simulate the effect of objects being partially occluded at the boundaries of the frame.

  • Out of Scene: we randomly select an initial time step and remove the object from the frame in steps and to simulate the out of scene phenomena for two consecutive steps.

  • Missing with Varying Appearance: we apply an elastic transformation 1227801 to change the appearance of the objects individually. The transformation grid is chosen randomly for each sequence, and the parameter of the deformation filter is set to and reduced linearly to 0 (no transformation) along the steps of the sequence. We remove each object for one time-step following the same logic as in scenario 2.

Scenario 1: Partial occlusion.

The top portion of Table 1 shows the quantitative performance comparison for all methods for the partial occlusion scenario. Our model outperforms all baseline models, except for the BCE in prediction. This is because dive generates sharper shapes which, in case of misalignment with the ground truth, have a larger effect on the pixel-level BCE. For reconstruction, our method often outperforms the baselines by a large margin, which highlights the significance of missing data imputation. Note that sqair performs well in reconstruction but fails in prediction. Prolonged full occlusions cause sqair to lose track of the object and re-identifying it as a new one when it reappears. Figure 3 shows a visualization of the predictions from dive and the baseline models. The bottom three rows show the decomposed representations from dive for each object and the missingness labels for objects in the corresponding order. We observe that drnet and sqair fail to predict the objects position in the frame and appearance while ddpae generates blurry predictions with the correct pose. These failure cases rarely occur for dive.

Scenario 1 BCE MSE PSNR SSIM NELBO
Model Rec Pred Rec Pred Rec Pred Rec Pred
drnetdenton2017unsupervised 482.07 852.59 72.21 96.36 7.99 6.89 0.76 0.72 /
sqairkosiorek2018sequential 178.71 967.20 21.84 84.73 13.19 9.96 0.90 0.73 -0.16
ddpaeddpae 182.66 417.00 39.09 67.41 17.56 15.49 0.77 0.72 -0.09
dive 119.25 459.10 19.73 64.49 20.64 15.85 0.90 0.78 -0.18
Scenario 2
drnet 392.33 1402.45 90.64 187.72 9.59 9.88 0.80 0.67 /
sqair 468.22 927.09 73.13 137.04 10.33 8.21 0.84 0.69 -0.17
ddpae 266.03 409.26 58.37 89.57 18.64 16.94 0.87 0.77 -0.17
dive 165.42 321.29 27.03 64.17 22.15 18.56 0.93 0.83 -0.21
Scenario 3
drnet 421.72 1304.53 90.46 176.28 9.91 7.33 0.75 0.70 /
sqair 560.51 1518.61 74.30 163.25 10.80 7.64 0.83 0.62 -0.16
ddpae 322.23 403.48 63.63 82.71 18.29 17.22 0.81 0.78 -0.18
dive 272.74 374.59 42.81 74.87 20.08 17.61 0.87 0.78 -0.19
Table 1: Quantities comparison of all methods for three missing scenarios w.r.t. reconstruction and prediction. From top to bottom: partial occlusion, out of scene and complete missing with varying appearance. The improvements of our method DIVE are evident for all scenarios.
Figure 3: Partial missing qualitative results. Obj 1 and Obj 2 show dives individual object generations and missing labels indicate whether each object is estimated completely missing in the scene. Note that objects are well decomposed, sharply generated and the labels properly predicted.
Figure 4: Qualitative results for out of scene missing scenario for two time steps.
Figure 5: Qualitative results for one time step complete missing with varying appearance.

Scenario 2: Out of Scene.

The middle portion of Table 1 illustrates the quantitative performance of all methods for scenario 2. We observe that our method achieves significant improvement across all metrics. This implies that our imputation of missing data is accurate and can drastically improve the predictions. Figure 4 shows the prediction results of all methods evaluated for the out of scene case. We observe that drnet and sqair fail to predict the future pose, and the quality of the generated object appearance is poor. The qualitative comparison with ddpae reveals that the objects generated by our model have higher brightness and sharpness. As the baselines cannot infer the object missingness, they may misidentify the missing object as any other object that is present. This would lead to confusion for modeling the pose and appearance. The figure also reveals how dive is able to predict the missing labels and hallucinate the pose of the objects when missing, allowing for accurate predictions.

Scenario 3: Missing with Varying Appearance.

Quantitative results for 1 time step complete missingness with varying appearance are shown in the bottom portion of Table 1. Our method again achieves the best performance for all metrics. The difference between our models and baselines is quite significant given the difficulty of the task. Besides the complete missing frame, the varying appearances of the objects introduce an additional layer of complexity which can misguide the inference. Despite these challenges, dive can learn the appearance variation and successfully recognize the correct object in most cases. Figure 5 visualizes the model predictions, a tough case where two seemingly different digits (“2” and “6”) are progressively transformed into the same digit (“6”). squair and drnet have the ability to model varying appearance, but fail to generate reasonable predictions due to similar reasons as before. ddpae correctly predicts the pose after the missing step, but misidentifies the objects appearance before that. Also, ddpae simply cannot model appearance variation. dive correctly estimates the pose and appearance variation of each object, while maintaining their identity throughout the sequence.

4.3 Pedestrian Experiments

Figure 6: MOTS data set qualitative results. Note that our method successfully identifies the missing time step, decomposes the objects and keeps track of the missing pedestrians.

The Multi-Object Tracking and Segmentation (MOTS) Challenge MOT16 dataset consists of real world video sequences of pedestrians and cars. We use 2 ground truth sequences in which pedestrians have been fully segmented and annotated Voigtlaender19CVPR_MOTS

. The annotated sequences are further processed into shorter 20 frame sub-sequences, binarized and with at most 3 unique pedestrians. The smallest objects are scaled and the sequences are augmented by simulating constant camera motion and 1 time step complete camera occlusion, further details deferred to the Appendix.

Model BCE MSE PSNR SSIM NELBO
ddpae 2495.08 560.37 22.22 0.90 -0.24
dive 1355.89 328.96 24.82 0.96 -0.26
Table 2: Quantitative comparison on MOTS pedestrian dataset for ddpae and dive.

Table 2 shows the quantitative metrics compared with the best performing baseline ddpae. This dataset mimics the missing scenarios 1 (partial occlusion) and 3 (missing with varying appearance) because the appearance walking pedestrians is constantly changing. dive outperforms ddpae across all evaluation metrics. Figure 6 shows the outputs from both models as well as the decomposed objects and missingness labels from dive. Our method can accurately recognize 3 objects (pedestrians), infer their missingness and estimate their varying appearance. ddpae fails to decompose them due to its rigid assumption of fixed appearances and the inherent complexity of the scenario.

5 Conclusion and Discussion

We propose a novel deep generative model that can simultaneously perform object decomposition, latent space disentangling, missing data imputation, and video forecasting. The key novelty of our method includes missing data detection and imputation in the hidden representations, as well as a robust way of dealing with dynamic appearances. Extensive experiments on moving MNIST demonstrate that dive can impute missing data without supervision and generate videos of significantly higher quality. Future work will focus on improving our model so that it is able to handle the complexity and dynamics in real world videos with unknown object number and colored scenes.

Broader Impact

Videos provide a window into the physics of the world we live in. They contain abundant visual information of what objects are, how they move, and what happens when cameras move against the scene. Being able to learn a representation that disentangles these factors is fundamental to AI that can understand and act in spatiotemporal environment. Despite the wealth of methods for video prediction, state-of-the-art approaches are sensitive to missing data, which are very common in real-world videos. Our proposed model significantly improves the robustness of video prediction methods against missing data, and thereby increasing the practical values of video prediction techniques and our trust in AI. Video surveillance systems can be potentially abused for discriminatory targeting, and we remained cognizant of the bias in our training data. To reduce the potential risk of this, we pre-processed the MOTSChallenge videos to greyscale.

References

Appendix A Model Implementation Details

sqair

The sqair model is sensitive to hyper-parameters [kosiorek2018sequential]. Different combinations of hyper-parameters are used to reproduce the best performance of the model. However, through our communication with the authors, sqair model is not designed for the missing data scenario. Thus, we were not able to reach similar level of performance reported in [kosiorek2018sequential]. In order to obtain the best performance of sqair for our data set, we trained and evaluated the reconstruction model and prediction model separately as we found that sqair model is more stable with the reconstruction task and thus could be trained longer (300 epochs and more). However, for the prediction task, the main issue we encountered was that the gradients would vanish for some combinations of hyper-parameters and the model was not able to make predictions after certain number of training epochs (this number can fluctuate). In order to obtain the best performance of sqair

, we kept the model training until it could generate predictions and select the checkpoint with the best performance. We used rmsprop to optimize the

sqair model and the model uses important-weighted auto-encoder[Burda2016ImportanceWA] with 5 particles as a general structure. For more implementation details, please refer to the github of sqair[kosiorek2018sequential]. It is also important to note that the training time per 100 epochs for sqair is at least 5 times more than the training time of our DIVE model.

drnet

The original version of drnet model only uses the first four frames for training. In order to adapt drnet for our prediction task, we changed the scene discriminator in drnet to train on all frames in the sequences. This modification is more suitable for our missing data scenario. Because if there were missing data in the first four frames, the scene discriminator trained only on the first four frames would easily fail. However, after this modification, the probability for the scene discriminator to successfully recognize the scene also increases. Except for this modification, the rest of the model was kept exactly the same as the author’s implementation for better reproduction of results. It is also important to note that the main network and the lstm in drnet were trained separately. The main network was trained first and then the lstm was trained based on results of the main network. Therefore, if the main network failed to recognize the objects, the lstm would also fail to learn the trajectories. We used the default Adam optimizer in drnet to train the model. The scene discriminator was trained with BCE loss. The main network and lstm were trained with MSE loss. For more implementation details, please refer to the github of drnet[denton2017unsupervised].

ddpae

We used the code provided by the authors. The hyperparameters that they use in the public version were kept unchanged. Also we followed the instructions in their github repository (

https://github.com/jthsieh/DDPAE-video-prediction) for the Moving MNIST experiment. However, for some of the experiments, we have added additional features to produce better results. For the Pedestrian experiments, we aligned the hyperparameters that are semantically similar with dive implementation. Also, the pose size was constrained less than in the default setting. This way, the model can adapt to a highly varying dataset.

dive

The main variables have the following dimensions: , , , , and . The dimensions were chosen after a manual sweep of hyperparameters range. Particularly, the dimensionality of was chosen from the range ; and from ; and from . The learning rate was set to and reduced to at of the training iterations, and we used a batch size of

. The Bernoulli distribution for the imputation model has probability

in training and in testing. For the appearance model, the Bernoulli distribution has , which was increased to after 3,000 iterations during training. For testing, we set . Further details can be found in the provided codebase. The missingness latent variable, , was implemented with a Heaviside step function in the pose encoding model, with a

bias in the logit. However, to allow the gradients to propagate, we did not binarize the variable for the decoder but simply used the logit.

To adapt to three missing scenarios, we made minor changes to our implementation. For missing scenario 1 (partial occlusion) and 2 (out of scene) of the MovingMNIST experiments, because the objects appearance remain static, we did not include the dynamic appearance model component. The appearance encoding is therefore adjusted accordingly. We followed Equation 6 to generate the static appearance, but we skipped the input frames and hidden states in where we predicted missingness . For partial occlusion training with Moving MNIST dataset, we used a scheduling mechanism to evaluate the loss only for the visible area of the frame. We applied the same procedure to all the baselines for a fair comparison. For the pedestrian dataset, similarly to ddpae, we relaxed the pose size constraint to accommodate the highly dynamic pose size in real-world videos. With this implementation, we measure the training time. It takes 91 minutes to carry out 100 epochs, for which we process 1 million samples in batches of 64.

Software

We implemented this method using Ubuntu 18.04, Python 3.6, Pytorch 1.2.0, Cuda 10.0 and Pyro 0.2 as a framework for probabilistic programming.

Hardware

For each of our experiments we used 1 GPU RTX 2080 Ti (Blower Edition) with 128GB of memory.

Appendix B Datasets Details

Moving MNIST with elastic deformation.

In order to simulate slowly varying appearance in Scenario 3, we applied an elastic deformation to the objects in the scene. Given a uniform grid that represents the object pixel coordinates, we generated a distortion. We created a displacement random field with parameters and . These parameters controlled the intensity of the deformation and the smoothing of the field, respectively. The displacement field was added to the uniform grid, and used to deform the coordinates of the given digit. This is described in [1227801]. The transformation was done independently to every digit. We set and varied linearly from to along the sequence.

MOTS Challenge pre-processing.

The Multi-Object Tracking and Segmentation (MOTS) Challenge [MOT16]

dataset focuses on the task of multi-object tracking to multi-object tracking and segmentation. It provides dense pixel-level annotations for two existing tracking datasets. It comprises 65,213 pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video frames. For our task, we used 2 scenes with only pedestrians. Each one of these scenes was processed as follows: We kept the dense annotations as the shapes of the objects, and discarded all remaining content (such as the background). Given the large variance in the objects size, we resized the objects below the average size in the scene to the average and added a small random margin. Each scene was divided in sequences of 20 frames, reducing the sampling rate by a factor of 5 to increase displacements of objects. For each sequence, we selected all combinations of 3 objects to augment the data. We binarized the grey values of all sequences. Each sequence was padded randomly to fit a square and resized to

pixels. Finally, during training we added on-the-fly transformation to the clips. We subtracted all content for one random time step and sequentially affine-transformed the frames to simulate full camera occlusion and constant camera motion. This was also done when generating the fixed testing sequences. As a result, we used 4,416 sequences for training and 675 for testing, while making sure they belonged to different scenes.

Appendix C Ablation Study

Figure 7: Ablation study for static and dynamic appearance modeling for missing Scenario 3 of the Moving MNIST experiments. ddpae results were also shown for comparison purposes.

In order to highlight the significance of dynamic appearance modeling, we performed an ablative study for dive, focusing on Scenario 3 in Section 4.2. In particular, we compared two cases: (1) dynamic appearance. This is our main configuration. Missingness was estimated with hard labels, binarized with a step function while encoding. The appearance was modeled as in Equation (8), where the Bernoulli probability is in testing and therefore we explicitly modeled the dynamic appearance. (2) static appearance. In this case, we altered the original configuration by setting for the Bernoulli distribution in Equation 8. This allows only for static (constant) appearance generation.

For each case, we trained the model for 600 epochs and kept the model every epochs for the range , with the same training and testing setup as previously reported for this scenario. We used the DDPAE results trained for 600 epochs as a baseline. We tested the models and report BCE and MSE per frame metrics, separately for input reconstruction and output prediction.

Figure 7 shows the quantitative results for both BCE (left) and MSE (right). We can see that for reconstruction, having dynamic appearance components significantly reduces the reconstruction error, specially for MSE. This is because Scenario 3 contains digits with distortion, with high intensity at the input, hence more flexible appearance modeling adapts better to the changing shapes. However, predicting the sequence into the future inevitably introduces uncertainty, leading to blurry predictions. Static modeling captures the shared constant appearance through the sequence, and that the outputs have low intensity deformations. Therefore, it does not suffer high intensity appearance variations, and generates sharper shapes in prediction. However, as the baseline ddpae does not provide a mechanism for missing data imputation or varying appearance, both of our approaches outperform (ddpae) by a large margin, even in the early stages of training.

We also conducted an ablative study on the missingness variable. In our implementation, we chose a heavy-sided function to obtain “hard” labels. One can also use a Sigmoidactivation function to obtain “soft ” label for encoding the missing labels. We tested both and found that the model can always learn the labels correctly. The performance difference was not statistically significant.

Appendix D Objective

The optimization objective is to maximize the evidence lower bound (ELBO), as in the common VAE framework:

Here, DIVE uses self-supervision for reconstructing the corrupted input and predicting the complete output . We add a regularization term to minimize the KL-divergence between our latent space representation and a Gaussian prior, parametrized by its mean and variance. Note that is our prior on the number of objects in the scene.

Appendix E More examples and failure cases of DIVE

In this section, we provide more examples including failure cases from three missing scenarios experiments. For each of the examples, the first 10 frames are the inputs, followed by the 10 predicted frames. The top row is the ground truth and the second to the last row is the reconstructions/predictions from dive. We also show the decomposed objects and the learned missingness labels, respectively.

A failure case where our DIVE model cannot reconstruct and predict digit “7”, as it doesn’t appear in the input.
(a) A failure case where our DIVE model cannot reconstruct and predict digit “7”, as it doesn’t appear in the input.
(b) A success case where our model recovers the heavily corrupted digit.
Figure 8: More examples for missing Scenario 1: Partial occlusion experiment. The rows for each figure from top to bottom are (1) ground truth, (2) first object, (3) second object, (4) dive predictions, (5) predicted missing labels for each object. We use the same display format for all Moving MNIST examples below.

Figure 8 shows three examples for missing scenario 1 (partial occlusion). Figure 8(a) shows a failure case where dive cannot recognize and generate digit “7” as it only reappears at the very end. This is partially due to our imputation mechanism, which only uses the previous information not the future information. Figure 8(b) shows a success case where even though one of the digits is heavily corrupted in the input frames, dive could still reconstruct it in the results. In this case, digit “5” is totally missing in five input frames and is heavily corrupted or overlaps with the other digits in the rest of the input frames. Our DIVE model successfully reconstructs and predicts it in almost all frames. It is also important to note that the imputation of the missing digit five from second frame to seventh frame is smooth and accurate (see third row of the figure).

Failure case with two digits overlapping in all input frames.
(a) Failure case with two digits overlapping in all input frames.
(b) Success case with two digits overlapping frequently.
Figure 9: Examples for Scenario 2: Out of scene for two time steps.
A failure case where the model misrecognizes the digits.
(a) A failure case where the model misrecognizes the digits.
(b) A success case on two similar digits.
Figure 10: Examples for Scenario 3: varying appearance experiment.

Figure 9 shows more examples from missing Scenario 2 (out of scene). Specifically, a failure case where DIVE cannot recognize both of the digits is shown in Figure 9(a). In this case, the digits entangle with each other almost in every frame and thus the model recognizes them as one object. We also show a success case where the two digits entangle with each other frequently in Figure 9(b). From these two cases, we can conclude that the model needs as least one frame where the two digits are separable to generate decent results.

Figure 10 displays more examples for an experiment on missing Scenario 3 (complete missing with varying appearance). Figure 10(a) shows a failure case where after the fifth frame, our dive model mis-recognizes the two digits. The switching happens when in the fifth frame, digit “8” is missing from the scene and the digit “8” and “0” have similar appearances. After the switching, the model fails to recover the initial assignment of the objects. Although in this example our model generates decent results, we cannot overlook the potential issue. Especially when the trajectories of objects are very complex and heterogeneous, confusion in appearances could lead to inaccurate predictions of trajectories. Figure 10(b) shows a success case where the two digits are similar.

(a) A failure case with a split object.
(b) A success in a difficult case of overlapping objects.
Figure 11: Qualitative examples for pedestrian (MOTS) dataset. The rows from top to bottom are: (1) ground truth, (2) first object, (3) second object, (4) third object, (5) dive prediction, (6) combined predicted missing labels for each object.

More examples from the MOTS data set are shown in Figure 11. Figure 11(a) shows a failure example where the object/pedestrian is partly present in the row, that should be empty. Given the low displacement of the objects, the model sometimes has problems to infer which entities are independent. This can create duplicated content when we decompose the frames. It can also happen, that two objects that are static or move at the same velocity are encoded as a single entity. The failure case also shows how the model can’t predict the appearance of a new object that hasn’t been identified in the input. This is not surprising, as a human wouldn’t have been able to make such prediction. Figure 11(b) shows a success case where our model encodes each pedestrian properly and generates reasonable predictions. This case is especially difficult because, although there is no full frame missing, two of the objects overlap for several frames at the input sequence.