## 1 Introduction

Deep neural networks have delivered impressive performance on a range of perceptual tasks, but their distributed representation is difficult to interpret and poses a challenge for problems involving reasoning. Motivated by this, the deep learning community has recently explored methods [

1–4,6–8,12,15–17,20,21] for learning distributed representations which are*disentangled*

, i.e. a unit (or small group) within the latent feature vector are exclusively responsible for capturing distinct concepts within the input signal. This work proposes such an approach for the video domain, where the temporal structure of the signal is leveraged to automatically separate the input into factors that vary independently of one another over time. Our results demonstrate that these factors correspond to distinct objects within the video, thus providing a natural representation for making predictions about future motions of the objects and subsequent high-level reasoning tasks.

Related work. [22] leverage structure at different time-scales to factor signals into independent components. Our approach can handle multiple factors at the same time-scale, instead relying on prediction as the factoring mechanism. Outside of the video domain, [1] proposed to disentangle factors that tend to change independently and sparsely in real-world inputs, while preserving information about them. [15] learned to disentangle factors of variation in synthetic images using weak supervision, and [21] extended the method to be fully unsupervised. Similar to our work, both [3] and [4] propose unsupervised schemes for disentangling video, the latter using a variational approach. However, ours differs in that it uncovers general factors, rather than specific ones like identity and pose or static/dynamic features as these approaches do.

## 2 Generative model and inference

The intuition behind our model is simple: in any real-world scene, most objects do not physically interact with one another, so their motion can be modeled independently. To find a representation of videos with these same independences, we introduce an approach based on a temporal version [14] of the variational auto-encoder [10], shown in Figure 1. In this model, each video comprises a sequence of frames . represents all the latent factors at time , where each factor is represented by a vector. Each of these factors evolve independently from one another and combine to produce the observation for timestep . Instead of directly maximizing the likelihood , we use variational inference [5] to optimize the evidence lower bound (ELBO), that is

For a derivation, see Appendix B. This lower bound naturally splits into two factors. The first, , is the log-likelihood of the data under our model when sampling from the approximate posterior, that is the “reconstruction” of from . The second factor is the KL divergence between the learned prior and the approximate posterior . It represents how far the predictions given by are from the inferred values given by ; it is the prediction error in the latent space. Our goal is to optimize the space of Z to make both reconstruction and prediction possible.

For the variational approximation, we choose

which, analogous to a Kalman filter, considers only the past state and current observation. This approximation marginalizes over the future, in that

must encode sufficient information to allow correct predictions of the future as well as fit the prior and the current observation. This allows us to do inference on a single frame and ensure our representation retains as much information about the future as possible. If we wish to use our learned representation for a downstream task, we may discard the generative model and use the inference network alone. This inference network can provide a factorized representation given a single image or use a sequence of images to produce increasingly tight estimates of the latent variables.

Neural network parameterization. We choose all of the distributions in our model to be Gaussian with diagonal covariance. To allow our model to scale to complex nonlinear observations and dynamics, we parameterize each distribution with a neural network. Table 1

describes each of these parameterizations. As each distribution is diagonal Gaussian, each network produces two outputs, corresponding to the mean and the variance of each distribution.

Training.

This model can be thought of as a series of variational autoencoders with a learned prior, and the training procedure is largely similar to [

10]. At each timestep in a sequence, we compute the prior over the latent space, using for . We infer the approximate posterior by observing and compute the KL divergence . We then sample using the reparameterization trick and compute . At the end of a sequence we update our parameters by backprop to maximize the ELBO (defined above).distribution | parameterization | parameter sharing? |
---|---|---|

N/A (but same for each ) | ||

MLP | across (different params for each ) | |

DCGAN generator | across | |

DCGAN discriminator | across |

## 3 Experiments

Datasets. We apply this model to two video datasets: the widely-used moving MNIST [19] and a new dataset of real-world videos of 5th Avenue recorded from above, which was collected by the authors. More details about these datasets, including a sample frame from 5th Avenue, can be found in Appendix C.

Baseline. In each of these evaluations we compare with a model
which is identical to the factored model except that its transition
function is not factored (referred to as the
*entangled* model in our experiments). It can be viewed as a
special case of our factored model which has a single high-dimensional
latent factor, and its latent space has the same total dimension as the
corresponding factored model in each experiment. This comparison is
intended to be as tight as possible, with any differences between the
factored model and the baseline coming exclusively from the
factorization in the latent space.

### 3.1 Evaluation

Lower bound. We compare the variational lower bound achieved by our factorized model with that of a non-factorized but otherwise identical model. These experiments reveal the price in terms of data fidelity that we pay for representing the data as independently-changing factors. Table 2 shows that the factored model achieves a lower bound on par with the entangled model.

Correlation structure. By plotting the correlation between
samples from the approximate posterior over the latent variables at time
and at time , we may observe whether our model has been
able to learn a representation for which really does only
depend on and not on the other latent factors. Note that
this only captures the *linear* dependence of these variables; this
analysis helps to illustrate the structure of our latent space, but
should not be considered definitive. Figure 4
shows that each latent factor is much more correlated with its own
previous state than with other latent factors.

dataset, model | ELBO | ||
---|---|---|---|

moving MNIST, two factors | 4.61 | 0.75 | -2902 |

moving MNIST, entangled | 4.66 | 2.61 | -2896 |

5th Ave, eight factors | 2.26 | 0.28 | 6830 |

5th Ave, entangled | 2.08 | 0.38 | 6800 |

Approximate mutual information. We use Kraskov’s method for estimating mutual information [13] to approximate and compare it to . A latent factor should be much more informative about its own future than a different factor is. The results, shown in Table 2, reveal that the evolution of each factor in the factored models depends almost exclusively on their past.

For the entangled models, which do not have separate factors, there is
no *a priori* subdivision into high- and low-mutual-information
segments of units. The reported scores were generated by creating 20
random partitionings of the latent units into factors, then
reporting the mutual information numbers for the partitioning that had
the greatest difference between same-factor and cross-factor
information.

The entangled models show much more cross-factor information than the factored models in our tests on moving MNIST, where we use two latent factors. However, as the number of factors increases, the average information between a pair of factors naturally diminishes. As a result on 5th Avenue, where we subdivide the latent units into eight factors, the entangled model shows cross-information almost as low as the factored model.

Independent generations. Finally, we evaluate qualitatively the representations learned by our model. We infer the approximate posterior , then set all of the latent variables fixed at their values given by . We then produce a sequence of generations by picking a single variable to vary, then for each timestep drawing a sample from (note the single bolded factor varying with ). That is, we hold all but one of the latent variables fixed and allow the single one to vary with the posterior. This allows us to see exactly what that single variable represents in this video. The images generated by this process are shown in Figure 7.

## 4 Discussion

By taking advantage of the structure present in video, our model can pull apart latent factors which change independently and produce a representation composed of semantically meaningful variables. The approach is conceptually simple and based on the insight that if two objects do not interact, they can be predicted independently. In the future we hope to apply a richer family of approximations to scale to more complex data.

## References

[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013.
Representation learning: A review and new perspectives. *Pattern
Analysis and Machine Intelligence, IEEE Transactions on* 35, 8 (2013),
1798–1828.

[2] Michael B Chang, Tomer Ullman, Antonio Torralba, and Joshua B
Tenenbaum. 2016. A compositional object-based approach to learning
physical dynamics. *arXiv preprint arXiv:1612.00341* (2016).

[3] Emily Denton and Vighnesh Birodkar. 2017. Unsupervised learning of disentangled representations from video.

*CoRR*abs/1705.10915, (2017).

[4] Will Grathwohl and Aaron Wilson. 2016. Disentangling space and
time in video with hierarchical variational auto-encoders. *arXiv
preprint arXiv:1612.04440* (2016).

[5] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference.

*The Journal of Machine
Learning Research*

[6] Wei-Ning Hsu, Yu Zhang, and James R. Glass. 2017. Unsupervised
learning of disentangled and interpretable representations from
sequential data. *CoRR* abs/1709.07902, (2017).

[7] Aapo Hyvarinen and Hiroshi Morioka. 2016. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In

*NIPS*.

[8] Wu Janner M. 2017. Learning to generalize intrinsic images with
a structured disentangling autoencoder. In *NIPS*.

[9] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for
stochastic optimization. *arXiv preprint arXiv:1412.6980* (2014).

[10] Diederik P Kingma and Max Welling. 2013. Auto-encoding
variational bayes. *arXiv preprint arXiv:1312.6114* (2013).

[11] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp
Hochreiter. 2017. Self-normalizing neural networks. *arXiv preprint
arXiv:1706.02515* (2017).

[12] Mathias Berglund Klaus Greff Antti Rasmus and Juergen
Schmidhuber. 2016. Deep unsupervised perceptual grouping. In
*NIPS*.

[13] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger.
2004. Estimating mutual information. *Physical review E* 69, 6
(2004), 066138.

[14] Rahul G Krishnan, Uri Shalit, and David Sontag. 2017.
Structured inference networks for nonlinear state space models. In
*AAAI*, 2101–2109.

[15] Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh
Tenenbaum. 2015. Deep convolutional inverse graphics network. In
*Advances in neural information processing systems*, 2530–2538.

[16] Ulrich Paquet Marco Fraccaro Simon Kamronn. 2017. A
disentangled recognition and nonlinear dynamics model for unsupervised
learning. In *NIPS*.

[17] Jan-Willem Van de Meent N. Siddharth Brooks Paige. 2017.
Learning disentangled representations with semi-supervised deep
generative models. In *NIPS*.

[18] Alec Radford, Luke Metz, and Soumith Chintala. 2015.
Unsupervised representation learning with deep convolutional generative
adversarial networks. *arXiv preprint arXiv:1511.06434* (2015).

[19] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov.
2015. Unsupervised learning of video representations using lstms. In
*International conference on machine learning*, 843–852.

[20] Valentin Thomas, Jules Pondard, Emmanuel Bengio, Marc Sarfati,
Philippe Beaudoin, Marie-Jean Meurs, Joelle Pineau, Doina Precup, and
Yoshua Bengio. 2017. Independently controllable factors. *arXiv
1708.01289* (2017).

## Appendix A: Network architecture details

All models are trained using the ADAM optimizer [9] with a learning rate of .

### Inference network

For the inference network, which parameterizes the function
, we use an architecture derived from the DCGAN
discriminator [18].
We use the discriminator architecture to encode the input image ,
then add an additional input to take in the value of . We
pass the inference network the *predicted values*
instead of having it do inference directly
from . We found that this greatly sped up training as the
inference network doesn’t have to learn the transition function in order
to fit to it. This is equivalent to sharing parameters between the
transition network and the inference network, though the transition
parameters are not updated here.

DCGAN image encoder:

[] Conv2d(3, 64, kernel_size=(4, 4

), stride

=(2, 2), padding

=(1, 1)) BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) Conv2d(256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) Conv2d(512, 32, kernel_size=(4, 4), stride=(1, 1))Combining information about with information about :

[] transformed_x = Linear(32 -> z_dim)(encoder_output) transformed_mu = Linear(z_dim -> z_dim)( mu(Z_{t} | Z_{t-1}) ) transformed_sigma = Linear(z_dim -> z_dim)( sigma(Z_{t} | Z_{t-1}) ) latent = Linear(z_dim * 3 -> z_dim)(transformed_x, transformed_mu, transformed_sigma) output_mu = Linear(z_dim -> z_dim)(latent) output_sigma = Linear(z_dim -> z_dim)(latent)

where mu(Z_{t} | Z_{t-1}) and sigma(Z_{t} | Z_{t-1}) are the mean and variance vectors respectively of the prediction and z_dim is the number of latent factors times the dimensionality of each factor. output_mu and output_sigma are the mean and diagonal covariance of the approximate posterior, and output_sigma is actually output as

for numerical reasons. Each layer is followed by a Leaky ReLU activation.

At , when there is no , we pass all-zero vectors instead.

### Generator network

The generator network takes in a latent vector and produces a pixelwise mean for an output image. It has this form:

[] Linear (z_dim -> z_dim) ConvTranspose2d(z_dim, 512, kernel_size=(4, 4), stride=(1, 1)) BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True) ConvTranspose2d(512, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True) ConvTranspose2d(256, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True) ConvTranspose2d(128, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True) ConvTranspose2d(64, 3, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))

where z_dim

is the number of latent factors times the dimensionality of each factor. Each layer is followed by a ReLU activation. We use a fixed variance in our Normal observation model of 0.25 for moving MNIST and 0.05 for 5th Avenue. This variance hyperparameter may be tuned to balance the tradeoff between fitting the predictions and making tight reconstructions.

### Transition network

Each latent factor has its own transition function . Each of these has the following form:

[] Linear (latent_dim -> 128) Linear (128 -> 128) Linear (128 -> 128) Linear (128 -> 128) Linear (128 -> latent_dim * 2)

where latent_dim is the dimensionality of a single latent factor and each layer is followed by a SELU activation [11]. The latent_dim * 2

output is for the mean and (diagonal) variance vectors for the Normal distribution. The variance is produced as

for numerical reasons.For this section the factored form of the transitions is not relevant. As such our development here will use the more general non-factored form, and we can substitute in our factored special case later. For simplicity of notation we will use in place of . Likewise we use in place of to represent our variational approximation function with parameters .

We begin with the form of our latent-variable generative model.

Since the series is Markov,

We introduce our variational auxiliary functions:

We may now convert this integral to an expectation with respect to :

By Jensen’s inequality,

Realizing that the expectations are KL divergences gives us an objective we can optimize:

This lower bound lies below the true log-probability by an additive term of

[10]. As the variational approximation improves (i.e., approaches the true posterior), this lower bound approaches the true log-likelihood of the data.This bound would hold for any function . Our model factors this general transition function:

which corresponds to a hidden Markov model with multiple Markov chains running in parallel in the latent space.

### Moving MNIST

This dataset consists of two digits from the MNIST dataset bouncing in a 64x64 pixel frame. Each digit is on a separate plane of the input (i.e., one is red and the other is green). The digits have randomized starting location and velocity vector for each sequence, but their motion is deterministic over the course of the sequence and the digits do not interact.

### 5th Avenue

The 5th Avenue dataset has greater complexity in its visuals and its dynamics than moving MNIST, but was designed to be simple enough to model with some fidelity using contemporary techniques. It consists of around 20 hours of video sampled at 2 frames per second. The videos were recorded from the 5th floor of a building overlooking 5th Avenue in Manhattan and show the busy street scene below including pedestrians and passing cars. Each video was recorded with a fixed camera position; between videos the camera position is nearly the same but may vary slightly. The data includes global variations such as time of day and weather. It was recorded on an iPhone 7 at 1080p resolution, though in our experiments we resize it to 64x64. A representative example image is shown in Figure 8.

Comments

There are no comments yet.