Disentangling Representations using Gaussian Processes in Variational Autoencoders for Video Prediction

We introduce MGP-VAE, a variational autoencoder which uses Gaussian processes (GP) to model the latent space distribution. We employ MGP-VAE for the unsupervised learning of video sequences to obtain disentangled representations. Previous work in this area has mainly been confined to separating dynamic information from static content. We improve on previous results by establishing a framework by which multiple features, static or dynamic, can be disentangled. Specifically we use fractional Brownian motions (fBM) and Brownian bridges (BB) to enforce an inter-frame correlation structure in each independent channel. We show that varying this correlation structure enables one to capture different aspects of variation in the data. We demonstrate the quality of our disentangled representations on numerous experiments on three publicly available datasets, and also perform quantitative tests on a video prediction task. In addition, we introduce a novel geodesic loss function which takes into account the curvature of the data manifold to improve learning in the prediction task. Our experiments show quantitatively that the combination of our improved disentangled representations with the novel loss function enable MGP-VAE to outperform the state-of-the-art in video prediction.



There are no comments yet.


page 6

page 7

page 8


Independent Subspace Analysis for Unsupervised Learning of Disentangled Representations

Recently there has been an increased interest in unsupervised learning o...

Disentangled State Space Representations

Sequential data often originates from diverse domains across which stati...

Polarized-VAE: Proximity Based Disentangled Representation Learning for Text Generation

Learning disentangled representations of real world data is a challengin...

Disentangled Dynamic Representations from Unordered Data

We present a deep generative model that learns disentangled static and d...

Learning Interpretable Disentangled Representations using Adversarial VAEs

Learning Interpretable representation in medical applications is becomin...

Unsupervised Learning of Disentangled Representations from Video

We present a new model DrNET that learns disentangled image representati...

Disentangling by Factorising

We define and address the problem of unsupervised learning of disentangl...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding good representations for data is one of the main goals of unsupervised machine learning

[3]. Ideally, these representations reduce the dimensionality of the data, and are structured such that the different factors of variation in the data get distilled into different channels. This process of disentanglement in generative models is useful as in addition to making the data interpretable, the disentangled representations can also be used to improve downstream tasks such as prediction.

In prior work on the unsupervised learning of video sequences, a fair amount of effort has been devoted to separating motion, or dynamic information from static content [7, 11, 14, 20, 28]. To achieve this goal, typically the model is structured to consist of dual pathways, e.g. using two separate networks to separately capture motion and semantic content [7, 28].

Such frameworks may be restrictive as it is not immediately clear how to extend them to extract multiple static and dynamic features. Furthermore, in complex videos, there usually is not a clear dichotomy between motion and content, e.g. in videos containing dynamic information ranging over different time-scales.

In this paper, we address this challenge by proposing a new variational autoencoder, MGP-VAE, for the unsupervised learning of video sequences. It utilizes a latent prior distribution that consists of multiple channels of fractional Brownian motions and Brownian bridges. By varying the correlation structure along the time dimension in each channel to pick up different static or dynamic features, while maintaining independence between channels, MGP-VAE is able to learn multiple disentangled factors.

We then demonstrate quantitatively the quality of our disentanglement representations using a frame prediction task. To improve prediction quality, we also employ a novel geodesic loss function which incorporates the manifold structure of the data to enhance the learning process.

Figure 1: Network illustration of MGP-VAE

Our main contributions can be summarized as follows:

  • We use Gaussian processes as the latent prior distribution in our model MGP-VAE to obtain disentangled representations for video sequences. Specifically, we structure the latent space by varying the correlation between video frame distributions so as to extract multiple factors of variation from the data.

  • We introduce a novel loss function which utilizes the structure of the data manifold to improve prediction. In particular, the actual geodesic distance between the predicted point and its target on the manifold is used instead of squared-Euclidean distance in the latent space.

  • We test MGP-VAE against various other state-of-the-art models on three datasets and demonstrate quantitatively that our model outperforms the competition in video prediction.

2 Related Work

2.1 Disentangled representation learning for video sequences

There are several methods for improving the disentanglement of latent representations in generative models. InfoGAN [6] augments generative adversarial networks [10] by additionally maximizing the mutual information between a subset of the latent variables and the recognition network output. beta-VAE [13] adds a simple coefficient () to the KL divergence term in the evidence lower bound of a VAE. It has been demonstrated that increasing beyond unity improves disentanglement, but also comes with the price of increased reconstruction loss [16]. To counteract this trade-off, both FactorVAE [16] and -TCVAE [5] further decompose the KL divergence term, and identify a total correlation term which when penalized directly encourages factorization in the latent distribution.

With regard to the unsupervised learning of sequences, there have been several attempts to separate dynamic information from static content [7, 11, 14, 20, 28]. In [20], one latent variable is set aside to represent content, separate from another set of variables used to encode dynamic information, and they employ this graphical model for the generation of new video and audio sequences.

[28] proposes MCnet, which uses a convolutional LSTM for encoding motion and a separate CNN to encode static content. The network is trained using standard loss plus a GAN term to generate sharper frames. DRNet [7] adopts a similar architecture, but uses a novel adversarial loss which penalizes semantic content in the dynamic pathway to learn pose features.

[14] proposes DDPAE, a model with a VAE structure that performs decomposition on video sequences with multiple objects in addition to disentanglement. In their experiments, they show quantitatively that DDPAE outperforms MCnet and DRNet in video prediction on the Moving MNIST dataset.

2.2 VAEs and Gaussian process priors

In [11], a variational auto-encoder which structures its latent space distribution into two components is used for video sequence learning. The “slow” channel extracts static features from the video, and the “fast” channel captures dynamic motion. Our approach is inspired by this method, and we go further by giving a principled way to shape the latent space prior so as to disentangle multiple features.

Outside of video analysis, VAEs with a Gaussian process prior have also been explored. In [4]

, they propose GPPVAE and train it on image datasets of different objects in various views. The latent representation is a function of an object vector and a view vector, and has a Gaussian prior imposed on it. They also introduce an efficient method to speed up computation of the covariance matrices.

In [8], a deep VAE architecture is used in conjunction with a Gaussian process to model correlations in multivariate time series such that inference can be performed on missing data-points.

Bayes-Factor VAE [17] uses a hierarchical Bayesian model to extend the VAE. As with our work, they recognize the limitations of restricting the latent prior distribution to standard normal, but they adopt heavy-tailed distributions as an alternative rather than Gaussian processes.

2.3 Data manifold learning

Recent work has shown that that distances in latent space are not representative of the true distance between data-points [1, 19, 25]. Rather, deep generative models learn a mapping from the latent space to the data manifold, a smoothly varying lower-dimensional subset of the original data space.

In [21], closed curves are abstractly represented as points on a shape manifold which incorporates the constraints of scale, rotational and translational invariance. The geodesic distance between points on this manifold is then used to give an improved measure of dissimilarity. In [26], several metrics are proposed to quantify the curvature of data manifolds arising from VAEs and GANs.

3 Method

In this section, we review the preliminaries on VAEs and Gaussian processes, and describe our model MGP-VAE in detail.

3.1 VAEs

Variational autoencoders [18] are powerful generative models which reformulate autoencoders in the framework of variational inference. Given latent variables

, the decoder, typically a neural network, models the generative distribution

, where denotes the data. Due to the intractability of computing the posterior distribution , an approximation , again parameterized by another neural network called the encoder, is used. Maximizing the log-likelihood of the data can be achieved by maximizing the evidence lower bound

which is equal to


with denoting the prior distribution of the latent variables.

The negative of the first term in (1) is the reconstruction loss, and can be approximated by

where is drawn ( times) from the latent distribution, although typically only one sample is required in each pass as long as the batch size is sufficiently large [18]. If is modeled to be Gaussian, then this is simply mean-squared error.

3.2 Gaussian processes

Given an index set , is a Gaussian process [12, 29] if for any finite set of indices of ,

is a multivariate normal random variable. In this paper, we are concerned primarily in the case where

indexes time, i.e. or , in which case can be uniquely characterized by its mean and covariance functions

The following Gaussian processes are frequently encountered in stochastic models, e.g. in financial modeling [2, 9], and the prior distributions employed in MGP-VAE will be the appropriately discretized versions of these processes.

Figure 2: Sample paths for various Gaussian processes. Top-left: Brownian bridge from -2 to 2; top-right: fBM with H = 0.1; bottom-left: standard Brownian motion; bottom-right: fBM with H = 0.9

Fractional Brownian motion (fBM). Fractional Brownian motion [22] is a Gaussian process parameterized by a Hurst parameter , with mean and covariance functions given by


When , is standard Brownian motion [12] with independent increments, i.e. the discrete sequence is a simple symmetric random walk where .

Most notably, when , the process is not Markovian. When , the disjoint increments of the process are positively correlated, whereas when , they are negatively correlated. We will demonstrate in our experiments how tuning effects the clustering of the latent code.

Brownian bridge (BB). The Brownian bridge [9, 15] from to on the domain is the Gaussian process defined as


Its mean function is identically zero and its covariance function is given by


It can be also represented as the solution to the stochastic differential equation [15]

with solution

From (3), its defining characteristic is that it is pinned at the start and the end such that and almost surely.

3.3 Mgp-Vae

For VAEs in the unsupervised learning of static images, the latent distribution

is typically a simple Gaussian distribution, i.e.

. For a video sequence input with frames, we model the corresponding latent code as

Here denotes the number of channels, where one channel corresponds to one sampled Gaussian path, and for each channel, are the mean and covariance of

in the case of fBM or

in the case of Brownian bridge. , are initial distributions, and is the terminal distribution for Brownian bridge. They are set to be standard normal, and we experiment with different values for . The covariances can be computed using (2) and (4) and are not necessarily diagonal, which enables us to model more complex inter-frame correlations.

Rewriting as , for each channel , we sample by sampling from a standard normal and computing

where is the lower-triangular Cholesky factor of .

The output of the encoder is a mean vector and a symmetric positive-definite matrix , i.e.

and to compute the KL divergence term in (1), we use the formula

Following [13], we add a factor to the KL divergence term to improve disentanglement. We will describe the details of the network architecture of MGP-VAE in Section 4.

3.4 Geodesic loss function

Moving MNIST Coloured dSprites Sprites

Gaussian processes
Channel 1: fBM (H = 0.1) Channel 2: fBM (H = 0.9) 5 Channels of BBs 5 Channels of BBs

0.25 0.25 0.25

2 2 2

Learning Rate
0.001 0.008 0.010

No. of epochs

200 120 150
Table 1: Hyperparameter settings for all datasets

For video prediction, we use a simple three-layer MLP with ReLU activation to predict the last

frames of the sequence given the first frames as input. The MLP is trained on the output of the encoder, i.e. on points in the latent space, rather than on the actual frame data so as to best utilize the disentangled representations.

The action of the decoder can be viewed as a differentiable map of the latent space to the data manifold . Given an encoder output and a target , we use the geodesic distance between and as the loss function instead of the usual squared-distance . We use the following algorithm from [25] to compute the geodesic distance.

Input: Two points, ;
, the learning rate
Output: Discrete geodesic path,

as the linear interpolation between

while  do
       for  do
             Compute gradient using (6)
       end for
end while
Algorithm 1 Geodesic Interpolation

This algorithm finds the minimum of the energy of the path (and thus the geodesic)


by computing its gradient


Algorithm 1 initializes to be uniformly-spaced points along the line between and and gradually modifies them until the change in energy falls below a predetermined threshold. At this point, we use as the target instead of as is more representative of the vector in which to update the prediction such that the geodesic distance is minimized (see Figure 3 for an illustration).

Figure 3: Using the geodesic loss function as compared to squared-distance loss for prediction. By setting the target as instead of , the model is constrained to improve its predictions along the manifold.

4 Experiments

In this section, we present experiments which demonstrate MGP-VAE’s ability to disentangle multiple factors of variation in video sequences.

4.1 Datasets

Moving MNIST111http://www.cs.toronto.edu/ nitish/unsupervised_video [27] comprises of moving gray-scale hand-written digits. We generate 60,000 sequences for training, each with a single digit moving in a random direction across frames and bouncing off edges.

Coloured dSprites is a modification of the dSprites222https://github.com/deepmind/dsprites-dataset [13] dataset. It consists of 2D shapes (square, ellipse, heart) with 6 values for scale and 40 values for orientation. We modify the dataset by adding 3 variations for colour (red, green, blue) and constrain the motion of each video sequence to be simple horizontal or vertical motion.

For each sequence, the scale is set to gradually increase or decrease a notch in each frame. Similarly, after an initial random selection for orientation, the subsequent frames rotate the shape clockwise or anti-clockwise one notch per frame. The final dataset consists of a total of approximately 100,000 datapoints.

Sprites [24] comprises of around 17,000 animations of synthetically rendered animated caricatures. There are 7 attributes: body type, sex, hair type, arm type, armor type, greaves type, and weapon type, with a total of 672 unique characters. In each animation, the physical traits of the sprite remain constant while the pose (hand movement, leg movement, orientation) is varied.

4.2 Network architecture and implementation details

For the encoder, we use 8 convolutional layers with batch normalization between each layer. The number of filters begins with 16 in the first layer and increases to a maximum of 128 in the last layer. An MLP layer follows the last layer, and this is followed by another batch normalization layer. Two separate MLP layers are then applied, one which outputs a lower-triangular matrix which represents the Cholesky factor of the covariance matrix of

and the other outputs the mean vector.

For the decoder, we have 7 deconvolutional layers, with batch normalization between each layer. The first layer begins with 64 filters and this decreases to 16 filters by the last layer. We use ELU for the activation functions between all layers to ensure differentiability, with the exception of the last layer, where we use a hyperbolic tangent function.

Figure 4: Results from swapping latent channels: Moving MNIST

Table 1 lists the settings for the parameters in the experiments. All channels utilizing Brownian bridge (BB) are conditioned to start at -2 and end at 2.

4.3 Qualitative analysis

Moving MNIST. From the results in Figure 4, we see that channel 1 (fBM()) captures the digit identity, whereas channel 2 (fBM()) captures the motion.

(a) fBM, H = 0.1
(b) fBM, H = 0.9
Figure 5: Latent space visualization of fBM channels for 6 videos. Each point represents one frame of a video.

Figure 5 gives a visualization of the latent space (here we use two channels with and two channels with ). In our experiments, we observe that fBM channels with are able to better capture motion in comparison to setting (simple-symmetric random walk, cf. [11]). We hypothesize that shifting the value of away from that of the static channel sets the distributions apart and allows for better disentanglement.

Figure 6: Results from swapping latent channels: Sprites

Sprites. In Figure 6, it is clear that channel 1 captures hair type, channel 2 captures armor type, channel 3 captures weapon type, and channel 4 captures body orientation.

Coloured dSprites. In the examples in Figure 7, we observe that channel 2 captures shape, channel 3 captures scale, channel 4 captures orientation and position, and channel 5 captures color.

Remarks. The disentanglement results were the best for Moving MNIST, where we were able to achieve perfect disentanglement in more than 90% of the cases. We were also able to consistently disentangle three or more features in Coloured dSprites and Sprites, but disentanglement of four or more features occurred less frequently due to their complexity.

We found that including more channels than the number of factors of variation in the dataset improved disentanglement, even as the extra channels did not encode anything new. Finally, for the Coloured dSprites and Sprites dataset, we originally experimented with different combinations of fBMs (with varying ) and Brownian bridges, but found that simply using 4-5 channels of Brownian bridges gave comparable results.

4.4 Prediction results

For prediction training with the geodesic loss function, we set the number of interpolated points to be four. In addition, to speed up the algorithm for faster training, we ran the loop in Algorithm 1 for a fixed number of iterations (15-20) instead of until convergence.

We compute the pixel-wise mean-squared-error and binary cross-entropy between the predicted frames and the actual last frames, given the first frames as input. We ran the tests with for Moving MNIST and Coloured dSprites, and for Sprites.

We compare the results of MGP-VAE against MCnet333https://github.com/rubenvillegas/iclr2017mcnet [28], DRNet444https://github.com/ap229997/DRNET [7], DDPAE555https://github.com/jthsieh/DDPAE-video-prediction [14] and the model in [11] across all three datasets. We report the results for and for Moving MNIST in Table 2, and for the other two datasets in Table 3.

k = 1 k = 2
MCnet [28] 50.1 248.2 91.1 595.5
DRNet [7] 45.2 236.7 86.3 586.7
DDPAE [14] 35.2 201.6 75.6 556.2
Grathwohl, Wilson [11] 59.3 291.2 112.3 657.2
MGP-VAE 25.4 198.4 72.2 554.2
(with geodesic loss)
18.5 185.1 69.2 531.4
Table 2: Prediction results on Moving MNIST.
Dataset Coloured dSprites Sprites
MCnet [28] 20.2 229.5 100.3 2822.6
DRNet [7] 15.2 185.2 94.4 2632.1
DDPAE [14] 12.6 163.1 75.4 2204.1
MGP-VAE 6.1 85.2 68.8 1522.5
(with geodesic loss)
4.5 70.3 61.6 1444.4
Table 3: Last-frame prediction results for Coloured dSprites and Sprites

The results show that MGP-VAE666The code will be available online upon publication of this paper., even without using the geodesic loss function, outperforms the other models. Using the geodesic loss functions further lowers MSE and BCE. DDPAE, a state-of-the-art model, achieves comparable results, although we note that we had to train the model considerably longer on the Coloured dSprites and Sprites datasets as compared to Moving MNIST to get the same performance.

Figure 7: Results from swapping latent channels: Coloured dSprites

Using the geodesic loss function during training also leads to qualitatively better results. As we are ensuring that each prediction gets updated such that it lies on the data manifold, we are able to produce more well-formed images. In figures 8, 9 and 10, we select sequences with large MSE and BCE losses, where the predicted point generates an image frame which can differ considerably from the actual image frame when the normal loss function is used. The results show that this is rectified when we use the geodesic loss function. (In each of these figures, the top row depicts the original video and the bottom row depicts the video with the predicted last frame.)

(a) Without geodesic loss
(b) With geodesic loss
Figure 8: Qualitative improvements from using the geodesic loss function: Moving MNIST
(a) Without geodesic loss
(b) With geodesic loss
Figure 9: Qualitative improvements from using the geodesic loss function: Sprites
(a) Without geodesic loss
(b) With geodesic loss
Figure 10: Qualitative improvements from using the geodesic loss function: Coloured dSprites

5 Conclusion

We introduce MGP-VAE, a variational autoencoder for obtaining disentangled representations from video sequences in an unsupervised manner. MGP-VAE uses Gaussian processes, such as fractional Brownian motion and Brownian bridge, as a prior distribution for the latent space. We demonstrate that different parameterizations of these Gaussian processes allow one to extract different static and time-varying features from the data.

After training the encoder which outputs a disentangled representation of the input, we demonstrate the efficiency of the latent code by using it as input to a MLP for video prediction. We run experiments on three different datasets and demonstrate that MGP-VAE outperforms other state-of-the-art models in video frame prediction. To further improve the results, we introduce a novel geodesic loss function which takes into account the curvature of the data manifold, and show qualitatively as well as quantitatively the improvement this brings.

For future work, we will continue to experiment with various combinations of Gaussian processes. (We experimented briefly with the Ornstein-Ulenbeck process [23], but did not include it in our final results as its performance was not superior.) In addition, combining our approach with more recent methods of disentanglement such as FactorVAE or -TCVAE may lead to further improvements.


  • [1] Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent Space Oddity: on the Curvature of Deep Generative Models. In International Conference on Learning Representations (ICLR), 2017.
  • [2] Christian Bayer, Peter Friz, and Jim Gatheral. Pricing under rough volatility. Quantitative Finance, 16(6):887–904, 2016.
  • [3] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1798–1828, 2012.
  • [4] Francesco Paolo Casale, Adrian V. Dalca, Luca Saglietti, Jennifer Listgarten, and Nicoló Fusi. Gaussian Process Prior Variational Autoencoders. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [5] Ricky T. Q. Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating Sources of Disentanglement in VAEs. In Advances in Neural Information Processing Systems, 2018.
  • [6] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversial Nets. In Conference on Neural Information Processing Systems (NIPS), 2016.
  • [7] Emily L. Denton and Vighnesh Birodkar. Unsupervised Learning of Disentangled Representations from Video. In Conference on Neural Information Processing Systems (NIPS), 2017.
  • [8] Vincent Fortuin, Gunnar Rätsch, and Stephan Mandt. Multivariate Time Series Imputation with Variational Autoencoders. ArXiv, abs/1907.04155, 2019.
  • [9] Paul Glasserman. Monte-Carlo Methods in Financial Engineering. Springer-Verlag, NY, 2003.
  • [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversial nets. In Advances in Neural Information Processing Systems, 2014.
  • [11] Will Grathwohl and Aaron Wilson. Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders. ArXiv, abs/1612.04440, 2016.
  • [12] Takeyuki Hida and Masuyuki Hitsuda. Gaussian Processes. American Mathematical Society, 2008.
  • [13] Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations (ICLR), 2017.
  • [14] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Fei-Fei Li, and Juan Carlos Niebles. Learning to Decompose and Disentangle Representations for Video Prediction. Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [15] Ioannis Karatzas and Steven E. Shreve. Brownian Motion and Stochastic Calculus. Springer-Verlag, NY, 1998.
  • [16] Hyunjik Kim and Andriy Mnih. Disentangling by Factorising. In International Conference on Machine Learning (ICML), 2018.
  • [17] Minyoung Kim, Yuting Wang, Pritish Sahu, and Vladimir Pavlovic. Bayes-Factor-VAE: Hierarchical Bayesian Deep Auto-Encoder Models for Factor Disentanglement. ArXiv, abs/1909.02820, 2019.
  • [18] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2013.
  • [19] Line Kühnel, Tom E Fletcher, Sarang C. Joshi, and Stefan Sommer. Latent Space Non-Linear Statistics. ArXiv, abs/1805.07632, 2018.
  • [20] Yingzhen Li and Stephan Mandt. Disentangled Sequential Autoencoder. In International Conference on Machine Learning (ICML), 2018.
  • [21] Nengli Lim, Tianxia Gong, Li Cheng, Hwee Kuan Lee, et al. Finding distinctive shape features for automatic hematoma classification in head CT images from traumatic brain injuries.

    International Conference on Tools with Artificial Intelligence (ICTAI)

    , 2013.
  • [22] Benoit B. Mandelbrot and John W. Van Ness. Fractional brownian motions, fractional noises and applications. SIAM Review, 10(4):422–437, 1968.
  • [23] Leonard S. Ornstein and George E. Uhlenbeck. On the theory of Brownian motion. Physical Review, 36:823–841, 1930.
  • [24] Scott E. Reed, Yi Zhang, Yuting Zhang, and Honglak Lee. Deep Visual Analogy-Making. In Conference on Neural Information Processing Systems (NIPS), 2015.
  • [25] Hang Shao, Abhishek Kumar, and P. Thomas Fletcher. The Riemannian Geometry of Deep Generative Models. In

    Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    , 2018.
  • [26] Ankita Shukla, Shagun Uppal, Sarthak Bhagat, Saket Anand, and Pavan K. Turaga. Geometry of Deep Generative Models for Disentangled Representations. ArXiv, abs/1902.06964, 2019.
  • [27] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised Learning of Video Representations using LSTMs. In International Conference on Machine Learning (ICML), 2015.
  • [28] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing Motion and Content for Natural Video Sequence Prediction. International Conference on Learning Representations (ICLR), 2017.
  • [29] Christopher KI Williams and Carl Edward Rasmussen. Gaussian Processes for Machine Learning, volume 2. MIT press Cambridge, MA, 2006.