MGPVAE
None
view repo
We introduce MGPVAE, a variational autoencoder which uses Gaussian processes (GP) to model the latent space distribution. We employ MGPVAE for the unsupervised learning of video sequences to obtain disentangled representations. Previous work in this area has mainly been confined to separating dynamic information from static content. We improve on previous results by establishing a framework by which multiple features, static or dynamic, can be disentangled. Specifically we use fractional Brownian motions (fBM) and Brownian bridges (BB) to enforce an interframe correlation structure in each independent channel. We show that varying this correlation structure enables one to capture different aspects of variation in the data. We demonstrate the quality of our disentangled representations on numerous experiments on three publicly available datasets, and also perform quantitative tests on a video prediction task. In addition, we introduce a novel geodesic loss function which takes into account the curvature of the data manifold to improve learning in the prediction task. Our experiments show quantitatively that the combination of our improved disentangled representations with the novel loss function enable MGPVAE to outperform the stateoftheart in video prediction.
READ FULL TEXT VIEW PDFNone
Finding good representations for data is one of the main goals of unsupervised machine learning
[3]. Ideally, these representations reduce the dimensionality of the data, and are structured such that the different factors of variation in the data get distilled into different channels. This process of disentanglement in generative models is useful as in addition to making the data interpretable, the disentangled representations can also be used to improve downstream tasks such as prediction.In prior work on the unsupervised learning of video sequences, a fair amount of effort has been devoted to separating motion, or dynamic information from static content [7, 11, 14, 20, 28]. To achieve this goal, typically the model is structured to consist of dual pathways, e.g. using two separate networks to separately capture motion and semantic content [7, 28].
Such frameworks may be restrictive as it is not immediately clear how to extend them to extract multiple static and dynamic features. Furthermore, in complex videos, there usually is not a clear dichotomy between motion and content, e.g. in videos containing dynamic information ranging over different timescales.
In this paper, we address this challenge by proposing a new variational autoencoder, MGPVAE, for the unsupervised learning of video sequences. It utilizes a latent prior distribution that consists of multiple channels of fractional Brownian motions and Brownian bridges. By varying the correlation structure along the time dimension in each channel to pick up different static or dynamic features, while maintaining independence between channels, MGPVAE is able to learn multiple disentangled factors.
We then demonstrate quantitatively the quality of our disentanglement representations using a frame prediction task. To improve prediction quality, we also employ a novel geodesic loss function which incorporates the manifold structure of the data to enhance the learning process.
Our main contributions can be summarized as follows:
We use Gaussian processes as the latent prior distribution in our model MGPVAE to obtain disentangled representations for video sequences. Specifically, we structure the latent space by varying the correlation between video frame distributions so as to extract multiple factors of variation from the data.
We introduce a novel loss function which utilizes the structure of the data manifold to improve prediction. In particular, the actual geodesic distance between the predicted point and its target on the manifold is used instead of squaredEuclidean distance in the latent space.
We test MGPVAE against various other stateoftheart models on three datasets and demonstrate quantitatively that our model outperforms the competition in video prediction.
There are several methods for improving the disentanglement of latent representations in generative models. InfoGAN [6] augments generative adversarial networks [10] by additionally maximizing the mutual information between a subset of the latent variables and the recognition network output. betaVAE [13] adds a simple coefficient () to the KL divergence term in the evidence lower bound of a VAE. It has been demonstrated that increasing beyond unity improves disentanglement, but also comes with the price of increased reconstruction loss [16]. To counteract this tradeoff, both FactorVAE [16] and TCVAE [5] further decompose the KL divergence term, and identify a total correlation term which when penalized directly encourages factorization in the latent distribution.
With regard to the unsupervised learning of sequences, there have been several attempts to separate dynamic information from static content [7, 11, 14, 20, 28]. In [20], one latent variable is set aside to represent content, separate from another set of variables used to encode dynamic information, and they employ this graphical model for the generation of new video and audio sequences.
[28] proposes MCnet, which uses a convolutional LSTM for encoding motion and a separate CNN to encode static content. The network is trained using standard loss plus a GAN term to generate sharper frames. DRNet [7] adopts a similar architecture, but uses a novel adversarial loss which penalizes semantic content in the dynamic pathway to learn pose features.
[14] proposes DDPAE, a model with a VAE structure that performs decomposition on video sequences with multiple objects in addition to disentanglement. In their experiments, they show quantitatively that DDPAE outperforms MCnet and DRNet in video prediction on the Moving MNIST dataset.
In [11], a variational autoencoder which structures its latent space distribution into two components is used for video sequence learning. The “slow” channel extracts static features from the video, and the “fast” channel captures dynamic motion. Our approach is inspired by this method, and we go further by giving a principled way to shape the latent space prior so as to disentangle multiple features.
Outside of video analysis, VAEs with a Gaussian process prior have also been explored. In [4]
, they propose GPPVAE and train it on image datasets of different objects in various views. The latent representation is a function of an object vector and a view vector, and has a Gaussian prior imposed on it. They also introduce an efficient method to speed up computation of the covariance matrices.
In [8], a deep VAE architecture is used in conjunction with a Gaussian process to model correlations in multivariate time series such that inference can be performed on missing datapoints.
BayesFactor VAE [17] uses a hierarchical Bayesian model to extend the VAE. As with our work, they recognize the limitations of restricting the latent prior distribution to standard normal, but they adopt heavytailed distributions as an alternative rather than Gaussian processes.
Recent work has shown that that distances in latent space are not representative of the true distance between datapoints [1, 19, 25]. Rather, deep generative models learn a mapping from the latent space to the data manifold, a smoothly varying lowerdimensional subset of the original data space.
In [21], closed curves are abstractly represented as points on a shape manifold which incorporates the constraints of scale, rotational and translational invariance. The geodesic distance between points on this manifold is then used to give an improved measure of dissimilarity. In [26], several metrics are proposed to quantify the curvature of data manifolds arising from VAEs and GANs.
In this section, we review the preliminaries on VAEs and Gaussian processes, and describe our model MGPVAE in detail.
Variational autoencoders [18] are powerful generative models which reformulate autoencoders in the framework of variational inference. Given latent variables
, the decoder, typically a neural network, models the generative distribution
, where denotes the data. Due to the intractability of computing the posterior distribution , an approximation , again parameterized by another neural network called the encoder, is used. Maximizing the loglikelihood of the data can be achieved by maximizing the evidence lower boundwhich is equal to
(1) 
with denoting the prior distribution of the latent variables.
The negative of the first term in (1) is the reconstruction loss, and can be approximated by
where is drawn ( times) from the latent distribution, although typically only one sample is required in each pass as long as the batch size is sufficiently large [18]. If is modeled to be Gaussian, then this is simply meansquared error.
Given an index set , is a Gaussian process [12, 29] if for any finite set of indices of ,
is a multivariate normal random variable. In this paper, we are concerned primarily in the case where
indexes time, i.e. or , in which case can be uniquely characterized by its mean and covariance functionsThe following Gaussian processes are frequently encountered in stochastic models, e.g. in financial modeling [2, 9], and the prior distributions employed in MGPVAE will be the appropriately discretized versions of these processes.
Fractional Brownian motion (fBM). Fractional Brownian motion [22] is a Gaussian process parameterized by a Hurst parameter , with mean and covariance functions given by
(2) 
When , is standard Brownian motion [12] with independent increments, i.e. the discrete sequence is a simple symmetric random walk where .
Most notably, when , the process is not Markovian. When , the disjoint increments of the process are positively correlated, whereas when , they are negatively correlated. We will demonstrate in our experiments how tuning effects the clustering of the latent code.
Brownian bridge (BB). The Brownian bridge [9, 15] from to on the domain is the Gaussian process defined as
(3) 
Its mean function is identically zero and its covariance function is given by
(4) 
It can be also represented as the solution to the stochastic differential equation [15]
with solution
From (3), its defining characteristic is that it is pinned at the start and the end such that and almost surely.
For VAEs in the unsupervised learning of static images, the latent distribution
is typically a simple Gaussian distribution, i.e.
. For a video sequence input with frames, we model the corresponding latent code asHere denotes the number of channels, where one channel corresponds to one sampled Gaussian path, and for each channel, are the mean and covariance of
in the case of fBM or
in the case of Brownian bridge. , are initial distributions, and is the terminal distribution for Brownian bridge. They are set to be standard normal, and we experiment with different values for . The covariances can be computed using (2) and (4) and are not necessarily diagonal, which enables us to model more complex interframe correlations.
Rewriting as , for each channel , we sample by sampling from a standard normal and computing
where is the lowertriangular Cholesky factor of .
The output of the encoder is a mean vector and a symmetric positivedefinite matrix , i.e.
and to compute the KL divergence term in (1), we use the formula
Following [13], we add a factor to the KL divergence term to improve disentanglement. We will describe the details of the network architecture of MGPVAE in Section 4.

Moving MNIST  Coloured dSprites  Sprites 
Gaussian processes 
Channel 1: fBM (H = 0.1) Channel 2: fBM (H = 0.9)  5 Channels of BBs  5 Channels of BBs 

0.25  0.25  0.25 

2  2  2 
Learning Rate 
0.001  0.008  0.010 
No. of epochs 
200  120  150 
For video prediction, we use a simple threelayer MLP with ReLU activation to predict the last
frames of the sequence given the first frames as input. The MLP is trained on the output of the encoder, i.e. on points in the latent space, rather than on the actual frame data so as to best utilize the disentangled representations.The action of the decoder can be viewed as a differentiable map of the latent space to the data manifold . Given an encoder output and a target , we use the geodesic distance between and as the loss function instead of the usual squareddistance . We use the following algorithm from [25] to compute the geodesic distance.
This algorithm finds the minimum of the energy of the path (and thus the geodesic)
(5) 
by computing its gradient
(6) 
Algorithm 1 initializes to be uniformlyspaced points along the line between and and gradually modifies them until the change in energy falls below a predetermined threshold. At this point, we use as the target instead of as is more representative of the vector in which to update the prediction such that the geodesic distance is minimized (see Figure 3 for an illustration).
In this section, we present experiments which demonstrate MGPVAE’s ability to disentangle multiple factors of variation in video sequences.
Moving MNIST^{1}^{1}1http://www.cs.toronto.edu/ nitish/unsupervised_video [27] comprises of moving grayscale handwritten digits. We generate 60,000 sequences for training, each with a single digit moving in a random direction across frames and bouncing off edges.
Coloured dSprites is a modification of the dSprites^{2}^{2}2https://github.com/deepmind/dspritesdataset [13] dataset. It consists of 2D shapes (square, ellipse, heart) with 6 values for scale and 40 values for orientation. We modify the dataset by adding 3 variations for colour (red, green, blue) and constrain the motion of each video sequence to be simple horizontal or vertical motion.
For each sequence, the scale is set to gradually increase or decrease a notch in each frame. Similarly, after an initial random selection for orientation, the subsequent frames rotate the shape clockwise or anticlockwise one notch per frame. The final dataset consists of a total of approximately 100,000 datapoints.
Sprites [24] comprises of around 17,000 animations of synthetically rendered animated caricatures. There are 7 attributes: body type, sex, hair type, arm type, armor type, greaves type, and weapon type, with a total of 672 unique characters. In each animation, the physical traits of the sprite remain constant while the pose (hand movement, leg movement, orientation) is varied.
For the encoder, we use 8 convolutional layers with batch normalization between each layer. The number of filters begins with 16 in the first layer and increases to a maximum of 128 in the last layer. An MLP layer follows the last layer, and this is followed by another batch normalization layer. Two separate MLP layers are then applied, one which outputs a lowertriangular matrix which represents the Cholesky factor of the covariance matrix of
and the other outputs the mean vector.For the decoder, we have 7 deconvolutional layers, with batch normalization between each layer. The first layer begins with 64 filters and this decreases to 16 filters by the last layer. We use ELU for the activation functions between all layers to ensure differentiability, with the exception of the last layer, where we use a hyperbolic tangent function.
Table 1 lists the settings for the parameters in the experiments. All channels utilizing Brownian bridge (BB) are conditioned to start at 2 and end at 2.
Moving MNIST.
From the results in Figure 4, we see that channel 1 (fBM()) captures the digit identity, whereas channel 2 (fBM()) captures the motion.
Figure 5 gives a visualization of the latent space (here we use two channels with and two channels with ). In our experiments, we observe that fBM channels with are able to better capture motion in comparison to setting (simplesymmetric random walk, cf. [11]). We hypothesize that shifting the value of away from that of the static channel sets the distributions apart and allows for better disentanglement.
Sprites. In Figure 6, it is clear that channel 1 captures hair type, channel 2 captures armor type, channel 3 captures weapon type, and channel 4 captures body orientation.
Coloured dSprites. In the examples in Figure 7, we observe that channel 2 captures shape, channel 3 captures scale, channel 4 captures orientation and position, and channel 5 captures color.
Remarks. The disentanglement results were the best for Moving MNIST, where we were able to achieve perfect disentanglement in more than 90% of the cases. We were also able to consistently disentangle three or more features in Coloured dSprites and Sprites, but disentanglement of four or more features occurred less frequently due to their complexity.
We found that including more channels than the number of factors of variation in the dataset improved disentanglement, even as the extra channels did not encode anything new. Finally, for the Coloured dSprites and Sprites dataset, we originally experimented with different combinations of fBMs (with varying ) and Brownian bridges, but found that simply using 45 channels of Brownian bridges gave comparable results.
For prediction training with the geodesic loss function, we set the number of interpolated points to be four. In addition, to speed up the algorithm for faster training, we ran the loop in Algorithm 1 for a fixed number of iterations (1520) instead of until convergence.
We compute the pixelwise meansquarederror and binary crossentropy between the predicted frames and the actual last frames, given the first frames as input. We ran the tests with for Moving MNIST and Coloured dSprites, and for Sprites.
We compare the results of MGPVAE against MCnet^{3}^{3}3https://github.com/rubenvillegas/iclr2017mcnet [28], DRNet^{4}^{4}4https://github.com/ap229997/DRNET [7], DDPAE^{5}^{5}5https://github.com/jthsieh/DDPAEvideoprediction [14] and the model in [11] across all three datasets. We report the results for and for Moving MNIST in Table 2, and for the other two datasets in Table 3.
k = 1  k = 2  
Model  MSE  BCE  MSE  BCE  
MCnet [28]  50.1  248.2  91.1  595.5  
DRNet [7]  45.2  236.7  86.3  586.7  
DDPAE [14]  35.2  201.6  75.6  556.2  
Grathwohl, Wilson [11]  59.3  291.2  112.3  657.2  
MGPVAE  25.4  198.4  72.2  554.2  

18.5  185.1  69.2  531.4 
Dataset  Coloured dSprites  Sprites  
Model  MSE  BCE  MSE  BCE  
MCnet [28]  20.2  229.5  100.3  2822.6  
DRNet [7]  15.2  185.2  94.4  2632.1  
DDPAE [14]  12.6  163.1  75.4  2204.1  
MGPVAE  6.1  85.2  68.8  1522.5  

4.5  70.3  61.6  1444.4 
The results show that MGPVAE^{6}^{6}6The code will be available online upon publication of this paper., even without using the geodesic loss function, outperforms the other models. Using the geodesic loss functions further lowers MSE and BCE. DDPAE, a stateoftheart model, achieves comparable results, although we note that we had to train the model considerably longer on the Coloured dSprites and Sprites datasets as compared to Moving MNIST to get the same performance.
Using the geodesic loss function during training also leads to qualitatively better results. As we are ensuring that each prediction gets updated such that it lies on the data manifold, we are able to produce more wellformed images. In figures 8, 9 and 10, we select sequences with large MSE and BCE losses, where the predicted point generates an image frame which can differ considerably from the actual image frame when the normal loss function is used. The results show that this is rectified when we use the geodesic loss function. (In each of these figures, the top row depicts the original video and the bottom row depicts the video with the predicted last frame.)
We introduce MGPVAE, a variational autoencoder for obtaining disentangled representations from video sequences in an unsupervised manner. MGPVAE uses Gaussian processes, such as fractional Brownian motion and Brownian bridge, as a prior distribution for the latent space. We demonstrate that different parameterizations of these Gaussian processes allow one to extract different static and timevarying features from the data.
After training the encoder which outputs a disentangled representation of the input, we demonstrate the efficiency of the latent code by using it as input to a MLP for video prediction. We run experiments on three different datasets and demonstrate that MGPVAE outperforms other stateoftheart models in video frame prediction. To further improve the results, we introduce a novel geodesic loss function which takes into account the curvature of the data manifold, and show qualitatively as well as quantitatively the improvement this brings.
For future work, we will continue to experiment with various combinations of Gaussian processes. (We experimented briefly with the OrnsteinUlenbeck process [23], but did not include it in our final results as its performance was not superior.) In addition, combining our approach with more recent methods of disentanglement such as FactorVAE or TCVAE may lead to further improvements.
International Conference on Tools with Artificial Intelligence (ICTAI)
, 2013.Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
, 2018.
Comments
There are no comments yet.