Diffusion Variational Autoencoders

by   Luis A. Pérez Rey, et al.

A standard Variational Autoencoder, with a Euclidean latent space, is structurally incapable of capturing topological properties of certain datasets. To remove topological obstructions, we introduce Diffusion Variational Autoencoders with arbitrary manifolds as a latent space. A Diffusion Variational Autoencoder uses transition kernels of Brownian motion on the manifold. In particular, it uses properties of the Brownian motion to implement the reparametrization trick and fast approximations to the KL divergence. We show that the Diffusion Variational Autoencoder is capable of capturing topological properties of synthetic datasets. Additionally, we train MNIST on spheres, tori, projective spaces, SO(3), and a torus embedded in R3. Although a natural dataset like MNIST does not have latent variables with a clear-cut topological structure, training it on a manifold can still highlight topological and geometrical properties.



There are no comments yet.


page 5

page 7

page 8


Variational Autoencoder with Learned Latent Structure

The manifold hypothesis states that high-dimensional data can be modeled...

Disentanglement with Hyperspherical Latent Spaces using Diffusion Variational Autoencoders

A disentangled representation of a data set should be capable of recover...

Robust Feature Disentanglement in Imaging Data via Joint Invariant Variational Autoencoders: from Cards to Atoms

Recent advances in imaging from celestial objects in astronomy visualize...

Encoded Prior Sliced Wasserstein AutoEncoder for learning latent manifold representations

While variational autoencoders have been successful generative models fo...

Use of Student's t-Distribution for the Latent Layer in a Coupled Variational Autoencoder

A Coupled Variational Autoencoder, which incorporates both a generalized...

Dirichlet Graph Variational Autoencoder

Graph Neural Networks (GNNs) and Variational Autoencoders (VAEs) have be...

Quantifying the Effects of Enforcing Disentanglement on Variational Autoencoders

The notion of disentangled autoencoders was proposed as an extension to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large part of unsupervised learning is devoted to the extraction of meaningful latent factors that explain a certain data set. The terminology around Variational Autoencoders suggests that they are a good tool for this task.

A Variational Autoencoder (Kingma & Welling, 2014; Rezende et al., 2014) consists of a “latent space”

(often just Euclidean space), a probability measure

on , an encoder from the data space to and a decoder from to . The term “latent space” suggests that an element in captures semantic information about its decoded data point . But is this interpretation actually warranted? Especially considering that in practice, a Variational Autoencoder is often trained without any knowledge about an underlying generative process. The only available information is the dataset itself. Nonetheless, one may train a Variational Autoencoder with any “latent space” and purely based on this terminology, one could be tempted to think of elements in as latent variables and factors explaining the data.

Figure 1: MNIST trained on a sphere and on a torus

Certainly, there is a heuristic argument that gives a partial justification of this interpretation. The starting point and guiding principle is that of Occam’s razor, that such latent factors might arise from the construction of simple, low-complexity models that explain the data

(Portegies, 2018). Strict formalizations of Occam’s razor exist, through Kolmogorov complexity and inductive inference (Solomonoff, 1964; Schmidhuber, 1997), but are often computationally intractable. A more practical approach leads to the principle of minimum description length (Rissanen, 1978; Hinton & Zemel, 1994) and variational inference (Honkela & Valpola, 2004).

From a different angle, part of the interpretation as latent space is warranted by the loss function of the Variational Autoencoder, which stimulates a continuous dependence between the latent variables and the corresponding data points. Close-by points in data space should also be close-by in latent space. This suggests that a Variational Autoencoder could capture topological and geometrical properties of a data set.

However, a standard Variational Autoencoder (with a Euclidean latent space) is at times structurally incapable of accurately capturing topological properties of a data set. Take for example the case of a spinning object placed on a turntable and being recorded by a camera from a fixed position. The data set for this example is the collection of all frames. The true latent factor is the angle of the turntable. However, the space of angles is topologically and geometrically different from Euclidean space. In an extreme example, if we train a Variational Autoencoder with a one-dimensional latent space on the pictures from the object on the turntable, there will be pictures taken from almost the same angle ending up at completely different parts of the latent space.

This phenomenon has been called manifold mismatch (Davidson et al., 2018; Falorsi et al., 2018). To match the latent space with the data structure, Davidson et al. implemented spheres as latent spaces, whereas Falorsi et al. implemented the special orthogonal group .

As further examples of datasets with topologically nontrivial latent factors, we can think of many translations of the same periodic picture, where the translation is the latent variable, or many pictures of the same object which has been rotated arbitrarily. In these cases, there are still clear latent variables, but their topological and geometrical structure is neither that of Euclidean space nor that of a sphere, but rather that of a torus and that of the respectively.

To address the problem of manifold mismatch, we developed the Diffusion Variational Autoencoder (VAE) which allows for an arbitrary manifold as a latent space. Our implementation includes a version of the reparametrization trick, and a fast approximation of the evaluation of the KL divergence in the loss.

We implemented VAEs with latent spaces of -dimensional spheres, a two-dimensional flat torus, an embedded torus in , the special orthogonal group and the real projective spaces .

We trained a synthetic data set of translated images on a flat torus. Our results show that the VAE can capture the topological properties of the data. We further observed that the success rate of the VAE in capturing global topological properties depends on the weight of higher Fourier components in the image. In data sets with more pronounced lower Fourier components the VAE is more successful in capturing the topology.

Mainly as a proof of concept, we trained on MNIST using the manifolds mentioned above.

2 Related work

Our work originated out of the search for algorithms that find semantically meaningful latent factors of data. The use of VAEs and their extensions to this end has mostly taken place in the context of disentanglement of latent factors (Higgins et al., 2017, 2018; Burgess et al., 2018). Examples of extensions that aim at disentangling latent factors are the -VAE (Higgins et al., 2017), the factor-VAE (Kim & Mnih, 2018), the -TCVAE (Chen et al., 2018) and the DIP-VAE (Kumar et al., 2018).

However, the examples in the introduction already show that in some situations, the topological structure of the latent space makes it practically impossible to disentangle latent factors. The latent factors are inherently, topologically entangled: in the case of a 3d rotation of an object, one cannot assign globally linearly independent angles of rotation.

Still, it is exactly global topological properties that we feel a VAE has a chance of capturing. What do we mean by this? One instance of ‘capturing’ topological structure is when the encoder and decoder of the VAE provide bijective, continuous maps between data and latent space, also called homeomorphic auto-encoding (Falorsi et al., 2018; de Haan & Falorsi, 2018). This can only be done when the latent space has a particular topological structure, for instance that of a particular manifold.

One of the main challenges when implementing a manifold as a latent space is the design of the reparametrization trick. In (Davidson et al., 2018), a VAE was implemented with a hyperspherical latent space. To our understanding, they implemented a reparametrization function which was discontinuous; see also Section 3.5 below.

If a manifold has the additional structure of a Lie group, this structure allows for a more straightforward implementation of the reparametrization trick (Falorsi et al., 2018). In our work, we do not assume the additional structure of a Lie group, but develop a reparametrization trick that works for general submanifolds of Euclidean space, and therefore by the Whitney (respectively Nash) embedding theorem, for general closed (Riemannian) manifolds.

The method that we use has similarities with the approach of Hamiltonian Variational Inference (Salimans et al., 2015). Moreover, the implementation of a manifold as a latent space can be seen as enabling a particular, informative, prior distribution. In that sense, our work relates to (Dilokthanakul et al., 2016; Tomczak & Welling, 2017). The prior distribution we implement is very degenerate, in that it is does not assign weight to points outside of the manifold.

There are also other ways to implement approximate Bayesian inference on Riemannian manifolds. For instance, Liu and Zhu adapted the Stein variational gradient method to enable training on a Riemannian manifold

(Liu & Zhu, 2017). However, their proposed method is rather expensive computationally.

The family of approximate posteriors that we implement is a direct generalization of the standard choice for a Euclidean VAE. Indeed, the Gaussian distributions are solutions to the heat equations, i.e. they are transition kernels of Brownian motion. One may want to increase the flexibility of the family of approximate posterior distributions, for instance by applying normalizing flows

(Rezende & Mohamed, 2015; Kingma et al., 2016; Gemici et al., 2016).

3 Methods

3.1 Variational autoencoders

A VAE has generally the following ingredients:

  • a latent space ,

  • a prior probability distribution

    on ,

  • a family of encoder distributions on , parametrized by in a parameter space ,

  • a family of decoder distributions on data space , parametrized by in a parameter space ; in the usual setup, and in our paper, in fact corresponds to the data space, and refers to the mean of a Gaussian distribution with identity covariance,

  • an encoder neural network

    which maps from data space to the parameter space ,

  • a decoder neural network which maps from latent space to parameter space .

The weights of these neural networks are optimized as to minimize the negated evidence lower bound (ELBO)

The first term is called reconstruction error (RE); up to additive and multiplicative constants it equals the mean squared error (MSE). The second term is called the KL-loss.

In a very common implementation, both latent space and data space

are Euclidean, and the families of decoder and encoder distributions are multivariate Gaussian. The encoder and decoder networks then assign to a datapoint or a latent variable a mean and a variance respectively.

When we implement as a Riemannian manifold, we need to find an appropriate prior distribution, for which we will choose the normalized Riemannian volume measure, a family of encoder distributions , for which we will take transition kernels of Brownian motion, and an encoder network mapping to the correct parameters.

3.2 Brownian motion on a Riemannian manifold

We will briefly discuss Brownian motion on a Riemannian manifold, recommending lecture notes by Hsu (Hsu, 2008) as a more extensive introduction.

In the paper, we always assume that is a smooth Riemannian submanifold of Euclidean space, which is closed, i.e. it is compact and has no boundary.

There are many different, equivalent definitions of Brownian motion. We present here the definition that is closest to our eventual approximation and implementation.

Figure 2: Random walk on a (one-dimensional) submanifold of , with time step .

We will construct Brownian motion out of random walks on a manifold. We first fix a small time step . We will imagine a particle, jumping from point to point on the manifold after each time step, see also Fig. 2. It will start off at a point . We describe the first jump, after which the process just repeats. After time , the particle makes a random jump from its current position, into the surrounding space, where is distributed according to a radially symmetric distribution in with identity covariance matrix. The position of the particle after the jump, , will therefore in general not be on the manifold, so we project the particle back: The particle’s new position will be

where the closest-point-projection assigns to every point the point in that is closest to . After another time the particle makes a new, independent, jump according to the same radially symmetric distribution, and its new position will be . This process just repeats.

Key to this construction, and also to our implementation, is the projection map . It has nice properties, that follow from general theory of smooth manifolds. In particular, smoothly depends on , as long as is not too far away from .

This way, for fixed, we have constructed a random walk, a random path on the manifold. We can think of this path as a discretized version of Brownian motion. Let now be a sequence converging to as . For fixed , we can construct a random walk with time step , and get a random path .

The random paths converge as to a random path (in distribution). This random path is called Brownian motion. The convergence statement can be made precise by for instance combining powerful, general results by (Jørgensen, 1975)

with standard facts from Riemannian geometry. But, because Riemannian manifolds are locally, i.e. when you zoom in far enough, very similar to Euclidean space, the convergence result essentially comes down to the central limit theorem and its upgraded version, Donsker’s invariance theorem.

In fact, can be interpreted as a Markov process, and even as a diffusion process. If is a subset of , the probability that the Brownian motion started at is in the set at time is measured by a probability measure applied to the set . We denote the density of this measure with respect to the standard Riemannian volume measure by . The function is sometimes referred to as the heat kernel.

Let us close this subsection with an alternative description of the function . It is also characterized by the fact that for every function

, the solution to the partial differential equation

is given by

3.3 Riemannian manifold as latent space

A VAE is a VAE with a Riemannian submanifold of Euclidean space as a latent space, and the transition probability measures of Brownian motion

as a parametric family of encoder distributions. We propose the uniform distribution for

, which is the normalized standard measure on a Riemannian manifold (although the choice of prior distribution could easily be generalized).

As in the standard VAE, we then implement functions and as neural networks.

3.4 Elbo

We optimize the weights in the network, aiming to minimize the average loss for the loss function

The first integral can often only be approached by sampling, and in that case it is often advantageous to perform a change of variables, commonly known as the reparametrization trick (Kingma & Welling, 2014).

3.5 Reparametrization trick

The reparametrization trick requires a space of noise variables, distributed according to , and a function such that is approximated according to .

The following example illustrates that finding such a function can be a nontrivial task, even for a relatively simple case where , the two-dimensional unit sphere in

. In an attempt to construct a random variable distributed according to

, we may first sample a random tangent vector

at the south pole according to an appropriate distribution, and then walk along the great circle in that direction for a distance according to the length of the vector. In other words, the new point is obtained from applying the exponential map at the south pole to the random sample. Next, we compose with a rotation of the sphere that brings the south pole to the point on the sphere. More precisely, we select a map of rotations such that every is the image of the south pole under the rotation . Then, a proposed reparametrization is given by

In some sense, this reparametrization works: One could find a distribution on the tangent space at the south pole such that the distribution of is . However, for topological reasons, in particular by the hairy ball theorem, it is impossible to find a continuous map with this property.

3.6 Approximate reparametrization by random walk

Instead, we construct an approximate reparametrization map by approximating Brownian motion by a random walk, similar to how we defined it in this paper. Starting from a point on the manifold, we set a random step in ambient space . We then project back to the manifold and repeat: we take a new step and project back to the manifold. In total, we take steps, see Fig. 2.

We define the function by

If we take as i.i.d. random variables, distributed according to a radially symmetric distribution, then is approximately distributed as a random variable with density . This approximation is very accurate for small times , even for small values of , if we take approximately Gaussian. The observation that for small times, the diffusion kernel is approximately Gaussian, is also very helpful in approximating the KL term in the loss.

3.7 Approximation of the KL-divergence

Unlike the standard VAE, or the hyperspherical VAE with the Von-Mises distribution, the KL-term cannot be computed exactly for the VAE. There are several techniques one could use to get, nonetheless, a good approximation of the KL divergence. We have implemented an asymptotic approximation, which we will describe first.

Asymptotic approximation

We can use short-term asymptotics, i.e. a parametrix expansion, of the heat kernel on Riemannian manifolds to obtain asymptotic expansions of the entropy.

For a -dimensional sphere of radius , the scalar curvature equals . By using a parametrix expansion of the heat kernel, see for instance (Zhao & Song, 2017), we find that

where is the geodesic distance between and .

We then derive, with , the following asymptotic behavior for the KL divergence

We conjecture that this asymptotic formula holds for general -dimensional Riemannian manifolds. In our implementation, we restrict so that it cannot become too large, thus ensuring a certain accuracy of the asymptotic expansion.

Figure 3: Latent space representation of MNIST for (top-left), (bottom-left), flat torus (top-right), and (bottom-right). The flat torus is represented as a square with periodic boundary conditions. The projective spaces are represented by a - and -dimensional ball respectively, for which every point on the boundary is identified with its reflection through the center. The effect of this identification can be seen, since the same digits that map close to a point on the boundary also map close to the reflected point.

Manifold LL ELBO KL MSE ()
-738.760.08 -739.250.09 5.160.13 3.480.05
Embedded Torus -738.580.08 -740.970.10 6.570.01 3.550.02

Flat Torus
-738.970.08 -741.370.11 6.420.00 3.700.02
-738.810.02 -739.320.03 5.650.08 3.370.02
-740.170.35 -741.190.53 3.150.56 4.490.02
-738.270.03 -738.850.04 5.740.05 3.230.02
-738.830.08 -739.350.11 4.950.04 3.560.04

Table 1: Numerical results for

VAEs trained on (non-binarized) MNIST. The values indicate mean and standard deviation over

runs. The columns represent the (data-averaged) log-likelihood estimate (LL), Evidence Lower Bound (ELBO), KL-divergence (KL) and mean squared error (MSE).

Manifold LL ELBO KL MSE ()
-3779.752.25 -3825.11 4.21 3.450.01 2.820.21
Embedded Torus -3787.3411.8 -3809.3323.0 11.21.20 1.851.18
Flat Torus -3773.991.70 -3813.005.16 6.420.00 0.900.25
-3774.610.53 -3821.252.44 3.700.00 2.620.12
-3789.138.36 -3850.1933.7 3.921.97 4.021.74
-3779.674.06 -3783.035.71 9.730.45 0.460.26
-3785.645.26 -3789.606.19 8.260.53 0.850.28

Table 2: Numerical results for VAEs trained on a simple picture consisting of the lowest non-trivial Fourier components. The values indicate mean and standard deviation over runs. The columns represent the (data-averaged) log-likelihood estimate (LL), Evidence Lower Bound (ELBO), KL-divergence (KL) and mean squared error (MSE).

Numerical approximation and integration

For some manifolds such as spheres or flat tori, exact solutions to the heat kernel are available, usually in terms of infinite series and special functions. For other, small-dimensional, Riemannian manifolds, the heat kernel may be accurately computed numerically. In both cases, the integration in the KL-term may be performed numerically. If the dimensionality of the underlying manifold gets to large, one may have to resort to Monte Carlo approximation of the integral.

4 Experiments

We have implemented VAEs with latent spaces of -dimensional spheres, a flat two-dimensional torus, a torus embedded in , the and real projective spaces .

For all our experiments we used multi-layer perceptrons for the encoder and decoder with three and two hidden layers respectively. Recall that the encoder needs to produce both a point

on the manifold and a time for the transition kernel. These functions share all layers, except for the final step where we project, with the projection map , from the last hidden layer to the manifold to get , and use an output layer with a activation function to obtain .

The encoder and decoder are connected by a sampling layer, in which we approximate sampling from the transition kernel of Brownian motion according to the reparametrization trick described in Section 3.6.

4.1 VAEs for MNIST

We then trained VAEs on MNIST. We show the manifolds as latent space with encoded MNIST digits in Figs. 1 and 3. When MNIST is trained on different latent spaces, different adjacency structures between digits may become apparent, providing topological information.

The is isometric to a scaling of the (with natural choices of Riemannian metrics). Although we have implemented an embedding and projection map for this embedding for the

directly based on an SVD decomposition to find the nearest orthogonal matrix, training on the

with the following trick was faster and we only present these results.

For training the projective spaces, we used an additional trick, where instead of embedding in a Euclidean space, we embed in , and make the decoder neural network even by construction (i.e. the decoder applied to a point on the sphere equals the decoder applied to a point ). Then, an encoder and decoder to and from the are defined implicitly. However, it must be noted that this setup does not allow for a homeomorphic encoding (because does not embed in ).

The numerically computed ELBO, reconstruction error, KL-divergence and mean-squared-error are shown in Table 1 together with the estimated log-likelihood for a test dataset of MNIST.

Estimation of log-likelihood

For the evaluation of the proposed methods we have estimated the log-likelihood of the test dataset according to the importance sampling presented in (Burda et al., 2016). The approximate log-likelihood of datapoint is calculated by sampling latent variables according to the approximate posterior . The estimated log-likelihood for datapoint is given by

The log-likelihood estimates presented in Table 1 are obtained with samples for each datapoint, averaged over all datapoints.

(a) Latent space flat torus
(b) Reconstruction flat torus

(c) Latent space sphere
(d) Reconstruction sphere
Figure 4: Results of training VAEs on a simple picture consisting only of the lowest non-trivial Fourier components. The figures on the left show the latent space manifolds of a flat torus and a sphere, with encoded images of translated original pictures, color-coded according to translation following the color scheme presented in Fig. 5. The figures on the right are reconstructions: they are placed on a grid in latent space and show the decoded images for the gridpoints. The reconstruction of the sphere is in spherical coordinates, with the left and right side of Fig. 4 corresponding to the black arc on the sphere in Fig. 4.

4.2 Translations of periodic pictures

To test whether a VAE can capture topological properties, we trained it on synthetic datasets consisting of translations of the same periodic picture.

Our input pictures were discretized to 6464 pixels, but to ease presentation, we discuss them below as if they were continuous. Note that with the choice of multi-layer perceptrons as encoder and decoder networks, the network has no information about which pixels are contiguous.

Simple picture

To illustrate the idea, we start with a very simple picture , which only consists of the lowest Fourier components in each direction

By translating the picture, i.e. by considering different phases, we obtain a submanifold of the space of all pictures. When we interpret this space as , we get an embedding of the flat torus which is isometric up to a scaling. In other words, if we train the VAE with a flat torus as latent space, we essentially try to find the identity map.

Figs. 4 and 4 illustrate that for this simple case, the VAE with a flat torus as latent space indeed captures the translation of the picture as a latent variable. The fact that Fig. 4 is practically a reflection and translation of the legend in Fig. 5, shows that there is an almost isometric correspondence between the translation of the original picture and the encoded latent variable.

Although the figure does not provide a proof that the embedding is homeomorphic, it is in principle possible to give a guarantee that the encoder map is surjective (more precisely, that it has degree or ).

Figure 5: Colors used to color encoded pictures in latent space. The horizontal direction represents horizontal translation, the vertical direction represents vertical translation of the original, periodic picture. The boundary conditions are periodic.

We contrast this to when we train the same dataset on a VAE with a sphere as a latent space. Fig. 4 displays typical results, showing that large parts of the sphere are not covered.

We present further numerical results in Table 2. We note that of all the closed manifolds, the MSE is by far the lowest for the flat torus. The KL divergence is lower for the positively-curved spaces , and , but the MSE for those is about a factor three higher. For as latent space the MSE is even lower, but it is hard to compare because the KL term is not directly comparable, and it is the combination of the two that is optimized.

More complicated pictures

As an exploratory analysis of the extent to which the VAE can capture the topological structure, we trained on fixed datasets, each consisting of translations of a fixed random picture. These random pictures, viewed as functions , are created by randomly drawing complex Fourier coefficients from a Gaussian distribution:

where is a discounting factor. The VAE with a flat torus as latent space is also capable of capturing the translations when the picture is generate this way, as long as the higher Fourier components carry not too much weight, see Fig. 6.

Figure 6: Latent space and reconstruction for a VAE with a flat torus as latent space, trained on a picture consisting of several Fourier components.

Too complicated pictures

Generically, we see that when there is too much weight on the higher Fourier components in the picture ( with ), the VAE is no longer capable of capturing the translations, see Fig. 7.

We only tested a very simple setup. It is likely that convolutional neural networks would significantly improve the performance in capturing the translations, since implicit information is added to the network about the underlying geometry of the pictures.

Initial experiments with pretraining the network on simple Fourier images also show an increased success rate of capturing the topological structure.

Figure 7: Flat torus latent space and reconstruction for a VAE trained on a picture with too much weight on higher Fourier components for the VAE to capture the topological structure.

Interesting patterns

For moderate weights on Fourier components, the data manifold is often mapped to the latent space in very interesting ways, see Fig. 8. The patterns that occur in this manner are very structured, suggesting that it may be possible to find underlying mathematical reasons for this structure. Moreover, it gives some indication that we may understand how the network arrives at such patterns, and how we may control them.

Figure 8: Interesting patterns may occur when translations of a picture with not too much weight on higher Fourier components are encoded into the latent space of a flat torus.

5 Conclusion

We developed and implemented Diffusion Variational Autoencoders, which allow for arbitrary manifolds as a latent space 111https://github.com/LuisArmandoPerez/DiffusionVAE. Our original motivation was to investigate to which extent VAEs find semantically meaningful latent variables, and more specifically, whether they can capture topological and geometrical structure in datasets. By allowing for an arbitrary manifold as a latent space, VAEs can remove obstructions to capturing such structure.

Indeed, our experiments with translations of periodic images show that a simple implementation of a VAE with a flat torus as latent space is capable of capturing topological properties.

However, we have also observed that when we use periodic images with significant high Fourier components, it is more challenging for the VAE to capture the topological structure. We will investigate further whether this can be resolved by better network design, by pretraining, different loss functions, or even more intensive extensions of the VAE algorithm.

Our exploratory analysis shows that we can ask well-defined questions about whether a VAE can capture topological structure, and that such questions are amenable to a statistical approach. Moreover, the patterns observed in latent space in Fig. 6 and Fig. 8, suggest that it may be possible to capture in mathematical theorems what the image of a data manifold in latent space would look like, when the networks are (almost) minimizing the loss. Combining these different perspectives, we may eventually develop more understanding on how to develop algorithms that capture semantically meaningful latent variables.