A large part of unsupervised learning is devoted to the extraction of meaningful latent factors that explain a certain data set. The terminology around Variational Autoencoders suggests that they are a good tool for this task.
(often just Euclidean space), a probability measureon , an encoder from the data space to and a decoder from to . The term “latent space” suggests that an element in captures semantic information about its decoded data point . But is this interpretation actually warranted? Especially considering that in practice, a Variational Autoencoder is often trained without any knowledge about an underlying generative process. The only available information is the dataset itself. Nonetheless, one may train a Variational Autoencoder with any “latent space” and purely based on this terminology, one could be tempted to think of elements in as latent variables and factors explaining the data.
Certainly, there is a heuristic argument that gives a partial justification of this interpretation. The starting point and guiding principle is that of Occam’s razor, that such latent factors might arise from the construction of simple, low-complexity models that explain the data(Portegies, 2018). Strict formalizations of Occam’s razor exist, through Kolmogorov complexity and inductive inference (Solomonoff, 1964; Schmidhuber, 1997), but are often computationally intractable. A more practical approach leads to the principle of minimum description length (Rissanen, 1978; Hinton & Zemel, 1994) and variational inference (Honkela & Valpola, 2004).
From a different angle, part of the interpretation as latent space is warranted by the loss function of the Variational Autoencoder, which stimulates a continuous dependence between the latent variables and the corresponding data points. Close-by points in data space should also be close-by in latent space. This suggests that a Variational Autoencoder could capture topological and geometrical properties of a data set.
However, a standard Variational Autoencoder (with a Euclidean latent space) is at times structurally incapable of accurately capturing topological properties of a data set. Take for example the case of a spinning object placed on a turntable and being recorded by a camera from a fixed position. The data set for this example is the collection of all frames. The true latent factor is the angle of the turntable. However, the space of angles is topologically and geometrically different from Euclidean space. In an extreme example, if we train a Variational Autoencoder with a one-dimensional latent space on the pictures from the object on the turntable, there will be pictures taken from almost the same angle ending up at completely different parts of the latent space.
This phenomenon has been called manifold mismatch (Davidson et al., 2018; Falorsi et al., 2018). To match the latent space with the data structure, Davidson et al. implemented spheres as latent spaces, whereas Falorsi et al. implemented the special orthogonal group .
As further examples of datasets with topologically nontrivial latent factors, we can think of many translations of the same periodic picture, where the translation is the latent variable, or many pictures of the same object which has been rotated arbitrarily. In these cases, there are still clear latent variables, but their topological and geometrical structure is neither that of Euclidean space nor that of a sphere, but rather that of a torus and that of the respectively.
To address the problem of manifold mismatch, we developed the Diffusion Variational Autoencoder (VAE) which allows for an arbitrary manifold as a latent space. Our implementation includes a version of the reparametrization trick, and a fast approximation of the evaluation of the KL divergence in the loss.
We implemented VAEs with latent spaces of -dimensional spheres, a two-dimensional flat torus, an embedded torus in , the special orthogonal group and the real projective spaces .
We trained a synthetic data set of translated images on a flat torus. Our results show that the VAE can capture the topological properties of the data. We further observed that the success rate of the VAE in capturing global topological properties depends on the weight of higher Fourier components in the image. In data sets with more pronounced lower Fourier components the VAE is more successful in capturing the topology.
Mainly as a proof of concept, we trained on MNIST using the manifolds mentioned above.
2 Related work
Our work originated out of the search for algorithms that find semantically meaningful latent factors of data. The use of VAEs and their extensions to this end has mostly taken place in the context of disentanglement of latent factors (Higgins et al., 2017, 2018; Burgess et al., 2018). Examples of extensions that aim at disentangling latent factors are the -VAE (Higgins et al., 2017), the factor-VAE (Kim & Mnih, 2018), the -TCVAE (Chen et al., 2018) and the DIP-VAE (Kumar et al., 2018).
However, the examples in the introduction already show that in some situations, the topological structure of the latent space makes it practically impossible to disentangle latent factors. The latent factors are inherently, topologically entangled: in the case of a 3d rotation of an object, one cannot assign globally linearly independent angles of rotation.
Still, it is exactly global topological properties that we feel a VAE has a chance of capturing. What do we mean by this? One instance of ‘capturing’ topological structure is when the encoder and decoder of the VAE provide bijective, continuous maps between data and latent space, also called homeomorphic auto-encoding (Falorsi et al., 2018; de Haan & Falorsi, 2018). This can only be done when the latent space has a particular topological structure, for instance that of a particular manifold.
One of the main challenges when implementing a manifold as a latent space is the design of the reparametrization trick. In (Davidson et al., 2018), a VAE was implemented with a hyperspherical latent space. To our understanding, they implemented a reparametrization function which was discontinuous; see also Section 3.5 below.
If a manifold has the additional structure of a Lie group, this structure allows for a more straightforward implementation of the reparametrization trick (Falorsi et al., 2018). In our work, we do not assume the additional structure of a Lie group, but develop a reparametrization trick that works for general submanifolds of Euclidean space, and therefore by the Whitney (respectively Nash) embedding theorem, for general closed (Riemannian) manifolds.
The method that we use has similarities with the approach of Hamiltonian Variational Inference (Salimans et al., 2015). Moreover, the implementation of a manifold as a latent space can be seen as enabling a particular, informative, prior distribution. In that sense, our work relates to (Dilokthanakul et al., 2016; Tomczak & Welling, 2017). The prior distribution we implement is very degenerate, in that it is does not assign weight to points outside of the manifold.
There are also other ways to implement approximate Bayesian inference on Riemannian manifolds. For instance, Liu and Zhu adapted the Stein variational gradient method to enable training on a Riemannian manifold(Liu & Zhu, 2017). However, their proposed method is rather expensive computationally.
The family of approximate posteriors that we implement is a direct generalization of the standard choice for a Euclidean VAE. Indeed, the Gaussian distributions are solutions to the heat equations, i.e. they are transition kernels of Brownian motion. One may want to increase the flexibility of the family of approximate posterior distributions, for instance by applying normalizing flows(Rezende & Mohamed, 2015; Kingma et al., 2016; Gemici et al., 2016).
3.1 Variational autoencoders
A VAE has generally the following ingredients:
a latent space ,
a prior probability distributionon ,
a family of encoder distributions on , parametrized by in a parameter space ,
a family of decoder distributions on data space , parametrized by in a parameter space ; in the usual setup, and in our paper, in fact corresponds to the data space, and refers to the mean of a Gaussian distribution with identity covariance,
an encoder neural networkwhich maps from data space to the parameter space ,
a decoder neural network which maps from latent space to parameter space .
The weights of these neural networks are optimized as to minimize the negated evidence lower bound (ELBO)
The first term is called reconstruction error (RE); up to additive and multiplicative constants it equals the mean squared error (MSE). The second term is called the KL-loss.
In a very common implementation, both latent space and data space
are Euclidean, and the families of decoder and encoder distributions are multivariate Gaussian. The encoder and decoder networks then assign to a datapoint or a latent variable a mean and a variance respectively.
When we implement as a Riemannian manifold, we need to find an appropriate prior distribution, for which we will choose the normalized Riemannian volume measure, a family of encoder distributions , for which we will take transition kernels of Brownian motion, and an encoder network mapping to the correct parameters.
3.2 Brownian motion on a Riemannian manifold
We will briefly discuss Brownian motion on a Riemannian manifold, recommending lecture notes by Hsu (Hsu, 2008) as a more extensive introduction.
In the paper, we always assume that is a smooth Riemannian submanifold of Euclidean space, which is closed, i.e. it is compact and has no boundary.
There are many different, equivalent definitions of Brownian motion. We present here the definition that is closest to our eventual approximation and implementation.
We will construct Brownian motion out of random walks on a manifold. We first fix a small time step . We will imagine a particle, jumping from point to point on the manifold after each time step, see also Fig. 2. It will start off at a point . We describe the first jump, after which the process just repeats. After time , the particle makes a random jump from its current position, into the surrounding space, where is distributed according to a radially symmetric distribution in with identity covariance matrix. The position of the particle after the jump, , will therefore in general not be on the manifold, so we project the particle back: The particle’s new position will be
where the closest-point-projection assigns to every point the point in that is closest to . After another time the particle makes a new, independent, jump according to the same radially symmetric distribution, and its new position will be . This process just repeats.
Key to this construction, and also to our implementation, is the projection map . It has nice properties, that follow from general theory of smooth manifolds. In particular, smoothly depends on , as long as is not too far away from .
This way, for fixed, we have constructed a random walk, a random path on the manifold. We can think of this path as a discretized version of Brownian motion. Let now be a sequence converging to as . For fixed , we can construct a random walk with time step , and get a random path .
The random paths converge as to a random path (in distribution). This random path is called Brownian motion. The convergence statement can be made precise by for instance combining powerful, general results by (Jørgensen, 1975)
with standard facts from Riemannian geometry. But, because Riemannian manifolds are locally, i.e. when you zoom in far enough, very similar to Euclidean space, the convergence result essentially comes down to the central limit theorem and its upgraded version, Donsker’s invariance theorem.
In fact, can be interpreted as a Markov process, and even as a diffusion process. If is a subset of , the probability that the Brownian motion started at is in the set at time is measured by a probability measure applied to the set . We denote the density of this measure with respect to the standard Riemannian volume measure by . The function is sometimes referred to as the heat kernel.
Let us close this subsection with an alternative description of the function . It is also characterized by the fact that for every function
, the solution to the partial differential equation
is given by
3.3 Riemannian manifold as latent space
A VAE is a VAE with a Riemannian submanifold of Euclidean space as a latent space, and the transition probability measures of Brownian motion
as a parametric family of encoder distributions. We propose the uniform distribution for, which is the normalized standard measure on a Riemannian manifold (although the choice of prior distribution could easily be generalized).
As in the standard VAE, we then implement functions and as neural networks.
We optimize the weights in the network, aiming to minimize the average loss for the loss function
The first integral can often only be approached by sampling, and in that case it is often advantageous to perform a change of variables, commonly known as the reparametrization trick (Kingma & Welling, 2014).
3.5 Reparametrization trick
The reparametrization trick requires a space of noise variables, distributed according to , and a function such that is approximated according to .
The following example illustrates that finding such a function can be a nontrivial task, even for a relatively simple case where , the two-dimensional unit sphere in
. In an attempt to construct a random variable distributed according to
, we may first sample a random tangent vectorat the south pole according to an appropriate distribution, and then walk along the great circle in that direction for a distance according to the length of the vector. In other words, the new point is obtained from applying the exponential map at the south pole to the random sample. Next, we compose with a rotation of the sphere that brings the south pole to the point on the sphere. More precisely, we select a map of rotations such that every is the image of the south pole under the rotation . Then, a proposed reparametrization is given by
In some sense, this reparametrization works: One could find a distribution on the tangent space at the south pole such that the distribution of is . However, for topological reasons, in particular by the hairy ball theorem, it is impossible to find a continuous map with this property.
3.6 Approximate reparametrization by random walk
Instead, we construct an approximate reparametrization map by approximating Brownian motion by a random walk, similar to how we defined it in this paper. Starting from a point on the manifold, we set a random step in ambient space . We then project back to the manifold and repeat: we take a new step and project back to the manifold. In total, we take steps, see Fig. 2.
We define the function by
If we take as i.i.d. random variables, distributed according to a radially symmetric distribution, then is approximately distributed as a random variable with density . This approximation is very accurate for small times , even for small values of , if we take approximately Gaussian. The observation that for small times, the diffusion kernel is approximately Gaussian, is also very helpful in approximating the KL term in the loss.
3.7 Approximation of the KL-divergence
Unlike the standard VAE, or the hyperspherical VAE with the Von-Mises distribution, the KL-term cannot be computed exactly for the VAE. There are several techniques one could use to get, nonetheless, a good approximation of the KL divergence. We have implemented an asymptotic approximation, which we will describe first.
We can use short-term asymptotics, i.e. a parametrix expansion, of the heat kernel on Riemannian manifolds to obtain asymptotic expansions of the entropy.
For a -dimensional sphere of radius , the scalar curvature equals . By using a parametrix expansion of the heat kernel, see for instance (Zhao & Song, 2017), we find that
where is the geodesic distance between and .
We then derive, with , the following asymptotic behavior for the KL divergence
We conjecture that this asymptotic formula holds for general -dimensional Riemannian manifolds. In our implementation, we restrict so that it cannot become too large, thus ensuring a certain accuracy of the asymptotic expansion.
runs. The columns represent the (data-averaged) log-likelihood estimate (LL), Evidence Lower Bound (ELBO), KL-divergence (KL) and mean squared error (MSE).
Numerical approximation and integration
For some manifolds such as spheres or flat tori, exact solutions to the heat kernel are available, usually in terms of infinite series and special functions. For other, small-dimensional, Riemannian manifolds, the heat kernel may be accurately computed numerically. In both cases, the integration in the KL-term may be performed numerically. If the dimensionality of the underlying manifold gets to large, one may have to resort to Monte Carlo approximation of the integral.
We have implemented VAEs with latent spaces of -dimensional spheres, a flat two-dimensional torus, a torus embedded in , the and real projective spaces .
For all our experiments we used multi-layer perceptrons for the encoder and decoder with three and two hidden layers respectively. Recall that the encoder needs to produce both a pointon the manifold and a time for the transition kernel. These functions share all layers, except for the final step where we project, with the projection map , from the last hidden layer to the manifold to get , and use an output layer with a activation function to obtain .
The encoder and decoder are connected by a sampling layer, in which we approximate sampling from the transition kernel of Brownian motion according to the reparametrization trick described in Section 3.6.
4.1 VAEs for MNIST
We then trained VAEs on MNIST. We show the manifolds as latent space with encoded MNIST digits in Figs. 1 and 3. When MNIST is trained on different latent spaces, different adjacency structures between digits may become apparent, providing topological information.
The is isometric to a scaling of the (with natural choices of Riemannian metrics). Although we have implemented an embedding and projection map for this embedding for the
directly based on an SVD decomposition to find the nearest orthogonal matrix, training on thewith the following trick was faster and we only present these results.
For training the projective spaces, we used an additional trick, where instead of embedding in a Euclidean space, we embed in , and make the decoder neural network even by construction (i.e. the decoder applied to a point on the sphere equals the decoder applied to a point ). Then, an encoder and decoder to and from the are defined implicitly. However, it must be noted that this setup does not allow for a homeomorphic encoding (because does not embed in ).
The numerically computed ELBO, reconstruction error, KL-divergence and mean-squared-error are shown in Table 1 together with the estimated log-likelihood for a test dataset of MNIST.
Estimation of log-likelihood
For the evaluation of the proposed methods we have estimated the log-likelihood of the test dataset according to the importance sampling presented in (Burda et al., 2016). The approximate log-likelihood of datapoint is calculated by sampling latent variables according to the approximate posterior . The estimated log-likelihood for datapoint is given by
The log-likelihood estimates presented in Table 1 are obtained with samples for each datapoint, averaged over all datapoints.
4.2 Translations of periodic pictures
To test whether a VAE can capture topological properties, we trained it on synthetic datasets consisting of translations of the same periodic picture.
Our input pictures were discretized to 6464 pixels, but to ease presentation, we discuss them below as if they were continuous. Note that with the choice of multi-layer perceptrons as encoder and decoder networks, the network has no information about which pixels are contiguous.
To illustrate the idea, we start with a very simple picture , which only consists of the lowest Fourier components in each direction
By translating the picture, i.e. by considering different phases, we obtain a submanifold of the space of all pictures. When we interpret this space as , we get an embedding of the flat torus which is isometric up to a scaling. In other words, if we train the VAE with a flat torus as latent space, we essentially try to find the identity map.
Figs. 4 and 4 illustrate that for this simple case, the VAE with a flat torus as latent space indeed captures the translation of the picture as a latent variable. The fact that Fig. 4 is practically a reflection and translation of the legend in Fig. 5, shows that there is an almost isometric correspondence between the translation of the original picture and the encoded latent variable.
Although the figure does not provide a proof that the embedding is homeomorphic, it is in principle possible to give a guarantee that the encoder map is surjective (more precisely, that it has degree or ).
We contrast this to when we train the same dataset on a VAE with a sphere as a latent space. Fig. 4 displays typical results, showing that large parts of the sphere are not covered.
We present further numerical results in Table 2. We note that of all the closed manifolds, the MSE is by far the lowest for the flat torus. The KL divergence is lower for the positively-curved spaces , and , but the MSE for those is about a factor three higher. For as latent space the MSE is even lower, but it is hard to compare because the KL term is not directly comparable, and it is the combination of the two that is optimized.
More complicated pictures
As an exploratory analysis of the extent to which the VAE can capture the topological structure, we trained on fixed datasets, each consisting of translations of a fixed random picture. These random pictures, viewed as functions , are created by randomly drawing complex Fourier coefficients from a Gaussian distribution:
where is a discounting factor. The VAE with a flat torus as latent space is also capable of capturing the translations when the picture is generate this way, as long as the higher Fourier components carry not too much weight, see Fig. 6.
Too complicated pictures
Generically, we see that when there is too much weight on the higher Fourier components in the picture ( with ), the VAE is no longer capable of capturing the translations, see Fig. 7.
We only tested a very simple setup. It is likely that convolutional neural networks would significantly improve the performance in capturing the translations, since implicit information is added to the network about the underlying geometry of the pictures.
Initial experiments with pretraining the network on simple Fourier images also show an increased success rate of capturing the topological structure.
For moderate weights on Fourier components, the data manifold is often mapped to the latent space in very interesting ways, see Fig. 8. The patterns that occur in this manner are very structured, suggesting that it may be possible to find underlying mathematical reasons for this structure. Moreover, it gives some indication that we may understand how the network arrives at such patterns, and how we may control them.
We developed and implemented Diffusion Variational Autoencoders, which allow for arbitrary manifolds as a latent space 111https://github.com/LuisArmandoPerez/DiffusionVAE. Our original motivation was to investigate to which extent VAEs find semantically meaningful latent variables, and more specifically, whether they can capture topological and geometrical structure in datasets. By allowing for an arbitrary manifold as a latent space, VAEs can remove obstructions to capturing such structure.
Indeed, our experiments with translations of periodic images show that a simple implementation of a VAE with a flat torus as latent space is capable of capturing topological properties.
However, we have also observed that when we use periodic images with significant high Fourier components, it is more challenging for the VAE to capture the topological structure. We will investigate further whether this can be resolved by better network design, by pretraining, different loss functions, or even more intensive extensions of the VAE algorithm.
Our exploratory analysis shows that we can ask well-defined questions about whether a VAE can capture topological structure, and that such questions are amenable to a statistical approach. Moreover, the patterns observed in latent space in Fig. 6 and Fig. 8, suggest that it may be possible to capture in mathematical theorems what the image of a data manifold in latent space would look like, when the networks are (almost) minimizing the loss. Combining these different perspectives, we may eventually develop more understanding on how to develop algorithms that capture semantically meaningful latent variables.
- Burda et al. (2016) Burda, Y., Grosse, R., and Salakhutdinov, R. Importance weighted autoencoders. ICLR, 2016.
- Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in -VAE. arXiv preprint, arXiv:1804.03599, 2018.
- Chen et al. (2018) Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in Variational Autoencoders. In Advances in Neural Information Processing Systems 31. 2018.
Davidson et al. (2018)
Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M.
Hyperspherical variational auto-encoders.
34th Conference on Uncertainty in Artificial Intelligence.2018.
- de Haan & Falorsi (2018) de Haan, P. and Falorsi, L. Topological constraints on homeomorphic auto-encoding. arXiv preprint, arXiv:1812.10783, 2018.
- Dilokthanakul et al. (2016) Dilokthanakul, N., Mediano, P. A. M., Garnelo, M., Lee, M. C. H., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with Gaussian mixture Variational Autoencoders. arXiv preprint, arXiv:1611.02648, 2016.
- Falorsi et al. (2018) Falorsi, L., de Haan, P., Davidson, T. R., De Cao, N., Weiler, M., Forré, P., and Cohen, T. S. Explorations in homeomorphic variational auto-encoding. arXiv preprint, arXiv:1807.04689, 2018.
- Gemici et al. (2016) Gemici, M. C., Rezende, D., and Mohamed, S. Normalizing flows on Riemannian manifolds. arXiv preprint, arXiv:1611.02304, 2016.
- Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. -vae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
- Higgins et al. (2018) Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. arXiv preprint, arXiv:1812.02230, 2018.
- Hinton & Zemel (1994) Hinton, G. E. and Zemel, R. S. Autoencoders, minimum description length and helmholtz free energy. In Advances in neural information processing systems, 1994.
- Honkela & Valpola (2004) Honkela, A. and Valpola, H. Variational learning and bits-back coding: An information-theoretic view to Bayesian learning. IEEE Transactions on Neural Networks, 15(4):800–810, 7 2004. ISSN 1045-9227.
- Hsu (2008) Hsu, E. P. A brief introduction to Brownian motion on a Riemannian manifold. Lecture notes, 2008.
- Jørgensen (1975) Jørgensen, E. The central limit problem for geodesic random walks. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 32(1-2):1–64, 1975. ISSN 00443719.
- Kim & Mnih (2018) Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint, arXiv:1802.05983, 2018.
- Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. In ICLR, 2014.
- Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved Variational Inference with Inverse Autoregressive Flow. In Advances in Neural Information Processing Systems 29. 2016.
- Kumar et al. (2018) Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of distentangled latent concepts from unlabeled observations. ICLR, 2018.
- Liu & Zhu (2017) Liu, C. and Zhu, J. Riemannian Stein variational gradient descent for Bayesian inference. arXiv preprint, arXiv:1711.11216, 2017.
- Portegies (2018) Portegies, J. W. Ergo learning. Nieuw Archief voor Wiskunde, 5(3):199–205, 2018.
- Rezende & Mohamed (2015) Rezende, D. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint, arXiv:1401.4082, 2014.
- Rissanen (1978) Rissanen, J. Modeling by shortest data description. Automatica, 14(5):465–471, 9 1978. ISSN 0005-1098.
- Salimans et al. (2015) Salimans, T., Kingma, D., and Welling, M. Markov Chain Monte Carlo and Variational Inference: Bridging the gap. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
- Schmidhuber (1997) Schmidhuber, J. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857 – 873, 1997. ISSN 0893-6080.
- Solomonoff (1964) Solomonoff, R. A formal theory of inductive inference. Part I. Information and Control, 7(1):1–22, 3 1964. ISSN 0019-9958.
- Tomczak & Welling (2017) Tomczak, J. M. and Welling, M. VAE with a VampPrior. arXiv preprint, arXiv:1705.07120, 2017.
- Zhao & Song (2017) Zhao, C. and Song, J. S. Exact heat kernel on a hypersphere and its applications in kernel SVM. Frontiers in Applied Mathematics and Statistics, 4:1, 1 2017. ISSN 2297-4687.