Variational autoencoders (VAEs) (kingma:iclr:2014; rezende2014stochastic) simultaneously learn a conditional density of high dimensional observations and low dimensional representations giving rise to these observations. In VAEs, a prior distribution is assigned to the latent variables which is typically a standard Gaussian. It has, unfortunately, turned out that this choice of prior is limiting the modelling capacity of VAEs and richer have been proposed instead (tomczak2017vae; van2017neural; bauer2018resampled; klushyn2019learning). In contrast to this popular view, we will argue that the limitations of the prior are not due to lack of capacity, but rather lack of principle.
Using a Riemannian Brownian motion (ours) with trainable (scalar) variance.
Informally, the Gaussian prior has two key problems.
1. The Euclidean representation is arbitrary. Behind the Gaussian prior lies the assumption that the latent space is Euclidean. However, if the decoder is of sufficiently high capacity, then it is always possible to reparameterize the latent space from to , and then let the decoder invert this reparameterization as part of its decoding process (arvanitidis:iclr:2018; hauberg2018only). This implies that we cannot assign any meaning to specific instantiations of the latent variables, and that Euclidean distances carry limited meaning in . This is an identifiability problem and it is well-known that even the most elementary latent variable models are subject to such. For example, Gaussian mixtures can be reparameterized by permuting cluster indices, and principal components can be arbitrarily rotated (Bishop:2006:PRM:1162264).
2. Latent manifolds are mismapped onto . In all but the simplest cases, the latent manifold giving rise to data observations is embedded in . An encoder with adequate capacity will always recover some smoothened form of , which will either result in the latent space containing “holes” of low density or, in being mapped to the whole of under the influence of the prior. Both cases will lead to bad samples or convergence problems. This problem is called manifold mismatch (davidson2018hyperspherical; falorsi2018explorations) and is closely related to distribution mismatch (hoffman2016elbo; bauer2018resampled; rosca2018distribution) where the prior samples from regions to which the variational posterior (or encoder) does not assign any density. A graphical illustration of this situation can be seen on the left panel of Fig. 1, where a VAE is trained on the 1-digits of MNIST under the Gaussian prior. The prior assigns density where there is none.
In this paper, we consider an alternative prior, which is shown in the right panel of Fig. 1. This is a Riemannian Brownian motion model defined over the manifold immersed by the decoder. The Riemannian structure solves the identifiability problem and gives a meaningful representation that is invariant to reparametrizations and at the same time restricts the prior to sample only from the image of onto . The prior generalizes the Gaussian to the Riemannian setting. It only has a single scalar variance parameter, yet it is able to capture intrinsic complexities in the data.
2.1 Variational autoencoders
VAEs learn a generative model by specifying a likelihood of observations conditioned on latent variables and a prior over the latent variables . The marginal likelihood of the observations is intractable. As such, VAEs are trained by maximizing the variational Evidence Lower Bound (ELBO) on the marginal likelihood :
where denotes the variational family. kingma:iclr:2014; rezende2014stochastic
proposed a low variance estimator of stochastic gradients of the ELBO, known asreparameterization trick.
In the VAE framework, both the variational family and the conditional likelihood
are parameterized by neural networks with variational parametersand generative parameters . In the language of autoencoders, these networks are often called encoder and decoder parameterizing the variational family and the generative model respectively. From an autoencoder perspective, Eq. 2.1
can be seen as a loss function involving a data reconstruction term (the generative model) and a regularization term (the KL divergence between the variational family and the prior distribution over the latent variables).
2.2 A primer on Riemannian geometry
The standard Gaussian prior relies on the usual Lebesgue measure which in turn, assumes a Euclidean structure over the latent space . Recently, it has been noted (arvanitidis:iclr:2018; hauberg2018only) that this assumption is mathematically questionable, and that, empirically, Euclidean latent space distances carry little information about the relationship between data points. Rather, a Riemannian interpretation of the latent space appears more promising. Hence we give a short review of Riemannian geometry.
A smooth manifold is a topological manifold endowed with a smooth structure. That is to say is locally homeomorphic to Euclidean space and we are able to do calculus on it. For a point , the tangent space
is a vector space centered onwhich contains all tangent vectors to passing through point
. With this we can give a formal definition of the Riemannian metric tensor which is of central importance to any analysis involving Riemannian geometry.
(Riemannian metric) (docarmo:1992) Given a smooth manifold , a Riemannian metric on assigns on each point an inner product (i.e. a symmetric, positive definite, bilinear form) in the tangent space which varies smoothly in the following sense: if is a local coordinate chart centered at and for , then is a smooth function on .
By generalizing the inner product to Riemannian manifolds, the metric tensor gives meaning to length, angle and volume on manifolds. Central to distributions defined on a Riemannian manifold, the volume measure over an infinitesimal region centered at point is defined as , where is the matrix representation of the metric tensor evaluated at point . Shortest paths on manifolds are represented by geodesic curves, which generalize straight lines in Euclidean space. A geodesic is a constant speed curve and its length can be computed by integrating the norm of its velocity vector under the metric, in other words . For there is a useful map defined on a neighborhood of the origin of called the exponential map. More precisely, the exponential map is a diffeomorphism, i.e. a bijection with a smooth inverse, between an open subset and an open subset . Given and , there is a unique geodesic with and . The exponential map is given by . Note that . The inverse map (from to ) is called the logarithmic map.
Let be an embedded -dimensional manifold and consider local coordinates with an open subset. Then the Euclidean metric on induces a Riemannian metric on . Expressed in terms of the coordinates given by , this metric is known as the pull-back metric on under . For , the pull-back metric at is given by
where denotes the Jacobian matrix of .
2.3 VAE decoders as immersions
We will dedicate this subsection to showing that, under certain architectural choices, VAE decoders induce Riemannian metrics in the latent space. That is to say, they belong to a certain class of maps, called smooth immersions, which give rise to immersed submanifolds. In other words, we will formally describe our intuition about VAEs mapping the latent space back to data space, using the language of smooth manifolds and Riemannian geometry.
The generative and variational distributions can be seen as families of parameterized mappings and with , and and parameters and respectively. The family defined by the generative model is of particular interest. To make the subsequent exposition clearer we will assume a Gaussian generative model and rewrite it in the following form:
, denoting the mean and standard deviation of the generative model parameterized by neural networks with parametersand denoting the Hadamard or element-wise product.
(Smooth immersions) Given smooth manifolds and with dim() dim(), a mapping , a point and its image , the mapping is called an immersion if its differential is injective for all .
We will consider a particular Riemannian metric on induced by and . The architectures of and
are such that these maps are immersions with probability 1. Consider now thediagonal immersion
whose geometry encodes both mean and variance. The random map is a random projection given by of the diagonal immersion. Sampling using the decoder can therefore be seen as first sampling the image of this immersion and then randomly projecting down to (haubergeklund2019). Taking the pull-back metric of to we obtain
where and are the Jacobian matrices of and .
The metric was studied by arvanitidis:iclr:2018 and is known to yield geodesics that follow high density regions in latent space. As an example, Fig. 2 shows geodesics of a VAE trained on 1-digits from MNIST, which follow the data due to the variance term of the metric, which penalizes geodesics going through low density regions of the latent space.
3 Geometric latent priors
It is evident that the geometric structure over the latent space carries significant information about data density that the traditional Euclidean interpretation foregoes. With this in mind, we propose that the prior should be defined with respect to the geometric structure. We could opt for a Riemannian normal distribution
Riemannian normal distribution, which is well-studied (oller1993intrinsic; mardia2000basic; pennec2006intrinsic; arvanitidis2016locally; hauberg2018directional). Unfortunately, computing its normalization constant is expensive and involves Monte Carlo integration. Furthermore, it is equally hard to sample from this distribution, since it generally requires rejection sampling with non-trivial proposal distributions.
Instead we consider a cheap and flexible alternative, namely the heat kernel of a Brownian motion process (hsu2002stochastic). A Brownian motion on a Riemannian manifold can be defined through a stochastic differential equation on Stratonovich form:
Here is a Brownian motion in and denotes the projection of the standard basis of onto the tangent space of at . This way, a Brownian motion on is driven by a Euclidean Brownian motion projected to the tangent space. Fixing an initial point and a time , Brownian motion starting at running for time
gives rise to a random variable on. Its density function is the transition density . An alternative description of Brownian motion on is that is the heat kernel associated to the Laplace-Beltrami operator on . Below we will express the transition density in terms of local coordinates on . Conveniently, we may approximate the transition density by a so-called Parametrix expansion in a power series (hsu2002stochastic). In this paper we will use the zeroth order approximation which gives rise to the following expression for with :
, denotes the duration of the Brownian motion, and corresponds to variance on Euclidean manifolds.
is the dimensionality of .
is the center of the Brownian motion.
is the geodesic distance on the manifold.
is the ratio of the Riemannian volume measure evaluated at points and respectively.
Equation 3.2 can be evaluated reasonably fast as no Monte Carlo integration is required. The most expensive computation is the evaluation of the geodesic distance for which several efficient algorithms exist (hennig:aistats:2014; arvanitidis:aistats:2019). Here we parameterize the geodesic as a cubic spline and perform direct energy minimization.
Since we use the heat kernel density function for the prior , we need the variational family to be defined with respect to the same Riemannian measure. We therefore also use the heat kernel density function for the variational family, which is parameterized by the encoder network with variational parameters . The parameter of the prior is learned through optimization. The ELBO can be derived with respect to the volume measure :
This ELBO can be estimated using Monte Carlo samples from the variational posterior. With no analytical solution to the KL divergence we resort to Monte Carlo integration:
where , .
Thus, the final form of the Monte Carlo evaluation of the KL divergence is:
In the previous section we mentioned that a Brownian motion (BM) on the manifold can be derived by projecting each BM step on the tangent space at . However we will take each step directly on the manifold and avoid having to evaluate the exponential map. Given a manifold with dimension , the immersion , a point and its image under , we take a random step from :
Applying a Taylor expansion we have:
With we have:
Thus an approximation to taking a step directly in the latent space is with and the pseudoinverse of . Since the step can be written:
We consider an isotropic heat kernel so in our case . Furthermore:
This implies that
Thus, to sample from the prior we simply need to run Brownian motion for :
We note that from a practical standpoint for small diffusion times the number of discretized steps can be small.
4 Meaningful variance estimation
We now turn to the problem of restricting our prior to sample from the image of our manifold in . Since typically the geometry of the data is not known a priori, we adopt the Bayesian approach and relate uncertainty estimation in the generative model to the geometry of the latent manifold. Specifically, since the generative model parameterizes we construct it such that the pull-back metric will acquire high values away from the data support and thereby restrict prior samples to high density regions of the latent manifold.
In Sec. 2.3 we described the metric tensor arising from the diagonal immersion . By the form of the metric, it is clear that both and contribute to the manifold geometry. In recent works (arvanitidis:iclr:2018; hauberg2018only; detlefsen2019reliable) it was shown that neural network variance estimates are typically poor in regions away from the training data, due to poor extrapolation properties. Thus, neural networks cannot be trusted to properly estimate the variance of the generative model “off-the-shelf” when the functional form of the immersion (and thus the geometry of the data) is not known a priori. By extension, this leads to poor estimates of latent manifold geometry and latent densities. arvanitidis:iclr:2018 propose to use a radial basis function (RBF) network (que:aistats:2016) to estimate precision, rather than variance. We adopt this approach due to its simplicity and relative numerical stability, however we note that similar approaches for principled variance estimation exist (detlefsen2019reliable).
The influence of the RBF network can be seen in Fig. 3
, where it is compared with a usual neural network variance estimate. Note that the metric creates “borders” demarcating the regions to which the latent codes have been mapped by the encoder. This makes interpolations and random walks generally follow the trend of the latent points instead of wondering off the support. Thus, this regularization scheme restricts prior sampling to such high density regions. A similar effect is not observed in the usual Gaussian VAE, where the prior samples from regions to which the variational posterior has not necessarily placed probability density(hoffman2016elbo; rosca2018distribution).
|d = 2||-1013.31||-957.67||55.63|
|d = 5||-1137.32||-1082.4||54.92|
|d = 10||-1250.5||-1170.06||80.44|
|d = 2||-1053.70||-1036.60||17.70|
|d = 5||-1177.86||-1145.84||32.02|
|d = 10||-1280.94||-1216.51||64.43|
|d = 2||0.94||0.95||0.88||0.67||0.55||0.42||0.86||0.68||0.61||0.53||0.72|
|d = 5||0.95||0.97||0.94||0.90||0.90||0.89||0.95||0.93||0.88||0.87||0.92|
|d = 10||0.98||0.99||0.97||0.94||0.96||0.95||0.98||0.97||0.93||0.94||0.96|
|d = 2||0.95||0.97||0.89||0.68||0.64||0.56||0.88||0.85||0.71||0.64||0.78|
|d = 5||0.95||0.98||0.94||0.91||0.94||0.88||0.95||0.93||0.90||0.89||0.93|
|d = 10||0.98||0.98||0.96||0.95||0.96||0.95||0.97||0.97||0.93||0.94||0.96|
Per digit and average F1 score for a classifier trained on the learned latent codes of VAE and-VAE. Results are averaged over 5 classifier training runs.
5.1 Generative modelling
For our first experiment we train a VAE with a Riemannian Brownian motion prior over the latent variables (denoted by -VAE) for different dimensions of the latent space and compare it to a Euclidean VAE with a standard Normal prior. Table 1 shows the results. -VAE achieves a better lower bound than its Euclidean counterpart in every dimensionality even with an aggressive KL annealing schedule in the Euclidean VAE. This is both due to the fact that -VAE achieves better reconstructions due to more expressive latent codes, and the fact that the prior is learned and thus, adapts to the latent codes, incurring smaller regularization penalties.
We next assess the usefulness of the latent representations of -VAE. Fig. 4 shows the latent code clusters. -VAE has produced more separable clusters in the latent space due to the prior adapting to the latent codes, which results in a less regularized clustering. We quantitatively measured the utility of the -VAE latent codes in different dimensionalities by training a classifier to predict digit labels and measuring the average overall and per-digit F1 score. Table 2 shows the results when comparing against the same classifier trained on latent codes derived by a VAE. It is clear that -VAE has a significant advantage in low dimensions. As dimensionality increases this advantage becomes non-existent. An explanation for this could be that in the case of a Euclidean VAE, the higher the dimensions, the harder it is for the KL regularization to be enforced once it is annealed and the model achieves good reconstructions. This can also be confirmed by higher values of the KL divergence in higher dimensions (see Table 1). As a result, the representations of a Euclidean VAE trained with KL annealing more closely resemble those of a deterministic autoencoder in higher dimensions.
5.3 Qualitative results
Finally we explore the geometric properties of a -VAE with a 2-dimensional latent space. Fig 4 shows the learned manifold. As in Fig. 3, the influence of the variance network on the metric can be seen in the “borders” surrounding the latent code support.
We begin by investigating the behavior of distances on the induced manifold. Fig. 5 shows the geodesic curves between two pairs of random points on the manifold, compared against their Euclidean counterpart. The geodesic interpolation is influenced by the metric tensor, which makes sure that shortest paths will generally avoid areas of low density. This can easily be seen in top left Fig. 5, where the geodesic curve follows a path along a high density region. Contrast this to the Euclidean straight line between the two points traversing a lower density region. Reconstructed images along the curves can be seen in the middle and bottom rows. Even in less apparent cases (top right Fig. 5), reconstructions of latent codes along geodesic curves generally provide smoother transitions between the curve endpoints as can be seen by comparing the middle right and bottom right sections of the figure.
Next, we investigate sampling from -VAE. In Sec. 4 we claimed that a Brownian motion prior coupled with the RBF regularization of the decoder variance network would yield samples that mostly avoid low density regions of the latent space. To empirically prove this, we executed two sets of multiple sampling runs on the latent manifold. In the first set we ran Brownian motion with the learned prior parameters. These runs and the resulting images are displayed in Fig. 6. The random walks generally stay within high density regions of the manifold. Cases where they explore low density regions do exist but they are rare. The samples generally seem clear although sometimes their quality drops, especially when the sampler is transitioning between classes, where variance estimates are higher. This could potentially be rectified with a less aggressive deterministic warm-up scheme, which would result in more concentrated densities with thinner tails, although between-class variance estimates would likely still be higher compared to within-class ones. For the second set of the sampling runs, we increased the duration of the Brownian motion. These runs are displayed along with the sampled images in Fig. 7. The influence of the variance estimates on the metric tensor is clearly shown here. As the sampler is moving farther away from the latent code support, evaluations of the metric tensor increase making these regions harder to traverse. As a result the random walk either oscillates with decreased speed and stops close to the boundary (as in Figures 6(a) and 6(b)) or returns to higher density regions of the manifold. This clearly shows that -VAE mostly avoids the manifold mismatch problem.
6 Related work
In recent literature many works have identified the adverse effects of the KL divergence regularization when the prior is chosen to be a standard Gaussian. As such, there have been many approaches of learning a more flexible prior. chen2016variational propose learning an autoregressive prior by applying an Inverse Autoregressive transformation (kingma2016improved) to a simple prior. nalisnick2016stick propose a non-parametric stick-breaking prior. (tomczak2017vae) propose learning the prior as a mixture of variational posteriors. More recently, bauer2018resampled present a rejection sampling approach with a learned acceptance function, while klushyn2019learning proposed a hierarchical prior through an alternative formulation of the objective.
Non-Euclidean latent space.
arvanitidis:iclr:2018 was one of the first to analyze the latent space of a VAE from a non-Euclidean perspective. This work was inspired by Tosi:UAI:2014 that studied the Riemannian geometry of the Gaussian process latent variable model (gplvm). arvanitidis:iclr:2018 train a Euclidean VAE and fit a latent Riemannian LAND distribution (arvanitidis2016locally) and show that this view of the latent space leads to more accurate statistical estimates, as well as better sample quality.
Since then, a number of other works have appeared in literature that propose learning non-Euclidean latent manifolds. xu2018spherical and davidson2018hyperspherical learn a VAE with a von Mises-Fisher latent distribution, which samples codes on the unit hypersphere. Similarly, mathieu2019continuous and nagano2019differentiable extend VAEs to hyperbolic spaces. mathieu2019continuous
assume a Poincaré ball model as a latent space and present 2 generalizations of the Euclidean Gaussian distribution - a wrapped Normal and the Riemannian Normal distributions, of which only the latter is a maximum entropy generalization. In practice, they perform similarly.nagano2019differentiable assume a Lorentz hyperbolic model as a latent space and also present a wrapped Normal generalization of the Gaussian. While these works have correctly identified the problem of the standard Gaussian not being a truly uninformative prior, due to the Euclidean assumption, they have proposed approaches which are designed for observations with known geometries. Most of the time, however, this information is not available and a more general framework for learning geometrically informed VAEs is needed. In response to this, skopek2019mixed propose VAEs with the latent space modelled as a product of constant cuvature manifolds, where each component curvature is learned. While more general than a model with a fixed curvature latent manifold, this framework still requires the specification of number of component manifolds along with the sign of their respective curvature. Finally, similar to our approach, li2019variational and rey2019diffusion both propose the heat kernel as a variational family representing a Brownian motion process on a Riemannian manifold. They test their approaches on a priori chosen manifolds.
In this paper we presented VAEs with Riemannian manifolds as latent spaces and proposed a Riemannian generalization of the Gaussian along with an efficient sampling scheme. We show that the pull-back metric informs distances in the latent space, remaining invariant to reparameterizations. We further make explicit the relationship between uncertainty estimation and proper latent geometry and qualitatively show that geometrically informed priors avoid manifold mismatch by drawing samples from the image of the manifold in the latent space. Quantitatively, we show that our approach outperforms Euclidean VAEs both in an unsupervised learning task and a classification task, especially in low latent space dimensions.
Appendix A On neural network-based immersions
For the decoder map 2.3 to be a valid immersion, its differential needs to be injective for all as stated in definition 2. The differential of is represented by its Jacobian matrix and for it to be injective for all , it needs to be full rank. This is ensured if for the MLPs representing the decoder and the following are true:
Each hidden layer in the network has an equal or greater number of units to the previous layer ().
All weight matrices in the network are full rank.
The activation functions are continuously differentiable (at least) and strictly monotone.
Then, in our experiments, we ensure the above by opting for the same number of units in each hidden layer of the network and ELU non-linearities. All weight matrices are initialized uniformly (DBLP:conf/iccv/HeZRS15) which practically has zero probability of yielding low rank weight matrices. While, in theory, this could change via the gradient updates of the weights, this would immediately break experiments because of numerical instabilities, which we did not observe.
Appendix B Geodesic estimation
We estimate geodesic distances by minimizing curve energy. In detail, we represent the geodesic curve with a cubic spline with randomly initialized parameters. These parameters are then optimized via gradient descent by minimizing the curve energy:
where is the geodesic curve, is the first derivative of the curve, i.e. its velocity vector and is the matrix representation of the metric tensor evaluated at the curve points. The integral B.1
is computed by numerical approximation, where the partition of the interval can be chosen as a hyperparameter.
Appendix C Experimental setup
The architectures of all model variants are shown below in Tables 3 and 4. The encoder mean and variance, as well as the decoder mean are modelled by 2-layer MLPs as shown below. For a fair comparison, both -VAE and VAE share the same RBF architecture for the decoder precision with the number of the RBF centers set to 350 and the bandwidth set to 0.01 in all cases. Tables 3 and 4 summarize the architectures, listing the activation function for each layer with the units corresponding to each layer in parentheses.
|Network||Layer 1||Layer 2||Output|
|ELU (300)||ELU (300)||Linear (dim())|
|ELU (300)||ELU (300)||Softplus (dim())|
|Network||Layer 1||Layer 2||Output|
|ELU (300)||ELU (300)||Linear (dim())|
|RBF ()||Linear* (dim())||Identity (dim())|
c.1 Section 5.1 experiment
detlefsen2019reliable highlighted the importance of optimizing the mean and variance components separately, when training VAEs with Gaussian generative models. Following this paradigm, in all our experiments we first optimize the encoder components ( and ) along with the decoder . Then, keeping these fixed, we optimize the decoder
. All models were trained for 300 epochs. More specifically, the-VAE was trained as an autoencoder (optimizing only the encoder and and the decoder ) for the first 200 epochs and for the remaining 100 epochs the latent prior and the decoder were optimized. Similarly for a VAE, it was deterministically warmed up for 200 epochs and for the remaining 100 epochs, the decoder was optimized. All experiments were run with the Adam optimizer (DBLP:journals/corr/KingmaB14) with default parameter settings and a fixed learning rate of . The batch size was 100 for all models.
c.2 Section 5.2 experiment
The classifier used on this section was a single, 100-unit layer MLP with ReLU non-linearities, trained for 100 epochs with the Adam optimizer with default parameter settings and a learning rate of. The batch size was set at 64. The architectures of the models giving rise to the latent representations are as in the previous section.