Increasing Expressivity of a Hyperspherical VAE

by   Tim R. Davidson, et al.
University of Amsterdam

Learning suitable latent representations for observed, high-dimensional data is an important research topic underlying many recent advances in machine learning. While traditionally the Gaussian normal distribution has been the go-to latent parameterization, recently a variety of works have successfully proposed the use of manifold-valued latents. In one such work (Davidson et al., 2018), the authors empirically show the potential benefits of using a hyperspherical von Mises-Fisher (vMF) distribution in low dimensionality. However, due to the unique distributional form of the vMF, expressivity in higher dimensional space is limited as a result of its scalar concentration parameter leading to a 'hyperspherical bottleneck'. In this work we propose to extend the usability of hyperspherical parameterizations to higher dimensions using a product-space instead, showing improved results on a selection of image datasets.


Hyperspherical Variational Auto-Encoders

The Variational Auto-Encoder (VAE) is one of the most used unsupervised ...

RENs: Relevance Encoding Networks

The manifold assumption for high-dimensional data assumes that the data ...

Generative Model without Prior Distribution Matching

Variational Autoencoder (VAE) and its variations are classic generative ...

Ordering Dimensions with Nested Dropout Normalizing Flows

The latent space of normalizing flows must be of the same dimensionality...

Multi-fidelity data fusion for the approximation of scalar functions with low intrinsic dimensionality through active subspaces

Gaussian processes are employed for non-parametric regression in a Bayes...

Estimating the distribution of Galaxy Morphologies on a continuous space

The incredible variety of galaxy shapes cannot be summarized by human de...

Exploiting Colorimetry for Fidelity in Data Visualization

Advances in multimodal characterization methods fuel a generation of inc...

1 Introduction

Following the manifold hypothesis, in unsupervised generative learning we strive to recover a distribution on a (low-dimensional) latent manifold, capable of explaining observed, high-dimensional data, e.g. images. One of the most popular frameworks to achieve this goal is the Variational Auto-Encoder (VAE)

(kingma-vae13, ; pmlr-v32-rezende14, ), a latent variable model which combines variational inference and auto-encoding to directly optimize the parameters of some latent distribution. While originally restricted to ‘flat’ space using the classic Gaussian normal distribution, there has recently been a surge in research extending the VAE to distributions defined on manifolds with non-trivial topologies (s-vae, ; falorsi2018explorations, ; mathieu2019hierarchical, ; hyperbolic-exp-map-reparam19, ; diffusion-vae19, ; implicit_reparam_figurnov, ; relie-aistats19, ). This is fruitful, as most data is not best represented by distributions on flat space, which can lead to undesired ‘manifold mismatch’ behavior.

In (s-vae, )

, the authors propose a hyperspherical parameterization of the VAE using a von Mises-Fisher distribution, demonstrating the improved results over the especially bad pairing of the ‘blob-like’ Gaussian distribution and hyperspherical data. Surprisingly, they further show that these positive results extend to datasets


a clear hyperspherical interpretation, which they mostly attribute to the restricted surface area and the absence of a ‘mean-biased’ prior in the vMF as the Uniform distribution is feasible in the compact, hyperspherical space. However, as dimensionality increases performance begins to decrease. This could possibly be explained by taking a closer look at the vMF’s functional form


where , a scalar, is the normalizing constant, and denotes the modified Bessel function of the first kind at order . Note that the scalar concentration parameter is fixed in all dimensions, severely limiting the distribution’s expressiveness as dimensionality increases.

2 Method: A Hyperspherical Product-Space

To improve on the vMF’s per-dimension concentration flexibility limitation we propose a simple idea: breaking up the single latent hyperspherical assumption, into a concatenation of multiple independent hyperspherical distributions. Such a compositional construction increases flexibility through the addition of a new concentration parameter for each hypersphere, as well as providing the possibility of sub-structure forming . Given a hyperspherical random variable

, we want to choose in respectively s.t. , and , where denotes concatenation. The probabilistic model becomes:


which factorizes in (*) if we assume independence between the new sub-structures. Assuming conditional independence of the approximate posterior as well, i.e. , it can be shown111See Appendix A for derivation.

that the Kullback-Leibler divergence simplifies as

Flexibility Trade-Off

Given a single hypersphere and keeping the ambient space fixed, for each additional ‘break’, a degree of freedom is exchanged for a concentration parameter. In the base case of k+1, we can potentially support

‘independent’ feature dimensions, that must share a single concentration parameter

, and hence are globally restricted in their flexibility per dimension. On the other hand, the moment we break k+1 up in the Cartesian cross-product of

, we ‘lose’ an independent dimensions (or degree of freedom), but in exchange the two resulting sub-hyperspheres have to share their concentration parameters over fewer dimensions increasing flexibility222In the most extreme case, this will lead to a latent space of - which is equal to the -Torus..

The reason a vMF is uniquely suited for such a decomposition as opposed to a Gaussian, is that assuming a factorized variance the Gaussian distribution is already equipped with a concentration parameter for each dimension. However, in the case of the vMF, which has only a single concentration parameter

for all dimensions, we gain flexibility. This is an important distinction: while all dimensions are implicitly connected through the shared loss objective in both cases, in the case of the vMF this connection is amplified through the direct connection of the shared concentration parameter.

Related Work

The work closest to our model is that of (paquet2018factorial, )

, where a Cartesian product of Gaussian Mixture Models (GMMMs) is proposed, with hyperpriors on all separate components to create a fully data-inferred model. They use results from

(hoffman2013stochastic, ; johnson2016composing, ) on structured VAEs, and extend the work on VAEs with single GMMs of (nalisnick2016approximate, ; dilokthanakul2016deep, ; jiang2017variational, ). Partially following similar motivations to our work, the authors hypothesize and empirically show the structured compositionality encourages disentanglement. By working with GMMs instead of single Gaussians, they circumvent the factorized single Gaussian break-up limitation described before. Another recent work proposing to break up a large, single latent representation into a composition of sub-structures in the context of Bayesian optimization is (combo-oh19, ).

3 Experiments and Discussion

To test the ability of a hyperspherical product-space model to improve performance over its single-shell counterpart, we perform product-space interpolations breaking up a single shell into an increasing amount of independent components.

Experimental Setup

We conduct experiments on Static MNIST, Omniglot (lake2015human, ), and Caltech 101 Silhouettes (marlin2010inductive, ) mostly following the experimental setup of (s-vae, ), using a simple MLP encoder-decoder architecture with ReLU()

activations between layers. We train for 300 epochs using early-stopping with a look-ahead of 50 epochs, and a linear

warm-up scheme of 100 epochs as per (bowman2015, ; sonderby2016ladder, ), during which the KL divergence is annealed from 0 to (higgins2017beta, ; alemi2018fixing, )

. Marginal log-likelihood is estimated using importance sampling with 500 sample points per

(burda2015importance, ), reporting the mean over three random seeds.

Keeping in mind the flexibility trade-off consideration, we analyze both the effects of keeping the total degrees of freedom fixed (increasing ambient space dimensionality), as well as the case of keeping the ambient space fixed (decreasing the degrees of freedom). We break up 40 respectively into 2, 4, 6, 10, and 40 sub-spaces. In each break-up, we try a balanced, leveled, and unbalanced hyperspherical composition.


Static MNIST
LL LL* *
41 4 -92.65 -98.23 -96.32 -104.11
41 4 -92.59 -98.27
41 6 -92.25 -98.10
41 6 -92.71 -98.46
41 4 -139.30 -151.67 -143.49 -152.25
41 4 -140.64 -153.05
41 4 -112.79 -119.17 -113.83 -120.48
41 6 -112.58 -118.49
41 10 -112.64 -118.67
Table 1: Overview of best results of various 40 product-space ambient dimensionality interpolations compared to best single k-VAE () indicated (*). LL represents the negative log-likelihood, the ELBO, indicates the ambient space dimensionality, the number of concentration parameters, i.e. breaks, and [k] the product-space composition.

A summary of best results for fixed ambient space is shown in Table 1, with a summary of best results for fixed degrees of freedom and complete interpolations in Appendix B. Initial inspection shows that partially breaking up a single 40 hypersphere into a hyperspherical product-space indeed allows us to improve performance for all examined datasets. Diving deeper into the results, we do find that both the number of breaks as well as the dimensional composition of these breaks strongly inform performance and learning stability.

A high number of breaks appears to negatively influence both performance and learning stability. Indeed, for most datasets the ‘Torus’ setting, i.e. full factorization in 1 components proved too unstable to train to convergence. One explanation for this result could be found in the fact that we omit the REINFORCE part of the vMF reparameterization during training333See (s-vae, ), Appendix D for more details.. While only of very limited influence on a single hyperspherical distribution, the accumulated bias across many shells might lead to a non-trivial effect. On the other hand, adding as few as four breaks extends the model’s expressivity enough to outperform a single shell consistently.

Balance of the subspace composition plays a key role as well. We find that when the subspaces are too unbalanced, e.g. 37 v. , the network starts to ‘choose’ between subspace channels. Effectively, it will for example start encoding all information in the 1 shells and completely ignore the 37 shell, leading to an effective latent space of 3 degrees of freedom444For a more extended discussion on the interplay between balance and the KL divergence see Appendix B., see for example Fig. 2(a). On the contrary, better balanced compositions appear capable of cleanly separating semantically meaningful features across shells as displayed in Fig. 2(b).

Conclusion and Future Work

In summary, breaking up a single hypersphere into multiple components effectively increases concentration expressiveness leading to more stable training and improved results. In future work we’d like to investigate the possibility of learning an optimal break-up as opposed to fixing it a-priori, as well as mixing sub-spaces with different topologies.


We would like to thank Luca Falorsi and Nicola De Cao for insightful discussions during the development of this work.


Appendix A Dimensionality Decomposition

Given a latent variable , we choose in respectively s.t. , and , where denotes concatenation. The probabilistic model becomes:


which factorizes in (*) if we assume independence. Assuming conditional independence of the approximate posterior as well, i.e. , the Kullback-Leibler divergence simplifies as


Appendix B Supplementary Tables and Figures

Figure 1: Value of the von Mises-Fisher Kullback-Leibler divergence varying the concentration parameter on the y-axis, and the dimensionality on the x-axis. (Best viewed in color)

Another way of understanding the importance of balance is by examining the KL divergence form of the vMF and its influence in the loss objective: In order to achieve high quality reconstruction performance, it is necessary for the concentration parameter to concentrate, i.e. take on a high value. Given the Uniform prior setting in which , this logically leads to an increase in the KL-divergence. The crucial observation here however, is that the strength of the KL-divergence is also strongly dependent on the dimensionality as can be observed in Fig. 1. Hence during learning over a product-space containing several lower dimensionality components and a single high dimensionality component, if the reconstruction error can be made sufficiently low using the lower dimensionality components, the optimal loss minimization strategy would be to set the concentration parameter of the largest component to 0, effectively ignoring it. A possible strategy to prevent this from happening could be to set separate

parameters for each hyperspherical component, however we fear that this will quickly blow up the hyperparameter search-space.

b.1 Fixed Ambient Space

m [] LL
41 2 -93.18 -98.72 69.78 28.94
41 2 -95.69 -103.67 71.67 32.01
41 4 -92.65 -98.23 70.55 27.68
41 4 -92.59 -98.27 71.33 26.94
41 4 -108.42 -116.86 99.62 17.23
41 6 -92.25 -98.10 69.78 28.32
41 6 -93.86 -100.99 69.46 31.54
41 6 -92.71 -98.46 70.97 27.48
41 10 -92.93 -99.07 70.67 28.41
41 10 -93.45 -100.29 68.75 31.54
41 10 -93.36 -99.40 71.93 27.47
Table 2: Summary of results of 40 ambient interpolations for unsupervised model on Static MNIST. RE and KL correspond respectively to the reconstruction and the KL part of the ELBO.

m [] LL
41 2 -142.43 -155.24 123.35 31.89
41 2 -147.41 -166.64 130.41 36.22
41 4 -139.30 -151.67 120.82 30.85
41 4 -140.64 -153.05 123.23 29.82
41 4 -168.25 -186.47 170.44 16.03
41 6 -142.84 -156.59 126.59 30.00
41 6 -169.15 -177.23 161.68 15.55
41 6 -139.99 -152.68 121.91 30.77
41 10 -144.73 -159.27 126.14 33.13
41 10 -154.91 -164.90 140.06 24.83
41 10 -144.72 -160.13 126.34 33.79
Table 3: Summary of results of 40 ambient interpolations for unsupervised model on Caltech.

m [] LL
41 2 -114.32 -120.72 92.10 28.62
41 2 -115.19 -122.30 91.82 30.48
41 4 -113.29 -118.97 88.93 30.05
41 4 -112.79 -119.17 87.94 31.23
41 4 -136.39 -142.03 132.75 9.28
41 6 -114.07 -119.99 91.26 28.72
41 6 -131.55 -137.29 124.66 12.62
41 6 -112.58 -118.49 88.27 30.23
41 10 -113.53 -119.83 90.00 29.83
41 10 -114.95 -121.42 92.42 29.00
41 10 -112.64 -118.67 88.98 29.68
Table 4: Summary of results of 40 ambient interpolations for unsupervised model on Omniglot.

b.2 Fixed Degrees of Freedom

Static MNIST
LL LL* *
44 4 -92.62 -98.26 -96.32 -104.11
46 6 -92.59 -98.46
46 6 -92.50 -98.28
50 10 -92.57 -98.81
44 4 -137.95 -150.86 -143.49 -152.25
46 6 -139.84 -152.92
44 4 -112.28 -118.21 -113.83 -120.48
46 6 -112.78 -118.84
50 10 -112.61 -118.70
Table 5: Overview of best results (mean over 3 runs) of 40 product-space interpolations compared to best single m-VAE () indicated (*). Here indicates the ambient space dimensionality, the number of concentration parameters, i.e. breaks, and [k] the product-space composition.

b.3 Ignored and Disentangled Shells

(a) Ignored Sub-space
(b) Thick to Thin
Figure 2: 1 interpolations of broken up 9. On top an example of an ‘ignored’ sub-space, leading to little to no semantic change when decoded. Bottom a semantically meaningful sub-space that consistently changes the stroke thickness.