1 Introduction
Following the manifold hypothesis, in unsupervised generative learning we strive to recover a distribution on a (lowdimensional) latent manifold, capable of explaining observed, highdimensional data, e.g. images. One of the most popular frameworks to achieve this goal is the Variational AutoEncoder (VAE)
(kingmavae13, ; pmlrv32rezende14, ), a latent variable model which combines variational inference and autoencoding to directly optimize the parameters of some latent distribution. While originally restricted to ‘flat’ space using the classic Gaussian normal distribution, there has recently been a surge in research extending the VAE to distributions defined on manifolds with nontrivial topologies (svae, ; falorsi2018explorations, ; mathieu2019hierarchical, ; hyperbolicexpmapreparam19, ; diffusionvae19, ; implicit_reparam_figurnov, ; relieaistats19, ). This is fruitful, as most data is not best represented by distributions on flat space, which can lead to undesired ‘manifold mismatch’ behavior.In (svae, )
, the authors propose a hyperspherical parameterization of the VAE using a von MisesFisher distribution, demonstrating the improved results over the especially bad pairing of the ‘bloblike’ Gaussian distribution and hyperspherical data. Surprisingly, they further show that these positive results extend to datasets
withouta clear hyperspherical interpretation, which they mostly attribute to the restricted surface area and the absence of a ‘meanbiased’ prior in the vMF as the Uniform distribution is feasible in the compact, hyperspherical space. However, as dimensionality increases performance begins to decrease. This could possibly be explained by taking a closer look at the vMF’s functional form
(1)  
(2) 
where , a scalar, is the normalizing constant, and denotes the modified Bessel function of the first kind at order . Note that the scalar concentration parameter is fixed in all dimensions, severely limiting the distribution’s expressiveness as dimensionality increases.
2 Method: A Hyperspherical ProductSpace
To improve on the vMF’s perdimension concentration flexibility limitation we propose a simple idea: breaking up the single latent hyperspherical assumption, into a concatenation of multiple independent hyperspherical distributions. Such a compositional construction increases flexibility through the addition of a new concentration parameter for each hypersphere, as well as providing the possibility of substructure forming . Given a hyperspherical random variable
, we want to choose in respectively s.t. , and , where denotes concatenation. The probabilistic model becomes:(3) 
which factorizes in (*) if we assume independence between the new substructures. Assuming conditional independence of the approximate posterior as well, i.e. , it can be shown^{1}^{1}1See Appendix A for derivation.
that the KullbackLeibler divergence simplifies as
(4) 
Flexibility TradeOff
Given a single hypersphere and keeping the ambient space fixed, for each additional ‘break’, a degree of freedom is exchanged for a concentration parameter. In the base case of k+1, we can potentially support
‘independent’ feature dimensions, that must share a single concentration parameter, and hence are globally restricted in their flexibility per dimension. On the other hand, the moment we break k+1 up in the Cartesian crossproduct of
, we ‘lose’ an independent dimensions (or degree of freedom), but in exchange the two resulting subhyperspheres have to share their concentration parameters over fewer dimensions increasing flexibility^{2}^{2}2In the most extreme case, this will lead to a latent space of  which is equal to the Torus..The reason a vMF is uniquely suited for such a decomposition as opposed to a Gaussian, is that assuming a factorized variance the Gaussian distribution is already equipped with a concentration parameter for each dimension. However, in the case of the vMF, which has only a single concentration parameter
for all dimensions, we gain flexibility. This is an important distinction: while all dimensions are implicitly connected through the shared loss objective in both cases, in the case of the vMF this connection is amplified through the direct connection of the shared concentration parameter.Related Work
The work closest to our model is that of (paquet2018factorial, )
, where a Cartesian product of Gaussian Mixture Models (GMMMs) is proposed, with hyperpriors on all separate components to create a fully datainferred model. They use results from
(hoffman2013stochastic, ; johnson2016composing, ) on structured VAEs, and extend the work on VAEs with single GMMs of (nalisnick2016approximate, ; dilokthanakul2016deep, ; jiang2017variational, ). Partially following similar motivations to our work, the authors hypothesize and empirically show the structured compositionality encourages disentanglement. By working with GMMs instead of single Gaussians, they circumvent the factorized single Gaussian breakup limitation described before. Another recent work proposing to break up a large, single latent representation into a composition of substructures in the context of Bayesian optimization is (combooh19, ).3 Experiments and Discussion
To test the ability of a hyperspherical productspace model to improve performance over its singleshell counterpart, we perform productspace interpolations breaking up a single shell into an increasing amount of independent components.
Experimental Setup
We conduct experiments on Static MNIST, Omniglot (lake2015human, ), and Caltech 101 Silhouettes (marlin2010inductive, ) mostly following the experimental setup of (svae, ), using a simple MLP encoderdecoder architecture with ReLU()
activations between layers. We train for 300 epochs using earlystopping with a lookahead of 50 epochs, and a linear
warmup scheme of 100 epochs as per (bowman2015, ; sonderby2016ladder, ), during which the KL divergence is annealed from 0 to (higgins2017beta, ; alemi2018fixing, ). Marginal loglikelihood is estimated using importance sampling with 500 sample points per
(burda2015importance, ), reporting the mean over three random seeds.Keeping in mind the flexibility tradeoff consideration, we analyze both the effects of keeping the total degrees of freedom fixed (increasing ambient space dimensionality), as well as the case of keeping the ambient space fixed (decreasing the degrees of freedom). We break up 40 respectively into 2, 4, 6, 10, and 40 subspaces. In each breakup, we try a balanced, leveled, and unbalanced hyperspherical composition.
Results
Static MNIST  
LL  LL*  *  
41  4  92.65  98.23  96.32  104.11  
41  4  92.59  98.27  
41  6  92.25  98.10  
41  6  92.71  98.46  
Caltech  
41  4  139.30  151.67  143.49  152.25  
41  4  140.64  153.05  
Omniglot  
41  4  112.79  119.17  113.83  120.48  
41  6  112.58  118.49  
41  10  112.64  118.67 
A summary of best results for fixed ambient space is shown in Table 1, with a summary of best results for fixed degrees of freedom and complete interpolations in Appendix B. Initial inspection shows that partially breaking up a single 40 hypersphere into a hyperspherical productspace indeed allows us to improve performance for all examined datasets. Diving deeper into the results, we do find that both the number of breaks as well as the dimensional composition of these breaks strongly inform performance and learning stability.
A high number of breaks appears to negatively influence both performance and learning stability. Indeed, for most datasets the ‘Torus’ setting, i.e. full factorization in 1 components proved too unstable to train to convergence. One explanation for this result could be found in the fact that we omit the REINFORCE part of the vMF reparameterization during training^{3}^{3}3See (svae, ), Appendix D for more details.. While only of very limited influence on a single hyperspherical distribution, the accumulated bias across many shells might lead to a nontrivial effect. On the other hand, adding as few as four breaks extends the model’s expressivity enough to outperform a single shell consistently.
Balance of the subspace composition plays a key role as well. We find that when the subspaces are too unbalanced, e.g. 37 v. , the network starts to ‘choose’ between subspace channels. Effectively, it will for example start encoding all information in the 1 shells and completely ignore the 37 shell, leading to an effective latent space of 3 degrees of freedom^{4}^{4}4For a more extended discussion on the interplay between balance and the KL divergence see Appendix B., see for example Fig. 2(a). On the contrary, better balanced compositions appear capable of cleanly separating semantically meaningful features across shells as displayed in Fig. 2(b).
Conclusion and Future Work
In summary, breaking up a single hypersphere into multiple components effectively increases concentration expressiveness leading to more stable training and improved results. In future work we’d like to investigate the possibility of learning an optimal breakup as opposed to fixing it apriori, as well as mixing subspaces with different topologies.
Acknowledgments
We would like to thank Luca Falorsi and Nicola De Cao for insightful discussions during the development of this work.
References
 (1) Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. In ICML, pages 159–168, 2018.
 (2) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 1112, 2016, pages 10–21, 2016.

(3)
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.
Importance weighted autoencoders.
ICLR, 2016.  (4) Tim R. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M. Tomczak. Hyperspherical Variational AutoEncoders. UAI, 2018.
 (5) Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
 (6) Luca Falorsi, Pim de Haan, Tim R Davidson, Nicola De Cao, Maurice Weiler, Patrick Forré, and Taco S Cohen. Explorations in homeomorphic variational autoencoding. ICML Workshop, 2018.
 (7) Luca Falorsi, Pim de Haan, Tim R Davidson, and Patrick Forré. Reparameterizing distributions on lie groups. AISTATS, 2019.
 (8) Mikhail Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, NeurIPS, pages 439–450. Curran Associates, Inc., 2018.
 (9) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.
 (10) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 (11) Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou. Variational deep embedding: An unsupervised and generative approach to clustering. In IJCAI, pages 1965––1972, 2017.

(12)
Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R
Datta.
Composing graphical models with neural networks for structured representations and fast inference.
In NIPS, pages 2946–2954, 2016.  (13) Diederik P. Kingma and Max Welling. Autoencoding variational bayes. CoRR, 2013.
 (14) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.

(15)
Benjamin Marlin, Kevin Swersky, Bo Chen, and Nando Freitas.
Inductive principles for restricted boltzmann machine learning.
In AISTATS, pages 509–516, 2010.  (16) Emile Mathieu, Charline Le Lan, Chris J Maddison, Ryota Tomioka, and Yee Whye Teh. Hierarchical representations with poincar’e variational autoencoders. NeurIPS, 2019.
 (17) Yoshihiro Nagano, Shoichiro Yamaguchi, Yasuhiro Fujita, and Masanori Koyama. A differentiable gaussianlike distribution on hyperbolic space for gradientbased learning. arXiv preprint arXiv:1902.02992, 2019.

(18)
Eric Nalisnick, Lars Hertel, and Padhraic Smyth.
Approximate inference for deep latent gaussian mixtures.
In
NIPS Workshop on Bayesian Deep Learning
, volume 2, 2016.  (19) Changyong Oh, Jakub M. Tomczak, Efstratios Gavves, and Max Welling. Combinatorial bayesian optimization using graph representations. ICML Workshop, 2019.
 (20) Ulrich Paquet, Sumedh K Ghaisas, and Olivier Tieleman. A factorial mixture prior for compositional deep generative models. arXiv preprint arXiv:1812.07480, 2018.
 (21) Luis A. Pérez Rey, Vlado Menkovski, and Jacobus W. Portegies. Diffusion variational autoencoders. arXiv preprint arXiv:1901.08991, 2019.

(22)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
ICML, pages 1278–1286, 2014.  (23) Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In NIPS, pages 3738–3746, 2016.
Appendix A Dimensionality Decomposition
Given a latent variable , we choose in respectively s.t. , and , where denotes concatenation. The probabilistic model becomes:
(5) 
which factorizes in (*) if we assume independence. Assuming conditional independence of the approximate posterior as well, i.e. , the KullbackLeibler divergence simplifies as
(6) 
Appendix B Supplementary Tables and Figures
Another way of understanding the importance of balance is by examining the KL divergence form of the vMF and its influence in the loss objective: In order to achieve high quality reconstruction performance, it is necessary for the concentration parameter to concentrate, i.e. take on a high value. Given the Uniform prior setting in which , this logically leads to an increase in the KLdivergence. The crucial observation here however, is that the strength of the KLdivergence is also strongly dependent on the dimensionality as can be observed in Fig. 1. Hence during learning over a productspace containing several lower dimensionality components and a single high dimensionality component, if the reconstruction error can be made sufficiently low using the lower dimensionality components, the optimal loss minimization strategy would be to set the concentration parameter of the largest component to 0, effectively ignoring it. A possible strategy to prevent this from happening could be to set separate
parameters for each hyperspherical component, however we fear that this will quickly blow up the hyperparameter searchspace.
b.1 Fixed Ambient Space
m  []  LL  

41  2  93.18  98.72  69.78  28.94  
41  2  95.69  103.67  71.67  32.01  
41  4  92.65  98.23  70.55  27.68  
41  4  92.59  98.27  71.33  26.94  
41  4  108.42  116.86  99.62  17.23  
41  6  92.25  98.10  69.78  28.32  
41  6  93.86  100.99  69.46  31.54  
41  6  92.71  98.46  70.97  27.48  
41  10  92.93  99.07  70.67  28.41  
41  10  93.45  100.29  68.75  31.54  
41  10  93.36  99.40  71.93  27.47 
m  []  LL  

41  2  142.43  155.24  123.35  31.89  
41  2  147.41  166.64  130.41  36.22  
41  4  139.30  151.67  120.82  30.85  
41  4  140.64  153.05  123.23  29.82  
41  4  168.25  186.47  170.44  16.03  
41  6  142.84  156.59  126.59  30.00  
41  6  169.15  177.23  161.68  15.55  
41  6  139.99  152.68  121.91  30.77  
41  10  144.73  159.27  126.14  33.13  
41  10  154.91  164.90  140.06  24.83  
41  10  144.72  160.13  126.34  33.79 
m  []  LL  

41  2  114.32  120.72  92.10  28.62  
41  2  115.19  122.30  91.82  30.48  
41  4  113.29  118.97  88.93  30.05  
41  4  112.79  119.17  87.94  31.23  
41  4  136.39  142.03  132.75  9.28  
41  6  114.07  119.99  91.26  28.72  
41  6  131.55  137.29  124.66  12.62  
41  6  112.58  118.49  88.27  30.23  
41  10  113.53  119.83  90.00  29.83  
41  10  114.95  121.42  92.42  29.00  
41  10  112.64  118.67  88.98  29.68 
b.2 Fixed Degrees of Freedom
Static MNIST  
LL  LL*  *  
44  4  92.62  98.26  96.32  104.11  
46  6  92.59  98.46  
46  6  92.50  98.28  
50  10  92.57  98.81  
Caltech  
44  4  137.95  150.86  143.49  152.25  
46  6  139.84  152.92  
Omniglot  
44  4  112.28  118.21  113.83  120.48  
46  6  112.78  118.84  
50  10  112.61  118.70 