Log In Sign Up

Hierarchical Representations with Poincaré Variational Auto-Encoders

The Variational Auto-Encoder (VAE) model has become widely popular as a way to learn at once a generative model and embeddings for observations living in a high-dimensional space. In the real world, many such observations may be assumed to be hierarchically structured, such as living organisms data which are related through the evolutionary tree. Also, it has been theoretically and empirically shown that data with hierarchical structure can efficiently be embedded in hyperbolic spaces. We therefore endow the VAE with a hyperbolic geometry and empirically show that it can better generalise to unseen data than its Euclidean counterpart, and can qualitatively recover the hierarchical structure.


Semi-Implicit Graph Variational Auto-Encoders

Semi-implicit graph variational auto-encoder (SIG-VAE) is proposed to ex...

Unsupervised Hyperbolic Representation Learning via Message Passing Auto-Encoders

Most of the existing literature regarding hyperbolic embedding concentra...

Training VAEs Under Structured Residuals

Variational auto-encoders (VAEs) are a popular and powerful deep generat...

X-Ray Sobolev Variational Auto-Encoders

The quality of the generative models (Generative adversarial networks, V...

Learning Hyperbolic Representations for Unsupervised 3D Segmentation

There exists a need for unsupervised 3D segmentation on complex volumetr...

1 Introduction

Learning from unlabelled raw sensory observations, which are often high-dimensional, is a problem of significant importance in machine learning. VAE

(Kingma and Welling, 2013; Rezende et al., 2014) are probabilistic generative models composed of an encoder stochastically embedding observations in a low dimensional latent space , and a decoder generating observations from encodings . After training, the encodings constitute a low-dimensional representation of the original raw observations, which can be used as features for a downstream task (e.g. Huang and LeCun, 2006; Coates et al., 2011) or be interpretable for their own sake. Such a representation learning (Bengio et al., 2013) task is of interest in VAE since a stochastic embedding is learned to faithfully reconstruct observations. Hence, one may as well want such an embedding to be a good representation

, i.e. an interpretable representation yielding better generalisation or useful as a pre-processing task. There are empirical evidence showing that neural networks yield progressively more abstract representations of objects as the information is pushed deeper in the layers

(Bengio et al., 2013). For example, feeding a dog’s image to an encoder neural network would yield an abstract vectorial representation of that dog. Since human beings organise objects hierarchically, it motivates the use of hierarchical latent spaces to embed data.

Many domains of natural and human life exhibit patterns of hierarchical organisation and structure. For example, the theory of evolution (Darwin, 1859) implies that features of living organisms are related in a hierarchical manner given by the evolutionary tree. Also, many grammatical theories from linguistics, such as phrase-structure grammar, involve hierarchy. What is more, almost every system of organization applied to the world is arranged hierarchically (Kulish, 2002), e.g. governments, families, human needs (Maslow, 1943), science, etc. Hence, explicitly incorporating hierarchical structure in probabilistic models has been a long going research topic (e.g. Heller and Ghahramani, 2005). It has been shown that certain classes of symbolic data can be organized according to a latent hierarchy using hyperbolic spaces (Nickel and Kiela, 2017). Trees are intrinsically related with hyperbolic geometry since they can be embedded with arbitrarily low distortion into the Poincaré disc (Sarkar, 2012). The exponential growth of the Poincaré surface area with respect to its radius is also related with the exponential growth of leaves in a tree with respect to its depth.

Taken together this suggests a VAE, whose latent space is endowed with a hyperbolic geometry, will be more capable of representing and discovering hierarchical features. This is what we consider in this work. Two recent concurrent works (Ovinnikov, 2018; Grattarola et al., 2018) considered a somewhat similar problem and endowed an AE-based model with an hyperbolic latent space. We discuss more in details those works in Section 5. Our goal is twofold; (a) learn a latent representation that is interpretable in terms of a hierarchical relationship between the observations, (b) yield a more "compact" (than its Euclidean counterpart) representation for low dimensional latent spaces – therefore yielding better reconstruction – for observations having such an underlying hierarchical structure.

The remainder of this paper is organized as follows: In Section 2, we briefly review Riemannian geometry and the Poincaré ball model of hyperbolic geometry. In Section 3, we introduce the Poincaré VAE model, and discuss its design and how it can be trained. In Section 4, we empirically assess the performance of our model on a synthetic dataset generated from a branching diffusion process, showing that it outperforms its Euclidean counterpart. Eventually, in Sections 5 and 6, we discuss related work and further work.

2 The Poincaré Ball model of hyperbolic geometry

2.1 Brief Riemannian geometry review

A real, smooth manifold

is a collection of real vectors

, which is locally similar to a linear space. At each point of the manifold is defined a real vector space of the same dimensionality as , called the tangent space in : . Intuitively, it contains all the possible directions in which one can tangentially pass through . For each point of the manifold, the

metric tensor

defines an inner product on the associated tangent space: . A Riemannian manifold is then defined as a tuple (Petersen, 2006).

The metric tensor gives local notions of angle, length of curves, surface area and volume, from which global quantities can be derived by integrating local contributions. A norm is induced by the inner product on : . The matrix representation of the Riemannian metric , is defined such that . An infinitesimal volume element is induced on each tangent space , and thus a measure on the manifold .

The length of a curve is given by

The concept of straight lines can then be generalised to geodesics, which are constant speed curves giving the shortest path between pairs of points of the manifold: with , and . A global distance is thus induced on given by . Endowing with that distance consequently defines a metric space . The concept of moving along a "straight" curve with constant velocity is given by the exponential map. In particular, there is a unique unit speed geodesic satisfying with initial tangent vector . The corresponding exponential map is then defined by , as illustrated on Figure 1 (Left). The logarithm map is the inverse . For geodesically complete manifolds, such as the Poincaré ball, is well-defined on the full tangent space . For an offset and a normal vector , the orthogonal projection is the exponential map with minimal length, from to the set of geodesics with origin and directions .

2.2 The hyperbolic space

Figure 1: (Left) Geodesics and exponential maps. (Right) An embedded tree.

From a historical perspective, hyperbolic geometry was created in the first half of the nineteenth century in the midst of attempts to understand Euclid’s axiomatic basis for geometry (Coxeter, 1942). It is one type of non-Euclidean geometry, the one that discards Euclid’s fifth postulate (or parallel postulate) to replace it with the following one: Given a line and a point not on it, there is more than one line going through the given point that is parallel to the given line.

Moreover, in every dimension there exists a unique complete, simply connected Riemannian manifold having constant sectional curvature up to isometries. The three manifolds given by , and are respectively the hypersphere , Euclidean space , and hyperbolic space . In contrast with and , the hyperbolic space can be constructed using various isomorphic models (none of which is prevalent), among them the hyperboloid model, the Beltrami-Klein model, the Poincaré half-plane model and the Poincaré ball which we will be relying on.

Intuitively, hyperbolic spaces can be thought of as continuous versions of trees or vice versa, trees can be thought of as "discrete hyperbolic spaces", as illustrated in Figure 1 (Right). Indeed, trees can be embedded with arbitrarily low distortion 111Graph embeddings global metric, measuring relative difference between pairwise distances in the embedding space and in the original space, the best distortion being . into the Poincaré disc (Sarkar, 2012). In contrast, Bourgain’s theorem (Linial et al., 1994) shows that the Euclidean space is unable to obtain comparably low distortion for trees - even using an unbounded number of dimensions. This type of structure is naturally suited with hyperbolic spaces since the volume and surface area grow exponentially with their radius.

2.3 The Poincaré ball

The Poincaré ball is a model of with curvature . Its smooth manifold is the the open ball of radius in : , where denotes the Euclidean norm. Its metric tensor is a conformal 222i.e. proportional to the Euclidean metric tensor . one, given by


where , and denotes the Euclidean metric tensor, i.e. the usual dot product, . The Poincaré ball model of hyperbolic space corresponds then to the Riemannian manifold . The induced distance function between is given as


The framework of gyrovector spaces provides a non-associative algebraic formalism for hyperbolic geometry, analogous to the vector space structure of Euclidean geometry (Ungar, 2008). The Möbius addition of and in is defined as


Ones recover the Euclidean addition of two vectors in as . Building on that framework, Ganea et al. (2018) derived closed-form formulations for the exponential map


and its inverse, the logarithm map


3 The Poincaré VAE: -Vae

We consider the problem of mapping an empirical observations distribution to a lower dimensional Poincaré ball , as well as learning a map from this latent space to the observation space . Building on the VAE framework, this -VAE model only differs by the choice of prior distribution and posterior distribution which are defined on , and by the encoder and decoder maps which take into account the geometry of the latent space. Their parameters are learned by maximising the ELBO. Hyperbolic geometry is in a sense an extrapolation of Euclidean geometry. Hence our model can also be seen as a generalisation of a usual Euclidean VAE that we denote by -VAE, i.e. -VAE -VAE.

3.1 Prior and variational posterior distributions

In order to parametrise distributions on the Poincaré ball, we consider a maximum entropy generalisation of normal distributions

(Said et al., 2014; Pennec, 2006), i.e. a hyperbolic normal distribution, which density is given by



a dispersion parameter (not being the standard deviation),

the Fréchet expectation 333Generalisation of the expectation to manifolds, defined as minimisers of . , and the normalising constant derived in Appendix B.6, given by


We discuss such distributions in Appendices B.1 and B.2. Hence, the prior distribution is chosen to be a hyperbolic normal distribution with mean zero, , and the variational family to be . Figure 2 shows samples form such hyperbolic normal distributions.

Figure 2: Samples from hyperbolic normal distributions with different expectations (black crosses) and distortions.

3.2 Encoder and decoder architecture

We make use of two neural networks, a decoder and an encoder , to respectively parametrise the likelihood and the variational posterior . Such functions and are often designed as compositions of decision units , with an affine transformation and a non-linearity.

Figure 3:

Illustration of an orthogonal projection on a hyperplane in a Poincaré disc

(Left) and an Euclidean plane (Right). Those hyperplanes are decision boundaries.

Affine transformations – given by – can be rewritten as being proportional to the distance from to its orthogonal projection on the hyperplane which acts as a decision boundary

As highlighted in Ganea et al. (2018) one can analogously define an operator


A closed-formed expression for has also been derived by Ganea et al. (2018)


Such hyperplanes – also called gyroplanes – can be interpreted as decision boundaries, being semi-hyperspheres orthogonal to the Poincaré ball’s boundary as shown on Figure 3, contrary to flat hyperplanes for the Euclidean space.

We define the decoder’s first layer as the concatenation of such operators , that we call gyroplane units. This output is then fed to a generic neural network.


The encoder must have outputs lying within the support of the posterior parameters. For the choice of , is required. Constraining

can classically be done via a log-variance parametrisation. The easiest way to ensure that

is to parametrise it as the image of the exponential map defined at the origin with velocity . Hence,


In the flat curvature limit , the usual parametrisation is recovered. The -VAE’s decoder is therefore a generic decoder whose mean output is mapped to the manifold via the exponential map .

3.3 Training


The ELBO can readily be extended for Riemannian latent spaces by applying Jensen’s inequality w.r.t. (cf Appendix A) yielding



can easily be estimated by MC samples.

1:, , dimension , curvature
2: Truncated Normal pdf
3: Hyperbolic radius pdf
4: Acceptance-Rejection constant
5:while sample not accepted do
6:     Propose
7:     Sample Uniform distribution on the segment
8:     if  then Accept-Reject step
9:         Accept sample
10:     else
11:         Reject sample
12:     end if
13:end while
14:Sample direction Uniform distribution on the hypersphere
15: Run geodesic from with direction and velocity
Algorithm 1 Sampling

In order to obtain samples and to compute gradients with respect to parameters , we rely on the reparameterisation given by hyperbolic polar coordinate


with the direction being uniformly distributed on the hypersphere,

and the hyperbolic radius having density


as shown in Appendices B.3 and B.4.


Then gradients can easily be computed thanks to the reparameterisation given by Eq 14. Similarly for gradients by additionally relying on an implicit reparameterisation (Figurnov et al., 2018) via its cdf as shown in Appendix B.7.

We consequently rely on that reparameterisation to obtain samples through an acceptance-rejection sampling scheme described in Algorithm 1 and Appendix B.7.

4 Experiments

Branching diffusion process

To test the -VAE on data with underlying hierarchical structure, we generate a synthetic dataset from a branching diffusion process. Nodes are represented in a vector space, and sampled according to the following sampling process


with being the index of the th node’s ancestor and its depth. Hence, the generated observations have a known hierarchical relationship. Yet, the model will not have access to the hierarchy but only to the vector representations .

Then to assess the generalisation capacity of the models, noisy observations are sampled for each node


The root is set to null vector for simplicity. The observation dimension is set to . The dataset is centered and normalised to have unit variance. Thus, the choice of variance does not matter and it is set to . The number of noisy observation is set to , and its variance to . The depth is set to and the branching factor to . Details concerning the optimisation and architectures of the models can be found in Appendix C. In order to avoid embeddings to lie too close to the center of the ball, thus not really making use of the hyperbolic geometry, we set the prior distribution’s distortion to . Indeed, more prior mass is put close to the boundary as the distortion increases as illustrated in Figure 2. Consequently, the embeddings are "pushed" towards the boundary of the ball because of the regularisation term in the ELBO (Equation 13). We restrict ourselves to a latent dimensionality of , i.e. the Poincaré disc .

Both models – -VAE and -VAE – are trained on the synthetic dataset previously described. The aim is to assess whether the underlying hierarchy of the observations is recovered in the latent representations. Figure 4 shows that indeed a hierarchical structure is somewhat learned by both models, even though the -VAE’s hierarchy could visually clearer.

Figure 4: Latent representations of a – -VAE (Left) and a -VAE (Right), trained on the synthetic dataset. Posterior means are represented by black crosses, and colour dots are posterior samples. Blue lines represent the true underlying hierarchy.
Figure 5: Latent representation of -VAE (Left) and -VAE (Right) with heatmap of the log distance to the hyperplane (in pink).

So as to quantitatively assess models performances, we compute the marginal likelihood on the test dataset using IWAE (Burda et al., 2015) estimates (with samples). We train several -VAEs with decreasing curvatures, along with a Euclidean -VAE for a baseline.

Table 1: Negative test marginal likelihood estimates on the synthetic dataset, averaged over 20 runs. The smaller the better. Twenty different datasets are thus generated and models are trained on those.
Figure 6: Latent representations of -VAE with decreasing curvatures (Left to Right).

Table 1 shows that on the synthetic dataset, the -VAE outperforms its Euclidean counterpart in terms of test marginal likelihood. Also, as the curvature decreases, the performance of the -VAE is recovered. Figure 6 shows latent representations of -VAEs with different curvatures. With "small" curvatures, we observe that embeddings lie close the center of the ball, where the geometry is close to be Euclidean.

5 Related work

In the BNP’s literature, explicitly modelling the hierarchical structure of data has been a long-going trend (Teh et al., 2008; Heller and Ghahramani, 2005; Griffiths et al., 2004; Ghahramani et al., 2010; Larsen et al., 2001). Embedding graphs in hyperbolic spaces has been empirically shown (Nickel and Kiela, 2017, 2018; Chamberlain et al., 2017) to yield a more compact representation compared to Euclidean space, especially for low dimensions. De Sa et al. (2018) studied the trade-offs of tree embeddings in the Poincaré disc.

VAE with non Euclidean latent space have first been explored by Davidson et al. (2018), endowing the latent space with a hyperspherical geometry, and parametrising variational posteriors with Von-Mises Fisher distributions. Independent works recently considered endowing AE latent space with a hyperbolic geometry in order to yield a hierarchical representation. Grattarola et al. (2018) considered the setting of CCM (i.e. hyperspherical, Euclidean and hyperbolic geometries), thus making use of the hyperboloid model for the hyperbolic geometry. By building on the AAE framework, observations are generated by mapping prior samples through the decoder, and observations are embedded via a deterministic encoder. Hence, only samples from the prior distribution are needed, which they choose to be a wrapped normal distribution. In order to use generic neural networks, the encoder is regularised so that its outputs lie close to the manifold. On the counterpart, the encoder and decoder architectures are therefore not tailored to take into account the geometry of the latent space. The posterior distribution is regularised to the prior via a discriminator, which requires the design and training of an extra neural network. In a concurrent work, Ovinnikov (2018) recently proposed to similarly endow VAE latent space with a Poincaré ball model. They train their model by minimising a Wasserstein Auto-Encoder loss (Tolstikhin et al., 2017) with a MMD regularisation because they cannot derive a close-form solution of the ELBO’s entropy term. We instead rely on a MC approximation of the ELBO.

6 Future directions

One drawback of our approach is that it is computationally more demanding that usual VAE. The gyroplane units

are not as fast as their linear counterpart which befit from an optimised implementation. Also, our sampling scheme for hyperbolic normal distributions does not scale well with the dimensionality. Sampling the hyperbolic radius is the only non-trivial part, even though it is a univariate random variable, its dependency in the dimensionality hurts our rejections bound. One may use the log-concavity of the hyperbolic radius pdf, by applying an ARS

(R. Gilks and Wild, 1992; Görür and Teh, 2011) scheme that should scale better.

Another interesting extension would be the use of normalising flows to avoid the previously mentioned difficulties due to hyperbolic normal distributions, while preserving a tractable likelihood. One may consider Hamiltonian flows (Caterini et al., 2018) defined on the Poincaré ball, or generalisation of trivial flows (e.g. planar flows).

Further empirical results on real datasets are also needed to strengthen the claim that such models indeed yield interpretable hierarchical representations which are more compact.


We are extremely grateful to Adam Foster, Phillipe Gagnon and Emmanuel Chevallier for their help. EM, YWT’s research leading to these results received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071 and they acknowledge Microsoft Research and EPSRC for partially funding EM’s studentship.


  • Alanis-Lobato and Andrade-Navarro (2016) Alanis-Lobato, G. and Andrade-Navarro, M. A. (2016). Distance Distribution between Complex Network Nodes in Hyperbolic Space. Complex Systems, pages 223–236.
  • Asta and Shalizi (2014) Asta, D. and Shalizi, C. R. (2014). Geometric Network Comparison.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798–1828.
  • Box and Muller (1958) Box, G. E. P. and Muller, M. E. (1958). A note on the generation of random normal deviates. Ann. Math. Statist., 29(2):610–611.
  • Burda et al. (2015) Burda, Y., Grosse, R., and Salakhutdinov, R. (2015).

    Importance Weighted Autoencoders.
  • Caterini et al. (2018) Caterini, A. L., Doucet, A., and Sejdinovic, D. (2018). Hamiltonian Variational Auto-Encoder.
  • Chamberlain et al. (2017) Chamberlain, B. P., Clough, J., and Deisenroth, M. P. (2017). Neural Embeddings of Graphs in Hyperbolic Space.
  • Chevallier et al. (2015) Chevallier, E., Barbaresco, F., and Angulo, J. (2015). Probability density estimation on the hyperbolic space applied to radar processing. In Nielsen, F. and Barbaresco, F., editors, Geometric Science of Information, pages 753–761, Cham. Springer International Publishing.
  • Coates et al. (2011) Coates, A., Lee, H., and Ng, A. (2011). An analysis of single-layer networks in unsupervised feature learning. In Gordon, G., Dunson, D., and Dudík, M., editors,

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , volume 15 of JMLR Workshop and Conference Proceedings, pages 215–223. JMLR W&CP.
  • Coxeter (1942) Coxeter, H. S. M. (1942). Non-Euclidean Geometry. University of Toronto Press.
  • Darwin (1859) Darwin, C. (1859). On the Origin of Species by Means of Natural Selection. Murray, London. or the Preservation of Favored Races in the Struggle for Life.
  • Davidson et al. (2018) Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. (2018). Hyperspherical Variational Auto-Encoders.
  • De Sa et al. (2018) De Sa, C., Gu, A., Re, C., and Sala, F. (2018). Representation Tradeoffs for Hyperbolic Embeddings.
  • Figurnov et al. (2018) Figurnov, M., Mohamed, S., and Mnih, A. (2018). Implicit Reparameterization Gradients.
  • Ganea et al. (2018) Ganea, O.-E., Bécigneul, G., and Hofmann, T. (2018). Hyperbolic Neural Networks.
  • Ghahramani et al. (2010) Ghahramani, Z., Jordan, M. I., and Adams, R. P. (2010). Tree-structured stick breaking for hierarchical data. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages 19–27. Curran Associates, Inc.
  • Grattarola et al. (2018) Grattarola, D., Livi, L., and Alippi, C. (2018). Adversarial Autoencoders with Constant-Curvature Latent Manifolds.
  • Griffiths et al. (2004) Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., and Blei, D. M. (2004). Hierarchical topic models and the nested chinese restaurant process. In Thrun, S., Saul, L. K., and Schölkopf, B., editors, Advances in Neural Information Processing Systems 16, pages 17–24. MIT Press.
  • Görür and Teh (2011) Görür, D. and Teh, Y. W. (2011). Concave-convex adaptive rejection sampling. Journal of Computational and Graphical Statistics, 20(3):670–691.
  • Heller and Ghahramani (2005) Heller, K. A. and Ghahramani, Z. (2005).

    Bayesian hierarchical clustering.

    In Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05, pages 297–304, New York, NY, USA. ACM.
  • Huang and LeCun (2006) Huang, F. J. and LeCun, Y. (2006). Large-scale learning with SVM and convolutional for generic object categorization. In CVPR (1), pages 284–291. IEEE Computer Society.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A Method for Stochastic Optimization.
  • Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. CoRR, abs/1312.6114.
  • Krioukov et al. (2010) Krioukov, D., Papadopoulos, F., Kitsak, M., Vahdat, A., and Boguna, M. (2010). Hyperbolic Geometry of Complex Networks., (3):253.
  • Kulish (2002) Kulish, V. (2002). Hierarchical Methods. Springer Netherlands.
  • Larsen et al. (2001) Larsen, J., Szymkowiak, A., and Hansen, L. K. (2001). Probabilistic hierarchical clustering with labeled and unlabeled data.
  • Linial et al. (1994) Linial, N., London, E., and Rabinovich, Y. (1994). The geometry of graphs and some of its algorithmic applications. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, SFCS ’94, pages 577–591, Washington, DC, USA. IEEE Computer Society.
  • Maslow (1943) Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50(4):430–437.
  • Nickel and Kiela (2017) Nickel, M. and Kiela, D. (2017). Poincaré Embeddings for Learning Hierarchical Representations.
  • Nickel and Kiela (2018) Nickel, M. and Kiela, D. (2018). Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry.
  • Ovinnikov (2018) Ovinnikov, I. (2018). Poincaré Wasserstein Autoencoder.

    NeurIPS Workshop on Bayesian Deep Learning

    , pages 1–8.
  • Pennec (2006) Pennec, X. (2006). Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. Journal of Mathematical Imaging and Vision, 25(1):127.
  • Petersen (2006) Petersen, P. (2006). Riemannian Geometry. Springer-Verlag New York.
  • R. Gilks and Wild (1992) R. Gilks, W. and Wild, P. (1992). Adaptive rejection sampling for gibbs sampling. 41:337–348.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014).

    Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
  • Said et al. (2014) Said, S., Bombrun, L., and Berthoumieu, Y. (2014). New riemannian priors on the univariate normal model. Entropy, 16(7):4015–4031.
  • Sarkar (2012) Sarkar, R. (2012). Low distortion delaunay embedding of trees in hyperbolic plane. In van Kreveld, M. and Speckmann, B., editors, Graph Drawing, pages 355–366, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Teh et al. (2008) Teh, Y. W., Daume III, H., and Roy, D. M. (2008). Bayesian agglomerative clustering with coalescents. In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages 1473–1480. Curran Associates, Inc.
  • Tolstikhin et al. (2017) Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. (2017). Wasserstein Auto-Encoders.
  • Ungar (2008) Ungar, A. A. (2008). A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1–194.

Appendix A Evidence Lower Bound

The ELBO can readily be extended for Riemannian latent spaces by applying Jensen’s inequality w.r.t. which yield

Appendix B Hyperbolic normal distributions

b.1 Probability measures on Riemannian manifolds

Probability measures and random vectors can be defined on Riemannian manifolds so as to model uncertainty on non-Euclidean spaces (Pennec, 2006). The Riemannian metric induces an infinitesimal volume element on each tangent space , and thus a measure on the manifold,


with being the Lebesgue measure. Random vectors would natural be characterised by the Radon-Nikodym derivative of a measure w.r.t. the Riemannian measure

Thus one can generalise isotropic normal distributions as parametric measures with the following density


with being the Riemannian distance on the manifold induced by the tensor metric. is called the dispersion (which is not equal to the standard deviation) and the Fréchet expectation.Such a class of distributions is an isotropic special case of the maximum entropy characterisation derived in Pennec (2006). It is also the formulation used by Said et al. (2014) in the Poincaré half-plane.

Another generalisation are wrapped (also push-forward or exp-map) normal distributions which consist in mapping a normal distribution by the exponential map of the manifold


Such wrapped distributions usually have tractable formulas, and can easily generalise all famous distributions. This is the reason why Grattarola et al. (2018) rely on such a prior. Yet, for the Gaussian case, they violate the maximum entropy characterisation. The usual (Euclidean) normal distribution is recovered in the limit of zero curvature for both generalisations.

b.2 Maximum entropy hyperbolic normal distributions

In the Poincaré ball , the maximum entropy generalisation of the normal distribution from Eq 19 yield


Such a pdf can easily be computed pointwise once is known, which we derive in Appendix B.6. A harder task it to get samples from such a distribution. We propose a rejection sampling algorithm based on the hyperbolic polar coordinates as derived in Appendix B.7.

b.3 Hyperbolic polar coordinates

Euclidean polar coordinates (or hyperspherical coordinate in ), express points through a radius and a direction such that . Yet, one could choose another pole (or reference point) such that . Consequently, . An analogous change of variables can also be constructed in Riemannian manifolds relying on the exponential map as a generalisation of the addition operator. Given a pole , the point of hyperbolic polar coordinates is defined as the point at distance of and lying on the geodesic with initial direction . Hence with , since .

Below we derive the expression of the Poincaré ball metric in such hyperbolic polar coordinate, for the specific setting where : . Switching to Euclidean polar coordinate yield


Let’s define , with being the geodesic joining and . Since such a geodesic is the segment , we have

Plugging (and ) into Eq 22 yield


Below we remind the result for the general setting where . In an orthonormal basis of – given by hyperspherical coordinate with and – the hyperbolic polar coordinate leads to the following expression of the matrix of the metric (Chevallier et al., 2015)




The fact that the length element and equivalently the metric only depends on the radius on hyperbolic polar coordinate, is a consequence of the hyperbolic space’s isotropy.

b.4 Probability density function

By integrating the density in the tangent space we get

plugging the hyperbolic normal density , with the hyperbolic distance , yields

Hence factorises as with being uniformly distributed on the hypersphere , and the hyperbolic radius being distributed according to the density (derivative w.r.t the Lebesgue measure)


The (Euclidean) Normal density is recovered as since .

By expanding the term using the binomial formula, we get