A Differentiable Gaussian-like Distribution on Hyperbolic Space for Gradient-Based Learning

02/08/2019 ∙ by Yoshihiro Nagano, et al. ∙ 18

Hyperbolic space is a geometry that is known to be well-suited for representation learning of data with an underlying hierarchical structure. In this paper, we present a novel hyperbolic distribution called pseudo-hyperbolic Gaussian, a Gaussian-like distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. Our distribution enables the gradient-based learning of the probabilistic models on hyperbolic space that could never have been considered before. Also, we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of probabilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

page 8

page 13

page 15

page 16

Code Repositories

hyperbolic_wrapped_distribution

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, hyperbolic geometry is drawing attention as a powerful geometry to assist deep networks in capturing fundamental structural properties of data such as a hierarchy. Hyperbolic attention network (Gülçehre et al., 2019)

improved the generalization performance of neural networks on various tasks including machine translation by imposing the hyperbolic geometry on several parts of neural networks. Poincaré embeddings

(Nickel & Kiela, 2017) succeeded in learning a parsimonious representation of symbolic data by embedding the dataset into Poincaré balls.

(a) A tree representation of the training dataset

(b) Normal VAE ()
(c) Hyperbolic VAE
Figure 1: The visual results of Hyperbolic VAE applied to an artificial dataset generated by applying random perturbations to a binary tree. The visualization is being done on the Poincaré ball. The red points are the embeddings of the original tree, and the blue points are the embeddings of noisy observations generated from the tree. The pink represents the origin of the hyperbolic space. The VAE was trained without the prior knowledge of the tree structure. Please see 6.1 for experimental details

In the task of data embedding, the choice of the target space determines the properties of the dataset that can be learned from the embedding. For the dataset with a hierarchical structure, in particular, the number of relevant features can grow exponentially with the depth of the hierarchy. Euclidean space is often inadequate for capturing the structural information (Figure 1). If the choice of the target space of the embedding is limited to Euclidean space, one might have to prepare extremely high dimensional space as the target space to guarantee small distortion. However, the same embedding can be done remarkably well if the destination is the hyperbolic space (Sarkar, 2012; Sala et al., 2018).

Now, the next natural question is; “how can we extend these works to probabilistic inference problems on hyperbolic space?” When we know in advance that there is a hierarchical structure in the dataset, a prior distribution on hyperbolic space might serve as a good informative

prior. We might also want to make Bayesian inference on a dataset with hierarchical structure by training a variational autoencoder (VAE) 

(Kingma & Welling, 2014; Rezende et al., 2014)

with latent variables defined on hyperbolic space. We might also want to conduct probabilistic word embedding into hyperbolic space while taking into account the uncertainty that arises from the underlying hierarchical relationship among words. Finally, it would be best if we can compare different probabilistic models on hyperbolic space based on popular statistical measures like divergence that requires the explicit form of the probability density function.

The endeavors we mentioned in the previous paragraph all require probability distributions on hyperbolic space that admit a parametrization of the density function that can be computed analytically and differentiated with respect to the parameter. Also, we want to be able to sample from the distribution efficiently; that is, we do not want to resort to auxiliary methods like rejection sampling.

In this study, we present a novel hyperbolic distribution called pseudo-hyperbolic Gaussian

, a Gaussian-like distribution on hyperbolic space that resolves all these problems. We construct this distribution by defining Gaussian distribution on the tangent space at the origin of the hyperbolic space and projecting the distribution onto hyperbolic space after transporting the tangent space to a desired location in the space. This operation can be formalized by a combination of the parallel transport and the exponential map for the Lorentz model of hyperbolic space.

We can use our pseudo-hyperbolic Gaussian distribution to construct a probabilistic model on hyperbolic space that can be trained with gradient-based learning. For example, our distribution can be used as a prior of a VAE (Figure 1, Figure 6). It is also possible to extend the existing probabilistic embedding method to hyperbolic space using our distribution, such as probabilistic word embedding. We will demonstrate the utility of our method through the experiments of probabilistic hyperbolic models on benchmark datasets including MNIST, Atari 2600 Breakout, and WordNet.

2 Background

(a)
(b)
(c)
Figure 2: (a) One-dimensional Lorentz model (red) and its tangent space (blue). (b) Parallel transport carries (green) to (blue) while preserving . (c) Exponential map projects the (blue) to (red). The distance between and which is measured on the surface of coincides with .

2.1 Hyperbolic Geometry

Hyperbolic geometry is a non-Euclidean geometry with a constant negative Gaussian curvature, and it can be visualized as the forward sheet of the two-sheeted hyperboloid. There are four common equivalent models used for the hyperbolic geometry: the Klein model, Poincaré disk model, and Lorentz (hyperboloid/Minkowski) model, and Poincaré half-plane model. Many applications of hyperbolic space to machine learning to date have adopted the Poincaré disk model as the subject of study

(Nickel & Kiela, 2017; Ganea et al., 2018a, b; Sala et al., 2018). In this study, however, we will use the Lorentz model that, as claimed in Nickel & Kiela (2018), comes with a simpler closed form of the geodesics and does not suffer from the numerical instabilities in approximating the distance. We will also exploit the fact that both exponential map and parallel transport have a clean closed form in the Lorentz model.

Lorentz model (Figure 2) can be represented as a set of points with such that its Lorentzian product (negative Minkowski bilinear form)

with itself is . That is,

(1)

Lorentzian inner product also functions as the metric tensor on hyperbolic space. We will refer to the one-hot vector

as the origin of the hyperbolic space. Also, the distance between two points on is given by , which is also the length of the geodesic that connects and .

2.2 Parallel Transport and Exponential Map

The rough explanation of our strategy for the construction of pseudo-hyperbolic Gaussian with and a positive positive definite matrix is as follows. We (1) sample a vector from , (2) transport the vector from to along the geodesic, and (3) project the vector onto the surface. To formalize this sequence of operations, we need to define the tangent space on hyperbolic space as well as the way to transport the tangent space and the way to project a vector in the tangent space to the surface. The transportation of the tangent vector requires parallel transport, and the projection of the tangent vector to the surface requires the definition of exponential map.

Tangent space of hyperbolic space

Let us use to denote the tangent space of at (Figure 2). Representing as a set of vectors in the same ambient space into which is embedded, can be characterized as the set of points satisfying the orthogonality relation with respect to the Lorentzian product:

(2)

set can be literally thought of as the tangent space of the forward hyperboloid sheet at . Note that consists of with , and .

Parallel transport and inverse parallel transport
Next, for an arbitrary pair of point , the parallel transport from to is defined as a map from to that carries a vector in along the geodesic from to in a parallel manner without changing its metric tensor. In other words, if is the parallel transport on hyperbolic space, then .

The explicit formula for the parallel transport on the Lorentz model (Figure 2) is given by:

(3)

where . The inverse parallel transport simply carries the vector in back to along the geodesic. That is,

(4)

Exponential map and inverse exponential map
Finally, we will describe a function that maps a vector in a tangent space to its surface.

According to the basic theory of differential geometry, every determines a unique maximal geodesic with and . Exponential map is a map defined by , and we can use this map to project a vector in onto in a way that the distance from to destination of the map coincides with , the metric norm of . For hyperbolic space, this map (Figure 2) is given by

(5)

As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that . Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well.

Solving eq. 5 for , we can obtain the inverse exponential map as

(6)

where . See Appendix A.1 for further details.

3 Pseudo-Hyperbolic Gaussian

3.1 Sampling

Finally, we are ready to formally explain our method of generating our pseudo-hyperbolic Gaussian with and a positive definite .

In the language of the differential geometry, our strategy can be re-described as follows:

  1. Sample a vector from the Gaussian distribution defined over .

  2. Interpret as an element of by rewriting as .

  3. Parallel transport the vector to along the geodesic from to .

  4. Map to by .

Algorithm 1 is an algorithmic description of the sampling procedure.

  Input: parameter ,
  Output:
  Require:
  Sample
  
  Move to by eq. 3
  Map to by eq. 5
Algorithm 1 Sampling on hyperbolic space

The most prominent advantage of this construction is that we can compute the density of the probability distribution.

3.2 Probability Density Function

Note that both and are differentiable functions that can be evaluated analytically. Thus, by the construction of , we can compute the probability density of at using a composition of differentiable functions, and . Let (Figure 3).

In general, if

is a random variable endowed with the probability density function

, the log likelihood of at can be expressed as

where is a invertible and continuous map. Thus, all we need in order to evaluate the probability density of at is the way to evaluate :

(7)

Algorithm 2 is an algorithmic description for the computation of the pdf.

  Input: sample , parameter ,
  Output:
  Require:
  Map to by eq. 6
  Move to by eq. 4
  Calculate by eq. 7
Algorithm 2 Calculate log-pdf

For the implementation of algorithm 1 and algorithm 2, we would need to be able to evaluate not only , and their inverses, but also need to evaluate the determinant. We provide an analytic solution to each one of them below.

Log-determinant
We compute the log-determinant of the Jacobian of . This is required in the evaluation of (7).

Appealing to the chain-rule and the rule of the determinant, we can decompose the expression into two components:

(8)

For the first term, we get

where and . Now, using the identity , we obtain

See Appendix A.3 for further details.

Next, the second term can be computed as

where . See Appendix A.4 for further details.
Putting these computations together, we can obtain the desired determinant in a simple and clean form:

Because both and can be computed in , the whole evaluation of the log determinant can be computed in .

Since the metric at the tangent space coincides with the Euclidean metric, we can produce various types of Hyperbolic distributions by applying our construction strategy to other distributions defined on Euclidean space, such as Laplace and Cauchy distribution.

(b)
(b)
Figure 3: The heatmaps of log-likelihood of the pesudo-hyperbolic Gaussians with various and . We designate the origin of hyperbolic space by the mark. See Appendix B for further details.
(a)

4 Applications of

4.1 Hyperbolic Variational Autoencoder

As an application of pseudo-hyperbolic Gaussian , we will introduce hyperbolic variational autoencoder (Hyperbolic VAE), a variant of the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) in which the latent variables are defined on hyperbolic space. Given dataset , the method of variational autoencoder aims to train a decoder model that can create from a dataset that resembles . The decoder model is trained together with the encoder model by maximizing the sum of evidence lower bound (ELBO) that is defined for each ;

(9)

where is the variational posterior distribution. In classic VAE, the choice of the prior is the standard normal, and the posterior distribution is also variationally approximated by a Gaussian. Hyperbolic VAE is a simple modification of the classic VAE in which and . The model of and is often referred to as encoder. This parametric formulation of is called reparametrization trick, and it enables the evaluation of the gradient of the objective function with respect to the network parameters. To compare our method against, we used -VAE (Higgins et al., 2017), a variant of VAE that applies a scalar weight to the KL term in the objective function.

In Hyperbolic VAE, we assure that output of the encoder is in by applying to the final layer of the encoder. That is, if is the output, we can simply use

As stated in the previous sections, our distribution allows us to evaluate the ELBO exactly and to take the gradient of the objective function. In a way, our distribution of the variational posterior is an hyperbolic-analog of the reparametrization trick.

4.2 Word Embedding

We can use our psudo-hyperbolic Gaussian for probabilistic word embedding. The work of Vilnis & McCallum (2015) attempted to extract the linguistic and contextual properties of words in a dictionary by embedding every word and every context to a Gaussian distribution defined on Euclidean space. We may extend their work by changing the destination of the map to the family of . Let us write to convey that there is a link between words and , and let us use to designate the distribution to be assigned to the word . The objective function used in Vilnis & McCallum (2015) is given by

where represents the measure of similarity between and evaluated with . In the original work, and were chosen to be a Gaussian distribution. We can incorporate hyperbolic geometry into this idea by choosing .

5 Related Work

As mentioned in the introduction, most studies to date that use hyperbolic space consider only deterministic mappings (Nickel & Kiela, 2017, 2018; Ganea et al., 2018a, b; Gülçehre et al., 2019).

Very recently, Ovinnikov (2019) proposed an application of Gaussian distribution on hyperbolic space. However, the formulation of their distribution cannot be directly differentiated nor evaluated because of the presence of error function in their expression of pdf. For this reason, they resort to Wasserstein Maximum Mean Discrepancy (Gretton et al., 2012) to train their encoder network. Our distribution has broader application than the distribution of Ovinnikov (2019) because it allows the user to compute its likelihood and its gradient without approximation. One advantage of our distribution is its representation power. Our distribution can be defined for any in and any positive definite matrix . Meanwhile, the hyperbolic Gaussian studied in Ovinnikov (2019)

can only express Gaussian with variance matrix of the form

.

For word embedding, several deterministic methods have been proposed to date, including the celebrated Word2Vec (Mikolov et al., 2013). The aforementioned Nickel & Kiela (2017) uses deterministic hyperbolic embedding to exploit the hierarchical relationships among words. The probabilistic word embedding was first proposed by Vilnis & McCallum (2015). As stated in the method section, their method maps each word to a Gaussian distribution on Euclidean space. Their work suggests the importance of investigating the uncertainty of word embedding. In the field of representation learning of word vectors, our work is the first in using hyperbolic probability distribution for word embedding.

On the other hand, the idea to use a noninformative, non-Gaussian prior in VAE is not new. For example, Davidson et al. (2018) proposes the use of von Mises-Fisher prior, and Rolfe (2017); Jang et al. (2017) use discrete distributions as their prior. With the method of Normalizing flow (Rezende & Mohamed, 2015), one can construct even more complex priors as well (Kingma et al., 2016). The appropriate choice of the prior shall depend on the type of dataset. As we will show in the experiment section, our distribution is well suited to the dataset with underlying tree structures. Another choice of the VAE prior that specializes in such dataset has been proposed by Vikram et al. (2018). For the sampling, they use time-marginalized coalescent, a model that samples a random tree structure by a stochastic process. Theoretically, their method can be used in combination with our approach by replacing their Gaussian random walk with a hyperbolic random walk.

6 Experiments

6.1 Synthetic Binary Tree

We trained Hyperbolic VAE for an artificial dataset constructed from a binary tree of depth . To construct the artificial dataset, we first obtained a binary representation for each node in the tree so that the Hamming distance between any pair of nodes is the same as the distance on the graph representation of the tree (Figure 1). Let us call the set of binaries obtained this way by . We then generated a set of binaries, , by randomly flipping each coordinate value of with probability . The binary set was then embedded into by mapping to

. We used an Multi Layer Parceptron (MLP) of depth 3 and 100 hidden variables at each layer for both encoder and decoder of the VAE. For activation function we used

.

Table 1 summarizes the quantitative comparison of Normal VAE against our Hyperbolic VAE. For each pair of points in the tree, we computed their Hamming distance as well as their distance in the latent space of VAE. That is, we used Hyperbolic distance for Hyperbolic VAE, and used Euclidean distance for Noraml VAE. We used the strength of correlation between the Hamming distances and the distances in the latent space as a measure of performance. Hyperbolic VAE was performing better both on the original tree and on the artificial dataset generated from the tree. Normal VAE performed the best with , and collapsed with . The difference between Normal VAE and Hyperbolic VAE can be observed with much more clarity using the 2-dimensional visualization of the generated dataset on Poincaré Ball (See Figure 1 and Appendix C.1). The red points are the embeddings of , and the blue points are the embeddings of all other points in . The pink mark designates the origin of hyperbolic space. For the visualization, we used the canonical diffeomorphism between the Lorenz model and the Poincaré ball model.

Model Correlation Correlation w/ noise
Normal 0.48 0.45
0.66 0.51
0.71 0.55
0.47 0.01
0.22 0.01
Hyperbolic 0.77 0.60
Table 1: Results of tree embedding experiments for the Hyperbolic VAE and Normal VAEs trained with different weight constants for the KL term.

6.2 Mnist

Normal VAE Hyperbolic VAE
ELBO LL ELBO LL
2 -148.58 -143.38 -145.34 -139.94
5 -110.99 -106.76 -110.36 -105.32
10 -90.40 -85.66 -92.00 -86.19
20 -82.98 -76.90 -85.24 -77.47
Table 2: Quantitative comparison of Hyperbolic VAE against Normal VAE on the MNIST dataset in terms of ELBO and log-likelihood (LL) for several values of latent space dimension . LL was computed using 500 samples of latent variables.

We applied Hyperbolic VAE to a binarized version of MNIST. We used an MLP of depth 3 and 500 hidden units at each layer for both the encoder and the decoder of the VAE. Table

2 shows the quantitative results of the experiments. Log-likelihood was approximated with an empirical integration of the Bayesian predictor with respect to the latent variables (Burda et al., 2016). Our method outperformed Normal VAE with small latent dimension. Figure 4 are the samples of the Hyperbolic VAE that was trained with 5-dimensional latent variables, and Figure 4

are the Poincaré Ball representations of the interpolations produced on

by the Hyperbolic VAE that was trained with 2-dimensional latent variables.

(a)
(b)
Figure 4: (a) Samples generated from the Hyperbolic VAE trained on MNIST with latent dimension . (b): Interpolation of the MNIST dataset produced by the Hyperbolic VAE with latent dimension , represented on the Poincaré ball.

6.3 Atari 2600 Breakout

In reinforcement learning, the number of possible state-action trajectories grows exponentially with the time horizon. We may say that these trajectories often have a tree-like hierarchical structure that starts from the initial states. We applied our Hyperbolic VAE to a set of trajectories that were explored by a trained policy during multiple episodes of Breakout in Atari 2600. To collect the trajectories, we used a pretrained Deep Q-Network

(Mnih et al., 2015), and used epsilon-greegy with . We amassed a set of trajectories whose total length is 100,000, of which we used 80,000 as the training set, 10,000 as the validation set, and 10,000 as the test set. Each frame in the dataset was gray-scaled and resized to 80 80. The images in the Figure 5 are samples from the dataset. We used a DCGAN-based architecture (Radford et al., 2016) with latent space dimension . Please see Appendix D for more details.

Figure 5: Examples of the observed screens in Atari 2600 Breakout.
Figure 6: Samples from Normal and Hyperbolic VAEs trained on Atari 2600 Breakout screens. Each row was generated by sweeping the norm of from 1.0 to 10.0 in a log-scale.

The Figure 6 is a visualization of our results. The top three rows are the samples from Normal VAE, and the bottom three rows are the samples from Hyperbolic VAE. Each row consists of samples generated from latent variables of the form with positive scalar in range . Samples in each row are listed in increasing order of . For Normal VAE, we used as the prior. For Hyperbolic VAE, we used as the prior. We can see that the number of blocks decreases gradually and consistently in each row for Hyperbolic VAE. Please see Appendix C.2 for more details and more visualizations.

In Breakout, the number of blocks is always finite, and blocks are located only in a specific region. Let’s refer to this specific region as . In order to evaluate each model-output based on the number of blocks, we binarized each pixel in each output based on a prescribed luminance threshold and measured the proportion of the pixels with pixel value in the region . For each generated image, we used this proportion as the measure of the number blocks contained in the image.

Figure 7

shows the estimated proportions of remaining blocks for Normal and Hyperbolic VAEs with different norm of

. For Normal VAE, samples generated from with its norm as large as contained considerable amount of blocks. On the other hand, the number of blocks contained in a sample generated by Hyperbolic VAE decreased more consistently with the norm of . This fact suggests that the cumulative reward up to a given state can be approximated well by the norm of Hyperbolic VAE’s latent representation. To validate this, we computed latent representation for each state in the test set and measured its correlation with the cumulative reward. The correlation was 0.8540 for the Hyperbolic VAE. For the Normal VAE, the correlation was 0.712. We emphasize that no information regarding the reward was used during the training of both Normal and Hyperbolic VAEs.

Figure 7: Estimated proportions of remaining blocks for Normal and Hyperbolic VAEs trained on Atari 2600 Breakout screens as they vary with the norm of latent variables sampled from a prior.

6.4 Word Embeddings

Lastly, we applied pseudo-hyperbolic Gaussian to word embedding problem. We trained probabilistic word embedding models with WordNet nouns dataset (Miller, 1998) and evaluated the reconstruction performance of them (Table 3). We followed the procedure of Poincaré embedding (Nickel & Kiela, 2017) and initialized all embeddings in the neighborhood of the origin. In particular, we initialized each weight in the first linear part of the embedding by

. We treated the first 50 epochs as a burn-in phase and reduced the learning rate by a factor of

after the burn-in phase.

In Table 3, ‘Euclid’ refers to the word embedding with Gaussian distribution on Euclidean space (Vilnis & McCallum, 2015), and ‘Hyperbolic’ refers to our proposed method based on pseudo-hyperbolic Gaussian. Our hyperbolic model performed better than Vilnis’ Euclidean counterpart when the latent space is low dimensional. We used diagonal variance for both models above. Appendix C.3 shows the results with unit variance. The performance difference with small latent dimension was much more remarkable when we use unit variance.

Euclid Hyperbolic Nickel & Kiela (2017)
MAP Rank MAP Rank MAP Rank
5 0.359 22.4 0.544 18.8 0.823 4.9
10 0.773 4.7 0.817 4.6 0.851 4.02
20 0.897 2.2 0.905 2.4 0.855 3.84
50 0.953 1.4 0.969 1.3 0.86 3.98
100 0.955 1.3 0.977 1.2 0.857 3.9
Table 3: Experimental results of the reconstruction performance on the transitive closure of the WordNet noun hierarchy for several latent space dimension .

7 Conclusion

In this paper, we proposed a novel parametrizaiton for the density of Gassusian on hyperbolic space that can both be differentiated and evaluated analytically. Our experimental results on hyperbolic word embedding and hyperbolic VAE suggest that there is much more room left for the application of hyperbolic space. Our parametrization enables gradient-based training of probabilistic models defined on hyperbolic space and opens the door to the investigation of complex models on hyperbolic space that could not have been explored before.

Acknowledgements

We would like to thank Tomohiro Hayase, Kenta Oono, and Masaki Watanabe for helpful discussions. We also thank Takeru Miyato and Sosuke Kobayashi for insightful reviews on the paper. This paper is based on results obtained from Nagano’s internship at Preferred Networks, Inc.

References

  • Burda et al. (2016) Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. In Proceedings of the 4th International Conference on Learning Representations, 2016.
  • Davidson et al. (2018) Davidson, T. R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J. M. Hyperspherical variational auto-encoders.

    34th Conference on Uncertainty in Artificial Intelligence

    , 2018.
  • Ganea et al. (2018a) Ganea, O., Bécigneul, G., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In Proceedings of the 35th International Conference on Machine Learning, pp. 1632–1641, 2018a.
  • Ganea et al. (2018b) Ganea, O., Bécigneul, G., and Hofmann, T. In Advances in Neural Information Processing Systems 31, pp. 5350–5360, 2018b.
  • Gretton et al. (2012) Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. A kernel two-sample test. The Journal of Machine Learning Research, 13:723–773, March 2012.
  • Gülçehre et al. (2019) Gülçehre, Ç., Denil, M., Malinowski, M., Razavi, A., Pascanu, R., Hermann, K. M., Battaglia, P., Bapst, V., Raposo, D., Santoro, A., and de Freitas, N. Hyperbolic attention networks. In Proceedings of the 7th International Conference on Learning Representations, 2019.
  • Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. -vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, volume 37, pp. 448–456, 2015.
  • Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Kingma & Welling (2014) Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations, 2014.
  • Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29, pp. 4743–4751. 2016.
  • Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. 2013.
  • Miller (1998) Miller, G. WordNet: An electronic lexical database. MIT press, 1998.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems 30, pp. 6338–6347. 2017.
  • Nickel & Kiela (2018) Nickel, M. and Kiela, D. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In Proceedings of the 35th International Conference on Machine Learning, pp. 3776–3785, 2018.
  • Ovinnikov (2019) Ovinnikov, I. Poincaré wasserstein autoencoder. CoRR, abs/1901.01427, 2019.
  • Radford et al. (2016) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations, 2016.
  • Rezende & Mohamed (2015) Rezende, D. J. and Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1530–1538, 2015.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D.

    Stochastic backpropagation and approximate inference in deep generative models.

    In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 1278–1286, 2014.
  • Rolfe (2017) Rolfe, J. T. Discrete variational autoencoders. In Proceedings of the 5th International Conference on Learning Representations, 2017.
  • Sala et al. (2018) Sala, F., De Sa, C., Gu, A., and Re, C. Representation tradeoffs for hyperbolic embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 4460–4469, 2018.
  • Sarkar (2012) Sarkar, R. Low distortion delaunay embedding of trees in hyperbolic plane. In Graph Drawing, pp. 355–366, 2012.
  • Vikram et al. (2018) Vikram, S., Hoffman, M. D., and Johnson, M. J. The loracs prior for vaes: Letting the trees speak for the data. CoRR, abs/1810.06891, 2018.
  • Vilnis & McCallum (2015) Vilnis, L. and McCallum, A. Word representations via gaussian embedding. In Proceedings of the 3rd International Conference on Learning Representations, 2015.

A Derivations

a.1 Inverse Exponential Map

As we mentioned in the main text, the exponential map from to is given by

Solving this equation for , we obtain

We still need to obtain the evaluatable expression for . Using the characterization of the tangent space (main text, (2)), we see that

Now, defining , we can obtain the inverse exponential function as

a.2 Inverse Parallel Transport

The parallel transportation on the Lorentz model along the geodesic from to is given by

(10)

where . Next, likewise, for the exponential map, we need to be able to compute the inverse of the parallel transform. Solving (10) for , we get

Now, observing that

we can write the inverse parallel transport as

The inverse of parallel transport from to coincides with the parallel transport from to .

a.3 Determinant of exponential map

As for the first term of (8) in the main text, we can write

where we wrote

Now, using the change of variables , , we get

Using the identity , we obtain

a.4 Determinant of parallel transport

Next, the second term (8) in the main text can be computed as

where . Using the identity , we get

B Visual Examples of Pseudo-Hyperbolic Gaussian

Figure 8 shows examples of pseudo-hyperbolic Gaussian with various and . We plotted the log-density of these distributions by heatmaps. We designate the by the mark. The right side of these figures expresses their log-density on the Poincaré ball model, and the left side expresses the same one on the corresponding tangent space.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 8: Visual examples of pseudo-hyperbolic Gaussian on . Log-density is illustrated on by translating each point from for clarity. We designate the origin of hyperbolic space by the mark.

C Additional Numerical Evaluations

c.1 Synthetic Binary Tree

We qualitatively compared the learned latent space of Normal and Hyperbolic VAEs. Figure 9 shows the embedding vectors of the synthetic binary tree dataset on the two-dimensional latent space. We evaluated the latent space of Normal VAE with , and , and Hyperbolic VAE. Note that the hierarchical relations in the original tree were not used during the training phase. Red points are the embeddings of the noiseless observations. As we mentioned in the main text, we evaluated the correlation coefficient between the Hamming distance on the data space and the hyperbolic (Euclidean for Normal VAEs) distance on the latent space. Consistently with this metric, the latent space of the Hyperbolic VAE captured the hierarchical structure inherent in the dataset well. In the comparison between Normal VAEs, the latent space captured the hierarchical structure according to increase the . However, the posterior distribution of the Normal VAE with collapsed and lost the structure. Also, the blue points are the embeddings of noisy observation, and pink represents the origin of the latent space. In latent space of Normal VAEs, there was bias in which embeddings of noisy observations were biased to the center side.

(a) A tree representation of the training dataset
(b) Normal ()
(c) Normal ()
(d) Normal ()
(e) Normal ()
(f) Hyperbolic
Figure 9: The visual results of Normal and Hyperbolic VAEs applied to an artificial dataset generated by applying a random perturbation to a binary tree. The visualization is being done in the Poincaré ball. Red points are the embeddings of the original tree, and the blue points are the embeddings of all other points in the dataset. Pink represents the origin of hyperbolic space. Note that the hierarchical relations in the original tree was not used during the training phase.

c.2 Atari 2600 Breakout

To evaluate the performance of Hyperbolic VAE for hierarchically organized dataset according to time development, we applied our Hyperbolic VAE to a set of trajectories that were explored by an agent with a trained policy during multiple episodes of Breakout in Atari 2600. We used a pretrained Deep Q-Network to collect trajectories, and Figure 10 shows examples of observed screens.

Figure 10: Examples of observed screens in Atari 2600 Breakout.

We showed three trajectories of samples from the prior distribution with the scaled norm for both models in the main text. We also visualize more samples in Figure 11 and 12. For both models, we generated samples with 0, 1, 2, 3, 5, and 10.

Normal VAE tended to generate oversaturated images when the norm was small. Although the model generated several images which include a small number of blocks as the norm increases, it also generated images with a constant amount of blocks even . On the other hand, the number of blocks contained in the generated image of Hyperbolic VAE gradually decreased according to the norm.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 11: Images generated by Normal VAE with constant norm .
(a)
(b)
(c)
(d)
(e)
(f)
Figure 12: Images generated by Hyperbolic VAE with constant norm .

c.3 Word Embeddings

We showed the experimental results of probabilistic word embedding models with diagonal variance in the main text. In this section, we show the results with unit variance (Table 4). When the dimensions of the latent variable are small, the performance of the model on hyperbolic space did not deteriorate much by changing the variance from diagonal to unit. However, the same change dramatically worsened the performance of the model on Euclidean space.

Euclid Hyperbolic
MAP Rank MAP Rank
5 0.163 272.8 0.535 29.1
10 0.512 49.9 0.778 5.3
20 0.792 11.3 0.854 2.7
50 0.842 21.4 0.905 1.8
100 0.854 19.3 0.874 2.5
Table 4: Experimental results of the word embedding models with unit variance on the WordNet noun dataset.

D Network Architecture

Table 5 shows the network architecture that we used in Breakout experiments. We evaluated Normal and Hyperbolic VAEs with a DCGAN-based architecture (Radford et al., 2016)

with the kernel size of the convolution and deconvolution layers as 3. We used leaky ReLU nonlinearities for the encoder and ReLU nonlinearities for the decoder. We set the latent space dimension as 20. We gradually increased

from 0.1 to 4.0 linearly during the first 30 epochs. To ensure the initial embedding vector close to the origin, we initialized for the batch normalization layer (Ioffe & Szegedy, 2015) of the encoder as 0.1. We modeled the probability distribution of the data space as Gaussian, so the decoder output a vector twice as large as the original image.

Encoder Layer Size Input Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution Linear Decoder Layer Size Linear BatchNormalization Deconvolution BatchNormalization Convolution BatchNormalization Deconvolution BatchNormalization Convolution Deconvolution Convolution
Table 5: Network architecture for Atari 2600 Breakout dataset.