Hyperbolic space is a geometry that is known to be well-suited for representation learning of data with an underlying hierarchical structure. In this paper, we present a novel hyperbolic distribution called pseudo-hyperbolic Gaussian, a Gaussian-like distribution on hyperbolic space whose density can be evaluated analytically and differentiated with respect to the parameters. Our distribution enables the gradient-based learning of the probabilistic models on hyperbolic space that could never have been considered before. Also, we can sample from this hyperbolic probability distribution without resorting to auxiliary means like rejection sampling. As applications of our distribution, we develop a hyperbolic-analog of variational autoencoder and a method of probabilistic word embedding on hyperbolic space. We demonstrate the efficacy of our distribution on various datasets including MNIST, Atari 2600 Breakout, and WordNet.READ FULL TEXT VIEW PDF
Learning representations for graphs plays a critical role in a wide spec...
Non-Euclidean geometry with constant negative curvature, i.e., hyperboli...
There exists a need for unsupervised 3D segmentation on complex volumetr...
Recently, there has been a raising surge of momentum for deep representa...
We comment on the paper of Murray, Browne, and McNicholas (2017), who
More often than not, we encounter problems with varying parameters as op...
Many problems in machine learning involve regressing outputs that do not...
Recently, hyperbolic geometry is drawing attention as a powerful geometry to assist deep networks in capturing fundamental structural properties of data such as a hierarchy. Hyperbolic attention network (Gülçehre et al., 2019)
improved the generalization performance of neural networks on various tasks including machine translation by imposing the hyperbolic geometry on several parts of neural networks. Poincaré embeddings(Nickel & Kiela, 2017) succeeded in learning a parsimonious representation of symbolic data by embedding the dataset into Poincaré balls.
In the task of data embedding, the choice of the target space determines the properties of the dataset that can be learned from the embedding. For the dataset with a hierarchical structure, in particular, the number of relevant features can grow exponentially with the depth of the hierarchy. Euclidean space is often inadequate for capturing the structural information (Figure 1). If the choice of the target space of the embedding is limited to Euclidean space, one might have to prepare extremely high dimensional space as the target space to guarantee small distortion. However, the same embedding can be done remarkably well if the destination is the hyperbolic space (Sarkar, 2012; Sala et al., 2018).
Now, the next natural question is; “how can we extend these works to probabilistic inference problems on hyperbolic space?” When we know in advance that there is a hierarchical structure in the dataset, a prior distribution on hyperbolic space might serve as a good informative
prior. We might also want to make Bayesian inference on a dataset with hierarchical structure by training a variational autoencoder (VAE)(Kingma & Welling, 2014; Rezende et al., 2014)
with latent variables defined on hyperbolic space. We might also want to conduct probabilistic word embedding into hyperbolic space while taking into account the uncertainty that arises from the underlying hierarchical relationship among words. Finally, it would be best if we can compare different probabilistic models on hyperbolic space based on popular statistical measures like divergence that requires the explicit form of the probability density function.
The endeavors we mentioned in the previous paragraph all require probability distributions on hyperbolic space that admit a parametrization of the density function that can be computed analytically and differentiated with respect to the parameter. Also, we want to be able to sample from the distribution efficiently; that is, we do not want to resort to auxiliary methods like rejection sampling.
In this study, we present a novel hyperbolic distribution called pseudo-hyperbolic Gaussian
, a Gaussian-like distribution on hyperbolic space that resolves all these problems. We construct this distribution by defining Gaussian distribution on the tangent space at the origin of the hyperbolic space and projecting the distribution onto hyperbolic space after transporting the tangent space to a desired location in the space. This operation can be formalized by a combination of the parallel transport and the exponential map for the Lorentz model of hyperbolic space.
We can use our pseudo-hyperbolic Gaussian distribution to construct a probabilistic model on hyperbolic space that can be trained with gradient-based learning. For example, our distribution can be used as a prior of a VAE (Figure 1, Figure 6). It is also possible to extend the existing probabilistic embedding method to hyperbolic space using our distribution, such as probabilistic word embedding. We will demonstrate the utility of our method through the experiments of probabilistic hyperbolic models on benchmark datasets including MNIST, Atari 2600 Breakout, and WordNet.
Hyperbolic geometry is a non-Euclidean geometry with a constant negative Gaussian curvature, and it can be visualized as the forward sheet of the two-sheeted hyperboloid. There are four common equivalent models used for the hyperbolic geometry: the Klein model, Poincaré disk model, and Lorentz (hyperboloid/Minkowski) model, and Poincaré half-plane model. Many applications of hyperbolic space to machine learning to date have adopted the Poincaré disk model as the subject of study(Nickel & Kiela, 2017; Ganea et al., 2018a, b; Sala et al., 2018). In this study, however, we will use the Lorentz model that, as claimed in Nickel & Kiela (2018), comes with a simpler closed form of the geodesics and does not suffer from the numerical instabilities in approximating the distance. We will also exploit the fact that both exponential map and parallel transport have a clean closed form in the Lorentz model.
Lorentz model (Figure 2) can be represented as a set of points with such that its Lorentzian product (negative Minkowski bilinear form)
with itself is . That is,
The rough explanation of our strategy for the construction of pseudo-hyperbolic Gaussian with and a positive positive definite matrix is as follows. We (1) sample a vector from , (2) transport the vector from to along the geodesic, and (3) project the vector onto the surface. To formalize this sequence of operations, we need to define the tangent space on hyperbolic space as well as the way to transport the tangent space and the way to project a vector in the tangent space to the surface. The transportation of the tangent vector requires parallel transport, and the projection of the tangent vector to the surface requires the definition of exponential map.
Tangent space of hyperbolic space
Let us use to denote the tangent space of at (Figure 2). Representing as a set of vectors in the same ambient space into which is embedded, can be characterized as the set of points satisfying the orthogonality relation with respect to the Lorentzian product:
set can be literally thought of as the tangent space of the forward hyperboloid sheet at . Note that consists of with , and .
Parallel transport and inverse parallel transport
Next, for an arbitrary pair of point , the parallel transport from to is defined as a map from to that carries a vector in along the geodesic from to in a parallel manner without changing its metric tensor. In other words, if is the parallel transport on hyperbolic space, then .
The explicit formula for the parallel transport on the Lorentz model (Figure 2) is given by:
where . The inverse parallel transport simply carries the vector in back to along the geodesic. That is,
Exponential map and inverse exponential map
Finally, we will describe a function that maps a vector in a tangent space to its surface.
According to the basic theory of differential geometry, every determines a unique maximal geodesic with and . Exponential map is a map defined by , and we can use this map to project a vector in onto in a way that the distance from to destination of the map coincides with , the metric norm of . For hyperbolic space, this map (Figure 2) is given by
As we can confirm with straightforward computation, this exponential map is norm preserving in the sense that . Now, in order to evaluate the density of a point on hyperbolic space, we need to be able to map the point back to the tangent space, on which the distribution is initially defined. We, therefore, need to be able to compute the inverse of the exponential map, which is also called logarithm map, as well.
Solving eq. 5 for , we can obtain the inverse exponential map as
where . See Appendix A.1 for further details.
Finally, we are ready to formally explain our method of generating our pseudo-hyperbolic Gaussian with and a positive definite .
In the language of the differential geometry, our strategy can be re-described as follows:
Sample a vector from the Gaussian distribution defined over .
Interpret as an element of by rewriting as .
Parallel transport the vector to along the geodesic from to .
Map to by .
Algorithm 1 is an algorithmic description of the sampling procedure.
The most prominent advantage of this construction is that we can compute the density of the probability distribution.
Note that both and are differentiable functions that can be evaluated analytically. Thus, by the construction of , we can compute the probability density of at using a composition of differentiable functions, and . Let (Figure 3).
In general, if
is a random variable endowed with the probability density function, the log likelihood of at can be expressed as
where is a invertible and continuous map. Thus, all we need in order to evaluate the probability density of at is the way to evaluate :
Algorithm 2 is an algorithmic description for the computation of the pdf.
For the implementation of algorithm 1 and algorithm 2, we would need to be able to evaluate not only , and their inverses, but also need to evaluate the determinant. We provide an analytic solution to each one of them below.
We compute the log-determinant of the Jacobian of . This is required in the evaluation of (7).
Appealing to the chain-rule and the rule of the determinant, we can decompose the expression into two components:
For the first term, we get
where and . Now, using the identity , we obtain
See Appendix A.3 for further details.
Next, the second term can be computed as
See Appendix A.4 for further details.
Putting these computations together, we can obtain the desired determinant in a simple and clean form:
Because both and can be computed in , the whole evaluation of the log determinant can be computed in .
Since the metric at the tangent space coincides with the Euclidean metric, we can produce various types of Hyperbolic distributions by applying our construction strategy to other distributions defined on Euclidean space, such as Laplace and Cauchy distribution.
As an application of pseudo-hyperbolic Gaussian , we will introduce hyperbolic variational autoencoder (Hyperbolic VAE), a variant of the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) in which the latent variables are defined on hyperbolic space. Given dataset , the method of variational autoencoder aims to train a decoder model that can create from a dataset that resembles . The decoder model is trained together with the encoder model by maximizing the sum of evidence lower bound (ELBO) that is defined for each ;
where is the variational posterior distribution. In classic VAE, the choice of the prior is the standard normal, and the posterior distribution is also variationally approximated by a Gaussian. Hyperbolic VAE is a simple modification of the classic VAE in which and . The model of and is often referred to as encoder. This parametric formulation of is called reparametrization trick, and it enables the evaluation of the gradient of the objective function with respect to the network parameters. To compare our method against, we used -VAE (Higgins et al., 2017), a variant of VAE that applies a scalar weight to the KL term in the objective function.
In Hyperbolic VAE, we assure that output of the encoder is in by applying to the final layer of the encoder. That is, if is the output, we can simply use
As stated in the previous sections, our distribution allows us to evaluate the ELBO exactly and to take the gradient of the objective function. In a way, our distribution of the variational posterior is an hyperbolic-analog of the reparametrization trick.
We can use our psudo-hyperbolic Gaussian for probabilistic word embedding. The work of Vilnis & McCallum (2015) attempted to extract the linguistic and contextual properties of words in a dictionary by embedding every word and every context to a Gaussian distribution defined on Euclidean space. We may extend their work by changing the destination of the map to the family of . Let us write to convey that there is a link between words and , and let us use to designate the distribution to be assigned to the word . The objective function used in Vilnis & McCallum (2015) is given by
where represents the measure of similarity between and evaluated with . In the original work, and were chosen to be a Gaussian distribution. We can incorporate hyperbolic geometry into this idea by choosing .
Very recently, Ovinnikov (2019) proposed an application of Gaussian distribution on hyperbolic space. However, the formulation of their distribution cannot be directly differentiated nor evaluated because of the presence of error function in their expression of pdf. For this reason, they resort to Wasserstein Maximum Mean Discrepancy (Gretton et al., 2012) to train their encoder network. Our distribution has broader application than the distribution of Ovinnikov (2019) because it allows the user to compute its likelihood and its gradient without approximation. One advantage of our distribution is its representation power. Our distribution can be defined for any in and any positive definite matrix . Meanwhile, the hyperbolic Gaussian studied in Ovinnikov (2019)
can only express Gaussian with variance matrix of the form.
For word embedding, several deterministic methods have been proposed to date, including the celebrated Word2Vec (Mikolov et al., 2013). The aforementioned Nickel & Kiela (2017) uses deterministic hyperbolic embedding to exploit the hierarchical relationships among words. The probabilistic word embedding was first proposed by Vilnis & McCallum (2015). As stated in the method section, their method maps each word to a Gaussian distribution on Euclidean space. Their work suggests the importance of investigating the uncertainty of word embedding. In the field of representation learning of word vectors, our work is the first in using hyperbolic probability distribution for word embedding.
On the other hand, the idea to use a noninformative, non-Gaussian prior in VAE is not new. For example, Davidson et al. (2018) proposes the use of von Mises-Fisher prior, and Rolfe (2017); Jang et al. (2017) use discrete distributions as their prior. With the method of Normalizing flow (Rezende & Mohamed, 2015), one can construct even more complex priors as well (Kingma et al., 2016). The appropriate choice of the prior shall depend on the type of dataset. As we will show in the experiment section, our distribution is well suited to the dataset with underlying tree structures. Another choice of the VAE prior that specializes in such dataset has been proposed by Vikram et al. (2018). For the sampling, they use time-marginalized coalescent, a model that samples a random tree structure by a stochastic process. Theoretically, their method can be used in combination with our approach by replacing their Gaussian random walk with a hyperbolic random walk.
We trained Hyperbolic VAE for an artificial dataset constructed from a binary tree of depth . To construct the artificial dataset, we first obtained a binary representation for each node in the tree so that the Hamming distance between any pair of nodes is the same as the distance on the graph representation of the tree (Figure 1). Let us call the set of binaries obtained this way by . We then generated a set of binaries, , by randomly flipping each coordinate value of with probability . The binary set was then embedded into by mapping to
. We used an Multi Layer Parceptron (MLP) of depth 3 and 100 hidden variables at each layer for both encoder and decoder of the VAE. For activation function we used.
Table 1 summarizes the quantitative comparison of Normal VAE against our Hyperbolic VAE. For each pair of points in the tree, we computed their Hamming distance as well as their distance in the latent space of VAE. That is, we used Hyperbolic distance for Hyperbolic VAE, and used Euclidean distance for Noraml VAE. We used the strength of correlation between the Hamming distances and the distances in the latent space as a measure of performance. Hyperbolic VAE was performing better both on the original tree and on the artificial dataset generated from the tree. Normal VAE performed the best with , and collapsed with . The difference between Normal VAE and Hyperbolic VAE can be observed with much more clarity using the 2-dimensional visualization of the generated dataset on Poincaré Ball (See Figure 1 and Appendix C.1). The red points are the embeddings of , and the blue points are the embeddings of all other points in . The pink mark designates the origin of hyperbolic space. For the visualization, we used the canonical diffeomorphism between the Lorenz model and the Poincaré ball model.
|Model||Correlation||Correlation w/ noise|
|Normal VAE||Hyperbolic VAE|
We applied Hyperbolic VAE to a binarized version of MNIST. We used an MLP of depth 3 and 500 hidden units at each layer for both the encoder and the decoder of the VAE. Table2 shows the quantitative results of the experiments. Log-likelihood was approximated with an empirical integration of the Bayesian predictor with respect to the latent variables (Burda et al., 2016). Our method outperformed Normal VAE with small latent dimension. Figure 4 are the samples of the Hyperbolic VAE that was trained with 5-dimensional latent variables, and Figure 4
are the Poincaré Ball representations of the interpolations produced onby the Hyperbolic VAE that was trained with 2-dimensional latent variables.
In reinforcement learning, the number of possible state-action trajectories grows exponentially with the time horizon. We may say that these trajectories often have a tree-like hierarchical structure that starts from the initial states. We applied our Hyperbolic VAE to a set of trajectories that were explored by a trained policy during multiple episodes of Breakout in Atari 2600. To collect the trajectories, we used a pretrained Deep Q-Network(Mnih et al., 2015), and used epsilon-greegy with . We amassed a set of trajectories whose total length is 100,000, of which we used 80,000 as the training set, 10,000 as the validation set, and 10,000 as the test set. Each frame in the dataset was gray-scaled and resized to 80 80. The images in the Figure 5 are samples from the dataset. We used a DCGAN-based architecture (Radford et al., 2016) with latent space dimension . Please see Appendix D for more details.
The Figure 6 is a visualization of our results. The top three rows are the samples from Normal VAE, and the bottom three rows are the samples from Hyperbolic VAE. Each row consists of samples generated from latent variables of the form with positive scalar in range . Samples in each row are listed in increasing order of . For Normal VAE, we used as the prior. For Hyperbolic VAE, we used as the prior. We can see that the number of blocks decreases gradually and consistently in each row for Hyperbolic VAE. Please see Appendix C.2 for more details and more visualizations.
In Breakout, the number of blocks is always finite, and blocks are located only in a specific region. Let’s refer to this specific region as . In order to evaluate each model-output based on the number of blocks, we binarized each pixel in each output based on a prescribed luminance threshold and measured the proportion of the pixels with pixel value in the region . For each generated image, we used this proportion as the measure of the number blocks contained in the image.
shows the estimated proportions of remaining blocks for Normal and Hyperbolic VAEs with different norm of. For Normal VAE, samples generated from with its norm as large as contained considerable amount of blocks. On the other hand, the number of blocks contained in a sample generated by Hyperbolic VAE decreased more consistently with the norm of . This fact suggests that the cumulative reward up to a given state can be approximated well by the norm of Hyperbolic VAE’s latent representation. To validate this, we computed latent representation for each state in the test set and measured its correlation with the cumulative reward. The correlation was 0.8540 for the Hyperbolic VAE. For the Normal VAE, the correlation was 0.712. We emphasize that no information regarding the reward was used during the training of both Normal and Hyperbolic VAEs.
Lastly, we applied pseudo-hyperbolic Gaussian to word embedding problem. We trained probabilistic word embedding models with WordNet nouns dataset (Miller, 1998) and evaluated the reconstruction performance of them (Table 3). We followed the procedure of Poincaré embedding (Nickel & Kiela, 2017) and initialized all embeddings in the neighborhood of the origin. In particular, we initialized each weight in the first linear part of the embedding by
. We treated the first 50 epochs as a burn-in phase and reduced the learning rate by a factor ofafter the burn-in phase.
In Table 3, ‘Euclid’ refers to the word embedding with Gaussian distribution on Euclidean space (Vilnis & McCallum, 2015), and ‘Hyperbolic’ refers to our proposed method based on pseudo-hyperbolic Gaussian. Our hyperbolic model performed better than Vilnis’ Euclidean counterpart when the latent space is low dimensional. We used diagonal variance for both models above. Appendix C.3 shows the results with unit variance. The performance difference with small latent dimension was much more remarkable when we use unit variance.
|Euclid||Hyperbolic||Nickel & Kiela (2017)|
In this paper, we proposed a novel parametrizaiton for the density of Gassusian on hyperbolic space that can both be differentiated and evaluated analytically. Our experimental results on hyperbolic word embedding and hyperbolic VAE suggest that there is much more room left for the application of hyperbolic space. Our parametrization enables gradient-based training of probabilistic models defined on hyperbolic space and opens the door to the investigation of complex models on hyperbolic space that could not have been explored before.
We would like to thank Tomohiro Hayase, Kenta Oono, and Masaki Watanabe for helpful discussions. We also thank Takeru Miyato and Sosuke Kobayashi for insightful reviews on the paper. This paper is based on results obtained from Nagano’s internship at Preferred Networks, Inc.
34th Conference on Uncertainty in Artificial Intelligence, 2018.
Stochastic backpropagation and approximate inference in deep generative models.In Proceedings of the 31st International Conference on Machine Learning, volume 32, pp. 1278–1286, 2014.
As we mentioned in the main text, the exponential map from to is given by
Solving this equation for , we obtain
We still need to obtain the evaluatable expression for . Using the characterization of the tangent space (main text, (2)), we see that
Now, defining , we can obtain the inverse exponential function as
The parallel transportation on the Lorentz model along the geodesic from to is given by
where . Next, likewise, for the exponential map, we need to be able to compute the inverse of the parallel transform. Solving (10) for , we get
Now, observing that
we can write the inverse parallel transport as
The inverse of parallel transport from to coincides with the parallel transport from to .
As for the first term of (8) in the main text, we can write
where we wrote
Now, using the change of variables , , we get
Using the identity , we obtain
Next, the second term (8) in the main text can be computed as
where . Using the identity , we get
Figure 8 shows examples of pseudo-hyperbolic Gaussian with various and . We plotted the log-density of these distributions by heatmaps. We designate the by the mark. The right side of these figures expresses their log-density on the Poincaré ball model, and the left side expresses the same one on the corresponding tangent space.
We qualitatively compared the learned latent space of Normal and Hyperbolic VAEs. Figure 9 shows the embedding vectors of the synthetic binary tree dataset on the two-dimensional latent space. We evaluated the latent space of Normal VAE with , and , and Hyperbolic VAE. Note that the hierarchical relations in the original tree were not used during the training phase. Red points are the embeddings of the noiseless observations. As we mentioned in the main text, we evaluated the correlation coefficient between the Hamming distance on the data space and the hyperbolic (Euclidean for Normal VAEs) distance on the latent space. Consistently with this metric, the latent space of the Hyperbolic VAE captured the hierarchical structure inherent in the dataset well. In the comparison between Normal VAEs, the latent space captured the hierarchical structure according to increase the . However, the posterior distribution of the Normal VAE with collapsed and lost the structure. Also, the blue points are the embeddings of noisy observation, and pink represents the origin of the latent space. In latent space of Normal VAEs, there was bias in which embeddings of noisy observations were biased to the center side.
To evaluate the performance of Hyperbolic VAE for hierarchically organized dataset according to time development, we applied our Hyperbolic VAE to a set of trajectories that were explored by an agent with a trained policy during multiple episodes of Breakout in Atari 2600. We used a pretrained Deep Q-Network to collect trajectories, and Figure 10 shows examples of observed screens.
We showed three trajectories of samples from the prior distribution with the scaled norm for both models in the main text. We also visualize more samples in Figure 11 and 12. For both models, we generated samples with 0, 1, 2, 3, 5, and 10.
Normal VAE tended to generate oversaturated images when the norm was small. Although the model generated several images which include a small number of blocks as the norm increases, it also generated images with a constant amount of blocks even . On the other hand, the number of blocks contained in the generated image of Hyperbolic VAE gradually decreased according to the norm.
We showed the experimental results of probabilistic word embedding models with diagonal variance in the main text. In this section, we show the results with unit variance (Table 4). When the dimensions of the latent variable are small, the performance of the model on hyperbolic space did not deteriorate much by changing the variance from diagonal to unit. However, the same change dramatically worsened the performance of the model on Euclidean space.
with the kernel size of the convolution and deconvolution layers as 3. We used leaky ReLU nonlinearities for the encoder and ReLU nonlinearities for the decoder. We set the latent space dimension as 20. We gradually increasedfrom 0.1 to 4.0 linearly during the first 30 epochs. To ensure the initial embedding vector close to the origin, we initialized for the batch normalization layer (Ioffe & Szegedy, 2015) of the encoder as 0.1. We modeled the probability distribution of the data space as Gaussian, so the decoder output a vector twice as large as the original image.
|Encoder Layer Size Input Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution BatchNormalization Convolution Linear||Decoder Layer Size Linear BatchNormalization Deconvolution BatchNormalization Convolution BatchNormalization Deconvolution BatchNormalization Convolution Deconvolution Convolution|