HyperbolicNF
ICML 2020 Paper: Latent Variable Modelling with Hyperbolic Normalizing Flows
view repo
The choice of approximate posterior distributions plays a central role in stochastic variational inference (SVI). One effective solution is the use of normalizing flows defined on Euclidean spaces to construct flexible posterior distributions. However, one key limitation of existing normalizing flows is that they are restricted to the Euclidean space and are ill-equipped to model data with an underlying hierarchical structure. To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces. We first elevate normalizing flows to hyperbolic spaces using coupling transforms defined on the tangent bundle, termed Tangent Coupling (TC). We further introduce Wrapped Hyperboloid Coupling (WHC), a fully invertible and learnable transformation that explicitly utilizes the geometric structure of hyperbolic spaces, allowing for expressive posteriors while being efficient to sample from. We demonstrate the efficacy of our novel normalizing flow over hyperbolic VAEs and Euclidean normalizing flows. Our approach achieves improved performance on density estimation, as well as reconstruction of real-world graph data, which exhibit a hierarchical structure. Finally, we show that our approach can be used to power a generative model over hierarchical data using hyperbolic latent variables.
READ FULL TEXT VIEW PDFICML 2020 Paper: Latent Variable Modelling with Hyperbolic Normalizing Flows
Stochastic variational inference (SVI) methods provide an appealing way of scaling probabilistic modeling to large scale data. These methods transform the problem of computing an intractable posterior distribution to finding the best approximation within a class of tractable probability distributions
(hoffman2013stochastic). Using tractable classes of approximate distributions, e.g., mean-field, and Bethe approximations, facilitates efficient inference, at the cost of limiting the expressiveness of the learned posterior.In recent years, the power of these SVI methods has been further improved by employing normalizing flows, which greatly increase the flexibility of the approximate posterior distribution. Normalizing flows involve learning a series of invertible transformations, which are used to transform a sample from a simple base distribution to a sample from a richer distribution (rezende2015variational)
. Indeed, flow-based posteriors enjoy many advantages such as efficient sampling, exact likelihood estimation, and low-variance gradient estimates when the base distribution is reparametrizable, making them ideal for modern machine learning problems. There have been numerous advances in normalizing flow construction in Euclidean spaces from RealNVP
(dinh2016density), NAF (huang2018neural), and FFJORD (grathwohl2018ffjord) to name a few.However, current normalizing flows are restricted to Euclidean space, and as a result, these approaches are ill-equipped to model data with an underlying hierarchical structure. Many real-world datasets, such as ontologies, social networks, sentences in natural language, and evolutionary relationships between biological entities in phylogenetics exhibit rich hierarchical or tree-like structure. Hierarchical data of this kind can be naturally represented in hyperbolic spaces, i.e., non-Euclidean spaces with constant negative curvature (Figure 1). But Euclidean normalizing flows fail to incorporate these structural inductive biases, since Euclidean space cannot embed deep hierarchies without suffering from high distortion (sarkar2011low). Furthermore, sampling from densities defined on Euclidean space will inevitability generate points that do not lie on the underlying hyperbolic space.
Present work. To address this fundamental limitation, we present the first extension of normalizing flows to hyperbolic spaces. Prior works have considered learning models with hyperbolic parameters (liu2019hyperbolic; nickel2018learning) as well as variational inference with hyperbolic latent variables (nagano2019wrapped; mathieu2019continuous), but our work represents the first approach to allow flexible density estimation in hyperbolic space.
To define our normalizing flows we leverage the Lorentz model of hyperbolic geometry and introduce two new forms of coupling, Tangent Coupling () and Wrapped Hyperboloid Coupling (). These define flexible and invertible transformations capable of transforming sampled points in the hyperbolic space. We derive the change of volume associated with these transformations and show that it can be computed efficiently with cost, where is the dimension of the hyperbolic space. We empirically validate our proposed normalizing flows on structured density estimation, reconstruction and generation tasks on hierarchical data, highlighting the utility of our proposed approach.
Within the Riemannian geometry framework, hyperbolic spaces are manifolds with constant negative curvature and are of particular interest for embedding hierarchical structures. There are multiple models of -dimensional hyperbolic space, such as the hyperboloid , also known as the Lorentz model, or the Poincaré ball . Figure 1 illustrates some key properties of and , highlighting how distances grow exponentially as you move away from the origin and how the shortest paths between distant points tend to go through the origin, giving rise to a hierarchical or tree-like structure. In the next section, we briefly review the Lorentz model of hyperbolic geometry. We are not assuming a background in Riemannian geometry, though Appendix A and Ratcliffe94 are of use to the interested reader. Henceforth, for notational clarity, we use boldface font to denote points on the hyperboloid manifold.
An -dimensional hyperbolic space, , is the unique, complete, simply-connected -dimensional Riemannian manifold of constant negative curvature, . For our purposes, the Lorentz model is the most convenient representation of hyperbolic space, since it is equipped with relatively simple explicit formulas and useful numerical stability properties (nickel2018learning). We choose the 2D Poincaré disk to visualize hyperbolic space because of its conformal mapping to the unit disk. The Lorentz model embeds hyperbolic space within the -dimensional Minkowski space, defined as the manifold equipped with the following inner product:
(1) |
which has the type . It is common to denote this space as to emphasize the distinct role of the zeroth coordinate. In the Lorentz model, we model hyperbolic space as the (upper sheet of) the hyperboloid embedded in Minkowski space. It is a remarkable fact that though the Lorentzian metric (Eq. 1) is indefinite, the induced Riemannian metric on the unit hyperboloid is positive definite (Ratcliffe94). The -Hyperbolic space with constant negative curvature with origin , is a Riemannian manifold where
Equipped with this, the induced distance between two points in is given by
(2) |
The tangent space to the hyperboloid at the point can also be described as an embedded subspace of . It is given by the set of points satisfying the orthogonality relation with respect to the Minkowski inner product ^{1}^{1}1It is also equivalently known as the Lorentz inner product.,
(3) |
Of special interest are vectors in the tangent space at the origin of
whose norm under the Minkowski inner product is equivalent to the conventional Euclidean norm. That is is a vector such that and . Thus at the origin the partial derivatives with respect to the ambient coordinates, , define the covariant derivative.Projections. Starting from the extrinsic view by which we consider , we may project any vector on to the hyperboloid using the shortest Euclidean distance:
(4) |
Furthermore, by definition a point on the hyperboloid satisfies and thus when provided with coordinates we can always determine the missing coordinate to get a point on :
(5) |
Exponential Map. The exponential map takes a vector, , in the tangent space of a point to a point on the manifold—i.e., by moving a unit length along the geodesic, (straightest parametric curve), uniquely defined by with direction . The closed form expression for the exponential map is then given by
(6) |
where we used the generalized radius in place of the curvature.
Logarithmic Map. As the inverse of the exponential map, the logarithmic map takes a point, y, on the manifold back to the tangent space of another point x also on the manifold. In the Lorentz model this is defined as
(7) |
where .
Parallel Transport. The parallel transport for two points is a map that carries the vectors in to corresponding vectors at along the geodesic. That is vectors are connected between the two tangent spaces such that the covariant derivative is unchanged. Parallel transport is a map that preserves the metric, i.e., and in the Lorentz model is given by
(8) |
where is as defined above. Another useful property is that the inverse parallel transport simply carries the vectors back along the geodesic and is simply defined as .
Probability distributions can be defined on Riemannian manifolds, which include as a special case. One transforms the infinitesimal volume element on the manifold to the corresponding volume element in as defined by the co-ordinate charts. In particular, given the Riemannian manifold and its metric , we have , where
is the Lebesgue measure. We now briefly survey three distinct generalizations of the normal distribution to Riemannian manifolds.
Riemannian Normal. The first is the Riemannian normal (pennec2006intrinsic; said2014new), which is derived from maximizing the entropy given a Fréchet mean and a dispersion parameter . Specifically, we have , where is the induced distance and is the normalization constant (said2014new; mathieu2019continuous).
Restricted Normal. One can also restrict sampled points from the normal distribution in the ambient space to the manifold. One example is the Von Mises distribution on the unit circle and its generalized version, i.e., Von Mises-Fisher distribution on the hypersphere (davidson2018hyperspherical).
Wrapped Normal. Finally, we can define a wrapped normal distribution (falorsi2019reparameterizing; nagano2019wrapped), which is obtained by (1) sampling from and then transforming it to a point by concatenating as the zeroth coordinate; (2) parallel transporting the sample from the tangent space at o to the tangent space of another point on the manifold to obtain ; (3) mapping from the tangent space to the manifold using the exponential map at . Sampling from such a distribution is straightforward and the probability density can be obtained via the change of variable formula,
(9) |
where is the wrapped normal distribution and is the normal distribution in the tangent space of o.
We seek to define flexible and learnable distributions on , which will allow us to learn rich approximate posterior distributions for hierarchical data. To do so, we design a class of invertible parametric hyperbolic functions, . A sample from the approximate posterior can then be obtained by first sampling from a simple base distribution defined on and then applying a composition of functions from this class: .
In order to ensure effective and tractable learning, the class of functions must satisfy three key desiderata:
[itemsep=0pt, parsep=0pt, topsep=0pt]
Each function must be invertible.
We must be able to efficiently sample from the final distribution, .
We must be able to efficiently compute the associated change in volume —i.e., the Jacobian determinant, of the overall transformation.
Given these requirements, the final transformed distribution is given by the change of variables formula:
(10) |
Functions satisfying desiderata 1-3 in Euclidean space are often termed normalizing flows (Appendix B), and our work extends this idea to hyperbolic spaces. In the following sections, we describe two flows of increasing complexity, Tangent Coupling () and Wrapped Hyperboloid Coupling (). The first approach lifts a standard Euclidean flow to the tangent space at the origin of the hyperboloid. The second approach modifies the flow to explicitly utilize hyperbolic geometry. Figure 2 illustrates synthetic densities as learned by our approach on .
Similar to the Wrapped Normal distribution (Section 2.2), one strategy to define a normalizing flow on the hyperboloid is to use the tangent space at the origin. That is, we first sample a point from our base distribution—which we define to be a Wrapped Normal—and use the logarithmic map at the origin to transport it to the corresponding tangent space. Once we arrive at the tangent space we are free to apply any Euclidean flow before finally projecting back to the manifold using the exponential map. This approach leverages the fact that the tangent bundle of a hyperbolic manifold has a well-defined vector space structure, allowing affine transformations and other operations that are ill-defined on the manifold itself.
Following this idea, we build upon one of the earliest and most well-studied flows: the RealNVP flow (dinh2016density). At its core, the RealNVP flow uses a computationally symmetric transformation (affine coupling layer) which has the benefit of being fast to evaluate and invert due to its lower triangular Jacobian, whose determinant is cheap to compute. Operationally, the coupling layer is implemented using a binary mask, and partitions some input into two sets, where the first set, is transformed elementwise independently of other dimensions. The second set, , is also transformed elementwise but in a way that depends on the first set (see Appendix B.2 for more details). Since all coupling layer operations occur at we term this form of coupling as Tangent Coupling ().
Thus the overall transformation due to one layer of our flow is a composition of a logarithmic map, affine coupling defined on , and an exponential map:
(11) |
where is a point on , and is a pointwise non-linearity such as the exponential function. Functions and are parameterized scale and translation functions implemented as neural nets from . One important detail is that arbitrary operations on a tangent vector may transport the resultant vector outside the tangent space, hampering subsequent operations. To avoid this we can keep the first dimension fixed at to ensure we remain in .
Similar to the Euclidean RealNVP, we need an efficient expression for the Jacobian determinant of .
Here we only provide a sketch of the proof and details can be found in Appendix C. First, observe that the overall transformation is a valid composition of functions:
. Thus, the overall determinant can be computed by chain rule and the identity,
. Tackling each function in the composition individually, as derived in skopek2019mixed. As the logarithmic map is the inverse of the exponential map the Jacobian determinant is simply the inverse of the determinant of the exponential map, which gives the term. For the middle term, we must calculate the directional derivative of in an orthonormal basis w.r.t. the Lorentz inner product, of . Since the standard Euclidean basis vectors are also a basis for , the Jacobian determinant simplifies to that of the RealNVP flow, which is lower triangluar and is thus efficiently computable in time.∎
It is remarkable that the middle term in Proposition 1 is precisely the same change in volume associated with affine coupling in RealNVP. The change in volume due to the hyperbolic space only manifests itself through the exponential and logarithmic maps, each of which can be computed in cost. Thus, the overall cost is only slightly larger than the regular Euclidean RealNVP, but still .
The hyperbolic normalizing flow with layers discussed above operates purely in the tangent space at the origin. This simplifies the computation of the Jacobian determinant, but anchoring the flow at the origin may hinder its expressive power and its ability to leverage disparate regions of the manifold. In this section, we remedy this shortcoming with a new hyperbolic flow that performs translations between tangent spaces via parallel transport.
We term this transformation Wrapped Hyperboloid Coupling (). As with the layer, it is a fully invertible transformation with a tractable analytic form for the Jacobian determinant. To define a layer we first use the logarithmic map at the origin to transport a point to the tangent space. We employ the coupling strategy previously discussed and partition our input vector into two components: and . Let be the point on after the logarithmic map. The remainder of the layer can be defined as follows;
(13) |
Functions and are taken to be arbitrary neural nets, but the role of when compared to is vastly different. In particular, the generalization of translation on Riemannian manifolds can be viewed as parallel transport to a different tangent space. Consequently, in Eq. 3.2, the function predicts a point on the manifold that we wish to parallel transport to. This greatly increases the flexibility as we are no longer confined to the tangent space at the origin. The logarithmic map is then used to ensure that both and are in the same tangent space before the final exponential map that projects the point to the manifold.
One important consideration in the construction of is that it should only parallel transport functions of . However, the output of is a point on and without care this can involve elements in . To prevent such a scenario we construct the output of where elements are used to determine the value of using Eq. 5, such that it is a point on the manifold and every remaining index is set to zero. Such a construction ensures that only components of any function of are parallel transported as desired. Figure 3 illustrates the transformation performed by the layer.
Inverse of . To invert the flow it is sufficient to show that argument to the final exponential map at the origin itself is invertible. Furthermore, note that undergoes an identity mapping and is trivially invertible. Thus we need to show that the second partition is invertible, i.e. that the following transformation is invertible:
(14) |
As discussed in Section 2, the parallel transport, exponential map, and logarithmic map all have well-defined inverses with closed forms. Thus, the overall transformation is invertible in closed form:
Properties of . To compute the Jacobian determinant of the full transformation in Eq. 3.2 we proceed by analyzing the effect of on valid orthonormal bases w.r.t. the Lorentz inner product for the tangent space at the origin. We state our main result here and provide a sketch of the proof, while the entire proof can be found in Appendix D.
The Jacobian determinant of the function in equation 3.2 is:
(15) |
where , the constant , is a non-linearity, and .
We first note that the exponential and logarithmic maps applied at the beginning and end of the can be dealt with by appealing to the chain rule and the known Jacobian determinants for these functions as used in Proposition 1. Thus, what remains is the following term: . To evaluate this term we rely on the following Lemma.
Let be a function defined as:
(16) |
Now, define a function which acts on the subspace of corresponding to the standard basis elements as
(17) |
where denotes the portion of the vector corresponding to the standard basis elements and and are constants (which depend on ). In Equation equation 17, we use to denote the vector corresponding to only the dimensions and similarly for . Then we have that
(18) |
The proof for Lemma 1 is provided in Appendix D. Using Lemma 1, and the fact that (nagano2019wrapped) we are left with another composition of functions but on the subspace . The Jacobian determinant for these functions, are simply that of the logarithmic map, exponential map and the argument to the parallel transport which can be easily computed as . ∎
The cost of computing the change in volume for one layer is which is the same as a layer plus the added cost of the two new maps that operate on the lower subspace of basis elements.
We evaluate our -flow and -flow on three tasks: structured density estimation, graph reconstruction, and graph generation.^{2}^{2}2Code is included with the submission and will be released. Throughout our experiments, we rely on three main baselines. In Euclidean space, we use Gaussian latent variables and affine coupling flows (dinh2016density), denoted and , respectively. In the Lorentz model, we use Wrapped Normal latent variables as an analogous baseline (nagano2019wrapped). Since all model parameters are defined on tangent spaces, models can be trained with conventional optimizers like Adam (kingma2014adam). Following previous work, we also consider the curvature as a learnable parameter and we clamp the max norm of vectors to before any logarithmic or exponential map (skopek2019mixed). Appendix E contains details on model architectures and implementation concerns.
We first consider structured density estimation in a canonical VAE setting (kingma2013auto)
, where we seek to learn rich approximate posteriors using normalizing flows and evaluate the marginal log-likelihood of test data. Following work on hyperbolic VAEs, we test the approaches on a branching diffusion process (BDP) and dynamically binarized MNIST
(mathieu2019continuous; skopek2019mixed).Our results are shown in Tables 1 and 2. On both datasets we observe our hyperbolic flows provide significant improvements when using latent spaces of low dimension. This result matches theoretical expectations—e.g., that trees can be perfectly embedded in —and dovetails with previous work on graph embedding (nickel2017poincare). This highlights that the benefit of leveraging hyperbolic space is most prominent in small dimensions. However, as we increase the latent dimension, the Euclidean approaches can compensate for this intrinsic geometric limitation.
Model | BDP-2 | BDP-4 | BDP-6 |
---|---|---|---|
-VAE | |||
-VAE | |||
Model | MNIST 2 | MNIST 4 | MNIST 6 |
---|---|---|---|
-VAE | |||
-VAE | |||
We evaluate the practical utility of our hyperbolic flows by conducting experiments on the task of link prediction using graph neural networks (GNNs)
(scarselli2008graph) as an inference model. Given a simple graph , defined by a set of nodes , an adjacency matrix and node feature matrix , we learn a VGAE (kipf2016variational) model whose inference network, , defines a distribution over node embeddings . To score the likelihood of an edge existing between pairs of nodes we use an inner product decoder: (with dot products computed in when necessary). Given these components, the inference GNNs are trained to maximize the variational lower bound on a training set of edges.We use two different disease datasets taken from (chami2019hyperbolic) and (mathieu2019continuous)^{3}^{3}3We uncovered issues with the two remaining datasets in (mathieu2019continuous) and thus omit them (Appendix F). for evaluation purposes. The first dataset Diseases-I is composed of a network of disorders and disease genes linked by the known disorder–gene associations (goh2007human). In the second dataset Diseases-II, we build tree networks of a SIR disease spreading model (anderson1992infectious), where node features determine the susceptibility to the disease. In Table 3 we report the AUC and average precision (AP) on the test set. We observe consistent improvements when using hyperbolic flow. Similar to the structured density estimation setting, the performance gains of are best observed in low-dimensional latent spaces.
Model | Dis-I AUC | Dis-I AP | Dis-II AUC | Dis-II AP |
---|---|---|---|---|
-VAE | ||||
-VAE | ||||
Finally, we explore the utility of our hyperbolic flows for generating hierarchical structures. As a synthetic testbed, we construct datasets containing uniformly random trees as well as uniformly random lobster graphs (golomb1996polyominoes), where each graph contains between 20 to 100 nodes. We then train a generative model to learn the distribution of these graphs. We expect the hyperbolic flows to provide a significant benefit for generating valid random trees, as well as learning the distribution of lobster graphs, which are a special subset of trees.
We follow the two-stage training procedure outlined in Graph Normalizing Flows (liu2019graph)
in that we first train an autoencoder to give node-level latents on which we train an NF for density estimation. Empirically, we find that using GRevNets
(liu2019graph) and defining edge probabilities using a distance-based decoder consistently leads to better generation performance. Thus we define edge probabilities as where and are learned edge specific bias and temperature parameters. At inference time, we first sample the number of nodes to generate from the empirical distribution of the dataset. We then independently sample node latents from our prior, beginning with a fully connected graph, and then push these samples through our learned flow to give refined edge probabilities.To evaluate the various approaches, we construct training graphs for each dataset to train our model. Figure 4 shows representative samples generated by the various approaches. We see that hyperbolic normalizing flows learn to generate tree-like graphs and also match the specific properties of the lobster graph distribution, whereas the Euclidean flow model tends to generate densely connected graphs with many cycles (or else disconnected graphs). To quantify these intuitions, Table 4 contains statistics on how often the different models generate valid trees (denoted by “accuracy”), as well as the average number of triangles and the average global clustering coefficients for the generated graphs. Since the target data is random trees, a perfect model would achieve 100% accuracy, with no triangles, and a global clustering of 0 for all graphs. We see that the hyperbolic models generate valid trees more often, and they generate graphs with fewer triangles and lower clustering on average. Finally, to evaluate how well the models match the specific properties of the lobster graphs, we follow liao2019efficient and report the MMD distance between the generated graphs and a test set for various graph statistics (Figure 5). Again, we see that the hyperbolic approaches significantly outperform the Euclidean normalizing flow.
Model | Accuracy | Avg. Clust. | Avg. GC. |
---|---|---|---|
Hyperbolic Geometry in Machine Learning:. The intersection of hyperbolic geometry and machine learning has recently risen to prominence (dhingra2018embedding; tay2018hyperbolic; law2019lorentzian; khrulkov2019hyperbolic; ovinnikov2019poincar). Early prior work proposed to embed data into the Poincaré ball model (nickel2017poincare; chamberlain2017neural). The equivalent Lorentz model was later shown to have better numerical stability properties (nickel2018learning), and recent work has leveraged even more stable tiling approaches (yu2019numerically)
. Recent works have extended several conventional deep learning modules (e.g., dense neural network layers and GNN architectures) to hyperbolic space
(gulcehre2018hyperbolic; ganea2018hyperbolic; chami2019hyperbolic). Latent variable models on hyperbolic space have also been investigated in the context of VAEs, using generalizations of the normal distribution (nagano2019wrapped; mathieu2019continuous). In contrast, our work learns a flexible approximate posterior using a novel normalizing flow designed to use the geometric structure of hyperbolic spaces. In addition to work on hyperbolic VAEs, there are also several works that explore other non-Euclidean spaces (e.g., spherical VAEs) (davidson2018hyperspherical; falorsi2019reparameterizing; grattarola2019adversarial).Normalizing Flows:. Normalizing flows (NFs) (rezende2015variational; dinh2016density) are a class of probabilistic models which use invertible transformations to map samples from a simple base distribution to samples from a more complex learned distribution.While there are many classes of normalizing flows, see survey (papamakarios2019normalizing; kobyzev2019normalizing), our work largely follows flows designed with partial ordered dependency as found in affine coupling transformations (dinh2016density). Recently, normalizing flows have also been extended to Riemannian manifolds such as spherical spaces in gemici2016normalizing. In a different line of work authors in wang-wang-2019-riemannian construct a planar flow (rezende2015variational) on Riemannian manifolds parameterized by the inverse multiquadratics kernel function.
In this paper, we introduce two novel normalizing flows on hyperbolic spaces. We show that our flows are efficient to sample from, easy to invert and require only cost to compute the change in volume. We demonstrate the effectiveness of constructing hyperbolic normalizing flows for latent variable modeling of hierarchical data. We empirically observe improvements in structured density estimation, graph reconstruction and also generative modeling of tree-structured data, with large qualitative improvements in generated sample quality compared to Euclidean methods. One important limitation is in the numerical error introduced by clamping operations which prevent the creation of deep flow architectures. We hypothesize that this is an inherent limitation of the Lorentz model, which may be alleviated with newer models of hyperbolic geometry that use integer-based tiling (yu2019numerically). In addition, while we considered hyperbolic generalizations of the coupling transforms to define our normalizing flows, designing new classes of invertible transformations like autoregressive and residual flows on non-Euclidean spaces is an interesting direction for future work.
Funding: AJB is supported by an IVADO Excellence Fellowship. RL was supported by Connaught International Scholarship and RBC Fellowship. WLH is supported by a Canada CIFAR AI Chair. This work was also supported by NSERC Discovery Grants held by WLH and PP. In addition the authors would like to thank Chinwei Huang, Maxime Wabartha, Andre Cianflone and Andrea Madotto for helpful feedback on earlier drafts of this work and Kevin Luk, Laurent Dinh and Niky Kamran for helpful technical discussions.
An -dimensional manifold is a topological space that is equipped with a family of open sets which cover the space and a family of functions that are homeomorphisms between the and open subsets of . The pairs are called charts. A crucial requirement is that if two open sets and intersect in a region, call it , then the composite map restricted to is infinitely differentiable. If is an dimensional manifold then a chart, , on maps an open subset to an open subset . Furthermore, the image of the point , denoted is termed the local coordinates of on the chart . Examples of manifolds include , the Hypersphere , the Hyperboloid , a torus. In this paper we take an extrinsic view of the geometry, that is to say a manifold can be thought of as being embedded in a higher dimensional Euclidean space, —i.e. , and inherits the coordinate system of the ambient space. This is not how the subject is usually developed but for spaces of constant curvature one gets convenient formulas.
Tangent Spaces. Let be a point on an dimensional smooth manifold and let be a differentiable parametric curve with parameter passing through the point such that . Since is a smooth manifold we can trace the curve in local coordinates via a chart and the entire curve is given in local coordinates by . The tangent vector to this curve at is then simply . Another interpretation of the tangent vector of is by interpreting the point as a position vector and the tangent vector is then interpreted as the velocity vector at that point. Using this definition the set of all tangent vectors at is denoted as , and is called the tangent space at .
Riemannian Manifold
. A Riemannian metric tensor
on a smooth manifold is defined as a family of inner products such that at each point the inner product takes vectors from the tangent space at , . This means is defined for every point on and varies smoothly. Locally, can be defined using the basis vectors of the tangent space . In matrix form the Riemannian metric, , can be expressed as, . A smooth manifold manifold which is equipped with a Riemannian metric at every point is called a Riemannian manifold. Thus every Riemannian manifold is specified as the tuple which define the smooth manifold and its associated Riemannian metric tensor.Armed with a Riemannian manifold we can now recover some conventional geometric insights such as the length of a parametric curve , the distance between two points on the manifold, local notion of angle, surface area and volume. We define the length of a curve, . This definition is very similar to the length of a curve on Euclidean spaces if we just observe that the Riemannian metric is . Now turning to the distance between points and we can reason that it must be the smallest or distance minimizing parametric curve between the points which in the literature are known as geodesics^{4}^{4}4Actually a geodesic is usually defined as a curve such that the tangent vector is parallel transported along it. It is then a theorem that it gives the shortest path.. Stated another way: with , and . A norm is induced on every tangent space by and is defined as . Finally, we can also define an infitisimal volume element on each tangent space and as a result measure , with being the Lebesgue measure.
Given a parametrized density on a normalizing flow defines a sequence of invertible transformations to a more complex density over the same space via the change of variable formula for probability distributions (rezende2015variational). Starting from a sample from a base distribution, , a mapping , with parameters that is both invertible and smooth, the log density of is defined as . Where, is the probability of the transformed sample and is the Jacobian of . To construct arbitrarily complex densities a chain of functions of the same form as can be defined and through successive application of change of density for each invertible transformation in the flow. Thus the final sample from a flow is then given by and it’s corresponding density can be determined simply by . Of practical importance when designing normalizing flows is the cost associated with computed the log determinant of the Jacobian which is computationally expensive and can range anywhere from for an arbitrary matrix and a chosen algorithm. However, through an appropriate choice of this computation cost can be brought down significantly. While there are many different choices for the transformation function, , in this work we consider only RealNVP based flows as presented in (dinh2016density) and (rezende2015variational) due to their simplicity and expressive power in capturing complex data distributions.
One obvious use case for Normalizing Flows is in learning a more expressive often multi-modal posterior distribution needed in Variational Inference. Recall that a variational approximation is a lower bound to the data log-likelihood. Take for example amortized variational inference in a VAE like setting whereby the posterior is parameterized and is amenable to gradient based optimization. The overall objective with both encoder and decoder networks:
(19) | ||||
(20) | ||||
(21) | ||||
(22) |
The tightness of the Evidence Lower Bound (ELBO) also known as the negative free energy of the system, , is determined by the quality of the posterior approximation to the true posterior. Thus, one way to enrich the posterior approximation is by letting be a normalizing flow itself and the resultant latent code be the output of the transformation. If we denote the probability of the latent code under the base distribution and as the latent code after flow layers we may rewrite the Free Energy as follows:
(23) | ||||
(24) | ||||
(25) |
For convenience we may take which is a reparametrized gaussian density and a standard normal.
Computing the Jacobian of functions with high-dimensional domain and codomain and computing the determinants of large matrices are in general computationally very expensive. Further complications can arise with the restriction to bijective functions make for difficult modelling of arbitrary distributions. A simple way to significantly reduce the computational burden is to design transformations such that the Jacobian matrix is triangular resulting in a determinant which is simply the product of the diagonal elements. In (dinh2016density), real valued non-volume preserving (RealNVP) transformations are introduced as simple bijections that can be stacked but yet retain the property of having the composition of transformations having a triangular determinant. To achieve this each bijection updates a part of the input vector using a function that is simple to invert, but which depends on the remainder of the input vector in a complex way. Such transformations are denoted as affine coupling layers. Formally, given a dimensional input and , the output of an affine coupling layer follows the equations:
(26) | ||||
(27) |
Where, and are parameterized scale and translation functions. As the second part of the input depends on the first, it is easy to see that the Jacobian given by this transformation is lower triangular. Similarly, the inverse of this transformation is given by:
(28) | ||||
(29) |
Note that the form of the inverse does not depend on calculating the inverses of either or allowing them to be complex functions themselves. Further note that with this simple bijection part of the input vector is never touched which can limit the expressiveness of the model. A simple remedy to this is to simply reverse the elements that undergo scale and translation transformations prior to the next coupling layer. Such an alternating pattern ensures that each dimension of the input vector depends in a complex way given a stack of couplings allowing for more expressive models. Finally, the Jacobian of this transformation is a lower triangular matrix,
(30) |
We now derive the change in volume formula associated with one layer. Without loss of generality we first define a binary mask which we use to partition the elements of a vector at into two sets. Thus is defined as
Note that all layer operations exclude the first dimension which is always copied over by setting and ensures that the resulting sample always remains on . Utilizing we may rewrite Equation 3.1 as,
(31) |
where is a point on the tangent space at o. Similar to the Euclidean RealNVP, we wish to calculate the jacobian determinant of this overall transformation. We do so by first observing that the overall transformation is a valid composition of functions: , where is the flow in tangent space. Utilizing the chain rule and the identity that the determinant of a product is the product of the determinants of its constituents we may decompose the jacobian determinant as,
(32) |
Tackling each term on RHS of Eq. 32 individually, as derived in (nagano2019wrapped). As the logarithmic map is the inverse of the exponential map the jacobian determinant is also the inverse —i.e. . For the middle term in Eq. 32 we proceed by selecting the standard basis which is an orthonormal basis with respect to the Lorentz inner product. The directional derivative with respect to a basis element is computed as follows:
As is a binary mask, it is easy to see that if then only the first term on the RHS remains and the directional derivative with respect to is simply the basis vector itself. Conversely, if then the first term goes to zero and we are left with the second term,