Neural Manifold Ordinary Differential Equations

06/18/2020 ∙ by Aaron Lou, et al. ∙ cornell university 25

To better conform to data geometry, recent deep generative modelling techniques adapt Euclidean constructions to non-Euclidean spaces. In this paper, we study normalizing flows on manifolds. Previous work has developed flow models for specific cases; however, these advancements hand craft layers on a manifold-by-manifold basis, restricting generality and inducing cumbersome design constraints. We overcome these issues by introducing Neural Manifold Ordinary Differential Equations, a manifold generalization of Neural ODEs, which enables the construction of Manifold Continuous Normalizing Flows (MCNFs). MCNFs require only local geometry (therefore generalizing to arbitrary manifolds) and compute probabilities with continuous change of variables (allowing for a simple and expressive flow construction). We find that leveraging continuous manifold dynamics produces a marked improvement for both density estimation and downstream tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 20

page 21

Code Repositories

Neural-Manifold-Ordinary-Differential-Equations

[NeurIPS 2020] Neural Manifold Ordinary Differential Equations (https://arxiv.org/abs/2006.10254)


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

A manifold ODE solution for a given vector field on the sphere.

footnotetext: * indicates equal contribution

Deep generative models are a powerful class of neural networks which fit a probability distribution to produce new, unique samples. While latent variable models such as Generative Adversarial Networks (GANs)

Goodfellow et al. (2014)

and Variational Autoencoders (VAEs)

Kingma and Welling (2014) are capable of producing reasonable samples, computing the exact modeled data posterior is fundamentally intractable. By comparison, normalizing flows Rezende and Mohamed (2015) are capable of learning rich and tractable posteriors by transforming a simple probability distribution through a sequence of invertible mappings. Formally, in a normalizing flow, a complex distribution is transformed to a simple distribution via a diffeomorphism (i.e. a differentiable bijective map with a differentiable inverse) with probability values given by the change of variables:

To compute this update efficiently, must be constrained to allow for fast evaluation of the determinant, which in the absence of additional constraints takes time (where is the dimension of ). Furthermore, to efficiently generate samples, must have a computationally cheap inverse. Existing literature increases the expressiveness of such models under these computational constraints and oftentimes parameterizes with deep neural networks Germain et al. (2015); Dinh et al. (2017); Kingma and Dhariwal (2018); Durkan et al. (2019). An important recent advancement, dubbed the Continuous Normalizing Flow (CNF), constructs using a Neural Ordinary Differential Equation (ODE) with dynamics and invokes a continuous change of variables which requires only the trace of the Jacobian of Chen et al. (2018); Grathwohl et al. (2019).

Since is a diffeomorphism, the topologies of the distributions and must be equivalent. Furthermore, this topology must conform with the underlying latent space, which previous work mostly assumes to be Euclidean. However, topologically nontrivial data often arise in real world examples such as in quantum field theory Wirnsberger et al. (2020), motion estimation Feiten et al. (2013), and protein structure prediction Hamelryck et al. (2006).

In order to go beyond topologically trivial Euclidean space, one can model the latent space with a smooth manifold. An -dimensional manifoldAll of our manifolds are assumed to be smooth, so we refer to them simply as manifolds. can be thought of as an -dimensional analogue of a surface. Concretely, this is formalized with charts, which are smooth bijections , where , that also satisfy a smoothness condition when passing between charts. For charts , with corresponding , the composed map is a diffeomorphism.

Preexisting manifold normalizing flow works (which we present a complete history of in Section 2) do not generalize to arbitrary manifolds. Furthermore, many examples present constructions extrinsic to the manifold. In this work, we solve these issues by introducing Manifold Continuous Normalizing Flows (MCNFs), a manifold analogue of Continuous Normalizing Flows. Concretely, we:

  1. [label=()]

  2. introduce Neural Manifold ODEs as a generalization of Neural ODEs (seen in Figure 1). We leverage existing literature to provide methods for forward mode integration, and we derive a manifold analogue of the adjoint method Pontryagin et al. (1962) for backward mode gradient computation.

  3. develop a dynamic chart method to realize Neural Manifold ODEs in practice. This approach integrates local dynamics in Euclidean space and passes between domains using smooth chart transitions. Because of this, we perform computations efficiently and can accurately compute gradients. Additionally, this allows us to access advanced ODE solvers (without manifold equivalents) and augment the Neural Manifold ODE with existing Neural ODE improvements Dupont et al. (2019); Grathwohl et al. (2019).

  4. construct Manifold Continuous Normalizing Flows. These flows are constructed by integrating local dynamics to construct diffeomorphisms, meaning that they are theoretically complete over general manifolds. Empirically, we find that our method outperforms existing manifold normalizing flows on their specific domain.

2 Related Work

In this section we analyze all major preexisting manifold normalizing flows. Previous methods are, in general, hampered by a lack of generality and are burdensomely constructive.

Normalizing Flows on Riemannian Manifolds Gemici et al. (2016). The first manifold normalizing flow work constructs examples on Riemannian manifolds by first projecting onto Euclidean space, applying a predefined Euclidean normalizing flow, and projecting back. Although simple, this construction is theoretically flawed since the initial manifold projection requires the manifold to be diffeomorphic to Euclidean space. This is not always the case, since, for example, the existence of antipodal points on a sphere necessarily implies that the sphere is not diffeomorphic to Euclidean spaceThis example is generalized by the notion of conjugate points. Most manifolds have conjugate points and those without are topologically equivalent to Euclidean space.. As a result, the construction only works on a relatively small and topologically trivial subset of manifolds.

Our work overcomes this problem by integrating local dynamics to construct a global diffeomorphism. By doing so, we do not have to relate our entire manifold with some Euclidean space, but rather only well-behaved local neighborhoods. We test against Gemici et al. (2016) on hyperbolic space, and our results produce a significant improvement.

Latent Variable Modeling with Hyperbolic Normalizing Flows Bose et al. (2020). In a recent manifold normalizing flow paper, the authors propose two normalizing flows on hyperbolic space—a specific Riemannian manifold. These models, which they name the Tangent Coupling (TC) and Wrapped Hyperboloid Coupling (WHC), are not affected by the aforementioned problem since hyperbolic space is diffeomorphic to Euclidean space. However, various shortcomings exist. First, in our experiments we find that the methods do not seem to conclusively outperform Gemici et al. (2016). Second, these methods do not generalize to topologically nontrivial manifolds. This means that these flows produce no additional topological complexity and thus the main benefit of manifold normalizing flows is not realized. Third, the WHC construction is not intrinsic to hyperbolic space since it relies on the hyperboloid equation.

Our method, by contrast, is derived from vector fields, which are natural manifold constructs. This not only allows for generality, but also means that our construction respects the manifold geometry. We compare against Bose et al. (2020) on hyperbolic space and find that our results produce a substantial improvement.

Normalizing Flows on Tori and Spheres Rezende et al. (2020). In another paper, the authors introduce a variety of normalizing flows for tori and spheres. These manifolds are not diffeomorphic to Euclidean space, hence the authors construct explicit global diffeomorphisms. However, their constructions do not generalize and must be intricately designed with manifold topology in mind. In addition, the primary recursive flow makes use of non-global diffeomorphisms to the cylinder, so densities are not defined everywhere. The secondary exponential map-based flow is globally defined but is not computationally tractable for higher dimensions.

In comparison, our work is general, requires only local diffeomorphisms, produces globally defined densities, and is tractable for higher dimensions. We test against Rezende et al. (2020) on the sphere and attain markedly better results.

Other Related Work. In Falorsi et al. (2019), the authors define a probability distribution on Lie Groups, a special type of manifold, and as a by-product construct a normalizing flow. The constructed flow is very similar to that found in Gemici et al. (2016), but the authors include a nonlinearity at the end of the Euclidean flow to constrain the input space and guarantee that the map back to the manifold is injective. We do not compare directly against this work since Lie Groups are not general (the -dimensional sphere is not a Lie Group Stillwell (2008)) while Riemannian manifolds are.

There are some related works such as Wirnsberger et al. (2020); Wang and Wang (2019); Brehmer and Cranmer (2020) that are orthogonal to our considerations as they either (i) develop applications as opposed to theory or (ii) utilize normalizing flows as a tool to study Riemannian metrics.

Concurrent work Mathieu and Nickel (2020); Falorsi and Forré (2020) also investigates the extension of neural ODEs to smoooth manifolds.

3 Background

In this section, we present background knowledge to establish naming conventions and intuitively illustrate the constructions used for our work. For a more detailed overview, we recommend consulting a text such as Lee (1997, 2003); do Carmo (1992).

3.1 Differential Geometry

Tangent space. For an -dimensional manifold , the tangent space at a point is a higher-dimensional analogue of a tangent plane at a point on a surface. It is an -dimensional real vector space for all points .

For our purposes, tangent spaces will play two roles. First, they provide a notion of derivation which is crucial in defining manifold ODEs. Second, they will oftentimes be used in place of for our charts (as we map to some open subset of through a change of basis from ).

Pushforward/Differential. A derivative (or a pushforward) of a function between two manifolds is denoted by . This is a generalization of the classical Euclidean Jacobian (as is a manifold), and provides a way to relate tangent spaces at different points.

As one might expect, the pushforward is central in the definition of manifold ODEs (analogous to the importance of the common derivative in Euclidean ODEs). We also use it in our dynamic chart method to map tangent vectors of the manifold to tangent vectors of Euclidean space.

3.2 Riemannian Geometry

While the above theory is general, to concretize some computational aspects (e.g. how to pick charts) and give background on related manifold normalizing flow work, we define relevant concepts from Riemannian geometry.

Riemannian Manifold. The fundamental object of study in Riemannian geometry is the Riemannian manifold. This is a manifold with an additional structure called the Riemannian metric, which is a smooth metric (often denoted as ). This Riemannian metric allows us to construct a distance on the manifold . Furthermore, any manifold can be given a Riemannian metric.

Exponential Map. The exponential map can be thought of as taking a vector and following the general direction (on the manifold) such that the distance traveled is the length of the tangent vector. Specifically, the distance on the manifold matches the induced tangent space norm . Note that .

The exponential map is crucial in our construction since it acts as a chart. Specifically, if we identify the chart domain with then is a diffeomorphism when restricted to some local set around .

Special Cases. Some special cases of Riemannian manifolds include hyperbolic spaces , spheres , and tori . Specific formulas for Riemannian computations are given in Appendix B. Hyperbolic space is diffeomorphic to Euclidean space, but spheres and tori are not.

3.3 Manifold Ordinary Differential Equations

Manifold ODE. Finally, we introduce the key objects of study: manifold ordinary differential equations. A manifold ODE is an equation which relates a curve to a vector field and takes the form

(1)

is a solution to the ODE if it satisfies Equation 1 with initial condition . Similarly to the case of classical ODEs, local solutions are guaranteed to exist under sufficient conditions on Hairer (2011).

4 Neural Manifold Ordinary Differential Equations

To leverage manifold ODEs in a deep learning framework similar to Neural ODEs

Chen et al. (2018), we parameterize the dynamics by a neural network with parameters . We define both forward pass and backward pass gradient computation strategies in this framework. Unlike Neural ODEs, forward and backward computations do not perfectly mirror each other since forward mode computation requires explicit manifold methods, while the backward can be defined solely through an ODE in Euclidean space.

4.1 Forward Mode Integration

The first step in defining our Neural Manifold ODE block is the forward mode integration. We review and select appropriate solvers from the literature; for a more thorough introduction we recommend consulting a text such as Crouch and Grossman (1993); Hairer (2011)

. Broadly speaking, these solvers can be classified into two groups:

  1. [label=()]

  2. projection methods that embed the manifold into Euclidean space , integrate with some base Euclidean solver, and project to the manifold after each step. Projection methods require additional manifold structure; in particular, must be the level set of some smooth function .

  3. implicit methods that solve the manifold ODE locally using charts. These methods only require manifold-implicit constructions.

Projection methods are conceptually simple, but ultimately suffer from generality issues. In particular, for manifolds such as the open ball or the upper half-space, there is no principled way to project off-manifold points back on. Furthermore, even in nominally well-defined cases such as the sphere, there may still exist points such as the origin for which the projection is not well-defined.

Implicit methods, by contrast, can be applied to any manifold and do not require a level set representation. Thus, this approach is more amenable to our generality concerns (especially since we wish to work with, for example, hyperbolic space). However, difficulty in defining charts restricts applicability. Due to this reason, implicit schemes often employ additional structure to define manifold variations of step-based solvers Bielecki (2002); Crouch and Grossman (1993). For example, on a Riemannian manifold one can define a variant of an Euler Method solver with update step using the Riemannian exponential map. On a Lie group, there are more advanced Runge-Kutta solvers that use the Lie Exponential map and coordinate frames Crouch and Grossman (1993); Hairer (2011).

4.2 Backward Mode Adjoint Gradient Computation

In order to fully incorporate manifold ODEs into the deep learning framework, we must also efficiently compute gradients. Similar to Chen et al. (2018)

, we develop an adjoint method to analytically calculate the derivative of a manifold ODE instead of directly differentiating through the solver. Similar to manifold adjoint methods for partial differential equations, we use an ambient space

Zahr et al. (2016).

Theorem 4.1.

Suppose we have some manifold ODE as given in Equation 1

and we define some loss function

. Suppose that there is an embedding of in some Euclidean space and we identify with an -dimensional subspace of . If we define the adjoint state to be , then the adjoint satisfies

(2)
Remark.

This theorem resembles the adjoint method in Chen et al. (2018) precisely because of our ambient space condition. In particular, our curve can be considered as a curve in the ambient space. Furthermore, we do not lose any generality since such an embedding always exists by the Whitney Embedding Theorem Whitney (1935), if we simply set .

The full proof of Theorem 4.1 can be found in Appendix A.3. Through this adjoint state, gradients can be derived for other parameters in the equation such as , initial condition , and weights .

Figure 2: Solving a manifold ODE with our dynamic chart method. We use charts.

5 Dynamic Chart Method

While our above theory is fully expressive and general, in this section we address certain computational issues and augment applicability with our dynamic chart method.

Given , local charts , starting condition and starting/ending times of .
Initialize ,
while  do
       Construct an equivalent differential equation with initial condition
       Solve locally using some numerical integration technique in Euclidean space. Specifically, solve in some interval for which .
      
end while
Algorithm 1 Dynamic Chart Forward Pass

The motivation for our dynamic chart method comes from Lezcano-Casado (2019), where the author introduces a dynamic manifold trivialization technique for Riemannian gradient descent. Here, instead of applying the traditional Riemannian gradient update for time steps, the author repeatedly applies local updates. Each update consists of a local diffeomorphism to Euclidean space,  equivalent Euclidean gradient updates, and a map back to the manifold. This allows us to lower the number of expensive exponential map calls and invoke existing Euclidean gradient optimizers such as Adam Kingma and Ba (2014). This is similar in spirit to Gemici et al. (2016), but crucially only relies on local diffeomorphisms rather than a global diffeomorphism.

We can adopt this strategy in our Neural Manifold ODE. Specifically, we develop a generalization of the dynamic manifold trivialization for the manifold ODE forward pass. In a similar spirit, we use a local chart to map to Euclidean space, solve an equivalent Euclidean ODE locally, and project back to the manifold using the chart. The full forward pass is given by Algorithm 1 and is visualized Figure 2.

We present two propositions which highlight that this algorithm is guaranteed to converge to the manifold ODE solution. The first shows that solves the manifold differential equation locally and the second shows that we can pick a finite collection of charts such that we can integrate to time .

Prop 5.1 (Correctness).

If is a solution to with initial condition , then is a valid solution to Equation 1 on .

Prop 5.2 (Convergence).

There exists a finite collection of charts s.t. .

Proofs of these propositions are given in Appendix A.1. Note that this forward integration is implicit, indicating the connections between Hairer and Wanner (2010); Hairer (2011) and Lezcano-Casado (2019). We realize this construction by finding a principled way to pick charts and incorporate this method into neural networks by developing a backward gradient computation.

We can intuitively visualize this dynamic chart forward pass as a sequence of Neural ODE solvers with chart transitions connecting them. Here, we see how the smooth chart transition property comes into play, as passing between charts is the necessary step in constructing a global solution. In addition, this construction provides a

chart-based backpropagation

. Under this dynamic chart forward pass, we can view a Neural Manifold ODE block as the following composition of Neural ODE blocks and chart transitions:

(3)

This allows for gradient computation through backpropagation. We may differentiate through the Neural ODE blocks by the Euclidean adjoint method Pontryagin et al. (1962); Chen et al. (2018).

To finalize this method, we give a strategy for picking these charts for Riemannian manifoldsNote that we do not lose generality since all manifolds can be given a Riemannian metric.. As previously mentioned, the exponential map serves as a local diffeomorphism from the tangent space (which can be identified with ) and the manifold, so it acts as a chart. Similar to Falorsi et al. (2019), at each point there exists a radius such that is a diffeomorphism when restricted to . With this information, we can solve the equivalent ODE with solution until we reach a point , at which point we switch to the exponential map chart defined around . Complete details are provided in Appendix D.

Our dynamic chart method is a significant advancement over previous implicit methods since we can easily construct charts as we integrate. Furthermore, it provides many practical improvements over vanilla Neural Manifold ODEs. Specifically, we

  1. [label=()]

  2. can perform faster evaluations. The aforementioned single step algorithms rely on repeated use of the Lie and Riemannian exponential maps. These are expensive to compute and our method can sidestep this expensive evaluation. In particular, the cost is shifted to the derivative of the chart, but by defining dynamics on the tangent space directly, we can avoid this computation. We use this for our hyperbolic space construction, where we simply solve .

  3. avoid catastrophic gradient instability. If the dimension of is less than the dimension of the ambient space, then the tangent spaces are of measure . This means that the induced error from the ODE solver will cause our gradients to leave their domain, resulting in a catastrophic failure. However, since the errors in Neural ODE blocks do not cause the gradient to leave their domain and as our charts induce only a precision error, our dynamic chart method avoids this trap.

  4. access a wider array of ODE advancements. While substantial work has been done for manifold ODE solvers, the vast majority of cutting edge ODE solvers are still restricted to Euclidean space. Our dynamic chart method can make full use of these advanced solvers in the Neural ODE blocks. Additionally, Neural ODE improvements such as Dupont et al. (2019); Jia and Benson (2019) can be directly integrated without additional manifold constructions.

6 Manifold Continuous Normalizing Flows

With our dynamic chart Neural Manifold ODE, we can construct a Manifold Continuous Normalizing Flow (MCNF). The value can be integrated directly through the ODE, so all that needs to be given is the change in log probability. Here, we can invoke continuous change of variables Chen et al. (2018) on the Neural ODE block and can use the smooth chart transition property (which guarantees that the charts are diffeomorphisms) to calculate the final change in probability as:

(4)

Note that we drop derivative and integration arguments for brevity. A full expression is given in Appendix B.

For our purposes, the last required computation is the determinant of the . We find that these can be evaluated analytically, as shown in the cases of the hyperboloid and sphere Nagano et al. (2019); Skopek et al. (2019).

Since the MCNF requires only the local dynamics (which are in practice parameterized by the exponential map), this means that the construction generalizes to arbitrary manifolds. Furthermore, we avoid diffeomorphism issues, such as in the case of two antipodal points on the sphere, simply by restricting our chart domains to never include these conjugate points.

Target WHC Bose et al. (2020) PRNVP Gemici et al. (2016) MCNF (Ours)
Figure 3: Density estimation on the hyperboloid , which is projected to the Poincaré Ball for visualization.
Target NCPS Rezende et al. (2020) MCNF (Ours)
Figure 4: Density estimation on the sphere , which is projected to two dimensions by the Mollweide projection.

7 Experiments

To test our MCNF models, we run density estimation and variational inference experiments. Though our method is general, we take to be two spaces of particular interest: hyperbolic space and the sphere . Appendix B concretely details the computation of MCNF in these spaces. Full experimental details can be found in Appendix C.

7.1 Density Estimation

We train normalizing flows for estimation of densities in the hyperbolic space and the sphere , as these spaces induce efficient computations and are easy to visualize. For hyperbolic space, the baselines are Wrapped Hyperboloid Coupling (WHC) Bose et al. (2020) and Projected NVP (PRNVP), which learns RealNVP over the projection of the hyperboloid to Euclidean space Gemici et al. (2016); Dinh et al. (2017). On the sphere, we compare with the recursive construction of Rezende et al. (2020), with non-compact projection used for the flow (NCPS). As visualized in Figures 4 and 4, our normalizing flows are able to match complex target densities with significant improvement over the baselines. MCNF is even able to fit discontinuous and multi-modal target densities; baseline methods cannot fit these cases as they struggle with reducing probability mass in areas of low target density.

7.2 Variational Inference

We train a hyperbolic VAE Mathieu et al. (2019) and Euclidean VAE Kingma and Welling (2014)

for variational inference on Binarized Omniglot

Lake et al. (2015) and Binarized MNIST Lecun et al. (1998). Both of these datasets have been found to empirically benefit from hyperbolic embeddings in prior work Nagano et al. (2019); Mathieu et al. (2019); Bose et al. (2020); Khrulkov et al. (2019); Skopek et al. (2019). We compare different flow layers in the latent space of the VAEs. For the Euclidean VAE, we compare with RealNVP Dinh et al. (2017) and CNF Chen et al. (2018); Grathwohl et al. (2019). Along with the two hyperbolic baselines used for density estimation, we also compare against the Tangent Coupling (TC) model in Bose et al. (2020). As shown in Table 1, our continuous flow regime is more expressive and learns better than all hyperbolic baselines. In low dimensions, along with the other hyperbolic models, our approach tends to outperform Euclidean models on Omniglot and MNIST. However, in high dimension the hyperbolic models do not reap as much benefit; even the baseline HVAE does not consistently outperform the Euclidean VAE.

MNIST Omniglot
2 4 6 2 4 6
Euclidean VAEKingma and Welling (2014)
RealNVP Dinh et al. (2017)
CNF Grathwohl et al. (2019)
3pt. Hyperbolic HVAEMathieu et al. (2019)
TC Bose et al. (2020)
WHC Bose et al. (2020)
PRNVP Gemici et al. (2016)
MCNF (ours)
Table 1:

MNIST and Omniglot average negative test log likelihood (lower is better) and standard deviation over five trials for varying dimensions.

8 Conclusion

We have presented Neural Manifold ODEs, which allow for the construction of continuous time manifold neural networks. In particular, we introduce the relevant theory for defining “pure” Neural Manifold ODEs and augment it with our dynamic chart method. This resolves numerical and computational cost issues while allowing for better integration with modern Neural ODE theory. With this framework of continuous manifold dynamics, we develop Manifold Continuous Normalizing Flows. Empirical evaluation of our flows shows that they outperform existing manifold normalizing flow baselines on density estimation and variational inference tasks. Most importantly, our method is completely general as it does not require anything beyond local manifold structure.

We hope that our work paves the way for future development of manifold-based deep learning. In particular, we anticipate that our general framework will lead to other continuous generalizations of manifold neural networks. Furthermore, we expect to see application of our Manifold Continuous Normalizing Flows for topologically nontrivial data in lattice quantum field theory, motion estimation, and protein structure prediction.

9 Broader Impact

As mentioned in the introduction, our method has applications to physics, robotics, and biology. While there are ethical and social concerns with parts of these fields, our work is too theoretical for us to say for sure what the final impact will be. For deep generative models, there are overarching concerns with generating fake data for malicious use (e.g. deepfake impersonations). However, our work is more concerned with accurately modelling data topology rather than generating hyper-realistic vision or audio samples, so we do not expect there to be any negative consequence in this area.

10 Acknowledgements

We would like to acknowledge Prof. Austin Benson and Junteng Jia for their insightful comments and suggestions. In addition, we would like to thank Facebook AI for funding equipment that made this work possible. We would also like to thank Joey Bose for providing access to his prerelease code.

References

  • A. Bielecki (2002) Estimation of the euler method error on a riemannian manifold. Communications in Numerical Methods in Engineering 18 (11), pp. 757–763. Cited by: §4.1.
  • A. J. Bose, A. Smofsky, R. Liao, P. Panangaden, and W. L. Hamilton (2020) Latent variable modelling with hyperbolic normalizing flows. ArXiv. To appear in ICML 2020 abs/2002.06336. Cited by: §C.3, §2, §2, Figure 4, §7.1, §7.2, Table 1.
  • J. Brehmer and K. Cranmer (2020) Flows for simultaneous manifold learning and density estimation. ArXiv abs/2003.13913. Cited by: §2.
  • T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud (2018) Neural ordinary differential equations. In NeurIPS, Cited by: §A.2, §A.2, §B.1, §B.6, §1, §4.2, §4, §5, §6, §7.2, Remark.
  • P. E. Crouch and R. Grossman (1993) Numerical integration of ordinary differential equations on manifolds. Journal of Nonlinear Science 3 (1), pp. 1–33. External Links: Document, ISBN 1432-1467, Link Cited by: §4.1, §4.1.
  • T. R. Davidson, L. Falorsi, N. D. Cao, T. Kipf, and J. M. Tomczak (2018a) Hyperspherical variational auto-encoders. In UAI, Cited by: item 2.
  • T. R. Davidson, L. Falorsi, N. De Cao, T. Kipf, and J. M. Tomczak (2018b) Hyperspherical variational auto-encoders.

    34th Conference on Uncertainty in Artificial Intelligence (UAI-18)

    .
    Cited by: item 1.
  • L. Dinh, J. Sohl-Dickstein, and S. Bengio (2017) Density estimation using real nvp. ArXiv abs/1605.08803. Cited by: §1, §7.1, §7.2, Table 1.
  • M.P. do Carmo (1992) Riemannian geometry. Mathematics (Boston, Mass.), Birkhäuser. External Links: ISBN 9783764334901, LCCN 91037377, Link Cited by: §3.
  • E. Dupont, A. Doucet, and Y. W. Teh (2019) Augmented neural odes. In NeurIPS, Cited by: item 2, item 3.
  • C. Durkan, A. Bekasov, I. Murray, and G. Papamakarios (2019) Neural spline flows. ArXiv abs/1906.04032. Cited by: §C.2, §1.
  • L. Falorsi, P. de Haan, T. R. Davidson, and P. Forré (2019) Reparameterizing distributions on lie groups. In AISTATS, Cited by: §2, §5.
  • L. Falorsi and P. Forré (2020) Neural ordinary differential equations on manifolds. ArXiv abs/2006.06663. Cited by: §2.
  • W. Feiten, M. Lang, and S. Hirche (2013) Rigid motion estimation using mixtures of projected gaussians. Proceedings of the 16th International Conference on Information Fusion, pp. 1465–1472. Cited by: §1.
  • N. I. Fisher, T. Lewis, and B. J. J. Embleton (1987) Statistical analysis of spherical data. Cambridge University Press. External Links: Document Cited by: item 2, §B.3.
  • M. Gemici, D. J. Rezende, and S. Mohamed (2016) Normalizing flows on riemannian manifolds. ArXiv abs/1611.02304. Cited by: §D.1, §2, §2, §2, §2, §5, Figure 4, §7.1, Table 1.
  • M. Germain, K. Gregor, I. Murray, and H. Larochelle (2015) MADE: masked autoencoder for distribution estimation. In ICML, Cited by: §1.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1.
  • W. Grathwohl, R. T. Q. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud (2019) FFJORD: free-form continuous dynamics for scalable reversible generative models. ArXiv abs/1810.01367. Cited by: §B.1, §B.6, item 2, §1, §7.2, Table 1.
  • E. Hairer and G. Wanner (2010) Solving ordinary differential equations ii: stiff and differential-algebraic problems. Springer Series in Computational Mathematics, Springer Berlin Heidelberg. External Links: ISBN 9783642052217, Link Cited by: §5.
  • E. Hairer (2011) Solving ordinary differential equations on manifolds. Lecture notes, University of Geneva. Cited by: §3.3, §4.1, §4.1, §5.
  • T. Hamelryck, J. T. Kent, and A. Krogh (2006) Sampling realistic protein conformations using local structural bias. PLoS Computational Biology 2. Cited by: §1.
  • J. Jia and A. R. Benson (2019) Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems 32, pp. 9847–9858. Cited by: item 3.
  • V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. V. Oseledets, and V. S. Lempitsky (2019) Hyperbolic image embeddings. ArXiv abs/1904.02239. Cited by: §7.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §C.2, §5.
  • D. P. Kingma and P. Dhariwal (2018) Glow: generative flow with invertible 1x1 convolutions. In NeurIPS, Cited by: §1.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. CoRR abs/1312.6114. Cited by: §C.3, §1, §7.2, Table 1.
  • B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. External Links: Document, ISSN 0036-8075, Link, https://science.sciencemag.org/content/350/6266/1332.full.pdf Cited by: §7.2.
  • M. Lapaine (2011) Mollweide map projection. KoG 15 (15.), pp. 7–16. Cited by: item 1.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pp. 2278–2324. Cited by: §7.2.
  • J.M. Lee (1997) Riemannian manifolds: an introduction to curvature. Graduate Texts in Mathematics, Springer New York. External Links: ISBN 9780387982717, LCCN 97014537, Link Cited by: §3.
  • J.M. Lee (2003) Introduction to smooth manifolds. Graduate Texts in Mathematics, Springer. External Links: ISBN 9780387954486, LCCN 2002070454, Link Cited by: §3.
  • M. Lezcano-Casado (2019) Trivializations for gradient-based optimization on manifolds. arXiv: Learning. Cited by: §5, §5.
  • E. Mathieu, C. L. Lan, C. J. Maddison, R. Tomioka, and Y. W. Teh (2019) Hierarchical representations with poincaré variational auto-encoders. ArXiv abs/1901.06033. Cited by: §7.2, Table 1.
  • E. Mathieu and M. Nickel (2020) Riemannian continuous normalizing flows. Cited by: §2.
  • Y. Nagano, S. Yamaguchi, Y. Fujita, and M. Koyama (2019)

    A wrapped normal distribution on hyperbolic space for gradient-based learning

    .
    In ICML, Cited by: item 2, §B.3, §B.6, §6, §7.2.
  • L.S. Pontryagin, V.G. Boltyanskie, K. M. R. Collection, L.W. Neustadt, K.N. Trirogoff, R.V. Gamkrelidze, and E.F. Miŝenko (1962) The mathematical theory of optimal processes. Interscience publishers, Interscience Publishers. External Links: LCCN 62017468, Link Cited by: §B.6, item 1, §5.
  • D. J. Rezende and S. Mohamed (2015) Variational inference with normalizing flows. ArXiv abs/1505.05770. Cited by: §C.2, §1.
  • D. J. Rezende, G. Papamakarios, S. Racanière, M. S. Albergo, G. Kanwar, P. E. Shanahan, and K. Cranmer (2020) Normalizing flows on tori and spheres. ArXiv. To appear in ICML 2020 abs/2002.02428. Cited by: §C.2, §2, §2, Figure 4, §7.1.
  • O. Skopek, O. Ganea, and G. Bécigneul (2019) Mixed-curvature variational autoencoders. ArXiv abs/1911.08411. Cited by: item 1, item 2, §B.3, §C.1, §6, §7.2.
  • J. Stillwell (2008) Naive lie theory. Springer. Cited by: §2.
  • G. Ulrich (1984) Computer generation of distributions on the m-sphere. Journal of the Royal Statistical Society. Series C (Applied Statistics) 33 (2), pp. 158–163. External Links: ISSN 00359254, 14679876, Link Cited by: item 1.
  • P. Z. Wang and W. Y. Wang (2019) Riemannian normalizing flow on variational wasserstein autoencoder for text modeling. ArXiv abs/1904.02399. Cited by: §2.
  • H. Whitney (1935) Differentiable manifolds in euclidean space. Proceedings of the National Academy of Sciences of the United States of America, pp. 462–464. Cited by: Remark.
  • P. Wirnsberger, A. J. Ballard, G. Papamakarios, S. Abercrombie, S. Racanière, A. Pritzel, D. J. Rezende, and C. Blundell (2020) Targeted free energy estimation via learned mappings. arXiv: Computational Physics. Cited by: §1, §2.
  • M. J. Zahr, P. Persson, and J. Wilkening (2016) A fully discrete adjoint method for optimization of flow problems on deforming domains with time-periodicity constraints. Computers & Fluids 139, pp. 130–147. Cited by: §4.2.

Appendix A Proofs of Propositions

a.1 Dynamic Chart Method

Prop A.1 (Correctness).

If is a solution to with initial condition , then is a valid solution to Equation 1 on .

Proof.

We see that if then for all

(5)
(6)
(7)
(8)

Prop A.2 (Convergence).

There exists a finite collection of charts s.t. .

Proof.

Around each point pick a chart . Then note that satisfies . But, the curve is compact since is compact (and is assumed to be continuous) so we can take a finite subset that covers the curve. ∎

a.2 Derivation of Gradient for Adjoint State

In this section we prove Theorem 4.1. The proof follows from the analogous one in [4], though we replace certain operations with their manifold counterparts.

Theorem A.3.

Suppose we have some manifold ODE as given in Equation 1 and we define some loss function . Suppose that there is an embedding of in some Euclidean space and we identify with an -dimensional subspace of . If we define the adjoint state to be , then the adjoint satisfies

(9)
Proof.

Consider the first order approximation of . Since we are embedding , then under standard computations we have that

(10)

We set . As in the original adjoint method derivation [4]

, we utilize the chain rule

(11)

Using these, we get that

(12)
(by Equation 11) (13)
(by Equation 10) (14)
(15)
(16)

Appendix B Computation of Manifold Continuous Normalizing Flows

b.1 Background

In Euclidean space, a normalizing flow is a diffeomorphism that maps a base probability distribution into a more complex probability distribution. Suppose is a sample from the simple distribution. By the change of variables formula, the target density of an in terms of (the complex distribution) is

(17)

There exist a variety of functions which constrain the log Jacobian determinant to be computationally tractable. An important one is the Continuous Normalizing Flow (CNF). CNFs construct to be the solution of an ODE [4, 19]. Explicitly, let be starting and ending times with , and consider the ordinary differential equation , where parameterizes the dynamics . For a sample from the base distribution , solving this ODE with initial condition gives the sample from the target distribution . The change in the log density given by this model satisfies an ordinary differential equation called the instantaneous change of variables formula:

(18)

We can therefore quantify the final probability as

(19)

b.2 Manifold Continuous Normalizing Flows

For our MCNF we split up the original time into for where . From our curve we can select , and we have charts s.t. . If our dynamics are determined by then this locally takes the form , in which is a CNF. The update after passing through a chart and integrating is given by

where is a shorthand for the Riemannian probability update induced by the chart. Note that in general this is not the determinant (for instance when the map goes from elements in to where ). In practice it ends up being quite similar.

We can consider our manifold ODE as a composition of these updates. Therefore, we have that

(20)

For our cases we will be setting .

b.3 Base Distributions

Hyperbolic Space. We will use the hyperbolic wrapped normal distribution where and where has dimension [36].

  1. Sampling. To sample a , perform the following steps. A priori, set some . First, sample . Then parallel transport this vector to the mean i.e. . Lastly, project to the manifold using the exponential map .

  2. Probability Density. The probability density can be found through the composition of the parallel transport map and exponential map. Specifically, we have that

    (21)

Spherical Space. We could possibly perform a relatively similar wrapped normal distribution [40]. However, we see that this is theoretically flawed since parallel transport between two conjugate points is not well defined.

Instead, we will use the von Mises-Fisher distribution, a distribution on the -sphere in originally derived for use in directional statistics [15]. We denote this distribution as where is treated as an element of via the canonical embedding, and . Note the following about the von Mises-Fisher distribution:

  1. Sampling. The von Mises-Fisher can be sampled from with efficient methods [42, 7].

  2. Probability Density. From [15, 6], we know that the density is given by

    (22)
    (23)

    Note that , is the normalizing constant, and that denotes the modified Bessel function of the first kind of order .

These baseline probability distributions are visualized in Figure 5.

(a)
(b)
Figure 5: Base distributions used for flow models. In , and in , . (a) is on as visualized on the Poincaré ball. (b) is on as visualized by the Mollweide projection.

b.4 Hyperbolic Space

b.4.1 Analytic Derivations

Manifold
Tangent space
Exponential map
Logarithmic map
Parallel transport
Tangent projection
Table 2: Formulas for basic operations in hyperbolic space .

For hyperbolic space, we will work with the hyperboloid . The analytic values of the operations are given in Table 2. Recall that the Lorentz Inner Product and Norm are given by

(24)

In addition, there are many useful identities which appear in our pipeline.

  1. Stereographic Projection. To visualize on the Poincaré ball, we use the stereographic projection as explained in [40], which maps a point to a the point on the Poincaré ball.

  2. Log Determinants. The log determinant of the derivative of the exponential map is given by [36, 40]:

    (25)

    The log determinant of the derivative of the log map is the negation of the above by the inverse function theorem, and the log determinant of the derivative of parallel transport is .

b.4.2 Numerical Stability

In order to ensure numerical stability, we examine several operations which are inherently numerically unstable and present solutions

  1. Arccosh. Arccosh has a domain of . In practice, we are concerned with the positive case, although the negative case can be similarly handled. Due to numerical instability a value may be realized as in our floating point system. To compensate, we clamp the minimum value to be for a small fixed .

  2. Sinh Division. In the exponential and logarithmic maps, there exist terms of the form . When for some small , this is numerically unstable. We special case this (and the derivative) for stability by explicitly deriving the limit value of for these cases.

b.5 Spherical Space

b.5.1 Analytic Derivations

For spherical space, we work with the sphere . The analytic values are given in Table 3. Norms and inner products are assumed to be the Euclidean values.

Manifold
Tangent Space
Exponential map
Logarithmic map
Parallel transport
Tangent projection
Table 3: Formulas for basic operations on the sphere .

Some useful identities are

  1. Mollweide Projection. To visualize , we use the Mollweide projection that is used in cartography. For latitude and longitude , the sphere is projected to coordinates by (where is a variable solved for by the given equation) [29]:

    (26)
  2. Log Determinants. The log determinant for the exponential map of the Sphere is given by

    (27)

    The log determinant of the derivative of the log map is the negation of the above and the log determinant of the derivative of parallel transport is .

b.5.2 Numerical Stability

In order to ensure numerical stability, we note that several functions are inherently numerically unstable

  1. Sine division. In the exp and log maps, there are values of the form . Note that this is numerically instable when . We special case this (and its derivative) to allow for better propagation.

  2. Log Derivative. We find that we need an explicit derivation of for our Manifold ODE on the Sphere (see B.7). Note that this value can be computed using backpropagation, but we derive it explicitly instead due to numerical instability of the higher order derivatives of some of our functions.

    Lemma.

    For , and , if , then

    (28)

    The limit of as is .

    Proof.

    We differentiate the equation of the logarithmic map as given in Table 3. First, suppose that . By the product rule we have,

    The summand on the right is given by

    To compute the left summand, we use the chain rule and differentiate by

    To check that the limit as is , it is enough to compute three separate limits that are all finite. It is clear that

    Since the limit of the other term in the left summand can be shown to be finite, this means that the left summand is zero in the limit. For the right summand, the only term that depends on has a limit

    where the final equality can be seen by L’Hopital’s rule. Thus, the entire limit is . ∎

b.6 Backpropagation

To update the parameters of an MCNF, we need to differentiate with respect to , so we need to differentiate each of the summands in (4) with respect to . Differentiating through neural ODE blocks is done with the Euclidean adjoint method [37, 4, 19], which, for a loss depending on the solution to a differential equation with dynamics , gives that

(29)

Differentiating the dynamics is done with standard backpropagation. The adjoint state is computed by the solution of the adjoint differential equation (2) for Euclidean space, with initial condition . The derivative of the loss can be computed directly. For MCNF, is taken to be the negative log likelihood.

For the hyperbolic and spherical cases, the log determinant terms take simple forms (as in equations 25 and 27), and are thus easy to differentiate through. Moreover, for the hyperbolic VAE models, we train by maximizing the evidence lower bound (ELBO) on the log likelihood, so that differentiation is done with a reparameterization as in [36].

b.7 Designing Neural Networks

b.7.1 Construction

In general we construct the dynamics as follows. Suppose is embedded in some . Then we construct to be a neural network with input of dimension . The first values are the manifold input and the last value is the time. The output of the neural network will be some vector in . To finalize, we project onto the tangent space using the linear projection .

Hyperbolic Space. Since hyperbolic space is diffeomorphic to Euclidean space, we can parameterize all manifold dynamics with a corresponding Euclidean dynamic with an , where is the point . In general since we only require one chart, our Manifold ODE consists of , which means that we can model our full dynamics only in the tangent space (not on the manifold). By picking our basis, we can represent elements of as element of . For the ODE block, we parameterize using a neural network which takes in an input of dimension (which is a tangent space element and time) and outputs an element of dimension .

Spherical Space.

For the spherical case, we use the default construction (with projection), as there is no global diffeomorphism. Note that when passing from the manifold to tangent space dynamics, we require . We also must invoke a radius of injectivity, as opposed to hyperbolic space. This is for all points.

b.7.2 Existence of a Solution

We construct our networks in such a way that the Picard-Lindelöf theorem holds. Our dynamics are given by where is a chart and is a neural network. These are well behaved since 1) the neural network dynamics are well behaved using tanh and other Lipchitz nonlinearities and 2) the chart is well behaved since we can bound the domain to be compact.

Appendix C Experimental Details

c.1 Data

In our code release, we will include functions to generate samples from the target densities that are used for density estimation in section 7.1.

Hyperbolic Density Estimation We detail the hyperbolic densities in each row of Figure 4.

  1. The hyperbolic density in the first row of Figure 4 is a wrapped normal distribution with mean and covariance .

  2. The second density is built from a mixture of 5 Euclidean gaussians, with means and , and covariance . The resulting hyperbolic density is obtained by viewing as the tangent space , and then projecting the Euclidean density to the Hyperboloid by .

  3. The third density is a projection onto the hyperboloid of a uniform checkerboard density in . The square in the second row and third column of the checkerboard has its lower-left corner at the origin . Each square has side length .

  4. The fourth density is a mixture of four wrapped normal distributions. Letting , and , the wrapped normals are given as:

Spherical Density Estimation Details are given about the densities that were learned in each row of Figure 4.

  1. The density in the first row of Figure 4 is a wrapped normal distribution with mean and covariance .

  2. The second density is built from a mixture of 4 wrapped normals, with means , and ; all components of the mixture have covariance .

  3. The third density is a uniform checkerboard density in spherical coordinates . The rectangle in the second row and third column of the checkerboard has its lower-left corner at . The side length of each rectangle in the -axis is , and the side length in the -axis is .

Variational Inference For variational inference, we dynamically binarize the MNIST and Omniglot images with the same procedure as given in [40]. We resize the Omniglot images to , the same size as the MNIST images.

c.2 Density Estimation

In section 7.1 we train on batches of 200 samples from the target density (or batches of size 100 for the discrete spherical normalizing flows [39]). Our MCNF models and the hyperbolic baselines use at most 1,000,000 samples, with early stopping as needed—the hyperbolic baselines sometimes diverge when training for too many batches. As suggested in [39], we find that the spherical baseline does indeed needed more samples than this (at least 5,000,000), so we allow it to train until the density converges. Although we do not investigate sample efficiency in detail, we find that our MCNF is able to achieve better results than the spherical baseline with, frequently, over an order of magnitude fewer samples than the spherical baseline. Note that all methods use the Adam optimizer [25]. For our MCNF, our dynamics are given by a neural network of hidden dimension 32 and 4 linear layers with tanh activation; for each integration we use a Runge-Kutta 4 solver.

Hyperbolic Normalizing Flows For the hyperbolic discrete normalizing flows, we train with 4 hidden blocks, hidden flow dimension of 32, and tanh activations. The prior distributions used are given in section B.3 and target distributions are given in section C.1.

Spherical Normalizing Flows For the discrete spherical normalizing flows from [38], we use the recursive flow for . In designing this flow, we use the non-compact projection (NCP) flow for and the autoregressive spline flow from [11] for the interval . To increase expressiveness for the